Block-Parallel IDA* for GPUs (SoCS17)

At the SoCS 2017: The 10th Annual Symposium on Combinatorial Search, our paper about the parallel IDA* search using GPU was accepted.

This is the results of a study during my master course submitted after graduation. It was accepted on May 4th, and a conference was held at Pittsburgh on June 16 and 17. As I was already working full-time, I asked my advisor who was also my co-author to present at the conference.

IDA*

I will not explain IDA* in detail here. There is a Wikipedia article about the topic.

Some of colleagues in our lab wrote a sequential posts on graph search in Japanese in 2015, and IDA* is explained in the 8th post.

Roughly speaking, IDA* is an algorithm to find an optimal sequence of actions with minimal cost using the knowledges about the problem solved without using exponential size of memory. (By the way, Richard E. Korf who developed this algorithm is actually my advisor’s advisor.)

Popular applications of this algorithm include some types of puzzle solving, car navigation system, search of a protein structure with some property, etc.

Background

Below is the background of the paper. (It is roughly summarized, so please see the paper for more details.)

The main achievement of this research is proposing a method to efficiently implement IDA* on GPUs. One previous research on parallelization of A* on GPU is Chou, 2015. However in A*, the memory size usually grows exponentially with a solution length. For example, even in sequential A* search on CPU, several gigabytes of memory is used up within one minute. Therefore, it is quite difficult in this method to solve large-scale problems that we actually want to solve using GPUs. On the other hand, with IDA*, the memory usage will only grow linearly with a solution length. That is why we might be able to solve large-scale problems that A* could not solve before because of restrictions concerning memory usage.

When parallelizing IDA* with GPU, the important thing is how we divide up search tasks. In simple searches such as depth-first search or breadth-first search, the search tree is highly symmetric, so we can divide up the work evenly in simple ways. However, in IDA*, the symmetricity of the search tree is low and the shapes of subtrees are unknown until we actually do the search.

There have been many researches on how to divide up and parallelize this search tasks in IDA*. However, most of them are about the parallelization on CPU or SIMD computers. The execution model of GPU (NVIDIA calls it SIMT) has different features. We re-implemented some methods based on past researches, but could not get a satisfactory performance due to various problems, since these methods were not designed for GPUs.

Block-parallel IDA*

Therefore, in this research we suggested Block-parallel IDA* as a method to efficiently implement IDA* using GPU.

Generally in IDA* parallelizations, short search is done as a pre-process from the primary state at first, and nodes obtained in the search are then distributed to each thread. Each thread then searches subtrees in parallel starting from the node that is assigned to it. The size of the subtree is unknown, and this causes bias in the amount of operation. The process is made complex by the mechanism to dynamically fix this unevenness.

In contract, with the Block-parallel IDA* that we suggested, each node is assigned to blocks instead of threads. This was inspired by a research, Rocki and Suda, 2010, which tried utilizing GPU for searching on game trees in Othello. We showed, in parallel IDA* search, the method to enable threads inside a block to conduct the search in parallel without explicit synchronization. Also, if we utilize the hardware scheduler inside the GPU, then the allocation of nodes between the blocks also does not need explicit synchronization. Because dynamic re-destribution of work and explicit synchronizations are unnecessary, the implementation is very simple.

Experiment

We confirmed the below facts through experiments:

  • In 1536 core GPUs, parallel search is 660 times faster compared to sequential search.
  • Work distribution is almost perfectly even
  • Parallel search on one 1536 core GPU is five times faster compared to sequential search on a CPU. (It is actually seven times faster if we consider the disadvantage by parallelization.)

Comments and future works

The achievements of this research, that is, five (or seven) times of increase in speed using the GPU, might not be so impressive. However, I believe there is value in this result in the fact that we achieved some acceleration through transforming a seemingly unfit problem for GPU.

I believe our method may scale well without a major modification in multi-GPU environment, because it runs without global synchronization/communication. However, we did not get to actually confirming this hypothesis in this research. Other difficulties might come up if we take this to the scale of 100 or 1000 GPUs.

Furthermore, it would also be meaningful to evaluate the capacity with various other problems. Block-parallel IDA* is designed to be as independent and unaffected from the problem as possible. However, there would possibly be some small problems. For example, with some problems the shared memory might not be able to accommodate the necessary information for the search. (As for this problem, we believe it can be solved by simply prefetching some information from the global memory to the shared memory.) Clarifying the range of effectivity of our method and solving possible issues are tasks left for future studies.

Call for V100

There is a joy of programming on GPU, so I’m asking any oil magnate who’s interested in combinatorial search technology to buy me a V100, which costs just more than $100,000.