A new Spatial Hashing approach to Neighborhood Queries on the GPU
A lot of algorithms hinges on knowing, for any point in a set, its neighbors closer than a fixed radius of influence $R$:
This probem is known as the Fixed-radius near neighbors problem – or FRNN. Algorithms for solving it has applications ranging from particle simulations, to protein folding, to statistics, and more.
Having a nice GPU-implementation of an algorithm to compute this set is crucial, not just for the raw speedup a GPU can provide, but because a lot of the problems where this kind of neighborhood query shows up are otherwise embarrasingly parallel in nature, so you want to solve it with data already living on the GPU.
Existing algorithms fall short because they are not well adapted to GPUs, or impose restrictions on the simulation domain which in turn restricts the algorithms' usefulness. In particular, GPU algorithms often restrict the simulation domain to a finite grid, which forces you to consider edge cases when particles can hit the boundary. Approaches based on spatial hashing have been devised which doesn't impose this restriction, but the bookkeping stemming from hash collisions makes hash tables tricky to implement efficiently on the GPU.
In this post, I want to present an alternative algorithm for constructing a hash table-like acceleration structure for this sort of query. This was part of the work on particle simulations I did to obtain my master's thesis.
The key insight is that you can construct spatial hash functions that preserve corresponding equivalence classes between positions in the simulation domain and buckets in the hash table, completely eliminating hash collisions in the immediate vicinity of a particle, thus removing the need to do bookkeeping in the hash table implementation. This makes the hash table easy to implement on the GPU.
Existing Algorithms
Parallel fixed-radius neighbor search algorithms generally work by imposing a regular square grid, with spacing $\Delta \geq R$, and narrowing down the search to adjacent grid cells. Only points in adjacent grid cells are neighbor candidates:
This superset of points is much easier to find, and in many cases, such as the case where the $r$ represents a force which diminishes quickly as a function of distance, finding an approximation of the set of neighbors is good enough.
If you really need the set of neighbors exactly, the superset is still a decent starting point: You could for instance try masking out the contributions of false positives using a step function to minimize thread divergence.
It is simple to compute what cell a point is in by rounding its coordinates down to multiples of $\Delta$:
\begin{equation*} \texttt{cell}(x_1, x_2) = \left( \left\lfloor \frac {x_1} \Delta \right\rfloor , \left\lfloor \frac {x_2} \Delta \right\rfloor \right) \end{equation*}Restricting your search only to points in adjacent grid cells is an extremely effective way to narrow down the search space. But herein lies the challenge: It is easy to figure out what cell a point is in, but how do you efficiently figure out which points are in the cells adjacent to it?
Our ambition is, based on knowing which cell a particle is in, to build an acceleration structure for asking the inverse question. The construction needs to scale to thousands of threads running on a GPU.
The final algorithm builds on two existing approaches, so let's first take a look at where existing algorithms fall short.
Index Sorting
One particular challenge of FRNN is that you don't know ahead of time how many neighbors a point has, so even just storing the set of neighbors compactly is a challenge. With GPUs not being particularily well suited for dynamically reallocating and growing buffers on the fly, to put it mildly, you need to allocate space up front, but on the other hand you don't actually know how much space you need.
In some cases, for example in the case of protein folding or incompressible fluid simulations, intermolecular forces put a very reliable bound on how densely your points can compress, so you can make a reliable bound on the number of particles in a cell. However, in general, there is no limit to how many points can be contained in a cell.
Doing a full global sort of the particles sounds like lunacy, but in practice it can perform quite well, and it solves the issue of memory allocation. With a finite set of cells, you can sort the particles by cell index using a parallel counting sort. Cells can be assigned an index using a classic strided indexing: For an $m \times n$ grid and $\texttt{cell}(\vec x) = (i, j)$,
\begin{equation*} \texttt{index}(\vec x) = nj + i \end{equation*}Sorting the points based on cell indices nicely arranges points in a cell contiguously in memory, effectively repurposing the particle buffer as several dynamically sized arrays for cells.
Let's work through an example. To start, we compute for each particle its corresponding chunk index:
Then, we perform a sort of the particles by cell index, and compute a lookup table to store the starts of the cells.
In practice, you basically get this lookup table for free when you perform a counting sort, as you need to calculate the offset at which to store points with different cell indices. Storing a similar lookup table for the ends of cells simplifies iteration over the slices, and this can be computed by adding the counts from the counting sort to the offset.
An alternative approach is to write a kernel that looks for deltas in the cell indices.
With this structure, you can find all the neighbor candidates of a point by computing its cell, offsetting the index to find adjacent cells, look up the slices in the lookup table, and iterate just over these slices. For example, to find the set of neighbors of $C$, the indices of adjacent cells are $\{ 1,2,3, 5,6,7, 9,10,11\}$, and by following the lookup table, we narrow down the set of neighboring points to the following set of slices:
This approach is simple and fast, but assigning a unique index in the lookup table to every cell means we are are restricted to a finite number of cells!
A common strategy for mapping a large set into a small table is using hashing. You might reasonably assume that we could simply replace the cell indices by a simple spatial hash, but the devil is in the details.
Naive Spatial Hashing
Approaches incorporating spatial hashing have already been implemented with some pretty nice results, but have failed to gain traction on the GPU.
Since you can't preallocate space for the buckets ahead of time, this hash table can not be implemented exactly as described, but sorting particles by a spatial hash is a nice starting point. This way, you could theoretically get the benefit of index sorting, namely repurposing the particle buffer as a set of dynamically sized arrays, and also the benefits of hashing, namely mapping an infinite simulation domain into a finite table. Compared to index sorting, you only pay the extra overhead of calculating a simple hash.
The problem is that hashes can collide within the neighborhood of a particle, which means that the set of neighbors can not be found by naively traversing the slices without some additional bookkeeping to deal with hash collisions, or we risk counting particles more than once.
As an example, consider the following example assignment of hash values to cell indices:
\begin{equation*} \left\{\hspace{.5em} \begin{array}{lccl} \texttt{hash} : & 2 & \longmapsto & 3 \\ & 3 & \longmapsto & 1 \\ & 4 & \longmapsto & 0 \\ & 6 & \longmapsto & 2 \\ & 7 & \longmapsto & 5 \\ & 8 & \longmapsto & 2 \\ & 9 & \longmapsto & 3 \end{array} \right. \end{equation*}To compute the hash of a point, we just apply $\texttt{hash}$ to its cell index.
This is a pretty stupid hash function – it is not even total! It is simply created to demonstrate what happens in the event of a hash conflict.
Now notice, particles in cells $2$ and $9$ both hash to $3$, so the bucket with index $3$ in the hash table will contain the set $\{ B, H, I \}$. Now, consider searching for neighbors of $C$:
If we naively traverse $C$s adjacent cells $\{ 1,2,3 , 5,6,7 , 9,10,11 \}$ and compute their hash, the set $\{ B, H, I \}$ will correspond to two adjacent grid cells – both $2$ and $9$:
If this is not deduplicated, and we just move on to iterating over the slice corresponding to each adjacent cell, these particles will in turn be counted twice.
Obviously, this is bad.
This imposes deduplication further down the line, or in the worst case, if we don't do anything about it, we have a subtle bug where in rare cases points may contribute twice to a mean, or particles may contribute forces twice to the pressure gradient, and so on.
This deduplication is fairly trivial on the CPU; there are any number of ways to solve it: You could for example store the cell hashes in a hash-set, or do a quick scan of the hashes visited prior and skip repeat visits, to name two simple solutions. Crucially, on the CPU threads don't pay for the bookkeeping done on another, so simple bookkeeping is cheap. Not so on the GPU: Any test to see if a hash leads to a repeat visit, or comparison of hashes to remove duplicates is an opportunity for the program to branch, causing serialized execution. Highly divergent branchy loopy code should be avoided if possible.
This seems like quite the pickle.
It would be best, it seems, if we could avoid hash collisions specifically in the neighborhood of a particle.
"Locally Perfect" Spatial Hashing
Perfect hash functions are hash functions specifically constructed to not have hash collisions for a given finite set of keys. These are a common optimization for hash tables where the keys are known ahead of time, because you can guarantee $O(1)$ access and don't need to do any probing. Our set of keys – the cells – is an infinite set. Mapping an infinite set of keys to a finite hash table is obviously impossible, so a perfect hash function can not just be created.
While perfect hashing is not going to be possible, we can still take some pointers from this technique and try to solve the deduplicaiton at the hash level. While the simulation domain is infinite, the immediate vicinity of a point contains a predictable, finite set of cells.
Let's zoom in on the neighborhood of a point!
The idea is to impose a finite number of equivalence classes on the cells, in such a way that all the cells in a particles neighborhood is in different equivalence classes. For a $3 \times 3$ neighborhood in 2D space, we need 9 equivalence classes:
For a $3 \times 3 \times 3$ neighborhood in 3D space, we would need 27, and so on. In general, we need $M = 3^d$ equivalence classes for $d$-dimensional space, represented as the set $\mathbb Z _{M}$ of integers modulo $M$.
A simple way to assign equivalence classes to particles is imposing a repeating structure in all directions using modular arithmetic, and computing a strided index. For $\texttt{cell}(\vec x) = (i, j)$:
\begin{equation*} \kappa = 3 \left( i \;\texttt{mod}\; 3 \right) + j \; \texttt{mod}\; 3 \end{equation*}This produces the the assignment of classes illustrated above.
Next, we want to impose corresponding equivalence classes on the buckets in the hash table. A simple way to do this is to say that each index $\texttt{mod}\, M$ is its own equivalence class. In other words, for $M=3^2$, the buckets $\{ 1, 10, 19, \dots \}$ is one equivalence class, the buckets $\{ 2, 11, 20, \dots \}$ is another, and so on.
The final piece of the puzzle is a hash function that preserves class information between the simulation domain and the hash table. In other words, the hash function needs to map particles in grid cells with $\kappa = 1$ to one of the buckets $\{ 1, 10, \dots \}$, particles with $\kappa = 2$ to $\{ 2, 11, \dots \}$, and so on.
The hash function
For $d = 2$, $M=3^2$, and a hash table of size $K = MN$, a hash function can be defined as follows:
\begin{equation*} \left\{\hspace{.5em} \begin{array}{lccl} \texttt{hash} : & \mathbb R^2 & \longrightarrow & \mathbb Z_K \\ & \vec x & \longmapsto & M \beta + \kappa \end{array} \right. \end{equation*}where $\kappa$ is the equivalence class, and
\begin{equation*} \beta = (P_1 \, i \;\texttt{xor}\; P_2 \, j) \;\texttt{mod}\; N \end{equation*}where $P_1$ and $P_2$ are relatively large prime numbers and the $\texttt{xor}$ is taken bitwise, selects a "bucket" in the hash table within the class $\kappa$. The reason this works is that adding a multiple of $M$ to $\kappa$ does not change its value $\texttt{mod}\; M$, and thus $\texttt{hash}(\vec x)$ will preserve the cyclic structure of $\kappa$!
The choice of $\beta$ is pretty much irrelevant; almost any hash function would do. The main consideration is that $\beta$ should be chosen to distribute cells evenly within each bucket.
The crux of the hash function is using the strided indexing $M \beta + \kappa$ to "multiplex" $\beta$ between equivalence classes $\kappa$. Any hash function constructed in this way ensures that there are no collisions between cells in the neighborhood of a point.
For example, looking at the same neighborhood as before,
we see that indeed, the cells in this neighborhood does end up in distinct "columns" of the hash table:
Thus, there is no need to do any bookkeeping to deduplicate hash collisions between cells. Every cell adjacent to a point will always hash to a distinct class, and thus to a distinct bucket, and thus correspond to a distinct slice of the underlying sorted buffer of particles; there is no longer any risk of scanning the same slice twice when iterating over adjacent grid cells.
The size of the hash table can be changed freely by modifying $N$. This allows you to tune the hash collision rate depending on performance and memory constraints, while retaining locally perfect hashing.
In other words, a hash-based sorting scheme with this kind of hash function actually works.
Notes on the choice of hash function
The hash function determines the order of the particles in memory, and as such the choice would have a pretty high impact on performance. The hash function should ideally be chosen with the GPU kernels' memory access patterns in mind, to minimize uncoalesced reads and writes.
The hash function I presented here is almost certainly not optimal with respect to memory access patterns, it is simply chosen because it demonstrates the point in a simple manner and lends itself nicely to illustrations.
I reckon optimizing the interaction between how you structure the kernels implementing the table construction and how the hash function scatters particles in memory is an interesting topic to investigate in its own right, and maybe a good starting point for a future post.
Where's the code?
While working on my thesis, I implemented the outlined algorithm using CUDA.jl
,
but since CUDA code is not portable, and CUDA itself is proprietary,
I don't think this is very satisfying to publish;
many people – particularily laptop users – would not be able to run it.
Besides, I think the post is long enough as is; I would quite like to survey the landscape of portable GPU programming in Julia, and make a follow-up post.