Last year, I explained how we use dynamic shape simplification and partitioning algorithms to renderhuge polygon and polyline datasets in Mapbox GL JS. This approach unfortunately doesn’t work for individual points: so how do we display hundreds of thousands of points in a meaningful way on a map?
Mapbox GL JS recently got a new feature to address this — point clustering. Let’s see how it can handle loading 40MB of GeoJSON data with 400,000 points in the browser:
By adjusting the style of the clusters, we can also turn this into a heatmap:
After a few seconds loading the data, you can browse it smoothly at 60fps. To enable this kind of performance, I built a new JavaScript clustering library calledSupercluster, which now powers Mapbox GL JS. Next, we’re porting Supercluster to C++ to bring clustering to our iOS and Android SDKs.
Let’s find out how Supercluster works under the hood.
Hierarchical greedy clustering
A simple and efficient way to group points into clusters for a specific zoom is called greedy clustering. It works like this:
- Start with any point from the dataset.
- Find all points within a certain radius around that point.
- Form a new cluster with the nearby points.
- Choose a new point that isn’t part of a cluster, and repeat until we have visited all the points.
This would be expensive to do for every zoom level of the map — for example, if zoom levels range from 0 to 18, we would have to process the whole dataset 19 times, and clustering would get too slow on lower zoom levels because each cluster would need to accommodate an exponentially increasing number of points.
We avoid this problem by reusing calculations. After we cluster the points on zoom level 18, we can group the resulting clusters (with weighted centroids) into new z17 clusters. Then we can use z17 clusters to form z16 clusters, and so on. Since the number of points we need to process with each such step decreases exponentially, we can cluster points for all zoom levels really fast.
This approach is called hierarchical greedy clustering, and was popularized by Dave Leaver with his fantastic Leaflet.markercluster plugin. Unlike more sophisticated clustering algorithms, it can be fast enough to handle millions of points in the browser, and it’s good enough to use for browsing point datasets on an interactive map.
Indexing points and clusters
Implementation of this clustering approach in an interactive map has two potentially expensive operations:
- Finding all points within a certain radius.
- Finding clusters on the current screen.
The simplest way to do a radius query is to loop through all the points and save those that are close enough to our query point. But since we need to do a lot of such queries – one for each potential cluster – this would be too slow. Similarly, we don’t want to loop through all the points for screen queries, which are necessary every time we drag or zoom the map.
To do both efficiently, we must use a spatial index— processing points into a special data structure once and then using it to do any number of later queries almost instantly.
Spatial indexing is an indispensable building block in many algorithms, so a lot of research went into building the fastest JavaScript implementation,RBush, which asks for a separate blog post.
Each clustering step for a specific zoom level requires indexing points from the earlier step. But after the clustering process, having an index for each zoom allows us to instantly query clusters for any map view.
Clustering performance
Here’s a breakdown of times for each clustering step for the 400,000 points dataset we’ve seen in the video:
399601 points prepared in 123ms
z16: indexed in 516ms clustered in 156ms 46805 clusters
z15: indexed in 53.4ms clustered in 40.8ms 20310 clusters
z14: indexed in 12.4ms clustered in 17.2ms 10632 clusters
z13: indexed in 7.9ms clustered in 12.9ms 6578 clusters
z12: indexed in 3.4ms clustered in 6.9ms 4514 clusters
z11: indexed in 2.4ms clustered in 3.4ms 3252 clusters
z10: indexed in 1.4ms clustered in 2.3ms 2329 clusters
z9: indexed in 1ms clustered in 1.5ms 1608 clusters
z8: indexed in 0.6ms clustered in 1.5ms 1083 clusters
z7: indexed in 0.4ms clustered in 0.7ms 721 clusters
z6: indexed in 0.2ms clustered in 0.4ms 421 clusters
z5: indexed in 0.1ms clustered in 0.2ms 202 clusters
z4: indexed in 0.1ms clustered in 0.1ms 82 clusters
z3: indexed in 0ms clustered in 0ms 34 clusters
z2: indexed in 0ms clustered in 0ms 14 clusters
z1: indexed in 0ms clustered in 0ms 7 clusters
z0: indexed in 0ms clustered in 0ms 3 clusters
total time: 972ms
Supercluster spends less than 1 second total processing this dataset. Since processing happens off the main thread (in a Web Worker), it doesn’t block map rendering. After the data is clustered, any screen query happens instantly (less than 1ms).
Using the library
Supercluster API is minimal:
// cluster GeoJSON pointsvarindex=supercluster({radius:40,maxZoom:16}).load(geojson.features);// get GeoJSON clusters given a bounding box and zoomvarclusters=index.getClusters([-180,-85,180,85],2);// get a JSON vector tile in the same format as GeoJSON-VTvartile=index.getTile(7,523,125);
Because Supercluster works so well for Mapbox GL JS, we released it as a standalone library so that other software can take advantage of its fast algorithms. For example, here’s a video of it being used together with Leaflet (which powers Mapbox.js) to browse 6 million points loaded in the browser (source code):
You can also use it to cluster points on the server with Node.js, and convert the resulting clusters into Mapbox vector tiles using the vt-pbf module by Anand Thakker.
We’re looking forward to seeing more amazing map apps with clustering from you! And don’t hesitate to read through the source code— it’s around 200 lines of code. Stay tuned for clustering support in our mobile SDKs, and hit me up on Twitter if you have any questions or comments.