Amazon DynamoDB Streams, which provide a streaming log of events for Amazon DynamoDB, officially launched today, and we’re using them to migrate our primary database to DynamoDB. We’re leaving behind a self-managed CouchDB cluster that served us well for years but required management overhead for our team. DynamoDB delivers even more stability, redundancy, and speed for our document-based data and requires less hands-on administration, so we can focus on building the software that powers Mapbox.
Data at a Global Scale
DynamoDB Streams unlock cross-region replication via our DynamoDB Replicator. Cross-region replication allows us to distribute data across the world for redundancy and speed. This is a key ingredient to how we serve maps fast to over 100 million unique users every month.
Mapbox stores copies of its data in AWS regions across the planet. When a user requests a map, the data is served from the closest database location, massively increasing the speed at which our maps are delivered and rendered. To keep these independent databases in sync, we need to replicate the data in every part of the world to every other part of the world, so that users in Australia are seeing the same maps as users in Virginia.
Having multiple redundant copies of the same data improves the availability of the Mapbox databases, ensuring that a failure in one region will not cause failure of the entire Mapbox infrastructure. When one region becomes slow or unavailable, database requests are automatically redirected to a stable region. Replica tables ensure users' data is safe and always available.
Amazon DynamoDB Streams
A DynamoDB Stream is a continuous pipeline of every modification made to a DynamoDB database. The moment a document is inserted, modified, or removed from the primary database, the DynamoDB Stream emits an event with information about the change, including the old and new versions of the modified document. Every DynamoDB database has the option to enable streams, which can be consumed by libraries provided by Amazon Web Services. While Amazon provides an open source library for using DynamoDB streams to replicate databases across regions, this library needs to be run and maintained on a standalone server. In order to provide cross-region replication without the need to set up self-managed EC2 instances, we turned to AWS Lambda.
Lambda is a service designed to read data from streams and perform actions with those data, without the need to manage the actual hardware that the function is executed on. The Lambda function we created processes events that are emitted by the DynamoDB Stream from the primary table, and replays those modifications onto the replica table in another region. Using Lambda, we are able to even further reduce our management overhead and focus on providing the most efficient and robust structure for serving maps to our users.
Every write to the primary table invokes the Lambda function, which writes the record to the replica tables
The possibilities with DynamoDB Streams aren’t limited to replication. Stream events are also used to create backups of data on S3 that capture every version of a document, which can help us recover from multiple types of failure quickly.
Introducing DynamoDB Replicator
At Mapbox, we create open source tools that developers everywhere can use to improve the way they handle geospatial problems. As part of the transition to DynamoDB, we have developed our tools in the open, and the source code for these tools can be found on GitHub:
- Dynoa JavaScript wrapper around the AWS DynamoDB SDK. Dyno abstracts away from the verbose DynamoDB API and implements batch methods which abstract over per-request limits.
- Streambotan automation tool for deploying to Lambda. Streambot provides tools that simplify the process of packaging and deploying a Lambda function and running it as a persistent service.
- DynamoDB Replicator, the Lambda function code we use to process DynamoDB Streams, is a small library packed with tools to perform replication between tables, create backups, and check replica tables for consistency with the primary table. The library is designed so that it can be used on any DynamoDB database.
With these tools, developers have the ability to build systems as reliable as our high volume infrastructure. We’ll be using them to run the fastest, most reliable spatial infrastructure at scale.