Caching increases map performance significantly, improving the experience for hundreds of millions of people zooming around our map. In addition to using a global edge network that brings our maps as close to the end user as possible, we’ve architected our own in-memory caching layer to sit in behind our CDN, next to the origin servers that composite raw map data. This layer lets us serve low-latency maps to users all around the globe.
How our in-memory cache in Sydney, Australia helps us avoid a round trip back to Virginia for map data.
While our raw map data lives in only a few key regions around the world, our caching layer boosts performance in each additional origin region. This prevents faraway regions like Sydney, Australia from making slow, round-trip requests back to one of our core regions (for example, the eastern United States).
Running on the AWS EC2 Spot market
Caching technology like this lets us not only serve maps faster to mobile phones around the world, it is also saves massive costs in operating our platform.
We modeled this homegrown caching service after AWS ElastiCache, the in-memory store from AWS. We built this in-house in order to run on the AWS EC2 Spot market: our entire platform runs on Spot, which saves us upwards of 50-90% on our EC2 costs. ElastiCache doesn’t support Spot, so we rolled our own service that does.
Global cache replacement
Our growth over the last year demanded a re-architecture of our cache (specifically, how we manage sharding) in order to to support our scale. This upgrade, which we deployed last week, required complete replacement of our old caching layer in each of the 7 regions where we run our Maps API. We performed the replacement one region at a time. A cache replacement can cause a severe latency penalty for end-users, so we avoided this by re-routing all traffic around a region for the duration of the deploy. This happened at the DNS level, and once we had performed the upgrade, we pre-warmed each new cache before putting it back in service.
Routing traffic between two of our North America regions during a cache replacement
The trick is making sure that when we reroute traffic around one region, the other regions can handle the additional traffic. If we aren’t careful, the extra traffic could look and act like a sophisticated DDOS attack to a smaller region, potentially causing our origin infrastructure there to buckle under the load. These delicate procedures are nearing full automation, with the appropriate seatbelts. We lovingly call the DNS-level traffic director our “kill-switch,” and we can control it with a few keystrokes in Slack.
Does building low-cost, resilient caching infrastructure & redirecting traffic around the world sound like fun? The Platform team is hiring!