Probe data is the byproduct of sensors and applications: hidden in logs, pings, and metadata lie tiny morsels of information like position and time. Individually, every record has little meaning, but the aggregate can be extremely valuable for augmenting other datasets.
We’re starting to use probe data as a cue for semi-automated improvements to OpenStreetMap. With algorithms that filter out noise and infer things like directionality and popularity of routes, we can identify roads missing from OpenStreetMap, turn restrictions, and even speed limits for Smart Directions. The data reveals complexity like time-based turn restrictions: you can’t turn left from Main St. onto Broadway between 4pm-6pm Monday through Friday. The number of applications is amazing.
Here’s a look at data from around the city of Denver shared with us by a partner. In a matter of minutes, our sophisticated approach matched data from 5,000 of the probes with high accuracy. Not only did we match the data, we classified the quality of the results for further processing. To ensure robust anonymity, we remove all identifiers from datapoints, publish only aggregated and fuzzed data, and elide areas where low density make deanonymization possible.
White is the trace as we receive it, green is a good match, orange is an uncertain match, and red indicates issues with matching or the data. Where the information shows a previously unknown road, our data team can use imagery and other methods for ground-truthing. Other data projects help power Mapbox’s internal algorithms for trip ETAs and mode classifications, categorizing every path as drivable, walkable, or bikeable.
We’re open sourcing the tools and contributing much of our analysis to OpenStreetMap as new roads and turn restrictions. As we collect more probe data, improve to-fix, and expand data teams in Peru and India, we’re gathering the pieces we need to build the most complete and accurate road network. Early March, we’ll present a major update of this work at FOSS4GNA in San Francisco.