Store Locations, Google Maps and Honeycombs - Why we wrote an algorithm for tiling an arbitrary polygon with hexagons
At String our engineers probably spend too much time ensuring our data is correct. Our commitment to data quality is how we went from answering the question “find all Lululemon store locations” to this Github discussion https://github.com/Turfjs/turf/discussions/2518 about covering an arbitrary polygon with circles (or, as you’ll see, hexagons).
String covers a range of data categories, or slices. One slice in our catalog is store locations, and this slice presents an interesting data validation problem. Many companies will list the locations of all of their stores on their website, and getting this data from a single .com domain is fairly trivial. However, multinational companies have international locations, and sometimes those international locations are operated by a partner under white-label. An international white-label partner (or a domestic entity) might operate under a different domain or might not even list store locations online. So how do you determine brick & mortar coverage if a company doesn’t share it? And even if they do, how do you make ensure the data is accurate and up-to-date?
To ensure accuracy, completeness and timeliness in our data we leverage 3rd party data sources*. These vary by slice; for store locations the most complete data source (validated by our engineers) is Google Maps. For example, only Google Maps accurately tracked Lululemon stores in the Arabian peninsula (operating under white-label, site).
While Google Maps has accurate location data worldwide, this data is not stored as a queryable database. Instead, a user can set their location and make a request to search for matches in their vicinity. This present multiple challenges - our search criteria can match arbitrarily many data points inside this circle, there are limitations on search range and we have no knowledge about any points lying outside of this circle. A request might result in something like this.
Despite these difficulties, the most complete data is only available via Google Maps so we must work within these constraints. This is where hexagons come in.
We can define the problem of getting store locations from Google Maps as: Given an arbitrary polygon and a maximum radius, how can you place circles (with overlap) to cover that polygon with as few circles as possible meaning as few website requests as possible?
The most effective way to fully cover a polygon with circles of fixed radius is (likely) tiling the plane with hexagons arranged in a “honeycomb” pattern - this has a covering efficiency of ≈0.827 (thanks to this discussion for pointing us in the right direction!). This efficiency is also the limit of the Disc Covering Problem, and while we don’t have a formal proof that a honeycomb structure is the optimal design pattern, this seems intuitively correct.
Knowing this, we wrote an algorithm that does the following:
In addition to filling in missing data we use this approach to flag likely stale data (i.e. when a company stops supporting updates on their own webiste) long before data aggregators who rely solely on the primary source.
Yes we wrote a fancy algorithm. But does it really matter? Here’s one example:
The top SEO result when Googling “Lululemon store locations” is this statista page (as of November 2023). The data for convenience:
I’ll call your attention to a few things you may notice:
At String we don’t just extract data once, we track it over time. This enables you to ask questions of the data that require temporal context, opening up a much richer set of insights (and time series visualizations).
You could search “list of store locations Lululemon” and see results from the various vendors that have bought ads pushing their one-time dumps of the data found on a .com
domain. But this data won’t tell the whole story. While a one-time extraction might work for some use cases, it won’t tell give you historical context. And unless you can trust that your vendor is doing due diligence to validate data integrity you could end up making decisions based on data that never even mentions the existence of a region of the business that has operated since 2015.
If you found this article interesting or want to explore collecting high quality web data, feel free to reach out to us!
*It is important to note that NO data source is guaranteed to be up to date, you must also track data freshness (which is something we do for you at String 😊)