Technology

Challenges In Collecting Store Locations Data

Store Locations, Google Maps and Honeycombs - Why we wrote an algorithm for tiling an arbitrary polygon with hexagons

5 mins
September 4, 2024
Logan Harless
CTO

At String our engineers probably spend too much time ensuring our data is correct. Our commitment to data quality is how we went from answering the question “find all Lululemon store locations” to this Github discussion https://github.com/Turfjs/turf/discussions/2518 about covering an arbitrary polygon with circles (or, as you’ll see, hexagons).

Background

String covers a range of data categories, or slices. One slice in our catalog is store locations, and this slice presents an interesting data validation problem. Many companies will list the locations of all of their stores on their website, and getting this data from a single .com domain is fairly trivial. However, multinational companies have international locations, and sometimes those international locations are operated by a partner under white-label. An international white-label partner (or a domestic entity) might operate under a different domain or might not even list store locations online. So how do you determine brick & mortar coverage if a company doesn’t share it? And even if they do, how do you make ensure the data is accurate and up-to-date?

3rd Party Sources To The Rescue!

To ensure accuracy, completeness and timeliness in our data we leverage 3rd party data sources*. These vary by slice; for store locations the most complete data source (validated by our engineers) is Google Maps. For example, only Google Maps accurately tracked Lululemon stores in the Arabian peninsula (operating under white-label, site).

Or maybe not…

While Google Maps has accurate location data worldwide, this data is not stored as a queryable database. Instead, a user can set their location and make a request to search for matches in their vicinity. This present multiple challenges - our search criteria can match arbitrarily many data points inside this circle, there are limitations on search range and we have no knowledge about any points lying outside of this circle. A request might result in something like this.

A user searching “lululemon” in Google Maps API from the Houston, TX area

Despite these difficulties, the most complete data is only available via Google Maps so we must work within these constraints. This is where hexagons come in.

Covering a polygon with hexagons

We can define the problem of getting store locations from Google Maps as: Given an arbitrary polygon and a maximum radius, how can you place circles (with overlap) to cover that polygon with as few circles as possible meaning as few website requests as possible?

The most effective way to fully cover a polygon with circles of fixed radius is (likely) tiling the plane with hexagons arranged in a “honeycomb” pattern - this has a covering efficiency of ≈0.827 (thanks to this discussion for pointing us in the right direction!). This efficiency is also the limit of the Disc Covering Problem, and while we don’t have a formal proof that a honeycomb structure is the optimal design pattern, this seems intuitively correct.

A map of the US tiled with circles of radius 100 miles.

Knowing this, we wrote an algorithm that does the following:

  1. Define a region/country to extract data from.
  2. Represent that region as a GeoJSON polygon (ignore curvature of the Earth).
  3. Generate lat/longs to optimally cover that polygon with fixed radius circles (with overlap)
  4. Make requests at each of those lat/longs
  5. Filter overlapping regions to prevent duplication
  6. Save the data to database!

In addition to filling in missing data we use this approach to flag likely stale data (i.e. when a company stops supporting updates on their own webiste) long before data aggregators who rely solely on the primary source.

Lululemon - A real example

Yes we wrote a fancy algorithm. But does it really matter? Here’s one example:

The top SEO result when Googling “Lululemon store locations” is this statista page (as of November 2023). The data for convenience:

I’ll call your attention to a few things you may notice:

  1. Data frequency is annual. By only calculating annual snapshots the range of questions this data can answer is further limited, i.e. “could new store locations be driving growth in the month of January”? Hard to say if you don’t know the month these stores were added in.
  2. Data is aggregated at the country level. This aggregation further limits the value of the data. It’s possible to see that Japan reduced its store count by 1, then increased it again. But which store? Was it the same one reopening or a new location? Could the closure have been due to external factors? Any questions you might ask about expansion strategy are severely limited by this aggregation.
  3. Locations from the Arabian Peninsula (Oman, UAE, Qatar and Saudi Arabia) and other countries are missing. Lululemon has had store locations in the Arabian Peninsula since 2015, however aggregators (such as the one backing statista) rarely validate their data quality.

Conclusion

At String we don’t just extract data once, we track it over time. This enables you to ask questions of the data that require temporal context, opening up a much richer set of insights (and time series visualizations).

You could search “list of store locations Lululemon” and see results from the various vendors that have bought ads pushing their one-time dumps of the data found on a .com domain. But this data won’t tell the whole story. While a one-time extraction might work for some use cases, it won’t tell give you historical context. And unless you can trust that your vendor is doing due diligence to validate data integrity you could end up making decisions based on data that never even mentions the existence of a region of the business that has operated since 2015.

If you found this article interesting or want to explore collecting high quality web data, feel free to reach out to us!

*It is important to note that NO data source is guaranteed to be up to date, you must also track data freshness (which is something we do for you at String 😊)

Latest insights delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
© Copyright 2024 String AI
Created In New York City, NY