Simple City Geocoding with SQLite and FTS5

I recently was working on a project where I had some fake address data for customers and businesses. I wanted to mark their locations on a map, but I only had the address and not the coordinates, so I needed some way to map addresses to coordinates.

The process of taking an address and turning it into coordinates is called geocoding. This process is generally something that requires an external API or some heavy weight pre-indexed database that exists on it's own infrastructure. In this case, I wasn't really looking to bring in anything too large since this project wasn't super important and I didn't need anything finer than city level geocoding.

While looking around at different data sources, I stumbled onto simplemaps.com, which provides a few different map data sources in simple CSV formats. Each source has a limited but openly licensed (CC BY 4.0) dataset as well as much larger purchaseable datasets with permissive licenses. In this case, I was interested in the US Cities dataset, which has rows like the following:

┌────────────────┬────────────────┬──────────┬────────────┬─────────────┬─────────┬──────────┬───────┐
│      city      │   city_ascii   │ state_id │ state_name │ county_name │   lat   │   lng    │ zips  │
├────────────────┼────────────────┼──────────┼────────────┼─────────────┼─────────┼──────────┼───────┤
│ New York Mills │ New York Mills │ NY       │ New York   │ Oneida      │ 43.1007 │ -75.2932 │ 13417 │
│ West New York  │ West New York  │ NJ       │ New Jersey │ Hudson      │ 40.7857 │ -74.0094 │ 07093 │
│ New York Mills │ New York Mills │ MN       │ Minnesota  │ Otter Tail  │ 46.5198 │ -95.3732 │ 56567 │
│ New City       │ New City       │ NY       │ New York   │ Rockland    │ 41.1542 │ -73.9912 │ 10956 │
│ New Cassel     │ New Cassel     │ NY       │ New York   │ Nassau      │ 40.7602 │ -73.5649 │ 11590 │
└────────────────┴────────────────┴──────────┴────────────┴─────────────┴─────────┴──────────┴───────┘

This is only a subset of the columns and, as of writing, there are 31,255 cities in the open dataset. For fake dataset usages, this is great. For production use-cases, you may be interested in the larger datasets, which have over 100k cities. The price is also pretty reasonable; $199 for just the US or $499 for the entire world's cities (4.3M cities!).

SQLite and FTS5

SQLite is a great database. It has tons of features, it's fast, and it can be embedded basically anywhere. It comes with a built-in extension called FTS5, which provides row-based text search features. Using SQL, you can query an FTS table with strings like "New York" and get all rows that contain the string across any of the columns, with ranking. As an example, the following is a query for all cities with Chicago in it:

SELECT rank, city, state_id 
FROM cities
WHERE cities MATCH 'Chicago'
ORDER BY rank
LIMIT 4;

┌───────────────────┬─────────────────┬──────────┐
│       rank        │      city       │ state_id │
├───────────────────┼─────────────────┼──────────┤
│ -11.606091491254  │ Chicago         │ IL       │
│ -10.9698140943145 │ North Chicago   │ IL       │
│ -10.9698140943145 │ Chicago Heights │ IL       │
│ -10.9698140943145 │ East Chicago    │ IN       │
└───────────────────┴─────────────────┴──────────┘

Rank is automatically calculated for you and available under the rank keyword. It's defined by the bm25() function which is described here. For our usecase, more negative means a higher ranking.

Creating the geocoder

As you can probably guess, we can use FTS5 to provide a simple search interface over top of this US Cities dataset. Before that though, we have to make a small change to it's formatting.

By default, each city has a list of ZIPs in one of the columns. This list can be very large for certain cities, like Chicago, LA, NY, and so on. The bm25() algorithm penalizes rows based on their length, meaning a short row that mentions a keyword once is ranked higher than a lengthy row that also only mentions the keyword once. This doesn't really make sense for us as a search for "Chicago" would weigh "Chicago Heights" above the actual "Chicago" because the real city has 84 zips and Chicago Heights has 2.

There are a handful of ways to solve this but the one I'm going with is to expand the zips into multiple rows with duplicated data. If a city has 10 ZIPs, then we'll have 10 rows for that city, each with a different zip. This does mean the FTS table has a lot more data than it needs, but it's at least all in the same table. A more space efficient alternative could be having multiple FTS tables and querying the correct table as you need it.

First, lets get rid of some of the extra data about the regions that we don't need. The following bash removes most columns and saves the data to a new file:

cut -d',' -f1-4,6-8,16 uscities.csv > uscites-trunc.csv

Now we start up an Elixir interpreter and use the CSV module to read the file and reduce each city into multiple lines:

Mix.install([:csv])

File.stream!("uscities-trunc.csv") 
|> Stream.drop(1)
|> CSV.decode!()
|> Stream.flat_map(fn x -> 
    zips = Enum.at(x, -1) |> String.split()
    Enum.reduce(zips, [], fn y, acc -> 
        # This could probably be better
        [[y | Enum.slice(x, 0..6)] | acc] 
    end)
end) 
|> CSV.encode() 
|> Stream.into(File.stream!("uscities-trunc-split.csv"))
|> Stream.run()

We can then import this truncated and split dataset into a SQLite databse. We'll do this by creating a SQL file that creates our FTS5 table, then runs the import. This uses the default importing features built into the sqlite3 CLI tool:

CREATE VIRTUAL TABLE cities USING fts5(
    zip, city, city_ascii, state_id, state_name, county_name, lat, lng);
.import --csv uscities-trunc-split.csv cities

sqlite3 cities.db < import-cities.sql

$ sqlite3 cities.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> SELECT city, zip, county_name FROM cities WHERE cities MATCH 'Chicago' ORDER BY rank LIMIT 5;
┌─────────┬───────┬─────────────┐
│  city   │  zip  │ county_name │
├─────────┼───────┼─────────────┤
│ Chicago │ 60699 │ Cook        │
│ Chicago │ 60696 │ Cook        │
│ Chicago │ 60695 │ Cook        │
│ Chicago │ 60694 │ Cook        │
│ Chicago │ 60690 │ Cook        │
└─────────┴───────┴─────────────┘

You'll notice we're getting multiple instances of Chicago, each with a different ZIP. For some use cases, this isn't great as we really want to search across the city names. We can fix this by using a group by clause:

sqlite> SELECT city, zip, county_name FROM cities WHERE cities MATCH 'Chicago' GROUP BY city ORDER BY rank LIMIT 5;
┌─────────────────┬───────┬─────────────┐
│      city       │  zip  │ county_name │
├─────────────────┼───────┼─────────────┤
│ Chicago         │ 60699 │ Cook        │
│ Chicago Heights │ 60412 │ Cook        │
│ Chicago Ridge   │ 60415 │ Cook        │
│ East Chicago    │ 46312 │ Lake        │
│ New Chicago     │ 46405 │ Lake        │
└─────────────────┴───────┴─────────────┘

Now we're getting multiple cities. We can also display the rank and list out all of the ZIPs for each city:

sqlite> SELECT MIN(rank), city, GROUP_CONCAT(zip) FROM cities WHERE cities MATCH 'Chicago' GROUP BY city ORDER BY rank LIMIT 3;
┌───────────────────┬─────────────────┬──────────────────────────────────────────────────────────────┐
│     MIN(rank)     │      city       │                      GROUP_CONCAT(zip)                       │
├───────────────────┼─────────────────┼──────────────────────────────────────────────────────────────┤
│ -8.757506826533   │ Chicago         │ 60699,60696,60695,60694,60690,60689,60688,60687,60686,60685, │
│                   │                 │ 60684,60681,60680,60678,60677,60675,60674,60673,60670,60669, │
│                   │                 │ 60668,60666,60664,60499,60707,60628,60629,60624,60625,60626, │
│                   │                 │ 60620,60621,60622,60623,60651,60655,60654,60657,60656,60653, │
│                   │                 │ 60652,60659,60106,60608,60609,60602,60603,60601,60606,60607, │
│                   │                 │ 60604,60605,60633,60632,60631,60630,60637,60636,60634,60639, │
│                   │                 │ 60638,60660,60661,60827,60619,60618,60611,60610,60613,60612, │
│                   │                 │ 60615,60614,60617,60616,60646,60647,60644,60645,60642,60643, │
│                   │                 │ 60640,60641,60649,60018                                      │
├───────────────────┼─────────────────┼──────────────────────────────────────────────────────────────┤
│ -8.32111573856792 │ Chicago Heights │ 60412,60411                                                  │
├───────────────────┼─────────────────┼──────────────────────────────────────────────────────────────┤
│ -8.32111573856792 │ Chicago Ridge   │ 60415                                                        │
└───────────────────┴─────────────────┴──────────────────────────────────────────────────────────────┘

Lastly, we can still search by ZIP (or any other field for that matter) and get coordinates out, which is the whole point of a geocoder:

sqlite> SELECT city, lat, lng FROM cities WHERE cities MATCH '60632';
┌─────────┬─────────┬──────────┐
│  city   │   lat   │   lng    │
├─────────┼─────────┼──────────┤
│ Chicago │ 41.8375 │ -87.6866 │
└─────────┴─────────┴──────────┘

Notice how I'm not using limit here. ZIPs are one-to-one in the US, so there should be only one city returned.

Note: There is one issue here. Due to us splitting out cities into multiple rows, we must group by city to get just one value. This does mean that if multiple cities all have the same name but are in different states, we'll lose all but one of them while grouping. We can fix this by grouping on both city and state, like the following:

sqlite> SELECT city, state_id, lat, lng FROM cities WHERE cities MATCH 'Portland' GROUP BY city LIMIT 3;
┌────────────────┬──────────┬─────────┬───────────┐
│      city      │ state_id │   lat   │    lng    │
├────────────────┼──────────┼─────────┼───────────┤
│ Portland       │ OR       │ 45.5371 │ -122.6500 │
│ South Portland │ ME       │ 43.6310 │ -70.2895  │
└────────────────┴──────────┴─────────┴───────────┘
sqlite> SELECT city, state_id, lat, lng FROM cities WHERE cities MATCH 'Portland' GROUP BY city, state_id LIMIT 3;
┌──────────┬──────────┬─────────┬───────────┐
│   city   │ state_id │   lat   │    lng    │
├──────────┼──────────┼─────────┼───────────┤
│ Portland │ AR       │ 33.2379 │ -91.5109  │
│ Portland │ CO       │ 38.0892 │ -107.6952 │
│ Portland │ IA       │ 43.1254 │ -93.1351  │
└──────────┴──────────┴─────────┴───────────┘

Conclusion

This post was a quick run through of creating our own geocoder using just SQLite, it's text search extension, and a great openly available source of US Cities geographical data. I do expect this to scale to the full US Cities dataset, but you might need to use more sophisticated database layouts like sharding or just switching to Elasticsearch for the 4.3M row full world dataset.

Simple City Geocoding with SQLite & FTS5

SQLite and FTS5

Creating the geocoder

Conclusion