Rants and musings of Paul Scott
Paul Scott on 2011-09-20 10:23:59
HOWTO create a massively scalable and very fast Geo database using MongoDBFirst things first, I will assume that you have MongoDb set up and running on your server. If you do not, on Ubuntu it is really easy, other distros, not sure. RTFM.
OK, so we need to create a geographic database for geocoding, reverse geocoding and some simple spatial queries. MongoDB is fast and easy to do this with, so we shall proceed from there.
NOTE: If you are using a 32 bit operating system, your dataset will be limited to about 2GB. Keep that in mind as we proceed. If you are going to try and import the entire CC-BY licensed dataset in this HOWTO, you will need a 64 bit OS.
Now to the fun part:
1. Make sure that Mongodb is up and running on your server (there is an upstart script to do so - start mongodb)
2. Make sure that you can connect to your MongoDB server with mongo (client). If you can, you are ready to proceed.
3. Open up a terminal (if you use Windows or something, you have bigger problems) and grab some of the geonames.org data (or all of it) with:
wget -r -l2 -nd -Nc -A.zip http://download.geonames.org/export/dump/
unzip *.zip
mkdir zip
mv *.zip zip/
You may also want to remove the cities*.txt and other lowercase file names, as well as the README file for the next step.
4. Make sure that all of the .txt files have an uppercase 2 letter ISO country code as the filename. Any other files will more than likely break the next step, as I really didn't take too much time to write good code. The next bit is used once only.
5. Write a quick Python script to insert the data to MongoDB
import glob
import csv
csv.field_size_limit(1000000000)
from pymongo import Connection, GEO2D
db = Connection().geopoints
db.places.create_index([("loc", GEO2D)])
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
li = glob.glob('*.txt')
for filename in li:
print "Doing "+filename
reader = unicode_csv_reader(open(filename), delimiter="t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
insdict = {"loc": [float(row[5]), float(row[4])], "geonameid": [row[0]], "name": [row[1]], "asciiname": [row[2]], "alternatenames": [row[3]], "latitude": [float(row[4])], "longitude": [float(row[5])], "featureclass": [row[6]], "featurecode": [row[7]], "countrycode": [row[8]], "cc2": [row[9]], "admin1code": [row[10]], "admin2code": [row[11]], "admin3code": [row[12]], "admin4code": [row[13]], "population": [row[14]], "elevation": [row[15]], "gtopo30": [row[16]], "timezone": [row[17]], "modificationdate": [row[18]]}
db.places.insert(insdict)
print filename + "Done"
6. You will notice that the code above (excuse me if the indentation is messed up...) makes use of the python-mongo code, which you will also need. Again on Ubuntu, it is as simple as: apt-get install python-pymongo
7. Execute the script.
8. Next, download a copy of Chisimba and the geo module (WIP) and install that. Configure your MongoDB connevtion in the sysconfig and Voila! You have a geocoding and reverse geocoding API. Easy!
9. Profit!
NOTES: You will notice that the pymongo script creates a 2D geospatial index for your data. I would also recommend that you create some additional indexes according to your app.
The database and collection names can be changed to anything you like. Just keep the whole db.collection string less than 128 chars long.
The CSV field size limit is huge. I doubt it needs to be THAT big, but it is safe and rather be safe than sorry right?
If your dataset is to grow more, or you need even more performance, look at sharding with mongos too.

