Using a job management system to process geodata
Making maps well requires a wide variety of skills and tools.
My first two posts here showed WebGL’s potential as a tool for quickly drawing custom-projected maps. Once geographic data — in the right form — is loaded into the browser, a graphical representation can be drawn in just a few hundredths of a second.
Converting geographic data into the right form, however, can sometimes take hundreds of hours. One small map may be distilled out of many gigabytes of source data and base imagery.
Job management
In this post we’ll explore one tool for reliably and efficiently handling cartography’s heavy lifting: a job management system. In particular, an experimental CouchDB-based job queue I wrote in node.js called RQMS.
Now a “job” is just an item of work that needs to get done. It could be a big task (reproject all the map tiles within region X) or a small task (combine four tile images into overview tile Z) but the map won’t be ready until every job has been accomplished. A job management system keeps track of remaining tasks, making items available to the code performing the actual work. As a job “queue”, RQMS additionally tries to ensure that jobs are performed roughly in order (more on this later).
Dividing and conquering
Let’s say I have some dozen 500MB source images I want to display in a web map viewer like Polymaps or Tile5. I might divide the work into the following job types:
- Given a path to a source image, reproject and split it into 256×256 tiles at the maximum zoom level
- Given four tiles at a particular zoom level, combine them into one “pyramid” tile in the next smaller zoom level
- Given a map tile, upload it to the destination server
For each job type, I write a Python script that asks the RQMS server for one item from that job’s queue. This item contains the task-specific information, e.g. the path of one tile and the server where it should be uploaded. Once the script has successfully uploaded the item’s file, it will inform the queue that it has finished that subtask and the item will be deleted. Multiple copies of each script may be run, even on multiple computers; RQMS keeps track of which work items are already “in progress” and only hands out available jobs.
Now when processing many many thousands of these little jobs across multiple machines, something is bound to go wrong: a worker script runs out of memory and abends, an upload fails due to a network glitch… RQMS simply expects each job to be finished before a configurable time limit expires, otherwise it will be handed out again for another try until it succeeds. So for a task like reprojecting a large source image, the worker script might checkout the job for a half hour. Normally it will succeed and delete the job within fifteen minutes or so, but if it crashes another worker process will retry after the deadline is reached.

Assembly lines
This simple job management makes parallel processing of geodata much more reliable, and it can also make it more efficient. Each job script can feed new work, as soon as it becomes available, into subsequent queues in the overall process. In this case, as each tile is split out of a source image, that task’s script adds a new task to the upload queue and also sets a job in the pyramid queue. Once the pyramid script combines the four tiles into one it adds another upload task for its own output, as well as setting a new pyramid task in the next lower zoom level. Since RQMS hands out tasks in a roughly sorted order, the pyramid worker process will typically combine tiles from the higher zoom levels before the lower.
In this way, the work of geodata processing can be spread across many CPUs with the various subtasks being performed in a pipelined manner. If one queue is filling faster than it is being processed, more worker process may be started to speed processing. If, say, the tile upload processes do not have enough bandwidth to keep up, RQMS just buffers their jobs until they can catch up. And if any worker script — or even the whole system — gets interrupted, the overall process will pick up right where it left off once restarted.
A simple system
As an experiment, RQMS has proven itself useful for tackling a large but dividable task like map tile generation. I call it experimental because it uses CouchDB in an atypical, suboptimal way and therefore has some fundamental issues that degrade its performance, especially when fetching many thousands of tiny tasks. (I will likely migrate my processing scripts to a high-performance realtime job management system that one of my coworkers at &yet is building.) In the meantime, using RQMS to feed my custom geodata wrapper scripts has almost magically turned them into an efficient workflow whose subtasks are tracked reliably and can be processed in parallel.


