A journey of modern mapmaking.
As recorded by @natevw, explorer for &yet.

RSS Feed

Using a job management system to process geodata

Making maps well requires a wide variety of skills and tools.

My first two posts here showed WebGL’s potential as a tool for quickly drawing custom-projected maps. Once geographic data — in the right form — is loaded into the browser, a graphical representation can be drawn in just a few hundredths of a second.

Converting geographic data into the right form, however, can sometimes take hundreds of hours. One small map may be distilled out of many gigabytes of source data and base imagery.

Job management

In this post we’ll explore one tool for reliably and efficiently handling cartography’s heavy lifting: a job management system. In particular, an experimental CouchDB-based job queue I wrote in node.js called RQMS.

Now a “job” is just an item of work that needs to get done. It could be a big task (reproject all the map tiles within region X) or a small task (combine four tile images into overview tile Z) but the map won’t be ready until every job has been accomplished. A job management system keeps track of remaining tasks, making items available to the code performing the actual work. As a job “queue”, RQMS additionally tries to ensure that jobs are performed roughly in order (more on this later).

Dividing and conquering

Let’s say I have some dozen 500MB source images I want to display in a web map viewer like Polymaps or Tile5. I might divide the work into the following job types:

  1. Given a path to a source image, reproject and split it into 256×256 tiles at the maximum zoom level
  2. Given four tiles at a particular zoom level, combine them into one “pyramid” tile in the next smaller zoom level
  3. Given a map tile, upload it to the destination server

For each job type, I write a Python script that asks the RQMS server for one item from that job’s queue. This item contains the task-specific information, e.g. the path of one tile and the server where it should be uploaded. Once the script has successfully uploaded the item’s file, it will inform the queue that it has finished that subtask and the item will be deleted. Multiple copies of each script may be run, even on multiple computers; RQMS keeps track of which work items are already “in progress” and only hands out available jobs.

Now when processing many many thousands of these little jobs across multiple machines, something is bound to go wrong: a worker script runs out of memory and abends, an upload fails due to a network glitch… RQMS simply expects each job to be finished before a configurable time limit expires, otherwise it will be handed out again for another try until it succeeds. So for a task like reprojecting a large source image, the worker script might checkout the job for a half hour. Normally it will succeed and delete the job within fifteen minutes or so, but if it crashes another worker process will retry after the deadline is reached.

Diagram: putting map tasks into queues

Assembly lines

This simple job management makes parallel processing of geodata much more reliable, and it can also make it more efficient. Each job script can feed new work, as soon as it becomes available, into subsequent queues in the overall process. In this case, as each tile is split out of a source image, that task’s script adds a new task to the upload queue and also sets a job in the pyramid queue. Once the pyramid script combines the four tiles into one it adds another upload task for its own output, as well as setting a new pyramid task in the next lower zoom level. Since RQMS hands out tasks in a roughly sorted order, the pyramid worker process will typically combine tiles from the higher zoom levels before the lower.

In this way, the work of geodata processing can be spread across many CPUs with the various subtasks being performed in a pipelined manner. If one queue is filling faster than it is being processed, more worker process may be started to speed processing. If, say, the tile upload processes do not have enough bandwidth to keep up, RQMS just buffers their jobs until they can catch up. And if any worker script — or even the whole system — gets interrupted, the overall process will pick up right where it left off once restarted.

A simple system

As an experiment, RQMS has proven itself useful for tackling a large but dividable task like map tile generation. I call it experimental because it uses CouchDB in an atypical, suboptimal way and therefore has some fundamental issues that degrade its performance, especially when fetching many thousands of tiny tasks. (I will likely migrate my processing scripts to a high-performance realtime job management system that one of my coworkers at &yet is building.) In the meantime, using RQMS to feed my custom geodata wrapper scripts has almost magically turned them into an efficient workflow whose subtasks are tracked reliably and can be processed in parallel.

Making the world go round

I’ve shown that map data can be projected and drawn at animation speed within an HTML5 document, using WebGL to harness the number-crunching, pixel-pushing power of the GPU.

Having conquered the initial hurdle of learning OpenGL ES 2.0, I’m now taking on some of the trickier issues involved in real-world (pun intended) map drawing. The next hurdle, which I’ve now tackled, was spinning the globe.

My first demo cheated. On the globe, the equator wraps around and around and around the world, in a circle. But in our source data the round earth had been unwrapped into a flat rectangle. In both the base raster image and the overlaid vector boundaries, the equator is just a line segment that starts at -180º and stops at +180º longitude. Conveniently, the first demo’s map showed the same range. It basically just had to stretch the flat source rectangle into different shapes. The equator stayed in place, still going from -180º back to 180º longitude (since it’s actually the same longitude line named two ways).

Diagram showing our treatment of longitude as a circle, measured and unwrapped arbitrarily.

To spin the globe, we need to constantly move the longitude line that splits the round earth we’re unwrapping. Splitting along a different longitude line introduces a problem covered up by the pre-projected source data. The vector borders are like a connect-the-dots puzzle drawing a line from each point to the next. Imagine what happens when some of these points are projected to the opposite side of the map:

World centered on Pacific Ocean, ruined by horizontal lines between opposite sides of the map.

Whoops. When spinning the globe, we don’t want to connect all the dots! Sometimes, instead of taking the full stretch around, the line from one point to the next should go through the magic portal connecting the map’s east to its west when necessary. A proper reprojection needs to split where the target projection splits, which is not necessarily where conventional longitude numbering is split.

There are two tricks I ended up using to accomplish this on the massively-parallel GPU, keeping the amount of single-threaded JavaScript work to a minimum. I describe these techniques more on the new demo page, but the end result is a fluid spinning map without the reprojection dot-connecting issues shown above.

Video link - View demo

If you’ve got a WebGL-enabled browser available, you can view the spinning map demo on your own computer. Note that there is currently still one strange issue I’m trying to track down on that affects at least one older nVidia card. If you don’t have a WebGL-enabled browser, a development build of Chromium (Mac OS X, Windows, 64-bit Linux or 32-bit Linux) has been an easy way to try it out without any manual configuration.

Making maps with WebGL

At &yet, we like to think that what’s the future for the web is the future for us. Well, WebGL is an upcoming HTML5 framework that will let web apps take advantage of the fast graphics cards in computers and smartphones, and it may be standard soon. Our resident “cloud cartographer” has explored some uncommon ways of using WebGL, and is now writing about the results in third person.

What You Are About To See uses WebGL to quickly transform raster and vector data using map projections. Rather than focusing on the 3D capabilities of graphics cards and street-level map data, this prototype concentrates on making a nicely shaped world map, and then animating between projections to test its speed.

View demo
diagram of square raster and vector coordinates getting churned through GPU to produce initial Robinson picture

(Note: you’ll need to use Chromium [Mac OS X, Windows, 64-bit Linux or 32-bit Linux] or another properly-configured nightly build browser.)

This is done by doing the complicated projection math in a vertex shader on the accelerated graphics card processor instead of in JavaScript on the main CPU. The graphics card’s GPU is great at drawing pixels from shapes and images, and most CPUs have enough other things to keep them busy, so it’s really helpful to move work to the GPU in this way.

With a bit more work on this prototype, and a bit more time for browsers to finalize their implementations, you can imagine being able to embed world maps in webpages that interactively display and animate data. With the flexibility and speed of WebGL, the map wouldn’t have to be pre-generated in a fixed projection, but could instead be dynamically reshaped based on the user’s zoom level and area of interest.

from &yet