Rockfax Digital Deep Dives Part 1 - The Data Pipeline Article

12th May, 2021

This has been read 5,640 times

Rockfax Digital, for anyone who doesn't know, is the digital version of all the Rockfax climbing information (plus a growing number of third-party guidebooks). There are links at the end of the article if you want to check it out.

This is the second in a series of articles where our developers discuss some of the interesting parts of how Rockfax Digital is put together. You can see the first article, about the topo view, here.

Introduction
- Database Structure
- Extracting the Data

In InDesign

Managing the Data in klang (our custom data-management application)

On the Command Line
- generating the crag packages

In this article I'm going to talk about the process of getting the information out of the printed guidebooks and into the apps. The complexity of this data pipeline is what makes it interesting to discuss, but it also makes it hard to write about without drifting out into the weeds. I'll try to give a high-level overview of each part of the procedure, and hopefully not get too bogged-down in the details.

Introduction

Rockfax is, at its core, a print company. This means that the "source of truth" for all the data we have is the desktop-publishing files for our books; any corrections or updates happen in these files, ready for the next edition. When I started work on Rockfax Digital, around 2014, I was confronted with the problem of getting all the data out of our back catalogue and wrangling it into a format we could distribute on phones. At that time, we already had the Rockfax online database, but this only had textual information in it (crag/sector/route names, descriptions, grades etc). What we needed was everything; the topo photos, the hand-drawn maps, the route lines, the topo notes, the arrows, the topo numbers (the coloured numbers indicating which route the line depicts), everything.

In the event that you're not a climber: "topo" means a drawing, or photo of a cliff face with route lines and other information superimposed on top. These days everyone uses a photo where possible. In the old days it was usually a hand-drawn picture, and in the old-old days it was just a description: "Start 50' left of the faint groove" etc.

If we were a digital-first company, our authors would be inputting all the data into some content management system that created the structure for us as they went. We're a print-first company though, so instead we've got loads of InDesign documents, and InDesign files are made for laying things out on pages, not maintaining complex relationship-models. So we have to suck it all out and build up the relationships we need outside of InDesign.

InDesign is a piece of desktop-publishing software in the Adobe Creative suite, alongside Photoshop and Illustrator. It's the industry standard for laying out print documents.

It may be a surprise that there was more work involved in creating the system to extract and manage the data for Rockfax Digital than there was in creating any of the apps. Munging (computing speak for transforming) all the information contained in the back-catalogue into a highly-structured relational-database involved years of work, gradually building the tools and know-how that now allow us to take a finished print guidebook and release it in the app within a few days.

Database Structure

InDesign has no concept of a route, a route line, a sector. It knows only text, graphics, shapes and the like. One thing it does have, however, is a JavaScript api. When you couple that with the visually-structured nature of Rockfax guidebooks, it allows for some quite interesting things.

Into-the-weeds-bar: The InDesign api is really quite thorough, and kind of amazing considering it's been there since the '90s (I think), but sadly the dialect of JavaScript that's used is awful. It's based on something like the ECMAScript 3.2 standard, which is definitely from the late 90s. Still, it works, but there are no modern-JavaScript niceties. Since 2019 there has been some chatter about all of the Adobe suite getting an embedded NodeJS runtime, but nothing has happened yet... Recently however, we've stopped writing JavaScript code directly though, and instead write TypeScript, which can transpile to this god-awful 90s js so I don't have to deal with it.

The apps need the data to be in some sort of relational database, so at runtime we can easily do all the things the apps do, like highlight a route line when you tap that route's description, or open a crag when you tap a flag on the offline maps.

To give an insight into what this schema looks like (schema means the structure of a database, the tables and columns etc), let's imagine a route object in an app: it needs a name, a grade, a description. It needs a relationship with a topo-photo, which needs a relationship with topo numbers, route lines, topo notes, approach arrows, loweroffs, belays etc. The topo also needs a relationship to the sectors/buttresses that appear on it. Those sectors need a crag, the crag needs an area, the area needs a country… etc, etc.

At the time of writing, the iOS app has 51 different database entities (route, sector, crag, topo etc), which together comprise 134 different relationships (topo.routes etc), and 506 properties (route.grade, route.name etc). All of this relationship/property information exists implicitly in the guidebooks, but only as a visual hierarchy that can be inferred by the reader.

As humans reading a guidebook, we know that a route belongs to a certain sector and crag because the names of each are at the top of the page we're looking at. We know which route line on the topo depicts which route because the line has a number on it that is the same as the number at the start of the route's description.

Topo-relationship inference © rockfax — Topo-relationship inference
© rockfax

Due to InDesign's scripting engine, all this information can be exposed to our parsing pipeline (once we've written enough glue code!).

Extracting the Data

Once a book is finished and off to the printers, the process of data extraction begins. We begin by making a copy of the document, safeguarding the exact version we sent to the printers, as the data extraction process for Rockfax Digital involves altering the layout of the pages.

The goal we had from the start was to never lose fidelity with the translation from paper to digital; any information that could be inferred from the books needed to be available in the apps, and needed to behave as a 21st-century user would expect. For example, route lines should be tappable, so should crag and sector flags and the images should be as high a resolution as we can reasonably get without requiring users to have really expensive phones.

There's quite a list of things that now need to happen to wrangle the data into a format we can ingest in the apps. These are done using a custom framework of InDesign scripts and a desktop application we call klang, because I'm bad at naming things.

In InDesign

Completing Route Lines

Route lines are drawn by our authors in the simplest way possible for printing. If route ② branches from route ①, then only that branch is drawn - no need to draw the shared start because the reader of a printed guide wouldn't see it anyway. The reader needs to infer any shared segments by looking at the topo and figuring it out with the help of the description.

This is no good for Rockfax Digital. It would be pretty crap if when you tap on a route that has a shared start and finish you only see highlighted the short independent segment in the middle. You can see what I'm talking about in the animation below. The coloured indicators show first how the lines would be highlighted if we did nothing, then how it will be after we've joined them up.

To explain the original way we'd do this, let's consider the following route lines (I've made the lines different weights for clarity):

Lines in need of joining © rockfax — Lines in need of joining
© rockfax

If we take route 3 as an example, the way you'd manually join it with the other lines would be to copy/paste the line for route 2, then cut it where it meets route 3, then delete the top part of this cut line then merge the remaining segments with the line for 3, as in the animation below.

We did it like this for a while, but you can imagine how boring that gets when a single book can have thousands of route lines in it. Plus, there's no way I'd be doing that for the catwalk at Malham!

So, as is the way with these things, as soon as we found the time, we wrote a script to do it for us. The solution was to analyse all the lines on a page, then figure out which ones shared segments with others, then generate full paths to replace the partial ones on the page. We then throw in the coloured markers you see in the top animation so we can see at a glance that the script got it right.

Merging Topos

Again due to the print nature of the documents, we've got some spreads (a spread is two facing pages) with what is logically a single topo, but has been chopped into two so that no information is hidden in the spine of the book. It's no good having this result in two topos in the data, so we need to merge them into a single topo. Another instance of a boring task that is pretty easy to automate with a script.

Assigning Objects to Topos/Crags/Sectors

This is the bit where we figure out which crag, sector and topo an object belongs to, based on where it is in the InDesign document.

For the most part this should be pretty simple, and could be inferred from what page an object is on – if its bounding rectangle falls within a topo's frame, that type of thing. Unfortunately we get tripped up again by the print books. Because, for some smaller sectors, we might have multiple topos for multiple sectors on the same spread (or even multiple crags on the same spread), any inference system we made would be prone to errors. So instead we have a manual process for every topo where we select everything related to it (route lines, route numbers, route descriptions etc) and explicitly tag them in a way that we can export to our intermediate format. The tag tells us which topo, crag and sector an object belongs to, which allows us to build the relationship graph later in klang.

Assigning Topo Numbers to Route Lines

During the tagging step above we also figure out which topo numbers go with which route lines, then add tags that can be used by klang later to build the relationships. Usually, this is a totally automated step. Most topo numbers are placed, by convention, at the start of a route line. This makes it easy to do something like this:

// this is pseudo code, it doesn't actually work
function assignTopoNumberToLine( line ) {
  const lineStartPoint = line.points[0] 
  for ( num of findTopoNumbers() ) {
    if ( num.boundsContainsPoint( lineStartPoint ) ) {
      line.addTag( { topoNumber: num.value } )
      break

    }

  }

}

This simple implementation works for most cases, but the majority of pages in a Rockfax guidebook have at least one topo number that isn't placed on the start of its route line. Routes that branch off another one have the number placed just after the branch point for example, e.g. route 14 in the image below.

Branching route © rockfax — Branching route
© rockfax

These are a bit harder to work with. Ideally we'd just be able to ask InDesign if a path intersects another one, but this is an area where the api is a bit lacking, so we had to write our own implementation.

Freeform paths in InDesign are represented by a type of mathematical curve called a cubic Bézier curve. Although I didn't even know what one was when I started work on all this parsing stuff, I've really grown to appreciate the beauty of these things. I encourage anyone who's not familiar with them to check out this brilliant primer article.

Long story short, to figure out if a Bézier passes through a rectangle (our topo number's bounds) we need to use some Bézier mathematics functions to sample points along the path, then check if these points fall inside the rectangle. The frequency of the sampling has to be such that you definitely won't miss the rectangle by sampling one side of it then jumping past it for your next sample.

Into-the-weeds-bar: Doing loads of point sampling creates a problem with the JavaScript engine because it's really slow, but we can reduce the number of samples we need to take with some simple optimisations like only sampling lines whose bounding box (the smallest rectangle we can draw that would include all of a line) intersects the bounding box of the topo number we're checking against. This one check can easily reduce the number of calculations we have to do by 90% or more.

There are a few more complexities around route lines with shared starts, where a topo dot might be laid on top of multiple lines. In these instances we have to work backwards, figuring out which lines intersect only one topo dot then through a process of elimination figure out the rest.

Routes with shared starts © UKC Articles — Routes with shared starts

Finally, occasionally a route line doesn't even intersect with a topo number, so then we have to do some manual matching up to let the system know the relation. This would probably be pretty easy to automate away, but at the moment I prefer the safety of requiring an explicit choice.

Assigning topo dots manually © rockfax — Assigning topo dots manually
© rockfax

You can tell that this is an internal tool from the fact that I've never bothered to fix the missing "l" in "cancel"...

Geo-locating Maps

Rockfax books have a particular style of map that has evolved over the years. These maps often have more relevant information on them than more mainstream maps would because we tailor them to our precise needs (we also remove a lot of unneeded info which makes them clearer when trying to navigate to a specific location, ie whichever crag we're covering).

RFDigital shows locations for crags and parkings etc. on the system map of whatever device it's being used on, but this alone would lose the extra detail we've already gone to the effort of documenting on our hand-drawn maps. Because of this we decided to also export our hand-drawn maps just like we do our topos. This has the added benefit of not needing a network connection once you've downloaded a crag package, so we refer to them as "offline maps" in RFDigital.

The Offline Map View in the iOS App © UKC Articles — The Offline Map View in the iOS App

Originally the maps were painstakingly hand-drawn by Alan James (digitally, not on paper or anything like that). These days we get the base map and contours from actual mapping sources using QGIS, then painstakingly style it to our liking. I'm not sure the process is any quicker now, but the results are way better. Plus, because the data is very accurate, it means we can do some stuff to show a user their current location as a blue dot.

To do that, we need to attach some metadata to each map that lets klang know the area it describes on the planet. The system we settled on is to drop two little flags onto each map in InDesign, ideally in opposite corners, and specify the coordinate that the flagpole is touching. We pick spots as close to opposite corners as we can, but we have to pick a feature we can easily identify so we can get its coordinate. A crossroads or corner of a building is ideal.

A Geo-anchor in InDesign © UKC Articles — A Geo-anchor in InDesign

In klang we can use this information to calculate the latitude or longitude of the edges of the map with a bit of simple maths:

- calculate the location of each flagpole in its parent map in fractions, giving an {x, y} result like {x:0.5, y:0.5} which would describe the exact centre of the map. {x:0.0, y:0.0} would describe the top-left corner.
- calculate the x- and y-distance between the flagpoles on the page, again the unit is fractions of the parent map.
- with the above info and the latitude and longitude we already have from the text in the flags we can calculate the latitude and longitude of the four edges of the map.

Then to show the user location in an app we do sort of the reverse of the above – take a coordinate and figure out where that would fall inside the frame of the map.

A side-effect of having these flags in InDesign is that we can put this same algorithm into a script and use it to fill in the parking locations for us. Automation like this makes me happy =]

Managing the Data in klang

Klang is our in-house data-wrangling software. In it, we do things like assign a rockfax_id and ukc_id to each route and crag, and perform loads of integrity checks. Some examples are below:

- making sure routes have:
- a route line
- a topo number
- a topo
- a valid grade
- the correct colour for their grade
- the correct number (it's sequential compared to the routes around it)

- making sure topos have their frame set so that it crops the underlying image (not doing this results in whitespace around the topo in rfdigital)

- crags/buttresses have valid coordinates
- crags/buttresses appear on offline maps
- maps have geo-anchors
- there are no more book-centric references ("see page…" etc)

…plus a load more. There's something like 60 different checks that happen once the data has been extracted.

Klang is architected using the "small core" paradigm. The core is written in Objective C and Swift ("native"), but most of its functionality is written using a JavaScript plugin-system. This means it's not quite as fast as it could be, but the nature of the books means that klang is under constant development trying to keep up with changes to the layout and styling, so being able to modify its functionality without having to recompile the application is a massive boon. Only the most performance-intensive tasks are written natively, such as the image tiling functions, but all the parsing happens in JavaScript.

When klang launches, it watches the plugin directory for any changes, then reloads the plugin system when it detects one. The plugin system includes a library of functionality and a menu hierarchy, which reflects the directory hierarchy on disk.

klang menus © rockfax — klang menus
© rockfax

I've noticed whilst making this image that the menus get created in reverse-alphabetical order. That bug must have existed for 6 or more years...

Anyway, you can see how a directory on disk results in a menu in klang (and a sub-directory results in a sub-menu, but we don't see that in this image). Then any .js files result in a menu item that calls the .js file when clicked. The special "file-with-many-dashes" creates a separator in the menu.

Extraction of Data from InDesign to an Intermediate Format

Because we're extracting almost all of the contents of each InDesign document, and the InDesign scripting api is slow, we extract it to an intermediate format. This means that if we've done the extraction once, then change one little thing in the ID document, we can just re-extract the one spread that change was on. This saves us a lot of time, at the expense of a bit more complexity and more disk space used.

On The Command Line

The final steps in the process are handled by a command-line application. This might seem a little weird when we already have a desktop application (klang) for manipulating and massaging the data. However, the format we now use for the packages (a zip file containing sqlite files with the image tiles in) isn't the format we originally used, so when we moved to this new format there were loads of old-format packages that now needed to be put in the new format. Klang is document-based, its files have a book as the logical unit, so it just made sense to have this tooling outside klang at first. Plus it's much quicker writing a command-line interface (CLI) than a graphical one, and much easier to update.

Generating the Crag Packages

The topos, overviews and maps have all been outputted into directories by klang, each containing hundreds or thousands of small image files. Now the CLI finds this data, along with the .json file with all the data for a crag, and creates the final packages, each one describing a crag in its entirety.

Well, nearly its entirety – maps and overviews are separate because they are shared between crags, so we don't want to download them multiple times if they already exist on the device.

From these packages, we then read the crag data, and this crag data is thinned out a bit and used to build up the metadata for what is available to download.

Finally, the CLI does a load of checking of both the packages and the metadata, then it uploads it all to the server ready for installation by our users.

That's it for now. Once I've recovered from writing this one, maybe I'll dive into klang in more detail for the next article, but we really will be getting into the weeds there.

You can find the Rockfax Digital on the App Store for iOS and the Play Store for Android, and sign up for a subscription at rockfax.digital.

UKH Articles and Gear Reviews by Stephen Horne Rockfax

COMPARISON REVIEW: Soft Shells 19 Jul, 2012

Latest Articles

UKC Advertising

Comments

Chris Craggs

12 May, 2021

Thanks for that Steve, all pretty amazing tbh - and that's only the 5% I understood!

Chris

planetmarshall

12 May, 2021

https://youtu.be/Z7_QQn_2McM

wintertree

12 May, 2021

Great stuff. I look forwards to part 2.

Always nice to find someone else has hit the point where they decide it’s going to be easier to cast a load of points and see if any fall in the box, than to solve the intercept equations! Very glad I’ve never had to do so in JS...

remus

13 May, 2021

Cool article, interesting to peak behind the curtains! Very impressed with the amount of automation you've managed to build in to extracting info from the topos too; even for something like rf topos that appear fairly well structured I imagine there's a whole world of weird little edge cases and idiosyncrasies to take in to account.

Do you think rockfax will convert to being digital first at some point? i.e. will the database + interactive topos become the reference rather than the books? From my totally naive "wouldn't it be easier if you just..." point of view, I would have thought going from a structured database in to a book format would have be easier than doing things the other way round.

I'd also be interested to understand how you deal with changes in formatting between different books. I imagine the layout in the books has evolved through time, so is it tricky accounting for those changes when you're extracting the data?

robert-hutton

13 May, 2021

Nice article, would it be possible for the menus to be given accordion menu actions e.g crags and routes on guide books and logbook on dates so saving on the endless scroll.

Plus I quite like the public voting of grade on UKC but not on the app, which might make a community of being involved in any possible grade change.

More Comments

Add Comment

Rockfax Digital Deep Dives Part 1 - The Data Pipeline Article

Introduction

Database Structure

Extracting the Data

In InDesign

Completing Route Lines

Merging Topos

Assigning Objects to Topos/Crags/Sectors

Assigning Topo Numbers to Route Lines

Geo-locating Maps

Managing the Data in klang

Extraction of Data from InDesign to an Intermediate Format

On The Command Line

Generating the Crag Packages

Popular Articles Right Now

Anna Wells on the Winter Munro Round

Doing the Dochartys - The Hill List You've Probably Never Heard Of

Comments