A Brief Summary
After a hackathon a few months back, we were joking about creating an easy way to take the data we’d painstakingly parsed from PDFs, word documents, and XML files, and “translate” it back into a format that government agencies are used to. Many of us have been shell-shocked in dealing with PDFs from government agencies, which are often scanned documents, off kilter and photocopied many times over. Fundamentally, they’re very difficult to pry information out of. For the OpenGov Foundation’s April Fools’ prank, we created Govify.org, a tool to convert plain text into truly ugly PDFs.
A quick, [one-line ImageMagick command](https://gist.github.com/krues8dr/9437567), was the first version. We quickly produced a few sample documents, and decided that it would be fantastic if users could upload their own files and convert them. Very quickly it became clear that the process might take a couple of seconds, and a decent amount of CPU – so to deal with any sort of load, we’d need a modular, decentralized process, rather than a single webpage to do everything.
— Ben Balter (@BenBalter) April 1, 2014
As Ben Balter points out, there are a lot of moving pieces to this relatively-simple setup. Govify.org is actually a combination of PHP, Compass + SASS, Capistrano, Apache and Varnish, Rackspace Cloud services and their graet API tools, Python and Supervisord, and ImageMagick with a bash script wrapper. Why in the world would you use such a hodgepodge of tools across so many languages? Or, as most people are asking these days, “why not just build the whole thing with Node.js?”
The short answer is, the top concern was time. We put the whole project together in a weekend, using very small pushes to build standalone, modular systems. We reused components wherever possible and tried to wholly avoid known pitfalls via the shortest route around them. A step by step breakdown of those challenges follow.
We started with just a single ImageMagick command, which:
Takes text as an input
Fills the text into separate images
Adds noise to the images
Rotates the images randomly
And finally outputs all of the pages as a PDF.
Using that to create a few sample documents, we began putting together a rough website to show them off. Like everyone else who needs to build a website in zero time, we threw Bootstrap onto a really basic template (generated with HTML5 Boilerplate. We use a few SASS libraries – Compass, SASS Bootstrap, and Keanu – to get some nice helpers, and copied in our standard brand styles that we use everywhere else. A few minutes in photoshop and some filler text later, and we had a full website.
We needed a nice way to deploy the site as we make changes, and our preferred tool is Capistrano. There are other tools available, like Fabric for Python or Rocketeer for PHP, but Capistrano excels in being easy to use, easy to modify, and mostly standalone. It’s also been around for a very long time and the one that we’ve been using the longest.
We’re using Rackspace for most of our hosting, so we stood up box with Varnish in front of Apache and stuck the files on there. Website shipped!
Once that was done, we made the decision to allow users to upload their own files. At OpenGov, we’re primarily a PHP shop, so we decided to use PHP. OK, OK – stop groaning already. PHP is not the most elegant language in the world, and never will be. It has lots of horns and warts, and people like to trash it as a result. That being said, there are a few things it’s great at.
First and foremost, it’s incredibly easy to optimize. Tools like APC and HipHop VM which allow you to take existing PHP scripts and make them run *very* well. The variety and diversity of optimization tools for PHP make it a very attractive language for dealing with high-performance apps, generally.
Second, it’s a “web-first” language, rather than one that’s been repurposed for the web – and as a result, it’s very quick to build handlers for common web-tasks without using a single additional library or package. (And most of those tasks are very well documented on the PHP website as well.) Handling file uploads in PHP is a very simple pattern.
So in no time at all, we were able to create a basic form where users could input a file to upload, have that file processed on the server, and output back our PDF. Using the native PHP ImageMagick functions to translate the files seemed like a lot of extra work for very little benefit, so we ran kept that part as a tiny shell script.
At this point however, we realized that the file processing iself was slow enough that any significant load could bring slow the server considerably. Rather than spinning up a bunch of identical servers, a job queue seemed like an ideal solution.
Creating a Job Queue
A very common pattern for large websites that do processing of data is the job queue, where single items that need processing are added to a list somewhere by one application, and pulled off the list to be processed by another. (Visual explanation, from the Wikipedia Thread Queue article.) Since we’re using Rackspace already, we were able to use Rackspace Cloud Files to store our files for processing, and the Rackspace Queue to share the messages across the pieces of the application. The entire Rackspace Cloud stack is controllable via their API, and there are nice libraries for many languages available.
On our frontend, we were able to drop in the php-opencloud library to get access to the API. Instead of just storing the file locally, we push it up to Rackspace Cloud Files, and then insert a message into our queue, listing the details of the job. We also now collect the user’s email address, so that we can email to let them know that their file is ready.
The backend processing, however, presented a different set of challenges. Generally, you want an always-running process that is constantly checking the queue for new files to process. For processes that take a variable amount of time, you don’t want just a Cron job, since the processes can start stacking up and choke the server – instead we just have a single run loop that runs indefinitely, a daemon or service.
For all the things that PHP is good at, memory management is not on the list. Garbage collection is not done very well, so large processes can start eating memory rapidly. PHP also has a hard memory limit, which will just kill the process in an uncatchable way when it dies.
Python, on the other hand, does a rather admirable job of this. Creating a quick script to get the job back out of the Rackspace Queue, pull down the file to be manipulared, and push that file back up was a rather simple task using the Rackspace Pyrax library. After several failed attempts in trying to use both the python-daemon and daemonize packages as a runner for the script, we reverted to using Supervisor to keep the script going instead.
Obviously, this isn’t the most elegant architecture ever created. It would have made far more sense to use a single language for the whole application – most likely Python, even though very little is shared across the different pieces aside from the API.
That being said, this thing scales remarkably well. Everything is nicely decentralized, and would perform well under significant load. However, we didn’t really get very significant load from our little prank – most people were just viewing the site and example PDFs, and very few were uploading their own. Sometimes overengineering is its own reward.
Not bad for three days of work, if I do say so myself.
All of the pieces are available on Github and GPL2 licensed for examining, forking, and commenting on.