For San Francisco, and governments everywhere, technology startups are the perfect partners

A recent article in the New York Times describes SMART Muni, “an Apple iPad app that uses Global Positioning System technology to track all of the city’s buses in real time, allowing transit managers and passengers to monitor problems and delays”. It sounds like the perfect success story: civic coders taking open data (Muni tracks its buses and trains with NextBus, which provides an XML data feed) and using that data to improve operations and create real value for the agency.

Unfortunately, it’s not a success story: the app has never been used in production. As the article explains, “Muni hopes to put the app to good use some day, but the agency is $29 million over budget and cannot afford to buy the iPads required to run the software…[nor] is the city willing to invest $100,000 to run a pilot program.”

The costs involved here—a few hundred dollars each for some iPads, perhaps a few thousand dollars to fund a stipend for a civic coder, even $100,000 for a pilot—pale in comparison to the costs associated with the big-name IT consulting firms that governments are used to dealing with.

In addition, startups, teams of civic coders, and open source projects can often deliver a working prototype or even a completed project much faster than conventional development teams. As the New York Times describes, “a small team of volunteers took just 10 days last summer to create [the app].”

Unfortunately, the City of San Francisco is out of touch with the realities of technology: “‘Start-ups fail at a high rate,’ said Jay Nath, chief innovation officer of San Francisco. ‘As stewards of taxpayer dollars, we need to be thoughtful of using that money wisely and not absorbing too much risk.’” Nath is right about one thing: start-ups do fail at an alarming rate. But that’s not the risk you might think it is, because startups aren’t like conventional development projects.

Unlike conventional projects, startups fail fast. Instead of wasting years and millions of dollars, when a startup has an idea that isn’t going anywhere, it winds up quickly. Maybe it was a bad (or even outright infeasible) idea to begin with, or the startup had the wrong team, or they tried to do too much at once. Maybe their idea’s been superseded by a newer, even better technology. Whatever the reason may be, the startup doesn’t just grind away for years, running up a million-dollar bill. Instead, they admit that they can’t deliver, and get out gracefully.

Consider, for example, the FBI’s Virtual Case File, a five-year, $170-million development effort that never actually delivered any working software. Imagine if the VCF project had failed after three or six months, not five years. Imagine if it had spent less than a million dollars before failing, not $170 million. Of course, the project still wouldn’t be done—but we’d have known that something was wrong up front, instead of finding out five years later, after millions of taxpayer dollars had been wasted on a doomed development effort.

More importantly, startups do have the agility necessary to keep up with the ever-changing technology marketplace. A development effort that takes five or ten years is bound to deliver a product that is obsolete as soon as it arrives, unless major changes are made along the way.

The conventional development practices used by many government agencies and their contractors don’t incorporate that kind of agility. Specifications and requirements are written early in the project’s life, perhaps even before a development team has been selected (if the project must be put out for bids). Even if the requirements are found to be lacking—or flat-out wrong—development marches on. In the end, the team will deliver a product that meets the requirements (thus satisfying the bean-counters) but which is already out-of-date and which doesn’t actually do what users need it to do.

I alluded to these problems in my recent coverage of WMATA’s initiative to install real-time information displays at bus stops. By only considering bids from vendors with “standard, proven products” and “successful existing and fully operational implementations, in multiple transit agencies”, they potentially shut out innovative startups (or even teams of civic coders, like the Mobility Lab).

It’s entirely possible that the first team to tackle a thorny problem may fail—but rather than casting them as “failures that burn holes in the city’s budget”, we’ve got to communicate to governments and taxpayers alike that not all failures are the same. There’s a big difference between a project that runs for years, spends millions of dollars, and has nothing to show for it in the end, and a project that fails after just a few months, has spent well less than a million dollars, and can identify what went wrong, so the next project will be more successful.

When it comes to technology, the best way for governments to be good ‘stewards of taxpayer dollars’ is to adopt successful development practices: small, agile, competent teams, that build inexpensive, flexible products, and fail quickly if they can’t get the job done. The old way—forking over millions and millions to high-priced contractors until they finally declare defeat, then taking it up in a years-long legal battle—just doesn’t look like good stewardship anymore. Sure, established companies may have a long track record that startups don’t, but what’s it a record of? We don’t need any more million-dollar failures. We need smart civic coders developing next-generation solutions like SMART Muni, and we need governments to accept, embrace, and support them.

Rendering issue in Basic Maths with Google Chrome

Basic Maths, the WordPress theme I use on this site, seems to have developed a rendering problem in Google Chrome. The problem is only present in Chrome, and not Safari, so it’s not a WebKit issue, and Gecko in Firefox doesn’t exhibit the problem either.

You can see the issue on any individual post page, such as this one from the demo site:

Basic Maths post page rendering in Chrome 17.0.932.0 on Mac OS X 10.6.8

Basic Maths post page rendering in Firefox 5.0.1 on Mac OS X 10.6.8

Notice how the borders at the top and bottom of the meta box overrun the first sidebar column in Chrome.

Not being a CSS wizard, I’m not in a good position to derive a minimal test case from the theme’s CSS, and I can’t say precisely when the issue first appeared in Chrome, either (although I will say that I run the Chrome dev version). I know this makes for a bad bug report, but I figure it’s better than nothing to document the issue publicly. The issue doesn’t seem to be specific to certain Basic Maths installations, either; both this site and the demo site, as well as others I’ve come across online, are similarly affected in Chrome.

Air and Space

Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .

for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt
done;

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt

head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt
done;

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
done;
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Announcing htmlbib, a tool for rendering BibTeX files as interactive HTML

For some time now, I’ve been working on an annotated bibliography of articles on various topics in transportation (particularly the history of automatic fare collection from 1960 to the present, as well as the SelTrac train control system and its origins in Germany). I’ve been compiling the information using BibDesk, and I’d like to be able to share it with a wider audience, in the hope that it might be useful to someone.

At a bare minimum, posting the BibTeX file online somewhere would fulfill my desire to get the information out there. But not everyone out there who might benefit from the bibliography uses BibTeX. For many people, I fear a .bib file would be nothing more than unintelligible gibberish; outside of academic circles (and even then, outside of the hard sciences), TeX is not particularly well-known.

The next alternative would be to post the bibliography online as a PDF or HTML file. This alternative is considerably more accessible to non-BibTeX users, but actually makes life harder for people who would like to be able to copy references (as BibTeX source) to use in their own BibTeX files (common practice in communities of TeX users). Merely rendering the entire contents of the file also loses some of the metadata—the comments associated with entries, the groups and keywords, etc.

There are also specialized tools (like bibtex2html) for converting a BibTeX file to HTML. But there, still, the results fall short; the output is mostly static text. I wanted a tool that would make good use of the keywords entered in BibDesk, and which would provide links between publications and authors. I also wanted a tool which would be equally useful for BibTeX users, who would be helped by having access to the BibTeX source for each entry, and non-BibTeX users, who would be helped by having formatted bibliography entries. I therefore set out to built a tool that would meet my needs; the result is htmlbib.

One of the items of concern for me was that the bibliography entries be formatted properly; after having taken care to make sure that the information was added to BibDesk so that it would be rendered well, I did not want to have some generic template used to create HTML for each entry. So, I ended up cobbling together an arrangement that actually uses BibTeX and tex4ht to produce HTML for each entry using the desired BibTeX style (in my case, IEEEtran), so that the entries look the same in the preview as they would in an actual publication. This is slow, but the preview results are cached, so subsequent runs are faster.

As for parsing the BibTeX file, since I’m already familiar with scripting BibDesk, I decided to use appscript to call BibDesk from Python. The result is therefore not portable from OS X, but it suits my needs. There are BibTeX parsing libraries for Python, so porting to another platform would only require substituting one of those libraries of the calls to BibDesk; the rest is pure Python, with the exception of lxml, and the aforementioned preview code, which expects a functioning TeX installation on the system.

The HTML is produced using Jinja2 templates, which for now are stored in the application egg. The default, built-in template is built on Blueprint CSS and jQuery along with jQuery Tools. It wouldn’t be too hard to provide an option for using user-specified templates instead of the built-in template.

I’ve uploaded some sample output to demonstrate what htmlbib does.