Tag Open data

Open standards are a force multiplier for civic software

In software engineering, software modularity and reusability are considered best practices. Unfortunately, in the civic software world, these principles are often ignored, because governments and public bodies fail to use open standards and interfaces for their data.

When governments and other public bodies adopt open standards, everyone wins. Consider, for example, the Open311 standard. When a city implements an Open311 endpoint, its citizens suddenly have the option of using any software which has been developed to support the Open311 standard. There’s no need for civic hackers in Chicago to develop one iPhone app, only for another group of civic hackers in New York to implement a substantially similar app, just because the two cities use different APIs.

For those developers, the use of open standards acts as a force multiplier: they don’t have to know anything about the cities where their apps are being used, because they all adhere to the same standards. Civic software is, for the most part, the domain of non-profits and individuals working on their own time. Development resources are far from unlimited, and, simply put, we have neither the time nor money to spend on what might be characterized as niche applications which only have use in a limited geographic area, or with one government’s proprietary API.

Closer to home, I’ve watched over the past few years as developers have expended needless effort building transit apps for the Washington, D.C. metropolitan area, simply to accommodate local transit authorities’ refusal to publish clean, high-quality data in standard formats.

Arlington County’s Mobility Lab Transit Tech initiative has developed two applications which rely on data from transit authorities. One is a package for driving real-time transit signs, and the other is Transit Near Me, a mobile webapp for mapping transit options.

I want to emphasize that I don’t mean to minimize the work of the Mobility Lab developers—in the end, they did what they needed to in order to be able to ship a working product, given the data they had access to.

Having said that, though, these applications are not all that different from similar transit apps which have already been built. The Mobility Lab’s real-time sign, for example, is (in terms of basic design concepts) not all that different from the OneBusAway sign mode.

Granted, the Mobility Lab’s real-time sign looks more polished, and includes support for transit modes like bike sharing, but imagine if instead of building a new piece of software from the ground up, the Mobility Lab developers had worked to polish the OneBusAway sign mode and add support for other transit modes?

Had they done so, every city which uses OneBusAway would have been able to benefit immediately from the improvements.

But, there’s a problem. OneBusAway consumes real-time transit information in the GTFS-realtime and SIRI VM formats. Out of the agencies in the region, only Montgomery County Ride On and VRE provide GTFS-realtime data. The other agencies which provide real-time data use proprietary formats which are incompatible with GTFS-realtime. Without detouring too deeply into technical territory, WMATA’s proprietary API actually provides all of the information that would be necessary to construct a GTFS-realtime feed for Metrobus, were it not for the fact that the API uses route, stop, and trip identifiers which are completely different from those in the static GTFS schedule.

The same goes for Transit Near Me; it is, in essence, a mobile version of the OpenTripPlanner system map. Cities around the world have adopted OpenTripPlanner; wouldn’t they also benefit from an interactive system map optimized for mobile devices?

OpenTripPlanner is designed to consume clean, well-constructed GTFS feeds, while instead Transit Near Me must include various work-arounds for idiosyncrasies in WMATA’s data: bad shapes which must be replaced with data from shapefiles, stop IDs which are only available in the API and not the GTFS feed, etc.

I should emphasize again that this isn’t just about the Mobility Lab; their work happens to highlight the problem particularly well, but they’re not the only developers to get caught up in this maelstrom:

What’s the solution? Civic hackers need to stand together with each other, and stand up for good software engineering principles. I doubt that any one developer alone will be able to convince WMATA to get their data in order (goodness knows I’ve tried). But if we stand together and recognize that reinventing the wheel over and over again is not a productive use of our time, we may be able to convince data providers to embrace open standards. When we do, it will have benefits not just locally, but for people around the world who benefit from the work of civic hackers.

For San Francisco, and governments everywhere, technology startups are the perfect partners

A recent article in the New York Times describes SMART Muni, “an Apple iPad app that uses Global Positioning System technology to track all of the city’s buses in real time, allowing transit managers and passengers to monitor problems and delays”. It sounds like the perfect success story: civic coders taking open data (Muni tracks its buses and trains with NextBus, which provides an XML data feed) and using that data to improve operations and create real value for the agency.

Unfortunately, it’s not a success story: the app has never been used in production. As the article explains, “Muni hopes to put the app to good use some day, but the agency is $29 million over budget and cannot afford to buy the iPads required to run the software…[nor] is the city willing to invest $100,000 to run a pilot program.”

The costs involved here—a few hundred dollars each for some iPads, perhaps a few thousand dollars to fund a stipend for a civic coder, even $100,000 for a pilot—pale in comparison to the costs associated with the big-name IT consulting firms that governments are used to dealing with.

In addition, startups, teams of civic coders, and open source projects can often deliver a working prototype or even a completed project much faster than conventional development teams. As the New York Times describes, “a small team of volunteers took just 10 days last summer to create [the app].”

Unfortunately, the City of San Francisco is out of touch with the realities of technology: “‘Start-ups fail at a high rate,’ said Jay Nath, chief innovation officer of San Francisco. ‘As stewards of taxpayer dollars, we need to be thoughtful of using that money wisely and not absorbing too much risk.’” Nath is right about one thing: start-ups do fail at an alarming rate. But that’s not the risk you might think it is, because startups aren’t like conventional development projects.

Unlike conventional projects, startups fail fast. Instead of wasting years and millions of dollars, when a startup has an idea that isn’t going anywhere, it winds up quickly. Maybe it was a bad (or even outright infeasible) idea to begin with, or the startup had the wrong team, or they tried to do too much at once. Maybe their idea’s been superseded by a newer, even better technology. Whatever the reason may be, the startup doesn’t just grind away for years, running up a million-dollar bill. Instead, they admit that they can’t deliver, and get out gracefully.

Consider, for example, the FBI’s Virtual Case File, a five-year, $170-million development effort that never actually delivered any working software. Imagine if the VCF project had failed after three or six months, not five years. Imagine if it had spent less than a million dollars before failing, not $170 million. Of course, the project still wouldn’t be done—but we’d have known that something was wrong up front, instead of finding out five years later, after millions of taxpayer dollars had been wasted on a doomed development effort.

More importantly, startups do have the agility necessary to keep up with the ever-changing technology marketplace. A development effort that takes five or ten years is bound to deliver a product that is obsolete as soon as it arrives, unless major changes are made along the way.

The conventional development practices used by many government agencies and their contractors don’t incorporate that kind of agility. Specifications and requirements are written early in the project’s life, perhaps even before a development team has been selected (if the project must be put out for bids). Even if the requirements are found to be lacking—or flat-out wrong—development marches on. In the end, the team will deliver a product that meets the requirements (thus satisfying the bean-counters) but which is already out-of-date and which doesn’t actually do what users need it to do.

I alluded to these problems in my recent coverage of WMATA’s initiative to install real-time information displays at bus stops. By only considering bids from vendors with “standard, proven products” and “successful existing and fully operational implementations, in multiple transit agencies”, they potentially shut out innovative startups (or even teams of civic coders, like the Mobility Lab).

It’s entirely possible that the first team to tackle a thorny problem may fail—but rather than casting them as “failures that burn holes in the city’s budget”, we’ve got to communicate to governments and taxpayers alike that not all failures are the same. There’s a big difference between a project that runs for years, spends millions of dollars, and has nothing to show for it in the end, and a project that fails after just a few months, has spent well less than a million dollars, and can identify what went wrong, so the next project will be more successful.

When it comes to technology, the best way for governments to be good ‘stewards of taxpayer dollars’ is to adopt successful development practices: small, agile, competent teams, that build inexpensive, flexible products, and fail quickly if they can’t get the job done. The old way—forking over millions and millions to high-priced contractors until they finally declare defeat, then taking it up in a years-long legal battle—just doesn’t look like good stewardship anymore. Sure, established companies may have a long track record that startups don’t, but what’s it a record of? We don’t need any more million-dollar failures. We need smart civic coders developing next-generation solutions like SMART Muni, and we need governments to accept, embrace, and support them.

Taming MTA New York City Transit’s bus GTFS feeds

If you go to the MTA’s developer resources page, you’ll find that while there is one GTFS feed to download for the subway (and Staten Island Railway), there are five feeds for bus data—one per borough. Your first reaction might be one of annoyance—after all, the agency almost certainly keeps data for all five boroughs in the same system internally, so why not release the data in the same structure?

However, if you look at the files more closely, you’ll soon see why they’re structured the way they are: they are, simply put, massive. The problem is in the stop_times.txt file; the largest, for Brooklyn, is nearly 700 megabytes. Concatenate them together, and you get a 2 gigabyte file containing more than 30 million records. (This is a result of how the feeds are constructed, as dissected in this thread on the developer mailing list)

Most tools designed for working with GTFS feeds simply can’t handle anything that large (or they choke badly). Yet, at the same time, many tools also assume that there will be a single feed per agency, so the per-borough feeds (which have some degree of overlap) can be something of a pain to work with.

This leads to a conundrum: you can work with the feeds one borough at a time (although even then, with some difficulty, as even the individual borough feeds are rather large), but there’s no good way to see the whole city’s bus service at once.

It turns out that with some ingenuity, this problem can be solved, although doing so takes some time and CPU resources. The basic strategy is to first naively merge the feeds together, and then refactor the merged feed, to reduce the number of stop times. The refactoring is described in this post by Brian Ferris.

Actually merging the feeds together isn’t that hard; the agency.txt, calendar.txt, calendar_dates.txt, routes.txt, and shapes.txt files are identical across the five feeds. The stops.txt file has to be merged and then deduplicated, but this can be done with simple command-line tools. For the trips.txt and stop_times.txt files, there’s no other option than to concatenate them together. This does result in a massive stop_times.txt file, but it’s only temporary.

After producing the naively concatenated feed, apply the previously-mentioned OneBusAway GTFS transformer (described in more detail here) to the feed.

The transformer will need about 8 GB of memory to run (so launch the JVM with -Xmx10G, or thereabouts), and on an EC2 large instance, it’ll take about 10 minutes. When it’s done, you’ll have a stop_times.txt file which contains around 6 million records, which isn’t quite so bad (considering that the entire merged and refactored feed for the five boroughs ends up being about the same size as the unmodified feed for Brooklyn alone, it’s actually almost good).

As an aside, here’s how I constructed the merged feed; I’m always a fan of solutions which make use of basic Unix tools.

mkdir nyct_bus_merged
cd nyct_bus_merged
cp ../google_transit_manhattan/{agency.txt,calendar.txt,calendar_dates.txt,routes.txt,shapes.txt} .
 
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stops.txt; do
	tail -n +2 $file >> stops_unmerged.txt
done;

head -n 1 ../google_transit_manhattan/stops.txt > stops.txt
cat stops_unmerged.txt | sort | uniq >> stops.txt
rm stops_unmerged.txt


head -n 1 ../google_transit_manhattan/trips.txt > trips.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/trips.txt; do
	tail -n +2 $file >> trips.txt
done;

head -n 1 ../google_transit_manhattan/stop_times.txt > stop_times.txt
for file in ../google_transit_{manhattan,bronx,brooklyn,queens,staten_island}/stop_times.txt; do
	tail -n +2 $file >> stop_times.txt
done;
#then zip the feed and apply the GTFS transformer

(Finally, a disclaimer: I haven’t extensively tested the feed which is the result of the process described in this post. It’s possible that this process has unintended consequences which could affect its integrity or usefulness for certain applications.)

Updates to tph.py

I have made some updates to tph.py, my tool for generating plots of transit service levels from GTFS feeds. Most importantly, these updates fix compatibility issues which kept it from working with certain agencies’ GTFS feeds. To start, here are two plots generated from the BART GTFS feed, using the new version of tph.py.

The first plot’s not actually all that useful, since it mixes up rail service and the AirBART bus shuttle, but it does demonstrate a point, which will be explained later. The second plot is actually useful; it readily demonstrates that the BART system was not designed to provide frequent service to the extremities of the network—while service through the core peaks at 22 trains per hour, most branches on average get four trains per hour.

Anyway, supporting the BART GTFS feed required two major changes: supporting feeds which do not use the direction_id field, and supporting feeds which use the frequencies.txt file (which is used in the BART GTFS feed for AirBART, hence its inclusion above) rather than explicit stoptimes for every trip. As a result of these changes, tph.py should now support any GTFS feed. However, feeds which do not use the direction_id field do require additional configuration to assign directions to routes and trips. This is all documented in the new documentation on the configuration file format.

In addition, tph.py‘s innards have had an overhaul; it no longer uses Google’s transitfeed module for parsing GTFS feeds. Instead, it uses a fork of the gtfs module, originally developed by Brandon Martin-Anderson and William Lachance. gtfs imports the feed and stores it in a SQLite database, using SQLAlchemy. This takes time upfront, but makes tph.py a lot faster to run. It also makes the code cleaner; some operations which previously required several nested for loops can now be done with a single SQL query.

Announcing cadorsfeed

Today I am proud to announce the launch of cadorsfeed, my application for parsing reports from the Transport Canada Civil Aviation Daily Occurrence Reporting System. A demonstration instance is available now at http://cfs.kurtraschke.me/, which has been pre-loaded with reports from the beginning of 2011 to the present. cadorsfeed loads the reports into a database, and provides a clean Web-based frontend as well as Atom and JSON feeds. I first got the idea for cadorsfeed a few years ago—I enjoyed reading CADORS reports, but the Web frontend was slow, not well-designed, and, besides, why wasn’t there a feed I could subscribe to? I had previously written a script to parse NRC Event Notification Reports into an Atom feed, so I was familiar with the territory, but this time I wanted to do more. This time, I wanted to make the information itself more useful, by extracting content from the report narratives and generating links based on that data.