How can I list all the stops associated with a route using GTFS?

Cam Saul picture Cam Saul · Nov 15, 2012 · Viewed 10.3k times · Source

I'm working with some GTFS data and would like to be able to create a list of all stops associated served by a route. I don't really understand how to do with with GTFS data.

Trips.txt comes in a format like this:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id 1,A20120610WKD,A20120610WKD_000800_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_002700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_004700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_006700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_008700_1..S03R,SOUTH FERRY,1,,1..S03R

I tried reading in the matching shape using the shape_id and then looking for stops with matching latitudes and longitudes but that doesn't seem to work reliably. Does anybody know how to do this?

Answer

user473305 picture user473305 · Nov 30, 2012

As you've noticed, there isn't a direct relationship between routes and stops in GTFS. Instead, stops are associated with trips, where each trip represents a single "run" of a vehicle along a particular route. This reflects the fact a route does not necessarily serve every one of its stops at all times—on weekends it might skip stops outside a high school, for instance.

So getting a list of every stop served by a route involves combining several models:

  • routes.txt gives you the route ID for the route you're interested in.
  • trips.txt gives you a set of trip IDs for that route.
  • stop_times.txt gives you a set of stop IDs for the stops served on each of these trips.
  • stops.txt gives you information about each of these stops.

Assuming you're using an SQL database to store your GTFS data, you might use a query like this (once you've obtained the route ID):

SELECT stop_id, stop_name FROM stops WHERE stop_id IN (
  SELECT DISTINCT stop_id FROM stop_times WHERE trip_id IN (
    SELECT trip_id FROM trips WHERE route_id = <route_id>));

Remember, though, this will output a record for every stop that is ever served by the route. If you're generating schedule information for a rider you'll probably want to limit the query to only trips running today and only stop times with departures in, say, the next thirty minutes.


Update: I wrote the above SQL query the way I did as I felt it most simply illustrated the relationship between the GTFS models, but btse is correct (in his answer below) that a query like this would never actually be used in production. It's too slow. You would instead use table joins and indices to keep query times reasonable.

Here is an equivalent query, written in a way more suited to being copied and pasted into a real application:

SELECT DISTINCT stops.stop_id, stops.stop_name
  FROM trips
  INNER JOIN stop_times ON stop_times.trip_id = trips.trip_id
  INNER JOIN stops ON stops.stop_id = stop_times.stop_id
  WHERE route_id = <route_id>;

Typically you would also create an index for each column used in a JOIN or WHERE clause, which in this case would mean:

CREATE INDEX stop_times_trip_id_index ON stop_times(trip_id);

CREATE INDEX trips_route_id_index ON trips(route_id);

(Note that RDBMSes normally index each table by its primary key automatically, so there is no need to explicitly create an index on stops.stop_id.)

Many further optimizations are possible, depending on the specific DBMS in use and your willingness to sacrifice disk space for performance. But these commands will yield good performance on virtually any RDBMS without needlessly sacrificing clarity.