mysql - How can I make my GTFS queries run faster? -
i'm trying play gtfs database, namely 1 provided ratp paris , suburbs.
the set of data huge. stop_times
table has 14 million rows.
here's tables schemas: https://github.com/mauryquijada/gtfs-mysql/blob/master/gtfs-sql.sql
i'm trying efficient way available routes @ specific location. far understand gtfs spec, here tables , links data (lat/lon) routes:
stops | stop_times | trips | routes -----------+----------------+------------+-------------- lat | stop_id | trip_id | route_id lon | trip_id | route_id | stop_id | | |
i have compiled want in 3 steps (actually 3 links have between 4 tables above), published under gist clarity: https://gist.github.com/benoitduffez/4eba85e3598ebe6ece5f
here's how created script.
i have been able find stops within walking distance (say, 200m) in less second. use:
$ . mysql.ini && time mysql -h $host -n -b -u $user -p${pass} $name -e "select stop_id, (6371000*acos(cos(radians(48.824699))*cos(radians(s.stop_lat))*cos(radians(2.3243)-radians(s.stop_lon))+sin(radians(48.824699))*sin(radians(s.stop_lat)))) distance stops s group s.stop_id having distance < 200 order distance asc" | awk '{print $1}' 3705271 4472979 4036891 4036566 3908953 3908755 3900765 3900693 3900607 4473141 3705272 4472978 4036892 4036472 4035057 3908952 3705288 3908814 3900832 3900672 3900752 3781623 3781622 real 0m0.797s user 0m0.000s sys 0m0.000s
then, getting stop_times later today (with stop_times.departure_time > '``date +%t``'
) takes lot of time:
"select trip_id stop_times stop_id in ($stops) , departure_time >= '$now' group trip_id"
with $stops
containing list of stops obtained first step. here's example:
$ . mysql.ini && time mysql -h $host -n -b -u $user -p${pass} $name -e "select stop_id, (6371000*acos(cos(radians( stops s group s.stop_id having distance < 200 order distance asc" | awk '{print $1}' 3705271 4472979 4036891 4036566 3908953 ... 9916360850964321 9916360920964320 9916360920964321 real 1m21.399s user 0m0.000s sys 0m0.000s
there more 2000 lines in result.
my last step select routes match these trip_id
s. it's quite easy, , rather fast:
$ . mysql.ini && time mysql -h $host -u $user -p${pass} $name -e "select r.id, r.route_long_name trips t, routes r t.trip_id in (`cat trip_ids | tr '\n' '#' | sed -e 's/##$//' -e 's/#/,/g'`) , r.route_id = t.route_id group t.route_id" +------+-------------------------------------------------------------------------+ | id | route_long_name | +------+-------------------------------------------------------------------------+ | 290 | (place de clichy <-> chatillon metro) - aller | | 291 | (place de clichy <-> chatillon metro) - retour | | 404 | (porte d'orleans-metro <-> ecole veterinaire de maison-alfort) - aller | | 405 | (porte d'orleans-metro <-> ecole veterinaire de maison-alfort) - retour | | 453 | (porte d'orleans-metro <-> lycee polyvalent) - retour | | 457 | (porte d'orleans-metro <-> lycee polyvalent) - retour | | 479 | (porte d'orleans-metro <-> velizy 2) - retour | | 810 | (place de la liberation <-> gare montparnasse) - aller | | 989 | (porte d'orleans-metro) - retour | | 1034 | (place de la liberation <-> hotel de ville de paris_4e__ar) - aller | +------+-------------------------------------------------------------------------+ real 0m1.070s user 0m0.000s sys 0m0.000s
with here file trip_ids
containing 2k trip ids.
how can result faster? there better way crawl through data rather stops>stop_times>trips>routes
path have taken?
the total time here around 30s 1 'query': "what routes available 200m location?". that's much...
the short answer is: use table joins , indices.
here's longer answer:
you have right idea here , understanding of how tables relate 1 correct. however, asking dbms match field values list (using where...in
) rather joining tables requiring lot more work needs to.
what want execute single query, using join
clauses link tables together. try this, additionally joins calendars
, calendar_dates
tables limit results routes operating today:
select distinct r.id, r.route_long_name (select s.stop_id, (6371000 * acos(cos(radians(48.824699)) * cos(radians(s.stop_lat)) * cos(radians(2.3243) - radians(s.stop_lon)) + sin(radians(48.824699)) * sin(radians(s.stop_lat)))) distance stops s) i_s inner join stop_times st on st.stop_id = i_s.stop_id inner join (select trip_id, route_id trips t inner join (select service_id calendars start_date <= '2014-09-09' , end_date >= '2014-09-09' , tuesday = 1 union select service_id calendar_dates date = '2014-09-09' , exception_type = 1 except select service_id calendar_dates date = '2014-09-09' , exception_type = 2) c on c.service_id = t.service_id) t_r on t_r.trip_id = st.trip_id inner join routes r on r.route_id = t_r.route_id st.departure_time > '$now' , i_s.distance < 200;
here inner join
used "add in" columns of table, including rows match condition in on
clause. should much faster generating list of results 1 query , feeding in next.
to better performance, though, want create indices prevent dbms having scan linearly through tables. rule of thumb have index defined each column used in either join
or where
clause. here indices defined, should find make above query perform quite well:
create index calendar_dates_date_exception_type_service_id_index on calendar_dates (date, exception_type, service_id); create index trips_service_id_trip_id_route_id_index on trips (service_id, trip_id, route_id); create index stop_times_trip_id_departure_time_stop_id_index on stop_times (trip_id, departure_time, stop_id); create index routes_route_id_index on routes (route_id); create index stops_stop_id_index on stops (stop_id);
Comments
Post a Comment