mysql - How can I make my GTFS queries run faster? -


i'm trying play gtfs database, namely 1 provided ratp paris , suburbs.

the set of data huge. stop_times table has 14 million rows.

here's tables schemas: https://github.com/mauryquijada/gtfs-mysql/blob/master/gtfs-sql.sql

i'm trying efficient way available routes @ specific location. far understand gtfs spec, here tables , links data (lat/lon) routes:

stops      | stop_times     | trips      | routes -----------+----------------+------------+-------------- lat        | stop_id        | trip_id    | route_id lon        | trip_id        | route_id   | stop_id    |                |            | 

i have compiled want in 3 steps (actually 3 links have between 4 tables above), published under gist clarity: https://gist.github.com/benoitduffez/4eba85e3598ebe6ece5f


here's how created script.

i have been able find stops within walking distance (say, 200m) in less second. use:

$ . mysql.ini && time mysql -h $host -n -b -u $user -p${pass} $name -e "select stop_id, (6371000*acos(cos(radians(48.824699))*cos(radians(s.stop_lat))*cos(radians(2.3243)-radians(s.stop_lon))+sin(radians(48.824699))*sin(radians(s.stop_lat)))) distance stops s group s.stop_id having distance < 200 order distance asc" | awk '{print $1}' 3705271 4472979 4036891 4036566 3908953 3908755 3900765 3900693 3900607 4473141 3705272 4472978 4036892 4036472 4035057 3908952 3705288 3908814 3900832 3900672 3900752 3781623 3781622  real    0m0.797s user    0m0.000s sys     0m0.000s 

then, getting stop_times later today (with stop_times.departure_time > '``date +%t``') takes lot of time:

"select trip_id  stop_times       stop_id in ($stops) , departure_time >= '$now'   group trip_id" 

with $stops containing list of stops obtained first step. here's example:

$ . mysql.ini && time mysql -h $host -n -b -u $user -p${pass} $name -e "select stop_id, (6371000*acos(cos(radians( stops s group s.stop_id having distance < 200 order distance asc" | awk '{print $1}' 3705271 4472979 4036891 4036566 3908953 ... 9916360850964321 9916360920964320 9916360920964321  real    1m21.399s user    0m0.000s sys     0m0.000s 

there more 2000 lines in result.

my last step select routes match these trip_ids. it's quite easy, , rather fast:

$ . mysql.ini && time mysql -h $host -u $user -p${pass} $name -e "select r.id, r.route_long_name trips t, routes r t.trip_id in (`cat trip_ids | tr '\n' '#' | sed -e 's/##$//' -e 's/#/,/g'`) , r.route_id = t.route_id group t.route_id" +------+-------------------------------------------------------------------------+ | id   | route_long_name                                                         | +------+-------------------------------------------------------------------------+ |  290 | (place de clichy <-> chatillon metro) - aller                           | |  291 | (place de clichy <-> chatillon metro) - retour                          | |  404 | (porte d'orleans-metro <-> ecole veterinaire de maison-alfort) - aller  | |  405 | (porte d'orleans-metro <-> ecole veterinaire de maison-alfort) - retour | |  453 | (porte d'orleans-metro <-> lycee polyvalent) - retour                   | |  457 | (porte d'orleans-metro <-> lycee polyvalent) - retour                   | |  479 | (porte d'orleans-metro <-> velizy 2) - retour                           | |  810 | (place de la liberation <-> gare montparnasse) - aller                  | |  989 | (porte d'orleans-metro) - retour                                        | | 1034 | (place de la liberation <-> hotel de ville de paris_4e__ar) - aller     | +------+-------------------------------------------------------------------------+  real    0m1.070s user    0m0.000s sys     0m0.000s 

with here file trip_ids containing 2k trip ids.

how can result faster? there better way crawl through data rather stops>stop_times>trips>routes path have taken?

the total time here around 30s 1 'query': "what routes available 200m location?". that's much...

the short answer is: use table joins , indices.

here's longer answer:

you have right idea here , understanding of how tables relate 1 correct. however, asking dbms match field values list (using where...in) rather joining tables requiring lot more work needs to.

what want execute single query, using join clauses link tables together. try this, additionally joins calendars , calendar_dates tables limit results routes operating today:

select distinct r.id, r.route_long_name   (select s.stop_id, (6371000 *           acos(cos(radians(48.824699)) * cos(radians(s.stop_lat)) *           cos(radians(2.3243) - radians(s.stop_lon)) +           sin(radians(48.824699)) * sin(radians(s.stop_lat)))) distance           stops s) i_s   inner join stop_times st on st.stop_id = i_s.stop_id   inner join (select trip_id, route_id trips t                 inner join (select service_id calendars                               start_date <= '2014-09-09'                                 , end_date >= '2014-09-09'                                 , tuesday = 1                               union                                 select service_id calendar_dates                                   date = '2014-09-09'                                     , exception_type = 1                               except                                 select service_id calendar_dates                                   date = '2014-09-09'                                     , exception_type = 2) c                    on c.service_id = t.service_id) t_r     on t_r.trip_id = st.trip_id   inner join routes r on r.route_id = t_r.route_id   st.departure_time > '$now'     , i_s.distance < 200; 

here inner join used "add in" columns of table, including rows match condition in on clause. should much faster generating list of results 1 query , feeding in next.

to better performance, though, want create indices prevent dbms having scan linearly through tables. rule of thumb have index defined each column used in either join or where clause. here indices defined, should find make above query perform quite well:

create index calendar_dates_date_exception_type_service_id_index   on calendar_dates (date, exception_type, service_id);  create index trips_service_id_trip_id_route_id_index   on trips (service_id, trip_id, route_id);  create index stop_times_trip_id_departure_time_stop_id_index   on stop_times (trip_id, departure_time, stop_id);  create index routes_route_id_index on routes (route_id);  create index stops_stop_id_index on stops (stop_id); 

Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -