---
title: "gtfsrouter"
author: "Mark Padgham"
date: "`r Sys.Date()`"
output: 
    html_document:
        toc: true
        toc_float: true
        number_sections: false
        theme: flatly
vignette: >
  %\VignetteIndexEntry{gtfsrouter}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r pkg-load, echo = FALSE, message = FALSE}
library (gtfsrouter)
```
```{r DTthread, echo = FALSE}
# Necessary for CRAN to avoid CPU / elapsed time ratios being too high
nthr <- data.table::setDTthreads (1)
```


# 1 Background: GTFS and other R packages

GTFS - General Transit Feed Specification - began life in 2005 as the "Google
Transit Feed Specification," and was renamed to "General" in 2009.  It provides
a standardised scheme for representing data on public transport services,
routes, frequencies, and timetables. A GTFS data set consists of several
comma-delimited (`.csv`) files detailing routes, stops, trips, transfers, and
other aspects, all bundled in a single `.zip`-compressed archive file. For full
details, see the relevant [google developer
site](https://developers.google.com/transit/gtfs/).

There are currently two other **R** packages which handle GTFS data:

1. [`gtfsr`](https://github.com/ropensci-archive/gtfsr), hosted by
   [rOpenSci](https://ropensci.org), developed by [Danton
   Noriega](https://github.com/dantonnoriega), but no longer under active
   development.
2. [`tidytransit`](https://github.com/r-transit/tidytransit), which began as a
   fork of [`gtfsr`](https://github.com/ropensci-archive/gtfsr), and currently
   represents its successor. This package can be used to, "map transit stops
   and routes, calculate transit frequencies, and validate transit feeds [as
   well as to read] the General Transit Feed Specification into tidyverse and
   simple features dataframes."

The one thing neither of these packages enable is the use of GTFS data for
transit routing. The `gtfsrouter` package enables both one-to-one and
one-to-many routing.  Functionality is demonstrated here through the sample data
set included with the package, provided by the "Verkehrsverbund
Berlin-Brandenburg" (VBB; or Transport Network Berlin-Brandenburg). The
`berlin_gtfs` data represents a reduced version of the full GTFS data,
containing only six tables, and a timetable reduced to the single hour between
12:00-13:00. Like all GTFS software including
[`tidytransit`](https://github.com/r-transit/tidytransit), this package is
designed to work directly with GTFS data in `.zip`-archived format, and so
includes a helper function, `berlin_gtfs_to_zip()`, which exports the internal
data set to a locally-stored `.zip` archive in the `tempdir()` of the current
**R** session. These data can be exported and re-imported with:
```{r berlin_gtfs}
berlin_gtfs_to_zip ()
f <- file.path (tempdir (), "vbb.zip")
file.exists (f)
gtfs <- extract_gtfs (f)
```
That simply re-creates the original package data, `berlin_gtfs` (although the
extracted data differ through having a couple of additional attributes defining
it as a `gtfs` object).
     

# 2. Routing

The primary routing function is `gtfs_route()`, the example of which uses the
`gtfs` data for the VBB created as described above. In simplest form, routing
requires a start and end point, defaulting to the current time as desired start
time, and routing for the current day of the week.
```{r route1-fakey, eval = FALSE}
from <- "Innsbrucker Platz"
to <- "Alexanderplatz"
gtfs_route (
    gtfs,
    from = from,
    to = to
)
```
```{r route1, eval = TRUE, echo = FALSE}
from <- "Innsbrucker Platz"
to <- "Alexanderplatz"
knitr::kable (gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00"
))
```

Both the start time and day of the week can be explicitly specified:
```{r route2, eval = TRUE}
route <- gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday"
)
```

## 2.1 GTFS Timetables

The `gtfsrouter` package uses the [Connection Scan
Algorithm](https://arxiv.org/abs/1703.05997), which requires converting the
"stop_times" table to a column-wise timetable. The "stop_times" table has
row-wise entries for each distinct "trip_id", with consecutive rows for a given
value of "trip_id" holding sequential values for stops and associated times (and
potentially additional variables). In contrast, the timetables processed by this
package have separate columns for departure and arrival stations and times. All
routing queries pre-process the original GTFS data with the `gtfs_timetable()`
function, which appends this timetable data, along with two single-column tables
of stop and trip ID values. (The timetable itself contains strictly integer
values for stops and trips, which are indices into these latter tables.)

The only important point of that from a user's perspective is that routing
queries will be faster if this pre-processing step is explicitly implemented
with `gtfs_timetable()` prior to calling `gtfs_route()`. This is easy to
demonstrate using the sample data:
```{r timetable}
gtfs <- extract_gtfs (f)
from <- "Innsbrucker Platz"
to <- "Alexanderplatz"
system.time (
    gtfs_route (
        gtfs,
        from = from,
        to = to,
        start_time = "12:00:00",
        day = "Sunday"
    )
)
names (gtfs)
# explicit pre-processing to extract timetable for Sunday
gtfs <- gtfs_timetable (gtfs,
    day = "Sunday"
)
names (gtfs)
system.time (gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00"
))
```
Note that the `day` parameter is used to extract the timetable, after which it
is no longer required in the actual call to `gtfs_route()`.


## 2.2. Routing by mode of transport

It is also possible to filter by desired mode of transport. This is done by
matching the pattern to those given in the `route_short_name` column of the
`gtfs$route` table:
```{r gtfs_route_table-fakey, eval = FALSE}
head (gtfs$route)
```
```{r gtfs_route_table, echo = FALSE}
knitr::kable (head (gtfs$route))
```

These short names will differ for each GTFS, with the two primary train systems
in Berlin being the underground trains denoted "U" (although not always
travelling underground), and street-level trains denoted "S". The default route
from `r from` to `r to` above was via two "U" services.  We can also specify
that we'd prefer to travel by "S" services, noting that the `route_pattern =
"S"` specifies a `route_short_name` that *starts with* (`"^"`) "S":
```{r route3-fakey, eval = FALSE}
gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday",
    route_pattern = "^S"
)
```
```{r route3, echo = FALSE, eval = TRUE}
knitr::kable (gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday",
    route_pattern = "^S"
))
```

## 2.3. Routing for earliest arrivals or earliest departures

The above route with the "S" services leaves one minute later, and arrives two
minutes later. Importantly, `gtfs_route()` searches by default for the service
which arrives at the nominated destination station at the earliest time. This
may not always be the first available service departing from the nominated start
station. Routing with the earliest *departing* service, instead of the earliest
*arriving* service, can be specified with the binary `earliest_arrival`
parameter:
```{r route4a-fakey, eval = FALSE}
from <- "Alexanderplatz"
to <- "Pankow"
gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday",
    earliest_arrival = FALSE
)
```
```{r route4a, eval = TRUE, echo = FALSE}
from <- "Alexanderplatz"
to <- "Pankow"
r1 <- gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday"
)
r2 <- gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday",
    earliest_arrival = FALSE
)
knitr::kable (r2)
```

And the earliest-departing route arrives at `r to` at 
`r r2$arrival_time [nrow (r2)]`, departing `r from` at 
`r r2$departure_time [1]`. In contrast, the earliest-arriving service is:
```{r route4b-fakey, eval = FALSE}
gtfs_route (
    gtfs,
    from = from,
    to = to,
    start_time = "12:00:00",
    day = "Sunday",
    earliest_arrival = TRUE
)
```
```{r route4b, eval = TRUE, echo = FALSE}
knitr::kable (r1)
diff_arrival <- gtfsrouter:::convert_time (r2$arrival_time [nrow (r2)]) -
    gtfsrouter:::convert_time (r1$arrival_time [nrow (r1)])
diff_depart <- gtfsrouter:::convert_time (r1$departure_time [1]) -
    gtfsrouter:::convert_time (r2$departure_time [1])
mm <- floor (diff_depart / 60)
ss <- diff_depart - mm * 60
diff_depart <- paste0 (mm, "min, ", ss, "s")

diff_total <- (gtfsrouter:::convert_time (r2$arrival_time [nrow (r2)]) -
    gtfsrouter:::convert_time (r2$departure_time [1])) -
    (gtfsrouter:::convert_time (r1$arrival_time [nrow (r1)]) -
        gtfsrouter:::convert_time (r1$departure_time [1]))
mm <- floor (diff_total / 60)
ss <- diff_total - mm * 60
diff_total <- paste0 (mm, "min, ", ss, "s")
```

This service departs `r diff_depart` later at `r r1$departure_time [1]`, and
arrives `r diff_arrival` seconds earlier at `r r1$arrival_time [nrow (r1)]`. The
earliest-arriving service thus entails `r diff_total` less travel time than the
earliest departing service.  It is nevertheless important to note that queries
for earliest-arriving services require two full routing runs, whereas
earliest-departing services can be executed in a single run. This, bulk queries
for analytic purposes will generally be up to twice as first with
`earliest_arrival = FALSE`.


# 3. Convenience Functions: `go_home()` and `go_to_work()`

The `gtfsrouter` package is intended both to enable statistical analyses of GTFS
data sets, as well as for personal, pragmatic purposes. In the latter regard,
the package provides two "convenience" functions to allow single-call queries
for next available services to "home" and "work" stations. These functions
require some initial set-up through specifying environmental variables, but once
done can be executed as single calls from any **R** session to return the next
available service.
```{r home-work-fakey1, eval = FALSE}
go_home ()
```
```{r home-work-setup, echo = FALSE}
Sys.setenv ("gtfs_home" = "Innsbrucker Platz")
Sys.setenv ("gtfs_work" = "Alexanderplatz")
Sys.setenv ("gtfs_data" = file.path (tempdir (), "vbb.zip"))
data.table::setDTthreads (1) # See ?setDTthreads: setenv resets it
process_gtfs_local () # If not already done
go_home (start_time = "12:20") # next available service
```
The complementary function, `go_to_work()` routes in the reverse direction. These
functions are intended to allow real-time queries of public transport schedules
from within the comfort of an **R** session, and will generally be much quicker
-- and hopefully easier -- than the arguably burdensome necessity of switching
attention from productive **R** programming to the usual app or website
otherwise needed to answer the simple question of when I ought to leave today?

Successfully calling that function requires setting three environmental
variables:
```{r, home-work-setup2, eval = FALSE}
Sys.setenv ("gtfs_home" = "<my home station>")
Sys.setenv ("gtfs_work" = "<my work station>")
Sys.setenv ("gtfs_data" = "/full/path/to/gtfs.zip")
```
along with execution of the single command:
```{r, home-work-setup3, eval = FALSE}
process_gtfs_local ()
```
This command attempts to reduce the size of the locally-stored GTFS data to the
minimum required for local routing, and saves the result as an internal `.Rds`
object in the same location as the `gtfs_data` environmental variable. Having
done that, `go_home()` will search for the next available service from the
nominated work station to the nominated home station, while `go_to_work()` will
search for connections in the other direction.

An even easier way to use these functions is to automatically load those
environmental variables at the start of each **R** session, which can be
achieved simply by creating a file named `.Renviron` in the user's root
directory (or opening if it already exists), and pasting or appending the
definitions to that file - in this case, without the **R**-specific
`Sys.setenv()` calls:
```{bash Renviron, eval = FALSE}
gtfs_home = "<my home station>"
gtfs_work = "<my work station>"
gtfs_data = "/full/path/to/gtfs.zip"
```
Of course, this function will only route using locally-stored data, so it is up
to the user to ensure their local copy of `gtfs.zip` is kept up to date.

The functions include one additional feature. Having found the next service with
`go_home()`, I may suspect that I can keep working until the following service.
The simple parameter `wait` enables searching for that following service:
```{r home-work-fakey2, eval = FALSE}
go_home (wait = 1)
```
```{r home-work2, echo = FALSE}
go_home (start_time = "12:20", wait = 1)
```
The service after that can be retrieved with `go_home (wait = 2)`, and so on.

```{r DTthread-reset, echo = FALSE}
data.table::setDTthreads (nthr)
```