Efficient Live Public Transport Data Sharing for
Route Planning on the Web

In reply to

HTML version at https:/​/​julianrojas.org/papers/icwe2020-main-track/

Abstract

Web-based information services transformed how we interact with public transport. Discovering alternatives to reach destinations and obtaining live updates about them is necessary to optimize journeys and improve the quality of travellers’ experience. However, keeping travellers updated with opportune information is demanding. Traditional Web APIs for live public transport data follow a polling approach and allocate all data processing on either data providers, lowering data accessibility, or data consumers, increasing the costs of innovative solutions. Moreover, data processing load increases further because previously obtained route plans are fully recalculated when live updates occur. In between solutions sharing processing load between clients and servers, and alternative Web API architectures were not thoroughly investigated yet. We study performance trade-offs of polling and push-based Web architectures to efficiently publish and consume live public transport data. We implement (i) alternative architectures that allow sharing data processing load between clients and servers, and evaluate their performance following polling- and push-based approaches; (ii) a rollback mechanism that extends the Connection Scan Algorithm to avoid unnecessary full route plan recalculations upon live updates. Evaluations show polling as a more efficient alternative on CPU and RAM but hint towards push-based alternatives when bandwidth is a concern. Clients update route plan results 8–10 times faster with our rollback approach. Smarter API design combining polling and push-based Web interfaces for live public transport data reduces the intrinsic costs of data sharing by equitably distributing the processing load between clients and servers. Future work can investigate more complex multimodal transport scenarios.

Introduction

Thanks to route planning applications, understanding printed time tables at a bus or train station has turned to instantly retrieving route planning advice on our smartphones. Access to such information may increase the usage of public transport services [1] and has a positive impact on its quality of experience [2]: travellers with access to live updates can reduce their waiting times, adjust their travelling choices for more efficient journeys and achieve higher satisfaction levels [3]. Today, many cities around the world recognize the value of providing reliable access to live information and devote considerable effort on maintaining and evolving their Web APIs, e.g., London, Helsinki or San Francisco [4].

However, sharing live public transport data is a resource-demanding process that may (i) limit data accessibility; (ii) increase the costs of new solutions; and (iii) impose additional processing load for route plan recalculations upon live updates. Traditional Web API architectures for live public transport data usually follow a polling-based approach that allocates all data processing load on either data providers (e.g., public transport operators) or data consumers (e.g., route planning applications) computational infrastructure. In terms of required computational resources, two main strategies prevailed: (i) publishing a fully-fledged live route planning API; or (ii) providing a data dump or feed containing live schedule updates.

With live route planning APIs, most data providers limit data accessibility due to high maintenance and scalability costs [4]. Despite offering reliable information, live route planning APIs entail high costs because all processing resides on data provider infrastructure. These costs increase proportionally with the number of clients and motivates API request limitations, thus limiting data accessibility. For data consumers, this approach requires minimal effort in terms of processing resources, as clients only need one request per route planning query. Dealing with live updates also means that consumers check for query updates more frequently, further increasing the load on providers infrastructure and potentially conflicting with API request limits.

In contrast, a data dump or feed increases the computational costs for data consumers because they need to handle data integration. This approach represents a low cost solution for data providers, as they only require to maintain online a resource with the latest updates. However, static and live data need to be separated for client consumption. This separation increases infrastructure costs for consumers that need to store and integrate static schedules and their live updates for one or more public transport services.

In both cases, handling live data requires additional processing for route plan recalculations on data updates. Changing transport schedules may quickly render previously calculated route plans invalid. For example, a vehicle route could have been cancelled or a suggested transfer is no longer possible due to delays. Applications need to refresh query results as soon as new updates are available to avoid providing incorrect information. This generally involves full algorithm recalculations, which increase processing load. Thus, cost-efficient data sharing solutions that allow full accessibility, to accurate and opportune information for travellers are needed for both data providers and data consumers.

An in-between approach, namely the the Linked Connections (LC) specification [5], defines a data sharing approach that equitably distributes processing load between data providers and consumers for route planning query processing over public transport networks. To measure the impact of polling- and push-based data sharing approaches on CPU, RAM, bandwidth and query response time, we implemented LC-based server and client-side applications. We (i) measure processing costs and query execution performance using the Connection Scan Algorithm (CSA) [6] for route planning calculation; (ii) simulate a Web-scale environment with up to 1,500 concurrent clients, using real data from the Belgian Railway operator; and (iii) extend CSA with a rollback mechanism that allows efficient route plan refreshing upon live updates, while avoiding full query recalculations.

The results show polling as a more efficient alternative on CPU and RAM for data providers but hint towards push-based alternatives when bandwidth is a concern. For data consumers there is no significant difference in CPU and RAM usage for route planning query processing. However, there is a significant improvement of 8–10 times with respect to query processing performance, using our proposed CSA rollback mechanism. Smarter API design that combines polling and push-based data sharing approaches can reduce the intrinsic costs of data sharing on the Web, by equitably distributing processing loads between providers and consumers. Moreover, by moving algorithm execution to the client-side, clients have more granular control on application state (e.g., previous query executions), which for this case increases efficiency on handling live data updates.

The remainder of this paper is organized as follows. We first present an overview of related work regarding live public transport data sharing on the Web and route planning over public transport networks. In section 3 we describe the reference architecture, data models and implementation details of the evaluated approaches. Section 4 presents the experimental setup including data characterization, queries and Web interfaces used to achieve comprehensive results. In Section 5 we present the obtained results. Finally, on Section 6 we discuss the results, present our conclusions and our vision for future work.

Reference Architecture

We designed and implemented a system architecture (Fig. 1) to evaluate the performance of different strategies for live public transport data sharing on the Web. In this section we present the design choices and implementation details of its different modules.

[Figure 1]
Fig. 1: Reference architecture used to evaluate polling (HTTP) and push-based (SSE) approaches for publishing and consuming live public transport data for route planning.

Publishing Live Public Transport Updates

We follow the LC specification for public transport data publishing. LC achieves higher cost efficiency, compared to equivalent full server-side route planning APIs [5]. The data publishing module in the reference architecture is called LC Server. For its implementation we extended the Linked Connections Server (LCS), given its capabilities of integrating live updates out of the box [10]. The server is implemented as a Node.js application and consists of the following sub-modules:

We study the cost and performance of polling and pushing Web interfaces. The polling interface of the LCS defines this access URL: {operator}/connections?departureTime={iso-date}. Where operator is the name of the public transit operator and iso-date is the date and time for which a client requires information of vehicle departures. Upon request the LCS retrieves the document that contains Connections departing closest to the given departure time. It also checks if there are any available live updates that involve the requested document and merges them. An example of a Connection is shown in Listing 1.

{
    "@context": {
        "lc": "http://semweb.mmlab.be/ns/linkedconnections#",
        "gtfs": "http://vocab.gtfs.org/gtfs.ttl#"
    },
    "@id": "http://irail.be/connections/8885001/20200131/IC3231",
    "@type": "lc:Connection",
    "lc:departureStop": "http://irail.be/stations/NMBS/008885001",
    "lc:arrivalStop": "http://irail.be/stations/NMBS/008885068",
    "lc:departureTime": "2020-01-31T09:54:00.000Z",
    "lc:arrivalTime": "2020-01-31T09:58:00.000Z",
    "lc:departureDelay": 60,
    "lc:arrivalDelay": 60,
    "lc:direction": "Courtrai",
    "gtfs:trip": "http://irail.be/vehicle/IC3231/20200131",
    "gtfs:route": "http://irail.be/vehicle/IC3231",
    "gtfs:pickupType": "gtfs:Regular",
    "gtfs:dropOffType": "gtfs:Regular"
}

Listing 1: LC formated as JSON-LD. The properties departureDelay and arrivalDelay indicate that live data is available for this Connection.

Schedule documents can be cached by clients, which can reuse them to answer more than one query. This reduces the amount of requests that need to be handled by the server. However, live updates quickly invalidate caches and clients need to request again updated LC documents for new queries. In the worst case, all Connections of a document change due to a live update, but the majority of the time only a handful of Connections are updated, causing that significant parts of LC documents are sent over and over again. For this reason, we extended the LCS implementation and added a new resource to its HTTP API that allows to retrieve only Connections that have changed since a given time: {operator}/events?lastSyncDate={iso-date}. This resource allows clients to synchronize their local caches with the latest available data.

We also implemented a pushing Web interface using SSE. Clients can subscribe to it on {operator}/events/sse and receive the latest schedule updates as they occur. We use the W3C standardized SOSA ontology to identify and semantically describe schedule updates for particular Connections. Our implementation is available online.

Consuming Live Public Transport Updates

A command line interface (CLI) client application was implemented for processing route planning queries on top of LC-based data. It implements the CSA algorithm on its Profile variant [6]. This allows calculating not only the Earliest Arrival Time but also later route alternatives, providing a maximum amount of desired vehicle transfers along the way. The selection of this algorithm is based on the data model defined by the LC specification. LC defines a sorted by departure time array of Connections, which is the data structure that CSA requires to process queries.

This client consist of a library called QRail Library} that was built using the Qt framework. We selected this framework due to its cross-platform portability including Android or iOS. The client’s modules are the following:

Dynamic Rollbacks for CSA

We extended CSA to address the problem of needing to perform complete route plan re-calculations, every time a new update is available on the client. Full route plan re-calculations increase the amount of requests and processing that both clients and servers need to handle, thus increasing computational costs.

Given a route plan query (e.g., from Bruges to Brussels departing today at 17:00), CSA starts its execution by scanning Connections departing no earlier than 17:00 until it finds the earliest arrival route. From this point, the algorithm performs predefined scan cycles, going back in time over the Connections array and adding every time for example, 30 minutes to the earliest arrival time. This allows finding later alternative route plans for the given query.

In our implementation, every time a new Connections document is requested during algorithm execution, we create a snapshot CSA’s internal state containing the Connections currently involved in the, so far discovered routes. Thanks to these snapshots we can determine the exact index in the Connections array, from which CSA needs to recalculate when there are updates in the route plans. Given that CSA Profile goes back in time over the Connections array, the closer an updated Connection is to the departure stop, the faster the recalculation process will be. The set of snapshots is kept in memory and updated every time a new query is processed.

Evaluation

We study different approaches to efficiently publish and consume live public transport data, in terms of computational resources (CPU, RAM and bandwidth) and route planning query processing performance. For this we define the following research questions:

To address these research questions, we defined a hypothesis for each of them:

We designed two different experiments to test our hypotheses. Next we describe the testing data and the setups for each of the experiments.

Real-World Test Data

We used real data from the Belgian railway operator NMBS. NMBS publishes both GTFS and GTFS-RT data dumps as open data. We collected the timetable and all emitted live updates for November 2019 (we make the data available online). We analyzed these data to understand how live updates happen, i.e., what the low and peak hours normally are, to run our experiments considering representative data. Fig. 2(a) shows the amount of Connections updated during the entire month, having the 28th of November as the day with the highest amount (over 13.63 million). Fig. 2(b) zooms in into this day, showing peak hours around 07:00 and 17:00, and registering 17:00 as the peak hour of the day with more than 900,000 updated Connections.

[Figure 2]
Fig. 2: The amount and distribution of Connection updates allow to visualize the behavior of the transport network regarding its low and peak times. Week days and especially their mornings and afternoons, consistently show higher number of updates.

We used the iRail query log dataset as a reference. This dataset contains a registry of over one million real route planning queries per day, received by the iRail API. We analyzed the queries that were executed on the week days during November 2019 and classified them by the amount of Connections used by their Earliest Arrival Time route, given that routes with higher amount of Connections require a higher processing effort to be computed.

Experiment 1: Publishing Live Public Transport Updates

The first experiment was designed to test the computational resource (CPU, RAM and bandwidth) consumption of LC-based live public transport data publishing, following polling (HTTP API) and push-based (SSE) approaches. We setup a server with a Quad core Intel E5520 (2.2GHz) CPU and 12 GB of RAM. We progressively instantiate up to 1500 clients requesting/receiving live data updates. Each client starts its operation 0.5 seconds after the previous to avoid overload peaks on the measurements and simulate a more realistic environment, where clients issue requests at any point in time and not necessarily synchronized with the live updates frequency.

For the polling scenario every client requests a new data update every 30 seconds, which corresponds to the update frequency of NMBS GTFS-RT feed. We also disabled serve-side caching, to obtain a clearer image of the actual operational costs of the server. In the pushing scenario, the clients subscribe once to the SSE interface and the server pushes new data to them every 30 seconds. The experiment was executed during 20 minutes for each scenario.

Experiment 2: Consuming Live Public Transport Updates

The second experiment was designed to measure the CPU, RAM and bandwidth usage of a LC-based client, consuming live public transport updates following polling (HTTP API) and push-based (SSE) approaches. The client application was deployed on a machine with a Quad core Intel E5520 (2.2GHz) CPU and 12GB of RAM. We handpicked routes with different amount of updated Connections. This was intended to avoid evaluating routes without updates and thus, with no significant impact on client resources.

We ran the experiment for each selected query using our client on a polling and a pushing scenario, and a reference test client without the rollback mechanism described in Subsection 3.3. The evaluation run for 15~minutes on each scenario, where clients requested/received live updates every 30 seconds. Table 1 shows an overview of the selected routes.

From To Connections Travel Time (min)
Hasselt Sint-Truiden 2 15
Leuven Diest 2 32
Landen Diest 5 43
Eppegem Brussels-Shuman 6 23
Mechelen Brussels-Congress 6 30
Leuven Schaarbeek 11 29
Asse Antwerp-Berchem 18 87
Antwerp-Central Alken 23 94

Table 1: Set of route planning queries extracted from the iRail API logs. This table shows the amount of Connections and the total travel time of the Earliest Arrival Time route.

Results

Here we present the results obtained for the experiments described in section 4.

Fig. 3 presents the measurements made for experiment 1. Fig. 3(a) shows a mean CPU usage of 10.8% for the pushing approach. For the polling approach we obtained a mean usage of 1.7%.

In terms of RAM, Fig. 3(b). shows a mean consumption of 563.8MB for the pushing approach. For the polling approach we measured a mean consumption of 423MB. Bandwidth measurements for pushing showed a total data exchange of 6.5GB serving up to 1500 clients during the measured time (20 mins). For pulling, the server exchanged a total of 15.8GB under the same conditions (figure not included for the sake of space).

[Figure 3]
Fig. 3: Polling shows a lower resource consumption for both CPU and RAM.

Fig 4(a). shows the bandwidth usage for the three test scenarios, defined in experiment 2. After 800 seconds the reference client exchange a total of 45.7MB. The rollback clients exchanged 5.4 MB and 3.6 MB for polling and pushing respectively. In terms of RAM (Fig 4(b).), no significant difference was measured for polling and pushing with a mean consumption of 70.6MB and 71.1MB respectively. The reference client shows an increased average RAM usage of 102.1MB. CPU usage (figure not shown for the sake of space) maintained the same tendency with average consumption of 12.15% (polling), 12.22% (pushing) and 19.2% (reference).

[Figure 4]
Fig. 4: The rollback clients give a significant reduction of bandwidth. There are no major differences in terms of RAM consumption.

Fig. 5 presents the results obtained for testing our rollback mechanism and its impact on route planning query solving. The rollback mechanism is between 8 and 10 times faster in every set of route planning queries.

[Figure 5]
Fig. 5: The rollback mechanism significantly improves the performance of query recalculation.

Conclusion and Future Work

Experiment 1 was designed to test H1, related to RQ1. In terms of CPU usage, the polling approach uses on average 10 times less resources than the pushing approach. RAM memory measurements are closer between the two approaches, with pushing using around 140MB more than polling. This can be explained by the operational characteristics of pushing and also by the setup of the experiment. In the pushing approach the server sends new data to all subscribed clients, when a new live update is received. This explains the various CPU peaks and the higher average usage. RAM memory consumption is also higher for pushing, because the server needs to keep an in-memory registry of all subscribed clients.

The lower resource usage of polling can also be explained by the distribution of client requests. A new client was added every 0.5s from the start of the experiment and every client did a new request every 30s, giving a maximum request load was of 50 requests/second with 1500 clients. This load was easily handled by the server given its hardware capabilities. Further tests with synchronized clients and different live update frequencies can provide a more complete picture for polling. Nevertheless, the results provide an indication of the required resources when using one approach or the other, giving polling as a less demanding approach for CPU and RAM.

Pushing requires lower bandwidth because clients do not send requests for new data updates. In conclusion, we reject H1 and consider polling as more cost-efficient, especially considering that with server-side caching in place, resource consumption could be improved.

Results of experiment 2 relate to RQ2 and H2, and show no significant difference in CPU or RAM for pulling or pushing on the client. This is explained by the usage of our rollback mechanism, and confirmed by the higher consumption of resources measured for the reference client. Therefore, we reject H2. However, pushing uses the least amount of bandwidth. This is of utmost importance if we consider that route planning applications may be executed on mobile devices using data connections subject to charges.

The second part of experiment 2 relates to RQ3 and H3. Results show significant improvements in route planning query recalculations using our rollback mechanism. By keeping snapshots of CSA’s internal state, full recalculations are avoided and faster times are achieved. Therefore we can accept H3. This is an important result from the perspective of end-users, which need to be informed of journey changes as soon as possible.

In general, results show us that there is no silver bullet approach. Different aspects need to be considered to fulfill data provider, data consumer, and end-user requirements. Further evaluations may be performed to complement these results. However, this study already provides an informative base ground for public transport data publishers and route planning application developers. It also highlights the benefits of moving algorithm execution to the client-side, as is having more granular control of application internal state, which make possible to design smarter and more efficient clients.

Future work could study the impact of client-side route planning integrating multiple modes of transportation, as well as explore alternative algorithms and evaluate their performance, including end-user testing.