Ridesharing Is Caring — Geospatial Density-Based Clustering (Part 1)
October 6th, 2025
This is Part 1 of two blog posts that I’m writing to highlight my recent projects using density-based clustering to explore geospatial datasets and urban dynamics. In this post, I will define and analyze taxi ridesharing efficiency using a public dataset of NYC yellow cab rides, and in Part 2 I’ll adapt this approach to analyze urban density patterns across towns and cities in the US and abroad.
What is density-based clustering?
Density-based clustering is a nonparametric method that can detect continuous regions with similar densities of observations across a dataset. Many popular implementations such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) allow outliers to remain unclustered. These algorithms generally rely on two key parameters: the neighborhood radius $\epsilon$, and the minimum number of neighbors required for “core” observations. In the example below, darker regions show core observations while lighter regions depict the borders between clusters and noise.
Because humans are heterogeneously organized across regions, these methods are well suited for exploratory analysis of transportation and geodemographic data. For example, geographic features such as rivers and mountains can separate cities and populations, whereas transit hubs like train stations (see below) can promote local clustering of passengers. Neither situation is well-approximated by a parametric prior such as a Gaussian, and we typically don’t know the number of clusters a priori. Instead, density-based clustering relies on the contrast between local densities to identify clusters, regardless of their shapes or sizes.
Density-based clustering of NYC taxi trips
I recently completed a project exploring the ridesharing efficiency of NYC taxi trips using an iterative density-based clustering algorithm to aggregate trips. The rest of this post will outline the main ingredients to successfully explore geospatial datasets using density-based methods and highlight any interesting (and surprising!) results along the way.
Downloading and cleaning the data
Thanks to a FOIA request by Chris Wong, the NYC taxi and limousine commission released trip and fare data from January through December 2013, containing medallion numbers, pickup and dropoff datetimes/locations, passenger counts, and payment breakdowns. For my analysis, I focused on data from the first full week of June: Monday (6/3) to Sunday (6/9). After merging trip/fare data and selecting rides on these dates, I used the following filters to keep high-quality trips only:
trip = (trip
.loc[(trip.passenger_count > 0) & (trip.passenger_count < 10)] # 0 < number of passengers < 10
.loc[trip.trip_time > dt] # trip time > 1 min
.loc[trip.time_delta > dt] # end minus start time > 1 min
.loc[(trip.trip_time - trip.time_delta).abs() < dt] # trip time equals time delta
.loc[(trip.trip_distance > dlim[0]) & (trip.trip_distance < dlim[1])] # .1 mile < actual distance < 30 miles
.loc[(trip.euclidean_distance > dlim[0]) & (trip.euclidean_distance < dlim[1])] # .1 mile < linear distance < 30 miles
.loc[trip.trip_distance < 2 * trip.euclidean_distance] # actual distance < 2x linear distance
.loc[(trip.avg_speed > 2) & (trip.avg_speed < 50)] # 2 mph < average speed < 50 mph
.loc[trip.fare_amount < 200] # total fare < $200
.loc[trip.tip_fare_ratio < 5] # tip-to-fare ratio < 5
.loc[(trip.sum_charges - trip.total_amount).abs() < 1e-2] # total price equals charges
.loc[~trip.index_pickup.isna() | ~trip.index_dropoff.isna()]) # pickup or dropoff in NYC
After deduplication, my data cleaning pipeline selected just over 1.5 million trips for downstream analysis.
Iterative density-based clustering
Because density-based clustering allows for outliers, these algorithms tend to leave many observations unclustered. To aggregate as many trips into clusters as possible, I implemented an iterative HDBSCAN (Hierarchical DBSCAN) approach that sequentially clusters any remaining observations from the previous step while relaxing the minimum cluster size from 6 to 2 riders (more on these values later). My input features consisted of pickup locations $x_0,y_0$ and times $t_0$ and dropoff locations $x_1,y_1$, and the outputs were cluster labels $k$ for each trip/passenger. Here’s what my clustering results looked like after each iteration:
Iteration 0 (min_cluster_size = 6): % clustered = 10.89
Iteration 1 (min_cluster_size = 5): % clustered = 27.17
Iteration 2 (min_cluster_size = 4): % clustered = 44.57
Iteration 3 (min_cluster_size = 3): % clustered = 62.15
Iteration 4 (min_cluster_size = 2): % clustered = 84.11
Another important parameter for my implementation was the relative scaling of time vs. distance, which controls the tradeoff between clusters’ spatial and temporal coherence. To select the best value, I performed a parameter sweep from 10 to 60 minutes/mile for one HDBSCAN iteration only. The results showed that smaller values ($\leq$ 10 min/mile) favored tighter temporal clusters, while larger values ($\geq$ 60 min/mile) favored tighter spatial clusters:
Quality of the identified clusters
Based on the parameter sweep, I used a spatiotemporal scaling of 25 minutes/mile to identify clusters with ~5 passengers and pickup/dropoff locations ~0.2 miles and ~5 minutes apart:
Defining and exploring ridesharing efficiency
Identifying coherent clusters of trips is already interesting and raises important questions for transit agencies and companies. For example, if passengers are departing/arriving at similar locations/times, is it possible to transport customers more efficiently by increasing their overall ridesharing? To answer this question, I implemented a complementary metric to assess ridesharing efficiency across clustered taxi trips.
Demand-responsive transport and packing efficiency
One common approach to increase ridesharing is demand-responsive transport, where vans or small busses operate on flexible routes according to overall demand and passengers’ pickup/dropoff locations. I used this microtransit model as a benchmark for comparison with actual ridesharing in my taxi-trip clusters. I defined the efficiency $E$ of an observed rider/vehicle configuration for a given cluster $k$ as follows:
$E = \frac{c_v}{c} = \frac{\text{cost per capita microtransit}}{\text{cost per capita actual}}$.
This equation is actually the inverse of most efficiency definitions, but I chose to use it because $E$ is bounded on the interval $(0,1]$, with $E=1$ indicating a taxi rider/vehicle configuration was as efficient as microtransit, and $E<1$ indicating that the configuration was comparatively inefficient. If we assume that microtransit trips cost a scalar multiple $\alpha$ of the average taxi-trip cost per cluster, $E$ becomes a measure of rider packing:
$E = \alpha \cdot \frac{M_v}{M} \rightarrow \frac{E}{\alpha} = \frac{M_v}{M}$,
where $E/\alpha$ is packing efficiency (unitless), $M$ is the total number of taxi trips, and $M_v$ is the total number of van trips, which depends on the number of passengers in cluster $k$ and the total van capacity. Here I assumed a van seating capacity of six passengers, which is also the minimum cluster size that I used for Iteration 0
of my density-based clustering approach.
These are the key advantages/disadvantages of using $E/\alpha$ to measure ridesharing efficiency:
Advantages | Disadvantages |
---|---|
$E/\alpha$ is scale-free and can meaningfully compare both long and short trips | $E/\alpha$ depends on urban density and may not accurately compare dense and sparse regions |
$M_v$ is directly tunable to optimize van/bus capacity across different regions/times | $E/\alpha$ assesses aggregation/configuration and is agnostic of trip distance/duration |
No need to introduce systematic errors due to biased/imprecise direct cost estimates | |
Does not require mid-trip pickups if calibrated to the minimum cluster size |
Efficiency trends by time and location
Across both Manhattan and the outer boroughs, packing efficiency drops every day between ~6 AM-Noon. During weekdays, the outer boroughs have efficiency peaks before/after the trough (~Midnight-6 AM & ~Noon-6 PM), and in Manhattan both weekday/weekend packing efficiency have broad peaks elsewhere (~Noon-6 AM):
Focusing on Manhattan, we can see that weekday demand for taxis increases ~6 hours before efficiency, between ~6 AM-Noon:
This suggests that passengers aren’t sharing taxis when commuting to their jobs but are instead sharing rides with their co-workers after leaving the office in the afternoon. This insight presents a significant opportunity to optimize ridesharing efficiency (and profits!) during the morning rush hour.
If we break down the morning efficiency dropoff by neighborhood, we can see that Upper Manhattan and the Lower East Side show comparatively better packing efficiency and the Upper East Side shows the worst efficiency:
Across the outer boroughs, we can see that demand peaks while efficiency drops during the period between ~6 PM-Midnight:
This means that the evening rush hour represents the best opportunity to optimize efficiency/profits in these boroughs.
The evening efficiency dropoff is also distributed unevenly across Brooklyn/Queens neighborhoods, with the airports (LGA/JFK) showing the worst packing efficiency and Bedford-Stuyvesant, Bushwick, and Long Island City showing the best efficiency:
Wrapping up
To recap, a successful exploration of geospatial datasets using density-based methods requires data- or domain-specific strategies for 1) iterative clustering and 2) evaluation metrics. When evaluating taxi-trip clusters against a microtransit benchmark, I implemented an approach that sequentially relaxed the minimum cluster size from the maximum van capacity to two.
In the next post, I’ll explore a dynamic approach to iterative density-based clustering using information theory and apply it to a geodemographic dataset of population density from cities and towns in the US and abroad. Happy clustering!
Image credits
- NYC taxis — Wikimedia Commons, CC BY-SA 4.0
- DBSCAN data — Wikimedia Commons, CC BY-SA 3.0
- Train station — HDBSCAN Read the Docs, BSD 3-Clause