Analysing Bikes Shared Rides in New York City

Author: Anuradha Damle

Introduction

In a time when ESG is a topic of everyday conversations, the introduction of shared hire-bicycles in NYC has been a transformation to lowering carbon emissions over its 15+ year history. Promoting fitness, less congestion, and a shared community effort in taming rising pollution are additional benefits. Or has it been as resounding a success as we would like to believe? This analysis delves into the who, where, when of bike usage, and analyses the adoption trend for tell-tale signs of success.

Based on this study there are many users preferring the cycling to work during morning hours and back in evening. These are typically 6 to 8-hour rentals. Additionally, there are many users prefer riding the bike in parks. These are 2 to 3-hour rentals. As expected, usage of the service is more in summer than winter. There are few opportunities to optimize bike utilization, based on overused and least used bikes. Detailed findings are listed in the section below. 

Citi Bike is a popular bike sharing system that provides easy and affordable bike trips around New York city and New Jersey city. 24/7 availability, convenient stations has made Citi Bike a great option for quick trips. 

In this article, we explore bike rides data to generate insights for business. We begin by examining every feature of the bike data, including data quality. Next, data is analysed to generate insights on trends and important features. We conclude with advanced analysis and model fitting, we identify business impactful insights about idle bikes, sufficiency of bikes along stations and predict rides.  

Acknowledgements New York City Bike makes regular open data releases (this dataset is a transformed version of the data from this link). The dataset contains 25,846,892 anonymised rides information made from Oct 2015 to Jul 2017.  

This dataset is the property of NYC Bike Share, LLC and Jersey City Bike Share, LLC (“Bikeshare”) who operate New York City’s Citi Bike bicycle sharing service (for T&C click here )  

I want to take a moment to express my sincere gratitude to AltoMeta Consulting and their exceptional team (Raghavachari Madhavan & Aditya Kulkarni) for their unwavering support throughout the analysis & writing process. Their expertise and guidance have been invaluable in shaping the ideas and ensuring the coherence and relevance of the content. 

Also, I would like to appreciate the product “Pragya” – No-Code data analysis environment from AltoMeta Consulting, for generating simple English language insights that help understand the data quickly. This helped me reduce the overall effort and focus on important features for further analysis. 

Collaborating with both AltoMeta Consulting team has been an enriching experience, and I am grateful for their dedication, professionalism, and commitment to excellence. It is through such partnerships that we can deliver meaningful and impactful content to our readers.


Executive Summary:

This report analyzes key trends in a bike-sharing system, focusing on user demographics, ride patterns, and predictive modeling.

User demographics reveal that Subscribers and Male users are predominant, with popular age groups being Gen X, Gen Y.1, and Gen Y.2. Short rides, primarily for weekday commuting, are prevalent, with notable activity around park areas. Peak ride hours occur in the morning (7am-9am) and evening (4.30pm-6.30pm), primarily in the South-North and Southeast-Northwest directions. Evening rides surpass morning rides in frequency.

The analysis indicates that bikes spend more time idle than in use, suggesting opportunities for optimizing inventory organization to increase business or ride time. The average speed falls within a safe range of 4 to 8 mph.

Despite a relatively short observation period of 22 months, the forecasting model effectively captures trends and seasonality, predicting an overall increasing trend in rides. 


Objectives of this exercise are – 

  1. Data Profile: Derived Features, Feature analysis 
  2. Data Quality 
  3. Insights from data: Influential features, trends, observations 
  4. Predict Rides   

Data Profile

Column Name and Description
Column Type
Data Distribution Chart
Observations
Bike ID, The ID of the Bikes used for riding
Numeric, 5-digit identifier. .❶❷ ❸
• Total 14,063 bikes. • Bike IDs from 14,529 to 30,337 • Average 1,838 rides /bike over 22 months • Number of bikes with rides at 25% are 3517, with rides at 50% are 7034 and with rides at 75% are 10550 . (Rides percentile of the total rides) • number of bikes with rides at100th percentile: 14063 • Left skewed. Based on Duration, distribution is as follows – • Average 29,762.03 min/bike • Number of bikes with duration of rides at 25% are 3516, with duration at 50% are 7032 and with duration at 75% are 10547. (Rides percentile of the total rides) • Right skewed • 9 bikes have extreme duration (> 150,000 mins). These records will be excluded from further analysis.
Trip duration Duration of each ride
Numeric, time in seconds. Excluded ❶❷❸
start station id, The station number at start of the ride.
Numeric, 4-digit number. Excluded. ❶❷❸
• Total 741 stations • Station IDs from 72 to 3478
start station name
Text
• Total 749 stations. • Average Rides per Start station: 34,508.53 over 22 months. • number of start stations with rides at 25%: 188 • number of start stations with rides at 50%: 375 • number of start stations with rides at 75%: 562 • (Rides percentile of the total rides) • Every Start Station name has ID in Start Station ID. • Every Start Station name has latitude and longitude. • Start Station name is subset of end station name. • There is difference between number of station ids and number of station names. Hence combination of id-name is used for further analysis.
end station id, The station number at end of the ride.
Text
• Total 746 stations • Station IDs from 72 to 3478
end station name
Text
• Total 755 station names. • Average Rides per end station: 34,234.29 over 22 months • number of end stations with rides at 25%: 189 • number of end stations with rides at 50%: 378 • number of end stations with rides at 75%: 566 • And every end station has latitude and longitude. • There is difference between number of station ids and number of station names. Hence combination of id-name is used for further analysis.
Start Station Latitude Latitude of the Start station AND Start Station Longitude Longitude of the Start station
Numeric, decimal number representing latitude AND longitude of start station ❶❷❸
• Unique pairs (latitude and longitude of start stations) are 756. • 7 rides do not have valid latitude -longitude details. • Station locations are in NY and NJ city
End Station Latitude AND End Station Longitude
Numeric, decimal number representing latitude and longitude ❶❷❸
• Unique pairs (latitude and longitude of end stations) are 763. • 153 rides do not have valid latitude -longitude details. • Station locations are in NY and NJ city
User Type; Type of User of the ride
Text
• User Types are Subscriber, Customer • Customer = duration or number of rides-based renting. • Subscriber = Annual Member (with different agreements). • Null values are replaced by “unknown”. • 89% rides are by subscribers, Customer user rides are 11(10.83) %, unknown user rides are 0.2%
Birth Year; Birth Year of the rider
Year- Numeric 4 digit ❶❷❸
Gender; Gender of the rider
Numeric values representing gender as below 0 - Unknown 1 - Male 2 - Female
• Male riders are 67%. • Female riders are 22%. • Unknown riders are 11%.
Start Time; Start date-time of the ride
Date time in format dd/mm/yyyy hh:mm:ss
• min: 2015-10-01 00:00:02 • max: 2017-07-31 23:59:57 • Date time formats are different for different data files. Hence new column “start_dttime“, having same format is created.
Stop Time End date-time of the ride
Date time in format dd/mm/yyyy hh:mm:ss
• min: 2015-10-01 00:02:5 • max: 2017-08-01 10:51:09 • Date time formats are different for different data files. Hence new column “end_dttime“, having same format is created.

Data Quality

  • Null values – Columns having null values are ‘usertype’, ‘birth year’. For ‘usertype’, value “unknown” is set in null values. For ‘birth year, value 0 is set in null values.
  • No Duplicate records.
  • Birth Year has values less than 1900(3315rows, negligible) and 1900 to 1964 are categorized as Baby boomers in ‘Age Group’ derived column.
  • Station Latitude and Longitude values are containing 0. Such rides are not considered for analysis of Speed and Distance

Derived Features

Column Name and Description
Column Type
Data Distribution Chart
Observations
Age Group, Age Group of the rider.
Text
• Generation X and Y are prominent users (74%). i.e. age with range of 21 years to 52) • Generation Baby boomers (above 52) are 18% of Gen X and Gen Y • Very young people (below 21) are negligible (0.83%). • Gen Z – 0.83% • Unknown -10.87% • Baby Boomers – 13.91% • Gen Y.1 -20.18% • Gen Y.2 – 23.74 % • Gen X – 30.46%
duration_min Duration of each ride in minutes
Numeric duration in minutes ❶❷❸
• min: 1 • max: 272163 • 92% rides are till 30 mins. • 99% rides are till 75 mins. • Average(median) duration is 5.75 min. • Average duration per ride: 16 • 5 most popular durations in minutes are 6,5,7,8,4. • Skewed distribution • 5min, 6min and 7 min durations are popular in all generations.
startStation_id_nm It is found that there is
• 758 stations • Average rides per startStation_id_nm : 34098 • number of start stations with rides at 25%: 190 • number of start stations with rides at 50%: 379 • number of start stations with rides at 75%: 568 Popular stations 1. '519-Pershing Square North', 2. '435-W 21 St & 6 Ave', • '497-E 17 St & Broadway', 3. '402-Broadway & E 22 St', 4. '426-West St & Chambers St'
endStation_id_nm
Text; start station id plus start station name
• 764 stations • Average rides per endStation_id_nm : 33831 • number of end stations with rides at 25%: 191 • number of end stations with rides at 50%: 382 • number of end stations with rides at 75%: 573 • Top 5 stations: 1. '519-Pershing Square North', 2. '497-E 17 St & Broadway', 3. '402-Broadway & E 22 St', 4. '435-W 21 St & 6 Ave', 5. '426-West St & Chambers St'
Routes Routes for the rides. Created from start and end stations.
Text; startStation_id_nm plus endStation_id_nm
• 305293 routes • 2% rides have same start and end stations. • Average rides per Route: 84.66 • number of routes with rides at 25%: 87112 • number of routes with rides at 50%: 155321 • number of routes with rides at 75%: 229175
Idle Time Idle time for the bike. Derived from the gap between 2 consecutive rides of the same bike.
Numeric number, representing time duration. ❶❷❸
• Idle time is time between 2 consecutive rides of the same bike. • Idle time 0 means bikes are continuously used. • min: 0, • Max: 613 days 16:51:04 • Average idle time per bikes : 418 days 14:55:34 • There are 36% bikes with Idle time above 500 days. • Less bikes remain idle when they cross idle time of 190days. When bike crosses idle time of 300 days, the idle time will remain in range of idle time from 300 to 430 days. Within 450 to 600 days, there are less bikes remaining idle. • Bikes idle in around 650 days are almost 2.5 times idle bikes in 150 days or in 400 days
Distance Distance of travel based on start and end stations
Numeric in miles ❶❷❸
• Derived from distance between start and end stations using Haversine formula. • Min Distance: 2.107e-06 • Max Distance: 5387 • Average distance per ride: 1.17 • Distance and therefore speed of some rides is 0. This is because start and end stations are same for the ride. These are excluded from analysis.
Speed Speed of the ride
Numeric in miles/hr ❶❷❸
• Derived from Duration in sec and Distance. • Min Speed: 1 mph • Max Speed: 20 mph • Average(mean) speed - 5.64 mph. • Average(median) speed - between 4 to 6 mph. • Distance and therefore speed of some rides is 0. This is because start and end stations are same for the ride. These rides are excluded. • The rides with either latitude or longitude of start or end station is 0, are excluded.
DirectionofTravel; Direction of travel based on direction of vector from start to end station
Text
• 8 directions based on angle of vector from start to end station. • Directions and corresponding rides are: NorthEast, North, NorthWest, West, SouthWest, South, SouthEast, East 1. North: 17.28 % 2. South: 16.16 % 3. Southeast: 15.66 % 4. Northwest: 15.08 % 5. Southwest: 9.91 % 6. Northeast: 9.84 % 7. East: 8.18 % 8. West: 7.89 % • There is slight difference of rides of North-South pair and East-West pair. • 64% traffic is on North-South directions and in Northwest- Southeast directions.
Text
• The data is from rides in 2 cities, New York and New Jersey. Hence ‘NY’ and ‘NJ’ are the symbols added to each ride record. • Rides from NY are 25401155(98%) • Rides from NJ are 445737(2%)

Insights

• Bike subscribers ride time is more than that of customers

  • The duration of Subscriber users is almost 3 times higher than that of Customer Users.
  • Both number of subscribers and number of customers reduce in winter and increase in spring & summer. For both, peak is observed in August to September.
  • Unknown user type is seen for 1 year(Mar2016 to Mar2017), may be data wrongly set.
  • Sharp rise for Subscriber as well as Customer users is in March 2017. For Subscribers it is 91 % and for Customers it is 637%. Sharp decline for Subscribers on Nov2016 is 36%. Sharp decline for Customers on Dec2015 is 65%.

• Trip duration across the timeline

  • Fastest growth of total trip duration is in March 2017(148%). Drastic reduction in total trip duration is in November 2016(40%). The fastest growth contributor is Subscriber user and contribution is 53.5%. The drastic reduction contributor is Subscriber user and contribution is 73%.

• Rides are popular with Gen X, Gen Y.2 & Gen Y.1, irrespective of gender (male or female)

Trip Duration for Female riders:
  • Generation Gen Y.2, Gen X and Gen Y.1 are using rides more than gen Z or Baby Boomers.
  • More Rides are taken during months of February to October.
Trip Duration for Male riders:
  • Generation X, Y.2 and Y.1 are using rides more than generation Z or generation Baby Boomers.
  • Gen X riders are almost 1.5 times more than that of Gen Y.1 or Y.2.
Patterns of average trip duration are similar for both male and female riders.
  • Average duration for Female riders is higher than that for Male riders by 17%.

• Insights for Routes and Stations

• Nature of Rides is almost same on most popular 25 Routes.
  • Out of top 25 routes, 6 are at some park.
  • When all park routes are combined and other routes are combined, the nature of both types of routes is similar.
  • New Routes are started in July and August 2016 (e.g. Route 432-E 7 St & Avenue A TO 3263-Cooper Square & E 7 St,2006-Central Park S & 6 Ave TO 3282-5 Ave & E 88 St)
• End Station
  • There are 6 stations which are used as end stations only.
  • Such stations and corresponding rides ending in these stations are:
    • 3019-NYCBS Depot – DEL — 325
    • 3039-NYCBS Depot – DYR — 3
    • 3247-SSP – Basement — 3
    • 3439-Broadway & E 22 St – Valet Scan — 1
    • 475-E 15 St & Irving Pl — 1
    • 3442-Indiana — 1
  • Rides ending to these 6 end stations are equal to 3 or less except 3019-NYCBS Depot – DEL
• Routes
  • Out of 305293 routes, 747 routes have same start and end station. Some popular stations for such rides are 2006-Central Park S & 6 Ave, 281-Grand Army Plaza & Central Park S, 387-Centre St & Chambers St, 3182-Yankee Ferry Terminal.
  • 2% of rides have same start and end stations. By duration, 4% of rides have same start and end stations. The rides with same start and end stations are in park areas.
  • Popular Routes
  • Popular routes by duration
  1. 2006-Central Park S & 6 Ave TO 3143-5 Ave & E 78 St
  2. 3215-Central Ave TO 3267-Morris Canal
  3. 3052-Lewis Ave & Madison St TO 3432-NYCBS Depot – GOW
  4. 281-Grand Army Plaza & Central Park S TO 2006-Central Park S & 6 Ave
  5. 514-12 Ave & W 40 St TO 426-West St & Chambers St

Popular routes by duration are mainly having few rides and high duration rides. Hence are in skewed region of distribution. Hence further analysis of routes is done, based on rides and short duration trips .

  • Popular routes by number of rides
  1. 2006-Central Park S & 6 Ave TO 2006-Central Park S & 6 Ave
  2. 3203-Hamilton Park TO 3186-Grove St PATH
  3. 2006-Central Park S & 6 Ave TO 3143-5 Ave & E 78 St
  4. 3209-Brunswick St TO 3186-Grove St PATH
  5. 435-W 21 St & 6 Ave TO 509-9 Ave & W 22 St
  • Popular routes for short duration trips
  1. 459-W 20 St & 11 Ave TO 426-West St & Chambers St
  2. 3165-Central Park West & W 72 St TO 2006-Central Park S & 6 Ave
  3. 3119-Vernon Blvd & 50 Ave TO 3124-46 Ave & 5 St, st in eve
  4. 3165-Central Park West & W 72 St TO 3137-5 Ave & E 73 St
  5. 426-West St & Chambers St TO 459-W 20 St & 11 Ave

 

  • Range of trip duration is wide when stations are near to parks.
Popular Routes by Duration
Trip Duration Vs Rides
Comments
2006-Central Park S & 6 Ave TO 3143-5 Ave & E 78 St
• The route has more rides in morning. The reverse route has more rides in afternoon. • The route has median 26 minutes duration, reverse route has 16 min. • The number of rides for reverse route are almost 30% less.
281-Grand Army Plaza & Central Park S TO 2006-Central Park S & 6 Ave
• Route has more rides in morning half, yet busy from 9 to 18. The reverse route is also having almost same traffic of rides, yet more rides in later half of day.
3165-Central Park West & W 72 St TO 2006-Central Park S & 6 Ave
• Both the route and reverse route have more mid-day rides • The route has median duration 5min and reverse route has 8 min and 24 min
3165-Central Park West & W 72 St TO 3137-5 Ave & E 73 St
• Both the route and reverse route have more rides in mid-day hours. • The median duration for both roues is 5min. • Both routes have almost same rides.
  • For stations other than park stations, trip duration is in narrower range.
Popular Routes by Duration
Trip Duration Vs Rides
Comments
514-12 Ave & W 40 St TO 426-West St & Chambers St
• The route has rides from 8 to 20. Number of rides are more in evening The reverse route also has more rides in evening. • The median duration of trip for route is 22 min and for reverse route is 21min.
459-W 20 St & 11 Ave TO 426-West St & Chambers St
• Both, the route and reverse route has more rides in evening. • Both the routes have median duration 14 minutes duration. • The number of rides for both routes are almost same.
3119-Vernon Blvd & 50 Ave TO 3124-46 Ave & 5 St, st in eve
• The route has more rides in evening and reverse route has more rides in morning. • The median duration is 3 min for both routes. • Both routes have almost same rides

• The speed of the Rider

  • The speed is calculated based on location of ride stations and duration. Hence for rides with same start and end station, exact speed cannot be calculated. For other rides, speed range is (1 to 20 miles/hr).
  • 90% rides having speed less than 8miles/hr.
  • 39% rides have speed between 4 to 6 miles/hr and 32% rides have speed between 6 to 8 miles/hr.
  • Average speed is 5.64 miles/hr.
  • This is lower than citi bike average of 8.3 miles/hr. This increases safety due to lowered risk of accidents.
  • Gen Y.1 and Gen Y.2 have the highest speed
Generation
Average of Speed miles/Hr
Baby boomers
5.45
Gen X
5.85
Gen Y.1
5.99
Gen Y.2
5.99
Gen Z
5.58

• Idle Time for Bikes, Unused Bikes, Sufficiency of Bikes

The Idle time of bike is non-business time, which is calculated by time between trips of each bike.

  1. Average trip duration per bike is 20.67 days out of 670
  2. Overused – Unused Bikes
  3. The number of overused Bikes is 1050. (Idle time is between 0 to 100 days per bike out of 670 days)
  4. The number of unused Bikes is 5757. (Idle time is between 500 to 650 days out of 670 days)
  5. The unused bikes are more than overused bikes. Hence, there is no need for buying new bikes. Operation activity like moving bikes to most used stations can reduce idle time and increase bike utility time.
  6. Requirement of Bikes at each station- Heat map for daily hours.

Heat map shows the bikes present at a particular station at a particular hour of the day. It shows difference of incoming bikes and outgoing bikes. If incoming bikes are more than outgoing, then bikes are more than sufficient and vice versa.

  • Bikes at station 402-Broadway & E 22 St:
  1. From heatmap it is seen that in morning hours from 8am to 10 am, bikes are excessive.
  2. For evening hours 5pm and 6 pm, there is shortage of bikes.
  • Heat map shows that requirement of bikes in evening time is more and sufficient bikes are not available at certain stations. Providing idle bikes at these stations in evening peak time, can increase rides.

Heatmap analysis was done for 22 months. Based on that analysis, heatmaps show that though the movement of bikes is done, but station 519-Pershing Square North is having shortage of bikes through all the months. Also, station 402-Bradway & E 22 St always has excess bikes than required.

• Direction of rides in morning hours is mainly to South and in evening hours is to North.

  • Popular directions in morning are South and Southeast. In evening hours are North and Northwest.
  • Directions of rides are opposite in morning and evening.

• Trend of Duration and Rides

  • Annual Trend for Trip Duration – Overall trend looks increasing. But as total span of data contains only 1 full year. Hence there can be ambiguity about overall trend.
  • During winter (Dec, Jan, Feb) duration of rides is less. Rides start increasing from March and are peak during summer (Aug, Sept, Oct). Then again rides drop.
  • Weekly trend of rides – It is seen that rides are more on weekdays than on weekends. Peak at Wednesday. Wednesday rides are almost 30% more than that on Weekend
  • Trend of rides over hour of the day – The busy hours are seen in morning and evening.
  • Morning hours from 7 to 9 are peak hours. And 8 am has the peak of the rides.
  • Evening hours from 4.30 to 6.30 are peak hours. Evening rides are distributed.
  • Peak rides seem to be more in evening than in morning. Night hours from 2 to 4 looks like to be idle time. The average rides between 2 peak times, are almost 1/3 of evening peak rides.
  • This overall trend for day hours is same over all seasons.
  • Summer rides are almost double than rides in winter.

• Predict Rides

With available monthly values of rides, for over 22 months, used Sarima model for prediction.

User Type – Subscriber

From beginning, Subscriber trend is increasing steadily. Rides are popular among Subscriber Users from beginning and steady increase in number of rides.

User Type – Customer

Customer users have increased volume of rides from April 2017 . Till April 2017 volume of rides is almost flat and after April2017, it’s significantly increasing trend.  This shows that popularity of rides among short period users (Usertype = Customer) is increasing.

Conclusion

Insights

  1. Popular categories: In User type, Subscribers are popular. Gender wise, Male users are popular. By Age groups, Gen X, Gen Y.1, Gen Y.2 are popular.
  2. Short duration rides are more popular, as it seems the rides are for commuting to workplace on weekdays. Also, rides are significant around park areas.
  3. The busiest hours for rides are in morning from 7am to 9am and in evening from 4.30 pm to 6.30pm. Directions with more traffic of rides are South-North and Southeast-Northwest.
  • Evening traffic of rides is more than morning traffic.
  1. The idle time of bikes is more than business/ride time. Considering inventory at stations, organization of idle bikes can increase business time or ride time of bikes.
  2. Speed range is 4 mph to 8 mph. This is safe zone speed.

Prediction –

The period of observations (22 Months) is less for model fitting and predictions. Yet the trend, seasonality is well captured in SARIMA model.

The overall trend of rides is increasing.

The prediction values for 2 Usertypes (Subscriber and Customer) are as below:

For Subscriber User

Date
Forecast Value
Lower Bound
Upper Bound
2017-07-31
58421.00
58421.00
58421.00
2017-08-01
59293.02
38936.31
79649.73
2017-08-02
63769.61
42247.56
85291.66
2017-08-03
61010.41
38382.96
83637.86
2017-08-04
56069.17
32387.86
79750.48
2017-08-05
46179.83
21489.6
70870.07
2017-08-06
46960.04
21300.53
72619.56
2017-08-07
55790.90
29197.41
82384.39
2017-08-08
60652.38
33060.15
88244.61

For Customer User

Date
Forecast Value
Lower Bound
Upper Bound
2017-07-31
7102.00
7102.00
7102.00
2017-08-01
7371.86
3405.25
11338.47
2017-08-02
6844.37
2388.52
11300.22
2017-08-03
6481.24
1881.28
11081.20
2017-08-04
6742.30
2083.88
11400.720
2017-08-05
12444.95
7754.26
17135.64
2017-08-06
13992.53
9278.96
18706.09
2017-08-07
6903.97
2171.31
11636.63
2017-08-08
7386.67
2538.45
12234.89