1 Introduction

New York is a biking city. Over the past 10 years, bike ridership has increased by over 4 times, and it is becoming the city’s fastest growing transportation for New Yorkers, looking for a safe, convenient and affordable way to travel in the city. New York City, with its mostly flat geography and high density is an ideal place for setting up a bike share system. Citi Bike has taken the initiative of using these strengths by offering an unlimited number of affordable, short bike trips. Polls released by Quinnipiac University has shown that over 70% New Yorkers support Citi’s Bike Sharing System. Another reason why the public has strongly embraced this system is because of the exhaustive and highly participatory planning progress used to develop Citi bike stations within New York City. Citi’s Bike Sharing System plans to target most of the New Yorkers who plan to look for a time-efficient, healthier, and fun way of traveling in the city. This bike sharing initiative has not only proven healthy for the rider due to the bicycle exercise involved, but also healthy for the environment since it saves gasoline, prevents carbon emissions and keeps pollutants out of the air.

Although Citi’s Bike Sharing System is a win-win situation for all, it faces a costly maintenance issue – rebalancing bikes across the stations. Usually we observe that existing systems frequently suffer from situations where there are no bikes to rent or where there are no bikes to deposit at the bike station due to the ad-hoc rebalancing practice. Asymmetric demands create empty and full stations across the city. For example, docking stations located in the affluent Manhattan neighborhood have noticed a lot more renting than returning. There are many factors such as time, weather, past demand and space that affect the bike share demand across the city. However, there are also some hidden factors such as New York City’s Upper West Side area, which is an uphill region with high observed demand, and people usually do not like to ride uphill which leads to more renting than returning. Such hidden factors tell us why there are asymmetric demands in the city and might play a large role in optimizing the re-balancing model.

In this analysis, we would be looking into the Citi’s Bike Sharing System and making an attempt to predict the demand across different bikes stations and at different times of the day. Having an idea of what the forecasted demand is, we would next come up with a re-balancing plan to cater to the these demands. For the scope of this analysis, we would consider three resources to carry out this re-balancing, namely (i) Trucks: useful for carrying large number of bikes but generally require low traffic, (ii) Trikes: much smaller than trucks and can reach the bike station in high traffic and (iii) Corrals: used to artificially increase the capacity of a bike station.

2 Setup

We start the analysis by first importing the required libraries, helper functions and defining some color palettes.

# Import necessary libraries
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(gifski)
library(tidycensus)
library(viridis)
library(osmdata)
library(FNN)

options(tigris_class = "sf")

# Import required functions
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Initialize color pallets
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette3 <- c("#eff3ff","#6baed6","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

3 Getting the Data

3.1 Bike Share

As mentioned in Section 1, we would be analyzing the Citi’s bike share data. To narrow down the scope, we would only be considering the bike share data available for the month of May 2018, and the first week of June 2018, giving us a total of 5 weeks for the analysis.

# Uncomment to process data in real-time

# # Get data for May 2018
# ride_2018_05 <- read.csv("../raw_data/201805-citibike-tripdata.csv") %>%
#   select(starttime, start.station.id, start.station.latitude, start.station.longitude,
#          end.station.id, end.station.latitude, end.station.longitude)
# 
# # Get data for first week of June 2018
# ride_2018_06 <- read.csv("../raw_data/201806-citibike-tripdata.csv") %>%
#   select(starttime, start.station.id, start.station.latitude, start.station.longitude,
#          end.station.id, end.station.latitude, end.station.longitude) %>%
#   filter(starttime < ymd("2018-06-04"))
# 
# 
# # combine the two dataframes
# ride <- rbind(ride_2018_05, ride_2018_06)
# 
# save(ride, file = "ride.RData")

# Load the saved data
load("ride.RData")

3.2 Census

3.2.1 Downloading census tracts

Next, we download the census data for three of the counties spanned by the Citi Bike share system, namely New York county, Kings county and Queens county. We would not use the census variables in our predictive models (due to collinearity with bike station ids), but would rather use the boundaries for mapping purpose.

# Setting the CRS
crs <- st_crs("+proj=lcc +lat_1=40 +lat_2=41.66666666666666 +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000 +y_0=500000 +ellps=GRS80 +units=m no_defs")

# Downloading census tracts data
nycCensus <- 
  get_acs(geography = "tract", variables = c("B01003_001", "B19013_001", "B02001_002", 
                                             "B08013_001", "B08012_001", "B08301_001", 
                                             "B08301_010"), 
          year = 2018, state = "NY", geometry = TRUE, county=c("New York County", "Kings County", "Queens County")) %>%
  st_transform(crs=crs) %>% 
  mutate(variable = 
          case_when(variable == "B01003_001" ~ "Total_Population",
                    variable == "B19013_001" ~ "Median_Household_Income",
                    variable == "B02001_002" ~ "Total_White_Population",
                    variable == "B08013_001" ~ "Aggregate_Travel_Time_to_Work",
                    variable == "B08012_001" ~ "Sex_of_Workers_by_Travel_Time_to_work",
                    variable == "B08301_001" ~ "Means_of_Transportation_to_Work",
                    variable == "B08301_010" ~ "Total_Public_Trans_excl_Taxi")) %>%
  select(variable, estimate, GEOID, geometry) %>%
  spread(variable,estimate) %>%
  mutate(Percent_White = Total_White_Population / Total_Population,
         Mean_Commute_Time_for_Workers = 
           Aggregate_Travel_Time_to_Work / Sex_of_Workers_by_Travel_Time_to_work,
         Percent_Taking_Public_Trans = 
           Total_Public_Trans_excl_Taxi / Means_of_Transportation_to_Work) %>%
  gather(Variable,Value, -GEOID, -geometry) 

# Creating the boundary
nycBoundary <- st_union(nycCensus)

3.2.2 Filtering Out Study Area

To mark our area of study, we next look into how these bike stations are distributed across the three counties. As can be seen from the left-hand-side plot in Figure 1, majority of these bike stations fall inside the New York county, while others are on the boundaries of the Kings and Queens county.

To extract our area of study, we created a quarter mile buffer around each bike station and selected out the census tracts that touch this buffer. The plot on the right-hand-side of Figure 1 shows the density of the the bike stations in the so extracted area of study.

# Filter out bike stations 

bikeStations <- ride %>% 
  select(start.station.id, start.station.latitude, start.station.longitude) %>%
  distinct() %>% 
  filter(start.station.latitude < 42) %>% 
  st_as_sf(coords = c("start.station.longitude", "start.station.latitude"), crs = 4326, agr = "constant") %>%
  st_transform(crs=crs) %>% 
  mutate(start.station.id = as.character(start.station.id))

# Filtering ride data based on valid bike station ids
ride <- ride %>% filter(start.station.id %in% bikeStations$start.station.id)

# Filtering out census tracts around 0.25 miles of bike station
buffer_around_bikestations <- st_union(st_buffer(bikeStations, 1320))

# Plotting the data

p1 <- ggplot() + 
  geom_sf(data = nycCensus, color = "grey", fill = "transparent") + 
  geom_sf(data = buffer_around_bikestations, fill = "red", alpha = 0.3) + 
  geom_sf(data = bikeStations, alpha = 0.3) + 
  labs(title = "Study Area",
       subtitle = "Red area denotes the area of study \nwhile black dots denote the bike stations") + 
  mapTheme()

# Getting the new boundaries

nycCensus <- nycCensus %>% .[buffer_around_bikestations, ] 
nycBoundary <- st_union(nycCensus)

p2 <- ggplot() + 
  geom_sf(data = nycCensus, color = "grey", fill = "transparent") +
  stat_density2d(data = data.frame(st_coordinates(bikeStations)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of bike stations") +
  mapTheme(title_size = 14) + theme(legend.position = "none")


grid.arrange(ncol = 2, p1, p2, bottom="Figure 1: Area of study: parts of New York, Kings and Queens county")

3.3 Weather

Figure 2 below shows the weather patterns for the month of May and June, 2018. As can be seen, May is one of New York city’s loveliest months, with the temperature ranging from 50-80 F, making it warm but not hot, and perfect for a bike ride. There have also been few spells of rain/snow during the time, which might effect the demand of bike share rides.

# Getting the weather data
weather.Data <- 
  riem_measures(station = "NYC", date_start = "2018-05-01", date_end = "2018-06-04")


# Processing weather data to get hourly metrics
weather.Panel <-
    weather.Data %>%
    mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
    replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Percipitation = sum(p01i),
              Wind_Speed = max(sknt)
              ) 

# Plotting
grid.arrange(top = "Weather Data - NYC - May & June, 2018",
  ggplot(weather.Panel, aes(interval60,Percipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Percipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  bottom = "Figure 2: Weather patterns across the months")

4 Data Exploration

Following our data gathering process, we next jump into exploring the characteristics of our bike share dataset.

4.1 Distribution of Trip Count

First, we begin by looking at the distribution of trip counts for different stations at different times of the day. From Figure 3 below, we can see that this distribution is skewed towards the left, with high frequency of low trip counts (many of them 0) and low frequency for high trip counts for bike station/hours. From a re-balancing point of view, this plot indicates that we may need to focus our attention on certain bikes stations at certain points of the day when there will be high demand of bikes.

# Extract time related features from the ride data

ride2 <-
  ride %>%
  mutate(interval60 = floor_date(ymd_hms(starttime), unit = "hour"),
         interval15 = floor_date(ymd_hms(starttime), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE),
         start.station.id = as.character(start.station.id),
         end.station.id = as.character(end.station.id))


# Bike share trips per hr by station
ggplot(ride2 %>%
       group_by(interval60, start.station.id) %>%
       tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. NYC, May-June, 2018",
       x="Trip Counts", 
       y="Number of Stations",
       caption = "Figure 3")+
  plotTheme()

In order to deep dive into our analysis, we break down Figure 3 by the different time periods of the day, namely, morning rush hours (7AM-10AM), mid-day (10AM-3PM), evening rush hours (3PM-6PM) and overnight (6PM-7AM). From Figure 4, it can be seen that the distribution of ride trips during morning and the evening rush hours seem to have a heavy right tails, in simple words, more demand, compared to the mid-day and overnight time. This intuitively makes sense because we can attribute this high demand typically to time when people go/return from their workplace, and would want to use bikes to avoid traffic, especially in a crowded city like New York.

From the point of view of re-balancing, this plot provides an interesting insight into the overnight time being more favorable to move bikes, when the demand is low and also when the traffic is less, favoring trucks as a mode to move bikes. We could also forecast demand for the entire next day (as we will also see using one of the predictive models we build using just the 1 day lag variable in Section 5.1), and make a plan to re-balance the bikes at each station, say by the average forecasted demand.

# Mean Number of Hourly Trips Per Station

ride2 %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                         hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                         hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                         hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, start.station.id, time_of_day) %>%
  tally()%>%
  group_by(start.station.id, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. NYC, May-June, 2018",
       x="Number of trips", 
       y="Frequency",
       caption = "Figure 4")+
  facet_wrap(~time_of_day)+
  plotTheme()

4.2 Space/Time Dependencies in Data

4.2.1 Day & Time of the Week

Next, we look into how the bike share trips vary across the different days of the week and by the different time of the day. From Figure 5 below we can see that the bike share demand is relatively high during the weekdays and compared to the weekends. For the weekdays, we note two significant peaks in demand during the entire day, (i) between 5AM and 10AM and (ii) between 3PM and 8PM, which also map to the morning and evening rush hour categories in Figure 4. Overall, this plot indicates that re-balancing efforts should be focused more on the weekdays, especially during the two peak hours that we found to cater to consumer demand.

# Bike share trips by day of the week

p3 <- ggplot(ride2 %>% mutate(hour = hour(starttime)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in NYC\nby day of the week",
       subtitle = "NYC, May-June 2018",
       x="Hour", 
       y="Trip Counts")+
  plotTheme() + 
  theme(legend.position="bottom")
  
# Bike share trips - weekend vs weekday,

p4 <- ggplot(ride2 %>% 
      mutate(hour = hour(starttime),
             weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")
             )
      )+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in NYC\nWeekend vs Weekday",
       subtitle = "NYC, May-June 2018",
       x="Hour", 
       y="Trip Counts")+
  plotTheme() + 
  theme(legend.position="bottom")

grid.arrange(ncol=2, p3, p4, bottom="Figure 5: Bikshare trips across days of week")

To further deepen our findings from Figure 5, we next look at where in space is the demand more. From Figure 6 below, we see that during the weekdays, and almost across all periods of time, the demand for bikes is more in the Upper-West Manhattan, Midtown Manhattan and Downtown Manhattan compared to the other areas in the city. From a re-balancing point of view, this indicates that we need to focus more on the Manhattan area to cater to this higher demand.

Considering our mentioned re-balancing resources, setting up corrals in these places can be good idea to insure that sufficient bikes are present in these areas. Trucks can be used to use to re-balance the bike stands overnight, and then trikes can be used during the morning and evening rush hours to move bikes from area of low demand to Manhattan.

ggplot()+
  geom_sf(data = nycCensus, color = "grey", fill = "transparent")+
  geom_sf(data = ride2 %>% 
                mutate(hour = hour(starttime),
                       weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                       time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
                group_by(start.station.id, start.station.latitude, start.station.longitude, weekend, time_of_day) %>%
                tally()  %>% 
                st_as_sf(coords = c("start.station.longitude", "start.station.latitude"), crs = 4326, agr = "constant") %>%
                st_transform(crs=crs)
            ,
            aes(color = n), 
            fill = "transparent", alpha = 0.6, size = 0.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station",
       subtitle = "NYC, May-June, 2018",
       caption = "Figure 6")+
  mapTheme()

4.2.2 Exploring Time Lag

Next, we explore how the demand of bikes at a station depends on the demand in the past units of time. From Figure 7 we see that the demand at a bike station in the past hour highly correlates to the demand in the current hour (with decreasing correlation with each passing hour). Similar conclusions can also be made to demand in the past day at the same time. For re-balancing, this indicates that both demand in the past hour and the previous day can be good indicators of the expected current demand, although relying on the past hour demand for estimating the expected demand in current time can be challenging, with the re-balancing effort being required to be completed in an hour.

Note: We do not consider holiday lag variable here given that in the selected time frame, Memorial Day is the only holiday, and hence it would not be possible to check how our predictive model performs around holiday demand patterns

study.panel <- 
  expand.grid(interval60 = unique(ride2$interval60), 
              start.station.id = unique(ride2$start.station.id)) %>% 
  mutate(start.station.id = as.character(start.station.id))

# Aggregating Trip by Bike Station

ride.panel <- 
  ride2 %>%
    select(start.station.id, interval60) %>% 
    mutate(Trip_Counter = 1) %>%
    right_join(study.panel) %>% 
    group_by(start.station.id, interval60) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    left_join(weather.Panel, by = "interval60") %>%
    left_join(bikeStations) %>%
    mutate(week = week(interval60),
          dotw = wday(interval60, label = TRUE)) %>%
    st_sf()

# Get lag features 

ride.panel <- 
  ride.panel %>% 
    arrange(start.station.id, interval60) %>% 
    group_by(start.station.id) %>% 
    mutate(lagHour = dplyr::lag(Trip_Count,1),
           lag2Hours = dplyr::lag(Trip_Count,2),
           lag3Hours = dplyr::lag(Trip_Count,3),
           lag4Hours = dplyr::lag(Trip_Count,4),
           lag12Hours = dplyr::lag(Trip_Count,12),
           lag1day = dplyr::lag(Trip_Count,24)) %>% 
   ungroup()

# Plotting 

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 19) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) %>% 
  mutate(correlation = paste0("R = ", correlation))


plotData.lag %>%
  ggplot(aes(x = Value, y = Trip_Count)) + 
  geom_point() +
  geom_smooth(method = "lm") + 
  facet_wrap(~Variable) +
  geom_text(data = correlation.lag, mapping = aes(x = -Inf, y = Inf, label = correlation), hjust = -0.1, vjust = 1.25, color="blue") + 
  labs(
    title = "Rideshare trip count as a function of time lags",
    subtitle = "One week in May 2018",
    x = "Lag Trip Count",
    caption = "Figure 7"
  ) + 
  plotTheme()

4.2.3 Animation Map

The animation shown in Figure 8 is a great summary of the findings we have so far, and it is rightly said that we can see the city breathing throughout the day. We can clearly see the two spikes in demand during the day, that is, the morning and the evening rush hour, with relatively more demand in the Manhattan area.

ride.animation.data <- filter(ride.panel , week == 19 & dotw == "Mon") %>% 
   mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 5 ~ "1-5 trips",
                             Trip_Count > 5 & Trip_Count <= 10 ~ "5-10 trips",
                             Trip_Count > 10 & Trip_Count <= 15 ~ "10-15 trips",
                             Trip_Count > 15 ~ "15+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-5 trips","5-10 trips",
                                       "10-15 trips","15+ trips"))


rideshare_animation <-
    ggplot() +
    geom_sf(data = nycCensus, color = "grey", fill = "transparent") + 
    geom_sf(data = ride.animation.data, aes(color = Trips)) +
    scale_color_manual(values = palette5) +
    labs(title = "Rideshare pickups for a Monday in May 2018",
         subtitle = "60 minute intervals: {current_frame}",
         caption = "Figure 8") +
    transition_manual(interval60) +
    mapTheme()

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

4.3 Exploring Weather

Next, we look at some of the weather related features we extracted in Section 3.3, specifically temperature and weather. As can be seen from Figure 9, the average number of trips are generally higher on a non rainy/snowy hour of the day, which intuitively also makes sense. From the point of view of re-balancing, this indicates that the re-balancing efforts can be relaxed when there are chances of rain/snow.

# Plotting effect of precipitation of bike share rides

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Percipitation = first(Percipitation)) %>%
  mutate(isPercip = ifelse(Percipitation > 0,"Rain/Snow", "None")) %>%
  drop_na() %>% 
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity") +
      labs(title="Does ridership vary with percipitation?",
           x="Percipitation", y="Mean Trip Count",
           caption = "Figure 9") +
      plotTheme()

Figure 10 shows how the number of trips vary with respect to temperature, with the number of trips increasing as the temperature increases across all the weeks. In simple words this means that people tend to use the bike share system as the temperature gets more pleasant (say around 70 F) compared to when it is relatively colder (say around 50 F). From a re-balancing point of view, this calls for increasing efforts of re-balancing as the temperature gets more favorable for a bike ride (relying on the weather forecast of the day).

# Plotting how bike share rides vary across temperature. 

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE) +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count",
         caption = "Figure 10") +
    plotTheme()

4.4 Experimenting with Exposure Variables

Next, we consider some interesting exposure variables, like transit stations and Universities, and understand how proximity to these, alters the demand at a bike station. Figure 11 below shows the correlation between the total trips at each bike station on a typical day as a function of nearest distance to these amenities. As can be seen, both the plots reveal that there is a negative correlation, indicating that as one goes away from these amenities, the demand goes down. From a re-balancing point of view, this indicates that our plan needs to focus more of areas that are close to transit stations and Universities, as the demand is likely to be high here.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

st_c <- st_coordinates

nycBoundary.4326 <- nycBoundary %>% st_transform(crs=4326)

xmin <- st_bbox(nycBoundary.4326)[[1]]
ymin <- st_bbox(nycBoundary.4326)[[2]]
xmax <- st_bbox(nycBoundary.4326)[[3]]  
ymax <- st_bbox(nycBoundary.4326)[[4]]

bbox <- c(xmin, ymin, xmax, ymax)

# Transit station

transit_station <- opq(bbox = bbox) %>% 
    add_osm_feature(key = 'railway', value = c("station","subway")) %>%
    osmdata_sf()

transit_station <- 
  transit_station$osm_points %>%
  .[nycBoundary.4326,] %>% 
  st_transform(crs=crs) %>% 
  mutate(Legend = "Transit_Station") %>% 
  dplyr::select(Legend)

# University

university <- opq(bbox = bbox) %>% 
    add_osm_feature(key = 'amenity', value = c("university")) %>%
    osmdata_sf()

university <- 
  university$osm_points %>%
  .[nycBoundary.4326,] %>% 
  st_transform(crs=crs) %>% 
  mutate(Legend = "University") %>% 
  dplyr::select(Legend)

# Nearest features

ride.panel <- ride.panel %>%
  mutate(
    TransitStation.nn = nn_function(st_c(ride.panel), st_c(transit_station),1),
    University.nn = nn_function(st_c(ride.panel), st_c(university),1)
    ) 

amenities <- st_drop_geometry(ride.panel) %>% 
  filter(week == 20, dotw == "Tue") %>% 
  select(start.station.id, TransitStation.nn, University.nn) %>% 
  distinct()

# Plotting

st_drop_geometry(ride.panel) %>% 
  filter(week == 20, dotw == "Tue") %>% 
  group_by(start.station.id) %>%
  summarise(
    total_trips = sum(Trip_Count)
  ) %>% 
  left_join(amenities) %>% 
  gather(Variable, Value, -start.station.id, -total_trips) %>% 
  ggplot(aes(Value, total_trips)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Bike share demand as a function of amenity",
       subtitle = "Based on bike share demand data for Tuesday of Week 20 ",
       x = "Distance",
       y = "Trip Count",
       caption = "Figure 11") +
  plotTheme()

5 Predicting Bike Share Demand

Following our Exploratory Data Analysis (EDA) of the bike share data, we next move onto modeling the demand for bike share for each station.

5.1 Building model

We start off the model building process by first splitting the bike share data into training and testing set. For this, we we consider the later 3 weeks of data for training and initial two weeks of data for testing, owing to the NA values being introduced in our dataset due to the time lag variables. Figure 11 below shows the time series plot for our bike share data, with the training and testing divisions.

We consider 5 different models for predicting the bike share demand, which on a high level try to capture the following aspects while predicting demand: (i) time, (ii) space, (iii) both time and space (iv) time and space along with demand pattern for previous day and (v) time and space along with demand at different times in the past. These models also include the different aspects of weather features that we discussed in Section 3.3.

Note: We do not consider exposure features we experimented in Section 4.4 as predictors in our model as these would be collinear with the bike station id, and hence including them would make no difference to the model.

# Splitting the data

ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 

# memorial_day   <- as.POSIXct("2018-05-28 06:00:00 UTC")

# Plotting the trip time series 
st_drop_geometry(rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        # geom_vline(xintercept = memorial_day, linetype = "dotted") +
        geom_vline(data = mondays, aes(xintercept = monday)) +
        labs(title="Rideshare trips by week: May-June",
             x="Day", y="Trip Count",
             caption = "Figure 12") +
        plotTheme() + theme(panel.grid.major = element_blank())

# Build models 

# Uncomment this to build the models in real time

# reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)
# 
# reg2 <- lm(Trip_Count ~  start.station.id + dotw + Temperature,  data=ride.Train)
# 
# reg3 <- lm(Trip_Count ~  start.station.id + hour(interval60) + dotw + Temperature ,
#                          data=ride.Train)
# 
# reg4 <- lm(Trip_Count ~  start.station.id +  hour(interval60) + dotw + Temperature + Wind_Speed + Percipitation + 
#                          lag1day,
#                          data=ride.Train)
# 
# reg5 <- lm(Trip_Count ~  start.station.id +  hour(interval60) + dotw + Temperature + Wind_Speed + Percipitation + 
#                          lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day,
#                          data=ride.Train)
# 
# 
# save(reg1, file = "reg1.RData")
# save(reg2, file = "reg2.RData")
# save(reg3, file = "reg3.RData")
# save(reg4, file = "reg4.RData")
# save(reg5, file = "reg5.RData")

# Load the saved models 

load("reg1.RData")
load("reg2.RData")
load("reg3.RData")
load("reg4.RData")
load("reg5.RData")

5.2 Evaluating Performance

After training each of the 5 models, we next look at how these models perform on our hold out test set.

5.2.1 Visualizing Predictions

Figure 13 shows the predictions from each of our 5 models on the test set. As can be seen, the model with all the lag features does the best job in predicting demand (we do not see get the predictions for the first day of the week of May owing to the lag variables we included in this model). Although, it can also be seen that the best model still underpredicts the demand during the two peaks/rush hours of the day.

# Validate test set by time 

ride.Test.weekNest <- 
  as.data.frame(ride.Test) %>%
  nest(-week) 

# ride.Test.weekNest

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)
}

# Get predictions for each week

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           C_Space_Time_FE = map(.x = data, fit = reg3, .f = model_pred),
           D_Space_Time_1_Day_Lag = map(.x = data, fit = reg4, .f = model_pred),
           E_Space_Time_All_Lags = map(.x = data, fit = reg5, .f = model_pred))

# week_predictions 

# Get errors

week_predictions <- week_predictions %>%  
    gather(Regression, Prediction, -data, -week) %>% 
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

# Plotting predictions

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id)) %>%
  dplyr::select(interval60, start.station.id, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start.station.id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed bike share by hourly interval", 
           x = "Hour", y= "Rideshare Trips",
           caption = "Figure 13") +
      plotTheme()

5.2.2 Test Errors Across Models

Figure 14 below strengthens the claim we made in the above plot by preferring the model with all lag variables.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week",
         caption = "Figure 14") +
  plotTheme()

5.2.3 Test Error by Space and Time

Next, we stick with our best model and analyze the errors the models is making across space and time. From Figure 15 it can be seen that the model does well in predicting demand for bike stations that belong in the Kings and Queens county (typically low demand areas as per Figure 6) but performs a bit poor when it comes to bike stations that lie in the Manhattan region (typically showing high demand). For a substantial number of such stations, the error is somewhere between 2 to 5 bikes, and hence we can improve our re-balancing plan by ensuring that we carry a buffer of around 5 bikes per bike station per hour while targeting re-balancing in Manhattan.

# Validate test set by space

error.byWeek <-
  filter(week_predictions, Regression == "E_Space_Time_All_Lags") %>% 
  unnest() %>% st_sf() %>%
  dplyr::select(start.station.id, Absolute_Error, week, geometry) %>%
  gather(Variable, Value, -start.station.id, -week, -geometry) %>%
    group_by(Variable, start.station.id, week) %>%
    summarize(MAE = mean(Value, na.rm = TRUE))

error.byWeek %>% 
  mutate(MAE = case_when(MAE < 2 ~ "< 2 trip",
                         MAE >= 2 & MAE <= 5 ~ "2-5 trips",
                         # MAE >= 2 & MAE <= 5 ~ "2-5 trips",
                         MAE > 5 ~ "5+ trips")
         )%>% 
  ggplot() + 
  geom_sf(data = nycBoundary) + 
  geom_sf(aes(color = MAE)) + 
  scale_color_manual(values = palette3) + 
  facet_wrap(~week) + 
  labs(title = "Mean absolute error by bike station by week",
       caption = "Figure 15") +
  mapTheme()

Next we look at how the model performs in predicting demand across different times of the day. Figure 16 below shows the model errors by hour of the day on Monday of week 19. As can be seen, the errors are typically low during the overnight time but grow significantly during the rush hours.

error.byHour <-
  filter(week_predictions, Regression == "E_Space_Time_All_Lags") %>% 
  unnest() %>% 
  st_sf() %>%
  dplyr::select(start.station.id, Absolute_Error, geometry, interval60) %>%
  gather(Variable, Value, -interval60, -start.station.id, -geometry) %>%
  filter(wday(interval60, label = TRUE) == "Mon" & week(interval60) == 19) %>%
  group_by(hour = hour(interval60), start.station.id) %>%
  summarize(MAE = mean(Value, na.rm=TRUE)) 


error.byHour %>% 
  mutate(MAE = case_when(MAE < 2 ~ "< 2 trip",
                         MAE >= 2 & MAE <= 5 ~ "2-5 trips",
                         # MAE >= 2 & MAE <= 5 ~ "2-5 trips",
                         MAE > 5 ~ "5+ trips")
         )%>% 
  ggplot() + 
  geom_sf(data = nycBoundary) + 
  geom_sf(aes(color = MAE)) + 
  scale_color_manual(values = palette3) + 
  facet_wrap(~hour, ncol = 8) + 
  labs(title = "Mean absolute error by bike station by hour",
       subtitle = "For Monday of week 19",
       caption = "Figure 16") +
  mapTheme()

Figure 16 next shows how the predictions compare to the actual demand across weekday/weekend and different time periods. The key takeaway from this plot is that the model is under-predicting demand in almost all the scenarios, specifically during AM and PM rush hours of weekdays.

# Difference between observed and predicted
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start.station.id = map(data, pull, start.station.id), 
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start.station.id,  Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "E_Space_Time_All_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips",
       caption = "Figure 17")+
  plotTheme()

Figure 17 further add one more dimension to Figure 16, indicating that these errors are mostly in the Manhattan region.

# Across space/time 
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start.station.id = map(data, pull, start.station.id), 
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start.station.id, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "E_Space_Time_All_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start.station.id, weekend, time_of_day) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  left_join(bikeStations) %>% 
  st_sf() %>% 
  ggplot(.)+
  geom_sf(data = nycCensus, color = "grey", fill = "transparent")+
  geom_sf(aes(color = MAE), fill = "transparent", size = 0.5, alpha = 0.6)+
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "D")+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
       caption = "Figure 18")+
  mapTheme()

5.2.4 LOGO CV

Finally, we perform cross validation across time to investigate whether our model generalizes across different times of the day. Figure 19 below shows the distribution of the errors for the model, and it can be seen that the model fails to generalize across time of the day. The times where the MAE is greater than 1 in the plot correspond to hours 7, 8, (AM Rush) and 17 (PM Rush), again reinforcing the inadequacy of the model to capture demand during the peak hours.

crossValidate <- function(dataset, id) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() 
  
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame()
  
  regression <-
    lm(Trip_Count ~  start.station.id +  hr + dotw + Temperature + Wind_Speed + Percipitation +
                         lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day, 
                         data=fold.train)
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

# reg.spatialCV <- crossValidate(dataset = ride.panel %>% mutate(hr = hour(interval60)), id = "hr")

# save(reg.spatialCV, file = "regSpatialCV.RData")
load("regSpatialCV.RData")


reg.spatialCV <- reg.spatialCV %>% mutate(Error = Prediction - Trip_Count)

# reg.spatialCV

error_by_reg_and_fold <-
  reg.spatialCV %>%
    mutate(hr = hour(interval60)) %>%
    group_by(hr) %>%
    summarize(Mean_Error = mean(Prediction - Trip_Count, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

# error_by_reg_and_fold


error_by_reg_and_fold %>%
  ggplot(aes(MAE)) +
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    labs(title="Distribution of MAE", subtitle = "LOGO-CV performed across hour of the day",
         x="Mean Absolute Error", y="Count",
         caption = "Figure 19") +
    plotTheme()

6 Conclusion

We believe that our model has shown a decent performance in predicting the demand for bikes. From the above analysis, we found out that our model is mostly underpredicting demand for bikes in the Manhattan area and especially during the rush hours which is the peak place and time respectively in terms of demands for bikes. But this should not deter us from using this model for the rebalancing plan. From the above error analysis, we know that in most of these areas inside Manhattan during rush hours, we are underpredicting by 2-5 trip counts on average. Hence we could maintain a buffer of 2-5 Bikes in addition to the predicted demand for bikes to be loaded in the trikes to satisfy the high customer demand as these trikes are a much more convenient and quicker mode of transporting bikes than trucks during peak hours of demand and vehicular traffic. An alternative to this is that we set up corrals in the bike stations inside the Manhattan area, which allows bike stations to store more bikes than usual. Hence in this case, it would be a good idea to store an additional 10-15 bikes in these bike stations overnight so that we do not need to worry about keeping a buffer of bikes in addition to required bikes to be transported every hour by trikes or trucks thus reducing our transportation costs. For overnight transportation we can use trucks since the demand and the vehicular traffic will be the lowest and trucks will be more convenient and quicker in carrying a bigger load of bikes than trikes.

At the same time, this is a temporary proposed solution, and we need to capture factors which could correct the underprediction by the model. For example, one of the factors could be that Upper Manhattan is an uphill area and people like to rent bikes from these bike stations rather than return it which relatively increases its Bike demand.

Our model would also fail when the unprecedented hits. The best example is the current pandemic. What if there is a dramatic surge in bike ridership after officials urge the public to avoid public transportation? And then just after that there is a dramatic plunge in bike ridership because the Governor issued a Stay-at-Home order. These all cases show how our model is vulnerable in being useful for the rebalancing plan.

Homework 5: Predicting Bike Share Demand in New York

By Shubham Annadate