Introduction

This is the report of the capstone project for my Google Data Analytics Professional Certificate program. I am using R programming language and RStudio Desktop. Note that the free version of RStudio Cloud cannot handle the amount of data needed for this project.

Scenario

I am a junior data analyst working in the marketing team of Cyclistic, a bike-share company in Chicago. Note: This is a fictional name, but the company is real and is called Divvy https://divvybikes.com. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships. My team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so my recommendations must be backed up with compelling data insights and professional data visualizations.

Ask

Three questions will guide the future marketing program:

  1. How do annual members and casual riders use bikes differently?
  2. Why would casual riders buy annual membership?
  3. How can we use digital media to influence casual riders to become annual members?

The director of marketing and my manager, Lily Moreno, has assigned me the first question to answer.

Prepare

I use Cyclistic’s monthly trip data https://divvy-tripdata.s3.amazonaws.com/index.html. According to Divvy https://divvybikes.com/system-data, the data has been processed to remove trips that are taken by staff as they service and inspect the system, and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).
To see the effect of all seasons on rides, 12 months of data is used. To make it easier to understand the seasonality, the period from January to December 2024 is used.
The data was downloaded to the RStudio work directory on my computer. To identify the work directory, I used the getwd() command. All trip data is in comma-delimited (.CSV) format.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(patchwork)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(leaflet)
library(purrr)
library(viridis)
## Loading required package: viridisLite
jan24 <- read_csv("202401-divvy-tripdata.csv") # I made sure that these files were in my local RStudio work directory
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb24 <- read_csv("202402-divvy-tripdata.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar24 <- read_csv("202403-divvy-tripdata.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr24 <- read_csv("202404-divvy-tripdata.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may24 <- read_csv("202405-divvy-tripdata.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun24 <- read_csv("202406-divvy-tripdata.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul24 <- read_csv("202407-divvy-tripdata.csv")
## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug24 <- read_csv("202408-divvy-tripdata.csv")
## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep24 <- read_csv("202409-divvy-tripdata.csv")
## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct24 <- read_csv("202410-divvy-tripdata.csv")
## Rows: 616281 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov24 <- read_csv("202411-divvy-tripdata.csv")
## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec24 <- read_csv("202412-divvy-tripdata.csv")
## Rows: 178372 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I Checked in my RStudio Environment pane to make sure these files were actually uploaded. Then, I vertically merged these files into tripdata table.

tripdata <- bind_rows(jan24, feb24, mar24, apr24, may24, jun24, jul24, aug24, sep24, oct24, nov24, dec24)

Process

Clean and Prepare data for analysis.

colnames(tripdata)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
str(tripdata)
## spc_tbl_ [5,860,568 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
##  $ rideable_type     : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
##  $ ended_at          : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
##  $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
##  $ start_station_id  : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
##  $ end_station_name  : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
##  $ end_station_id    : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
##  $ start_lat         : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5860568] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(tripdata)
##    ride_id          rideable_type        started_at                    
##  Length:5860568     Length:5860568     Min.   :2024-01-01 00:00:39.00  
##  Class :character   Class :character   1st Qu.:2024-05-20 19:47:53.00  
##  Mode  :character   Mode  :character   Median :2024-07-22 20:36:16.27  
##                                        Mean   :2024-07-17 07:55:47.61  
##                                        3rd Qu.:2024-09-17 20:14:22.56  
##                                        Max.   :2024-12-31 23:56:49.84  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2024-01-01 00:04:20.00   Length:5860568     Length:5860568    
##  1st Qu.:2024-05-20 20:07:54.75   Class :character   Class :character  
##  Median :2024-07-22 20:53:59.16   Mode  :character   Mode  :character  
##  Mean   :2024-07-17 08:13:06.54                                        
##  3rd Qu.:2024-09-17 20:27:46.02                                        
##  Max.   :2024-12-31 23:59:55.70                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5860568     Length:5860568     Min.   :41.64   Min.   :-87.91  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng        member_casual     
##  Min.   :16.06   Min.   :-144.05   Length:5860568    
##  1st Qu.:41.88   1st Qu.: -87.66   Class :character  
##  Median :41.90   Median : -87.64   Mode  :character  
##  Mean   :41.90   Mean   : -87.65                     
##  3rd Qu.:41.93   3rd Qu.: -87.63                     
##  Max.   :87.96   Max.   : 152.53                     
##  NA's   :7232    NA's   :7232
head(tripdata)
## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike  2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike  2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike  2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
colSums(is.na(tripdata))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##            1073951            1073951            1104653            1104653 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               7232               7232 
##      member_casual 
##                  0

There are 13 columns (variables) and over 5.8 million rows (rides). The Min and Max of end_lat and end_lng are far from Chicago, probably due to signal drift, station mislabeling, or technical glitches. Also, approximately 7000 end_lat and end_lng are null (NA). In addition, over 1 million station names and ids are null (NA) which are likely due to dockless stations or any other place the riders abandoned their bike.

To see if there are any ride_id duplicates:

sum(duplicated(tripdata$ride_id))
## [1] 211

There are 211 ride_id duplicate rows.

tripdata_unique <- tripdata[!duplicated(tripdata$ride_id), ]

Next, I filtered out the coordinates that were outside Chicago in order to prevent map distortion.

tripdata_clean_lat_lng <- tripdata_unique %>%
  filter(
    between(start_lat, 41.6, 42.1),
    between(start_lng, -88.0, -87.5),
    between(end_lat, 41.6, 42.1),
    between(end_lng, -88.0, -87.5)
  )

For later analysis, I calculated the ride_lengths,

# This ensures consistency with BigQuery's TIMESTAMP_DIFF(..., MINUTE), which truncates fractional minutes.
tripdata_clean_lat_lng <- tripdata_clean_lat_lng %>% 
  mutate(ride_length = floor(as.numeric(difftime(ended_at, started_at, units = "mins"))))

and summarized the ride_lengths:

summary(tripdata_clean_lat_lng$ride_length)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2749.00     5.00     9.00    14.99    17.00  1509.00

I saw that the min was negative, which is impossible. Therefore, I filtered out observations with negative ride_length as well as any observations with ride_length < 1 minute (potentially false starts or users trying to re-dock a bike to ensure it was secure).

tripdata_clean <- tripdata_clean_lat_lng %>%
  filter(ride_length >= 1) 

Analyze

Compare annual members and casual riders

tripdata_clean %>% 
  group_by(member_casual) %>% 
  summarise(
    ride_count = n(),
    ride_percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
  )
## # A tibble: 2 × 3
##   member_casual ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           2080278            36.4
## 2 member           3641227            63.6

Members take nearly twice as many rides as casuals.

tripdata_clean %>%
  group_by(member_casual) %>%
  summarise(
    average_ride_length = round(mean(ride_length), 2),
    median_length = round(median(ride_length), 2),
    max_ride_length = round(max(ride_length), 2),
    min_ride_length = round(min(ride_length), 2)
  )
## # A tibble: 2 × 5
##   member_casual average_ride_length median_length max_ride_length
##   <chr>                       <dbl>         <dbl>           <dbl>
## 1 casual                       21.3            12            1509
## 2 member                       12.0             8            1499
## # ℹ 1 more variable: min_ride_length <dbl>

Casual riders take 1.5 to 2 times longer rides than annual members on average.

All riders are charged an extra fee for each minute the ride is over 3 hours. So, it is informative to analyze rides with ride_length <= 180 minutes and ride_length > 180 minutes.

filtered_data <- tripdata_clean %>%
  filter(ride_length <= 180)

ggplot(filtered_data, aes(x = ride_length, fill = member_casual)) +
  geom_histogram(binwidth = 2) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE) ) +
  labs(title = "Distribution of Ride Lengths (≤ 180 mins)",
       x = "Ride Length (minutes)",
       y = "Count") +
  facet_wrap(~member_casual, ncol = 2) +
  theme(legend.position = "none") 

Casual riders tend to take longer rides than members, with a broader spread and a heavier tail.

tripdata_clean %>%
  filter(ride_length > 180) %>%
  group_by(member_casual) %>%
  summarise(
    long_ride_count = n(),
    percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
  )
## # A tibble: 2 × 3
##   member_casual long_ride_count percentage
##   <chr>                   <int>      <dbl>
## 1 casual                   9850       0.17
## 2 member                   4220       0.07

Casual riders are more than twice as likely as members to take ultra-long rides. While both percentages are small, the absolute number of casual long rides is significant: almost 10,000 instances.

Seasonality?

tripdata_clean <- tripdata_clean %>%
  mutate(
    month = format(as.Date(started_at), "%B"),
    month = factor(month, levels = c(
      "January", "February", "March", "April", "May", "June",
      "July", "August", "September", "October", "November", "December"
    ), ordered = TRUE)
  )
seasonality_summary <- tripdata_clean %>%
  group_by(member_casual, month) %>%
  summarise(
    number_of_rides = n(),
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, month)
# Plot 1
p1 <- seasonality_summary %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Number of Rides", x = "Month", y = "Number of Rides") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Plot 2
p2 <- seasonality_summary %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Average Ride Lengths", x = "Month", y = "Average Ride Length (minutes)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Combine them
combined_plot <- p1 / p2 +
  plot_annotation(
    title = "Seasonality Analysis: Number of Rides and Average Ride Lengths",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )
# Show the combined plot
combined_plot

Both user types bike more in the summer. Members ride more than casual riders in every month. Average ride lengths are longer in the summer, especially for casual riders. Casual riders have longer average ride lengths than members in every month.

Day of week effect?

tripdata_clean <- tripdata_clean %>% 
  mutate(
    day_of_week = format(as.Date(started_at), "%A"),
    day_of_week = factor(day_of_week, 
                         levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
                         ordered = TRUE)
  ) 
day_summary <- tripdata_clean %>%
  group_by(member_casual, day_of_week) %>%  
  summarise(
    number_of_rides = n(), 
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, day_of_week)
p1 <- day_summary %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
    geom_col(width=0.5, position = position_dodge(width=0.5)) +
    labs(title ="Total Rides", x = "Day of the Week", y = "Number of Rides") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

p2 <- day_summary %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Average Ride Lengths", x = "Day of the Week", y = "Average Ride Length (minutes)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

combined_plot <- p1 / p2 +
  plot_annotation(
    title = "Day-of-week Analysis: Number of Rides and Average Ride Lengths",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )
# Show the combined plot
combined_plot 

Members consistently take more rides than casuals every day of the week. Weekdays (Mon–Fri) show a steady, high volume of member rides — consistent with commuting behavior. Casual ridership peaks on weekends, suggesting leisure or recreation.

Casual riders consistently take longer rides than members every day of the week. Casual ride lengths peak on Sunday and Saturday. Member ride lengths are shorter and relatively stable across the week.

Hour of day effect?

tripdata_clean <- tripdata_clean %>% 
  mutate(start_hour = as.numeric(strftime(started_at, "%H")))

hour_summary <- tripdata_clean %>%
  group_by(member_casual, start_hour) %>%  
  summarise(
    number_of_rides = n(), 
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, start_hour)
p1 <- ggplot(hour_summary, aes(x = start_hour, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Total Rides", x = "Hour of the Day", y = "Number of Rides") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(hour_summary, aes(x = start_hour, y = average_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Average Ride Lengths", x = "Hour of the Day", y = "Avg Ride Length (min)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

combined_plot <- p1 / p2 +
  plot_annotation(
    title = "Hour-of-day Analysis: Number of Rides and Average Ride Lengths",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )

combined_plot

Demand of members for bikes is higher than casuals during every hour of the day. For both groups the peak demand occurs just after noon, but members have another lower peak during early mornings, showing a commuter-like pattern, whereas casuals’ demand distribution suggest a leisure/tourism behavior.
The average ride length is steady for members throughout the day suggesting a commuter pattern, but casuals’ average ride length peaks in the morning suggesting a lesiure pattern. Casual ride lengths are longer than members during every hour of the day.

ggplot(tripdata_clean, aes(x = start_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  facet_wrap(~ day_of_week) +
  labs(
    title = "Bike Demand by Hour and Day of Week",
    x = "Hour of the Day",
    y = "Number of Rides"
  ) +
  scale_x_continuous(breaks = 0:23) +
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

The demand by hour for each day of the week clearly shows that members do have commuter behavior during the week days but in the weekend they behave leasurely, like casuals.

Top start stations?

top_member <- tripdata_clean %>%
  filter(member_casual == "member", !is.na(start_station_name)) %>%
  group_by(start_station_name) %>%
  summarise(start_count = n(), .groups = "drop") %>%
  slice_max(start_count, n = 10)

top_member_coords <- tripdata_clean %>%
  filter(member_casual == "member", start_station_name %in% top_member$start_station_name) %>%
  group_by(start_station_name) %>%
  summarise(
    start_lat = mean(start_lat, na.rm = TRUE),
    start_lng = mean(start_lng, na.rm = TRUE),
    .groups = "drop"
  )

top_member <- left_join(top_member, top_member_coords, by = "start_station_name")
m1 <- leaflet(data = top_member) %>%    # interactive map
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  setView(lng = -87.63, lat = 41.85, zoom = 11) %>%  # Sets the **initial center and zoom level** of the map view.
  addCircles(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~sqrt(start_count) * 2,  
    color = "#1f78b4",
    stroke = FALSE,   # no outline
    fillOpacity = 0.6,
    label = ~paste0(start_station_name, ": ", start_count, " rides")
  ) %>%
  addControl("<strong>Top 10 Start Stations for Members</strong>", position = "topright")
m1

Most of the top 10 starting stations for members are concentrated in downtown Chicago, with additional clusters near residential areas on the North and South Sides.

top_casual <- tripdata_clean %>%
  filter(member_casual == "casual", !is.na(start_station_name)) %>%
  group_by(start_station_name) %>%
  summarise(start_count = n(), .groups = "drop") %>%
  slice_max(start_count, n = 10)

top_casual_coords <- tripdata_clean %>%
  filter(member_casual == "casual", start_station_name %in% top_casual$start_station_name) %>%
  group_by(start_station_name) %>%
  summarise(
    start_lat = mean(start_lat, na.rm = TRUE),
    start_lng = mean(start_lng, na.rm = TRUE),
    .groups = "drop"
  )

top_casual <- left_join(top_casual, top_casual_coords, by = "start_station_name")

m2 <- leaflet(data = top_casual) %>%
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  setView(lng = -87.63, lat = 41.89, zoom = 12) %>%
  addCircles(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~sqrt(start_count)* 2,  # adjust multiplier to your data
    color = "#e31a1c",
    stroke = FALSE,
    fillOpacity = 0.6,
    label = ~paste0(start_station_name, ": ", start_count, " rides")
  ) %>%
  addControl("<strong>Top 10 Start Stations for Casuals</strong>", position = "topright")
m2

The top 10 starting stations for casual users are heavily clustered along the lakefront and near downtown tourist areas, reflecting strong recreational and sightseeing usage.

Top routes for members

# Step 1: Identify top 10 member routes by name
top_member_routes <- tripdata_clean %>%
  filter(member_casual == "member", !is.na(start_station_name), !is.na(end_station_name)) %>%
  group_by(start_station_name, end_station_name) %>%
  summarise(route_count = n(), .groups = "drop") %>%
  slice_max(route_count, n = 10)

# Step 2: Get average coordinates for start and end stations
station_coords <- tripdata_clean %>%
  filter(!is.na(start_station_name), !is.na(end_station_name)) %>%
  group_by(start_station_name, end_station_name) %>%
  summarise(
    start_lat = mean(start_lat, na.rm = TRUE),
    start_lng = mean(start_lng, na.rm = TRUE),
    end_lat   = mean(end_lat, na.rm = TRUE),
    end_lng   = mean(end_lng, na.rm = TRUE),
    .groups = "drop"
  )

# Step 3: Merge coordinates with top routes and jitter self-loops
top_member_routes <- left_join(top_member_routes, station_coords,
                               by = c("start_station_name", "end_station_name")) %>%
  mutate(
    is_self_loop = start_station_name == end_station_name,
    end_lat_jittered = ifelse(is_self_loop, start_lat + runif(n(), 0.002, 0.004), end_lat),
    end_lng_jittered = ifelse(is_self_loop, start_lng + runif(n(), 0.002, 0.004), end_lng)
  )

# Step 4: Create LINESTRING geometry 
member_sf <- st_sf(
  top_member_routes,
  geometry = st_sfc(
    pmap(
      list(top_member_routes$start_lng, top_member_routes$start_lat,
           top_member_routes$end_lng_jittered, top_member_routes$end_lat_jittered),
      ~ st_linestring(matrix(c(..1, ..2, ..3, ..4), ncol = 2, byrow = TRUE))
    ),
    crs = 4326
  )
)

# Step 5: Leaflet map 
leaflet(member_sf) %>%
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  addPolylines(
    color = "#1f78b4",  # standard blue for members
    weight = 6,
    opacity = 1,
    label = ~paste0(start_station_name, " → ", end_station_name, " (", route_count, " rides)")
  ) %>%
  setView(lng = -87.63, lat = 41.84, zoom = 12) %>%
  addControl("<strong>Top 10 Routes for Member (with self-loops jittered)</strong>", position = "topright")

The top 10 routes by members are primarily concentrated in the Hyde Park and South Shore areas, with a couple of high-traffic segments extending into the West Side.

### Top routes for casuals

# Get top 10 routes for casuals
top_casual_routes <- tripdata_clean %>%
  filter(member_casual == "casual", !is.na(start_station_name), !is.na(end_station_name)) %>%
  group_by(start_station_name, end_station_name) %>%
  summarise(route_count = n(), .groups = "drop") %>%
  slice_max(route_count, n = 10)

# Get average coordinates for each start and end station
casual_coords <- tripdata_clean %>%
  filter(!is.na(start_station_name), !is.na(end_station_name)) %>%
  group_by(start_station_name, end_station_name) %>%
  summarise(
    start_lat = mean(start_lat, na.rm = TRUE),
    start_lng = mean(start_lng, na.rm = TRUE),
    end_lat   = mean(end_lat, na.rm = TRUE),
    end_lng   = mean(end_lng, na.rm = TRUE),
    .groups = "drop"
  )

# Merge coordinates with top routes
top_casual_routes <- left_join(top_casual_routes, casual_coords,
                               by = c("start_station_name", "end_station_name"))

# Jitter self-loop coordinates
top_casual_routes <- top_casual_routes %>%
  mutate(
    is_self_loop = start_station_name == end_station_name,
    end_lat_jittered = ifelse(is_self_loop, start_lat + runif(n(), 0.002, 0.004), end_lat),
    end_lng_jittered = ifelse(is_self_loop, start_lng + runif(n(), 0.002, 0.004), end_lng)
  )

# Create sf object
casual_sf <- st_sf(
  top_casual_routes,
  geometry = st_sfc(
    pmap(
      list(top_casual_routes$start_lng, top_casual_routes$start_lat,
           top_casual_routes$end_lng_jittered, top_casual_routes$end_lat_jittered),
      ~ st_linestring(matrix(c(..1, ..2, ..3, ..4), ncol = 2, byrow = TRUE))
    ),
    crs = 4326
  )
)

# Draw leaflet map
leaflet(casual_sf) %>%
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  addPolylines(
    color = "#e31a1c",  # standard red for casuals
    weight = 6,
    opacity = 1,
    label = ~paste0(start_station_name, " → ", end_station_name, " (", route_count, " rides)")
  ) %>%
  setView(lng = -87.63, lat = 41.91, zoom = 12) %>%
  addControl("<strong>Top 10 Routes for Casuals (with self-loops jittered)</strong>", position = "topright")

The top 10 routes by casual users are concentrated along Chicago’s lakefront and Museum Campus, particularly around Millennium Park, Navy Pier, and the Shedd Aquarium—popular tourist destinations.

Share

This phase will be done by presentation, but here we use R Markdown Notebook to share.

Main Insights and Conclusions

  1. User Type Strongly Influences Ride Behavior Members tend to take shorter, more frequent trips distributed across a wider geographic area, consistent with commuting, errands, or utilitarian use. Casual users typically take longer rides concentrated near tourist attractions and the lakefront, suggesting primarily recreational or sightseeing behavior.

  2. Spatial Mapping Deepens Understanding Spatial visualizations of start stations and top routes clearly differentiate the behavior of user types. Members dominate high-frequency routes in Hyde Park, the South Side, and the West Loop, aligning with local and routine use. Casual users are highly clustered along Chicago’s lakefront, including Millennium Park, Shedd Aquarium, and Navy Pier, highlighting a focus on leisure and sightseeing.

Act

Top 3 Marketing Actions to Convert Casual Users to annual Members

  1. Deploy Membership Promos in High-Casual Zones Casual riders are heavily concentrated near lakefront and downtown attractions such as Millennium Park, Navy Pier, and DuSable Harbor. Action: Deploy location-triggered promos (e.g., via QR codes at docking stations or geo-push notifications) offering trial memberships, discounted monthly rates, or priority bike access in these high-traffic casual zones.

  2. Educate Casual Riders on Membership Value Many casual riders take longer rides and self-loops, which may indicate a lack of awareness about cost savings with membership. Action: Use in-app nudges, ride-end receipts, or email follow-ups to showcase savings from joining. Target repeat casual users or those with rides over 20–30 minutes with messaging like: “You could’ve saved $X on this ride with membership.” Include comparisons to encourage conversion (“You’ve taken 4 rides this week—members ride free for the first 45 minutes!”).

  3. Run Weekend-Focused Membership Campaigns Casual usage peaks on weekends and afternoons, aligning with recreational patterns. Action: Run weekend-limited conversion campaigns: “Join today—ride free this weekend!” or “$1 Membership Trial—This Weekend Only.” Combine with event partnerships (e.g., Taste of Chicago, Air & Water Show) to offer bundled perks (e.g., ride credits with festival tickets or museum entries).