3  Results

Code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(ggplot2)
library(stringr)

data <- read.csv("collisions.csv")

# converting the empty char cells to NA
data <-  mutate_all(data, ~ifelse(is.character(.x) & nchar(.x) == 0, NA, .x))

3.1 When is it safe to drive or walk (basically get home alive)?

We have a heatmap to represent the frequency of crashes by day of the week and hour of the day. This reveals the patterns in crash occurrence, suggesting potential countermeasures like increased enforcement or public awareness campaigns during high-risk periods.

Code
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
# convert CRASH.DATE column from char to date
data$CRASH.DATE <- as.Date(data$CRASH.DATE, format = "%m/%d/%Y")

data2 <- data[c("CRASH.DATE", "CRASH.TIME")]

# convert CRASH.TIME column from char to POSIXct time
data2$CRASH.TIME <- as.POSIXct(data$CRASH.TIME, format = "%H:%M")

# Extract day of the week and hour of the day
data2 <- data2 %>%
  mutate(DayOfWeek = weekdays(CRASH.DATE),
         HourOfDay = hour(CRASH.TIME))

# Convert DayOfWeek to factor 
data2$DayOfWeek <- factor(data2$DayOfWeek, levels = rev(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

# group by DayOfWeek and HourOfDay to represent
summarized_data2 <- data2 %>%
  group_by(DayOfWeek, HourOfDay) %>%
  summarise(Frequency = n())
`summarise()` has grouped output by 'DayOfWeek'. You can override using the
`.groups` argument.
Code
# summarized_data2

# Create a heatmap
ggplot(summarized_data2, aes(y = DayOfWeek, x = HourOfDay, fill= Frequency )) +
  geom_tile(stat = "identity") +
  scale_fill_viridis_c(trans = "reverse") +  
  labs(title = "Heatmap for DayOfWeek and HourOfDay",
       x = "Hour of the Day",
       y = "Day of the Week",
       fill = "Frequency") +
  theme_minimal()

From the heatmap above shows that the frequency of collisions has been fairly low after 1AM till 5AM, but as the day goes on, the numbers of collisions increase and it is maximum around 4PM-5PM.

Another insight from the above visualization is that there is a clear difference in frequency of collisions on Saturday and Sunday, compared to the other days of the week. We can notice that the frequency of collisions on these two days is comparatively less than the other days.

3.2 Should you buy a Sedan?

Code
#taking top 30 most frequent vehicles involved in collisions
vehicles <- sort(table(data$VEHICLE.TYPE.CODE.1),decreasing = TRUE)[seq(1,30)]

# Convert the vehicles vector to a dataframe
data31 <- data.frame(Values = vehicles)
colnames(data31) <- c("vehicle", "frequency")
# data31

# Create a Cleveland dot plot
ggplot(data31, aes(x = vehicle, y = frequency)) +
  geom_point(position = "jitter", color = "blue", size = 2) +
  labs(title = "Cleveland Dot Plot with Top 30 Most Frequently Involved Vehicles", 
       x = "Vehicles", 
       y = "Number of Cases") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()

The bar graph shows that the type of vehicle that has been most frequently involved in the collision cases has been Sedan with over 500K reported cases in the last 11 years, followed by Station Wagon/Sport Utility Vehicle, Passenger Vehicle and Taxi.

3.3 What causes Collisions to happen?

Code
#taking top 20 most frequent contributing factors in collisions
factors <- sort(table(data$CONTRIBUTING.FACTOR.VEHICLE.1), decreasing = TRUE)[seq(1,20)]

# Convert the factors vector to a dataframe 
data32 <- data.frame(Values = factors)
colnames(data32) <- c("factor", "frequency")
# data32

# Create a Cleveland dot plot
ggplot(data32, aes(x = factor, y = frequency)) +
  geom_point(position = "jitter", color = "blue", size = 2) +
  labs(title = "Cleveland Dot Plot with Top 20 Most Frequently Contributing Factors",
       x = "Contributing Factors",
       y = "Number of Cases") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()

As we see, maximum number of cases have unspecified cause of collision (with almost 700K cases).
But in known reasons of collision, Driver Inattention or Distraction stands on top. Following that there’s failure to yield right-of-way, following the vehicle too closely or backing unsafely.

It’s interesting to notice that alcohol involvement and unsafe speed stand a little less high on the most frequently contributing factors list than they are generally thought to be.

One contributing factor that most people would not guess to be so frequent is Prescription Medication. Almost 20K cases has reported that to be the major contributing factor of collision.

3.4 Who gets hurt and how?

Code
library(tidyr)

# You can use your entire dataset or filter it based on your analysis needs
data4 <- data[c("NUMBER.OF.PEDESTRIANS.INJURED", "NUMBER.OF.PEDESTRIANS.KILLED",
                "NUMBER.OF.CYCLIST.INJURED", "NUMBER.OF.CYCLIST.KILLED",
                "NUMBER.OF.MOTORIST.INJURED", "NUMBER.OF.MOTORIST.KILLED")]

# Pivot longer the columns
data4_long <- data4 %>%
  pivot_longer(
    cols = starts_with("NUMBER.OF."),
    names_to = "Category",
    values_to = "Count"
  ) 

# summarizing by group sum
data4_summarized <- data4_long %>%
  group_by(Category) %>%
  summarise(Count = sum(Count))

# adding new column person_involved and fatality_type with information for the existing data
data4_summarized <- data4_summarized %>%
  mutate(
    person_involved = gsub("NUMBER.OF.(\\w+)\\.(\\w+)", "\\1", Category),
    fatality_type = gsub("NUMBER.OF.(\\w+)\\.(\\w+)", "\\2", Category)
  )

# Drop the Category column
data4_summarized <- data4_summarized[, c("person_involved", "fatality_type", "Count")]

# Calculate percentages for each combination of fatality_type and person_involved
data4_summarized <- data4_summarized %>%
  group_by(fatality_type) %>%
  mutate(percentage = Count / sum(Count))

# data4_summarized

#create a stacked bar chart
ggplot(data4_summarized, aes(x = fatality_type, y=Count, fill = person_involved)) +
  geom_bar(position = "fill", stat = "identity") +
  geom_text(aes(label = scales::percent(percentage)),
            position = position_fill(vjust = 0.5),  # Adjust text position
            show.legend = FALSE) +  # Hide legend for the text
  labs(title = "Percent Stacked Bar Chart of Fatalities",
       x = "Fatality Type",
       y = "Percentage",
       fill = "Person Involved") +
  scale_y_continuous(labels = scales::percent_format(scale = 100)) +  # Format y-axis as percentage
  theme_minimal()

The above stacked bar chart gives some interesting information and comparison in data.

Of all the people who are injured or killed, there is a similar percentage that is cyclists. But percentage of pedestrians killed in collision is thrice the percentage of pedestrians injured. This is opposite for the motorists. This implies that given a person gets into a crash, the chances of them dying are way higher if they are a pedestrian than a motorist.

Of all the people who get killed in collisions, more than 50 percent are pedestrians. This says that pedestrians are the most prone to a serious fatality in a collision.

3.5 How did the number of Collisions changed over time?

Code
# Create a new dataset with month-year and count columns
data5 <- data %>%
  mutate(MonthYear = format(CRASH.DATE, "%Y-%m")) %>%
  group_by(MonthYear, BOROUGH) %>%
  summarise(CrashCount = n())
`summarise()` has grouped output by 'MonthYear'. You can override using the
`.groups` argument.
Code
# converting MonthYear in data5 to date format
data5$MonthYear <- as.Date(paste(data5$MonthYear, "01", sep = "-"))

# omit NA values
data5 <- na.omit(data5)

# Create a separate dataset for overall pattern
overall_pattern <- data5 %>%
  group_by(MonthYear) %>%
  summarise(TotalCases = sum(CrashCount))

# Create a time series line plot
ggplot(data5, aes(x = MonthYear, y = CrashCount, color = BOROUGH)) +
  geom_line(size = 1, alpha = 0.5) +
  geom_line(data = overall_pattern, aes(x = MonthYear, y = TotalCases, color = "OVERALL"), linetype = "solid", size = 1.3) +
  labs(title = "Time Series Line Plot of Case Counts by Borough",
       x = "Year",
       y = "Case Count") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year")

The time series plot of case count shows the trends of number of cases over the years in different boroughs of New York and also as a whole (“overall”).

The most visible trend in the graph is that there was a major dip in the number of reported collision cases in 2020, which is potentially because of the pandemic and the lockdown. Even though we can see that the cases started to rise after 2020, the numbers have still been nearly half of what it used to be before 2020.

Brooklyn has been on top when it comes to the number of reported collisions, followed by Queens, Manhattan and Da Bronx. Staten Island has had the least reported cases since 2012, and it is the only borough that did not see any decrease in the number of cases during lockdown.

3.6 Who is more likely to get involved in Collisions? - Demographic Study

Code
library(vcd)
Loading required package: grid
Code
library(ggmosaic)

Attaching package: 'ggmosaic'
The following objects are masked from 'package:vcd':

    mosaic, spine
Code
# reading the people data
people_data <- read.csv("people.csv")
people_data <-  mutate_all(people_data, ~ifelse(is.character(.x) & nchar(.x) == 0, NA, .x))

# Filter the data based on the condition
people_data <- subset(people_data, (PERSON_TYPE == "Occupant" & POSITION_IN_VEHICLE == "Driver") | PERSON_TYPE == "Bicyclist" | PERSON_TYPE == "Pedestrian")

# Replace "Occupant" with "Driver" in the PERSON_TYPE column
people_data$PERSON_TYPE <- ifelse(people_data$PERSON_TYPE == "Occupant", "Driver", people_data$PERSON_TYPE)

#drop the rows where PERSON_SEX is "U"
people_data <- subset(people_data, PERSON_SEX != "U")

# taking a subset of these three columns only
data6 <- people_data[c("PERSON_AGE", "PERSON_SEX", "PERSON_TYPE")]

# omitting the rows with NA values
data6 <- na.omit(data6)

# Filter rows based on the condition for PERSON_AGE
data6 <- data6 %>%
  filter(PERSON_AGE >= 1 & PERSON_AGE <= 100)

# Create age bins with intervals of 10
labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90-100")

# Add a new column 'AgeBin' to the dataframe
data6 <- data6 %>%
  mutate(AgeBin = cut(PERSON_AGE, breaks = seq(0, 100, by=10), 
                      labels = labels, 
                      include.lowest = TRUE))

# Create a mosaic plot
vcd::mosaic(PERSON_SEX ~ PERSON_TYPE + AgeBin, data = data6, highlighting_fill = c("lightgreen", "cornflowerblue"), main = "Mosaic Plot", direction=c("v","v","h"))

Of all the people getting into collisions, a big majority of them are drivers, and a very few are bicyclists or pedestrians.

Almost 80% of all bicyclist and 75% of drivers are male. But in case of pedestrians, there is a nearly 50-50 between men and women over all ages.