Classifying Fatal Chicago Traffic Crashes

5 min readNov 17, 2020

The Chicago Traffic Crashes dataset, an open source dataset provided by the City of Chicago, has real data representing traffic crashes since September 2017 (with a few districts being represented since 2015). The dataset was downloaded from the Chicago Data Portal on November 8, 2020.

The goal of the project is gather information from the dataset and to implement machine learning models in order to advise a hypothetical Vehicle Advisory Board to prevent future accidents. This project focused on the use of Decision Tree Classifiers and Ensemble Methods with the use of GridSearch CV for hypertuning parameters. The target variable for this research is fatal crashes. The outcome of this research is a very accurate model with features selection that suggest which conditions of traffic crashes, in particular fatal crashes, that the Vehicle Advisory Board can work on improving.

Data Exploration

Quick observations show the dataset contains 49 columns and over 453000 entries. We can begin to understand which features might be relevant towards fatal accidents. When we observe the fatal injuries column we can see the distribution of the crashes. The data is imbalanced with the ratio of non-fatal/fatal crashes being much higher. We will account for this later in the ML model with a class_weight parameter = balanced. For now, we will divide this column into two groups — fatal and non fatal.

Plotting Geographical Data

Visualizing the location of the non-fatal (serious), non-fatal (not serious), and fatal accidents overlaid on a map of Chicago is an important tool to understand which parts of towns are more impacted by traffic accidents. This is valuable knowledge for a Vehicle Safety Board to be able to implement task teams to these areas most impacted to help prevent future accidents. We use Geopandas to overlap this data as points onto a shape file of Chicago.

We can see in our first map showing fatal accidents that the accidents are sprinkled throughout Chicago yet there are still certain zones that have a higher rate of fatal accidents.

The second map shows non-fatal but serious injuries. These injuries were disabling and the person in the accident was not able to walk away from it. This data is also very valuable to visualize because even though the accident was not fatal, it was still an accident that impacted someone’s life. As we can see in the second map, the areas of high rate of non-fatal serious accidents align with the areas of fatal accidents. This is very valuable information for the Vehicle Safety Board in order to target these areas to avoid future serious accidents.

The final map shows non-fatal and not serious accidents. These accidents are a larger majority of the dataset but also show the same areas of chicago that have a higher rate of car crashes.

Machine Learning

Decision Tree Classifier and DT Classifier with Entropy

A Decision Tree Classifier is a useful ML tool for this dataset because it uses discrete variables as input (encoded from our categorical variables) and is one of the most powerful (and oldest!) tools in machine learning. The results of Decision Trees are also easier to understand than more complex models. The general idea of the Decision Tree model is that it uses recursive partioning of the space in order to check for conditions and performs a decision at each node. For the parameters of the decision tree, I will use class_weight = balanced to account for the sample sizes being misbalanced. There are much less fatal crashes than non-fatal crashes. I will also use max_depth=5 because without this parameter, the model is completely overfit. Later on with Hypertuning the parameters, we will explore these parameters more to find the best fit for the model.

By comparing a DT Classifier with no set parameters, a DT Classifier with max_depth=8 parameter, and a DT Classifier that uses entropy, the DT Classifier with entropy worked best in terms of accuracy for this dataset.

Bag of Trees and Random Forest Classifier

Bagged Decision Trees train on slightly different training sets allowing for minor differences and slightly different predictions. Bagging and Random Forest Classifiers attempt to remove complexities of a model that can overfit it. Boosting, in contrast, attempts to limit complexities in order to make sure a model is not underfit. A Random Forest Classifier creates many decision trees using data samples and then takes the solution of each of them to select the best by voting.

Hypertuning Parameters

Grid Search

XGBoost

Final Conclusions

The Hour of a Crash has the biggest impact on fatal accidents. Potentially, the Vehicle Safety Board could advise for more officers to patrol during the hours or limiting speed limits during these hours to prevent more fatal accidents.

Geographic location indicates that certain areas are more likely to have fatal accidents. Conditions in these areas could be considered to see if there are any specific road problems that can be addressed.

Because this dataset is reliable (created from police reports) and also is constantly changing, I recommend that there be further and continuous review of this dataset to find more patterns that could help explain fatal (and also serious accidents).

Classifying Fatal Chicago Traffic Crashes

Written by Bonny Nichol