Start
1. Introduction
2. Literature review
2.1. Factors influencing cyclist crash severity
2.2. Methodologies employed in previous studies
3. Data
4. Method
4.1. Classification tree
4.2. Mixed logit
4.3. Performance metrics
5. Results
5.1. Classification tree
5.2. Mixed logit
6. Discussion
7. Conclusions
Rederences

Mixed Logit Model and Classification Tree to Investigate Cyclists Crash Severity

Abstract

Growing concerns about emissions, urban traffic congestion, and the promotion of an active lifestyle are inducing more people to choose bike for their daily commute. The increase in bike usage underscores the need for improving the cyclist’s safety. Our study examined the 72 363 cyclist crashes that occurred in Great Britain in the period 2016-2019 with the objective of (1) examining how various factors influence cyclist crash severity, (2) identifying complex interactions among these crash patterns, and (3) proposing countermeasures aimed at solving the identified risk factors. To achieve these goals, a Classification Tree (CT) model was used as an exploratory tool to detect patterns and interactions that may not have been hypothesized a priori and an econometric approach, such as Mixed Logit Model (MLM), was used to quantify global effects and test the interactions identified by the CT and all the explanatory variables within a statistically rigorous framework. Specifically, six interaction variables were identified from the CT terminal nodes with the highest probability of fatal crashes by tracing back their pathways to the root node. These interactions were then included as additional explanatory variables in the MLM to guarantee that all risk factors were tested within a unified statistical framework. Interestingly, all the interactions were statistically significant. Thus, the CT model is explicitly used as a supporting tool to identify potential interactions, while conclusions are extracted from the MLM results. Based on the identified risk factors, a set of targeted safety countermeasures has been proposed to minimize cyclist crash severity and improve overall road safety.

1. Introduction

The emphasis on encouraging cycling has a great importance for European cities and towns, because it constitutes a key element to reach climate objectives, such as the EU’s ambitious target to reduce emissions by a minimum of 55% by 2030 (European Union, 2021). The rising concerns regarding greenhouse gas emissions, urban traffic congestion, and the active lifestyle promotion have resulted in a transformation in transportation habits. Indeed, an increasing number of road users choose the bikes as their daily commuting mode (European Commission, 2020).

Moreover, in Great Britain, cyclist traffic increased by 50% between 2004 and 2022 (Department for Transport, 2023). The rise in bike usage represents a significant phenomenon, that indicates an expanding environmental awareness and a commitment to sustainable mobility. However, despite this significant rise in bike usage, Great Britain reports relatively low cycling rates compared to other European countries, with an average of 80 km cycled per person per year between 2016 and 2018. In contrast, countries such as the Netherlands and Denmark have a higher rate, with 865 km and 508 km cycled per person per year respectively (Adminaitė & Jost, 2020). This disparity highlights the varying levels of cycling engagement across Europe and shows the need for complete streets and more tailored design and maintenance projects for the creation of a community that promotes safety, connectivity, and attractiveness through a transportation network that accommodates all modes (Montella et al., 2022). Indeed, the way the streets are designed significantly impacts the transport system. However, the high number of cyclists fatalities and serious injuries show that a significant and rigorous effort to improve the cyclist safety is essential to tackle the challenge of sustainable mobility.

More than ever, the cyclist safety deserves attention and in-depth research, because a growing number of cities is promoting cycling as an economic, green, healthy, and efficient way of transport. At this aim, the definition of a vulnerable road users (VRUs) research is beneficial to identify patterns that affect crash severity and to the development of safety countermeasures to reduce fatalities and serious injuries involving VRUs. This study aims at contributing to the research focused on cyclist safety improvement. Indeed, the use of crash data analysis is necessary for a deeper understanding of the factors associated to cyclist crashes by supporting the implementation of more effective countermeasures (Scarano, Aria, et al., 2023; Scarano, Rella Riccardi, et al., 2023).

The objectives of the research include (1) examining how various factors, such as road-related features, environmental conditions, involved vehicle characteristics, driver attributes, and cyclist-related aspects, influence cyclist crash severity, (2) identifying complex interactions among these crash patterns and refining the regression analysis, and (3) offering useful insights that can support the development of effective countermeasures linked to the identified risk factors.

To achieve these objectives, this study employs both a machine learning tool, the Classification Tree (CT) and an econometric model, the Mixed Logit Model (MLM). The CT model serves as an exploratory tool, detecting patterns and potential interactions that may not have been hypothesized a priori. These insights are then used to refine the MLM, ensuring that all relevant variables and interactions are tested within a unified statistical framework. This approach guarantees that conclusions are based on a rigorous and statistically consistent methodology.

These associations were examined using crash data referred to the 72,363 cyclist crashes that were recorded in Great Britain in the period 2016-2018. This integrative approach, combining econometric and machine learning methodologies, is relatively uncommon in existing literature.

2. Literature review

In the field of cyclist safety research, understanding the factors affecting the crash severity is important. In recent years, several studies have explored this critical area, aiming to uncover contributors to the crashes involving cyclist outcome. Despite the wide research conducted in this field, the issue resolution has not been reached. Consequently, there is a need for more investigations to understand the factors contributing to the cyclist crash severity and our study aims to further investigate these factors by combining two different methodologies: the CT and the MLM. The CT helps in identifying critical scenarios by tracing the decision paths within the model. On the other hand, the MLM is particularly advantageous because it allows us to account for unobserved heterogeneity, providing quantitative and easily interpretable results on each factor impact. This combination of models is relatively uncommon in existing research, and it enables not only to identify the key factors influencing crash severity but also to pinpoint critical associations that can help inform targeted safety measures.

Table 1 provides a concise summary of insights derived from prior studies on cyclist injury severity, employing various methodologies from traditional econometric models to advanced machine learning techniques.

2.1. Factors influencing cyclist crash severity

Several studies in the last decade have emphasized the association of higher speed limits with the most severe outcomes occurrence (Behnood & Mannering, 2017; Chen & Shen, 2016; Dash et al., 2022; S. Islam & Hossain, 2015; S. Liu et al., 2021; Zhu, 2021).

The crash severity is significantly influenced by roadway factors, including the area in which the crash occurs, road and pavement characteristics, the number of lanes, and the presence of road signage (Anysz et al., 2021; Das et al., 2023; Dash et al., 2022; Katanalp & Eren, 2020; Prati et al., 2017; Sener et al., 2019; Zhu, 2021). This reinforces the importance that roadway plays in determining cyclist safety.

Environmental factors are also important as the analysis indicates that the most severe cyclist crashes are more likely to occur during the summer season, at night, or in low-illumination conditions (Chen & Shen, 2016; Das et al., 2023; Dash et al., 2022; S. Islam & Hossain, 2015; Salon & Mcintyre, 2018; Sun et al., 2022b).

Similarly, individual cyclist characteristics including age, gender, and race have been identified as factors associated to an increase in crash severity (Bahrololoom et al., 2020; Balakrishnan et al., 2019; Behnood & Mannering, 2017; Chen & Shen, 2016; Joo et al., 2017; J. Liu et al., 2020; Macioszek & Granà, 2022; Prati et al., 2017; Sun et al., 2022a; Zhu, 2021). Among cyclists behaviours, alcohol consumption, riding on the wrong side of the road, and helmet and reflective clothing usage significantly influence the crash severity (S. Islam & Hossain, 2015; S. Liu et al., 2021; Wahi et al., 2018).

In motor-vehicle crashes involving cyclists, various factors associated both with the driver as well as the vehicle affect the injury severity. Prior studies identified a significant influence of drivers gender, age, alcohol consumption, distraction, and driving on the wrong side of the road on the cyclist crash outcome (Balakrishnan et al., 2019; Das et al., 2023; Liu and Fan, 2021; Rahimi et al., 2020).

Moreover, the involvement of heavy vehicles and the extent of damage sustained by the vehicles contribute to the likelihood of severe injuries in cyclist crashes. Recognizing this dual aspect highlights the need to consider not only factors specific to cyclists but also the vehicles characteristics (Joo et al., 2017; Katanalp & Eren, 2020).

Several studies focused on crashes occurring at intersections, revealing noteworthy insights. Akgun et al. (2018) found that a higher speed limit reduces safety for cyclists at roundabouts, with the probability of a serious casualty increasing for each additional lane on approach and with a higher entry path radius. Shen et al. (2020) extended this exploration to various intersection types, including roundabouts, crossroads, and T-junctions, highlighting differences in factors influencing cycling safety across these intersections. Another perspective was offered by Wahi et al. (2018) who identified factors contributing to bike-motor vehicle crash severity under different traffic control measures at intersections. Their findings indicated increased injuries for cyclists aged 40-49 and 60+ without helmets, particularly at regulated intersections. Especially, faulty cyclists experienced more severe injuries at stop signs, with speed emerging as an important risk factor specifically at these intersections. Bahrololoom et al. (2020) focused on bike-car crashes at intersections, showing that pre-collision car kinetic energy and post-collision bike kinetic energy affect injury severity. The study further highlighted that older cyclists are more susceptible to severe crashes, while the use of helmets serves as a mitigating factor in injury severity.

2.2. Methodologies employed in previous studies

The literature analysis indicates different methodology families employed in previous studies (Table 1). Among them, econometric models have a solid foundation for crash analysis. They can be classified into ordinal and non-ordinal methods. The ordinal models, such as Ordered Logit (S. Liu et al., 2021), Ordered Mixed Logit (Balakrishnan et al., 2019; Chen & Shen, 2016), Ordered Probit (Joo et al., 2017; S. Liu et al., 2021), Geographically Weighted Ordinal Logit (J. Liu et al., 2020) and Generalized Ordered Logit (Chen & Shen, 2016; Shen et al., 2020) have been used to analyse the crash severity in a graduated manner, considering increasing levels of injuries.

On the other hand, non-ordinal models included the Generalized Additive Model (Chen and Shen., 2016), the generalized linear model (Xing et al., 2020), the Logit Model (Akgun et al., 2018; Dash et al., 2022; Macioszek & Granà, 2022; Salon & Mcintyre, 2018; Sener et al., 2019), the Mixed Logit (Bahrololoom et al., 2020; Behnood & Mannering, 2017; Das et al., 2023; S. Islam & Hossain, 2015; Sun et al., 2022b, 2022a; Wahi et al., 2018), the Partial Proportional Odds Logit (S. Liu et al., 2021; Shen et al., 2020), and the Latent Class Analysis (Sun et al., 2022a).

Furthermore, the literature analysis indicates a growing use of machine learning techniques such as Association Rules (Anysz et al., 2021), Decision Trees (Dash et al., 2022; Joo et al., 2017; Katanalp & Eren, 2020; Prati et al., 2017; Rahimi et al., 2020; Xing et al., 2020), DT-Based Converted Fuzzy Logic (Katanalp & Eren, 2020), Random Forest (Dash et al., 2022; Rahimi et al., 2020), Gradient Boosting (Zhu, 2021) and Bayesian Network (Prati et al., 2017; Sun et al., 2022b).

The machine learning methods are characterised by their ability to handle complex and nonlinear relationships in data without specific assumptions about data distribution. Thus, in presence of large and complex datasets, machine learning technique may be preferred to econometric models, needing short data preparation time, and providing enough accuracy (F. Mannering et al., 2020; Montella et al., 2021; Moral-Garcia et al., 2019; Rella Riccardi, Galante, et al., 2022; Zhu, 2021).

Thus, while econometric models provide a solid analytical framework, machine learning methods offer the more flexibility to address complex relationships and dataset.

In this paper, both econometric methods, such as the Mixed Logit Model (MLM), and machine learning tools, specifically the Classification Tree (CT), were employed to identify the factors influencing the cyclist crash severity.

Among econometric models, the MLM was chosen because it accounts for unobserved heterogeneity among observations, providing a full understanding of variability in the data. This is particularly useful in the context of cyclist crashes, where individual characteristics may affect outcomes (Rella Riccardi, Mauriello, Sarkar, et al., 2022; Scarano et al., 2022b). Moreover, our study introduces an innovative approach that combines both traditional econometric methods and machine learning techniques. While traditional econometric models provide a solid foundation in statistics and econometrics field, our research makes a step forward by also using advanced machine learning tools such as the CT. This integration represents a conventional practice that overtakes the singular method limitations.

Finally, our study is in line with the contemporary trend in the literature, showing a growing trust on machine learning techniques. The inclusion of decision trees as data mining tools, as showed in a recent analysis carried out by Scarano et al. (2023), represents a progress in our research. Indeed, CT emerge as a practical technique able to effectively manage discrete variables or variables with a high number of categories. This approach not only demonstrates ability in handling outliers but also eliminates the need for a priori assumptions regarding variable distributions, overcoming the regression model limitations (Prati et al., 2017).

Table 1. Literature review (Part A)
References Method Objective Main findings
[b@442069] Logit To investigate which design factors, influence cyclist severity at give way (non-signalized) roundabouts with mixed traffic A higher speed limit reduces the safety for cyclists at roundabouts. The probability of a serious casualty increases five times for each additional number of lanes on approach and by 4% with a higher entry path radius.
[b@442073] Association Rules To identify factors and conditions contributing to cyclists’ serious crashes Daylight reduces traffic signal impact on slight injuries in urban areas, with no significant rules for fatal and severe injuries. In non-urban areas, the absence of daylight and dual roadway results in fatal and severe injuries
[b@442075] Newtonian Mechanics; Mixed Logit To investigate the relationship of cyclist’s injury severity with various parameters in bike-car crashes at intersections Pre-collision car kinetic energy and post-collision bike kinetic energy impact injury severity. Older cyclists suffer more severe crashes, helmet usage reduced injury severity.
[b@442076] Ordered Mixed Logit To identify the socioeconomic characteristics influencing injury severity heavy trucks -VRUs crashes Injury severities are affected by the age and gender of road users and sociodemographic factors such as age, language, occupation, religion, and income
[b@442077] Mixed Logit To investigate factors that contribute to the cyclist injury severity in bike-MV crashes while systematically accounting for unobserved heterogeneity within the crash data Injury severities are affected by cyclist and driver race and gender, alcohol impairment in cyclists or drivers, the age of cyclists, riding or driving on the wrong side of the road, drivers’ unsafe speeding, and the absence of a cyclist helmet
[b@442082] Generalized Ordered Logit;
Generalized Additive
To estimate the effects of built environment factors on cyclist injury severity in MV- bike crashes, and to accommodate possible spatial dependence among crash locations. Higher employment density correlates with lower injury severity. Land use mixture reduces severe injuries or fatalities. Reflective clothing and improved street lighting decrease cyclist injuries, while higher speed limits, older age, and involvement of trucks increase the risk of severe crashes
[b@442084] Mixed Logit To examine the key contributing factors influencing cyclist injuries Distracted drivers, elderly cyclists, and riding in dark conditions increase the likelihood of severe injuries in MV-cyclist crashes. Conversely, straight level roadways and city streets decrease the odds of severe injuries
[b@442085] Logit; Decision Tree; Random Forest To understand the critical factors that influence severe cyclist crash and identify countermeasures Inadequate lighting conditions, crashes on roadways, speed limits, average annual daily traffic, number of lanes, and weekends are the factor affecting cyclist crashes
[b@442093] Mixed Logit To compare the influence of various variables on injury severities in pedestrian-MV and bike-MV crashes Injury severities are affected by use of retroreflective clothing, older age, dark roadway, and low speed roadway. Time of day and weather, resulted random parameters
[b@442095] Decision Tree; Ordered Probit To examine the impacts of contributing factors on the cyclist injury severity Factors affecting injury severity were the cyclist age and the presence of trucks
[b@442098]. C4.5 Decision Tree; DT-Based Converted Fuzzy Logic; To examine the effect of several parameters on cyclist injury severity and validate the new fuzzy decision approaches for the classification of cyclist injury severity The most important factors in bike-MV crash severity are gender, vehicle damage extent, road type, pavement type, crash type, and MV manoeuvres
[b@442102] Ordered Logit;
Partial Proportional Odds Logit
To Investigate factors influencing injury severity in bike crashes on weekdays and weekends Factors like older aged cyclists, riding direction, pickup, older aged drivers, male drivers, and specific time periods are significant only on weekdays. Speed limits, time of day, alcohol usage, and specific road characteristics affect injury severities only on weekends.

Abbreviations: VRUs= Vulnerable Road uses; MV = motor-vehicle

Table 1. Literature review (Part B)
References Method Objective Main Findings
[b@442101] Geographically Weighted Ordinal Logit To generate new knowledge about bike safety using spatial modelling Cyclist injury severity increases with older cyclists, cyclist intoxication, and higher MV speeds
[b@442105]. Logit To identify factors that influence the occurrence and severity of cyclist injury in cyclist-MV crashes Factors affecting cyclist injuries and deaths include driver attributes (gender, age, alcohol, speeding), cyclist attributes (age, alcohol, speed), and crash details (vehicle type, crash location, time, type)
[b@442122] CHAID Decision Tree; Bayesian Network To investigate the factors predicting the severity of bike crashes The key predictors were road type, crash type, cyclist’s age, road signage, cyclist’s gender, type of opponent vehicle, month, road segment type, and opponent vehicle type
[b@442123] Random Forests; Decision Trees To recognize fatality patterns in large truck-involved pedestrian/bike crashes On local roads (AADT > 38,000), intersections and vehicle speeding are crucial for injury severity. On High-volume roads (non-local), young/middle-aged truck drivers, divided roadways, speeding, and midday increase fatal outcomes. On low-volume roads, clear weather, median barrier, and curb shoulder near a signalized intersection affected fatal crashes
[b@442143] Separate Mixed Logit To identify factors that contribute to bike-MV crash severity under different traffic control measures at intersections Injuries increase for cyclists aged 40-49 and 60+ without helmets, particularly at regulated intersections. Faulty cyclists experience more severe injuries at stop signs, with speed being a risk factor specifically at stop signs
[b@442129] Bivariate Analysis; Logit To illuminate key determinants of crash severity for both pedestrian and cyclist State highway crashes, late-night hours, and daylight contribute to crash severity. Summer increases serious outcomes for faulty cyclist crashes. Cyclist characteristics such as race, sobriety, age over 64 affect crash severity
[b@442134] Logit To identify factors associated with crash frequency and severity for pedestrians and cyclists Crash frequency and severity are affected by travel demand, commute behaviours, network characteristics, and sociodemographic features
[b@442135] Generalized Ordered Logit;
Partial Proportional Odds
To explore the contributing factors to cyclist injury severity on roundabouts, crossroads, and T-junctions The factors that affect cycling safety at various intersections show enormous differences
[b@442137] Mixed Logit; Bayesian Network To analyse VRUs-MV crashes across seasons using a hybrid method to identify significant factors and their interactions. Certain factors increase fatality likelihood when combined, like functional zone in spring and summer and MV type in fall and winter.
[b@442138] Latent Class Analysis; Mixed Logit To identify contributing factors to cyclist crashes in cold regions Male cyclists have a higher winter fatality risk.
[b@442148] Generalized Linear Model;
Classification Tree
To identify the influence factors on injury severity of electric and non-electric bike crashes The factors influencing non-electric bike crashes are signal control mode, MV type, visibility, lighting condition, motor vehicle license possession, and the type of MV.
Zhu, s., 2021. Gradient Boosting To investigate the contributing factors to fatalities and injuries involving cyclist Key predictors are gender, arterial road type, 100 km/h speed zone, young drivers, dark conditions, crash type, and cyclists age.

Abbreviations: VRUs= Vulnerable Road uses; MV = motor-vehicle

3. Data

Crash data were retrieved from the STATS19 dataset (https://www.gov.uk/transport-statistics-notes-and-guidance-road-accident-and-safety). This dataset includes records of road crashes resulting in personal injuries, involving at least one vehicle, that occurred on public roads in Great Britain. Notably, the dataset does not include information about Property Damage Only (PDO) crashes. These data were collected by law enforcement at the crash scene or reported by the public at a police station. For the purposes of this study, we carried out an analysis of crashes that occurred in the Great Britain during the three-year period from 2016 to 2019.

Initially, the dataset was presented in three separate subsets, containing information on the crashes, the involved vehicles, and the casualties, respectively. The crash dataset consisted of thirty-two variables to describe the crash, the vehicle dataset included twenty-two variables describing all the vehicles involved, and the casualty dataset contained sixteen variables characterising the road users injured or killed in the crashes. To work with a unified dataset, we merged these three subsets using the unique crash index for each crash.

In order to improve the performance of our statistical tools, an initial data preprocessing was conducted by joining specific categories and reorganizing redundant information. Regarding the vehicles involved in the crash, the analysis focus on the bikes and a second vehicle, because a very small percentage of cyclist-related crashes involved more than two vehicles.

The final dataset includes a total of 72,363 cyclist-related crashes. This dataset was reorganized into forty explanatory variables as shown in the Appendix A, Tables A1-A4. The variables were divided into roadway (Table A1), environmental (Table A2), vehicle (Tables A2 and A3), cyclist, and driver-related variables (Table A4). The crash severity, which represents the response variable, was based on the injury severity of the most seriously injured person involved in the crash. It was classified into three categories: slight injury, serious injury, and fatal. A fatal crash was defined as a crash in which at least one person was killed instantly or within 30 days from the crash. A serious injury crash was referred to crashes resulting in injuries that required hospitalization or involved fractures, concussion, internal injuries, burns (excluding friction burns), severe cuts, severe general shock requiring medical treatment, or injuries causing death 30 or more days after the crash. Finally, slight injuries were referred to minor injuries.

The database used in this research comprised 429 fatal crashes (0.59% of all crashes), 14,890 serious injury crashes (20.58% of all crashes), and 57,044 slight injury crashes (78.83% of all crashes).

4. Method

To ensure a complete understanding of the factors influencing cyclist crash severity, this study adopts an integrative methodological approach that combines econometric modelling with machine learning techniques. The CT model is leveraged as a preliminary exploratory tool to uncover complex interactions and high-risk crash scenarios. The insights derived from the CT analysis are then systematically incorporated into the MLM, allowing for a statistically rigorous evaluation of both main effects and interaction terms.

4.1. Classification tree

Classification trees represent a methodology aimed at obtaining a hierarchical segmentation of the cyclist crashes by iteratively partitioning the dataset into subgroups that are homogeneous in terms of cyclist crash severity. This process shows the variables that have the stronger influence on the outcome.

The tree is a directed graph that starts from a root node. Then, this root node is progressively split into leaf nodes by using an explanatory variable named “splitter”. Although all independent variables are considered potential candidates for splits at each internal node, the predictor that makes the best partition is selected.

We implemented the Classification and Regression Trees (CART) algorithm introduced by Breiman et al. (1984). The evaluation of impurity at each node was performed using the Gini reduction criterion. A higher Gini index value indicates increased homogeneity in the node resulting from the split. The Gini reduction criterion is calculated as follows:

where represents the proportion of observations in node () belonging to the jth class of the response variable. A node is pure when contains all observations belonging to a single severity class, resulting in an impurity value of zero.

The total impurity of the tree (T) is calculated as follows:

where represents the impurity of the node , is the node’s weight ( is the count of observations in node (), is the overall count of observations), and denotes the all the terminal nodes in the tree. The terminal nodes exhibit lower impurity levels in contrast to the root node. The overall impurity of the tree decreases through the identification, at each tree node, of the optimal partitioning of observations into distinct severity classes, ensuring external heterogeneity and internal homogeneity.

The CART algorithm comprises the following steps:

  1. generate the set S of all possible partitions obtained from the predictor set X;

  2. for each split s in the set S, calculate the impurity reduction;

  3. determine the best partition associated with the maximum impurity reduction.

The algorithm is applied to each node until the tree construction stops. Before choosing the most appropriate stopping criterion for the tree, a literature review was carried out (Table 2).

Table 2. Literature analysis for classification tree stopping criteria
References Levels* Stopping criterion
[b@442067] 4 Sensitivity analysis: entropy decrease
[b@442081] 5 Sensitivity analysis: confusion matrix
[b@442090] 4 Minimum impurity decreases (improvement = 0.0001)
[b@442097] 4 Minimum evaluation error
[b@442099] 4 Minimum cardinality of node subject to dividing; minimum cardinality of the node resulting from dividing; maximum number of tree levels
[b@442100] 3 Missing**
[b@442103] 4 Maximum tree depth set at 4 levels
[b@442109] 3 Minimum evaluation error
[b@442112; @442113; @442115] 3,4 Minimum impurity decreases (improvement = 0.001) or maximum tree depth set at 4 levels
[b@442121] Missing** Minimum impurity decreases; Limit for the minimum number of observations in a leaf
[b@442124; @442125] 4 Minimum impurity decreases (improvement = 0.001) or maximum tree depth set at 4 levels
[b@442128] 20 Minimum cross-validated error; at least two observations in each leaf

* Levels: the different subdivisions or partitions of the data during the tree-building process.
**Missing: missing information

Based on the information presented in Table 2, it was decided to adopt stopping techniques according to two criteria: (1) when the decrease in Gini index fell below a minimum threshold set at 0.0001; and (2) when the tree reached a maximum depth of 4 levels.

For each node, the assigned severity class depends on the PCR highest value. This PCR compares the percentage of observations for each severity class in the terminal node with those in the root node (Rella Riccardi, Galante, et al., 2022):

where denotes the crashes in node belonging to severity class , is the tree root node.

For each node, the class with the highest PCR value determines the class of that node, selected as follows:

The terminal nodes allow us to identify relationships between variables that affect the severity of cyclist crashes. The classification trees were carried out using the SPSS software.

4.2. Mixed logit

The MLM represents a generalized version of multinomial logistic regression (Rella Riccardi, Mauriello, Scarano, et al., 2022). Unlike the standard logit model, the MLM allows coefficients of the variables βj to vary across observations or groups of observations. Thus, the βj coefficients can be decomposed into their means () and standard deviations (F. L. Mannering et al., 2016):

In the contest in which the MLM is applied to predict crash injury severity, the propensity of crash i (where i = 1,…, 72,363) towards the severity category j (where j varies from 1 to 3: 1 = slight injury, 2 = serious injury, 3 = fatal) is expressed through the injury-severity function Sij, which is determined by the sum of Vij (the systematic component) and εij (the unobservable stochastic error):

where is a column vector of characteristics (explanatory variables) that influence the cyclist injury severity level j, is a column vector of the parameters to be estimated for the crash severity category j, and is the error term assumed to be independently and identically distributed (McFadden & Train, 2000; Washington et al., 2020).

The mixed logit probability represents a weighted average of the standard logit probabilities at different values of parameter β. Thus, the mixed logit probability is the integral of standard logit probabilities over a density of parameters β, defines as follows (Alogaili & Mannering, 2020; Train, 2009):

where: f(β|σ) is the continuous density function adopted by vector , σ is a vector of parameters that specify the density function (mean and variance), and all other terms are as previously defined (Anastasopoulos & Mannering, 2011). Note that in the simplified case where f(β|σ) = 1, the model reduces to the standard logit (Washington et al., 2020)

A normal distribution is chosen for f(β|) due to its proven suitability over other distributions studies (Azimi and Asgari, 2023; M. Islam et al., 2022; Uddin & Huynh, 2020).

To determine which variables should be treated as random parameters, we employed a forward stepwise selection procedure based on the improvement of model fit, as measured by the Akaike Information Criterion (AIC) which balances model complexity and goodness of fit (Burnham & Anderson, 2004). Specifically, we initially estimated a base model with only fixed effects and sequentially introduced random effects for different variables, retaining only those that improved model performance in terms of AIC values. Furthermore, since categorical variables were transformed into dummy variables, only specific categories were tested as random rather than the entire variable.

The βj coefficients explain the effect of the independent variables. For a proper interpretation of the model results, it is important to assess the magnitude of the impact of the independent variables on the dependent variable. In evaluating the explanatory variable influence on the crash severity, various methodological approaches are available, including odds ratio, marginal effect, or elasticity (Lord et al., 2021). Among these, Odds Ratio (OR) stands out as the most practical choice (Norton & Dowd, 2018). The OR is the exponentiation of βj (exp(βj)), showing the proportional increase (OR > 1) or decrease (OR < 1) in the likelihood of the outcome when the corresponding indicator variable is set to 1 (Lord et al., 2021). Thus, the OR provides a complete understanding of the impact of each variable on the outcome, representing a vigorous and convenient tool for analysing the complexities of factors influencing the cyclist crash severity.

The MLM was executed in the R-CRAN software environment by using the “mlogit” package.

As part of the preliminary analysis, a chi-square test was conducted to identify any significant relationships between our categorical variables. Using a significance level of 10%, the test results show there is not any statistically significant relationship between the variables.

4.3. Performance metrics

In the cyclist crash severity analysis, the model performance assessment is important to guarantee a complete understanding of the factors associated with cyclist crash severity.

For the MLM, the McFadden’s Pseudo R2 was employed:

where: LL0 is the loglikelihood of the null model and LLfull is the loglikelihood of the model including all the statistically significant variables.

The McFadden’s Pseudo R2 metric works as a goodness of fit indicator in discrete choice models. Specifically, it shows how much variance the full model that incorporate independent variables explains in the data compared to what would be explained using the null model. The McFadden’s Pseudo R2 variability range is between 0 and 1 with the higher values indicating better model performance. McFadden’s Pseudo R2 greater than 0.20 indicates a very good fit (Andreß et al., 2013).

For both the MLM and the CT tool, our focus extends to the G mean and F measures as performance indicators. They are composite indicators that integrate more individual metrics into a unified performance measure (Guo et al., 2008).

The G-mean (Eq. 9) shows how well the model can correctly identify both positive and negative instances:

where:

is the true negative rate; TN is the number of true negatives; FP is the number of false positives;

is the true positive rate; TP is the number of true positives; FN is the number of false negatives.

The F-measure (Eq. 11) balances precision and recall, providing an assessment of the model performance in terms of true cases classification. Thus, a higher F-measure value reflects a better trade-off between precision and recall.

where:

is the accuracy of positive predictions among the instances predicted as positive;

is the model capacity to identify all actual positive instances;

is a coefficient to adjust the relative importance of precision versus recall.

was set 1 indicating an equal emphasis on both precision and recall (Bekkar et al., 2013).

5. Results

5.1. Classification tree

The CT is composed by 13 terminal nodes (Figure 1). Among these terminal nodes, 7 are associated with fatal crashes, 4 with serious injuries, and the other 2 with slight injuries.

The CT starts from a first splits based on the speed limit. Furthermore, the following splits are based on nine specific variables, including the manoeuvre of the second vehicle, vehicle 2 type, driver age, junction type and details, the first road class, and the first point of impact of the second vehicle, the cyclist age, and whether the second vehicle hit an object in the carriageway.

The PCR was calculated for all terminal nodes to evaluate how well each node predicts a crash severity class (Table 3). The nodes 15, 17 and 21 demonstrates higher PCRs (10.62, 8.97, and 7.71, respectively), by highlighting their effectiveness in fatal crash prediction.

The Node 15 has a PCR of 10.62. It highlights a scenario that includes situations where the speed limit exceeds 50 mph, the second vehicle is overtaking or proceeding straight, the crash occurs outside an intersection or at crossroads, and the second vehicle’s first point of impact is the front of the bike. This combination of factors creates a dangerous situation that increase the fatal crash likelihood.

Node 17, which has a PCR of 8.97, starts from a split based on speed limit between 20 and 30 mph. At these speed limits, fatal crashes are associated with specific conditions such as the truck involvement, the driver of the second vehicle aged between 25 and 74 years, and the first road class equal to A or a motorway. These conditions together create a hazardous scenario where even moderate speeds can result in fatal cyclist crashes.

Node 21 has a PCR of 7.71 and represents another dangerous scenario including a speed limit split of 30 mph, the involvement of a car or a two-wheeler (e.g., bike or PTW) proceeding ahead or overtaking and colliding with a curb. Furthermore, the collision with the curb increases the proportion of fatal crashes from 0.003 (node 13) to 0.046 (node 21).

Regarding variable importance, the Figure 2 highlights variables with normalized importance above 20%. Six of these variables have normalized importance greater than 50%, indicating they are significant in classifying crash severity:

Figure 1
Figure 1. Classification tree
Table 3. PCR for terminal nodes
Terminal node PCR Actual Predicted Class
Fatal Serious Slight
7 5.15 1.89 0.74 Fatal
8 0.21 1.27 0.93 Serious
10 0.33 1.06 0.99 Serious
15 10.62 1.96 0.68 Fatal
16 2.46 1.42 0.88 Fatal
17 8.97 1.12 0.91 Fatal
18 2.37 1.34 0.90 Fatal
19 2.47 1.53 0.85 Fatal
20 0.50 1.30 0.93 Serious
21 7.71 1.08 0.93 Fatal
22 0.54 0.95 1.02 Slight
23 0.54 1.19 0.95 Serious
24 0.04 0.75 1.07 Slight
Figure 2
Figure 2. Normalized importance of the variables (%)

Results of the CT analysis were used as input for the MLM. Specifically, we identified the first two terminal nodes with the highest PCR for fatal crashes, such as the node 15 and the node 17, and traced their pathways back to the root node, showing the following interactions:

The performance metrics, expressed through F-measure and G-mean, provide an understanding of our model’s predictive capabilities (Table 4).

For fatal crashes, the G-mean index demonstrates a notably high value of 0.78, indicating the model’s ability to accurately identify both positive and negative instances and providing a balanced prediction. However, the F-measure, which balances precision and recall, yields a lower value of 0.06, suggesting a challenge in effectively managing this severity level.

In the case of serious injury crashes, there is a significant improvement in the F-measure, with a value of 0.20, highlighting a better balance between precision and recall. Conversely, the G-mean index for this category is lower, with a value of 0.38, indicating a diminished capacity of the model to recognize both positive and negative instances in a balanced manner.

Table 4. CT performance metrics
Severity level F-measure G-mean
Fatal 0.06 0.78
Serious injury 0.20 0.38
Slight injury 0.79 0.42

5.2. Mixed logit

The findings for both the fixed and random variables are presented in Table 5 where, for each significant variable (p-value <0.05), we reported the estimated value (βj) and its OR.

To refine the regression analysis and explore potential interaction effects, we incorporated the interactions derived from the CT analysis as dummy variables: interaction 1-4, interaction 4-9, interaction 9-15, interaction 2-5, interaction 5-11, interaction 11-17.

In estimating fatal crashes, 21 explanatory variables and 45 indicator variables emerged as statistically significant in addiction to 6 interaction variables. Conversely, for estimating serious outcome, 17 explanatory variables, 30 indicator variables and 5 interaction variables demonstrated statistical significance.

For fixed parameters, for each severity category, positive β coefficients indicate an increase in odds and negative β coefficients indicate a decrease in odds, relative to the reference category.

Understanding the random variable outcomes is a bit more complicated. Examining the normal distribution of random parameters reveals that certain observations in one group are more likely to have a severity level, while others are less likely. In our model, cyclist age ≥75 (specific to the fatal crashes) is a random variable with a mean of 3.29 and a standard deviation of 0.66. According to the normal distribution, it can be inferred that 99.9% of the crashes involving an elderly cyclist had a higher probability of resulting in the fatal severity level. The remaining small percentage (< 0.1%) of the crashes were more likely to result in slight or serious injuries. Another significant random parameter is the indicator variable driver age = 55-64 (specific to the serious injury level), with a mean of -0.12 and a standard deviation of 0.51. The distribution for this variable indicates that 59.0% of the crashes where the cyclist collides with a driver aged between 55 and 65 years old had a higher probability of resulting in a serious injury level, while the remaining 41.0% of these crashes were more likely to result in one of the other severity levels (slight injury or fatal). Additionally, the indicator variable for cyclist gender = male has resulted a significant random parameters for both fatal and serious outcomes. As regards the fatal crashes, the mean and the standard deviation of the random parameter cyclist gender = male were 0.37 and -0.97, respectively. This indicates that 64.7% of the crashes involving a male cyclist had a higher probability of resulting in a fatal outcome, while 35.3% were more likely to result in a slight or serious injury. For serious outcomes, the mean and the standard deviation of the random parameter cyclist gender = male were -0.11 and 1.56, respectively. This indicates that 52.8% of the crashes with male cyclist were more likely to result in severe injuries, while the 47.2% were more likely to result in slight injuries or fatalities.

Among the fixed-effect parameters, high speed limit has a dramatic effect on crash severity. The OR of fatal crashes associated with speed limit equal to 40 mph (with a baseline of 30 mph) is 1.60 and increases to 2.96 for speed limit ≥ 50 mph. If a cyclist crash occurs in a rural area, the probability of both fatal (OR=1.78) and serious injury crashes (OR= 1.25) increases. Compared to the single carriageway the road type with the lower probability of fatal crash is the roundabout with an OR of 0.65. While dual carriageway and slip road have an OR respectively equal to 1.24 and 2.06. Darkness condition shows greater propensity towards most severe crashes having an OR equal to 1.66 for fatal crash and 1.13 for serious crash. The wet/frozen pavement is significant both for fatal crash with an OR of 1.50 and for serious crash with an OR of 1.09, whereas weekend increased the probability of both fatal and serious crashes, with an OR respectively equal to 1.53 and 1.16. Bike leaving the carriageway nearside, offside ore straight are strongly associated with crash severity showing an OR respectively of 3.17, 12.91 and 18.95 for fatal crash. Considering vehicle 2 engine capacity equal to 1501-2000 as baseline, vehicle 2 engine capacity >3000 is associated with higher probability of both fatal crashes exhibiting an OR of 2.24 and serious crash with an OR of 1.13. When the second vehicle skids or overturns, there is a significant increase in the likelihood of both fatal (OR = 2.79) and serious injury crashes (OR = 1.60). Regarding driver related factors, young drivers (≤17, 18-24 years old) increased probability of fatal crash. The most influential variable is the cyclist age. Compared to the young cyclist (25-34), the involvement of an elderly cyclist increases the probability of fatal crashes: 35-44 with an OR of 1.21, 45-54 with an OR of 1.37, 55-64 with an OR of 3.57, 65-74 with an OR of 7.26, and ≥75 (random variable) with an OR of 26.90.

The results of the MLM confirm also that all tested interaction terms are statistically significant, highlighting the risk effect associated with the combination of some factors on crash severity. The interaction between higher speed limits (≥40 mph) and the manoeuvre of the second vehicle (going ahead or overtaking) is associated with an increase in the probability of fatal crashes (interaction 1-4, OR = 1.48). However, this effect is not significant for serious crashes, suggesting that crashes occurring under these conditions are more likely to be fatal rather than serious. Similarly, the interaction between the second vehicle’s manoeuvre and the junction type (not at a junction, crossroads, or other junctions) exhibits a positive effect on both fatal (OR = 2.22) and serious crashes (OR = 1.23) (interaction 4-9). This finding indicates that crashes occurring outside intersections, or at crossroads, become more severe when the second vehicle is proceeding straight or overtaking.

Another significant interaction is observed between junction type and the first point of impact of the second vehicle (front) (interaction 9-15). This interaction increases the probability of fatal crashes (OR = 1.55). Speed limits of 20–30 mph combined with the involvement of a truck also show an effect on crash severity (interaction 2-5). While this interaction increases the likelihood of fatal crashes (OR = 1.37), it is negatively associated with serious crashes (OR = 0.66). This suggests that although lower-speed environments generally reduce crash severity, the presence of truck vehicles can still result in fatal consequences.

The most pronounced effect is observed for the interaction between the involvement of a truck the age of the second driver (25–74 years old) (interaction 5-11). This interaction significantly increases the likelihood of both fatal (OR = 6.04) and serious crashes (OR = 2.53).

Finally, the interaction between driver age and the type of road (A roads or motorways) is associated with an increase in fatal crash risk (OR = 1.49) and a slight increase in serious crashes (OR = 1.06) (interaction 11-17). This suggests that driving behaviour among older age groups, particularly in high-speed road environments, can influence crash severity outcomes.

Thus, the results indicate that specific combinations of factors, such as speed limits, vehicle type, driver age, and road design, can significantly increase crash severity risks.

Table 5. Estimated parameters and OR for cyclist crash severity
Variable Fatal Serious
β OR Std. Err. p-⁠value β OR Std. Err. p-⁠value
Intercept -1.86 0.16 0.06 <0.001 -0.33 0.72 0.04 0.01
Speed Limit (30 mph as baseline)
20 -0.08 0.92 0.03 <0.001
40 0.47 1.60 0.06 <0.001 0.17 1.19 0.05 <0.001
≥ 50 1.08 2.96 0.06 <0.001 0.50 1.64 0.06 <0.001
Area (Urban as baseline)
Rural 0.57 1.78 0.04 <0.001 0.22 1.25 0.03 <0.001
Junction control (Not at junction or within 20m as baseline)
Give way/Stop -0.53 0.59 0.03 <0.001 -0.05 0.95 0.02 0.02
Traffic lights 0.32 1.38 0.05 <0.001
Pedestrian crossing physical facilities (No physical crossing facilities within 50m baseline)
Central refuge 0.69 2.00 0.06 <0.001
Pedestrian phase at traffic signal junction 0.14 1.15 0.05 <0.001 -0.19 0.83 0.04 <0.001
Zebra 0.22 1.24 0.04 <0.001 -0.15 0.86 0.04 0.00
Road type (Single carriageway as baseline)
Dual carriageway 0.22 1.24 0.04 <0.001 0.09 1.09 0.04 0.01
Roundabout -0.43 0.65 0.05 <0.001 -0.10 0.90 0.03 0.00
Slip road 0.72 2.06 0.13 <0.001
Lighting (Daylight as baseline)
Darkness 0.51 1.66 0.03 <0.001 0.12 1.13 0.02 <0.001
Pavement (Dry as baseline)
Wet/Frozen 0.41 1.50 0.04 <0.001 0.08 1.09 0.03 <0.001
Weather (Clear as baseline)
Raining -0.48 0.62 0.05 <0.001 -0.19 0.83 0.04 <0.001
Day of week (Weekday as baseline)
Weekend 0.43 1.53 0.03 <0.001 0.15 1.16 0.02 <0.001
Number of bikes (1 as baseline)
>1 0.61 1.83 0.10 <0.001 0.77 2.17 0.08 <0.001
Bike 1st point of impact (No impact as baseline)
Back -0.55 0.57 0.05 -0.32 0.73 0.04 <0.001
Front -0.35 0.71 0.04 <0.001 0.06 1.07 0.03 0.05
Nearside/Offside -0.63 0.53 0.05 <0.001 -0.14 0.87 0.03 <0.001
Bike leaving carriageway (No as baseline)
Nearside 1.15 3.17 0.07 <0.001 0.35 1.42 0.06 <0.001
Offside 2.56 12.91 0.19 <0.001 0.68 1.98 0.20 <0.001
Straight 2.94 18.95 0.34 <0.001
Vehicle 2 skidding and overturning (No as baseline)
Yes 1.03 2.79 0.11 <0.001 0.47 1.60 0.10 <0.001
Vehicle 2 engine capacity (1501-2000 as baseline)
≤ 1000 0.22 1.25 0.05 <0.001 -0.08 0.92 0.04 0.04
1000-1500 -0.26 0.77 0.03 <0.001 -0.09 0.92 0.02 <0.001
>3000 0.81 2.24 0.08 <0.001 0.26 1.30 0.06 <0.001
Vehicle 2 age (≤ 15 as baseline)
>15 -0.42 0.65 0.06 <0.001 -0.49 0.61 0.01 <0.001
Table 5. Estimated parameters and OR for cyclist crash severity (Continuation)
Variable Fatal Serious
β OR Std. Err. p-⁠value β OR Std. Err. p-⁠value
Driver 2 gender (Female as baseline)
Male -0.08 0.92 0.02 <0.001
Driver 2 age (25-34 as baseline)
≤17 0.84 2.33 0.14 <0.001 0.20 1.22 0.12 0.10
18-24 0.33 1.40 0.05 <0.001 0.21 1.24 0.04 <0.001
35-44 <0.001
45-54 -0.36 0.70 0.04 <0.001
55-64 (random variable for serious crashes) 0.15 1.16 0.04 <0.001 -0.12 0.89 0.05 0.01
St. dev. 55-64 0.51 0.19 0.01
65-74 -0.72 0.48 0.06 <0.001
≥75 -0.19 0.83 0.07 <0.001 0.26 1.30 0.05 <0.001
Cyclist IMD decile (Less deprived as baseline)
More deprived -0.11 0.89 0.02 <0.001
Cyclist gender (Female as baseline)
Male (random variable for serious and fatal crashes) 0.37 1.44 0.02 <0.001 -0.11 0.90 0.03 <0.001
St. dev. Male -0.97 0.03 <0.001 1.56 0.06 <0.001
Cyclist age (25-34 as baseline)
≤17 -0.40 0.67 0.04 <0.001 -0.10 0.91 0.03 <0.001
18-24 -0.11 0.90 0.04 0.01
35-44 0.19 1.21 0.04 <0.001 0.17 1.19 0.03 <0.001
45-54 0.31 1.37 0.04 <0.001 0.35 1.42 0.03 <0.001
55-64 1.27 3.57 0.04 <0.001 0.54 1.72 0.04 <0.001
65-74 1.98 7.26 0.07 <0.001 0.68 1.98 0.06 <0.001
≥75 (random variable for fatal crashes) 3.29 26.90 0.07 <0.001 0.97 2.64 0.12 <0.001
St. dev. ≥75 -0.66 0.10 <0.001
Cyclist journey purpose (Commuting to from work as baseline)
Journey as part of work -0.51 0.60 0.06 <0.001 -0.06 0.94 0.03 0.08
To/from school -0.23 0.79 0.05 <0.001
Interactions
Int 1-4 (SL= 40, ≥50 -V2man = going ahead, overtaking) 0.39 1.48 0.07 <0.001
Inter 4-9 (V2man = going ahead, overtaking – JunDet= not at junction, crossroads, other junctions) 0.80 2.22 0.03 <0.001 0.20 1.23 0.02 <0.001
Inter 9-15 (JunDet= not at junction, crossroads, other junctions – V2FPI= front) 0.44 1.55 0.03 <0.001 -0.08 0.92 0.02 <0.001
Inter 2-5 (SL=30, 20 – V2Type= truck, other) 0.32 1.37 0.10 <0.001 -0.41 0.66 0.09 <0.001
Inter 5-11 (V2Type= truck, other – D2age=25-34, 35-44, 45-54, 55-64, 65-74) 1.80 6.04 0.12 <0.001 0.93 2.53 0.12 <0.001
Inter 11_17(D2age=25-34, 35-44, 45-54, 55-64,
65-74 - FRC= A, Motorway)
0.40 1.49 0.03 <0.001 0.06 1.06 0.02 0.01

Log likelihood null model =-79 499; Log likelihood full model = -62 825.
Abbreviations: SL= speed limit; V2man= vehicle 2 manoeuvre; JunDet= Junction detail; V2FPI= Vehicle 2 First point of impact; V2Type= Vehicle 2 type; D2age= Driver 2 age, FRC= First road class

The MLM shows a robust fit with a McFadden Pseudo R2 of 0.21. F-measures is reasonable for serious injuries (0.29) but less satisfactory for fatal crashes (0.06). G-mean reveals differences in classification performance, with the model exhibiting higher effectiveness for fatal crashes (0.80) compared to serious injuries (0.51) (Table 6).

Table 6. MLM performance metrics
Severity level F-⁠measure G-⁠mean McFadden R2
Fatal 0.06 0.80 0.21
Serious injury 0.29 0.51
Slight injury 0.73 0.59

6. Discussion

The study findings confirm the significant function of vehicle speed, with higher speed limits increasing the probability of serious and fatal outcomes. These results align with previous research (Boufous et al., 2012; Isaksson-Hellman & Toreki, 2019; Kaplan et al., 2014; Montella et al., 2015; Z. Wang et al., 2021). Higher speeds are critical when the second vehicle is going ahead or overtaking (interaction 1-4). However, even at moderate speeds (20-30 mph), with specific conditions, such as crashes involving trucks, can increase fatality risk (interaction 2-5). To reduce these risks, speed management strategies and bike network improvement are needed, as showed by other researchers (Damsere-Derry & Bawa, 2018; Samerei et al., 2021; Tuckel, 2021). Furthermore, develop cycle paths and marked lanes, physically separating cyclists from other road users, can significantly improve cyclist safety. To accommodate human errors, it is recommended to implement “forgiving” sloped kerbs, given that studies show its superiority compared to right-angled kerbs (Janssen et al., 2018).

Within road-related variables, crashes occurred in rural areas, characterized by higher speed limits, significantly increase the likelihood of fatal or serious injury severity. Furthermore, roundabouts are associated with a lower risk of both fatal and serious crashes. Roundabouts achieve this result by forcing vehicles to reduce speed through deflection, as showed by previous studies (Gross et al., 2013; Montella, 2007, 2011; Rella Riccardi, Augeri, et al., 2022; Silvano & Linder, 2017). On the other hand, the interaction between the second vehicle’s manoeuvre (going ahead or overtaking) and specific junction type (crossroads or other junctions) highlights an increased risk of severe crashes in these scenarios (interaction 4-9). Additionally, crashes where the second vehicle’s first point of impact is the front, occurring in these same junction types, show a higher probability of being fatal (interaction 9-15). These findings suggest that improving intersection safety is needed. This is possible by implementing junction with clear cyclist priority, and improved road markings.

Beyond road-related factors, environmental conditions are drastically associated with cyclist crash severity. Particularly, lighting conditions is a critical factor since crashes occurring during nighttime are associated with an increase in probability of fatal and serious outcomes. The use of lighting systems with light-emitting diodes (LEDs), the enforcement of mandatory bike light installations, and the promotion of reflective clothing use can improve visibility during nighttime conditions and on unlit streets. Moreover, wet, or frozen pavements are identified as significant contributors resulting in an increased likelihood of cyclist fatal crashes. These observations are in line with the findings of previous research (Wahi et al., 2018; C. Wang et al., 2015). Improving road surface conditions is also necessary, particularly in wet or frozen conditions where skidding is more likely. Using permeable surfaces, such as permeable asphalt or concrete, which help absorb water into the ground and reduce the risk of skidding.

The analysis also highlights the impact of the second vehicle’s characteristics on crash severity. In particular, truck involvement is associated with increased fatality risks for cyclists, especially when the truck is driven by driver aged 25–74 (interaction 5-11). This interaction shows that driving experience is not always enough to avoid dangers when operating heavy vehicles. These manoeuvres often involve complex spatial dynamics, longer braking distances for trucks, and blind spots that may limit the truck driver capacity to detect cyclists. These results are in line with the findings of previous research (Chen & Shen, 2016; Damsere-Derry & Bawa, 2018; Joo et al., 2017; Kaplan et al., 2014; van Haperen et al., 2018). The progress and integration of safety systems in truck, particularly active safety systems, can improve road safety. Moreover, placing mirrors on signal posts eliminates blind spots for truck drivers, improving visibility and reducing crash risks.

On the other hand, regarding bike related factors, the bike leaving the carriageway offside is associated with more severe outcomes. This can be attributed to the increased cyclist exposure to collisions with the other motor vehicles or obstacles. To address this issue, effective strategies are needed to promote safe riding behaviours such as targeted road safety education and the development of safe infrastructure with appropriate and clear road markings and road signs.

As regards the cyclist characteristics, the model results indicate a correlation between cyclist crash severity and the ages of both the cyclist and the driver. Infact, as the cyclist age increases, there is a corresponding rise in the probability of serious or fatal outcomes. Conversely, an inverse relationship is observed for the age of the driver, with younger drivers associated with an increased likelihood of severe consequences. Moreover, the interaction between driver age and road class shows that crashes on A roads and motorways are more likely to result in severe outcomes when the driver is between 25 and 74 years old (interaction 11_17). These findings are in line with previous studies (Oikawa et al., 2019; Scholes et al., 2018; C. Wang et al., 2015). To address these critical factors, targeted interventions are needed, such as promoting awareness campaigns emphasizing safe cycling practices or driver education programs. Furthermore, advanced driver assistance systems in vehicles may reduce the crash risk.

7. Conclusions

The road crash statistics indicate that, despite the efforts made by the member states of the European Union and the United Nations to protect VRUs, coupled with significant scientific and engineering progress, the issue of cyclist crashes cannot be deemed resolved. Thus, especially now that a growing number of cities is promoting cycling as an economic, green, healthy, and efficient way of transport, the cycling safety deserves attention as well as in-depth research. To identify patterns associated with severe injuries and fatal crashes involving cyclists and improve cyclist safety, this research utilized a combination of tradition econometric model known as MLM, and emerging machine learning methodology, represented by the CT algorithm.

The study focused on the analysis of cyclist crash data spanning the period from 2016 to 2019 in Great Britain. The CT model served as an exploratory tool to detect patterns and potential interactions that might not have been hypothesized a priori, whereas the MLM was used to test all variables consistently, quantify global effects, and obtain statistical significance. Furthermore, the interactions derived from the CT analysis were incorporated into the MLM. Interestingly, all the interactions were statistically significant.

The CT model is explicitly used as a supporting tool to identify potential interactions, while conclusions are extracted from the MLM results. Based on the identified risk factors, a set of targeted safety countermeasures has been proposed to minimize cyclist crash severity and improve overall road safety.

In conclusion, this research not only shows the persistent and significant issue of cyclist crashes but also introduces a new perspective in their understanding and management. Thus, the study ensures that the factors influencing cyclist crash severity are tested in a statistically consistent manner within the MLM, using insights from the CT model to refine the regression analysis. Furthermore, these insights are translated into actionable countermeasures, offering a detailed plan aimed at mitigating cyclist crash severity and improving overall road safety.


CRediT contribution statement

Antonella Scarano: Conceptualization, Formal analysis, Methodology, Software, Writing—original draft, Writing—review & editing. Maria Rella Riccardi: Conceptualization, Methodology, Writing—review & editing. Filomena Mauriello: Conceptualization, Methodology, Writing—review & editing. Carmelo D’Agostino: Conceptualization, Writing—review & editing. Alfonso Montella: Conceptualization, Methodology, Supervision, Writing—review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declaration of generative AI use in writing

The authors declare that no generative AI was used in this work.

Acknowledgement

An earlier version of this work was presented as a poster at the 35th ICTCT Conference in Catania, Italy, on 26–27 October 2023, titled “Challenges and Actual Opportunities Offered by New Technologies to Improve Traffic Safety.” It was awarded the Best Presentation Award.

Ethics statement

Crash data used in this research were collected by the Department for Transport (GOV.UK) and were retrieved from the STATS19 dataset (https://www.gov.uk/transport-statistics-notes-and-guidance-road-accident-and-safety). Thus, this research was exempted from requiring an ethical approval.

Funding

No external funding was used in this research.

Editorial information

One of the authors, Carmelo D’Agostino, is a member of the editorial board of the Traffic Safety Research journal. To minimize the risk for conflict of interest during the review process, an external guest editor was called to handle this paper.

Guest editor: Ragnhild Davidse, SWOV Institute for Road Safety Research, the Netherlands

Reviewer: Paul Schepers, Ministry of Infrastructure and Water Management, the Netherlands

Appendix A

Table A1. Descriptive statistics related to crash data (Part A)
Variable Fatal Serious Slight Total
N % N % N % N %
Crash severity 429 0.59 14 890 20.58 57 044 78.83 72 363 100.00
First road class
A 213 0.29 5 865 8.10 23 298 32.20 29 376 40.60
B 63 0.09 1 972 2.73 6 787 9.38 8 822 12.19
C 29 0.04 930 1.29 3 889 5.37 4 848 6.70
Motorway 1 0.00 1 0.00 3 0.00 5 0.01
Missing 123 0.17 6 122 8.46 23 067 31.88 29 312 40.51
Road type
Single carriageway 341 0.47 11 694 16.16 43 581 60.23 55 616 76.86
Dual carriageway 59 0.08 1 154 1.59 4 082 5.64 5 295 7.32
One way street 8 0.01 378 0.52 1 784 2.47 2 170 3.00
Roundabout 16 0.02 1 383 1.91 5 920 8.18 7 319 10.11
Slip road 4 0.01 83 0.11 353 0.49 440 0.61
Missing 1 0.00 198 0.27 1 324 1.83 1 523 2.10
Speed limit (mph)
20 24 0.03 1 600 2.21 7 830 10.82 9 454 13.06
30 187 0.26 10 379 14.34 42 868 59.24 53 434 73.84
40 46 0.06 1 028 1.42 3 076 4.25 4 150 5.73
≥50 172 0.24 1 880 2.60 3 254 4.50 5 306 7.33
Missing 0 0.00 3 0.00 16 0.02 19 0.03
Junction detail
Not at junction 240 0.33 4 858 6.71 15 212 21.02 20 310 28.07
Crossroads 144 0.20 7 007 9.68 28 820 39.83 35 971 49.71
Other junctions 22 0.03 1 073 1.48 4 629 6.40 5 724 7.91
Roundabout 23 0.03 1 886 2.61 7 896 10.91 9 805 13.55
Missing 0 0.00 66 0.09 487 0.67 553 0.76
Junction control
Not at junction or within 20 metres 240 0.33 4 858 6.71 15 212 21.02 20310 28.07
Traffic lights 57 0.08 1 479 2.04 6374 8.81 7910 10.93
Give way/Stop 132 0.18 8 310 11.48 33310 46.03 41752 57.70
Missing 0 0.00 243 0.34 2148 2.97 2391 3.30
Second road class
A 33 0.05 1 424 1.97 6 354 8.78 7 811 10.79
B 13 0.02 681 0.94 2 676 3.70 3 370 4.66
C 12 0.02 581 0.80 2 741 3.79 3 334 4.61
Motorway 1 0.00 21 0.03 56 0.08 78 0.11
Missing 370 0.51 12 183 16.84 45 217 62.49 57 770 79.83
Pedestrian crossing physical facilities
No physical crossing facilities within 50 meters 335 0.46 11 419 15.78 40 788 56.37 52 542 72.61
Central refuge 17 0.02 458 0.63 1 896 2.62 2 371 3.28
Pedestrian phase at traffic signal junction 45 0.06 1 169 1.62 5 517 7.62 6 731 9.30
Pelican, puffin, toucan or similar non junction pedestrian light Crossing 25 0.03 1 006 1.39 3 817 5.27 4 848 6.70
Zebra 6 0.01 579 0.80 2 921 4.04 3 506 4.85
Missing 1 0.00 259 0.36 2 105 2.91 2 365 3.27
Table A2. Descriptive statistics related to crash data (Part B)
Variable Fatal Serious Slight Total
N % N % N % N %
Area
Urban 200 0.28 1 1106 15.35 48 276 66.71 59 582 82.34
Rural 229 0.32 3 784 5.23 8 768 12.12 12 781 17.66
Day of week
Weekday 306 0.42 11 489 15.88 46 367 64.08 58 162 80.38
Weekend 123 0.17 3 401 4.70 10 677 14.75 14 201 19.62
Lighting
Daylight 317 0.44 11 647 16.10 4 4798 61.91 56 762 78.44
Darkness 112 0.15 3 243 4.48 12 246 16.92 15 601 21.56
Weather
Clear 377 0.52 13 006 17.97 48 430 66.93 61 813 85.42
Raining 33 0.05 1 158 1.60 5 022 6.94 6 213 8.59
Other 9 0.01 272 0.38 1 254 1.73 1 535 2.12
Missing 10 0.01 454 0.63 2 338 3.23 2 802 3.87
Pavement
Dry 334 0.46 11 923 16.48 45 387 62.72 57 644 79.66
Wet/frozen 94 0.13 2 827 3.91 10 730 14.83 13 651 18.86
Missing 1 0.00 140 0.19 927 1.28 1 068 1.48
Number of bikes
1 413 0.57 14 479 20.01 56 420 77.97 71 312 98.55
>1 16 0.02 411 0.57 624 0.86 1 051 1.45
Bike skidding and overturning
No 357 0.49 12 599 17.41 48 250 66.68 61 206 84.58
Yes 71 0.10 1 847 2.55 4 669 6.45 6 587 9.10
Missing 1 0.00 444 0.61 4 125 5.70 4 570 6.32
Bike leaving carriageway*
No 343 0.47 13 711 18.95 51 551 71.24 65 605 90.66
Nearside 63 0.09 622 0.86 1 304 1.80 1 989 2.75
Offside 19 0.03 117 0.16 171 0.24 307 0.42
Missing 2 0.00 426 0.59 3 961 5.47 4 389 6.07
Bike hit off carriageway
None 400 0.55 14 284 19.74 52 898 73.10 67 582 93.39
Barrier/Pole/Tree/Wall 8 0.01 120 0.17 152 0.21 280 0.39
Other 21 0.03 102 0.14 204 0.28 327 0.45
Missing 0 0.00 384 0.53 3 790 5.24 4 174 5.77
Bike 1st point of Impact
No impact 42 0.06 1 121 1.55 2 798 3.87 3 961 5.47
Back 105 0.15 1 457 2.01 6 449 8.91 8 011 11.07
Front 157 0.22 7 724 10.67 27 800 38.42 35 681 49.31
Nearside/Offside 124 0.17 4 263 5.89 17 303 23.91 21 690 29.97
Missing 1 0.00 325 0.45 2 694 3.72 3 020 4.17
Bike object hit in carriageway
None 404 0.56 13 671 18.89 51 086 70.60 65 161 90.05
Kerb 14 0.02 200 0.28 423 0.58 637 0.88
Parked vehicle 6 0.01 461 0.64 1 290 1.78 1 757 2.43
Other 3 0.00 120 0.17 161 0.22 284 0.39
Missing 2 0.00 438 0.61 4 084 5.64 4 524 6.25
Bike manoeuvre
Going ahead 362 0.50 11 753 16.24 42 128 58.22 54 243 74.96
Moving off 9 0.01 360 0.50 1 696 2.34 2 065 2.85
Overtaking 7 0.01 633 0.87 2 462 3.40 3 102 4.29
Turning left/right/U/Reversing 37 0.05 1 100 1.52 4 011 5.54 5 148 7.11
Other 13 0.02 635 0.88 2 947 4.07 3 595 4.97
Missing 1 0.00 409 0.57 3 800 5.25 4 210 5.82

Abbreviations: na= not admissible
*Bike Leaving Carriageway indicates whether the cyclist left the main roadway, either due to a single-bike crash or a collision with another vehicle.

Table A3. Descriptive statistics related to crash data (Part C)
Variable Fatal Serious Slight Total
N % N % N % N %
Vehicle 2 skidding and overturning
No 336 0.46 12 638 17.46 49 920 68.99 62 894 86.91
Yes 21 0.03 241 0.33 520 0.72 782 1.08
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 1 0.00 380 0.53 3 754 5.19 4 135 5.71
Vehicle 2 leaving carriageway
No 330 0.46 12 665 17.50 50 066 69.19 63 061 87.15
Nearside 12 0.02 150 0.21 388 0.54 550 0.76
Offside 15 0.02 80 0.11 152 0.21 247 0.34
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 1 0.00 364 0.50 3 588 4.96 3 953 5.46
Vehicle 2 hit off carriageway
None 346 0.48 12 849 17.76 50 652 70.00 63 847 88.23
Barrier/Pole/Tree/Wall 9 0.01 54 0.07 83 0.11 146 0.20
Other 3 0.00 35 0.05 59 0.08 97 0.13
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 0 0.00 321 0.44 3 400 4.70 3 721 5.14
Vehicle 2 1st point of impact
No impact 10 0.01 860 1.19 2 680 3.70 3 550 4.91
Back 13 0.02 807 1.12 3 202 4.42 4 022 5.56
Front 246 0.34 6 251 8.64 25 419 35.13 31 916 44.11
Nearside/Offside 89 0.12 5 087 7.03 20 569 28.42 25 745 35.58
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 0 0.00 254 0.35 2 324 3.21 2 578 3.56
Vehicle 2 engine capacity
≤1000 26 0.04 749 1.04 3 024 4.18 3 799 5.25
1001-1500 54 0.07 2 809 3.88 11 413 15.77 14 276 19.73
1501-2000 106 0.15 4 428 6.12 18 572 25.67 23 106 31.93
2001-3000 38 0.05 1 359 1.88 5 301 7.33 6 698 9.26
>3000 78 0.11 520 0.72 1 439 1.99 2 037 2.81
Missing 50 0.07 3 165 4.37 14 140 19.54 17 355 23.98
na 77 0.11 1 860 2.57 3 155 4.36 5 092 7.04
Vehicle 2 propulsion code
Petrol 113 0.16 4 979 6.88 20 294 28.04 25 386 35.08
Heavy oil 189 0.26 4 626 6.39 18 015 24.90 22 830 31.55
Hybrid electric 2 0.00 267 0.37 1 409 1.95 1 678 2.32
Other 1 0.00 38 0.05 174 0.24 213 0.29
na 77 0.11 1 860 2.57 3 155 4.36 5 092 7.04
Missing 47 0.06 3 120 4.31 13 997 19.34 17 164 23.72
Vehicle 2 age
≤15 281 0.39 9 132 12.62 36 775 50.82 46 188 63.83
>15 19 0.03 652 0.90 2 412 3.33 3 083 4.26
Missing 47 0.06 3 077 4.25 13 725 18.97 16 849 23.28
na 82 0.11 2 029 2.80 4 132 5.71 6 243 8.63
Vehicle 2 type
Car 242 0.33 11 903 16.45 50 346 69.57 62 491 86.36
Two-wheelers 22 0.03 512 0.71 1 346 1.86 1 880 2.60
Truck 83 0.11 656 0.91 1 947 2.69 2 686 3.71
Other 11 0.02 129 0.18 400 0.55 540 0.75
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 0 0.00 59 0.08 155 0.21 214 0.30
Vehicle 2 towing and articulation
No 322 0.44 12 674 17.51 52 343 72.33 65 339 90.29
Articulated vehicle 20 0.03 90 0.12 121 0.17 231 0.32
Other 9 0.01 115 0.16 310 0.43 434 0.60
na 77 0.11 1 860 2.57 3 155 4.36 5 092 7.04
Missing 1 0.00 151 0.21 1 115 1.54 1 267 1.75
Vehicle 2 manoeuvre
Going ahead 260 0.36 5 099 7.05 17 646 24.39 23 005 31.79
Moving off 9 0.01 1 078 1.49 5 254 7.26 6 341 8.76
Overtaking 30 0.04 764 1.06 2 857 3.95 3 651 5.05
Turning left/right/U/Reversing 37 0.05 4 235 5.85 18 099 25.01 22 371 30.91
Other 21 0.03 1 710 2.36 6 670 9.22 8 401 11.61
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 1 0.00 373 0.52 3 668 5.07 4 042 5.59
Vehicle 2 object hit in carriageway
None 346 0.48 127 66 17.64 50 082 69.21 63 194 87.33
Kerb 8 0.01 42 0.06 112 0.15 162 0.22
Parked vehicle 1 0.00 361 0.50 3 723 5.14 4 085 5.65
Other 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
na 2 0.00 43 0.06 143 0.20 188 0.26
Missing 1 0.00 47 0.06 134 0.19 182 0.25

Abbreviations: na= not admissible, PTW= Powered two-wheeler

Table A4. Descriptive statistics related to crash data (Part D)
Variable Fatal Serious Slight Total
N % N % N % N %
Cyclist journey purpose
Commuting to/from work 55 0.08 2 724 3.76 11 340 15.67 14 119 19.51
Journey as part of work 12 0.02 793 1.10 3 351 4.63 4 156 5.74
To/from school 6 0.01 327 0.45 2 047 2.83 2 380 3.29
Other 45 0.06 1 647 2.28 4 204 5.81 5 896 8.15
Missing 311 0.43 9 399 12.99 36 102 49.89 45 812 63.31
Cyclist gender
Female 58 0.08 2 559 3.54 11 120 15.37 13 737 18.98
Male 369 0.51 12 201 16.86 45 433 62.78 58 003 80.16
Missing 2 0.00 130 0.18 491 0.68 623 0.86
Cyclist age
≤17 36 0.05 1 710 2.36 8 331 11.51 10 077 13.93
18-24 28 0.04 1 432 1.98 6 825 9.43 8 285 11.45
25-34 52 0.07 2 651 3.66 13 004 17.97 15 707 21.71
35-44 51 0.07 2 616 3.62 10 050 13.89 12 717 17.57
45-54 72 0.10 2 897 4.00 8 788 12.14 11 757 16.25
55-64 74 0.10 1 679 2.32 4 228 5.84 5 981 8.27
65-74 55 0.08 675 0.93 1 379 1.91 2 109 2.91
≥75 36 0.05 278 0.38 462 0.64 776 1.07
Missing 25 0.03 952 1.32 3 977 5.50 4 954 6.85
Cyclist IMD
Less deprived 198 0.27 6 291 8.69 21 527 29.75 28 016 38.72
More deprived 154 0.21 6 637 9.17 28 020 38.72 34 811 48.11
Missing 77 0.11 1 962 2.71 7 497 10.36 9 536 13.18
Cyclist home area
Urban 247 0.34 11 039 15.26 44 616 61.66 55 902 77.25
Rural 69 0.10 1 065 1.47 2 585 3.57 3 719 5.14
Small town 36 0.05 825 1.14 2 348 3.24 3 209 4.43
Missing 77 0.11 1 961 2.71 7 495 10.36 9 533 13.17
Driver 2 journey purpose
Commuting to-from work/school 27 0.04 1 653 2.28 5 719 7.90 7 399 10.22
Journey as part of work 122 0.17 2 193 3.03 7 676 10.61 9 991 13.81
Other 37 0.05 1 254 1.73 3 417 4.72 4 708 6.51
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 172 0.24 8 159 11.28 37 382 51.66 45 713 63.17
Driver 2 gender
Female 55 0.08 3 452 4.77 14 031 19.39 17 538 24.24
Male 287 0.40 7 766 10.73 29 765 41.13 37 818 52.26
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 16 0.02 2 041 2.82 10 398 14.37 12 455 17.21
Driver 2 age
≤17 3 0.00 94 0.13 315 0.44 412 0.57
18-24 45 0.06 1 160 1.60 3 902 5.39 5 107 7.06
25-34 75 0.10 2 330 3.22 8 852 12.23 11 257 15.56
35-44 66 0.09 2 004 2.77 8 077 11.16 10 147 14.02
45-54 61 0.08 2 118 2.93 8 075 11.16 10 254 14.17
55-64 57 0.08 1 489 2.06 5 279 7.30 6 825 9.43
65-74 16 0.02 779 1.08 2 837 3.92 3 632 5.02
≥75 13 0.02 616 0.85 1 826 2.52 2 455 3.39
na 76 0.11 1 817 2.51 3 850 5.32 5 743 7.94
Missing 17 0.02 2 483 3.43 14 031 19.39 16 531 22.84
Driver 2 IMD
Less deprived 135 0.19 4 667 6.45 16 488 22.79 21 290 29.42
More deprived 160 0.22 4 926 6.81 19 578 27.06 24 664 34.08
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 63 0.09 3 666 5.07 18 128 25.05 21 857 30.20
Driver 2 home area
Urban 203 0.28 7 785 10.76 30 728 42.46 38 716 53.50
Rural 57 0.08 1 013 1.40 3 007 4.16 4 077 5.63
Small town 35 0.05 796 1.10 2 333 3.22 3 164 4.37
na 71 0.10 1 631 2.25 2 850 3.94 4 552 6.29
Missing 63 0.09 3 665 5.06 18 126 25.05 21 854 30.20

Abbreviations: na= not admissible

References

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CCBY-4.0). View this license’s legal deed at https://​creativecommons.​org/​licenses/​by/​4.0 and legal code at https://​creativecommons.​org/​licenses/​by/​4.0/​legalcode for more information.