Start
Introduction
Methodology
NB regression
MBRP algorithm
Data
Analysis and results
NB regression
MBRP model
Model comparison
Conclusions
Funding
Declaration of generative AI use in writing
Declaration of competing interests
Ethics statement
CRediT contribution statement
Acknowledgements
References

Applying model-based recursive partitioning to improve pedestrian exposure models to support transportation safety analyses


The Pennsylvania State University, the United States of America

Guest editor: Nikiforos Stamatiadis, University of Kentucky, the United States of America

Reviewers: Mariusz Kiec, Cracow University of Technology, Poland
Mike Sewell, Gresham Smith, the United States of America

DOI: https://doi.org/10.55329/aseb7655

Received: 25 May 2024; Accepted: 18 November 2024; Published: 17 January 2025

Abstract

Pedestrians are among the most vulnerable road users in urban areas, and their safety is a growing concern for transportation planners and engineers. Pedestrians are at disproportionately high risk for injuries or fatalities in crashes with motor vehicles, highlighting the critical need to address their safety. To address the dangers urban pedestrians face, the relationship between pedestrian crashes and their contributing factors must first be understood. One way to do this is to use statistical models relating pedestrian crash frequency with quantifiable contributing factors, such as land use, demographics, and roadway characteristics. Perhaps the most important of these factors is pedestrian exposure, which is often difficult to obtain because pedestrian volumes are not as widely available as vehicle volumes. Since pedestrian volumes are not available across an entire network, they are often estimated using statistical models—for example, negative binomial (NB) regression—rather than being directly observed. These models are typically a ‘one-size-fits-all’ approach, applying the same model to estimate pedestrian exposure across the entire network. However, relationships between pedestrian exposure and explanatory features—such as population, infrastructure design, and land use context—might differ significantly with respect to the context of an individual location. To address this issue, this paper proposes a model-based recursive partitioning (MBRP) algorithm to develop pedestrian exposure models. The MBRP approach combines traditional statistical methods (e.g. NB regression) with recursive data partitioning techniques commonly found in tree-based machine learning methods. This innovative approach yields a collection of exposure models stratified according to selected input variables with unique relationships between explanatory variables and exposure. The proposed method was tested on pedestrian exposure data from North Carolina significantly improved predictions of pedestrian volumes by approximately 10%. Therefore, the MBRP algorithm presents a promising tool for advancing pedestrian safety analyses in practical applications.

Keywords

exposure model, model based recursive partitioning, negative binomial regression, pedestrians, pedestrian safety

Introduction

Pedestrian safety is a topic of growing concern in the United States, especially in urban areas where 84% of pedestrian fatalities in traffic crashes occur (NHTSA, 2023). Widely considered to be the most vulnerable roadway users, pedestrians are disproportionately represented in fatality statistics, and their representation is growing at an alarming rate. Data from the Fatality Analysis Reporting System (FARS) indicates that 7 388 pedestrians were killed in traffic crashes in 2021, about 19% of all traffic fatalities in the United States. As the greatest pedestrian fatality total since 1981, that pedestrian fatality total also represents a 12.5% increase from the previous year, following a trend in yearly increases dating back to 2013 when pedestrian fatalities accounted for 3% less of all traffic fatalities (NHTSA, 2023). To address the growing concern for pedestrian safety, we must first understand the relationship between factors that contribute to pedestrian crashes and crash outcomes.

Often these relationships are quantified through statistical models that relate frequency of pedestrian crashes with observable factors that contribute to the crashes. Many different models have been used to investigate the relationships between pedestrian crashes and their contributing factors – most often including combinations of variables such as land use, socio-demographic information, and roadway characteristics. But perhaps the most important predictor of pedestrian safety outcomes is pedestrian exposure. Unfortunately, pedestrian exposure can be difficult to obtain because pedestrian volumes are not as widely available across a roadway network as vehicular volumes. This is because network-scale pedestrian counts are resource-intensive due to high costs associated with labor for manual counts and prices charged for automated counts. Some pedestrian volume data is more widely available from fitness-tracking sources such as Strava, but data from fitness-trackers is flawed in that they relay a self-reporting sample and the trackers are subject to error due to poor GPS reception, which is especially an issue in urban environments (Lee & Sener, 2020). Accurate volumes can also be difficult to obtain and even impractical to use due to highly variable daily volumes, short trips which may not be observed, and difficulties in detecting individuals (Lagerwey et al., 2015). Instead of directly using counts directly, pedestrian volumes are usually estimated through statistical models.

Several studies have developed pedestrian exposure models to predict the amount of pedestrian activity at a given level based on input variables that reflect the built environment, roadway features and other variables. These have traditionally been developed using log-linear ordinary least squares regression (OLS) or negative binomial (NB) regression (Behnam & Patel, 1977; Griswold et al., 2019; Hankey et al., 2012; Haynes et al., 2010; Lindsey et al., 2006; Lindsey et al., 2007; Liu & Griswold, 2009; Miranda-Moreno & Fernandes, 2011; Pulugurtha & Repaka, 2008; Schneider et al., 2009; Schneider et al., 2012). Pedestrian exposure has also been estimated using Tobit models and by modifying NB regression techniques to artificially inflate zero value pedestrian counts, in both cases relating the counts to demographics, land use, and traffic data (Lee et al., 2019). Still other studies suggest using stepwise linear regression to account for spatial variations in independent variables (Hankey & Lindsey, 2016; Hankey et al., 2017; Lu et al., 2018).

While these model types have their own merit, they also have various flaws when applied to estimating pedestrian exposure. OLS regression does not account for the count nature of volume data, and using a log-linear form assumes a logarithmic distribution that may not be observed. Similarly, Tobit models assume a normal distribution for the dependent variable which is not typically observed in pedestrian count data. Stepwise linear regression may result in atheoretical coefficient estimates, which makes interpretation complicated and may limit transferability to other datasets. NB regression is the most appropriate and the most common, because of its ability to account for overdispersion in fluctuating pedestrian volumes and the count nature of the data. Even so, all these models are flawed in that they are typically a ‘one-size-fits-all’ approach in which the same model is used to estimate pedestrian exposure at all locations within a transportation network. However, the relationships between pedestrian exposure and explanatory features—such as population, infrastructure design, and land use context—might differ significantly with respect to the context of an individual location, which may not be known a priori. Incorporating these differences could help improve the exposure model and provide more accurate predictions.

To help address this issue, this paper proposes a model-based recursive partitioning (MBRP) algorithm to develop pedestrian exposure models. The MBRP approach combines traditional statistical methods (e.g. NB regression) with the recursive data partitioning techniques commonly found in tree-based machine learning methods. The proposed method was tested on pedestrian exposure data obtained in North Carolina and shown to significantly improve predictions of pedestrian volumes by approximately 10%. Therefore, the MBRP algorithm presents a promising tool for advancing pedestrian safety analysis in practical applications.

The remainder of this paper is organized as follows. Section 2 provides a description of the methodology used for NB regression and the proposed MBRP approach. Section 3 describes the dataset used. Section 4 provides an analysis of the results from the MBRP model. Finally, Section 5 contains concluding remarks.

Methodology

The goal of this research is to demonstrate the potential of the MBRP algorithm to estimate pedestrian exposure better than traditional regression methods. To do so, exposure models were developed over a training dataset using traditional NB regression methods and the MBRP algorithm. The performance of the MBRP model was judged relative to the NB regression model based on goodness of fit statistics over a separate test dataset and cumulative residual (CURE) plots.

NB regression

Pedestrian counts are always a non-negative integer and are therefore most appropriately modeled using count models. Though there are many count regression models, NB regression models are used most extensively in research due to their ability to account for overdispersion in the dataset, which is commonly observed in pedestrian count data. This paper’s NB regression procedure is adapted from Hankey et al. (2012) and Lee et al. (2019).

NB regression can be described through the following formulation: Let i = 1 , 2 , 3 , , N represent the index of a given location where N is the number of locations in the dataset. In the NB model, the pedestrian count at a location i , takes an exponential form as shown in Equation 1:

λ i = E y i = e x p β X i + ε i

where λ i is the predicted pedestrian count at location i , y i is the observed pedestrian count at location i , β is the vector of estimated parameters, X i is the associated vector of explanatory variable values observed at location i , and ε i is an error term such that e x p ε i has a gamma distribution. A maximum likelihood estimation (MLE) method was adopted using the probability distribution for NB regression as presented in Equation 2, and the likelihood function as presented in Equation 3. The MLE method allows for estimation of the coefficient parameter, β , and the overdispersion parameter, α , relying on the gamma function, denoted by Γ ( · ) .

P y i = Γ 1 α + y i Γ 1 α y i ! α λ i 1 + α λ i y i 1 1 + α λ i 1 α
L ( α , β ) = i = 1 N P y i

Selecting parameters which maximize the likelihood function presented in Equation 3 establishes a model expected to best fit the data. Typically, the likelihood value found by maximizing Equation 3 is very small, so frequently the natural log of Equation 3 is optimized in place and is reported as log-likelihood.

MBRP algorithm

The MBRP algorithm combines traditional statistical modeling, such as NB regression, with recursive data partitioning techniques commonly found in tree-based machine learning methods. In the MBRP algorithm, the root node is a parametric model fitted over the entire dataset. Child nodes are then formed through splitting based on a decision rule, which continues until the terminal node of the tree model is reached. Figure 1 depicts a general form of the tree model (Kashani & Mohaymany, 2011). The following formulation for the MBRP algorithm draws extensively from Seibold et al. (2016) and descriptions from Tang and Donnell (2019).

Developing a model using the MBRP algorithm is a process that occurs in three steps:

  1. Fit a parametric model to all observations in the dataset;

  2. Test coefficient stability over the splitting variable; and

  3. Determine the optimal cut point of the splitting variable.

In Step 1, the parametric model with parameter vector, θ (represented by the root node in ) may be estimated through methods such as OLS or MLE which inform the objective function, Ψ ( · ) as found in Equation 4. The coefficients are then estimated through a partial score function as shown in Equation 5, where ψ ( · ) represents the score function, and β is the estimated parameter (Seibold et al., 2016).

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b4c20676-0003-47f6-98f8-f54dc3dd1e1b/image/6e34704b-bed1-4f41-bfb3-76117edcc972-ufigure_1.png
Figure 1: General structure of a tree model (Kashani & Mohaymany, 2011)
θ ^ = a r g m i n i = 1 N Ψ ( y , X i ,   θ )
i = 1 N Ψ y , X i ,   θ β = i = 1 N ψ β y , X i ,   θ

In Step 2, a generalized M-fluctuation test is used to test coefficient stability over splitting variables, with the null hypothesis stating that the partial score functions from Equation 5 are independent of partitioning variables. This hypothesis indicates that global estimates of an independent variable are appropriate. Equation 6 shows the functional form of the null hypothesis, where Z is the splitting variable and J is the number of splitting variables (Seibold et al., 2016).

H 0 β j :   ψ β Y , X ,   θ ^ Z j ,   j = 1 , , J

The splitting variable is selected based on the greatest correlation with partial score functions, and the optimal cut point of the splitting variable is determined by evaluating segmented objective functions as shown in Equation 7 and choosing the minimum value. In Equation 7, I b is the set of observations belonging to b under splitting rule Β (Tang & Donnell, 2019).

S C O R E = b = 1 B i I b Ψ Y i , θ b

The process outlined above results in child nodes after the splitting rule is followed. The process is then repeated on all child nodes until terminal nodes are reached, where there is no longer coefficient instability as determined in Step 2 of the process. Each of the terminal nodes represents a parametric model, and any observation in the root nodes that meets the splitting criteria to fall into the terminal node can be predicted using the corresponding parametric model. Additional information about the MBRP algorithm can be found in Zeileis and Hornik (2007) and Zeileis et al. (2008).

Data

Pedestrian count data analyzed in this study comes from the North Carolina Department of Transportation (NCDOT) and was collected and analyzed for a previous study investigating factors contributing to pedestrian crash outcomes (Gayah et al., 2022). The counts are not a random sample, arising from a variety of convenience samples such as turning movement counts (TMCs) and pedestrian safety studies at particularly dangerous intersections or intersections with high-volumes relative to other intersections in the city. The non-random sample is still considered to be a valid sample because the counts are known to demonstrate a wide range of pedestrian count values, as well as to accurately represent the expected levels of pedestrian activity near the count location. The counts come from the following sources:

  • 1 993 counts collected by NCDOT as part of turning movement counts (TMCs) or other similar analyses

  • 496 counts collected between 2011 and 2020 by City of Charlotte TMC program as part of a FHWA funded research project

  • 19 counts from downtown Raleigh, NC as collected for a NCDOT pedestrian safety study

  • 539 counts from Greensboro DOT/Greenville Urban Area Metropolitan Planning Organization (MPO)

  • 387 counts provided by Gaston-Cleveland-Lincoln MPO

  • 184 counts provided by the City of Durham.

Each observation in the dataset included vehicular volume data, roadway features at the count location, census-level demographic statistics, and land-use statistics for a 0.5-mile radius surrounding the count location. A summary of the variables that were included in final models is provided in Table 1 and Table 2. The land use mix variable is an aggregate of land uses in the surrounding area adapted from the methodology outlined in Frank et al. (2004) and Gayah et al. (2022). The land use mix value is based on four land use types: high intensity developed, medium intensity developed, low intensity developed, and all other land use classifications combined, where a value of 1.00 represents a perfect balance of all four land uses, and 0.00 represents only a single land use present. Alcohol sales locations are not commonly considered as an indicator variable; however, including this variable in indicator form can be justified through contextualization—the two categories represent the difference between an area with single liquor store and a couple of restaurants, and an area with densely clustered bars expecting high foot traffic and fewer vehicles. Including this variable in this form was found to improve model performance relative to a continuous form of alcohol sales locations.

Table 1: Continuous variables included in final NB and MBRP models

Mean

Standard deviation

Minimum

Maximum

Land use mix

0.80

0.17

0.10

1.00

Proportion of non-motorized commuters within 0.25 miles of count location

0.07

0.09

0.00

0.55

AADT (veh/day)

20 260

11 133

1 200

77,000

Number of parcels within 0.25 miles of count location (parcel count)

92.98

91.93

2

568

Table 2: Discrete variables included in final NB and MBRP models

1

0

Count duration = 13 hours (1 indicates yes, 0 indicates no)

74.49%

25.51%

Count duration = 16 hours (1 indicates yes, 0 indicates no)

5.27%

94.73%

Indicator for 0-5 alcohol sales locations within 0.25 miles of count location (1 indicates yes, 0 indicates no)

40.00%

60.00%

Indicator for ≥ 6 alcohol sales locations within 0.25 miles of count location (1 indicates yes, 0 indicates no)

29.29%

70.71%

Indicator for presence of sidewalk (1 indicates present, 0 indicates not present)

69.88%

30.12%

Indicator for presence of crosswalk (1 indicates present, 0 indicates not present)

32.76%

67.24%

Indicator for bus stop located within 0.25 miles of count location (1 indicates present, 0 indicates not present)

58.72%

41.28%

Indicator variable for posted speed limit ≥ 40mph (1 indicates yes, 0 indicates no)

45.68%

54.32%

Of the 3 618 counts provided, the majority were performed at intersections, and therefore intersections were selected as the unit of analysis. Count locations that were not intersections, in non-urban environments, and count durations less than or equal to 2.5 hours or equal to 24 hours were removed from the analysis to limit the analysis to characteristically similar count locations. The remaining 2 430 pedestrian counts ranged from 0 to 14 854 pedestrians over the respective durations, though 75% of counts registered 67 or fewer pedestrians. An 80/20 split was performed over the full 2 430 counts, creating a training dataset (1 942 observations) and a testing dataset (488 observations) such that both the NB and MBRP models would be estimated over the training dataset, and the models’ fitness would be assessed using the goodness of fit statistics based on their predictions over the test dataset. The training and test datasets were selected to be representative samples of the full dataset with approximately equal summary statistics, but each dataset was also vetted to ensure representation of outlier counts. Due to the nature of the outlier counts, exact matches for maximum count values could not be achieved, and the training dataset is known to contain a greater proportion of lower count values because of the higher maximum count but otherwise similar summary statistics. Table 3 shows the summary statistics of the final split datasets.

Table 3: Summary statistics of pedestrian counts

Minimum

25th-percentile

Median

Mean

75th-percentile

Maximum

Full dataset

0

4

19

145.7

67.00

14 854

Training dataset

0

4

19

146.4

67.75

14 854

Test dataset

0

4

19

143.0

65.50

9 839

Analysis and results

This section describes the estimation results from the NB regression model and the MBRP model, including the assessments of goodness of fit through mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). Additionally, a cumulative residual (CURE) plot was generated with both models plotted against the 95-percent confidence interval as determined for the NB regression model, according to the methodology outlined in Hauer (2015).

The results of the MLE process for the NB regression model and the model generated from the MBRP algorithm are presented in Table 4. In the modeling process, only the variables found to be statistically significant at the 5% significance level were retained. Exceptions were made for some variables found to be insignificant at the 5% level but were deemed theoretically important predictors of pedestrian counts. Additionally, any variables that were found to be inconsistent with theory regarding the sign of the estimated parameter were vetted for their contribution to the model’s predictive performance before being considered for removal.

Table 4: NB regression and MBRP model estimation results

Variable

NB regression

MBRP

Parcel count ≤ 71

‘Low urban density’

Parcel count > 71

‘High urban density’

Constant

2.538

3.226

4.926

Count duration = 13 hours (1 indicates yes, 0 indicates no)

0.428

0.747

0.293

Count duration = 16 hours (1 indicates yes, 0 indicates no)

-0.408

0.078

-0.965

Land use mix

1.265

---

---

Indicator for 0–5 alcohol sales locations within 0.25 miles of count location (1 indicates yes, 0 indicates no)

0.172

0.210

0.603

Indicator for ≥ 6 alcohol sales locations within 0.25 miles of count location (1 indicates yes, 0 indicates no)

0.680

0.743

1.054

Proportion of non-motorized commuters within 0.25 miles of count location

9.262

8.958

9.009

Natural logarithm of AADT (veh/day)

-0.301

-0.228

-0.260

Indicator for presence of sidewalk (1 indicates present, 0 indicates not present)

0.972

1.046

0.696

Indicator for presence of crosswalk (1 indicates present, 0 indicates not present)

0.576

1.019

-0.089

Indicator for bus stop located within 0.25 miles of count location (1 indicates present, 0 indicates not present)

0.214

0.418

0.008

Natural logarithm of parcel count

0.262

---

---

Indicator variable for posted speed limit ≥ 40mph (1 indicates yes, 0 indicates no)

-0.636

-0.746

-0.381

Overdispersion parameter

1.550

2.435

1.226

Akaike Information Criterion (AIC)

17 889

17 857

Values in italics are not significant at 95% confidence level.

NB regression

All variables included in the final NB regression model were found to be statistically significant at the 95% confidence level or higher and were found to meet expectations consistent with engineering judgement:

  • Count duration. In the dataset, the base condition is a 12-hour count. Therefore, it would be expected that a 13-hour count would yield a higher prediction due to a longer period for pedestrians to be observed. The same logic would indicate that a 16-hour count parameter should be greater than the 13-hour count parameter, however, 16-hour counts are typically conducted at two-way stop intersections being considered for signalization. Generally, these intersections have high traffic volumes which make pedestrian access uncomfortable, resulting in fewer pedestrians. Therefore, a smaller or even negative coefficient may be appropriate to describe the effects of a 16-hour count on pedestrians observed.

  • Traffic conditions. As previously described, pedestrians are less comfortable walking near roadways with higher traffic volumes where they may be more vulnerable to crashes (Miranda-Moreno & Fernandes, 2011). That trend extends also to roadways with higher speed limits as crashes involving pedestrians and vehicles moving at higher speeds are known to result in more severe outcomes for pedestrians (Pulugurtha & Repaka, 2008). In the NB regression model, we find that the coefficients associated with the natural logarithm of AADT and speed limits greater than or equal to 40 miles per hour are negative, indicating a negative impact on pedestrian counts.

  • Pedestrian-friendly infrastructure. Presence of features such as crosswalks and sidewalks are expected to contribute positively to pedestrian counts. Crosswalks provide priority to pedestrians crossing at intersections, and sidewalks provide a separate right of way for pedestrians to walk safely next to a road (Lee et al., 2019; Lu et al., 2018). These expectations are met by the positive coefficients associated with presence of a crosswalk or sidewalk in the NB regression model. Bus stops are also associated with greater pedestrian presence due to the access distance between the trip origin/destination and the transit stop (Hankey et al., 2012; Pulugurtha & Repaka, 2008).

  • Land development. Denser, more varied use of land is expected to result in greater pedestrian counts due to convenience of the walking mode. Increases in parcel count, a measure of urban density, should indicate shorter distances to potential destinations, while land use mix increases suggest a greater variety of residential, commercial, and industrial uses within a small radius (Gayah et al., 2022). Walking becomes more convenient over shorter distances, and therefore more pedestrians are likely to be observed with greater parcel count and land use mix. Additionally, alcohol sales locations are frequently accessed via walking due to the dangers associated with driving under the influence (Gayah et al., 2022).

  • Demographics. People who reported that they commute by non-motorized means are vastly more likely to be observed in a pedestrian count, and therefore a greater proportion of non-motorized commuters is expected to increase pedestrian counts, as observed by the positive coefficient in the NB regression model (Griswold et al., 2019).

MBRP model

Under the framework discussed in the methodology section, a MBRP model splitting over the natural logarithm of parcel count was developed. Due to the relatively small dataset, the maximum depth of the tree structure was limited to ensure that each of the sub-models was trained on a sufficiently large dataset. This also allowed for greater control over variable inclusion based on statistical significance and consistency with theory. The optimal split value of the parcel counts was found to be 71, with 1 197 of the training dataset observations registering parcel counts less than or equal to the optimal split value. Figure 2 depicts the tree structure for the MBRP model developed.

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b4c20676-0003-47f6-98f8-f54dc3dd1e1b/image/5d5d5403-0dfe-4206-9618-dc821ba64874-ufigure-2.jpg
Figure 2: MBRP model tree structure

In comparing the estimation results between the NB regression model and the MBRP model, land use mix was found to be statistically insignificant in the MBRP model and was therefore removed from the model. The indicator variables for crosswalks and bus stops were found to be insignificant in the higher urban density model of the MBRP model, though the coefficient for bus stops was consistent with theory, and the contribution from the crosswalk variable was minor enough to overlook given the magnitude of its contribution in the lower urban density model. Aside from these differences, the NB and MBRP model coefficients take the same signs and indicate the same relationships between explanatory variables and pedestrian count outcomes.

The MBRP model allows for additional interpretation of model coefficients between the sub-models, which captures the difference in relationships between the explanatory features and the pedestrian count outcome as they vary with urban density. For example, based on differences in the magnitudes of coefficients presented in the second and third columns of Table 4, we observe that pedestrian counts are more sensitive to the presence of alcohol sales locations in denser urban environments. On the other hand, pedestrians in denser urban environments are less sensitive to the presence of sidewalks, crosswalks, bus stops, and greater speed limits than their counterparts in less dense urban areas. Across all models, the coefficients estimated for the proportion of non-motorized commuters and the natural logarithm of AADT are approximately equal, which indicates that pedestrians everywhere are about as sensitive to both variables, regardless of urban density.

Model comparison

Following model development, both models' cumulative residuals were evaluated over the testing dataset. The results were plotted on a CURE plot (Figure 3) with the 95% confidence interval from the NB regression plotted to demonstrate how the MBRP model compares to the ‘baseline’ NB regression model. Overall, we can tell that both models are robust, plotting nearly all observations within the confidence interval. From the CURE plot, we see that for lower count values predicted, the cumulative residuals are about equal across both models. Both models appear to underpredict observations of approximately 200 pedestrians, though the residuals for the MBRP model are smaller. The trend of smaller residuals can be observed over most of the predicted range for the MBRP model, indicating a generally better fit for the data than the NB regression model.

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/b4c20676-0003-47f6-98f8-f54dc3dd1e1b/image/0cadd133-2d2b-4b37-a9a8-7ec1a80b3a69-ufigure_3.png
Figure 3: Test data cumulative residual plots

The models were also tested for goodness of fit by standard error measurements to determine how well the models predict the pedestrian counts relative to the observed pedestrian counts in the test dataset. Table 5 presents the goodness of fit statistics for both models, where the better performance measure is bolded. To better depict the differences between the models, the goodness of fit statistics are presented for arbitrarily determined low-, mid-, and high-count ranges, as well as over the entire test dataset. Due to the nature of MAPE and the presence of zero-count observations, MAPE could not be applied to the low-count range, nor the full test dataset. In other studies, MAPE has been applied to zero-count locations through either log-transformation or by adding a negligibly small value to zero-counts. In this case, neither adjustment was applied to avoid artificially inflating the percentage error of zero-counts, for which any predicted value other than zero would incur immense errors.

Table 5: Goodness of fit statistics

Low-Count (0–100)

(393 observations)

Mid-Count (101–1000)

(82 observations)

High-Count (1000+)

(13 observations)

Full Test Dataset

(488 observations)

NB

MBRP

NB

MBRP

NB

MBRP

NB

MBRP

MAE

47.40

40.78

256.10

245.99

2116.33

1964.48

137.58

126.51

MAPE

n/a

n/a

118%

109%

61%

55%

n/a

n/a

RMSE

125.24

97.49

409.42

479.26

3058.23

2865.33

538.47

514.75

Bolded values indicate best performing model.

Based on the AIC presented in Table 4, we find that the MBRP model is an overall improvement over the traditional NB regression model. But the statistics presented in Table 5 allow for more distinct quantitative results. Over the whole test dataset, the MAE and RMSE both indicate that the MBRP model predicts more accurately than the NB regression model by about 8.0% and 4.4% respectively. This trend is consistently observed across the low- and high-count ranges, where MAE improves by 14.0% and 7.2%, respectively, and the RMSE is improved by 22.2% and 6.3%, respectively. The exception to the trend is in the mid-count range, where the MAE improves by 3.9%, but the RMSE increases by 17.1%. This anomaly can be attributed to a single predicted value with a large residual, which is then highly weighted in the RMSE calculation due to the squared component of the error statistic. The evaluation that the MBRP model outperforms the NB regression model even in the mid-count range is further supported by the 9.0% improvement in the MAPE, which is also observed in the high-count range.

Accounting for evidence provided by the AIC as found in model development, the CURE plots generated for both models, and the goodness of fit statistics as calculated over the test dataset, the MBRP model improves upon the traditional NB regression model in predicting pedestrian count values. Though error evaluations differ in quantity, based on trends in the MAE, RMSE, and MAPE indicate that using MBRP to model pedestrian count data can improve prediction accuracy by approximately 10% compared to traditional NB regression models.

Conclusions

This study investigates a possible improvement in pedestrian exposure modeling, by augmenting traditional NB regression models with a tree-based machine learning component, the MBRP algorithm. It was hypothesized that the inclusion of the MBRP algorithm would not only improve predictive accuracy but would allow for additional interpretation of the relationships between explanatory variables and pedestrian exposure estimates which may differ contextually in ways that are not known a priori. Data from North Carolina is subdivided into training and test dataset. Training data is applied to develop a traditional NB regression model and an MBRP model, and the two models’ predictive performance is assessed based on the predictions for the test dataset.

The use of the MBRP algorithm resulted a model indicating that a pedestrian count’s relationship with explanatory variables varies with the natural logarithm of parcel count, which suggests that pedestrian counts in denser urban environments are more sensitive to presence of alcohol sales locations, and less sensitive to presence of sidewalks, crosswalks, bus stops, and roadways with higher speed limits. These trends are not observable through traditional NB models, which are typically more of a ‘one-size-fits-all approach which does not consider how the relationships between explanatory variables may vary with context.

The model developed using the MBRP algorithm improved predictions of pedestrian counts over the test dataset by approximately 10% as measured by MAE, MAPE, and RMSE. The model’s fitness was also shown to be an improvement over the NB regression model through lower AIC and the MBRP plotting more points within the confidence interval and nearer to zero on a CURE plot.

The proposed use of the MBRP algorithm to improve pedestrian exposure estimates overcomes the shortcomings of traditional NB regression methods by capturing relationships between explanatory variables in context. Fortunately, the methodology is quite transferable to other regions outside of North Carolina, given that the MBRP algorithm inherently contextualizes coefficient estimates and naturally defines its own optimal splitting points. However, given the difficulty of retaining statistically significant variables across sub-models in the MBRP model, it should be noted that the variables selected for the MBRP model presented in this paper may not be statistically significant in another region. Further optimization may be required for application elsewhere. The MBRP algorithm and other machine learning methods are also known to be ‘data hungry’, and therefore the predictive power of this model may be limited by the size of the available dataset. In practice, a larger dataset would yield better results and could allow for a deeper understanding of the relationships between explanatory variables, which may be more complex than the restricted model presented in this paper suggests.

Funding

No external funding was used in this research.

Declaration of generative AI use in writing

During the preparation of this work the authors used GPT-4 to check writing for grammatical errors and improve clarity of language in the final manuscript. The outputs were reviewed and revised by the authors who take full responsibility for the content of the publication.

Declaration of competing interests

The authors declare no competing interests.

Ethics statement

The research performed in this paper does not qualify as human subjects research and thus does not require any IRB protocol.

CRediT contribution statement

Jakob C. Wiegand: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing—original draft, Writing—review & editing.

Vikash V. Gayah: Conceptualization, Data curation, Methodology, Supervision, Validation, Writing—review & editing.