Application of the logistic regression model to assess the risk of death in road…

in shaping the overall strategy on road safety.


INTRODUCTION
Road safety is an important and topical issue.Statistics show that more than 1.3 million people die worldwide every year owing to traffic incidents.It is estimated that traffic accidents are in eighth place among the main causes of human mortality, and in the age group of 5 to 26 years, in the first place.
Therefore, it is important to take measures to minimize and counteract their negative effects.This is facilitated by the identification and assessment of factors that affect the risk of being involved in a traffic incident.The use of mathematical models allows the estimation of the strength and direction of this effect, and as a result can be a support in making decisions on improving the level of road safety.
This article features an analysis of road accidents in the Mazowieckie Voivodeship, which has the highest mortality rate among all voivodeships (regions) in Poland.Figure 1 shows the number of fatalities in individual provinces in Poland in 2016-2018.Every year, an average of 473 people die in this area, which is 16% of all nationwide fatalities, and means that every sixth victim loses their life in the Mazowieckie Voivodeship.The article applied logistic regression.The collected observations and preliminary analyses allowed to isolate the main causes conducive to deaths in road accidents, and on this basis, model predictors were selected.These included qualitative variables characterizing accident participants and quantitative variables describing atmospheric conditions prevailing at that time.The estimated parameters of the regression model made it possible to examine relationships.The quality of the proposed model was verified based on an analysis of significance of parameters and an evaluation of goodness-of-fit using the ROC curve (Receiver Operating Characteristic Curve) and the AUC (Area Under the ROC Curve).

LITERATURE REVIEW -PROBLEM STATEMENT
Road safety has been subject to many scientific studies and publications.Most often, quantitative analyses are carried out, aimed at estimating the parameters of mathematical models describing the phenomenon under consideration and predicting its future development.
The frequency of occurrence of road incidents and their numbers are investigated [2,9].There are also assessments of the number of traffic accidents causing fatalities [8,11] or causing injuries [13,34].Analyses regarding the number of fatalities as well as injuries are less popular [5,17].
Widely used in this area are generalized linear regression models with adopted probability distributions, including Poisson distribution, gamma-Poisson distribution -negative binomial distribution (NB) [1,10], binomial distribution 21, Poisson lognormal (PL) [26], and distributions with an excess of zero counts, i.e., zero-inflated Poisson (ZIP) and zero-inflated negative Binomial (ZINB) 18.In addition, solutions based on complex econometric models are used, among others on the Bayes method, a Monte Carlo simulation method based on Markov chains 14, as well as one using time series analysis methods 11.The main purpose of qualitative analyses is to present a descriptive characteristic of the research objects, assess the severity of road accidents broken down into individual categories of victims [24], as well as analyze the concentration of incidents and spatial differentiation of the level of road safety [25].Among the studies available in the literature, the most frequently mentioned as explained variables are as follows: parameters related to vehicle traffic: average daily traffic volume 12, transport performance [6], and the participation of heavy goods vehicles 15; 127.
A review of the literature also revealed the use of the logistic regression function 5.The available studies include research on the severity of road accidents depending on the location of the accident 28 or the causes of the incident (excessive speed, inadequate distance between vehicles, red light violations 5 etc.).Other authors consider the age of the perpetrator and other participants of the incident to be significant, as well as the time when the traffic accident occurred 31.This is demonstrated, for example, by Shakaya and Marsani by analyzing the death rate of accidents in the Kathmandu Valley.Based on an analysis of events in Shijiazhuang, Ma et al. indicated the significant effect of variables related to the cross-section of the road, place of the accident, type of the road, density of intersections, and lighting conditions 28.However, Ahmed as the main reason indicated the speed and type of the vehicle and the location of the incident 4.
The analysis of the literature also showed that research conducted for large, demographically diverse areas does not always fully reflect the characteristics of occurring phenomena 24.Therefore, this article assesses one territorially distinguished region in which the number of accidents was the highest and which is characterized by significant population migration resulting from professional and educational factors.It was found that the risk of death as a result of traffic accidents is significantly influenced by factors related to the type of the perpetrator and the participant of the incident, urbanization of the place where it occurred, the manner the road is lit, as well as the sex of the victim and the driver's experience.

LOGISTIC REGRESSION
The analysis and assessment of the mortality rate of road accidents in the Mazowieckie Voivodeship was carried out using the logistic regression model.It allows mathematical recording of the effect of several variables X1, X2, …, Xn on the dichotomous variable Y, which takes the value 1 in the event of death as a result of a traffic incident, and 0, when participants are injured or do not take any injuries.The link function of the model is the logistic function, which takes the following form (1) 23: where: e -Euler's number, and x -value of explanatory variable X.
The logistic regression model is described by the following equation: where: bi i=0,…,k -regression coefficients, and x1, x2, …, xk -independent variables that can be measurable or qualitative.
The model is based on the method of expressing probability through odds, which is the ratio of the probability of occurrence of an incident and the probability that it will not happen.For a given incident A, the aforementioned definition takes the following form: The most commonly used measure of association is the odds ratio OR, which, for two compared groups A and B, is defined as the odds of occurrence of A to the odds of occurrence of B. The odds ratio is recorded as follows: Regression coefficients bi, where i=0,…,k are estimated based on the maximum likelihood estimation method (MLE), which maximizes the likelihood function or its square.The likelihood function takes the following form 33: L4 ",  $ , … ,  * ;  " ,  $ , … ,  % 7 = ∏ ( " * +1" ;  " ,  $ , … ,  % ) for discrete variables , (5) L4 ",  $ , … ,  * ;  " ,  $ , … ,  % 7 = ∏ ( " * +1" ;  " ,  $ , … ,  % ) for continuous variables, (6) where:  " ,  $ , … ,  % -unknown distribution parameters;  ",  $ , … ,  * -the value of the variable X observed in the random sample; p(x) -distribution determined by the probability function; f(x)distribution determined by the density function, and L -likelihood function.
Assuming that all observations are independent of each other, the probability of observing the entire data set (with the model parameters set) is equal to the product of the probabilities (likelihood functions) of individual samples.The estimates of parameters are those values for which the likelihood is the highest (the higher, the better the fit of the model to the data).
The evaluation of the model and the quality of the prediction is carried out by following: applying statistics used in the evaluation of diagnostic tests, i.e. accuracy (ACC), sensitivity (SE), and specificity (SP); performing the Hosmer-Lemeshow test; and plotting the ROC curve and analyzing the AUC.
The application of statistics used in the evaluation of diagnostic tests requires the determination of the value called the cut-off point.This is the value of the examined variable that optimally divides the group into two parts, the first in which the phenomenon occurs and the second in which it does not.The cut-off point is therefore the determined value of π0 from the range [0,1], for which the following is assumed: To evaluate the quality of classification, two measures are most often used simultaneously: sensitivity and specificity (defined as difference 1 -SP).To evaluate, the obtained pairs of values 1 -SP and SE are marked on a plane, where the horizontal axis represents the values of 1 -SP, whereas the vertical axis the values of SE.The combination of the obtained points creates the so-called ROC curve that allows an overall assessment of the predictive quality of the model.It shows all possible cut-off points and statistics related to them.In addition, it does not depend on the adopted scale and allows easy reading of sensitivity and specificity values.
If there is a continuous independent variable in the logistic regression model, many different cutoff points can be tested to distinguish occurrences from non-occurrences, depending on the predicted probability.It is important to find a point that has high values of both sensitivity and specificity.This is the optimal cutoff point, which is determined using the Youden index (J).It takes the following form: The optimal cutoff point corresponds to the case where the J value reaches its maximum.

MODEL OF LOGISTIC REGRESSION USED FOR TESTING ROAD TRAFFIC MORTALITY RATE IN THE MAZOWIECKIE VOIVODESHIP
This article analyzes road accidents recorded in 2016-2018 in the Mazowieckie Voivodeship.The focus was on the victims of the aforementioned incidents as well as the categories of injuries sustained; the group included persons who died, were seriously and slightly injured or did not sustain injuries.The dependent variable was the mortality index with a binomial distribution, for which 1 indicates the loss of life as a result of a road accident.Model building began by selecting factors that Application of the logistic regression model to assess the risk of death in road… 129. may affect the explained variable.Factors were selected that determine the atmospheric conditions at the time of the incident, i.e. air temperature, wind speed, and total precipitation, as well as parameters characterizing the participants and the place of the incident, i.e., the type of the perpetrator and the traffic participant, sex and age of the victim, the effect of narcotics, area of the incident, road lighting, or the driver's experience.In the first step, the effect of the aforementioned variables on the dependent variable was evaluated.In the group of qualitative variables, the analysis was carried out based on contingency tables and by gauging association strength using the independence test  2 .In the first step, in the group of quantitative variables, basic descriptive statistics were calculated.The obtained values of skewness and kurtosis indicate the lack of compliance of the distribution of variables with the normal distribution (absolute values, respectively, greater than 1 for skewness and from 2 for kurtosis) (Table 1).The aforementioned results are confirmed by the results of the Kolmogorov-Smirnov test (Fig. 2, Tab.2).Therefore, the assessment of the effect of quantitative variables on the fatality of road accidents was based on the homogeneity of variance analysis based on the nonparametric Kruskal-Wallis test 22  The performed Kruskal-Wallis test, for which the null hypothesis implies that the median from all samples do not differ from each other, showed that only the effect of the participant's age variables on the examined explained variable is statistically significant.The results of the conducted test are shown in Table 3.
In the next step, the significance of the effect of the selected qualitative variables on traffic accidents mortality was evaluated.For this purpose, contingency tables and the independence test  2 were used.The results are presented in Table 4.The analysis showed that only the variable effect of narcotics does not significantly affect the dependent variable studied.Confirmation of the aforementioned are the results of the independence test  2 , according to which, for all variables (excluding effect of narcotics) the null hypothesis should be rejected and an alternative hypothesis should be adopted claiming that road traffic mortality depends on the qualitative variable being studied.
The use of logistic regression model requires meeting the assumption of linearity of independent variables.Quantitative predictors were evaluated.The significance of effect was found only for the participant's age variable.A graphical evaluation of the scatter plot of this variable and the corresponding value of the logarithm of the odds was carried out (Fig. 3) 33.In addition, a linearity test was performed.The calculated p value indicates that there are no grounds for rejecting the null hypothesis on the linear effect of the studied variable on the analyzed phenomenon.
In the next stage, the construction of the multivariate model was started.Variables whose effect on mortality owing to traffic accidents was significant were used.For qualitative variables, parametrization was carried out with an adopted reference level (Table 4).When building the model, cross-evaluation was selected as a way of its validation.Estimated values of logistic regression parameters are presented in Table 5.
All estimated model parameters proved to be statistically significant.For all variables, the calculated Wald test value and the corresponding p value are smaller than the assumed significance level α = 0.05.On this basis, it can be concluded that the listed factors have a significant effect on the 131.risk of death as a result of a traffic incident.The equation of the logistic regression model is as follows (9): where: a = -2.01+0.02•participant's age -1.17•type of perpetrator.dr+ 0.55•type of participant.p+ 0.08•type of participant.dr-1.14•built-up area -0.42•lighting day light + 0.31•lighting night unlit road 0.59•gender male + 0.65•experience no data -0.44•experiencehigh.The next step examined the goodness of fit of the model based on a comparison of predicted and observed values.For this purpose, the Hosmer-Lemeshow (HS) test was used.The HS test is based on a comparison of the number of occurrences of the modelled class with its actual implementation.The null hypothesis assumes that the observed and expected number of occurrences do not differ from each other.The calculated p value of the HS test is 0.64 and does not provide grounds for rejecting the null hypothesis, indicating that the model is well fitted to the data.
Then, the calculation of the odds ratios of death in a traffic accident was made, taking into account the effect of individual factors.The sign at the calculated parameter values indicates whether the analyzed factor increases or decreases the odds of the occurrence of the studied phenomenon.The calculated odds ratio values (Table 6) indicate that each of the model parameters significantly affects the risk of death in a road accident.In relation to the age of the incident participant, the odds ratio indicates that with each subsequent year of life the risk of death as a result of a traffic incident increases by 2%.Considering the type of the perpetrator, it is estimated that the driver's odds of dying are 0.31 times smaller in relation to the pedestrian.An analysis of the odds ratio of the type of the participant indicates that in relation to the passenger, the risk of the pedestrian's death is 1.74 times higher, whereas in the driver's case, it is 1.09 times higher.The odds of losing one's life in a traffic accident that takes place in a built-up area are 0.32 times smaller than in an undeveloped area.Considering the type of road lighting, it should be noted that in relation to driving after dark or at dawn, the odds of losing life owing to a traffic incident are 0.65 times smaller, whereas at night on an unlit road, they are 1.37 times higher.In addition, studies indicate that the risk of men dying in traffic accidents is 1.80 times higher.In turn, the experience of incident participants shows that drivers who have extensive experience are 0.64 times less likely to die compared with persons with a shorter possession of a driving license.
The developed model can be used to predict the risk of death as a result of a traffic incident; therefore, the quality of prediction was evaluated.For this purpose, the ROC curve was plotted; the most important evaluation parameter was AUC, i.e. the area under the ROC curve.It takes values from 0 to 1.The greater the area under the curve, the greater the discriminatory performance of the model; for its evaluation, Kleinbaum and Klein proposed the following classification (Table 7) 33.
As a method of model validation, cross-evaluation was used, for which the sample was split into 10 subgroups containing the training and test sets 26.The study was then conducted in accordance with Application of the logistic regression model to assess the risk of death in road… 133.
the LOO (leave one out) strategy, which assumes that all observations are used to estimate the model, with the exception of one observation, which is used to calculate the prediction error.In crossvalidation, the global evaluation of prediction is an average of errors from individual sets 33.Owing to the use of cross-validation for the model, ROC curves were plotted separately for the training and test sample sets and the AUC fields were calculated (Figs. 4 and 5).In addition, the cutoff point and the matrix of classification of actual observations and those predicted by the model were determined (Table 8 and 9).The calculated AUC values for individual sets are 0.7589 for the training set and 0.7567 for the test set, respectively, which shows their fair discrimination.This proves both sufficient adjustment to the data and sufficient quality of the model in the event of receiving new data.In addition, the calculated AUC values are similar (the difference between them does not exceed 0.05), which allows the model to be considered correct.For the proposed cutoff point, the sensitivity is at 0.67, whereas the specificity is at 0.29.A total of 14,040 cases (692 true-positive and 13,348 true-negative) have been correctly classified, whereas 5,909 cases (5,570 false-positive and 339 false-negative) have been classified incorrectly.For such a cutoff point, the value of the accuracy indicator assessing the prediction efficiency is 70.37%, allowing the model to be considered satisfactory.In addition, the percentage share of correct indications for death as a result of a traffic incident is 67.11%, which demonstrates that the model has good predictive properties.

CONCLUSIONS
The study presented in the article was conducted on the basis of data on accidents registered only in the Mazowieckie Voivodeship.The scope was adopted on the account of demographic and social diversity as well as differences related to the quality of the available road infrastructure, which could affect the quality of the model.The territorial limitation of the area enabled an accurate evaluation of the incidents occurring there.
The aim of the article was to develop a model for estimating the level of risk of death as a result of traffic incidents and evaluating the factors that affect them.Reference was made not to the number of accidents but to their consequences, which made it possible to clearly highlight the scale of the threat.Analysis concentrating solely on the frequency of such situations does not fully reflect their consequences, as many of them do not lead to fatalities, which may obscure the proper assessment of the problem.The developed model can be applied in processes related to road safety management.
Among the significant factors, among others, age was distinguished; it was assessed that with each subsequent year of life, the risk of death in a road accident increased by 2%.It was also pointed out that pedestrians are the group most exposed to death, both as perpetrators and victims.In addition, the significance of the effect of time of day on the phenomenon under study was emphasized; the study demonstrated that the risk of death was greatest when driving at night on an unlit road.Furthermore, according to the conducted study, men were the sex most exposed to the negative effects of road incidents.The risk of death for women is 1.8 times higher.The effect of driving experience is also important, and the risk of death is 0.64 times lower for drivers with longer practice.
The research results available in the literature indicate that various variables affect the mortality rate in road accidents.The results obtained in this study show a similarity to the analyses carried out in the Kathmandu region, where the effect of the victim's age was also significant (the risk of death increases by 2.3% with each subsequent year of life) and road lighting (driving at night after unlit roads).On the contrary, studies conducted in Iraq and Saudi Arabia have shown that mortality is influenced by other factors, such as driving speed, type of car, or the scene of the accident.The proposed model can be applied in processes related to road safety management.It indicates the factors that most significantly increase the risk of death in an accident.Thus, it can support decisionmaking processes of traffic participants concerning their route planning.In addition, it may support public safety and law enforcement authorities in carrying out preventive actions, including preventive police actions, aimed at reducing the number of road accidents, as well as social campaigns aimed at increasing awareness of the factors contributing to the tragic effects of road accidents.

Fig. 1 .
Fig. 1.Graph of number of fatalities in Poland by voivodeships

Fig. 2 .
Fig. 2. Histograms of the observed distribution of variables

Fig. 3 .
Fig. 3.A scatter plot of the logarithm of the odds as a function the predictor Participant's age

Sensitivity
Application of the logistic regression model to assess the risk of death in road… 135.
. Table1Descriptive statistics for the independent variables

Table 2
Kolmogorov-Smirnov test results for normality of distribution

Table 8
Matrix of classification of actual observations and the model's predictions