OLS Regression: Basics and Applications

Introduction

Ordinary least square (OLS) regression is a statistical method, utilised to determine the strength between a continuous dependent variable and one or more explanatory variables, and also it predicts values of the dependent variable by using one or more explanatory variables (Hutcheson, 2011). One condition to perform the OLS regression is that all variables are continuous, otherwise alternative regression methods can be used (Hutcheson, 2011). OLS regression is basically a linear modelling method, where the dependent variable and the explanatory variables are linearly related. Moreover, non-linear relationship can be modelled by the OLS regression (Hutcheson, 2011). Suitable transformations can be applied on one or more of the variables in order to turn the non-linear to linear relationship. Disability to transform the non-linear relationship will predict the model, but this model will be unreliable and has negative impacts on the model created (Hutcheson, 2011). An advantage of this method is that the analysis of the variance and the analysis covariance can be obtained within this method (Hutcheson, 2011). The OLS can be divided into two parts: Simple OLS regression and multiple OLS regression. In the simple OLS regression, only one explanatory variable is used in the model, whereas in the multiple OLS regression, two or more explanatory variables are used in the model (Hutcheson, 2011). In the multiple regressions, the impact of each explanatory variable on the dependent variable can be computed in the model by controlling the other variables. The multiple OLS regression equation is similar to the simple OLS regression equation except that the multiple regression equation has more terms for the explanatory variables (Hutcheson, 2011).

Geographically weighted regression (GWR) is a spatial analysis method, which is used in geography (Wheeler& Páez, 2010). GWR is based on the non-stationary variables, and the relationships between these non-stationary variables can vary over the whole study area (Wheeler& Páez, 2010). The advantage of the GWR is that a subset of the data can be approximated in the model based on their geographic location (Wheeler& Páez, 2010). In the GWR, a regression model can be created at each location in the data, and a kernel function is used to compute the weight for each data location (Wheeler& Páez, 2010). A disadvantage of the GWR, is that using the data more than once to obtain models at different places will increase the probability of the statistical significance tests to be significant (Wheeler& Páez, 2010). Overall, the GWR model extend based on the OLS regression model by considering the spatial structure and predict local models with different coefficients for each place in the data (Matthews & Yang, 2012).

There are some differences between the OLS regression and the GWL. Firstly, the OLS regression is called a global model since the coefficient of the regression equation is fixed for the whole study area. While in the GWR, the coefficient of the regression equation varies from location to another in the study area, so it is called a local model (Wheeler& Páez, 2010). Secondly, the OLS regression is commonly used in non-spatial modelling, whereas the GWR is a spatial modelling method commonly used in geography.

Aim

The data used in this workshop is from Esri online training website (ESRI, 2020a). 911 incidents in Portland, Oregon, are displayed in a hotspotsmap. The objective of this workshop is to identify the factors that explain the occurrence of high values and low values on a various areas and model them using the OLS regression analysis, explanatory regression and the GWR analysis. Under the 911 incidents, the dependent variables and more than one explanatory variable are used, such as, population, jobs and percentage of low education. Furthermore, a future prediction will be obtained by using the GWR analysis. These analyses will be performed by using ArcGIS Pro (vers 2.4 2020).

Methods

The first step to start performing the regression analysis is to determine the dependent variable, which is y equal to the 911 incidents calls in this workshop. The x variables can be any variables and by performing this analysis, the best x variables can influence the y and this will be identified, as shown in the equation (1).

Equation 1 y= 0+1x1+2x2+…+nxn+ε

where y is the dependant variable, xis the explanatory variables, represents the coefficients and represents residuals.

It is necessary to check each model produced in parts 1 -3, the model validation questions are used to determine the best model and the passing models. After performing the OLS regression, the GWR can be used, if the test statistics shows a statistically significant non-stationary. Then, the best model is used in parts 4-5, where the GWR and future predictions are obtained. These are the model validation questions which must be satisfied:

1) Do the explanatory variables help the model?

2) Are the relationships what you expected?

3) Are any of the explanatory variables redundant?

4) Is the model biased?

5) Do you have all the key explanatory variables?

6) How well are you explaining your dependent variables?

Any model satisfies these questions can be considered as passing model.

In part 1 of this workshop, the OLS regression is performed with only one explanatory variable. The geo-processing tool Ordinary Least Square tool is used in this part. The only explanatory variable is unemployment. This tool generates a report, which will help to answer the model validation questions, to decide if this model pass or fail.

In part 2 of this workshop, a multivariable regression modelis created based on the explanatory variables: population, Employment, percentage of low education and distance to the nearest urban centre. The geoprocessing tool Ordinary Least Square tool is used in this part as well. This tool generates a report which will help to answer the model validation questions, to decide if this model pass or fail.

In part 3 of this workshop, the exploratory regression assists to evaluate many possible combinations of explanatory variables and search for the best OLS model. The geoprocessing tool used in this part is the Exploratory regression tool, and the explanatory variables evaluated are population, jobs, percentage of low education, distance to the nearest urban centre, unemployment, alcoholX, population Density, med Income. In the search criteria, 5 is chosen to be the maximum number of explanatory variables and 2 is chosen to be the minimum number of explanatory variables for each model. This tool generates a report. Furthermore, passing models listed on the report are also satisfying the OLS model validation questions. In all passing model, the coefficients of the explanatory variables are statistically significant, coefficients represent the relationship between the dependent variable and each explanatory variable, all explanatory variables are not redundant and have small VIF values, which is less than 7.5, unbiased models since the residuals are normally distributed, the p-value of the spatial autocorrelation must be not statistically significant (Esri, 2020b).

In part 4 of this workshop, passing model can be used to obtain better result by performing GWR. When the Koenker test statistics shows a statistically significant nonstationary, then GWR can be used. The best model of OLS regression is the model that hasthe highest value of the adjusted R2 and the lowest value of the Akaike information criterion (AIC) (Zhao & Park, 2004). The geoprocessing tool used is the Geographically Weighted Regression tool. The explanatory variables are from the best OLS model,the neighbourhood type selected is the number of neighbours, the neighbourhood selection method selected is the golden search.The diagnostics statistics including the adjusted R2 and AIC, can be used to compare the result of the GWR to the results of the OLS regression.

In part 5 of this workshop, the GWR is used to predict values of y in future. The Geographically Weighted Regression tool is used to achievethe prediction. The selected model type is Continuous (Gaussian) which indicate that the dependent variables are continuous (Esri, 2020b), the explanatory Variables are population, jobs, lowEducation and distance to urban centre, the neighbourhood type selected is Distance Band, which indicates a fixed neighbourhood size for each feature (Esri, 2020b). The neighbourhood selection method selected is Golden Search which means that the number of neighbors depends on the characteristics of the data (Esri, 2020b). Moreover, the prediction option must be expanded to complete the prediction step. The Data 911 Calls layer is selected for prediction location. For explanatory variables match PopFY from the prediction location to the Pop from the input features, JobsFY to Jobs, LowEducFY to LowEduc, Dst2UrbCen to Dst2UrbCen, respectively. Also, the Robust Prediction is checked. In addition to this, the geo-processing tool Apply Symbology from Layer helps to match the symbology of the two layers, in order to display the number of calls of the GWRPredict layer.

Results and Discussion

According to Currit (2002), the main assumptions of the OLS are all variables related linearly, the residuals have a normal distribution along the regression line, the distribution of dependent variable and explanatory variables are the same, outliers impact on the OLS regression statistics and the explanatory variables are randomly selected in the OLS regression. In addition to this, alternative regression methods developed due to these many assumptions in the OLS regression. The GWR is one of these methods which always derived from the OLS regression. In the GWR, the coefficients vary locally which helps to generate maps based on the coefficient (Currit, 2002).

Appendix J shows the report generated by performing the OLS regression with only one explanatory variable (unemployment variable), and it shows the result of applying the spatial autocorrelation global Moran’s I on the residuals. According to appendix J, the data is stationary and the Koenker equals to 24.48, and the relationship between the 911 calls and the unemployment variables is moderate, in terms of strength. Stationary data means that the relationship between the dependent variable and the explanatory variables is global and constant over the space (Koh, Lee & Lee, 2020). Jarque-Bera diagnostic value indicate that the model is unbiased, and the residuals are normally distributed. The spatial autocorrelation shows that the residuals have a significant clustering pattern. AIC equals to around 767.4 and the adjusted R2 equals to 0.53. overall, this model fails according to the model validation questions. Furthermore, the prediction power of the regression model decreases when applying a global OLS method to unstable spatial data (Koh, Lee & Lee, 2020).

Appendix K shows the report generated by performing the OLS regression with four explanatory variables (population, Employment, percentage of low education and distance to the nearest urban centre variables), and it shows the result of applying the spatial autocorrelation global Moran’s I on the residuals.According to appendix K, the data is statistically significant. also, the relationship between the 911 calls and the 4 variables tends to be strong. The VIF is less than 7.5 for all variables, so there are no redundant explanatory variables. Jarque-Bera diagnostic value indicate that the model is unbiased, and the residuals are normally distributed. The spatial autocorrelation shows that, the residuals have a significant clustering pattern. AIC equals to around 694.88 and the adjusted R2 equals to 0.80. R2 Value is always used to measure how well the data fit the model (Hutcheson, 2011). overall, this model passes according to the model validation questions.

A limitation of the OLS regression method is that unable to discover the specific local relationship and the spatial autocorrelation in the residuals model (Su, Xiao& Zhang2012). The exploratory regression will help to determine the best model, in order to produce the GWR regression. Appendix L shows the report generated by performing the exploratory regression. This report illustrates different combinations between 8 variables as shown in appendix L. The best model chosen from these different combinations is the model with the variables: x1= + population, x2= + Jobs, x3= + LOWEDUC, x4= - DIS2URBCEN. The fourth variable has a negative coefficient which means that the 911 calls and DIS2URBCEN is negatively related. Moreover, this model achieves an AIC = 683.470629, also it explains 83.10% of the variation in the dependent variable. Therefore, this model is used to perform the GWR.The GWR is more powerful than OLS regression since the GWR described with the higher adjusted R2 and lower AIC value (Su, Xiao & Zhang, 2012). Higher adjusted R2 is obtained when the variance of the dependent variables explained more, and lower AIC value is obtained when the regression model has a robust ability to reflect the reality (Su, Xiao & Zhang, 2012).

Appendix A, C, E and G shows the GWR maps of the coefficient for each variable, also it illustrates that the coefficients values in the model vary locally for each variable over the study area. The higher the coefficients values, the more influence on the area. According to Appendix E, the coefficients of LOWEDUC variable is highest than the other three variables. Moreover, the highest coefficients values of LOWEDUC variable is located on the western region of the study area. According to Su, Xiao and Zhang (2012), coefficients maps are useful to visualise the geographical interactions which provide helpful descriptions. Appendix B, D, F and H shows the GWR maps of the standard error for each variable, also it illustrates that the standard error values in the model vary locally for each variable over the study area. According to Appendix F, the standard error of LOWEDUC variable is highest than the other three variables over the whole study area.

In addition, the GWR can be used to predict the data at any location over the study area. Moreover, using different bandwidth functions produce different results. Therefore, the GWR is sensitive to the bandwidth function chosen (Mishra et. al. 2010). For large study areas, the data is usually not uniform distributed. So, the bandwidth can suit the size related to the variation of the data density, in order to make the variation in bandwidth size lets the same number of point data involve in each approximation (Mishra et al., 2010). The GWR models attain higher quality results of prediction than the OLS regression models (Koh, Lee & Lee, 2020). Henceforth, the GWR model is used to predict the 911 calls for future time. Furthermore, appendix I show the predicted 911 calls over the study area. Predicted calls have a range from 0 call to 160 calls. The calls predicted in the northern region of the study area has the highest number of 911 calls, where the prediction range is from 41 to 160 calls for the most places of the region. The calls predicted in the western region of the study area has the lowest number of 911 calls, where the prediction range is from 0 to 25 calls for the most places of the region.

Summary

911 incidents in Portland, Oregon is the dependent variable and the explanatory variables is determined by applying the OLS regression, Exploratory regression tools in ArcGIS pro (vers 2.4 2020). These models are judged by the six validation questions, and the models satisfy this validation questions classified as passing models and the best model used to perform the GWR and to predict the 991 calls over study area by using GWR tool in ArcGIS pro (vers 2.4 2020). The best model found is the model with the explanatory variables : x1= + population, x2= + Jobs, x3= + LOWEDUC, x4= - DIS2URBCEN. This model achieves the lowest AIC which equal to 683.47 and the highest adjusted R2 which equal to 0.83. Overall, this workshop shows that the GWR performance is better than the OLS regression performance.

Take a deeper dive into UK Corporate Governance Code Impact with our additional resources.

References

Currit, N., 2002. Inductive regression: overcoming OLS limitations with the general regression neural network. Computers, environment and urban systems, 26(4), pp.335-353.

Hutcheson, G.D., 2011. Ordinary least-squares regression. L. Moutinho and GD Hutcheson, The SAGE dictionary of quantitative management research, pp.224-228.

Koh, E.H., Lee, E. and Lee, K.K., 2020. Application of geographically weighted regression models to predict spatial characteristics of nitrate contamination: Implications for an effective groundwater management strategy. Journal of Environmental Management, 268, p.110646.

Matthews, S.A. and Yang, T.C., 2012. Mapping the results of local statistics: Using geographically weighted regression. Demographic research, 26, p.151.

Mishra, U., Lal, R., Liu, D. and Van Meirvenne, M., 2010. Predicting the spatial variation of the soil organic carbon pool at a regional scale. Soil Science Society of America Journal, 74(3), pp.906-914.

Wheeler, D.C. and Páez, A., 2010. Geographically weighted regression. In Handbook of applied spatial analysis (pp. 461-486). Springer, Berlin, Heidelberg.

Su, S., Xiao, R. and Zhang, Y., 2012. Multi-scale analysis of spatially varying relationships between agricultural landscape patterns and urbanization using geographically weighted regression. Applied Geography, 32(2), pp.360-375.

Zhao, F. and Park, N., 2004. Using geographically weighted regression models to estimate annual average daily traffic. Transportation research record, 1879(1), pp.99-107.