This reports research claim states that the average level of unemployment will be affected by some amount per each of the ten variables. The error term is used to include all the variables not already included in the equation, it is added at the end of the equation to make the equation more reliable. The model includes secondary panel (monthly, 1986-2017) data that are a sample population from a bigger population. This study tests if the number of unemployed persons (B0) is affected by the following variables:
● B1: people in the labour force,
● B2: unemployed total males,
● B3: unemployed total females,
● B4: average weekly full-time wage,
● B5: gross domestic income (GDI),
● B6: real gross domestic product (GDP),
● B7: total company labour costs,
● B8: total welfare spending,
● B9: total occupation skill shortage ratings, and
● B10: gross household product (GHP).
Equation before estimation: U = B1 + B2 + B3 + B4 + B5 + B6 + B7 + B8 + B9 + B10 + error
Hypothesis of each variable:
“The null hypothesis is denoted by H0, and the alternative hypothesis is denoted by H1.” (Mann, 2010, p.382) The null hypothesis is usually the hypothesis that is assumed to be true to begin with. The meaning of H0 is null and H1 is alternative. The hypothesis must test to find at least one variable must unequal zero to be significant. In statistics the null hypothesis states that a given claim (or statement) about a population parameter is true.
The variables that were chosen include quantitative data, and qualitative data that has been converted into ratings making it quantitative data. The first set of secondary data collected was the monthly change in the total unemployed persons between 1986-2017. This would be the dependent variable in the equation, due to unemployment being the original hypothesis question. Secondly, the total number of people in the labour force was selected due to the fact that the labour force is made up of both unemployed and employed persons, and can determine the fraction of the labour force who are unemployed. The next two data variables are the total number of unemployed males and the total number of unemployed females, which make up the total number of unemployed persons, and can determine whether males or females make up more of the fraction of unemployment. Another variable is the average weekly full-time wage (in dollars) which aims to determine whether the increase or decrease in wages had any effect toward unemployment.
Three more variables, Gross Domestic Income, Gross Domestic Product and Gross Household Product in dollars are included, all three having great contribution to the overall economic indicators. GDI can explain whether the collective change in income can affect unemployment. GDP can explain whether the collective production of the economy can affect unemployment. GHP can explain whether the collective unpaid work and household’s capital (includes: unpaid domestic work, unpaid assistance to a person with a disability, unpaid child care and voluntary work for an organisation or charity) of the economy can affect unemployment. This statistic GHP was significant to the original research question as a failure of statistical organisations to provide official estimates of the household economy meant that almost half of the total valuable economic activities undertaken by Australians are ignored by economists and policy makers.
The last variables include total company labour costs in dollars, as to its overall effect on unemployment, causing fewer jobs due to higher costs. Total welfare spending in dollars, as to its overall effect on unemployment, causing fewer jobs due to lessor financial need. Total occupation skill shortage ratings, as to its overall effect on unemployment, causing more unemployment due to more shortages in occupational skills. All these data variables were chosen in attempt to include all types of unemployment: frictional (workers are temporarily transitioning between jobs, e.x. recent graduates), structural (workers have insufficient skills, employers no longer demand them) and cyclical (the result of a recession or drop in the business cycle). Frictional and structural unemployment are unavoidable and always occur in the economy. These two make up the natural rate of unemployment (the amount of unemployment the economy would occur when the economy is at full employment).
In choosing to analyse the data using linear regression, part of the process involves checking to make sure that the data can be analysed using linear regression. It is only appropriate to use linear regression if the data passes six assumptions that are required for linear regression to give a valid result.
Assumption #1: Your two variables should be measured at the continuous level (i.e., they are either interval or ratio variables). The variables are interval variables.
Assumption #2: There needs to be a linear relationship between the two variables:
When graphing the data, there can be present stationary and non-stationary data depending upon the condition in the data. Stationary: the ideal condition of data after taking the return of log. Non-stationary: there is serial autocorrelation meaning a trend in the data. Each have graphing differences represented below.
Non-Stationary Unemployment Data (Females and Males) 1986-2017:
Stationary Unemployment Data (Females and Males) 1986-2017:
As the relationship displayed in the scatterplot is not linear, the reason to convert non-stationary data to stationary is to eliminate: data in trend, mean in error term being zero, standard deviation in error term being constant.
Assumption #3: There should be no significant outliers (an observed data point that has a dependent variable value that is very different to the value predicted by the regression equation). An outlier is far away from the regression line indicating a large residual, as highlighted below.
Assumption #4: independence of observations, checked using the Durbin-Watson statistic, this is analysed on page fifteen.
Assumption #5: The data needs to show homoscedasticity (variances along the line of best fit remain similar along the line). Three scatterplots show two examples of data that fail the assumption (heteroscedasticity) and one of homoscedasticity:
Assumption #6: The residuals (errors) of the regression line are approximately normally
distributed, check using either a histogram (with a superimposed normal curve), done on page sixteen.
SPSS Statistics will generate quite a few tables of output for a linear regression, while only the three main tables (model summary, anova and coefficients) are required to understand the results from the linear regression procedure. In this regression, unemployment was the dependant variable, and all other variables as independent(s). Equation before estimation: U = B1 + B2 + B3 + B4 + B5 + B6 + B7 + B8 + B9 + B10 + ɛt. Here all ten variables are secondary data. The determinants of unemployment will also include error term into the equation.
Variables B1, B4, B5 and B6 are deleted by the model, shown below. Therefore, more than one table, model, various variable combinations, and log with interactive variables will be run. There may be present unfulfilled assumptions that cause the data to be negatively impacted. Before investigating these data issues, the next step is to interpret the data.
To interpret the results, the level of significance can act as a validity check. “The size of the rejection region in a statistics problem of a test of hypothesis depends on the value assigned to alpha (a).” (Mann, 2010, p.384) Therefore, a value must be assigned to alpha before beginning the interpretation. The commonly used values are 0.01, 0.05 and 0.10. Usually the value assigned to alpha does not exceed 10%. The level of significance taken will be at a 95% confidence interval or 5% significance level.
Coefficients in SPSS Linear regression model, with all independent variables and unemployment as dependent variable:
“Assuming that the null hypothesis is true, the p-value can be defined as the probability that a sample statistic (such as the sample mean) is at least as far away from the hypothesized value in the direction of the alternative hypothesis as the one obtained from the sample data under consideration.” (Mann, 2010, p.391) Note that the p-value is the smallest significance level at which the null hypothesis is rejected.
Beta values are the standardized coefficients, obtained by standardizing all regression variables (dependent and independent). Making them on the same scale, the magnitude of the coefficients can be seen as to which has more of an effect; larger betas are associated with larger t-values and lower p-values. The t-statistics are calculated by B / Std. Error, therefore these t values are the calculated values.
The coefficients table provides information how statistically significantly variables are to the model (at the “Sig.” column). With a non-significant intercept but highly significant total for unemployed females coefficient and a significant intercept and highly significant total for unemployed males and labour costs coefficients. B2 shows a positive strong B value, for every 1-unit increase in unemployed males there will be a .537 increase in unemployment. While for every 1-unit increase in labour costs there will be a minus .543 decrease in unemployment, to describe this the labour costs must include the additional wage costs for taking on new employees, represented in the negative coefficient. While, all other insignificant, implies there is no statistical evidence that the variable can affect unemployment.
Mathematical equation: B0 = .001 (C) + .537 (B2) + .099 (B3) + -.543 (B7) + -.006 (B8) + -.008 (B9) + .086 (B10)
Sub equation: U = B5 + B6 + B10 + ɛt
These variables: GDP, GDI and GHP all contain large data values and are major economic indicators. Even though B5 and B6 are discarded from the model, B10 is now showing as having a significant value of .055 compared to the original equation regression results as .459.
Mathematical equation: U = .002 (C) + .304 (B10)
The paired variables show the combined impact, sample size does not have any missing variables. While, now the significance of B8 & B10 are significant at .000, while previously B8 had a Sig. at .892 and B10 .459. This could imply that welfare spending and GHP combined calculated values could have been greater than the 1.65 critical value, and had rejected the null hypothesis.
Lastly, we can check for normality of residuals with a normal P-P plot. The plot should show that the points generally follow the normal (diagonal) line with no strong deviations. This indicates that the residuals are normally distributed. However, in this plot, there is less of a diagonal line.
In this section, there will be an analysis of the variables, including R squared and the F-test, and if the regression shows signs of multicollinearity, autocorrelation, or hetrosocidicity.
The information in the table above also allows us to check for multicollinearity in our multiple linear regression model. Tolerance should be > 0.1 (or VIF < 10) for all variables, which they are.
This table provides the R and R2 values. R is the square root of R-Squared and is the correlation between the observed and predicted values of dependent variable. The R value represents the simple correlation and is 0.699 (the “R” Column), which indicates a high degree of correlation. The R 2 value .488 (the “R Square” column) indicates how much of the total variation in the dependent variable, unemployment, can be explained by the independent variables. In this case, 48.8% can be explained, which is less than half. This is an overall measure of the strength of association and does not reflect the extent to which any independent variable is associated with the dependent variable. The adjustment of the R-squared, penalizes the addition of extraneous predictors to the model. The standard error of the estimate (the root mean squared error) is the standard deviation of the error term and the square root of the Mean Square for the Residuals in the ANOVA table. The Durbin-Watson = 1.965, which is between the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order linear autocorrelation in our multiple linear regression data.
This model which reports how well the regression equation fits the data (i.e., predicts the dependent variable). The Sum of Squares are associated with the three sources of variance: total, model and residual. The Total variance can be explained by the independent variables (regression) and the variance which is not explained by the independent variables (residual). These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom: 383-1 = 382. This table indicates that the regression model predicts the dependent variable significantly well. Look at the “Regression” row and go to the “Sig.” column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).
The last thing to check is the homoscedasticity and normality of residuals. The histogram indicates that the residuals approximate a positively skewed distribution skewness 2.3 > 0, however the kurtosis benchmark is three, here there is Leptokurtic peakness 128.2 > 3. Both values are calculated using Excel.
The t-statistics are calculated by B / Std. Error, therefore these t values are the calculated values. T-test, is done when the sample size (n) is less than 30. However, this model has n = 384. The z-test must therefore be taken, as population standard deviation is also know at 0.021004 calculated on Excel or 0.0121 on SPSS. For a two tailed test, .05/1 = 0.5, 1-0.5 = 0.95, critical is 1.65.
- B2: 17.502 > 1.65, as calculated value is greater than critical value we reject the null
- B3: 4.554 > 1.65, as calculated value is greater than critical value we reject the null hypothesis.
- B7: -5.946 < 1.65, as calculated value is less than critical value we do not reject the null hypothesis.
- B8: -1.36 < 1.65, as calculated value is less than critical value we do not reject the null hypothesis.
- B9: -.654 < 1.65, as calculated value is less than critical value we do not reject the null hypothesis.
- B10: .742 < 1.65, as calculated value is less than critical value we do not reject the null hypothesis.
- F-test: 60 > 1.65, as calculated value is greater than critical value we reject the null
As we reject the null hypothesis of the F-test (overall validity of the model), at least one of the variables is statistically significant, so the model has statistical validity. Whilst, being mindful toward avoiding both Type I type error: occurs when a true null hypothesis is rejected. The value of alpha represents the probability of committing this type of error. Type II type error: occurs when a false null hypothesis is not rejected. The value of beta (B) represents the probability of committing this type of error.
Mann, S. P. (2010). Introductory Statistics; International Student Edition. Seventh Edition.
Laerd Statistics. (2018). Retrieved from: <https://statistics.laerd.com/features-overview.php>.
Secondary Data Sources:
ABS 6302.0 – Average Weekly Earnings, Australia
ABS Australian Industry Labour costs
AIHW welfare expenditure database reports
Census data quality
Indicative Department of Jobs and Small Business skill shortage ratings at the national level —1986 to 2017.
Labour force status by Sex, Australia – Trend, Seasonally adjusted and Original
RBA Gross Domestic Product & Income – H1