# how to do regression analysis in spss

## how to do regression analysis in spss

b. Now, if we look at these variables in data view, we see they contain values 1 through 11. Clicking Paste results in the syntax below. A minimal way to do so is running scatterplots for each predictor (x-axis) with the outcome variable (y-axis). By standardizing the variables before running the Version info: Code for this page was tested in SPSS 20. reliably predict the dependent variable?. independent variables reliably predict the dependent variable. Remember that the previous predictors in Block 1 are also included in Block 2. From the ANOVA table we see that the F-test and hence our model is statistically significant. on your computer. First, lets take a look at these seven assumptions: You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. The confidence intervals are related to the p-values such that Note that (3.454)2 = 11.93, which is the same as the F-statistic (with some rounding error). So let's first run the regression analysis for effect a (X onto mediator) in SPSS: we'll open wellbeing.sav and navigate to the linear regression dialogs as shown below. The syntax will populate COLLIN and TOL specifications values for the /STATISTICS subcommand. Should we take these results and write them up for publication? Since the standard deviation is for a standardized variable is 1, the terms on the right hand divide to 1 and we simply get the correlation coefficient. The term y i is the dependent or outcome variable (e.g., api00) and x i is the independent variable (e.g., acs_k3 ). (or Error). The syntax looks like this (notice the new keyword CHANGE under the /STATISTICS subcommand). When you find such a problem, you want to go back to the original source of the data to verify the values. the columns with the t-value and p-value about testing whether the coefficients SPSS Multiple Regression Syntax II *Regression syntax with residual histogram and scatterplot. Scale variables go into the Dependent List, and nominal variables go into the Factor List if you want to split the descriptives by particular levels of a nominal variable (e.g., school). reliably predict science (the dependent variable). These are very useful for interpreting the output, as we will see. of variance in the dependent variable (science) which can be predicted from the Below we show how to use the regression command to run the regression with write as the dependent variable and using the three dummy variables as predictors, followed by an annotated output. The table belowsummarizes the general rules of thumb we use for the measures we have discussed for identifying observations worthy of further investigation (where k is the number of predictors and n is the number of observations). You can see that there is a possibility that districts tend to have different mean residuals not centered at zero. The coefficient for socst (.05) is not statistically significantly different from 0 because This suggests that the errors are not independent. The descriptives have uncovered peculiarities worthy of further examination. variance has N-1 degrees of freedom. For example, how can you compare the values You now need to check four of the assumptions discussed in the. For example, you could use linear regression to understand whether exam performance can be predicted based on revision time; whether cigarette consumption can be predicted based on smoking duration; and so forth. Poisson Regression Analysis using SPSS Statistics Introduction Poisson regression is used to predict a dependent variable that consists of "count data" given one or more independent variables. column). measure of the strength of association, and does not reflect the extent to which However, what we realize is that a correct conclusion must first be based on valid data as well as a sufficiently specified model. The VIF, which stands for variance inflation factor, is (1/tolerance) and as a rule of thumb, a variable whose VIF values is greater than 10 are problematic. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. If the mean is greater than the median, it suggests a right skew, and conversely if the mean is less than the median it suggests a left skew. The /DEPENDENT subcommand indicates the dependent variable, and the variables following which says that the residuals are normally distributed with a mean centered around zero. variance in the dependent variable simply due to chance. Click on Simple Data in Chart Are Summaries for groups of cases Define. Lets use the REGRESSION command. Lets examine the output from this regression analysis. Remember that predictors in Linear Regression are usually Scale From this formula, you can see that predictors are added to the model, each predictor will explain some of the The mean is 18.55 and the 95% Confidence Interval is (18.05,19.04). More commonly seen is the Q-Q plot, which compares the observed quantile with the theoretical quantile of a normal distribution. command, the statistics subcommand must come before the dependent SPSS Regression Dialogs. scores on various tests, including science, math, reading and social studies (socst). Kurtosis values greater than 3 is considered not normal. Note: For a standard logistic regression you should ignore the and buttons because they are for sequential (hierarchical) logistic regression. These data (hsb2) were collected on 200 high schools students and are This basis is constructed as linear combination of predictors to form orthogonal components. pct full credential, avg class size k-3, pct free meals, a. Predictors: Try 1: Separate regressions. We see quite a difference in the coefficients compared to the simple linear regression. The use of categorical variables will be covered in Lesson 3. In this cass we have a left skew (which we will see in the histogram below). R-square would be simply due to chance variation in that particular sample. In linear regression, a common misconception is that the outcome has to be normally distributed, but the assumption is actually that the residuals are normally distributed. In this example, multicollinearity arises because we have put in too many variables that measure the same thing. Completing these steps results in the SPSS syntax below. We can click on Analyze Descriptive Statistics Explore Plots Descriptive and uncheck Stem-and-leaf and check Histogram for us to output the histogram of acs_k3. As we have seen, it is not sufficient to simply run a regression analysis, but to verify that the assumptions have been met because coefficient estimates and standard errors can fluctuate wildly (e.g., from non-significant tosignificant after dropping avg_ed). However the R-square was low. Here, we have specified ci, which is short for confidence intervals. Its difficult to tell the relationship simply from this plot. Expressed in terms of the variables used g. t and Sig. This page shows an example regression analysis with footnotes explaining the Influence can be thought of as the product of leverage and outlierness. The general form of a bivariate regression equation is "Y = a + bX." SPSS calls the Y variable the "dependent" variable and the X variable the "independent variable." I think this notation is misleading, since regression analysis is frequently used with data collected by nonexperimental However, the procedure is identical. increase in math, a .389 unit increase in science is predicted, We will keep this in mind when we do our regression analysis. SSTotal is equal to .489, the value of R-Square. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for linear regression to give you a valid result. Skewness is a measure of asymmetry, a threshold is 1 for positive skew and -1 for negative skew. R-Square is also called the coefficient of determination. d. This is the source of variance, Looking at the boxplot and histogram we see observations where avg parent ed, parent some college, parent hsg, parent college grad, Go to Analyze Correlate Bivariate. According to SAS Documentation Q-Q plots are better if you want to compare to a family of distributions that vary on location and scale; it is also more sensitive to tail distributions. **. However, we do not include it in the SPSS Statistics procedure that follows because we assume that you have already checked these assumptions. female and 0 if male. The statistics subcommand is not needed to run the regression, but on it We can conclude that the relationship between the response variable and predictors is zero since the residuals seem to be randomly scattered around zero. Please note: The purpose of this page is to show how to use various data analysis . We now have some first basic answers to our research questions. Hence, you need to know which variables were entered into the current regression. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. By default, SPSS now adds a linear regression line to our scatterplot. of predictors minus 1 (K-1). R-squared for the population. alphabet. To investigate this, we can run two separate regressions, one for before age 14, and one for after age 14. These are Hence, this would Multiply the resulting first term in the right hand side by $$\frac{SD(x)}{SD(x)}=1$$: $$(y_i-\bar{y})=b_1\frac{(x_i-\bar{x})}{SD(x)}*{SD(x)}+\epsilon_i$$. An average class size of -21 sounds implausible which means we need to investigate it further. Then shift the newly created variable ZRE_1 to the Variables box and click Paste. Click here to report an error on this page or leave a comment, Your Email (must be a valid email for us to receive the report!). Coefficients having p-values In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables. level. We do this using the Harvard and APA styles. Consider the model below which is the same model we started with in Lesson 1 except that we take out meals as a predictor. In this lesson, we will explore these methods and show how to verify regression assumptions and detect potential problems using SPSS. If this value is very different from the mean we would expect outliers. This indicates the statistical significance of the regression model that was run. This is because the high degree of collinearity caused the standard errors to be inflated hence the term variance inflation factor. are significant). 0, which should be taken into account when interpreting the coefficients. variables math, female, socst and read. You should get the following in the Syntax Editor. It can be shown that the correlation of the z-scores are the same as the correlation of the original variables: $$\hat{\beta_1}=corr(Z_y,Z_x)=corr(y,x).$$. We will use the same dataset elemapi2v2 (remember its the modified one!) Click here to report an error on this page or leave a comment, Your Email (must be a valid email for us to receive the report!). Lets continue checking our data. Assumptions in linear regression are based mostly on predicted values and residuals. In the Linear Regression menu, you will see Dependent and Independent fields. Your scatterplot of the standardized predicted value with the standardized residual will now have a Loess curve fitted through it. Before we write this up Another assumption of ordinary least squares regression is that the variance of the residuals is homogeneous across levels of the predicted values, also known as homoscedasticity. Shift *ZRESID to the Y: field and *ZPRED to the X: field, these are the standardized residuals and standardized predicted values respectively. There are three ways that an observation can be unusual. The output we obtain from this analysis is: We can see that adding student enrollment as a predictor results in an R square change of 0.006. for total is 199. I demonstrate how to perform a multiple regression in SPSS. How do we know this? Hence, for every unit increase in reading score we expect a .335 point increase Throughout this seminar, we will show you how to use both the dialog box and syntax when available. for publication, we should do a number of checks to make sure we can firmly stand behind these results. correlation between the observed and predicted values of dependent variable. You can get special output that you cant get from Analyze Descriptive Statistics Descriptives such as the 5% trimmed mean. For females the predicted Additionally, as we see from the Regression With SPSS web book, the variable full (pct full credential) appears to be entered in as proportions, hence we see 0.42 as the minimum. As such, the coefficients cannot be compared with one another to If they fall above 2 or below -2, they can be considered unusual. Logistic regression, also called a logit model, is used to model dichotomous outcome variables. The index i can be a particular student, participant or observation. are four tables given in the output. Looking more specifically on the influence of School 2910 on particular parameters of our regression, DFBETA indicates that School 2910 has a large influence on our intercept term (causing a -8.98 estimated drop in api00 if this school were removed from the analysis). 0.01 (for 1 predictor) This will call a PDF file that is a reference for all the syntax available in SPSS. If you leave out certain keywords specifications these are done by default SPSS such as /MISSING LISTWISE. Lets first include acs_k3 which is the average class size in kindergarten through 3rd grade (acs_k3). We can perform whats called a hierarchical regression analysis, which is just a series of linear regressions separated into what SPSS calls Blocks. d. Variables Entered SPSS allows you to enter variables into a table. What it can do for your business. f. df These are the Note that this is the same model we began with in Lesson 1. These columns provide the (Optional) The following attributes apply for SPSS variable names: The Measure column is often overlooked but is important for certain analysis in SPSS and will help orient you to the type of analyses that are possible. We expect that better academic performance would be associated with lower class size. The proportion of variance explained by average class size was only 2.9%. Regression These are Please go to Help Command Syntax Reference for full details (note the **). However, if you hypothesized specifically that males had higher scores than females (a 1-tailed test) and used an alpha of 0.05, the p-value The Syntax Editor is where you enter SPSS Command Syntax. The boxplot is shown below. You can learn about our enhanced data setup content on our Features: Data Setup page. c. This column shows the predictor variables can help you to put the estimate SSRegression The improvement in prediction by using The variable we are using to predict the other variable's value is called the independent variable (or sometimes, the predictor variable). Residual to test the significance of the predictors in the model. parameter estimate by the standard error to obtain a t-value (see the column This includes studying consumer buying habits, responses to . First is the Data View. SPSS is not case sensitive for variable names however it displays the case as you enter it. The data consist of two variables: (1) independent variable (years of education), and (2) dependent variable (weekly. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. The R is the correlation of the model with the outcome, and since we only have one predictor, this is in fact the correlation of acs_k3 with api00. Subtract both sides by $$\bar{y}$$, note the first term in the right hand side goes to zero: $$(y_i-\bar{y})=(\bar{y}-\bar{y})+b_1(x_i-\bar{x})+\epsilon_i$$. R 2 = 0.403 indicates that IQ accounts for some 40.3% of the variance in performance scores. be the squared differences between the predicted value of Y and the mean of Y, Furthermore, we can use the values in the "B" column under the "Unstandardized Coefficients" column, as shown below: If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced linear regression guide. being reported. Enter means that each independent variable was Ordinal or Nominal variables: In regression, you typically work with Scale outcomes and Scale predictors, although we will go into special cases of when you can use Nominal variables as predictors in Lesson 3. Before we introduce you to these seven assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). with t-values and p-values). In SPSS Statistics, we created two variables so that we could enter our data: Income (the independent variable), and Price (the dependent variable). Note that the Case Number may vary depending on how your data is sorted, but the School Number should be the same as the table above. Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are nonetheless, of great "As managers, we want to figure out how we. However, dont worry. Consider the case of collecting data from our various school districts. Violation of this assumption can occur in a variety of situations. Looking at the Coefficients table the constant or intercept term is 308.34, and this is the predicted value of academic performance when acs_k3 equals zero. filter off. Add the variable acs_k3 (average class size) into the Dependent List field by highlighting the variable on the left white field and clicking the right arrow button. Because the Beta coefficients The Total The R2 value (the "R Square" column) indicates how much of the total variation in the dependent variable, Price, can be explained by the independent variable, Income. Without thoroughly checking your data for problems, it is possible that another researcher could analyze your data and uncover such problems and question your results showing an improved analysis that may contradict your results and undermine your conclusions. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. If you use a 2 tailed test, then you would compare Squares, the Sum of Squares divided by their respective DF. students, so the DF With a p-value of zero to three decimal places, the model is statistically significant. In SPSS Statistics, an ordinal regression can be carried out using one of two procedures: PLUM and GENLIN. Note that the mean of an unstandardized residual should be zero (see Assumptions of Linear Regression), as should standardized value. In conclusion, we have identified problems with our original data which leads to incorrect conclusions about the effect of class size on academic performance. socst The coefficient for socst is .050. The variable we want to predict is called the dependent variable (or sometimes the response, outcome, target or criterion variable). Case Number is the order in which the observation appears in the SPSS Data View; dont confuse it with the School Number. Lets use that data file and repeat our analysis and see if the results are the same as our original analysis. In our last lesson, we learned how to first examine the distribution of variables before doing simple and multiple linear regressions with SPSS. (constant, math, female, socst, read). Lets try Since female is coded 0/1 (0=male, Next, remove the line breaks and copy-paste-edit it as needed. Outliers: In linear regression, an outlier is an observation with large residual. Lets start with getting more detailed summary statistics for acs_k3 using the Explore function in SPSS. This will put the School Number next to the circular points so you can identify the school. Lets move onto the next lesson where we make sure the assumptions of linear regression are satisfied in making our inferences. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). As researchers we need to make sure first that the data we cleaned hold plausible values. science score would be 2 points lower than for males. The dataset used in this portion of the seminar is located here: elemapiv2. Square Regression (2385.93019) divided by the Mean Square Residual (51.0963039), yielding Lets juxtapose our api00 and enroll variables next to our newly created DFB0_1 and DFB1_1 variables in Variable View. confidence intervals for the coefficients. This tells you the number of the model After pasting the Syntax and clicking on the Run Selection button or by clicking OK from properly specifying your analysis through the menu system, you will see a new window pop up called the SPSS Viewer, otherwise known as the Output window. 1=female) the interpretation can be put more simply. The residual is the vertical distance (or deviation) from the observation to the predicted regression line. independent variables (math, female, socst and read). alpha level (typically 0.05) and, if smaller, you can conclude Yes, the * Before age 14. compute before14 = (age < 14). Shift *ZRESID to the Y: field and *ZPRED to the X: field, these are the standardized residuals and standardized predicted values respectively. The second is called Variable View, this is where you can view various components of your variables; but the important components are the Name, Label, Values and Measure. If you did a stepwise regression, the entry in You will see a menu system called Properties. 51.0963039. Note that the We can request percentiles to show where exactly the lines lie in the boxplot. The Variance is how much variability we see in squared units, but for easier interpretation the Standard Deviation is the variability we see in average class size units. In this case, there were N=200 Spaces between charcters are not allowed but the underscore _ is. Please note that SPSS sometimes includes footnotes as part of the output. Sorry. To create the more commonly used Q-Q plot in SPSS, you would need to save the standardized residuals as a variable in the dataset, in this case it will automatically be named ZRE_1. Note that Note that this is an overall The TOL keyword tolerance is an indication of the percent of variance in the predictor that cannot be accounted for by the other predictors. These confidence intervals Essentially, the equation above becomes a new simple regression equation where the intercept is zero (since the variables are centered) with a new regression coefficient (slope): Click here to report an error on this page or leave a comment, Your Email (must be a valid email for us to receive the report!). If residuals are normally distributed, then 95% of them should fall between -2 and 2. As we will see in this seminar, there are some analyses you simply cant do from the dialog box, which is why learning SPSS Command Syntax may be useful. This is the output that SPSS gives you if you paste the syntax. You can use these procedures for business and analysis projects where ordinary regression techniques are limiting or inappropriate. add predictors to the model which would continue to improve the ability of the less than alpha are statistically significant. (a, b, etc.) The term $$y_i$$ is the dependent or outcome variable (e.g., api00) and $$x_i$$ is the independent variable (e.g., acs_k3). The term b 0 is the intercept, b 1 is . In this case, 76.2% can be explained, which is very large. Correlation is significant at the 0.01 level (2-tailed). AlthoughSchool 2910 does not pass the threshold for DFBETA on our enroll coefficient, if it were removed, it wouldshow the largest changeenrollamong all the other schools. of Adjusted R-square was .479 Adjusted R-squared is computed using the formula for female is equal to 0, because p-value = 0.051 > 0.05. You will also notice that the larger betas are associated with the From the histogram you can see a couple of values at the tail ends of the distribution. (See To understand the relationship between correlation and simple regression, lets run a bivariate correlation of api00 and acs_k3 (average class size). Lets examine the standardized residuals as a first means for identifying outliers first using simple linear regression. First, we introduce the example that is used in this guide. It is important to meet this assumption for the p-values for the t-tests to be valid. We will ignore the regression tables for now since our primary concern is the scatterplot of the standardized residuals with the standardized predicted values. The code after pasting the dialog box will be: The plot is shown below. h. F and Sig. Looking at the coefficients, the average class size (acs_k3, b=-2.712) is marginally significant (p = 0.057), and the coefficient is negative which would indicate that larger class sizes is related to lower academic performance which is what we would expect. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required when using linear regression and can be tested using SPSS Statistics, you can learn more about our enhanced guides on our Features: Overview page. approximately .05 point increase in the science score. Lets take a look at the bivariate correlation among the three variables. Taking a look at the minimum and maximum for acs_k3, the average class size ranges from -21 to 25. The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. document.getElementById( "ak_js" ).setAttribute( "value", ( new Date() ).getTime() ); Department of Statistics Consulting Center, Department of Biomathematics Consulting Clinic, Annotated SPSS Output Descriptive statistics, a. Predictors: (Constant), avg class size k-3, b. Predictors: (Constant), avg class size k-3. For the effect. Linear regression is the next step up after correlation. to assist you in understanding the output. It is used when we want to predict the value of a variable based on the value of another variable. Click the Run button to run the analysis. When we do linear regression, we assume that the relationship between the response variable and the predictors is linear. entered in usual fashion. Additionally, we are given that the formula for the intercept is $$a=\bar{y}-b_1 \bar{x}$$. To see if theres a pattern, lets look at the school and district number for these observations to see if they come from the same district. To begin, lets go over basic syntax terminology: Note that a ** next to the specification means that its the default specification if no specification is provided (i.e., /MISSING = LISTWISE). S(Ypredicted Ybar)2. We conclude that the linearity assumption is satisfied and the hetereoskedasticity assumption is satisfied if we run the fully specified predictive model. These can be computed in many ways. Note that Tukeys hinges cannot take on fractional values whereas Weighted Average can. predictors to explain the dependent variable, although some of this increase in