BUS 308 – Week 4 Lecture 2
Interpreting Relationships
Expected Outcomes
After reading this lecture, the student should be able to:
- Interpret the strength of a correlation 2. Interpret a Correlation Table 3. Interpret a Linear Regression Equation 4. Interpret a Multiple Regression Equation
Overview
As in many detective stories, we will often find that when one thing changes, we see that something else has changed as well. Moving to correlation and regression opens up new insights into our data sets, but still lets us use what we have learned about Excel tools in setting up and generating our results.
The correlation between events is mirrored in data analysis examinations with correlation analysis. This week’s focus changes from detecting and evaluating differences to looking at relationships. As students often comment, finding significant differences in gender-based measures does not explain why these differences exist. Correlation, while not always explaining why things happen gives data detectives great clues on what to examine more closely and helps move us towards understanding why outcomes exist and what impacts them. If we see correlations in the real world, we often will spend time examining what might underlie them; finding out if they are spurious or causal.
Regression lets us use relationships between and among our variables to predict or explain outcomes based upon inputs, factors we think might be related. In our quest to understand what impacts the compa-ratio and salary outcomes we see, we have often been frustrated due to being basically limited to examining only two variables at a time, when we felt that we needed to include many other factors. Regression, particularly multiple regression, is the tool that allows us to do this.
Linear Correlation
When two things seem to move in a somewhat predictable way, we say they are correlated. This correlation could be direct or positive, both move in the same direction, or it could be inverse or negative, where when one increases the other decreases. The Law of Supply in economics is a common example of an inverse (or negative) correlation, where the more supply we have of something, the less we typically can charge for it; the Law of Demand is an example of a direct (or positive) correlation as the more demand exists for something, the more we can charge for it. Height and weight in young children is another common example of a direct correlation, as one increases so does the other measure.
Probably the most commonly used correlation is the Pearson Correlation Coefficient, symbolized by r. It measures the strength of the association – the extent to which measures change together – between interval or ratio level measures as well as the direction of the relationship (inverse or direct). Several measures in our company data set could use the Pearson Correlation to show relationships; salary and midpoint, salary and years of service, salary and performance rating, etc. The Pearson Correlation runs from -1.0 (perfect negative or inverse correlation) thru 0 (no correlation) to +1.0 (perfect positive or direct correlation).
A perfect correlation means that if we graphed the values, they would fall exactly on a straight line, either increase from bottom left to top right (positive) or from top left to bottom right (negative). The stronger the absolute value (ignoring the sign), the stronger the correlation and the more the data points would form a straight line when plotted on a graph. The Excel Fx function Correl, and the Data Analysis function Correlation both produce Pearson Correlations.
Question 1
When we have a data set with multiple variables, we would want to see what relationships exist – a detective’s sort of “who works with whom” around the result we are looking for. This question asks for us to identify the statistically significant correlations between the SALARY and the other variables.
Remember, while the assignment asks for correlations with Salary, this example will use correlations with the compa-ratio. The hypothesis testing steps are
Step 1: Ho: Correlation with compa-ratio is not significant. Ha: Correlation with compa-ratio is significant. (A two-tail test.)
Technically we should set up the hypothesis testing steps for each of the correlations (which we will see shortly equals 15). This is a bit tedious, so we approach the issue of statistical significance as we did with the ANOVA coefficients last week by using a single Hypothesis test and apply it to each of the correlations.
Step 2: Alpha = 0.05
Step 3: Statistical Test and statistic: Pearson’s correlation (r), t, and the correlation t-test to test a correlation
The significance of the Pearson Correlation is tested with our old friend the t-test.
Step 4: Reject the null hypothesis if the p-value < 0.05.
Step 5: Conduct the test.
Below is a screen shot of a correlation table produced by the Analysis Toolpak Correlation tool.
Reading the Table. The table only shows correlations below the diagonal (which has a 1.00 correlation of each variable with itself). Values above the line would simply duplicate those below it. The diagonal is a “pivot-point,” so to speak. In reading the correlations we would start with a row such as Age. The correlation of Age and Compa-ratio is 0.195 (rounded), the correlation of Age and Midpoint is 0.567. Then we get to the diagonal. Instead of continuing horizontally, we start going vertical (down the Age column). So, the correlation of Age with Performance rating is 0.139, with Service is 0.565, and with raise -0.180. All correlations, except for the first one (in this case Compa-ratio) would be read this way in Correlation tables.
Step 6: Conclusion and Interpretation.
To assess the statistical significance of the correlations we need to calculate a t value for each correlation, using the formula: t = r * sqrt(n-2)/sqrt(1-r^2), df = n-2;
In this formula, n equals the number of data point pairs used in the correlation, and r equals the correlation that we are testing.
The associated p-value for a t-value is found using the Fx function T.DIST.2T(t, df).
In any correlation table, the correlations are all developed with the same number of pairs of data points. (In our case, we have 50 pairs to use.) So, it seems reasonable that each of these correlations would have the same critical value that cuts off our p-value of 0.05. So, once we find a correlation value that is non-significant (that is, we fail to reject the null hypothesis), any correlations smaller than this would also be non-significant.
For this lecture, please ignore part b. We will discuss this in Lecture 3, as it is a short-cut that depends on your understanding the approach described below.
Part c asks us to use this information and identify the variables significantly correlated to salary in the homework and to compa-ratio for this example.
We need to calculate the t value and its associated p-value to determine significance. Starting with the largest correlation value with Compa-ratio, we see this is with midpoint, and is 0.50, rounded. So we have:
T= r * sqrt(n-2)/sqrt(1-r^2) = .5*sqrt(50-2)/sqrt(1-.5^2) = 4 (letting Excel do the math).
The two-tail p-value is T.DIST.2T(t,df) = T.DIST.2T(4,48) = 0.0002.
The null hypothesis is rejected for the Compa-Ratio and Midpoint correlation.
The next largest Compa-ratio correlation is with Age at 0.20 (rounded). Using the same Excel functions, we get t= 1.41 with a p-value of 0.164 (both rounded). So, we do not reject the null hypothesis for the Compa-ratio and age correlation.
It makes sense that if a correlation of .20 is non-significant, then any smaller correlations would also be non-significant, so our testing is done.
Looking at the output table above, we can say that only Midpoint is significantly correlated to compa-ratio with a correlation of .50 (rounded).
Part d asks for any surprising results/correlations. This will depend upon your table and what you did or did not expect.
Part e asks if this information helps us answer our equal pay question. The compa-ratio correlations do not seem to be helpful as they do not shed any insight on gender based issues.
Multiple Correlation
As interesting as linear correlation is, multiple correlation is even more so. It correlates several independent (input) variables with a single dependent (output) variable. For example, it would show the shared variation (multiple R squared, or Multiple Coefficient of Determination) for compa-ratio with the other variables in the data set at the same time rather than in pairs as we did in question 1. While we can generate this value by itself, it is a bit complicated and is rarely found except in conjunction with a multiple regression equation. So, having noted that this exists, let’s move on to multiple regression.
Regression
Regression takes us the next step in the journey. We move from knowing which variables are correlated to finding out which variables can be used to actually predict outcomes or explain the influence of different variables on a result. As we might suspect, linear regression involves a single dependent (outcome) and single independent (input) variable. Linear regression uses at least interval level data for both the dependent and independent variables.
The form of a linear regression equation is:
Y = a + b*X; where Y is the output, X is the input, a is the intercept (the value of y when X = 0) on a graph, and b is the coefficient (showing the change in Y for every 1 change in the value of X.
Earlier, we found that the correlation between raise and performance rating was 0.674 (rounded). While we did not make note of this in our correlation discussion, it was part of the correlation table. This correlation relates to a coefficient of determination (CD) of 0.674^2 or 0.45 (rounded). As mentioned, this is not a particularly strong correlation, and we would not expect the graph of these values to show much of a straight line. For purposes of understanding linear regression, let’s look at a graph showing performance rating as an input (an X variable) predicting raise (Y). An example of a regression equation and its graph is:
Raise (Y) vs Performance Rating (X)
This is a Scatter Diagram graph produced by Excel. The regression line, equation, and R- squared values have been added. Note that the Coefficient of Determination (R2) is the 45% we found earlier, and that the data points are not all that close to the regression (AKA trend) line. Note the format of the regression equation Y = 0.5412 + 0.0512X, this is the same as saying Raise = 0.5412 + 0.0512* Performance Rating when we substitute the variable names for the algebraic letters.
Let us look at the equation. Since we know that the correlation is significant (it is larger than our 0.50 value we found for compa-ratio and Midpoint), the linear regression equation is significant. The regression says for every single point increase in the performance rating (our X variable), the raise (The Y variable) increases, on average by 0.0512%. If we extended the line towards the y (vertical axis), it would cross at Y = – 0.0512 and X = 0, this is an example where looking at the origin points is not particularly helpful as no one has a performance rating of 0. This graph does tend to reinforce our earlier comment that raise and performance rating, even
y = 0.0512x + 0.5412 R² = 0.4538
0
1
2
3
4
5
6
7
0 20 40 60 80 100 120
though the strongest correlation, are not particularly good at predicting each other’s value. We see too much dispersion of data points around the best fit regression line through the data points.
Most of us are probably not surprised, just as we feel compa-ratio is not determined by a single factor, we know raise is more complicated than simply the performance rating. This is where looking at multiple regression, the use of several factors, might be more insightful.
Multiple Regression
Multiple Regression is probably the most powerful tool we will look at in this course. It allows us to examine the impact of multiple inputs (AKA independent variables) on a single output or result (AKA dependent variable). It also allows us to include nominal and ordinal variables in the results when they are used as dummy coded variables.
Multiple regression has an interesting ability that we have not been able to use before. It can use nominal data variables as inputs to our outcomes, rather than using them simply as grouping labels. It does so by assigning either a 0 or 1 to the variable value depending upon whether some characteristic exists or not. For example, with degree we essentially are looking to see if a graduate degree has any impact, since everyone in the sample has at least an undergraduate degree. So, we code the existence of a graduate degree with a 1, and the “non- existence” with a 0. Similarly, with gender we are interested, essentially, how females are being treated, so we code them 1 (existence of being female). This coding is called Dummy Coding, and involves only using a 0 or 1 in specific situations where the existence of a factor is considered important. Note, other than some changes in the value of the coefficients, the outcomes would not differ if the codes were reversed. The significance, or non-significance, of degree or gender would remain the same regardless of the code used. We will comment on this more after we see our results.
Question 2
Question 2 for this week asks for a regression equation that explains the impact of various variables on our output of interest. Of course, in the homework this is salary, while in our lectures it is the compa-ratio.
Now that the data has been set up, let’s look at our hypothesis testing process for the question, first, of whether or not the regression equation is helpful in explaining what impacts compa-ratio outcomes.
Parts a and b. This part looks at the overall regression.
Step 1: Ha: The regression equation is not significant.
Ho: The regression equation is significant.
Step 2: Alpha = 0.05
Step 3: F stat and ANOVA-Regression, used to test regression significance
Step 4: Decision Rule: Reject the null hypothesis if p-value < 0.05.
Step 5: Conduct the test.
Here is a screen shot of a multiple regression analysis for the question of what factors influence compa-ratio. Note: we will split the discussion of the output into two screen shots.
Step 6: Conclusion and Interpretation.
The first table in the output provides some summary statistics. Two are important for us – the multiple correlation, shown as R, which equals 0.655, a moderate value; and, the R square or the multiple coefficient of determination showing that about 43% of the variation in compa-ratio values can be explained by the shared variation in the variables used in the analysis.
The second table shows the results of the actual statistical test of the regression. Similar to the ANOVA tables we looked at last week, it has two rows that are used to generate our F statistic (4.51) and the p-value which is labeled “Significance F.”
What is the p-value? 0.0008
Decision: Rej or Not reject the null? Reject the null hypothesis.
Why? The p-value is less than (<) 0.05.
Conclusion about Compa-ratio factors? The input variables are significantly related to compa-ratio outcomes. Some of the compa-ratio outcomes can be explained by the selected variables. We used the phrase “some of” since the equation only explains 43% of the variance, less than half.
Part c
Once we reject the null hypothesis, our attention changes to the actual equation, the variables and their corresponding coefficients. The third table provides all the details we need to reach our conclusions.
As with the correlations in question 1, we will use the hypothesis testing process, but will write it only once and use the p-values to make decisions on each of the possible equation variables.
Step 1: Ha: The variable coefficient is not significant (b = 0).
Ho: The variable coefficient is significant (b =/= 0).
Step 2: Alpha = 0.05
Step 3: T stat and t-test for coefficients
Step 4: Decision Rule: Reject the null hypothesis if p-value < 0.05.
Step 5: Conduct the test. In this case, the test has already been performed and is part of the regression out. Here is a screen shot of the second half of the Regression output.
Step 6: Conclusions and Interpretation
As with the correlations, we will use a single statement of the 6 steps to interpret the outcomes in this part. You are asked to transfer values from the ANOVA-Reg table to a decision summary table. Here is the completed table. Note that the variable names (the X’s in our regression equation) come from column L. The t-values are shown in column
O while the p-values are in Column P. The coefficients (the b’s in our equation) are listed in Column M.
The rejection decision is the same as we have been using. Looking at the p-values, we reject the null for all p-values less than 0.05. Note the unusual looking p-value for the intercept: 2.9E-13. This format is called exponential and is the same as 2.9 * 10-13. This value equals 0.00000000000029, we move the decimal point 13 places to the left. Any E-xx Excel output will be less than our alpha value of 0.05. We do not list the intercept in our table as it is always included in the equation.
With our results, we reject the null hypothesis and find that 3 variables are significant factors into determining compa-ratio, and these are Midpoint, Perf. Rating and Gender.
The Multiple Regression equation is similar to the linear regression example given above except it has more independent terms: Y = a + b1X1 + b2X2 + B3*X3 + …. The b’s stand for the coefficients that are multiplied by the value of each variable (represented by the X’s).
With these 3 variables and the intercept, the statistically significant regression equation is:
Compa-ratio = 0.954 + 0.003midpoint -0.002performance rating + 0.056*gender.
So, what does this equation mean? How do we interpret it? The intercept (0.9545) is somewhat of a place holder – it centers the line in the middle of the data points but has little other meaning for us. The three variables, however tell us a lot. Changes in each of them impact the compa-ratio outcome independently of the others; it is as if we can consider the other factors being held constant as we examine each factor’s impact. So, all other things the same, each dollar increase in midpoint increases the compa-ratio value by 0.0034. This relates to what we found last week that compa-ratio is not independent of grade. At the same time, and possibly surprisingly, every increase in an employee’s performance rating causes the compa-rating to decrease by .0024! Finally, the equation says that gender is an important factor. This factor alone means that the company is violating the equal pay act. But, what might be surprising is that for a change from male (coded 0) to female (coded 1) the compa-ratio goes up by 0.0562! Females get a higher compa-ratio (percent of midpoint) when all other things are equal than males do, since the female gender results in adding 0.056*1 to the compa-ratio while the male gender has 0.056 * 0 (or 0) added to their compa-ratio.
We did have one hint that this might be the case, when we noticed in week 1 that the female mean compa-ratio was higher than the male compa-ratio. But, then some of the single factor tests minimize this difference. This is one of multiple regression’s greatest strengths, it will show us the impact of a single variable by controlling for, or keeping constant, the impact of all other variables.
Parts d, e, and f
Gender is a significant element in the compa-ratio, as females (coded 1) get a higher value when all other variables are equal. We see this from the significant positive coefficient to
the variable gender. If we had switched the coding and had Females coded 0, the sign of the gender variable would have changed causing Males to have part of their compa-ratio reduced due to being male.
Question 3
This answer will depend on what other factors you would like to see.
Question 4
As of this point, we have some strong evidence in the compa-regression equation and the t-test on average compa-ratios, that females get more pay for equal work than males. The company is violating the Equal Pay Act, in favor of women.
Question 5
What you say here describes your understanding of regression analysis versus the power of inferential tests of 2 variables at a time.
Summary
Correlations show the direction and strength of a relationship between two variables and are fairly straight forward to understand. The Pearson correlation and the Spearman Rank order correlation are the two most generally used correlations.
Excel produces the Pearson Correlation in a single value or in a table showing correlations among three or more variables. Each cell shows the correlation value for the variable listed on the side row and top column. The statistical significance of either the Pearson or Spearman correlation is found by using the t-statistic. The Spearman’s rank correlation is not produced directly by Excel.
If we have variables that are related/correlated to an output, such as salary, we can create a regression equation. A regression equation is somewhat like a recipe for your favorite food; it tells how “much” of each ingredient (AKA variables) to add into the mix to get your result.
The regression tables have a lot of information in them, and at first glance they can appear a bit overwhelming. However, as with the ANOVA tables, only a few results are important. Start with the first value, the Multiple R and R squared (the multiple coefficient of determination). These give us a sense of how well the regression variables explain the outcome. Then go to the next table and look at the significance of F value at the right. This is our P-value and should be less than 0.05 for our regression equation to reject the null of no significance. Assuming we rejected the null, the third table gives us our details. The first two columns contain the variable names and their respective coefficient. These get multiplied together and added (or subtracted depending on the coefficient sign) to create the regression equation. A couple of columns over is the p-value column, letting us know which of the variables is significant to the regression. If interested, we can build a confidence interval using the final columns for each variable.
Please ask your instructor if you have any questions about this material.
When you have finished with this lecture, please respond to Discussion Thread 2 for this week with your initial response and responses to others over a couple of days before reading the third lecture for the week.
What Students Are Saying About Us
.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"
.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."