# QUANT homework

## Attachments

##
Introduction

Ideally the goal is to select explanatory variables to build the simplest regression equation possible that will produce a good estimate of the response variable (parsimony) | ||||

this to simplify interpretation of the regression coefficients | ||||

selecting good variables for use in a regression analysis usually involves initially engaging in some research | ||||

To use linear simple linear regression analysis: There needs to be a linear relation between: | ||||

1) the response variable and the explanatory variable | ||||

Investigate the linear relations by examining: | ||||

1) Correlation: the numerical measure of strength in the relation between the explanatory and response variables | ||||

2) scatter plot of the response variable against the independent variable | ||||

the plot can show linear dot pattern , indicating likely need for the explanatory variable to be included in the regression equation | ||||

the plot can show non-linear dot pattern suggesting a transformation of an explanatory variable may be needed | ||||

the plot can show a horizontal dot pattern suggesting the variable should not be used in the regression equation | ||||

3) residual plots | ||||

looking for a scattered, cloudy dot pattern with consistent variance at each level of the explanatory variable to help confirm linear relation likely exists | ||||

if a pattern is observed, this suggests transformations many need to be applied |

##
correlation

Response variables (variable of interest) : typically placed on y-axis | ||||||||||

explanatory variables: typically placed on x-axis | ||||||||||

Choosing the role of the variables depends on how you think the variables are related | ||||||||||

Correlation coefficient (r): permits you to quantify the strength of the linear relation between the 2 variables | ||||||||||

In Excel use =CORREL feature | ||||||||||

properties of correlation coefficient | ||||||||||

range from -1 to 1 | ||||||||||

sign gives the direction of the association | ||||||||||

unitless measure | ||||||||||

not affected by changes to the center or scale of the variables | ||||||||||

depends only on the z-scores | ||||||||||

sensative to outliers | ||||||||||

does not imply causation (could be lurking variables standing behind the association) | ||||||||||

check the scatter plot for straightness even if the value of r is high |

##
scatter

Scatterplots | ||||||||||||

displays relation (association) between 2 quantitative variables | ||||||||||||

investigor can see patterns, trends, and outliers | ||||||||||||

1. Direction of the association | ||||||||||||

negative | positive | |||||||||||

2. Form of the association | ||||||||||||

nonlinear vs linear | ||||||||||||

3. Strength of association | ||||||||||||

How discernable is the data pattern or trend? | ||||||||||||

weak, moderate, strong |

##
residual

Question: What line can be used to model the linear data pattern displayed in the scatterplot? | |||||||||||||

the equation of the best fit line is the regression line (model) | |||||||||||||

the estimates for the response variable made through the regression line are called the predicted values | |||||||||||||

residuals | |||||||||||||

To find residuals, subtract the predicted value (calculated with the regression equation) from the observed value | |||||||||||||

negative residual: the regression equation is overestimating | |||||||||||||

postive residual: the regression equation is underestimating | |||||||||||||

Best fitting line means the sum of its least squared residuals is smallest |

##
Example1A

Question: Is there a linear relation between the response variable (salary) and housing price? | ||||||||||||||||||||

1) correlation (CORELL feature in Excel) | ||||||||||||||||||||

2) scatter plot | ||||||||||||||||||||

3) residual plot | ||||||||||||||||||||

1) Correlation | 0.9086846363 | 3) Residual Plot (Run the regression in Excel, using the Data Analysis Toolbar. Select Residual plot | ||||||||||||||||||

A correlation of approximately .91 implies that salary and housing price have | ||||||||||||||||||||

a strong, positive, linear relation where salary increase ad housing price increases in a systematically predictable way | ||||||||||||||||||||

Salary($10,000) | Housing Price ($10,000) | |||||||||||||||||||

1 | 45 | 2) Scatter Plot (select variables and use the Insert .. Charts menu) | ||||||||||||||||||

2 | 47 | |||||||||||||||||||

3 | 57 | |||||||||||||||||||

4.5 | 59 | |||||||||||||||||||

5.2 | 60 | |||||||||||||||||||

8.4 | 65 | |||||||||||||||||||

11 | 66 | |||||||||||||||||||

12.4 | 67 | |||||||||||||||||||

12.5 | 69 | |||||||||||||||||||

13.1 | 70 | |||||||||||||||||||

15 | 70 | |||||||||||||||||||

17 | 72 | |||||||||||||||||||

22.2 | 73 | |||||||||||||||||||

22.6 | 74 | There is a pattern (dot pattern is not scattered) evident in the residual plot. This suggests that some other relation, beyond a linear relation | ||||||||||||||||||

25.8 | 76 | would likely better represent the relation between salary and housing price. Hence a transformation should be | ||||||||||||||||||

performed | ||||||||||||||||||||

From the scatter plot is can be seen that the relation appears to be positive and strong. However there also | ||||||||||||||||||||

appears to be a slight bend in the dot pattern. Further investigation is needed in order to determine if the | ||||||||||||||||||||

trend is linear | ||||||||||||||||||||

Housing Price ($10,000) 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8 45 47 57 59 60 65 66 67 69 70 70 72 73 74 76

Salary

Housing Price ($10,000)

##
Transform1

Goal: Make the dot pattern of a scatterplot more nearly linear | ||

calculate transform for the explanatory and the response variables | ||

look at scatterplots of combinations of transformed and/or non-transformed values | ||

Can use Adjusted R-Squared to help chose between candidate linear equation models that result | ||

Non-linear dot patterns in scatterplots and possible transformations to straighten | ||

year | salary (in $10,000) | LN(salary) |

1980 | 1 | |

1982 | 2 | |

1990 | 3 | |

1990 | 4.5 | |

1991 | 5.2 | |

1996 | 8.4 | |

1997 | 11 | |

1998 | 12.4 | |

1999 | 12.5 | |

1999 | 13.1 | |

1999 | 15 | |

2001 | 17 | |

2001 | 22.2 | |

2004 | 22.6 | |

2005 | 25.8 |

(LN(x), Y)

(x, y2)

(LN(y), LN(x))

(LN(x), Y)

(x, 1/y)

(x, SQRT(y))

(x, LN(y))

(LN(y), LN(x))

##
Significance

What else to check in the Regression Output | |||||||||||

The overall significance of the regression equation (in the ANOVA table) | |||||||||||

The regression coefficent is the average change in y expected per unit change in the explanatory variable when all other explanatory variables are held constant | |||||||||||

How significant are the regression equation and/or the coefficients? | |||||||||||

less than .05 for the significance and/or p-values , 1 or more of the explanatory variables are good predictors of the response variable | |||||||||||

also, check R-squared (use adjusted R-squared) | |||||||||||

Standard Approach we will use (1 of the following will be the investigation conclusion) | |||||||||||

If regression equation is not significant, the model is not a good fit to the data | |||||||||||

If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction | |||||||||||

If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction | |||||||||||

Coefficient significance | Model significance |

##
Example2

Is the regression model with the Square Root Transformed response variable salary the better model? Yes according to the analysis (scatter plots, residual plots, adjusted r-squared, and standard error comparisons) | ||||||||||||||||||||||||||

year (actual) | year (Explanatory) | salary (in $10,000) (Response) | year (Explanatory) | SQRT(Salary) | Compare Scatter Plots : Which looks visually straighter ? | |||||||||||||||||||||

1980 | 80 | 1 | 80 | 1 | ||||||||||||||||||||||

1982 | 82 | 2 | 82 | 1.4142135624 | ||||||||||||||||||||||

1990 | 90 | 3 | 90 | 1.7320508076 | ||||||||||||||||||||||

1990 | 90 | 4.5 | 90 | 2.1213203436 | ||||||||||||||||||||||

1991 | 91 | 5.2 | 91 | 2.2803508502 | ||||||||||||||||||||||

1996 | 96 | 8.4 | 96 | 2.8982753492 | ||||||||||||||||||||||

1997 | 97 | 11 | 97 | 3.3166247904 | ||||||||||||||||||||||

1998 | 98 | 12.4 | 98 | 3.5213633723 | ||||||||||||||||||||||

1999 | 99 | 12.5 | 99 | 3.5355339059 | ||||||||||||||||||||||

1999 | 99 | 13.1 | 99 | 3.6193922142 | ||||||||||||||||||||||

1999 | 99 | 15 | 99 | 3.8729833462 | ||||||||||||||||||||||

2001 | 101 | 17 | 101 | 4.1231056256 | ||||||||||||||||||||||

2001 | 101 | 22.2 | 101 | 4.7116875958 | ||||||||||||||||||||||

2004 | 104 | 22.6 | 104 | 4.7539457296 | ||||||||||||||||||||||

2005 | 105 | 25.8 | 105 | 5.0793700397 | Compare Residual Plots : Which has a cloudy, scattered dot pattern ? | |||||||||||||||||||||

Response: Salary | Response : Square Root of Salary | |||||||||||||||||||||||||

Compare Adjusted R-squared values : Which is largest ? Compare Standard Error: Which is smallest? | ||||||||||||||||||||||||||

Both models are significant | ||||||||||||||||||||||||||

The regression model with the transformed response variable has the largest Adjusted R-squared value and smallest standard error | ||||||||||||||||||||||||||

Response Variable : Salary | Response Variable : Square Root Of Salary | |||||||||||||||||||||||||

Regression Statistics | Regression Statistics | |||||||||||||||||||||||||

Multiple R | 0.9150264296 | Multiple R | 0.9668776273 | |||||||||||||||||||||||

R Square | 0.8372733668 | R Square | 0.9348523462 | |||||||||||||||||||||||

Adjusted R Square | 0.8247559335 | Adjusted R Square | 0.9298409882 | |||||||||||||||||||||||

Standard Error | 3.28038926 | Standard Error | 0.3337444496 | |||||||||||||||||||||||

Observations | 15 | Observations | 15 | |||||||||||||||||||||||

ANOVA | ANOVA | |||||||||||||||||||||||||

df | SS | MS | F | Significance F | df | SS | MS | F | Significance F | |||||||||||||||||

Regression | 1 | 719.7849352684 | 719.7849352684 | 66.8885821383 | 0.0000017547 | Regression | 1 | 20.7785720521 | 20.7785720521 | 186.5467103404 | 0.0000000044 | |||||||||||||||

Residual | 13 | 139.892398065 | 10.7609536973 | Residual | 13 | 1.4480096496 | 0.1113853577 | |||||||||||||||||||

Total | 14 | 859.6773333333 | Total | 14 | 22.2265817017 | |||||||||||||||||||||

Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |||||||||||

Intercept | -80.4842432619 | 11.3048781078 | -7.1194260119 | 0.0000078229 | -104.906947591 | -56.0615389328 | -104.906947591 | -56.0615389328 | Intercept | -12.4661593835 | 1.1501501874 | -10.8387230813 | 0.0000000701 | -14.9509077987 | -9.9814109684 | -14.9509077987 | -9.9814109684 | |||||||||

year (Explanatory) | 0.9657567381 | 0.1180841892 | 8.1785440109 | 0.0000017547 | 0.710651357 | 1.2208621192 | 0.710651357 | 1.2208621192 | year (Explanatory) | 0.164087017 | 0.0120138007 | 13.6582103637 | 0.0000000044 | 0.1381327785 | 0.1900412554 | 0.1381327785 | 0.1900412554 | |||||||||

salary (in $10,000) (Response) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8

SQRT(Salary) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 1.4142135623730951 1.7320508075688772 2.1213203435596424 2.2803508501982761 2.8982753492378879 3.3166247903553998 3.5213633723318019 3.5355339059327378 3.6193922141707713 3.872983346207417 4.1231056256176606 4.7116875957558984 4.7539457296018854 5.0793700396801178

##
regression equation after trans

To undo the LN transformation to determine a regression equation for the original salary units, take the exponential (e) of both sides of the equation |

To Undo the LN transformations

(x, LN(Y)) LN(Y) = ax + b y = e(ax + b)

y = eax (eb)

(LN(x), y) y = aLN(x) + b

(LN(x), LN(y)) LN(y) = aLN(x) + b LN(y) = aLN(x) + b

LN(y) = LN(xa) + b

y = xa + eb

##
numerical summaries

Numerical Summaries: Excel Regression Output Table (color coded) | ||||||||||||||||||||||

Regression Coefficients | Excel Regression Output Table | |||||||||||||||||||||

y-intercept: average value of the response variable when all explanatory variables are 0 | Regression Statistics | |||||||||||||||||||||

impact of explanatory variables are not considered | Multiple R | 0.9150264296 | ||||||||||||||||||||

R Square | 0.8372733668 | |||||||||||||||||||||

regression coefficients: the average predicted change in the response variable per unit change in the explanatory variable | Adjusted R Square | 0.8247559335 | ||||||||||||||||||||

Standard Error | 3.28038926 | |||||||||||||||||||||

Correlation Coefficient: Strength of Linear Relation | Observations | 15 | ||||||||||||||||||||

Coefficient of Determination (r^2) – Use Adjusted R^2 | ANOVA | |||||||||||||||||||||

fraction of the variation in the data accounted for by the regression equation | df | SS | MS | F | Significance F | |||||||||||||||||

sometimes reported as a percentage | Regression | 1 | 719.7849352684 | 719.7849352684 | 66.8885821383 | 0.0000017547 | ||||||||||||||||

between 0 and 1 | Residual | 13 | 139.892398065 | 10.7609536973 | ||||||||||||||||||

Total | 14 | 859.6773333333 | ||||||||||||||||||||

SEE (Standard Error of the Estimate) | ||||||||||||||||||||||

standard deviation of the residuals | Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | ||||||||||||||

how spread out the observations are from the regression line | Intercept | -80.4842432619 | 11.3048781078 | -7.1194260119 | 0.0000078229 | -104.906947591 | -56.0615389328 | -104.906947591 | -56.0615389328 | |||||||||||||

year (Explanatory) | 0.9657567381 | 0.1180841892 | 8.1785440109 | 0.0000017547 | 0.710651357 | 1.2208621192 | 0.710651357 | 1.2208621192 | ||||||||||||||

Confidence Intervals : Range, withing specified level of confidence, that the population parameter resides |

##
Influential Points

Influential Points | |||||||||||||||||

Does the regression equation change significantly when the isolated points is removed? | |||||||||||||||||

Would a different result from the analysis be concluded? | |||||||||||||||||

If yes, then the point is influential | |||||||||||||||||

In the regression output look at: | |||||||||||||||||

significance of model or coefficients | |||||||||||||||||

numerical summaries | |||||||||||||||||

Example | |||||||||||||||||

with influencial point | without influential point |

##
recode

Recode multiple category variables (3 or more categories) to 2 category variable | ||||||||

Gender: male (1) female (0) | ||||||||

Political Party (0) republican, (1) democrat, (2) libertarian, (3) green | ||||||||

Student | IQ | Study hours | Gender | Political Party | Test score | |||

1 | 110 | 40 | 1 | 0 | 100 | |||

2 | 110 | 40 | 0 | 2 | 95 | |||

3 | 120 | 30 | 1 | 1 | 90 | |||

4 | 110 | 40 | 1 | 3 | 85 | |||

5 | 100 | 20 | 0 | 3 | 80 | |||

6 | 110 | 40 | 1 | 1 | 75 | |||

7 | 90 | 0 | 0 | 0 | 70 | |||

8 | 110 | 40 | 0 | 2 | 65 | |||

9 | 80 | 10 | 1 | 3 | 60 | |||

10 | 80 | 10 | 0 | 1 | 55 | |||

The dummy variables are 1 when (zero otherwise): | ||||||||

x1: republican | ||||||||

x2: democrat | ||||||||

x3: libertarian | ||||||||

Student | IQ | Study hours | Gender | x1 | x2 | x3 | Test score | |

1 | 110 | 40 | 1 | 1 | 0 | 0 | 100 | |

2 | 110 | 40 | 0 | 0 | 0 | 1 | 95 | |

3 | 120 | 30 | 1 | 0 | 1 | 0 | 90 | |

4 | 110 | 40 | 1 | 0 | 0 | 0 | 85 | |

5 | 100 | 20 | 0 | 80 | ||||

6 | 110 | 40 | 1 | 75 | ||||

7 | 90 | 0 | 0 | 70 | ||||

8 | 110 | 40 | 0 | 65 | ||||

9 | 80 | 10 | 1 | 60 | ||||

10 | 80 | 10 | 0 | 55 |

## Leave a Reply

Want to join the discussion?Feel free to contribute!