This example is taken from Larry Wasserman’s lecture notes.
SAT are exams used for entry in US universities. In 1982, average SAT scores were published with breakdowns of state-by-state performance in the United States. The average SAT scores varied considerably by state, with mean scores falling between 790 (South Carolina) to 1088 (Iowa). Two researchers examined compositional and demographic variables to examine to what extent these characteristics were tied to SAT scores.
data=read.table("SATData.txt", header=T, fill=T)
attach(data)
names(data)
## [1] "Rank" "State" "SAT" "Takers" "Income" "Years" "Public" "Expend"
## [9] "Rank.1"
We have a column with the ranks of the 50 states according to average SAT scores, a column with the corresponding names of the states and then 7 columns corresponding to 7 variables:
Notice that the states with high average SATs had low percentages of takers. One reason is that these are mostly midwestern states that administer other tests to students bound for college in-state. Only their best students planning to attend college out of state take the SAT exams. As the percentage of takers increases for other states, so does the likelihood that the takers include lower-qualified students.
Research Question: After accounting for the percentage of students who took the test and the median class rank of the test takers (to adjust, somewhat, for the selection bias in the samples from each state), which variables are associated with state SAT scores? After accounting for the percentage of takers and the median class rank of the takers, how do the states rank? Which states perform best for the amount of money they spend?
Exploratory data analysis allows us to look at the variables contained in the data set before beginning any formal analysis. First we examine the variables individually through histograms.
par(mfrow=c(2,4))
hist(SAT,main="Histograms of SAT scores",xlab="Mean SAT score",col=1)
hist(Takers,main="Histograms of Takers scores",xlab="Percentage of students tested",col=2)
hist(Income,main="Histograms of Income scores",xlab="Mean Household Income($100s)",col=3)
hist(Years,main="Histograms of Years scores",xlab="Mean Years of Science and Humanities",col=4)
hist(Public,main="Public school Percentage",xlab="Percentage of students in Public Schools",col=5)
hist(Expend,main="Histograms of Expenditures",xlab="Schoolind Expenditures per Student($100s)",col=6)
hist(Rank.1,main="Histograms of Class Rank",xlab="Median class Ranking Percentile",col=7)
Here we can see the general range of the data, shape (skewed, gapped, symmetric, etc.), as well as any other trends. For example, we note that one state has almost double the amount of secondary schooling expenditures of all the other states. We may be interested in determining which state this is:
data[which(Expend==max(Expend)),2]
## [1] "Alaska"
data[which(Income==min(Income)),2]
## [1] "Colorado"
Next we look at the variables together.
par(mfrow = c(1, 1))
pairs(data[,-c(1,2)]) #scatterplot matrix of ?data?, ignoring the first column.
The scatterplot matrix shows the relationships between the variables at a glance. Generally we are looking for trends here. Does the value of one variable tend to affect the value of another? If so, is that relationship linear? These types of questions help us think whether we should include higher order (more generally non-linear) terms in the regression model. e.g. Takers should be replaced by log(Takers), since there seems to be a log relationship with SAT.
Here we can confirm some of the observations of the problem statement. The scatterplot matrix shows clear relationships between sat, takers, and rank. Interestingly, we can also note Alaska?s features, since we know it?s the state with the very high ?expend? value. We can see that Alaska has a rather average sat score despite its very high levels of spending. For now we will leave Alaska in the data set, but a more complete analysis would ask whether to remove outliers and high influence points. In fact, this data-set contains two rather obvious outliers.
Since subtle trends are often difficult to spot in scatterplot matrices, sometimes a correlation matrix can be useful:
round(cor(data[,-c(1,2)]), 2)
## SAT Takers Income Years Public Expend Rank.1
## SAT 1.00 -0.86 0.40 0.33 -0.08 -0.06 0.88
## Takers -0.86 1.00 -0.46 -0.10 0.12 0.28 -0.94
## Income 0.40 -0.46 1.00 0.01 -0.31 0.04 0.38
## Years 0.33 -0.10 0.01 1.00 -0.42 0.05 0.07
## Public -0.08 0.12 -0.31 -0.42 1.00 0.29 0.05
## Expend -0.06 0.28 0.04 0.05 0.29 1.00 -0.26
## Rank.1 0.88 -0.94 0.38 0.07 0.05 -0.26 1.00
Correlation matrices usually print 8-10 significant digits, so the use of the ?round? command makes the output more easily readible. We note that both the income and the years variables have moderately strong positive correlations with the response variable (sat). The respective correlations of 0.40 and 0.33 indicate that higher levels of income and years of education in sciences and humanities are generally associated with higher mean sat scores. However, this does not imply causation, and each of these trends may be nullified or even reversed when accounting for the other variables in the data set!
A variable such a ?years? may be of particular interest to researchers. Although neither science nor humanities are directly tested on the SAT, researchers may be interested in whether an increase in the number of years of such classes is associated with a significant increase in SAT score. This may help them make recommendation to schools as to how to plan their cirricula.
We fit the linear model with all predictors:
lm.full = lm(SAT ~ I(log(Takers)) + Income + Years + Public + Expend + Rank.1)
summary(lm.full)
##
## Call:
## lm(formula = SAT ~ I(log(Takers)) + Income + Years + Public +
## Expend + Rank.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.102 -8.190 1.943 15.198 53.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 433.08860 289.33692 1.497 0.141742
## I(log(Takers)) -39.49284 16.06791 -2.458 0.018083 *
## Income -0.03983 0.08277 -0.481 0.632846
## Years 16.75113 6.47292 2.588 0.013119 *
## Public -0.13403 0.53666 -0.250 0.803971
## Expend 2.57425 0.71634 3.594 0.000834 ***
## Rank.1 3.95633 2.52045 1.570 0.123817
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.83 on 43 degrees of freedom
## Multiple R-squared: 0.8923, Adjusted R-squared: 0.8772
## F-statistic: 59.35 on 6 and 43 DF, p-value: < 2.2e-16
log(Takers), Years and Expend appear to be statistical significant (given all the other variables are in the model). e.g. even though Rank.1 has high correlation with SAT it does not appear to be significant. This can be explained by the very high correlation between Takers and Rank.1.
names(lm.full)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
residuals.sat=lm.full$residuals
qqnorm(residuals.sat)
qqline(residuals.sat)
hist(residuals.sat)
The q-q plot indicates that the residuals in our regression model have (a bit) heavy tails. On both the negative and positive side (mainly on negative), observed (sample) quantiles are larger than theoretical quantiles. This is confirmed by the histogram.
We won’t do anything about this now, but we keep in mind that this may affect our results.
We can also do residual plots to check for heteroscedasticity, correlation in errors and outliers:
plot(lm.full$fitted.values, residuals.sat, xlab="Fitted Values", ylab="Residuals")
Better using studentized residuals:
student.res = studres(lm.full)
plot(lm.full$fitted.values, student.res, xlab="Fitted Values", ylab="Studentized Residuals")
data[which.max(abs(student.res)),2]
## [1] "Alaska"
The residual plot appears to be fine.
What about high leverage points?
leverages = hatvalues(lm.full)
plot(student.res, leverages, xlab="leverage", ylab="Studentized Residuals")
data[which.max(leverages),2]
## [1] "Colorado"
data[which.max(leverages[-18])+1,2] #we add 1 because the indices are shifted by 1 after index 18, check which.max(leverages[-18]) to see
## [1] "Alaska"
The highest leverage belongs to Colorado, while the second highest to Alaska! Alaska is also (kind of) an outlier so perhaps we should remove it. We won’t act on this here, but we should keep it in mind.
Let’s try dropping all the variables except log(Takers) and Rank.1. We can test whether we should do so using the appropriate F-test.
lm.reduced=lm(SAT~log(Takers)+Rank.1)
anova(lm.full,lm.reduced)
## Analysis of Variance Table
##
## Model 1: SAT ~ I(log(Takers)) + Income + Years + Public + Expend + Rank.1
## Model 2: SAT ~ log(Takers) + Rank.1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 26505
## 2 47 45530 -4 -19025 7.7161 8.862e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of the F-test is small, we reject the reduced model. That is, we should not remove all of the variables Years, Public, Expend and Income since collectively they contain useful information about the SAT score. We don’t know yet which exactly are important, but we should not discard them. We can instead apply some subset selection method (see later chapters).
To address the research question of how the states rank after accounting for the percentage of takers and median class rank, we use our reduced model (?lm.reduced? above). Instead of ranking by actual SAT score, we can rank the schools by how far they fall above or below their fitted regression hyperplane value, i.e. using the corresponding the residuals! If a state’s residual is postive, it means that the SAT score is higher than expected based on the percentage of log(Takers) and Rank.1 (since this \(e_i=y_i-\hat{y}_i\)), which means that the rest of the variables have a positive contribution to the SAT score (i.e. the state is doing something right). Vice versa if the residual is negative.
We order the states in a decreasing way according to the residuals. We then present the new order, together with the original order.
order.vec=order(lm.reduced$res,decreasing=TRUE)
states=factor(data[order.vec,2])
newsat=data.frame(State=states,Residual=as.numeric(lm.reduced$res[order.vec]),oldrank=(1:50)[order.vec])
newsat
## State Residual oldrank
## 1 Hampsbire 48.467052 28
## 2 Iowa 40.621942 1
## 3 Connecticat 38.006930 35
## 4 Montana 37.855248 6
## 5 Minnesota 34.009519 7
## 6 Washington 33.614260 19
## 7 Wiscosin 31.211345 10
## 8 Colorado 30.202790 18
## 9 NewYork 29.253874 36
## 10 Kansas 28.854485 4
## 11 Massachusetts 27.068472 41
## 12 Illinois 25.596688 21
## 13 Nebraska 23.582686 5
## 14 Vernost 20.551727 32
## 15 NorthDakota 20.382315 3
## 16 Tennessee 16.366914 13
## 17 Delaware 12.737932 34
## 18 Maryland 12.377678 39
## 19 Virginia 10.993494 40
## 20 Ohio 10.714216 27
## 21 NewMexico 8.325184 14
## 22 RhodeIsland 8.097229 43
## 23 NewJersey 7.607435 44
## 24 SouthDakota 7.141529 2
## 25 Alaska 5.364490 29
## 26 Arizona 5.278973 20
## 27 Missouri 3.837926 23
## 28 Pennsylvania 3.824760 42
## 29 Oregon 2.602455 31
## 30 Maine 2.181054 37
## 31 Michigan -1.037601 24
## 32 Wyoming -2.021042 9
## 33 Utah -3.271303 8
## 34 Ibaho -4.981162 15
## 35 Florida -6.279225 38
## 36 California -6.845704 33
## 37 Oklahoma -13.468124 11
## 38 Hawai -18.584557 47
## 39 Kentucky -23.145671 17
## 40 Indiana -24.697457 46
## 41 Nevada -28.796972 30
## 42 WestVirginia -32.700044 25
## 43 Louisiana -32.998188 22
## 44 Arkansas -34.179002 12
## 45 Alabama -38.154990 26
## 46 Texas -40.532651 45
## 47 Georgia -58.718084 49
## 48 Mississippi -60.336567 16
## 49 NorthCarolina -61.524429 48
## 50 SouthCarolina -94.457830 50
Note how dramatically the rankings shift once we conrol for the variables ?takers? and ?rank?. Hampshire for example shifts from 28th in raw score to 1st in residuals. Similar shifts happened in the reverse direction, with Arkansas sliding from 12th to 44th.
We could further analyze the ranks by accounting for such things as expenditures to get a sense of which states appear to make ?efficient? use of their spendings. Try it at home!