##### ANOVA vs. ANCOVA vs. Interactions ##### working with gravid data. ##### Source: started with my hw3-stats file, but then fiddled. ### read it in, take a look at it. > gravid.data=read.table("/Users/burnett/MMB/MiscResearch/Empirical-S+howto/Example-st515Assigns/DataForAssigns/gravid.txt",header=TRUE) > attach(gravid.data) > names(gravid.data) [1] "cohort" "size" "nembr" > boxplot(size~cohort) ### ### Look at the data: this plot gives the best picture of what's happening: ### nembr seems to differ by cohort, but also seems related to size ### (2 parallel lines). > plot(size,nembr,type="n"); text(size,nembr,labels=cohort) ## Here's where the lines came from: > gravid.regression=lm(nembr~size) > abline(gravid.regression) > gravid.regression=lm(nembr~cohort) > abline(gravid.regression) ### Start with just the simple model. (We've seen this before.) ### Model1 cohort: nembr ~ cohort. > cohort.f=factor(cohort) > aov(nembr~cohort.f) Call: aov(formula = nembr ~ cohort.f) Terms: cohort.f Residuals Sum of Squares 468.0863 1979.1348 Deg. of Freedom 1 93 Residual standard error: 4.613135 Estimated effects may be unbalanced > summary(aov(nembr~cohort.f)) Df Sum Sq Mean Sq F value Pr(>F) cohort.f 1 468.09 468.09 21.995 9.353e-06 *** Residuals 93 1979.13 21.28 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 ### ---------------------------------------------------- ### Model 2: nembr ~ (cohort and size). Main effects only. ### This is the ANCOVA model. > gravid.lm.model2 = lm(nembr ~ cohort.f + size) > summary(gravid.lm.model2) Call: lm(formula = nembr ~ cohort.f + size) Residuals: Min 1Q Median 3Q Max -11.81823 -1.99596 0.03492 2.27726 17.60081 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.1307 5.6970 -1.603 0.1124 ##<- ignore cohort.f2 2.1240 0.9668 2.197 0.0305 * ##<- ignore size 6.4794 1.3098 4.947 3.39e-06 *** ##<- ignore --- Signif. codes: 0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1 Residual standard error: 4.122 on 92 degrees of freedom Multiple R-squared: 0.3612, Adjusted R-squared: 0.3473 F-statistic: 26.01 on 2 and 92 DF, p-value: 1.115e-09 ##this p-value we care about (tells if we trust the model) > summary(aov(gravid.lm.model2)) Df Sum Sq Mean Sq F value Pr(>F) cohort.f 1 468.09 468.09 27.547 9.792e-07 *** size 1 415.82 415.82 24.471 3.388e-06 *** Residuals 92 1563.31 16.99 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 ----------------------------------------------------------------------- ### ### Model 3: nembr ~ (cohort and size). WITH interaction. > gravid.lm.model3 = lm(nembr ~ cohort.f + size + (cohort.f*size)) > summary(gravid.lm.model3) > gravid.lm.model3 = lm(nembr ~ cohort.f + size + (cohort.f*size)) > summary(gravid.lm.model3) Call: lm(formula = nembr ~ cohort.f + size + (cohort.f * size)) Residuals: Min 1Q Median 3Q Max -9.5412 -2.4557 0.2548 2.1780 14.9670 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.507 7.727 1.360 0.177275 ##<- ignore cohort.f2 -37.168 11.152 -3.333 0.001245 ** ##<- ignore size 1.939 1.782 1.088 0.279359 ##<- ignore cohort.f2:size 8.738 2.472 3.535 0.000643 *** ##<- ignore --- Signif. codes: 0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1 Residual standard error: 3.887 on 91 degrees of freedom Multiple R-squared: 0.4383, Adjusted R-squared: 0.4198 F-statistic: 23.67 on 3 and 91 DF, p-value: 2.059e-11 > summary(aov(gravid.lm.model3)) Df Sum Sq Mean Sq F value Pr(>F) cohort.f 1 468.09 468.09 30.989 2.599e-07 *** size 1 415.82 415.82 27.529 1.003e-06 *** cohort.f:size 1 188.75 188.75 12.496 0.0006434 *** Residuals 91 1374.56 15.11 --- Signif. codes: 0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1 ### ------------------------------------------------- ### which of the above is the best model? ## Answer: Looks like Model 3 is best: everything came out ## significant, which seems to say that it's all highly relevant, ## ie that there are different means for each of these factors ## that we need to account for separately. (Using the most ## complicated model that shows signif, which turns out to be what we did here, ## is always the safest choice too. ## Note: in some cases you have to run an extra F-test to find out the ## answer to this question. (To find out why, take stats-515.)