##### R basics and Anova and 2-sample T ##### working with gravid data: ### ## read it in, do some basic manipulations and graphs > gravid.data=read.table("data/gravid.txt",header=TRUE) > names(gravid.data) [1] "cohort" "size" "nembr" > gravid.data[1,] cohort size nembr 1 1 4.1 15 > gravid.data[1,"size"] [1] 4.1 > help(hist) > cohort2=subset(gravid.data,cohort==2) > mean(cohort2) cohort size nembr 2.00000 4.68250 23.33333 > cohort1=subset(gravid.data,cohort==1) > mean(cohort1) cohort size nembr 1.000000 4.325106 18.893617 > cohort1[1,3] [1] 15 > cohort1[1,"nembr"] [1] 15 > hist(cohort1[,"nembr"]) > hist(cohort2[,"nembr"]) > boxplot(gravid.data[,"nembr"] ~ gravid.data[,"cohort"]) ### ## Just a few more basic things (variance & graphs) > var(gravid.data[,"nembr"]) [1] 26.03427 > var(cohort1[,"nembr"]) [1] 10.87974 > var(cohort2[,"nembr"]) [1] 31.46099 > plot(size~cohort,data=gravid.data) ##### Fishers and chi-sq ### (working with racing data ) ## here's how you can type in data directly into a matrix: > matrix(c(9,4,3,9),2) [,1] [,2] [1,] 9 3 [2,] 4 9 > racing=matrix(c(9,4,3,9),2) ## fisher's exact test > fisher.test(racing) Fisher's Exact Test for Count Data data: racing p-value = 0.04718 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9006803 57.2549701 sample estimates: odds ratio 6.180528 ## chi squared on same data > chisq.test(racing) Pearson's Chi-squared test with Yates' continuity correction data: racing X-squared = 3.2793, df = 1, p-value = 0.07016 ####### One-sample T and Wilcoxon ## one-sample T-test ## This example tests whether the subjects' energy intake differs signif'ly from ## the Recommended Daily Allowance (RDA) of 7725kJ. > daily.intake = c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770) > hist(daily.intake) ##let's look at the data > t.test(daily.intake,mu=7725) One Sample t-test data: daily.intake t = -2.8208, df = 10, p-value = 0.01814 alternative hypothesis: true mean is not equal to 7725 95 percent confidence interval: 5986.348 7520.925 sample estimates: mean of x 6753.636 ## one-sample Wilcoxon signed-rank on same data as above > wilcox.test(daily.intake, mu=7725) Wilcoxon signed rank test with continuity correction data: daily.intake V = 8, p-value = 0.0293 alternative hypothesis: true location is not equal to 7725 Warning message: In wilcox.test.default(daily.intake, mu = 7725) : cannot compute exact p-value with ties ######## 2-sample tests ### ## 2-sample t-test ## note: I later tried it with "cohort" converted to a factor ## and got the same answer. (Also true of aov). So, somehow ## it looks like R knew that cohort should be a factor, not ## just a number. > t.test(gravid.data[,"nembr"] ~ gravid.data[,"cohort"]) Welch Two Sample t-test data: gravid.data[, "nembr"] by gravid.data[, "cohort"] t = -4.7143, df = 76.333, p-value = 1.071e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.315272 -2.564161 sample estimates: mean in group 1 mean in group 2 18.89362 23.33333 ## Two-sample Wilcoxon > wilcox.test(formula=nembr~cohort, data=gravid.data) Wilcoxon rank sum test with continuity correction data: nembr by cohort W = 483, p-value = 1.512e-06 alternative hypothesis: true location shift is not equal to 0 Warning message: In wilcox.test.default(x = c(15L, 15L, 20L, 19L, 20L, 18L, 22L, : cannot compute exact p-value with ties ## Paired t-test > prepost.data=read.table("data/prepost.txt",header=TRUE) > attach(prepost.data) > t.test(pre, post, paired=T) Paired t-test data: pre and post t = 11.9414, df = 10, p-value = 3.059e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1074.072 1566.838 sample estimates: mean of the differences 1320.455 ## Matched-pairs Wilcoxon ## Note: data is still prepost.data (still "attached") > wilcox.test(pre, post, paired=T) Wilcoxon signed rank test with continuity correction data: pre and post V = 66, p-value = 0.00384 alternative hypothesis: true location shift is not equal to 0 Warning message: In wilcox.test.default(pre, post, paired = T) : cannot compute exact p-value with ties ###### ANOVA: you can have more than >=2 groups/values of the indep vble. ## (The other tests we've seen worked only for 2 groups.) ## One-way ANOVA ## "One way" means investigating one indep vble (eg Treatment), same ## as in above ex's. ### ## anova: 2 samples (same thing we did the t-test on earlier) > aov(nembr~cohort,data=gravid.data) Call: aov(formula = nembr ~ cohort, data = gravid.data) Terms: cohort Residuals Sum of Squares 468.0863 1979.1348 Deg. of Freedom 1 93 Residual standard error: 4.613135 Estimated effects may be unbalanced > gravid.aov=aov(nembr~cohort,data=gravid.data) > summary(gravid.aov) Df Sum Sq Mean Sq F value Pr(>F) cohort 1 468.09 468.09 21.995 9.353e-06 *** Residuals 93 1979.13 21.28 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 ## Here is anova again, this time with the factor specifically ## identified. Got same answers, so I think R already had the ## factor status of the cohort column figured out. > attach(gravid.data) > cohort.f = factor(cohort) ###with anova, convert any numeric grp# to a Factor!! > gravid.aov2=aov(nembr ~ cohort.f) > summary(gravid.aov2) Df Sum Sq Mean Sq F value Pr(>F) cohort.f 1 468.09 468.09 21.995 9.353e-06 *** Residuals 93 1979.13 21.28 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > detach(gravid.data) ### Kruskal-Wallis on the gravid data. (non-parametric counterpart to Anova) > kruskal.test(nembr~cohort,data=gravid.data) Kruskal-Wallis rank sum test data: nembr by cohort Kruskal-Wallis chi-squared = 23.1684, df = 1, p-value = 1.484e-06 ##### ### ANOVA "contrast" example on more than 2 treatments, working with fluoride data (6 treatments). ### Note: ANOVA is implemented in part with lm, but don't confuse the two, and don't ### get sloppy by interchanging the two when you shouldn't. ## read it in, set up the factor, "attach" for convenience of typing, > fluoride.data=read.table("/Users/burnett/MMB/MiscResearch/Empirical-S+howto/Example-st515Assigns/fluoride.txt",header=TRUE) > names(fluoride.data) [1] "tooth" "cement" "uptake" > attach(fluoride.data) > cement.f=factor(cement) > boxplot(uptake ~ cement.f) ## looks like strain 5 is different ## So, we need to set up the anova so that strain 5 is treated as the baseline. ## We'll do that by forcing strain 5's "treatment #" to be the "first" (here, force it to be 0.) > cement [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 [39] 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 > cement2=(cement==5) > cement3=(cement*(1-cement2)) > cement3 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 [39] 4 4 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 > cement.f=factor(cement3) ## and of course, it needs to be a factor, not a number. ## do the actual Anova "contrast" > fluoride.aov=aov(uptake ~ cement.f) > summary(fluoride.aov) Df Sum Sq Mean Sq F value Pr(>F) cement.f 5 0.091641 0.018328 7.04 3.954e-05 *** Residuals 54 0.140586 0.002603 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ok, we got the right DF, and it's signif, but which ones are actually diferent from T5? ## lm's output on this data tells us -- it's not REALLY an lm here (we're doing it on a factor), and it ## tells us what we wanted to know without any extra tests! > summary(lm(uptake ~ cement.f)) Call: lm(formula = uptake ~ cement.f) Residuals: Min 1Q Median 3Q Max -0.107820 -0.026820 -0.001145 0.024877 0.148660 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.06134 0.01614 3.802 0.000367 *** cement.f1 -0.11985 0.02282 -5.252 2.61e-06 *** cement.f2 -0.10352 0.02282 -4.537 3.22e-05 *** cement.f3 -0.09411 0.02282 -4.124 0.000129 *** cement.f4 -0.08869 0.02282 -3.887 0.000280 *** cement.f6 -0.10190 0.02282 -4.466 4.11e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.05102 on 54 degrees of freedom Multiple R-squared: 0.3946, Adjusted R-squared: 0.3386 F-statistic: 7.04 on 5 and 54 DF, p-value: 3.954e-05 ## clean up by detaching, to avoid confusion with later data: > detach(fluoride.data) ####### ANOVA for within-subject (also works on Latin Squares) ### Demo #1 (for between AND within mixed design) # The between groups test indicates that the variable group is significant, # consequently in the graph we see that the lines for the two groups are rather far apart. # The within subject test indicate that there is not a significant time effect, # in other words, the groups do not change in depression over time. # In the graph: # we see that the groups have lines that are flat, # i.e. the slopes of the lines are approximately equal to zero. Also, since the lines are parallel, # we are not surprised that the interaction between time and group is not significant. # Do these things: demo1<-read.table("demo1.csv", header=T, sep=",") attach(demo1) par(cex=.6) interaction.plot(time, factor(group), pulse, ylim=c(5, 20), lty=c(1, 12), ylab="mean of pulse", xlab="time", lwd=3, trace.label="group") demo1.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id))) summary(demo1.aov) detach(demo1) # You get this: Error: factor(id) Df Sum Sq Mean Sq F value Pr(>F) factor(group) 1 155.042 155.042 3721 1.305e-09 *** Residuals 6 0.250 0.042 Error: Within Df Sum Sq Mean Sq F value Pr(>F) factor(time) 2 0.08333 0.04167 1 0.3966 factor(group):factor(time) 2 0.08333 0.04167 1 0.3966 Residuals 12 0.50000 0.04167 ### Demo #2 (for between AND within mixed design) # The between groups test indicates that the variable group is not significant, # (consequently in the graph we see that the lines for the two groups are rather close together.) # The within subject test indicate that there is a significant time effect, # in other words, the groups change in depression over time. # In the graph: # we see that the groups have lines that increase over time. # Again, the lines are parallel consistent with the finding that the interaction is not significant. # Do these things: demo2<-read.table("demo2.csv", header=T, sep=",") par(cex=.6) attach(demo2) interaction.plot(time, factor(group), pulse, ylim=c(10, 40), lty=c(1, 12), lwd=3,ylab="mean of pulse", xlab="time", trace.label="group") demo2.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id))) summary(demo2.aov) detach(demo2) # You get this: Error: factor(id) Df Sum Sq Mean Sq F value Pr(>F) factor(group) 1 15.042 15.042 0.8363 0.3957 Residuals 6 107.917 17.986 Error: Within Df Sum Sq Mean Sq F value Pr(>F) factor(time) 2 978.25 489.12 53.6845 1.032e-06 *** factor(group):factor(time) 2 1.08 0.54 0.0595 0.9426 Residuals 12 109.33 9.11 ### Demo #3 (for between AND within mixed design) # The between groups test indicates that the variable group is significant, # consequently in the graph we see that the lines for the two groups are rather far apart. # The within subject test indicate that there is a significant time effect, # in other words, the groups do change over time, both groups are getting less depressed over time. # Moreover, the interaction of time and group is significant # which means that the groups are changing over time but are changing in different ways, # which means that in the graph the lines will not be parallel. # In the graph: # we see that the groups have non-parallel lines that decrease over time and # are getting progressively closer together over time. demo3<-read.table("demo3.csv", header=T, sep=",") par(cex=.6) attach(demo3) interaction.plot(time, factor(group), pulse, ylim=c(10, 60), lty=c(1, 12), lwd=3, ylab="mean of pulse", xlab="time", trace.label="group") demo3.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id))) summary(demo3.aov) detach(demo3) # You get this: Error: factor(id) Df Sum Sq Mean Sq F value Pr(>F) factor(group) 1 2035.04 2035.04 343.15 1.596e-06 *** Residuals 6 35.58 5.93 Error: Within Df Sum Sq Mean Sq F value Pr(>F) factor(time) 2 2830.33 1415.17 553.761 1.517e-12 *** factor(group):factor(time) 2 200.33 100.17 39.196 5.474e-06 *** Residuals 12 30.67 2.56 ####### ####### ####### ####### ####### ####### ####### ####### ####### Ignore ignore ignore ignoreÉ ## Ignore the following example, it's obsolete -- I think they've made it too complicated, ## and also I think it has a couple of errors in it. > data.ex3=read.table("data/R.appendix3.data.txt",header=T) # within-subj data, arranged 1 outcome/line > data.ex3 # show the data Observation Subject Valence Recall 1 1 Jim Neg 32 2 2 Jim Neu 15 3 3 Jim Pos 45 4 4 Victor Neg 30 5 5 Victor Neu 13 6 6 Victor Pos 40 7 7 Faye Neg 26 8 8 Faye Neu 12 9 9 Faye Pos 42 10 10 Ron Neg 22 11 11 Ron Neu 10 12 12 Ron Pos 38 13 13 Jason Neg 29 14 14 Jason Neu 8 15 15 Jason Pos 35 ## Because Valence is crossed with the random factor Subject (i.e., every subject sees all ## three types of words), you must specify the error term for Valence , which in this case ## is Subject by Valence. Do this by adding the termError(Subject/Valence) to the factor ## Valence , as shown: > aov.ex3 = aov(Recall~Valence+Error(Subject/Valence),data.ex3) > summary(aov.ex3) Error: Subject Df Sum Sq Mean Sq F value Pr(>F) Residuals 4 105.067 26.267 Error: Subject:Valence Df Sum Sq Mean Sq F value Pr(>F) Valence 2 2029.73 1014.87 189.11 1.841e-07 *** Residuals 8 42.93 5.37 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Q: What if we hadn't used the Error term? ## A: Would get a slightly different answer (so, we'd be cheating a little). > aov.ex3.whatif = aov(Recall~Valence,data.ex3) > summary(aov.ex3.whatif) Df Sum Sq Mean Sq F value Pr(>F) Valence 2 2029.73 1014.87 82.287 9.852e-08 *** Residuals 12 148.00 12.33 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(lm(Recall~Valence+Error(Subject/Valence),data.ex3)) Error in eval(expr, envir, enclos) : could not find function "Error" > summary(lm(Recall~Valence,data.ex3)) Call: lm(formula = Recall ~ Valence, data = data.ex3) Residuals: Min 1Q Median 3Q Max -5.8 -1.9 0.4 2.1 5.0 ####### ####### ####### ####### ####### ####### End of Ignore ignore ignore ignoreÉ ##### ##### Linear regression ##### For cause/effect relationships between 2 numeric variables > gravid.data=read.table("data/gravid.txt",header=T) > attach(gravid.data) > names(gravid.data) [1] "cohort" "size" "nembr" > summary(lm(size~nembr)) Call: lm(formula = size ~ nembr) Residuals: Min 1Q Median 3Q Max -0.561610 -0.256746 0.008663 0.209201 0.780011 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.625945 0.134386 26.982 < 2e-16 *** nembr 0.041621 0.006182 6.732 1.36e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3058 on 93 degrees of freedom Multiple R-squared: 0.3277, Adjusted R-squared: 0.3204 F-statistic: 45.33 on 1 and 93 DF, p-value: 1.356e-09 ##### ##### Correlation ##### For relationships between 2 numeric variables, in which you make no claim about ##### which is cause and which is effect. > cor.test(nembr,size) Pearson's product-moment correlation data: nembr and size t = 6.7325, df = 93, p-value = 1.356e-09 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4192547 0.6939154 sample estimates: cor 0.5724287 # This is the correlation, ie the "R" to report. (Note: .5724^2=.3277 the "R2" above.) ## non-parametric variant > cor.test(nembr,size,method="spearman") Spearman's rank correlation rho data: nembr and size S = 64428.19, p-value = 8.339e-09 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.5490748 # With Spearman's, you report "rho" instead of "R". Warning message: In cor.test.default(nembr, size, method = "spearman") : Cannot compute exact p-values with ties