##### R basics and Anova and 2-sample T
##### working with gravid data:

###
## read it in, do some basic manipulations and graphs

> gravid.data=read.table("data/gravid.txt",header=TRUE)
> names(gravid.data)
[1] "cohort" "size"   "nembr" 
> gravid.data[1,]
  cohort size nembr
1      1  4.1    15
> gravid.data[1,"size"]
[1] 4.1
> help(hist)


> cohort2=subset(gravid.data,cohort==2)
> mean(cohort2)
  cohort     size    nembr 
 2.00000  4.68250 23.33333 
> cohort1=subset(gravid.data,cohort==1)
> mean(cohort1)
   cohort      size     nembr 
 1.000000  4.325106 18.893617 
> cohort1[1,3]
[1] 15
> cohort1[1,"nembr"]
[1] 15
> hist(cohort1[,"nembr"])
> hist(cohort2[,"nembr"])
> boxplot(gravid.data[,"nembr"] ~ gravid.data[,"cohort"])


###
## Just a few more basic things (variance & graphs)

> var(gravid.data[,"nembr"])
[1] 26.03427
> var(cohort1[,"nembr"])
[1] 10.87974
> var(cohort2[,"nembr"])
[1] 31.46099
> plot(size~cohort,data=gravid.data)


##### Fishers and chi-sq
### (working with racing data )

## here's how you can type in data directly into a matrix:
> matrix(c(9,4,3,9),2)
     [,1] [,2]
[1,]    9    3
[2,]    4    9
> racing=matrix(c(9,4,3,9),2)

## fisher's exact test

> fisher.test(racing)

        Fisher's Exact Test for Count Data

data:  racing 
p-value = 0.04718
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
  0.9006803 57.2549701 
sample estimates:
odds ratio 
  6.180528 

## chi squared on same data

> chisq.test(racing)

        Pearson's Chi-squared test with Yates' continuity correction

data:  racing 
X-squared = 3.2793, df = 1, p-value = 0.07016

####### One-sample T and Wilcoxon

## one-sample T-test
## This example tests whether the subjects' energy intake differs signif'ly from 
## the Recommended Daily Allowance (RDA) of 7725kJ.

> daily.intake = c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770)     
> hist(daily.intake)    ##let's look at the data
> t.test(daily.intake,mu=7725)

        One Sample t-test

data:  daily.intake 
t = -2.8208, df = 10, p-value = 0.01814
alternative hypothesis: true mean is not equal to 7725 
95 percent confidence interval:
 5986.348 7520.925 
sample estimates:
mean of x 
 6753.636 

## one-sample Wilcoxon signed-rank on same data as above

> wilcox.test(daily.intake, mu=7725)

        Wilcoxon signed rank test with continuity correction

data:  daily.intake 
V = 8, p-value = 0.0293
alternative hypothesis: true location is not equal to 7725 

Warning message:
In wilcox.test.default(daily.intake, mu = 7725) :
  cannot compute exact p-value with ties

######## 2-sample tests

###
## 2-sample t-test
## note: I later tried it with "cohort" converted to a factor
##  and got the same answer.   (Also true of aov).  So, somehow
##  it looks like R knew that cohort should be a factor, not
##  just a number.

> t.test(gravid.data[,"nembr"] ~ gravid.data[,"cohort"])

	Welch Two Sample t-test

data:  gravid.data[, "nembr"] by gravid.data[, "cohort"] 
t = -4.7143, df = 76.333, p-value = 1.071e-05
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -6.315272 -2.564161 
sample estimates:
mean in group 1 mean in group 2 
       18.89362        23.33333 


## Two-sample Wilcoxon

> wilcox.test(formula=nembr~cohort, data=gravid.data)

        Wilcoxon rank sum test with continuity correction

data:  nembr by cohort 
W = 483, p-value = 1.512e-06
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(x = c(15L, 15L, 20L, 19L, 20L, 18L, 22L,  :
  cannot compute exact p-value with ties

## Paired t-test

> prepost.data=read.table("data/prepost.txt",header=TRUE)
> attach(prepost.data)
> t.test(pre, post, paired=T)

        Paired t-test

data:  pre and post 
t = 11.9414, df = 10, p-value = 3.059e-07
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 1074.072 1566.838 
sample estimates:
mean of the differences 
               1320.455 

## Matched-pairs Wilcoxon
## Note: data is still prepost.data (still "attached")

> wilcox.test(pre, post, paired=T)

        Wilcoxon signed rank test with continuity correction

data:  pre and post 
V = 66, p-value = 0.00384
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(pre, post, paired = T) :
  cannot compute exact p-value with ties

###### ANOVA: you can have more than >=2 groups/values of the indep vble.
##  (The other tests we've seen worked only for 2 groups.)
## One-way ANOVA 
## "One way" means investigating one indep vble (eg Treatment), same 
##   as in above ex's.

###
## anova: 2 samples (same thing we did the t-test on earlier)
  
> aov(nembr~cohort,data=gravid.data)
Call:
   aov(formula = nembr ~ cohort, data = gravid.data)

Terms:
                   cohort Residuals
Sum of Squares   468.0863 1979.1348
Deg. of Freedom         1        93

Residual standard error: 4.613135 
Estimated effects may be unbalanced
> gravid.aov=aov(nembr~cohort,data=gravid.data)
> summary(gravid.aov)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
cohort       1  468.09  468.09  21.995 9.353e-06 ***
Residuals   93 1979.13   21.28                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

## Here is anova again, this time with the factor specifically
## identified.  Got same answers, so I think R already had the
## factor status of the cohort column figured out.

> attach(gravid.data)
> cohort.f = factor(cohort)  ###with anova, convert any numeric grp# to a Factor!!
> gravid.aov2=aov(nembr ~ cohort.f)
> summary(gravid.aov2)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
cohort.f     1  468.09  468.09  21.995 9.353e-06 ***
Residuals   93 1979.13   21.28                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
> detach(gravid.data)

### Kruskal-Wallis on the gravid data. (non-parametric counterpart to Anova)

> kruskal.test(nembr~cohort,data=gravid.data)

        Kruskal-Wallis rank sum test

data:  nembr by cohort 
Kruskal-Wallis chi-squared = 23.1684, df = 1, p-value = 1.484e-06


#####
### ANOVA "contrast" example on more than 2 treatments, working with fluoride data (6 treatments).
### Note: ANOVA is implemented in part with lm, but don't confuse the two, and don't 
###   get sloppy by interchanging the two when you shouldn't.

## read it in, set up the factor, "attach" for convenience of typing,

> fluoride.data=read.table("/Users/burnett/MMB/MiscResearch/Empirical-S+howto/Example-st515Assigns/fluoride.txt",header=TRUE)
> names(fluoride.data)
[1] "tooth"  "cement" "uptake"
> attach(fluoride.data)

> cement.f=factor(cement)
> boxplot(uptake ~ cement.f) ## looks like strain 5 is different
   ## So, we need to set up the anova so that strain 5 is treated as the baseline. 
   ## We'll do that by forcing strain 5's "treatment #" to be the "first" (here, force it to be 0.)

> cement
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
[39] 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
> cement2=(cement==5)
> cement3=(cement*(1-cement2)) 
> cement3
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
[39] 4 4 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6

> cement.f=factor(cement3)  ## and of course, it needs to be a factor, not a number.

## do the actual Anova "contrast"

> fluoride.aov=aov(uptake ~ cement.f)
> summary(fluoride.aov)
            Df   Sum Sq  Mean Sq F value    Pr(>F)    
cement.f     5 0.091641 0.018328    7.04 3.954e-05 ***
Residuals   54 0.140586 0.002603                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

## ok, we got the right DF, and it's signif, but which ones are actually diferent from T5?
## lm's output on this data tells us -- it's not REALLY an lm here (we're doing it on a factor), and it
##  tells us what we wanted to know without any extra tests!

> summary(lm(uptake ~ cement.f))

Call:
lm(formula = uptake ~ cement.f)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.107820 -0.026820 -0.001145  0.024877  0.148660 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.06134    0.01614   3.802 0.000367 ***
cement.f1   -0.11985    0.02282  -5.252 2.61e-06 ***
cement.f2   -0.10352    0.02282  -4.537 3.22e-05 ***
cement.f3   -0.09411    0.02282  -4.124 0.000129 ***
cement.f4   -0.08869    0.02282  -3.887 0.000280 ***
cement.f6   -0.10190    0.02282  -4.466 4.11e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.05102 on 54 degrees of freedom
Multiple R-squared: 0.3946,     Adjusted R-squared: 0.3386 
F-statistic:  7.04 on 5 and 54 DF,  p-value: 3.954e-05 

## clean up by detaching, to avoid confusion with later data:
> detach(fluoride.data)


####### ANOVA for within-subject (also works on Latin Squares)
### Demo #1 (for between AND within mixed design)

# The between groups test indicates that the variable group is significant, 
#    consequently in the graph we see that the lines for the two groups are rather far apart. 
# The within subject test indicate that there is not a significant time effect, 
#   in other words, the groups do not change in depression over time. 
# In the graph: 
#    we see that the groups have lines that are flat, 
#    i.e. the slopes of the lines are approximately equal to zero. Also, since the lines are parallel, 
#    we are not surprised that the interaction between time and group is not significant.

# Do these things:
    demo1<-read.table("demo1.csv", header=T, sep=",")
    attach(demo1)
    par(cex=.6)
    interaction.plot(time, factor(group), pulse, ylim=c(5, 20), lty=c(1, 12),
                     ylab="mean of pulse", xlab="time", lwd=3, trace.label="group")

    demo1.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id))) 
    summary(demo1.aov) 
    detach(demo1)

# You get this:
    Error: factor(id)
                  Df  Sum Sq Mean Sq F value    Pr(>F)    
    factor(group)  1 155.042 155.042    3721 1.305e-09 ***
    Residuals      6   0.250   0.042                      

    Error: Within
                               Df  Sum Sq Mean Sq F value Pr(>F)
    factor(time)                2 0.08333 0.04167       1 0.3966
    factor(group):factor(time)  2 0.08333 0.04167       1 0.3966
    Residuals                  12 0.50000 0.04167


### Demo #2 (for between AND within mixed design)
# The between groups test indicates that the variable group is not significant, 
#    (consequently in the graph we see that the lines for the two groups are rather close together.) 
# The within subject test indicate that there is a significant time effect, 
#    in other words, the groups change in depression over time. 
# In the graph: 
#    we see that the groups have lines that increase over time. 
#    Again, the lines are parallel consistent with the finding that the interaction is not significant.

# Do these things:

demo2<-read.table("demo2.csv", header=T, sep=",")
par(cex=.6)
attach(demo2)
interaction.plot(time, factor(group), pulse, ylim=c(10, 40), lty=c(1, 12),
                 lwd=3,ylab="mean of pulse", xlab="time", trace.label="group")
demo2.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id)))
summary(demo2.aov)

detach(demo2)

# You get this:

Error: factor(id)
              Df  Sum Sq Mean Sq F value Pr(>F)
factor(group)  1  15.042  15.042  0.8363 0.3957
Residuals      6 107.917  17.986               

Error: Within
                           Df Sum Sq Mean Sq F value    Pr(>F)    
factor(time)                2 978.25  489.12 53.6845 1.032e-06 ***
factor(group):factor(time)  2   1.08    0.54  0.0595    0.9426    
Residuals                  12 109.33    9.11 


### Demo #3 (for between AND within mixed design)
# The between groups test indicates that the variable group is significant, 
#    consequently in the graph we see that the lines for the two groups are rather far apart. 
# The within subject test indicate that there is a significant time effect, 
#    in other words, the groups do change over time, both groups are getting less depressed over time. 
# Moreover, the interaction of time and group is significant 
#    which means that the groups are changing over time but are changing in different ways, 
#    which means that in the graph the lines will not be parallel. 
# In the graph: 
#    we see that the groups have non-parallel lines that decrease over time and 
#    are getting progressively closer together over time.

    demo3<-read.table("demo3.csv", header=T, sep=",")
    par(cex=.6)
    attach(demo3)
    interaction.plot(time, factor(group), pulse, ylim=c(10, 60), lty=c(1, 12),
                     lwd=3, ylab="mean of pulse", xlab="time", trace.label="group")
    demo3.aov <- aov(pulse ~ factor(group)*factor(time) + Error(factor(id)))
    summary(demo3.aov)
    detach(demo3)

# You get this:

Error: factor(id)
              Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(group)  1 2035.04 2035.04  343.15 1.596e-06 ***
Residuals      6   35.58    5.93                      

Error: Within
                           Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(time)                2 2830.33 1415.17 553.761 1.517e-12 ***
factor(group):factor(time)  2  200.33  100.17  39.196 5.474e-06 ***
Residuals                  12   30.67    2.56


####### ####### ####### ####### ####### ####### ####### ####### ####### Ignore ignore ignore ignoreÉ
## Ignore the following example, it's obsolete -- I think they've made it too complicated, 
## and also I think it has a couple of errors in it.

> data.ex3=read.table("data/R.appendix3.data.txt",header=T) # within-subj data, arranged 1 outcome/line
> data.ex3                                                  # show the data
   Observation Subject Valence Recall
1            1     Jim     Neg     32
2            2     Jim     Neu     15
3            3     Jim     Pos     45
4            4  Victor     Neg     30
5            5  Victor     Neu     13
6            6  Victor     Pos     40
7            7    Faye     Neg     26
8            8    Faye     Neu     12
9            9    Faye     Pos     42
10          10     Ron     Neg     22
11          11     Ron     Neu     10
12          12     Ron     Pos     38
13          13   Jason     Neg     29
14          14   Jason     Neu      8
15          15   Jason     Pos     35

## Because Valence is crossed with the random factor Subject (i.e., every subject sees all 
##  three types of words), you must specify the error term for Valence , which in this case 
##  is Subject by Valence. Do this by adding the termError(Subject/Valence) to the factor 
##  Valence , as shown: 
> aov.ex3 = aov(Recall~Valence+Error(Subject/Valence),data.ex3)
> summary(aov.ex3)

Error: Subject
          Df  Sum Sq Mean Sq F value Pr(>F)
Residuals  4 105.067  26.267               

Error: Subject:Valence
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Valence    2 2029.73 1014.87  189.11 1.841e-07 ***
Residuals  8   42.93    5.37                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 

##
## Q: What if we hadn't used the Error term? 
## A: Would get a slightly different answer (so, we'd be cheating a little).

> aov.ex3.whatif = aov(Recall~Valence,data.ex3)
> summary(aov.ex3.whatif)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
Valence      2 2029.73 1014.87  82.287 9.852e-08 ***
Residuals   12  148.00   12.33                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> summary(lm(Recall~Valence+Error(Subject/Valence),data.ex3))
Error in eval(expr, envir, enclos) : could not find function "Error"
> summary(lm(Recall~Valence,data.ex3))

Call:
lm(formula = Recall ~ Valence, data = data.ex3)

Residuals:
   Min     1Q Median     3Q    Max 
  -5.8   -1.9    0.4    2.1    5.0 

####### ####### ####### ####### ####### ####### End of  Ignore ignore ignore ignoreÉ


#####
##### Linear regression
##### For cause/effect relationships between 2 numeric variables

> gravid.data=read.table("data/gravid.txt",header=T)
> attach(gravid.data)
> names(gravid.data)
[1] "cohort" "size"   "nembr" 

> summary(lm(size~nembr))

Call:
lm(formula = size ~ nembr)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.561610 -0.256746  0.008663  0.209201  0.780011 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.625945   0.134386  26.982  < 2e-16 ***
nembr       0.041621   0.006182   6.732 1.36e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.3058 on 93 degrees of freedom
Multiple R-squared: 0.3277,     Adjusted R-squared: 0.3204 
F-statistic: 45.33 on 1 and 93 DF,  p-value: 1.356e-09 

#####
##### Correlation
##### For relationships between 2 numeric variables, in which you make no claim about 
#####   which is cause and which is effect.

> cor.test(nembr,size)

        Pearson's product-moment correlation

data:  nembr and size 
t = 6.7325, df = 93, p-value = 1.356e-09
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.4192547 0.6939154 
sample estimates:
      cor 
0.5724287      # This is the correlation, ie the "R" to report. (Note: .5724^2=.3277 the "R2" above.)

## non-parametric variant

> cor.test(nembr,size,method="spearman")

        Spearman's rank correlation rho

data:  nembr and size 
S = 64428.19, p-value = 8.339e-09
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.5490748      # With Spearman's, you report "rho" instead of "R".

Warning message:
In cor.test.default(nembr, size, method = "spearman") :
  Cannot compute exact p-values with ties