LIS 4273 Adv Statistics: Module 10 ANOVA Testing

This week focused on Analysis of Variance (ANOVA) testing, which is a test that can be used for data with 2 or more groups and can reduce Type I errors.

To perform the analysis in this post I wrote an R script to speed up the process. The code is included below and you can click here to find it on my GitHub or use the link at the bottom of the page.

The problem posed this week is as follows:

A researcher is interested in the effects of drug against stress reaction. She gives a reaction time test to three different groups of subjects: one group that is under a great deal of stress, one group under a moderate amount of stress, and a third group that is under almost no stress. The subjects of the study were instructed to take the drug test during their next stress episode and to report their stress on a scale of 1 to 10 (10 being most pain).

From the Assignment Page

The data is given in the following table.

High StressModerate StressLow Stress
1084
9106
866
974
1082
882
Table 1: Given data from the assignment page. Subjects in each group reported their stress levels after taking a certain drug.

Setting up the sample data in R

The first step is to load the data into the R environment. For this I split the data into the numeric values and each group. I make the group vector a set of factors so that the ANOVA function recognizes the three categories.

# Set up sample data
stress_levels <- c(10, 9, 8, 9, 10, 8, 8, 10, 6, 7, 8, 8, 4, 6, 6, 4, 2, 2)
group <- factor(c("high", "high", "high", "high", "high", "high", "moderate", "moderate", 
           "moderate", "moderate", "moderate", "moderate", "low", "low", "low", 
           "low", "low", "low"))
my_data <- data.frame(group, stress_levels)
Code language: R (r)

The final data frame ends up looking like this:

GroupStress Level
High10
High9
High8
High9
High10
High8
Moderate8
Moderate10
Moderate6
Moderate7
Moderate8
Moderate8
Low4
Low6
Low6
Low4
Low2
Low2
Table 2: Data inside of the data frame

Analysis using ANOVA

Now that the data is in a format that can be worked with for the analysis, with a single line of code the ANOVA test can be performed. I then use summary() to extract the information I want from the test. The first argument passed into aov() is a formula for what I want to test. Here I am telling R that I want to test how the group affects the stress levels reported by the test subjects, the ANOVA test will be testing the difference between the means of the three groups.

# Run ANOVA test
ANOVA_test <- aov(stress_levels ~ group, data = my_data)
summary(ANOVA_test)
Code language: R (r)

The ANOVA test summary gives the following output in R.

            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2  82.11   41.06   21.36 4.08e-05 ***
Residuals   15  28.83    1.92                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Code language: plaintext (plaintext)

Here we can get into the meat of ANOVA testing. In R to perform the test is very easy, however there is a lot to unpack from this output.

The two rows presented in the out put are Group and Residuals. Group comes from the name of the vector containing the categories. Residuals is made up of the numeric side of the data with the pain scales. Because of how the formula was written as an argument for aov(), this row does not have an F value or p-value because the function did not test for the affect of that column on itself.

Group has 2 degrees of freedom because there are 3 sample groups for the in-between analysis, the formula for this is simply k – 1 where k is the number of sample groups. The Residuals row has 15 degrees of freedom which comes from the fact that it is 3 sample groups comprised of 6 samples each. For the degrees of freedom within a set, the degrees of freedom is the cumulative of each sample group. It was found using the formula N – k where N is the total samples and k is the number of sample groups.

Both rows also have the Sum of the Squares which is measuring the variance, the higher this number is, the more variance there is. Both rows still have a lot of variance and adding additional factors and samples to the data set may help reduce that variance.

Mean of the Squares is measuring the size of the differences either between sample groups or within the sample groups. The group row Mean of Squares is measuring between sample groups, here it can be seen that there is a large difference between the means in the sample groups. The residuals row shows a much smaller difference within the sample groups.

F-value and the p-value columns only have values calculated in this test for the group row. Here the f-value indicates the effect of the treatment, in this case it is the various test groups, on the members of the samples. A value close to 1 would indicate that there was no effect, however here with a value of 21.36, it indicates there is some effect from the different sample groups.

The p-value from ANOVA works the same as everywhere else. For 95% confidence we need a p-value less than 0.05. In this test the p-value is significantly smaller than 0.001, which indicates that the statistics are very significant. In this case the members in lower stress situations will rate lower on the scale for their stress.

Without a control such as a placebo it cannot be said wether or not the drug has any effect. From this data it can only be judged that those in lower stress conditions will report lower stress levels while taking the drug.

Links:

GitHub: https://github.com/SimonLiles/LIS4273AdvStatistics/blob/master/LIS4273Mod10.R