In this post I go over data analysis using chi-tests and data visualization techniques. For the R code I use, you can find it on my GitHub here, or follow the link at the bottom.
The following table is real data collected from a hotel quest satisfaction study. The hotels are located in Saint Petersburg, Florida.
Choose the same hotel again? | Beachcomber | Windsurfer | Total |
Yes | 163 | 154 | 317 |
No | 64 | 108 | 172 |
Total | 227 | 262 | 489 |
The first step is to perform a Chi Square test to analyze categorical data to test if the observed data has a statistically significant difference from base line values.
After setting the data up in R, performing the test is very easy and simple. In this script, when I load the data in, I omit the totals from Table 1 as they will only get in the way of the chi square test. Yes, the way the data is kept in the data frame is a little weird, however that is what works best for the data visualization that I do later. Normally I would set the response types as row names instead of as a separate column.
# Raw data in contingency table
beachcomber <- c(163, 64)
windsurfer <- c(154, 108)
response <- c("yes", "no")
# Put data into a data frame
hotel_survey <- data.frame(response, beachcomber, windsurfer)
# Perform Chi Square Test
chisq.test(hotel_survey[,2:3], correct = TRUE)
Code language: R (r)
The chi square test in R then gives the following output.
Pearson's Chi-squared test with Yates' continuity correction
data: hotel_survey[, 2:3]
X-squared = 8.4903, df = 1, p-value = 0.00357
Code language: plaintext (plaintext)
Now we can break down this output. The X-squared is the test statistic that the chi squared test produces. Similar to other test statistics it does not mean much on its own. However it does indicate the size or significance of the difference from expected values. Larger chi-squared values indicate larger differences.
df is measuring the degrees of freedom for this test. Because the table is relatively small, the degrees of freedom is also very small, here there is only 1 degree of freedom.
The p-value here is a direct measure of the statistical significance of the test statistic. One can use either p-value or the chi squared test statistic to measure significance.
From this output, the following conclusions can be made, with 1 degree of freedom and alpha of 0.05, the chi squared for alpha would be 3.841. Since Chi squared for the test statistic is larger than that, we reject the null hypothesis that the two categories have no difference. Thus, we can conclude that Beachcomber guests do not have the same satisfaction as Windsurfer guests. With a p-value smaller than 0.05, we can also conclude these results are statistically significant.
The next step would be to visualize the data. Sometimes the best way to do some initial analysis on data is to visualize it. Also it looks cool. Here I use ggplot2 to create a bar chart of the data in Table 1 above. ggplot and geom_bar are very picky about how the data is fed into them, so you end up having to do weird transformations such as using pivot_longer() to lengthen the data frame so that the vectors I put in fit.
# Data visualization
hotel_columns <- pivot_longer(hotel_survey , cols = c(beachcomber, windsurfer))
ggplot(hotel_columns, aes(x = response)) +
geom_bar(aes(weight = value), fill = "darkblue") +
facet_wrap(~name)
Code language: R (r)
Once the code runs successfully you can get a bar chart like this.
Relevant links:
Github: https://github.com/SimonLiles/LIS4273AdvStatistics/blob/master/LIS4273Mod11.R