This week was multivariate analysis. Generally useful data has more than 2 variables and to find a useful pattern, many of those variables need to be plotted together in a two dimensional space. While we could go up to 3D, that is generally not recommended for a variety of reasons, and does not scale as we try and plot data in higher dimensions. Instead we can change things such as size, color, shape, and add other annotations to a plot to describe the data.
To explore Multivariate Analysis I decided to revisit some data I have worked with before, that is the famous Iris data set. This time I wanted to experiment with making violin plots. To do so I wrote an R Script which you can find on my GitHub here, or follow the link at the bottom of this post.
At first this plot may seem too simple, like it could not have multiple dimensions, but it does. The violin is like a box plot, however it shows the shape of the distribution allowing you to identify clusters in the data. Adding additional violin plots and grouping variables is how new dimensions are added to this plot. For example with the Iris data plotted like this, it is easy to see that the petals of each flower are generally smaller than the sepal and that the setosa species is generally smaller than the other two. It is because that so many variables are plotted like this that new and interesting patterns can be found. The trick is to plot that data in a clear, simple, and concise manner.
To create the visualization above I used ggplot2, however I first needed to transform the data so that ggplot would have an easier time plotting so many variables on the x-axis. To do this I combined the four measurement columns into one, and made a new name column to identify each measurement. In the end I had 3 columns of data, and each of the 150 flowers had 4 records each. The code to do so is as follows:
#Load in data set
my_iris <- iris
#Mutate data set for 4 columns, Length, Width, Species, Petal or Sepal
new_iris <- pivot_longer(my_iris, c(Sepal.Length, Petal.Length, Sepal.Width,
Petal.Width),
names_to = c("Type", ".value", ".value"),
names_pattern = "(.....)(.*)(.*)")
#Combine length and width measurements into one column
new_iris <- pivot_longer(new_iris, c(.Length, .Width))
#Combine the variable identification columns
new_iris <- unite(new_iris, col = "Name", Type, name, sep = "")
Code language: R (r)
Making the violin and box plots in ggplot2 was fairly simple and only requires a slight bit of tweaking to make the plot look aesthetically appealing.
#Make a violin/box plot
ggplot(new_iris, aes(x = Name, y = value, fill = Name)) +
geom_boxplot(width = 0.15) +
geom_violin(alpha = 0.5, scale = "width") +
facet_wrap(~ Species) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90))
Code language: R (r)
So how do the 5 Principles of Design fit into this. For those that are unfamiliar, the 5 principles are as follows:
- Alignment
- Repetition
- Contrast
- Proximity
- Balance
Upon first inspection these rules seem fairly straightforward, however by understanding them, and using them to their full effect, extraordinary visuals can be made.
With my visualization, aligning the elements is fairly straight forward. All of the violin plots for example, are parallel, and are 90º to the x-axis. This just makes the entire visual cleaner and easier to read.
For repetition I repeated the colors for each species of iris. Because each of the four colors is repeated for each species, I create two relationships in the visualization. The first are the three groups of flowers, since no color is repeated for the same species of iris, the groups become more obvious. The second relation ship is that each flower has the same 4 variables, length and width for the sepal and petals. This allows for easy and quick comparisons of the data since the colors are repeated like so.
The contrast that I created in the above visualization is the contrast between the data and the background. It seems like a small thing, however the contrast between foreground and background helps the data pop out and become easier to read.
Proximity was used in this visualization by placing measures of the same species close together. This kind of grouping is logical and to not do so would only serve to hinder the reader.
Balance was maintained in this visualization through keeping the number of placement of elements even through out. I chose to forgo a color legend in this plot despite using different colors for two reasons. First had to do with balance, the legend squeezed the plot to one side while leaving large amounts of empty space on the other side. The legend would have made the visual feel lopsided and would not add anything of value. Upon removal the visualization looked clearer and gave the data room to breathe.
Overall I am of the opinion that the 5 Principles are good reminders of things to consider when making a visualization. It is too easy to get carried away and make a visualization that communicates a lot of data, but fails to do so effectively.
Links:
GitHub: https://github.com/SimonLiles/LIS4317DataVisualization/blob/main/LIS4317Mod9.R