LIS 4317 Data Visualization: Distribution Analysis

This week was about distribution analysis. This week I made the above figure using the iris dataset in R using ggplot2. You can find the code I used to make the plot on GitHub here, or follow the link at the bottom of this post.

The dataset is the famous Iris dataset that is included with the base R package. The data was collected by Edgar Anderson in 1935. The measurements are of the length and width of the petal and sepal of the iris flowers. 50 flowers from each of 3 species of iris, setosa, versicolor, and Virginica.

The goal when I was creating this visual was to plot the distribution of the length and widths of the sepals and petals together, and then to differentiate between the species. Thus the first step was to get the data into a form that would be easier to plot. The original data set has five variables, Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. Thus each flower has only a single record in the dataset and the table is fairly compact.

Sepal Length	Sepal Width	Petal Length	Petal Width	Species
5.1	3.5	1.4	0.2	Setosa
4.9	3.0	1.4	0.2	Setosa

Table 1: The original data set.

However to plot this effectively using ggplot2, I will need a different four variables, Species, Type, Length, and Width. This new table doubles the number of records so that each flower has an individual record for its sepal, and its petal. I have also added a new variable, Type which will be used to distinguish between Sepal and Petal of each flower.

Species	Type	Length	Width
Setosa	Sepal	5.1	3.5
Setosa	Petal	1.4	0.2

Table 2: The dataset after being lengthened.

The code to perform this mutation is this fairly simple single line of code.

#Mutate data set for 4 columns, Length, Width, Species, Petal or Sepal
new_iris <- pivot_longer(my_iris, c(Sepal.Length, Petal.Length, Sepal.Width, Petal.Width),
                         names_to = c("Type", ".value", ".value"), 
                         names_pattern = "(.....)(.*)(.*)")
Code language: R (r)

With this new data frame I can now plot it using ggplot2. This is the code to do it:

#Make the plot
ggplot(new_iris, aes(.Length, .Width, color = Species)) +
  geom_point(position = "jitter") +
  facet_grid(Type ~ Species) + 
  theme(legend.position = "none") + 
  xlab("Length in cm") + 
  ylab("Width in cm")
Code language: R (r)

I set the x axis to length and the y axis to width. Color is set to species so that it is easier to distinguish the different species in the figure. The plot is faceted into columns by species, and by rows by Sepal or Petal. This data does suffer from over plotting so jitter was turned on to spread out points that shared the same coordinate. While I could have set the axes to have independent scales for each facet, I chose not too so that the distribution shapes and locations are more easily compared. For example I can modify the code above slightly so as to make the scales for all the facets different, and the plot would look like this.

Figure 2: The same plot, except the axes for each facet are independent.

Regarding Few’s Recommendations

Few’s recommendations are to

Keep the interval consistent
Select the best interval
Use measures that are resilient to outliers.

In my opinion, Few’s recommendations are good rules of thumb for when working with distributions such as this.

The iris dataset is one of those examples of perfect data that does not have many issues. I have worked with other data sets in the past that had many outliers. For example in my final project for Advanced Statistics, which can be found here, I discussed earthquake data. In the first plot I made comparing magnitude and error it appears that there is a negative correlation, however after removing outliers, it became apparent that there was no relationship between the two variables.

One of the issues I came across when making the plots in this post was the interval, or in the case of the continuous Iris data, the best scale. In the end I chose to keep it the same across all six of the facets because otherwise it became harder to compare the locations of each distribution. If the higher detail of the distribution was important for analysis, then I would go with that. Part of this demonstrates the importance of exploring the data, and how a single plot will not tell the full story. Sometimes you need two.

Relevant Links

GitHub: https://github.com/SimonLiles/LIS4317DataVisualization/blob/main/LIS4317Mod7.R