LIS 4370 R Programming: Visualization in R

One of the best parts of R is the ease at which visualizations of data can be created. Beyond the base library there are many packages that make visualization a powerful tool, such as lattice and ggplot2. For a while now I have been learning ggplot2 and I have become quite accustomed to it. While ggplot2 is a powerful visualization package, and is very good at multivariate visualizations, it is not the end all, be all of visualization. For example, ggplot2 does not have a standard 3d plotting functionality, and creating multiple plots in a single page is not necessarily straight forward. That is why this week I am going to try and broaden my capabilities by using the lattice package.

While ggplot2 was built on the idea of Grammar of Graphics, lattice was not. Instead you specify the kind of plot you want with a function, then you give it a formula and your data. With very simple bivariate plots, lattice is like an uglier cousin of ggplot2, its scatter plots are not quite as aesthetically appealing, nor are there built in theme options like in gpplot2. However since it uses a formula to generate the plot, it is very easy to specify in a simple formula for the x and y axes, and create multiple facets, and then put facets in facets. It may be hard to understand through explanation alone, so I will demonstrate the lattice package in action. To demonstrate lattice and visualization in R, I wrote an R script which you can find on my GitHub here, or follow the link at the bottom of this post.

First I need to get some data. Fortunately there is a ton of prepared data to work with in various R packages. I went with the Highway Accidents data set that comes with the carData package. The data describes sections of highways in the United States along with the rate of accidents per million vehicle miles in 1973. These variables are going to be used a lot in the following plots. Longer names could have been made for them, however several are very long and would take up too much space, so I stuck with the original acronyms.

Variable	Description
rate	1973 accident rate per million vehicle miles
len	Length of the highway segment in miles
adt	Average Daily Traffic Count in Thousands
trks	Truck volume as a percent of the total volume
sigs1	The number of signals per mile of roadway
slim	Speed limit in 1973
shld	Width in feet of the outer shoulder of the roadway
lane	Total number of lanes of traffic
acpt	Number of access points per mile
itg	Number of freeway interchanges per mile
lwid	Lane width in feet
htype	Indicator of type of roadway or source of funding

Table 1: Variables in the Highway1 data frame

To load the data in is very simple, here is the code I used to do it:

library(carData)

#Import data
highway_data <- Highway1
Code language: R (r)

This is a lot of variables and I am not quite certain yet how they interact with each other. With data such as this I would be curious about the correlation between each variable. A visualization that would help with that is a correlogram. I have made correlograms before, you can see them in my post on it here, however I have not done one in lattice yet. Unfortunately lattice, much like ggplot2, does not have a standard correlogram function, so I will need to improvise. The lattice package does have a function called levelplot()which creates a tiled plot with different colors. To create the plot I first need to create a correlation matrix, then with the reshape2 package, melt it into three columns so that it will plot easily. The code is as follows:

library(reshape2)

#Prepare the data, remove  htype columns htype is a character vector and is 
#   incompatible with cor() function
cor_hwy <- cor(highway_data[,-12])
melted_cor_hwy <- melt(cor_hwy)
Code language: R (r)

To create the correlation matrix I use cor(). This function creates matrix with the list of all variables on both axes, then calculates the correlation coefficient for each intersection of two variables. In this example I also specify for the twelfth column to be removed before generating the matrix because it contains a character vector which cannot have correlation coefficients calculated for it.

I then use the melt() function from the reshape2 package to create a data frame from the matrix with 3 columns. Two of the columns specify each intersection of the matrix, the last column specifies the value held at that intersection. Now I have the data prepared to be be plotted.

As I mentioned before, to create a correlogram with the lattice package I need the levelplot() function. This function only really needs 2 arguments to do its job. One is to specify the data, the other is a formula that describes how to draw the plot. For the levelplot()function the formula breaks down as follows, the variables on either side of the*mark, are specifying the axes. The variable on the left side of the ~specifies how to color the tiles. These formulas can be quite complicated and it takes practice and some cheat sheets to get them right. The other arguments I pass to it are for making the plot more aesthetically pleasing. The at argument is used for setting the break marks for the different levels. The function will normally set the breaks for you, however for the correlogram I wanted to set them myself so that the plot would be more clear. The argument that follows it, pretty = TRUE accomplishes the same thing, I leave it in so that the code is more explicit. The final argument I pass in this function is col.regions which I set to the RColorBrewer diverging Red and Blue palette. You can set it to whatever color palette you want, I went with this one because it is the traditional correlogram colors.

library(lattice)
library(RColorBrewer)

#Correlogram
levelplot(value ~ Var1 * Var2, melted_cor_hwy, 
          at = c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1),
          pretty = TRUE, 
          col.regions = brewer.pal(11, "RdBu"))
Code language: R (r)

After this code runs you will get the following plot.

Figure 1: Correlogram of Highway Accident Data

From this correlogram some interesting pattens become obvious. For example the number of lanes and highway interchanges both have a strong positive correlation with the average daily traffic count. That is the more lanes and interchanges a roadway has, the more traffic that will usually be on it. TO anyone that has done any highway driving, this would be expected. Another one that might be interesting is the speed limit and accident rate seem to have a moderate negative correlation. In other words, the lower the speed limit, the higher the accident rate, or higher speed limits correlates to lower accident rates. This seems to go against common teaching in Driver Schools of slower is safer, and this could be a pattern to investigate more.

Let us take for example, we wanted to look more in depth into the correlation coefficients of the accident rate variable. Here the correlogram would not be as useful, so instead I will make a dot plot. With the dot plot I will put the categorical variables on the y axis, and the axis will have a numerical variable. First I will take the first column out from the matrix I made earlier, and then put that into a new vector. The dotplot()function operates much the same as the levelplot() function did before, except since I am working with a vector, I do not need to specify the data. For the formula only a ~ is needed. On the left side is the y axis, in this case the names of the named vector, and on the right is the x axis variable, here it it is the correlation coefficients. The additional arguments I add in this function call are the limits of the x-axis, and the x-axis label. I list -1.1 and 1.1 as the limits to give padding for plotting the most extreme possible correlation coefficients, -1 or 1.

#Look at correlation values for rate of accidents only
cor_rate_only_hwy <- cor_hwy[,1]
dotplot(names(cor_rate_only_hwy) ~ cor_rate_only_hwy, 
        xlim = c(-1.1,1.1), 
        xlab = "Correlation Coefficient")Code language: HTML, XML (xml)

After this code runs you will get the following plot.

Figure 2: Dot Plot of Correlation Coefficients for Rate of Accidents per million miles

With this plot it is a little bit easier to compare the actual values next to each other. It also becomes more clear that many of the variables have no correlation with the accident rate.

Alright, the plot in Figure 2 is good, but what about making the plot for every variable in the data set. While one could rewrite the same code for every plot for all 12 variables, this is not efficient and all the plots will be separate. In the lattice package a single operator in the formula is all it takes to create several facets. This is the | symbol, the formula on its left will describe the individual xy plots, the rest of the formula on the right will subset the different facets. This is one of those examples of where ggplot2 is weak compared to lattice. While ggplot2 is amazing at creating beautiful and complex graphics, it can be much more wordy compared to lattice which takes the name of a function and a simple formula to create a complex multivariate plot.

In the code below I create a dot plot with the same data frame I used for the correlogram, melted_cor_hwy, in the formula I name Var1 as the y axis and value as the x, and then with the | operator I create the facets with Var2.

#Dot plot for all variables
dotplot(Var1 ~ value | Var2, melted_cor_hwy,  
        xlim = c(-1.1,1.1), 
        xlab = "Correlation Coefficient")
Code language: R (r)

This code creates the following plot.

Figure 3: Dot Plot of all Variables and their Correlation Coefficients

Many of the same patterns that were seen in the original correlogram reappear in this plot. The real power of a faceted dot plot like this though is it allows quick and easy comparison of many groups at the same time. One pattern that may not have been obvious in Figure 1 is that the lane width variable, look at the facet labeled “lwid,” has very weak correlation with the other variables.

One of the best parts of the R programming language in my personal opinion is the number of tools and packages that make it easy to visualize data and find interesting patterns. Both ggplot2 and lattice have their place in data visualization, and when used together, there is not much that is not possible. Data visualization problems that cannot be answered by some package in R are far and few between. And for those few problems that do exist, R makes it easy to make your own solution.

Relevant Links:

GitHub: https://github.com/SimonLiles/LIS4370RProgramming/blob/main/LIS4370Mod9.R