LIS 4317 Data Visualization: Correlation Analysis

Correlation and Regression analysis are very powerful tools when exploring data. These two forms of analysis are very useful when it comes to process improvement, take the case of a plant that converts ammonia into nitric acid. After the ammonia has been oxidized, it is passed into an absorption tower where the nitric acid is collected. There are four variables to take into account for the operation of the plant. There is the air flow which represents the rate of operation of the plant. Then we have the temperature of the water that is being used to cool the stack. In side of the stack there is the acid concentration, which in the case of the data I will be using is measured as minus 50, times 10, or in other words a percentage of 58.9 will be recorded as 89. Then there is stack loss which is 10 times the percentage of ammonia going into the plant, that escapes through the absorption tower, this is an inverse measure of the efficiency of the plant.

The data set I will be using is Brownlee’s Stack Loss Plant Data that was recorded in 1965. The variables are as I described above. To create the the correleograms and regression plots in this post I wrote an R script using ggplot2 which you can find on my GitHub here, or follow the link at the bottom of this post.

The Stack Loss data set is one of many that comes with the Base R package. You can load in this data set with the following code.

#Load stackloss data set
my_stack <- stackloss
Code language: R (r)

The first step in the analysis is to find the strengths of association between the variables using Correlation Analysis. However a matrix of numbers is not very informative, nor is it visually appealing, thus we create a correlogram to plot the data.

#Create a corellegram using ggplot2
ggplot(melt(cor(my_stack)), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() + 
  labs(title = "Correlation of Variables involved in Converting Ammonia to Nitric Acid",
       x = "", y = "")
Code language: R (r)

This code then creates the following plot.

Figure 1: Correlation of different variables involved in converting ammonia into nitric acid.

As the key in the visual suggests, lighter colors are equated to a stronger positive relationship. We can infer that there is only positive correlations in this plot because the scale only goes down to 0.4, had there been negative correlations, the scale would go down further.

In figure 1 we can see some interesting patterns. One of those is that stack loss has a strong, positive correlation with air flow and the water temperature. Meanwhile the relationship between acid concentration and stack loss seems to be much weaker.

This analysis has created some interesting insights. In the realm of process improvement we would want to visualize these relationships, and model the interaction between the variables. For example, if we wanted to minimize stack loss, how would we create a visualization that shows that?

For this example I will use ggplot, and I will create a scatter plot for stack loss and each of the three other variables. We can then use the stat_smooth() function to create a linear model of each. To create this with three facets in a single plot will require two steps though. Before we plot we need to transform the data into a format that will plot nicer so that instead of the four original variables we have three, stack loss, the name of one of the other three variables, and a value. The code to do all of this is like so:

##Mutate data set for 3 columns, stack.loss, name, value
new_stack <- pivot_longer(my_stack, c(Air.Flow, Water.Temp, Acid.Conc.))
#Plot the values by stack.loss in 3 seperate facets
ggplot(new_stack, aes(value, stack.loss, color = name)) +
  geom_point(position = "jitter") +
  stat_smooth(method = "lm", col = "red") + 
  facet_wrap(~ name, scales = "free") + 
  theme(legend.position = "none")
Code language: R (r)

This will create the following plot:

Figure 2: Effects on stack loss by Acid Concentration, Air Flow, and Water Temperature.

Now the results from the correlation analysis make more sense. The far left plot of acid concentration and stack loss seems almost like a blob while the other two plots have much more obvious linear relationships.

These visualizations and analysis create a good base for further, more complicated analysis. From this visualization we can see that if we were concerned about reducing stack loss as much as possible, we will want to reduce air flow and water temperature. With this analysis, next steps would involve a discussion of goals and requirements, and from there we can decide how to model the data, or if we need to go back and gather more data, and make decisions to optimize the plant.

Regarding Few’s recommendation to include grid lines, I feel in the case of correlogram’s and regression analysis, it is only really useful when you create a scatter plot. Grids will not mesh well with a series of pie charts or tiles. The correlogram in figure 1 which was made with ggplot has a grid in the background, however the tiles take up so much space and are opaque so that you cannot see it. In short, I think Few’s recommendation has its place, however it is not necessary to follow it a hundred percent of the time.

Links:

GitHub: https://github.com/SimonLiles/LIS4317DataVisualization/blob/main/LIS4317Mod8.R