LIS 4370 R Programming: Debugging

Debugging is a critical skill for any programmer. Not very many enjoy it, the process often involves lots of intense googling, reading help pages and incomplete documentation, and occasionally the sacrificing of chickens. This week I have been given a chunk of code that contains a deliberate bug. You can find a copy of the code used in this post on my GitHub which can be found here, or follow the link at the bottom of this post.

The code I was given is as follows.

#Given Code
tukey_multiple <- function(x) { 
  outliers <- array(TRUE,dim=dim(x)) 
  for (j in 1:ncol(x)) 
  { 
    outliers[,j] <- outliers[,j] && tukey.outlier(x[,j]) 
  } 
  outlier.vec <- vector(length=nrow(x)) 
  for (i in 1:nrow(x)) 
  { outlier.vec[i] <- all(outliers[i,]) } return(outlier.vec) }
Code language: R (r)

It would appear this is some kind of function, running it in the command line should load it into the global environment.

Error: unexpected symbol in:
"  for (i in 1:nrow(x)) 
  { outlier.vec[i] <- all(outliers[i,]) } return"Code language: plaintext (plaintext)

Or not. So the first thing I notice when looking at this code is that the white space is not done very nicely. So my first fix is going to be that whitespace. It is hard to tell with many programming languages with how picky they will be regarding white space and the placement of things such as brackets. In JAVA for example, a curly brace can be on the same line as the code that uses it, or the following line. In R though, you generally only see the curly brace on the same line as a function that it belongs to.

#Clean up white space, indentations, etc. 
tukey_multiple <- function(x) { 
  outliers <- array(TRUE,dim=dim(x)) 
  for (j in 1:ncol(x)) 
  { 
    outliers[,j] <- outliers[,j] && tukey.outlier(x[,j]) 
  } 
  
  outlier.vec <- vector(length=nrow(x)) 
  
  for (i in 1:nrow(x)) { 
    outlier.vec[i] <- all(outliers[i,]) 
  } 
  
  return(outlier.vec) 
}
Code language: R (r)

Now that the code is cleaned up and readability has been improved, I run the function again. And there are no errors. But, that does not mean that the debugging work is done. As long as there are no obvious syntax errors, the R interpreter will load a function into the global environment when you run its source code. This means that mismatching type errors and missing object errors are not likely to be caught when you run it like this. So the next step is to run a test object through the function.

The object I create to test this function is a matrix of 10 rows by 10 columns, and is randomly filled with a normally distributed data. I could pass a vector through the function, I chose a matrix because I see that the function is operating by columns and rows. This function may not be intended to be used on vectors, so I am staying with something that is more likely to work.

#Testing tukey_multiple()
X <- matrix(rnorm(100), nrow = 10, ncol = 10)
X[1,] <- 100

tukey_multiple(X)
Code language: R (r)

And when this code runs, the following errors appears.

Error in tukey.outlier(x[, j]) : could not find function "tukey.outlier"Code language: plaintext (plaintext)

We can see by looking in the code that there is a call to tukey.outlier(), which could be a call to another function, or it was misspelled. After some googling, I could not find the function as spelled in the original code, however I did find a similar one that replaced the dot with an underscore.

This is where simply correcting the code begins to get a little hazy. At this point I would need to look at the documentation for this function if it were a part of a package, or if it was a part of a script for a data science project I would need to talk with the original author and figure out their intent. While the function call could be misspelled, it is just as likely that I am missing some code. In addition, just from eyeballing the code I see that there may be potential logic errors, such as the parameter x does not actually get used in the function, instead just the dimensions are used to make an array with the same dimensions filled with boolean TRUE values. From what I can see this would make the results from the function meaningless, but I would not know unless I had more information.

While debugging can be frustrating, I like it, because in the process I gain a better understanding of the programming language and how the tools work. And with a better understanding of programming, I can find new and exciting solutions.

Links:

GitHub: https://github.com/SimonLiles/LIS4370RProgramming/blob/main/LIS4370Mod11.Rmd