One of the most important aspects to any program is the ability to take in data, and after doing its process, saving the data. This is generally done through the input and output libraries of a programming language. In JAVA, it is done with the java.io library. In the R language, we have the readr package which can handle some of the most common types of storing data, specifically in rectangular formats such as comma delimited files (.csv), or tab delimited files. To explore these concepts and functions, I wrote an R package which you can find on my GitHub here, or follow the link at the bottom of this post.
The basic concept behind how R imports data from a file is that the file is actually a plaintext document, and most of the operations to import the data are done through string manipulation. Take the below code for example:
user_string <- readline("Enter a string:\n")
cat("Your string was: ", user_says)
Code language: R (r)
This code prompts the user for a string, and then returns it back. The output in the console is as follows.
> user_string <- readline("Enter a string:\n")
Enter a string:
Hello World!
> cat("Your string was: ", user_says)
Your string was: Hello world!
Code language: plaintext (plaintext)
The highlighted line is where I typed in the string. This may work fine for small sets of data or simple commands to be passed during the execution of much larger programs, however for data entry, this alone will be slow and tedious.
This is where the scan()
function is used to read in a file. scan()
is the most basic function for data entry and the format specific file reading functions are based on this one. It begins by reading everything as a giant string and then can apply a limited number of data cleaning functions depending on the arguments passed to the function. When you are working with a custom or niche file format, you will probably use functions such as read.csv()
or read.table()
. These functions will convert the long input string into a data frame or table based on the character used to delimit the values in the file.
So. let’s try applying this. I have been given a text file named “Assignment 6 Dataset-1.txt.” This contains a small data set of 20 students in a class, with their name, age, gender, and grade. Before writing any R code I always open the file and manually look through how it is organized. With this file it is delimited with commas, so I choose to use the read.csv()
function to read in this data set. While we could use the scan()
function, it would require more arguments and cleaning before other processes can be applied to it.
# FileName to input is: Assignment 6 Dataset-1
Student <- read.csv("Assignment 6 Dataset-1.txt", header = TRUE)
Code language: R (r)
Now that the file is in a data frame, we can do some data manipulation functions to transform the data by adding columns, or summarize it into fewer columns. For example if I wanted to summarize the average grade in the class based on gender, I would use the following code.
StudentAverage = ddply(Student, "Sex", summarize, Grade.Average = mean(Grade))
Code language: R (r)
Now we have a two-by-two table with each gender and their average grade. But now what if I want save this data? Now with something this small I could just copy it by hand into an excel spreadsheet or into a notepad, however that does not keep the data in a very useful state, and will not allow me to do other things within the script such as transferring the file to a new location on a different server. So to save a table or data, you can use one of the many write functions. The main difference between them is how the delimit the file, however the forms are generally the same for what you would normally read from, csv, tab delimited, etcetera.
To create a new file you will need to specify the data you want to save, the name of the new file, and if necessary, the location relative to your working directory. If you do not specify location, it will save to whatever the working directory is set to. For example, if I wanted to save the Student averages, I would use the following line of code.
write.table(StudentAverage, "StudentAverage")
Code language: JavaScript (javascript)
This creates a new plaintext document in my working directory named StudentAverage, using the StudentAverage data frame. Because I specified .table
it is written as a tab delimited file.
Okay, that is all cool. But now let’s pull a subset of the data. For example, what if we only wanted our data to only contain students that had the letter “i” in their name. We could try some complicated data frame selection statements, or we can just use the built in grep package. In this case we would use the grepl()
function which will return a vector of booleans, and only indices where the specified string pattern is true are set to true. We can pass this as a logical argument to subset which will return only the indices that are true. In all, the code will look like the following:
i_students <- subset(Student, grepl("i", Student$Name, ignore.case = TRUE))
Code language: R (r)
Now we have a data frame that only has names that contain the letter “i.” So now what do we do with it? We could do some more processing on the data frame, or maybe do some analysis. However, how do we save the file as one of the most common formats, a comma delimited file? Simple, use write.csv()
the same way we used write.table()
above.
And that is how input and output works in R. It is fairly simple, and adds a lot of power to an application or script when used effectively. Every data set out there that you can run through R, is stored in some kind of file, and being able to read those is a must, and an understanding of how those read functions work will also allow you to work with niche files or other special data.
Links:
GitHub: https://github.com/SimonLiles/LIS4370RProgramming/blob/main/LIS4370Mod8.R