The Problem
As the COVID-19 pandemic has raged, researchers have been working to track the virus and monitor its evolution. We have seen in the news a lot of discussion regarding South African, UK, and other variants of the virus. However, it is difficult to assume there is a consistent and equal effort around the globe. Different parts of the globe will contribute more or less to the various Gene Banks for a variety of reasons, from differences in resources, to differences in the size of population. The questions that I have are: where are the samples of the virus genes coming from, how are they being collected, and who is submitting the most?
Data
To answer the above questions I first need to find a data set that can answer them. One such data set is one that I have been working on for a different project, the National Center for Biotechnology Information, or NCBI, has documents known as Virus Reports, and happens to have one for COVID-19. The Virus Report describes information regarding where and how samples were taken, different descriptive data regarding the virus, and accessors for things such as the nucleotide sequence. It also contains the locations and sample sources for every record. In all there are over 20 variables when the data is converted into a table.
Unfortunately it is not in a table format, nor is it in an easy to use format. The original data is stored in JSON Lines format. For every line in this file is a complete JSON file, and there are approximately 40 thousand of these records. In addition, each JSON file is not exactly the same, there are many fields that do not exist in every record. Fortunately there is a package that exists and has the capability to easily take the semi-structured data and turn it into a table, for this I will use the protein8k package that I have been developing.
Here I have the code for loading in the protein8k package, currently it is only available from the GitHub page. I then give the file path for the Virus Report data which is in my working directory. The last line uses two functions from protien8k, fromJSONL()
reads in the JSONL format an returns a list of JSON files, each get parsed into a list, and the report_as_dataframe()
takes the JSON List and transforms it into a table.
#Install the latest version of protein8k
devtools::install_github("SimonLiles/protein8k", build_vignettes = TRUE)
#Load necessary libraries
library(protein8k)
#String representing location of data in relation to Working Directory
jsonListName <- "data_report.jsonl"
#Get all records from the .jsonl file
#May take a little while to read and parse all 40k records
report_df <- report_as_dataframe(fromJSONL(jsonListName))
Code language: R (r)
The result of this code is a data frame with 24 variables and over 40 thousand records. Because the JSONL file is so large, report_as_dataframe()
may take a few minutes to run.
Now that I have the data loaded into my environment, I can begin preparing it for a visualization. The visualization I want to use to answer the questions I asked above is a sunburst diagram. A sunburst diagram is similar in structure to a donut chart, however besides being less tasty, it has several layers that relate in a hierarchical structure. With this in mind I can build an object in R that will represent this hierarchy and its attributes.
Cleaning
To get started I will need to clean my data up a little bit. While the protein8k package does a lot for processing the raw data, it is limited by its generalized nature. One example is the geo_Location variable, it has the name of the country that submitted the data, but many of the records also have the name of states and provinces in the same field. While these could be useful for a follow up analysis, I want to focus on the larger, global picture. To clean up this field I will simply remove any additional text that comes after the name of the country while knowing that the format in the field is country: territory
. There is also isolate_Source column which is stored as factors, while I need it to be character data for the network I will be building later. Finally I will add an additional column called world, this column simply has the text “world” for every record and will be used to build the root of the hierarchy. The code to do this is as follows.
report_clean_df <- report_df
report_clean_df$geo_Location <- gsub(":.*", "", report_df$geo_Location)
report_clean_df$isolate_source <- as.character(report_clean_df$isolate_source)
report_clean_df$world <- "world"
Code language: R (r)
Building the Hierarchy
Now the data is ready and I can start building the network for the hierarchy. While many hierarchies are built with an undefined structure, for the purposes of this visualization I am going to explicitly declare what is contained in each level. Then the depth of the hierarchy will not be as informative as the connections within the hierarchy. For the sunburst diagram I want to make the hierarchy with 4 levels, the root “world”, the isolate source, where they took the sample from, and then the location which is split into region and locations within that region.
A hierarchy can also be described as a network of entities, with that in mind I will need to create a network object in R using the igraph and ggraph libraries. To make this happen, I will need to create a nodes list and an edge list. Making a nodes list is fairly simple, I only need to catalog all the unique values or names in the data set, add the attributes I want, in this case the frequency that the node occurs, and I will also give each node a unique identifier. It is also at this point that I clean up the nodes list. I make the root node, “world”, artificially smaller so that it does not overshadow the rest of the node sizes. The world node is only there to connect all of the branches together so its size and any other attributes associated with it are not very meaningful. I also apply a logarithmic scale so that when the node sizes are applied to the visualization differences for smaller entities become more apparent.
#Load libraries
library(ggraph)
library(igraph)
library(tidyverse)
#Make list of nodes with frequencies for sizes
nodes <- report_clean_df %>%
pivot_longer(cols = c(world, isolate_source, geo_Region, geo_Location),
values_to = "label") %>%
count(label, name = "size") %>%
rowid_to_column("id")
nodes$size[nodes$label == "world"] <- 1
nodes$size <- log10(nodes$size)
Code language: R (r)
Next is the edge list, which is a little bit more difficult because I have 4 columns of data, and it needs to become 2 columns listing where a connection is coming from, and where it is going to. First I create a list with each element being a vector of a single record of the four columns I listed before. This is made into a matrix before binding neighboring rows to get the final edge list. Then I add the ids from the nodes list and clean up the edge list. The code is as follows.
#First create a list with an explicit hierarchy
hierarchy_list <- list()
for(record in 1:nrow(report_clean_df)) {
branch <- c(report_clean_df$world[record],
report_clean_df$isolate_source[record],
report_clean_df$geo_Region[record],
report_clean_df$geo_Location[record])
hierarchy_list[[record]] <- branch
}
# create edge list (from in first column / to in the second)
d <- do.call(rbind, hierarchy_list)
edges <- rbind(d[,1:2], d[,2:3], d[,3:4])
edges <- as.data.frame(edges)
edges <- edges %>%
left_join(nodes, by = c("V1" = "label")) %>%
rename(from = id)
edges <- edges %>%
left_join(nodes, by = c("V2" = "label")) %>%
rename(to = id)
#Clean up edges dataframe
edges <- select(edges, from, to)
edges <- group_by(edges, from, to) %>%
count(name = "weights") %>%
ungroup()
Code language: R (r)
The final steps of building the network is just creating the network object, for this I use graph_from_dataframe()
. The call for this is fairly straight forward. Now I am ready for plotting.
#Making the network
loc_graph <- graph_from_data_frame(edges, vertices = nodes, directed = TRUE)
Code language: PHP (php)
Plotting
I use the function ggraph()
to make this visualization. It is built on the ggplot2 package, so the functionality is the same. For this visualization I used the Viridis color scale because it looks nice and highlights the differences in values fairly well.
#Make the plot
ggraph(loc_graph, 'partition', circular = TRUE) +
geom_node_arc_bar(aes(fill = size, color = depth), size = 0.25) +
geom_node_text(aes(label = label), size = 2.5) +
theme_void() +
theme(legend.position = "right") +
scale_color_continuous(guide = "none") +
scale_fill_viridis_c(direction = 1)
Code language: PHP (php)
The final plot is this.
Discussion
The Sunburst Diagram is a method for part to whole analysis. With this visualization, the parts of the hierarchy have been divided up within the whole so as to visualize their relationships and differences within the hierarchy. From this we can see interesting trends such as most sequences that were submitted from Asian and North American countries were collected from the feces, while European countries chose to collect from the lungs. We can also see that many of the submitted sequences are from North America, and specifically many are from the United States. We can also see that most smaller countries have submitted fewer sequences, most likely due to the fact they have a smaller population and even fewer positive cases to pull from. A good next step for this visualization would be to correct for the number of COVID-19 cases in each country so that they are more on the same scale. What I find most interesting though is that China, India, and Brazil, countries with large populations and many active COVID-19 cases are not more prominently featured. Considering that India and Brazil are second and third behind the United States. This would suggest either that there are no variants being submitted from regions such as these two, or that there are holes in the operation of tracking COVID-19 variants and resources to track the evolution of the virus should be deployed there.
My analysis of this data is not complete, there are some interesting trends that are beginning to appear at this stage. The next step of this analysis would be to combine the nodes list with another data set describing the total number of COVID-19 cases in each country and making a new “size” variable that accounts for the proportion of a population that has been infected. Doing this will reduce the bias towards countries with higher populations. For example, the United States will obviously have the highest number of submissions because the US has the most cases and appropriate resources. While countries such as Brazil and India may be obvious in their lack of new submissions to the NCBI database, others may not be obvious. Correcting for proportion of infected population will help remove some of this bias.
Visualizations such as the one in this post are very useful for not just analyzing data directly, but are also useful in analyzing the associated metadata so one can visualize the quality of the data as well.
Links:
The analysis of this data was done using an R Script that I wrote. You can find a copy of this script on my GitHub here: https://github.com/SimonLiles/LIS4317DataVisualization/blob/main/LIS4317FinalProject.R
For the data I used you can download it from my GitHub as well here: https://github.com/SimonLiles/LIS4317DataVisualization/blob/main/data_report.jsonl.zip