LIS 4273 Final Project: Increasing Precision of Earthquake Monitoring Data

Despite being an everyday occurrence for some parts of the world, earthquakes can be highly destructive forces, and even more so in coastal areas due to the additional risk of tsunami. While the time is still unpredictable, the size of the wave can be predicted as soon as the magnitude of the quake is recorded. Knowing wave size allows civil agencies to issue the appropriate alerts to the appropriate areas. The magnitude of the earthquake is only one of many factors in calculating the size of a tsunami, however there is error in all of those factors and reducing error in a factor such as magnitude will subsequently reduce the error in calculating tsunami size.

Defining the Problem

For initial work I will be using a dataset provided by the US Geological Survey. The dataset was a single instance from a live feed that is updated every minute for all earthquakes over the last 7 days, I downloaded the dataset on December 7, 2020 at 7:06 PM EST. You can find the dataset and related information here or follow the link at the bottom of this post. For the initial work into this problem I will be investigating the relationship between the number of stations reporting the magnitude and the standard error of the measured magnitude.

The null hypothesis will be that if there are more stations reporting magnitude, then there will be a decrease in the standard error of the magnitude. I hypothesize this because in general, as more samples are taken of an event, the uncertainty of that event becomes less. The alternative hypothesis will be that there is no relationship between the two variables.

All R code used can be found on my GitHub here or follow the link at the bottom of this post.

Related Work

The work here will be focused on correlation analysis. This relates to work done in Module 8 of LIS 4273 which contained the Pearson Method and Spearman methods which in the corresponding assignment were applied to a group. Please go see the post for Module 8 by clicking here or the link provided below.

Solution

Before any analysis is done, the data must be prepared first. This dataset has over 3000 records and 22 variables. The columns we are interested in are the ones labeled magNst and magError. These two columns have many NA values which could mess up the analysis, so these records can be removed.

# Remove NA's for Stations Reporting Magnitude and Magnitude Error
eqData <- eqData[complete.cases(eqData$magNst, eqData$magError),]
Code language: R (r)

This line looks for all records that have a value in the specified columns. If it has a value in both columns, then it is kept, otherwise it is left out of the data frame. The result with my data was a data frame with about 2000 observations.

That is about all that is needed to prepare the data. Now it is time for the fun stuff. Below is a plot of the data.

Figure 1: Simple Scatter Plot of number of stations reporting magnitude compared to the standard error of the reported magnitude.

It would seem there is a relationship here between the number of stations and the error. However a plot like this is not enough. We need some numbers to describe this relationship.

# Correlation Analysis
cor_test1 <- cor.test(eqData$magNst, eqData$magError)
cor_test1
Code language: R (r)

This code will perform a Pearson Product-Moment correlation test and tell us the statistical significance of the test. The output is as follows.

         Pearson's product-moment correlation

 data:  eqData$magNst and eqData$magError
 t = -3.69, df = 2037, p-value = 0.0002301
 alternative hypothesis: true correlation is not equal to 0
 95 percent confidence interval:
  -0.12445611 -0.03821224
 sample estimates:
         cor 
 -0.08148671Code language: plaintext (plaintext)

From this there are two values that are important, the r value that measures the correlation, and the p-value which will give the statistical significance. The correlation coefficient r is listed at the bottom of the output below cor. Here we can see the value is -0.081. So the relationship is negative as hypothesized, however it is not a very strong relationship. The p-value is also less than 0.001 so this result is significant. For the relationship to be strong enough to consider trying to do regression analysis the coefficient should be more than ±0.2, closer to zero and the relationship is is very weak and will not be predicted with linear regression.

Conclusion

The null hypothesis was not supported, and the alternative was accepted because the correlation coefficient was too close to zero, within ±0.2 of 0. Considering the significance of the result, optimizing error of the reported magnitude of earthquakes will not be solved simply through a large number of stations. Further analysis could be performed by taking a survey of the stations that report magnitude and error for other potential factors and a comparison of those factors may lead to some new insight. For example, one factor could be the brand of seismometer being used.

Links

GitHub: https://github.com/SimonLiles/LIS4273AdvStatistics/blob/master/LIS4273FinalProject.R

USGS Earthquake Data: https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

LIS 4273 Module 8: https://quantknot.com/2020/10/lis-4273-adv-statistics-module-8-assignment/