Mining for YouTube Sailing Channels

Recently my Dad came to me with a question about getting data on the Sailing Channels on YouTube. The number of sailing focused YouTube channels is estimated to be anywhere from 500 to about 1000 active channels. The data mining that I will perform in the following sections will only include channels that are currently listed on YouTube. Hidden and deleted channels are beyond the scope of this discussion, and would require resources that are not accessible to the layman.

I have included all of the project files on my GitHub, they can be found here, or follow the link at the bottom of this post. The project files also include the data I mined, if you the reader wants to play with the data as well.

This blog post is a companion to the one my Dad wrote on his blog. While I discuss the data mining and collection techniques here, the analysis and results are explained in significant detail in the other post. You can find that post here, or follow the links at the bottom of the page.

First Collection

With the initial data collection, I started with an easy albeit crude collection method of searching “sailing sv” on YouTube, filtering for only channels, and then copying and pasting the output into an Excel spreadsheet. I would generally recommend starting with the crude collection method, if for no other reason than trying to access an unfamiliar API and little idea of what you want, is a recipe for failure.

Cleaning this data was fairly straight forward, mostly identifying key elements and how they were separated in the spreadsheet. Then I only had to take the numbers which had been represented by strings and convert them into numeric variables.

Getting into the YouTube Data API

The initial data set was good enough to get insight into what data may be useful in further analysis. Now, we can improve the analysis of the data by simple improving the quality of the data. By accessing the YouTube Data API, more data about the channels can be easily accessed. Beyond subscriber and video counts, this API can also provide creation dates, branding information, and more if the channel has that information available.

How the API Works

Accessing the YouTube Data API is fairly simple and works with any language or platform. Requests are written as HTTP requests. Here is an example query for channel data:

https://www.googleapis.com/youtube/v3/channels?key=[YOUR_API_KEY]&id=UCaDdlHvrBqEi8LVWIRpo6sA&part=snippetCode language: HTTP (http)

This query requests some basic channel data from the channel “Sailing Cadoha”. The query is composed of a base that directs it across the internet to the server that will process the request. The YouTube Data API can process a lot more than data collection, so we have to direct it to list the type of data we want, in this case channel data. After where it says channels comes a series of parameters that are separated by &. The first one is the key I generated in Google Cloud’s IAM service.

The query is sent out and it returns with a page of JSON formatted data. Then are parameters specific to this kind of query. The first is the ID of the channel followed by the parts you want. The above query only asks for the snippet which provides only the basic data that may appear in a search on the website. Other parts may be requested as well, such as branding settings, and other details that contribute to the SEO for the channel.

Getting the Channel IDs

Now here is where I learned some interesting details about how YouTube works. Channels have IDs, Titles, and Usernames. When you see the channel on YouTube, that name is actually only the title. Each channel is then associated with a username which may own many channels. The Data API I was working with allowed data retrieval by the username and ID’s only which I did not have at this time. Unfortunately I could not get the username for the channels by searching for their title, but I could get the channel ID. By taking the search results I got from the previous reconnaissance into the data, and searching for the channel ID of each channel individually, I could create a list of Channel IDs to pull the complete data.

And now let’s complicate this algorithm further. To perform a search function on the YouTube Data API, you use 100 quota units. Each project on Google Cloud is allowed to use up to 10,000 quota units per day. With a little over 500 channels to search for individually, I would end up using 50,000 of these quota units for a single run of this query. There are many solutions to get around this, requesting more units, splitting the query up to run over the course of 5 days, etcetera.

In the end I chose to create 18 projects on Google Cloud across 3 accounts, and generate YouTube API keys across all of them. This gave me 180,000 quota units to work with and plenty of room to refine the query. To get the multiple keys to work I wrote an algorithm that would test if the key was active. If it received an error, it would move on to the next one. All the channel IDs would be retrieved after about a minute of running the code, and I would burn through 5 or 6 keys.

Getting the Channel Data

Now that I had the Channel IDs, getting their data would be a fairly trivial matter. Simple data retrieval from a single channel counted as only 1 quota unit. No need to do anything fancy here. The hardest part was selecting the variables I wanted. I chose to pull three sections of data from the API. The “snippet”, “statistics”, and “branding settings”. Not all of the data from these 3 parts would be kept though. Many channels lacked branding settings so that was ultimately discarded during the cleaning phase. In the end I kept the following variables:

VariableTypeDescription
Channel IDStringThe ID of the channel
TitleStringThe name of the channel as it appears on the channel’s main page.
DescriptionStringA brief description provided by the channel.
Join DateDateThe date the channel was created.
View CountNumericThe total number of views that the channel has accumulated.
Video CountNumericThe total number of videos currently published by the channel.
Subscriber CountNumericThe current number of subscribers. Only recorded to 3 significant figures
Is Subscriber Count HiddenBooleanDoes the channel have the subscriber count set to hidden?
Table 1: Selected Variables to be collected.

From here the cleaning was fairly simple. All of the data is read in from the JSON as a string, so the individual types are identified from the YouTube Data API documentation and converted accordingly. The Join Date required some additional cleaning. To convert into a Date/Time format that R would understand, the format is defined with a string. However YouTube records the publishing of a channel or video to the second or sometimes to the thousandth of a second. This kind of precision is unnecessary when discussing channel creations over the course of a little more than a decade. I decided to cut all of that off so that in the final data set, the precision is only down to the day.

Handling Missed Channels

After the data had been mined and cleaned I performed a quality check by looking for channels I know in the data set. After combing through the data several times, I found that a few known sailing channels did not get included in the data. I was able to identify 3 channels that I knew should have been included, and at this point I performed the same operations described above to obtain the channel IDs and then their data. I then created a new boolean variable called, Omitted which identified these missing channels as such with the value of TRUE. I may be missing more channels than the 3 I identified, but more complete data would require significant reworking of the query method described above.

Conclusion

The mining of the data was far from perfect. The method of collecting the channel IDs was far from efficient, and with more work I could have found a better method. The actual data gathered also contains significant bias. Part of the bias comes from the fact that the channel names used through out the mining process were initially gathered through a web scrape done while being signed into YouTube. The YouTube search is known to return personalized results. If I was to do the data collection again, I would at least make sure I am signed out.

To take this project further I would build a dashboard. The dashboard would be supplied with new data daily from a script running on a cloud server. As much as I would like to do that, it is a project for another time.

My Dad performed the analysis of this data, and that can be found on the SV EOTI blog here.

Links

GitHub: https://github.com/SimonLiles/sailing_channels

SV EOTI, YouTube Sailing Channels by the Numbers: https://sveoti.net/?p=8707