Khadija Dial – Product Designer

Role

ML Engineer

Duration

Autumn 2024

~ 4 months

Skills

Supervised Learning

Python

NLP Techniques

Project Management

Team

Atmaja Patil, Sally Hong, Siqi Deng, Zohreh Ashtarilarki, Khadija Dial

Advisors: Sally Goldman, Anushya Subbiah, Vivian Yang, and our studio TA Helenna Yin

Business Impact

Google is an established player in the entertainment industry. YouTube, as part of Google, aims to deliver content from individual creators to viewers across the globe.

This project is relevant to the entertainment industry, as it can help content creators gain insight into how well their content is performing as well as what measures they can take to improve their metrics. This can also give Youtube more information on how to improve their recommendation algorithm.

Google is an established player in the entertainment industry. YouTube, as part of Google, aims to deliver content from individual creators to viewers across the globe.This project is relevant to the entertainment industry, as it can help content creators gain insight into how well their content is performing as well as what measures they can take to improve their metrics. This can also give Youtube more information on how to improve their recommendation algorithm.

Goal

To predict which YouTube videos are likely to trend and go viral using the provided video metadata. We will employ supervised learning techniques, including decision trees, Random Forest, DNNs, and linear models, to make these predictions.

Timeline

With a little under 4 months, we were tasked with the challenge of building a comprehensive model to predict and understand what influences a youtube videos to trend or go viral. To streamline the process we developed a plan to stay on track and ensure we are accomplishing all our objectives punctually.

Resources Leveraged

Data Understanding & Preparation

Our given dataset originates from Kaggle containing over 268,787 video logs from 2022 - 2024. We had access to various video metadata such as the published date, the video genre, the tags used within the video, and many more. To gain a better sense of the data and discover patterns, we started by creating general visualizations.

A heatmap showcasing what day of the week a category of video is posted on. Music related videos seem to be mostly posted on fridays while sports seem to be posted on sundays as expected. As for the other video categories, they don’t seem to have a specific date correlated as expected.

One interesting pattern we discussed frequently was how in the distribution of views graph, the results were heavily skewed to the right. We quickly realized later in the project, that this would be one of the major roadblocks we would face when it came to accurately predicting view counts for videos in our models.

A similar thing issue was also seen with the view count over time graph on the right. The views are increasing at a exponential rate, which also suggests that in our dataset, each video’s views are collected in accumulation rather than discrete daily or periodically.

Label Selection

log(view_count + 1) / days_since_publication

This is so done to distinguish videos with lots of views that have been published for a longer time from those that were published on the same day (so 1 day since publication).

Feature Engineering

One hot encoded the following features: day of the week video was published on, the top 100 channels and the channel category

Normalized the numerical features: days since video was published, length of description, length of title, and the top 30 tags to handle infinite values.
Performed text vectorization on the video description and title
Created derived features: average likes, view count, and comment count of a channel
All missing data was replaced with average values

To prepare the features for our respective models we:

One hot encoded the following features: day of the week video was published on, the top 100 channels and the channel category

Normalized the numerical features: days since video was published, length of description, length of title, and the top 30 tags to handle infinite values.
Performed text vectorization on the video description and title
Created derived features: average likes, view count, and comment count of a channel
All missing data was replaced with average values

Feature Selection

Categories:

'People & Blogs'
'Gaming'
'Entertainment'
'Music'
'Howto & Style'
'Education'
'Comedy'
'Science & Technology'
'Film & Animation'
'News & Politics'
'Sports'
'Travel & Events'
'Pets & Animals'
'Autos & Vehicles'
'Nonprofits & Activism'

Categorical Features:

Top 100 channels

Channel Category

Day of the week

Numerical Features:

Days since published
Squared days since published
Square root of days since published
Log days since published

Feature Leakage

As we began to train our models, we learned that the use of comments, likes, views, or any other direct engagement metrics would classify as feature leakage, and therefore we should not use these metrics in training our models. This is because it wouldn't make sense to have access to likes and comments before a video starts gaining traction, as these metrics are generated as a result of the video's popularity over time. Using them as features would give the model information it wouldn't realistically have at prediction time, leading to overly optimistic results and poor generalization to unseen data.

Instead, we focused on features that are available at the time of video upload, such as the video's category, description length, presence of certain keywords, and the day and time of publishing. These features ensure that our model remains robust and applicable in real-world scenarios where engagement data is unavailable prior to the video's release.

Time-Based Data Split

Training set: Between Jan 1, 2023 & Dec 31, 2023

Validation set: Between Dec 31, 2023 & Feb 18, 2024

Test set: After Feb 18, 2024

Model Comparison

— Trying out various models with the same features to find the most optimal model

Linear Regression Model

Using Ridge (an L2 regularization technique)

I worked on a linear regression model,, specifically using Ridge which is a type of regularization technique that punishes large coefficients to keep the model more robust. This allowed the model to better account for overfitting data points.

Once the model was fit I got a score of 0.25 for the validation RSME, and 0.273 for the test RSME. An ideal RSME score should be closer to zero to reflect higher accuracy and In this case the linear model did a pretty good job! On the predicted vs actual log view velocity graph you can additionally see how accurate the results are. Some things to take note of however is how at the bottom left of and top right corners of the chart the model begins to struggle to accuracy predict view velocity. This is a limitation of linear regression models, as they assume all patterns in the data are linear and struggle to capture complex patterns in the data.

Regression Decision Tree Model

In comparison to the linear regression model, the regression decision tree developed by Sally outperformed in both the test and validation sets. To add, looking at the feature importance bar grah, we can see that days_since_published and its logged/squared/square root versions are heavily responsible for the results we see.

In comparison to the linear regression model, the regression decision tree developed by Sally outperformed in both the test and validation sets. To add, looking at the feature importance bar grah, we can see that days_since_published and its logged, squared, and square root versions are heavily responsible for the results we see.

Random Forest Model

Zohreh chose to also test the random forest model because of its great ability to handle and capture nonlinear relationships and reduce the risk of overfitting. We trained the random forest model with 100 estimators. When comparing these to other models like linear regression and decision tree, the random forest outperformed them, delivering the lowest root mean square on both validation and test datasets as shown in the table.

Zohreh chose to develop the random forest model because of its great ability to handle and capture nonlinear relationships and reduce the risk of overfitting. We trained the random forest model with 100 estimators. When comparing these to other models like linear regression and decision tree, the random forest outperformed them, delivering the

lowest root mean square on both validation and test datasets as shown in the table.

XGBoost Model

As an ensemble model, Atmaja was curious if the XG Boost model would be more robust than the baseline decision tree. In comparison to alll our other models, XG boost only outperformed the linear regression model in the validation and test set. This came as a surprise to us and further lead us to raise the question, is it worth using an unsolvable model like this over the baseline decision tree?

As an ensemble model, Atmaja was mainly curious if the XG Boost model would be more robust than the baseline decision tree. In comparison to alll our other models, XG boost only outperformed the linear regression model in the validation and test set. This came as a surprise to us and further lead us to raise the question, is it worth using an unsolvable model like this over the baseline decision tree?

Neural Network Model

Four layers:

First Dense layer with 128 units and tanh activation
Second Dense layer with 64 units and tanh activation
Third Dense layer with 32 units and tanh activation
Output Dense layer with 1 unit (defaults to linear activation)

Four layers:

First Dense layer with 128 units and tanh activation
Second Dense layer with 64 units and tanh activation
Third Dense layer with 32 units and tanh activation
Output Dense layer with 1 unit (defaults to linear activation)

Siqi also decided to try using a neural network model to see if we would be able to uncover more underlying patterns. Siqi started out with a simple neural network model with a few layers and progressively added more before it bacame overfitted. Around seven layers, we realized it was taking hours for the model to train in Colab and how there wasn't much significant improvement in the performance, even though overfitting wasn't observed yet. When comparing the SGD and the Adam optimizer, we decided to go with Adam because it had adaptive learning rates. At around 20 epochs, the model reached a pretty consistent root mean square error for the training and validation loss. With more training, with deeper models, it could potentially improve because as of now, the training error is still high, which suggests that we're far from having the model overfit the data.

Siqi also decided to try using a neural network model to see if we would be able to uncover more underlying patterns. Siqi started out with a simple neural network model with a few layers and progressively added more before it became overfitted. Around seven layers, we realized it was taking hours for the model to train in Colab and how there wasn't much significant improvement in the performance, even though overfitting wasn't observed yet. When comparing the SGD and the Adam optimizer, we decided to go with Adam because it had adaptive learning rates. At around 20 epochs, the model reached a pretty consistent root mean square error for the training and validation loss. With more training, with deeper models, it could potentially improve because as of now, the training error is still high, which suggests that we're far from having the model overfit the data.

Insights and Key Findings

Our analysis shows that time-based features, the video category, top performing channels, and the number of tags were the most significant predictors of the video engagement. The random forest model achieved the lowest rmse on both validation and the training set, showing its robustness and also demonstrating better generalization to unseen data. These findings highlight the importance of feature engineering and tailoring models to specific stages of evaluation.

Throughout this process, we definitely learned a lot for us going through the whole process of starting with data exploration and going all the way to evaluating different models. Starting off with data exploration, we learned about all the different types of graphs and how they can communicate different stories about the data. This helped us tremendously in determining what features would be best for our model and the scope of our problem.

Furthermore, we had the opportunity to really thing deeply about feature leakage and how it might occur in different situations whether it was intentional or not. Choosing a label as well and thinking about the intricacies of logging a variable before dividing it by another variable or like logging it as a whole also made a difference in terms of the predictive power of our model. Lastly, we spent the most of our time on the project tuning various models and trying a variety of techniques. We started with each of us working individually on different models so training it led to slightly different results. However, we were able to align with each other by ensuring that we were using the exact same label and features to better compare the errors.

Where do we venture from here?

As a team we've done so much but there's definitely so much more that we can do could do to further improve our model performances and continue to explore more complex patterns in the data. Our next set of steps are to:

Continue to train and fine-tune the models: particularly focusing on some of the models that were the best performing, like the neural network and the random forest models, and also using data sets from other countries.
Look at other countries dataset: Early on in the project we imported the India data set to explore and see how that data set performs on the models we trained. Would the feature importances look any different? Are different categories more likely to trend?
Explore more advanced NLP techniques: We can also explore more NLP techniques to better understand the benefit of specific titles for a video or specific video descriptions and how that impacts how likely a video is to trend.

Information we could have benefitted from

This data set was amazing, however there are a lot of things that this data set and project as a whole could have benefited from having to better address our goal, which was to predict whether video is likely to trend. A great example is co-watch information. Having details on if someone watched a given video and how likely they are to watch similar videos, could've strengthened out models performance.

We could've also benefit from having true time series analysis of creators and they content. Many content creators on the Youtube start small and over time they grew much bigger. It would be helpful to better understand how did they grow and which videos led to their growth? Information like such and relating each of a creator's videos to one another could be really beneficial in model performance.

We could've additionally benefit from seeing the percentage of people who finished watching the video or which parts of the video they watched more of. It could help us better understand what content is doing well and which videos are actually trending? Are people just watching the first five seconds of a video or really watching the entire thing? We could also look at the length of the video itself. Are longer videos doing better or are the shorter videos doing better? These are some of the things that from a bigger picture perspective could benefit our overall project goal. But this is all information that wasn't provided in the data set that we had.