
Role
ML Engineer
Duration
Autumn 2024
~ 4 months
Skills
Supervised Learning
Python
NLP Techniques
Project Management
Team
Atmaja Patil, Sally Hong, Siqi Deng, Zohreh Ashtarilarki, Khadija Dial
Advisors: Sally Goldman, Anushya Subbiah, Vivian Yang, and our studio TA Helenna Yin
Business Impact
Goal
To predict which YouTube videos are likely to trend and go viral using the provided video metadata. We will employ supervised learning techniques, including decision trees, Random Forest, DNNs, and linear models, to make these predictions.
Timeline
With a little under 4 months, we were tasked with the challenge of building a comprehensive model to predict and understand what influences a youtube videos to trend or go viral. To streamline the process we developed a plan to stay on track and ensure we are accomplishing all our objectives punctually.

Data Understanding & Preparation
Our given dataset originates from Kaggle containing over 268,787 video logs from 2022 - 2024. We had access to various video metadata such as the published date, the video genre, the tags used within the video, and many more. To gain a better sense of the data and discover patterns, we started by creating general visualizations.

A heatmap showcasing what day of the week a category of video is posted on. Music related videos seem to be mostly posted on fridays while sports seem to be posted on sundays as expected. As for the other video categories, they don’t seem to have a specific date correlated as expected.

One interesting pattern we discussed frequently was how in the distribution of views graph, the results were heavily skewed to the right. We quickly realized later in the project, that this would be one of the major roadblocks we would face when it came to accurately predicting view counts for videos in our models.

A similar thing issue was also seen with the view count over time graph on the right. The views are increasing at a exponential rate, which also suggests that in our dataset, each video’s views are collected in accumulation rather than discrete daily or periodically.
Label Selection

log(view_count + 1) / days_since_publication
This is so done to distinguish videos with lots of views that have been published for a longer time from those that were published on the same day (so 1 day since publication).
Feature Engineering
Feature Selection
Categories:
'People & Blogs'
'Gaming'
'Entertainment'
'Music'
'Howto & Style'
'Education'
'Comedy'
'Science & Technology'
'Film & Animation'
'News & Politics'
'Sports'
'Travel & Events'
'Pets & Animals'
'Autos & Vehicles'
'Nonprofits & Activism'
Categorical Features:
Top 100 channels
Channel Category
Day of the week
Numerical Features:
Days since published
Squared days since published
Square root of days since published
Log days since published
Feature Leakage
As we began to train our models, we learned that the use of comments, likes, views, or any other direct engagement metrics would classify as feature leakage, and therefore we should not use these metrics in training our models. This is because it wouldn't make sense to have access to likes and comments before a video starts gaining traction, as these metrics are generated as a result of the video's popularity over time. Using them as features would give the model information it wouldn't realistically have at prediction time, leading to overly optimistic results and poor generalization to unseen data.
Instead, we focused on features that are available at the time of video upload, such as the video's category, description length, presence of certain keywords, and the day and time of publishing. These features ensure that our model remains robust and applicable in real-world scenarios where engagement data is unavailable prior to the video's release.
Time-Based Data Split
Training set: Between Jan 1, 2023 & Dec 31, 2023
Validation set: Between Dec 31, 2023 & Feb 18, 2024
Test set: After Feb 18, 2024
Model Comparison
I worked on a linear regression model,, specifically using Ridge which is a type of regularization technique that punishes large coefficients to keep the model more robust. This allowed the model to better account for overfitting data points.
Once the model was fit I got a score of 0.25 for the validation RSME, and 0.273 for the test RSME. An ideal RSME score should be closer to zero to reflect higher accuracy and In this case the linear model did a pretty good job! On the predicted vs actual log view velocity graph you can additionally see how accurate the results are. Some things to take note of however is how at the bottom left of and top right corners of the chart the model begins to struggle to accuracy predict view velocity. This is a limitation of linear regression models, as they assume all patterns in the data are linear and struggle to capture complex patterns in the data.



Random Forest Model


XGBoost Model



Neural Network Model



As a team we've done so much but there's definitely so much more that we can do could do to further improve our model performances and continue to explore more complex patterns in the data. Our next set of steps are to:
Continue to train and fine-tune the models: particularly focusing on some of the models that were the best performing, like the neural network and the random forest models, and also using data sets from other countries.
Look at other countries dataset: Early on in the project we imported the India data set to explore and see how that data set performs on the models we trained. Would the feature importances look any different? Are different categories more likely to trend?
Explore more advanced NLP techniques: We can also explore more NLP techniques to better understand the benefit of specific titles for a video or specific video descriptions and how that impacts how likely a video is to trend.















