Interestingly, there’s no overlap between the categories. So even though it might take some time for a music clip to get into the trending, it’s more likely to stay there for longer. The same goes for movie trailers and other entertainment content.
So we know that the live-comedy shows get into the trending the fastest and music and entertainment videos stay there the longest. But has it always been the case? To answer this question, we need to create some rolling aggregates. Let’s answer three main questions in this section:
- What is the total number of trending videos per category per month?
- What is the number of new videos per category per month?
- How do the categories compare when it comes to views over time?
Total Number of Monthly Trending Videos per Category
First, let’s look at the total number of videos per category per month. To get this statistic, we need to use .groupby_dynamic()
method that allows us to group by the date column (specified as index_column
) and any other column of choice (specified as by
parameter). The grouping frequency is controlled according to the every
parameter.
trending_monthly_stats = df.groupby_dynamic(
index_column="trending_date", # date column
every="1mo", # can also me 1w, 1d, 1h etc
closed="both", # including starting and end date
by="category_id", # other grouping columns
include_boundaries=True, # showcase the boudanries
).agg(
pl.col("video_id").n_unique().alias("videos_number"),
)print(trending_monthly_stats.sample(3))
You can see the resulting DataFrame above. Very nice property of Polars is that we can output the boundaries to sense check the results. Now, let’s do some plotting to visualise the patterns.
plotting_df = trending_monthly_stats.filter(pl.col("category_id").is_in(top_categories))sns.lineplot(
x=plotting_df["trending_date"],
y=plotting_df["videos_number"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)
plt.title("Total Number of Videos in Trending per Category per Month")
From this plot we can see that Music has the largest share of Trending stating from 2018. This might indicate some strategic shift within YouTube to become the go-to platform for music videos. Entertainment seems to be on the gradual decline together with People & Blogs and Howto & Style categories.
Number of New Monthly Trending Videos per Category
The query is exactly the same, except now we need to provide as index_column
the first the date when a video got into Trending. Would be nice to create a function here, but I’ll leave this as an exercise for a curious reader.
trending_monthly_stats_unique = (
time_to_trending_df.sort("first_day_in_trending")
.groupby_dynamic(
index_column="first_day_in_trending",
every="1mo",
by="category_id",
include_boundaries=True,
)
.agg(pl.col("video_id").n_unique().alias("videos_number"))
)plotting_df = trending_monthly_stats_unique.filter(pl.col("category_id").is_in(top_categories))
sns.lineplot(
x=plotting_df["first_day_in_trending"],
y=plotting_df["videos_number"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)
plt.title(" Number of New Trending Videos per Category per Month")
Here we get an interesting insights — the number of new videos by Entertainment and Music is roughly equal throughout the time. Since Music videos stay in Trending much longer, they are overrepresented in the Trending counts, but when these videos are deduped this pattern disappears.
Running Average of Views per Category
As the last step of this analysis, let’s compare two most popular categories (Music and Entertainment) according to their views over time. To perform this analysis, we’re going to use the 7 day running average statistic to visualise the trends. To calculate this rolling statistic Polars has a handy method called .groupby_rolling()
. Before applying it though, let’s sum up all the views by category_id
and trending_date
and then sort the DataFrame accordingly. This format is required to correctly calculate the rolling statistics.
views_per_category_date = (
df.groupby(["category_id", "trending_date"])
.agg(pl.col("views").sum())
.sort(["category_id", "trending_date"])
)
Once the DataFrame is ready, we can use .groupby_rolling()
method to create the rolling average statistic by specifying 1w
in the period argument and creating an average expression in the .agg()
method.
# Calculate rolling average
views_per_category_date_rolling = views_per_category_date.groupby_rolling(
index_column="trending_date", # Date column
by="category_id", # Grouping column
period="1w" # Rolling length
).agg(
pl.col("views").mean().alias("rolling_weekly_average")
)# Plotting
plotting_df = views_per_category_date_rolling.filter(pl.col("category_id").is_in(['Music', 'Entertainment']))
sns.lineplot(
x=plotting_df["trending_date"],
y=plotting_df["rolling_weekly_average"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)
plt.title("7-day Views Average")
According to the 7-day rolling average views, Music completely dominates the Trending tab and starting from February 2018 the gap between these two categories has increased massively.
After finishing this post and following along the code you should get a much better understanding of advanced aggregate and analytic functions in Polars. In particular, we’ve covered:
- Basics of working with
pl.datetime
.groupby()
aggregations with multiple arguments- The use of
.over()
to create aggregates over a specific group - The use of
.groupby_dynamic()
to generate aggregates over time windows - The use of
.groupby_rolling()
to generate rolling aggregates over period
Armed with this knowledge you should be able to perform almost every analytical task you have at the lightning speed.
You might have felt that some of this analysis felt very ad-hoc and you would be right. The next part is going to address exactly this topic — how to structure and create data processing pipelines. So stay tuned!
Not a Medium Member yet?
This post originally appeared on TechToday.