What is content-based filtering?

Content based filtering diagram.
data

Leah Kim

Have you ever watched one YouTube video on...say, ASMR cake eating. You browse some more videos and when you come back to the main landing page, you notice that all the top videos recommended to you are now cake ASMR videos. You see more videos from the same creator you watched eat the cake, and you see videos from other creators eating a huge variety of cakes in forms you never knew could exist. What?? How did YouTube know exactly what you’ve just recently started to like? 

This is the magic of content-based filtering. 

If you enjoy watching YouTube, Netflix, or any content from streaming services, chances are you’re well familiar with the concept of content-based filtering.  

Overview

Recommendation systems provide personalized suggestions by learning about users’ interests. Based on past behaviors, these systems aim to predict future activity. 

There are many types of recommendation systems that serve various business models, but content-based filtering is a classic algorithm that finds new items similar to past purchases.

You can read a general overview of product recommendation engines here to familiarize yourself with recommendation systems, and I also talk about collaborative filtering here

Content-based filtering considers item features and characteristics to find similar items that you might like. For example, if you recently purchased a notebook on Amazon, you might be recommended more notebooks to add to your shopping cart. 

Basically, content-based filtering looks at what you liked in the past to predict your next purchase. This means that the algorithm doesn’t consider other users – rather, the focus is on the actual items themselves. Products are compared side-by-side to find similarities using keywords and attributes assigned. For example, consider that you watched the movie Titanic starring Leonardo DiCaprio and the next recommended movie in the queue is The Wolf of Wall Street, which also starred DiCaprio. Although the two movies are not in the same genre, they are similar in that they both star Leonardo Dicaprio. This actor could’ve been tagged as an item attribute of these movies, and thus the algorithm predicted you might enjoy the movie regardless of genre. 

The Math Involved

Content-based filtering uses similarity algorithms, which is a machine learning method that identifies similarities between two or more items. These algorithms predict what users might like by comparing items that users have already seen. 

The goal is to gather a set of items that a user has used or purchased in the past to create an item profile, which contains data about item features and characteristics. This information is then used to create a user profile that represents what the user likes and prefers. This algorithm only considers your data, so all recommendations are made based on your purchases and preferences. Such data would be clicks, ratings, wish lists, likes, saved items, and past purchases, and this data makes up your user profile.

These profiles are represented as vectors in a high-dimensional space, with each feature corresponding to a dimension. 

A common similarity algorithm used for content-based filtering is cosine similarity. Cosine similarity measures the similarity between two vectors– a user profile and an item profile– and measures the cosine of the angle between these two vectors. This is the formula:

This formula calculates the similarity between the user and the set of items. Ultimately, the item with the highest similarity would be recommended to the user. 

Concepts in linear algebra are very important to understanding the math behind machine-learning algorithms. These are some resources that dive deeper into the math: 

Why content-based filtering is so good: 

The algorithm only cares about you. It only takes into consideration what you have enjoyed or purchased in the past. This means that companies don’t have to worry about insufficient data to make personalized recommendations to their customers. As long as there are a couple of purchases (or even one), it is possible to create good recommendations. 

For example, say you searched for a very specific product such as a cinnamon-roll-flavored lotion from Kansas. Very niche, right? Content-based filtering will recognize your preferences and find similar products you might like but otherwise wouldn’t notice. 

It works for the tiniest of businesses. If you’re familiar with collaborative filtering, you might know that it’s challenging for businesses to generate recommendations if there are too few users or purchases. Because recommendation systems rely on past activity to make decisions, most algorithms use data from multiple users– which doesn’t work well with communities that have just launched and are still working to acquire new users. 

However, content-based filtering is fairly easy to get started with. Of course– the algorithm does need some data to work with, but the quality of the recommendations is far better than what other recommender systems may generate with little input. 

But…Content-based filtering isn’t perfect

It’s boring. Content-based filtering is great for finding new but similar items for you. For example, if you love rom-coms, this algorithm will do a fantastic job finding more rom-coms for you. But, it’s not the most diverse. For example, if you bought a wooden spoon from Amazon, chances are you won’t need variations of that wooden spoon…unless your spoon accidentally breaks off into pieces. Also, circling back to the rom-com example: if you love rom-coms, would you really need a recommendation system to find more rom-coms for you? You probably already watched the movies that the recommender suggests. It might be better to ask your friends for recommendations– they know you enough to know what you might like, but may also throw in some surprises that you end up loving– and this is exactly what collaborative systems do.

It’s not always the most accurate. Content-based filtering relies on the attributes and features of the items– these items may not always be tagged correctly, and it would require some level of domain knowledge to ensure accuracy. Have you ever ordered a huge, human-size plush online, just to receive a tiny stuffed animal in the mail instead? The product may have been tagged incorrectly instead, and content-based filtering isn’t invincible against these shortcomings.

Additionally, it might be hard to stay up to date with user interests– these recommendations are purely based on historical activity, so there isn’t enough information to expand into other categories of items. 

Conclusion

Content-based filtering is a great recommendation system, especially for smaller businesses who may want to incorporate personalization into their CX, but have limited data to work with. 

However, we’ve seen that content-based filtering may not be the best system to use for larger businesses that are constantly expanding. It would be extremely hard to keep count of every product and tag each item based on their features. Now that you know the various techniques for making product recommendations, consider: what if you combined these techniques, such as content-based and collaborative filtering, to make a hybrid recommendation system? 

Boom. I think your business will find itself a very accurate recommender that can surprise customers with diverse but effective suggestions. 

Stay in touch

Not ready to reach out yet? Sign up for news on our latest product and content updates.