VOOZH about

URL: https://thenewstack.io/create-a-movie-recommendation-engine-with-milvus-and-python/

⇱ Create a Movie Recommendation Engine with Milvus and Python - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-11-17 09:32:13
Create a Movie Recommendation Engine with Milvus and Python
sponsor-zilliz,sponsored-post-contributed,
Cloud Native Ecosystem / Open Source / Python

Create a Movie Recommendation Engine with Milvus and Python

Use open source tools to help people find movies they might want to watch.
Nov 17th, 2023 9:32am by Gourav Singh Bais
👁 Featued image for: Create a Movie Recommendation Engine with Milvus and Python
Featured image by Tima Miroshnichenko on Pexels.
Zilliz sponsored this post.
Recommender systems or recommendation engines are information-filtering systems that aim to predict and suggest things users might be interested in. These items include products, services and content such as movies, books, music or news articles. There are various types of recommender systems, such as collaborative filtering, content-based filtering, hybrid recommendation systems and vector-based recommendation systems. Vector-based systems use the vector space to find (recommend) the closest items in the database. There are various ways to store these vectors; one of the most efficient ones is using the Milvus open source vector database. This database is highly flexible, fast and reliable and allows for trillion-byte-scale addition, deletion, updating and nearly real-time search of vectors. This article explains how to build a movie recommender with Milvus and Python. This system will use SentenceTransformers to convert the text information to vectors and store these vectors in Milvus. Milvus enables users to search for a movie in the database based on the text information they provide. You can find the code for this tutorial in the Milvus Bootcamp repository on GitHub, and there’s also a Jupyter Notebook.

Setting Up the Environment

For this article, you’ll need the following requirements installed:

Python Requirements

You also need to install a set of libraries that will be needed throughout this tutorial. Install the libraries using PIP:
$ python -m pip install pymilvus pandas sentence_transformers Kaggle

Vectors Data Store (Milvus)

You’ll use the Milvus vector database to store the embeddings that you’ll generate using movie descriptions. The dataset is relatively large, at least for a server running on a personal computer. So you may want to use a Zilliz Cloud instance to store these vectors. If you prefer to stay with a local instance, you can download a docker-compose configuration and run it:
$ wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
$ docker-compose up -d
You now have all the requirements to build your movie recommender system with Milvus.

Data Collection and Preprocessing

For this project, you’ll use the Movies Dataset from Kaggle which contains metadata for 45,000 movies. You can download this dataset directly or use the Kaggle API to download the dataset using Python. To do it with Python, you need to download the kaggle.json file from the profile section of Kaggle.com and put it in a location where the API will find it. Next, set up some environment variables for Kaggle authentication. For this, you can open the Jupyter Notebook and write the following lines of code:
%env KAGGLE_USERNAME=username
%env KAGGLE_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
%env TOKENIZERS_PARALLELISM=true
Once done, you can use Kaggle’s Python dependency to download the dataset from Kaggle:
# import kaggle dependency
import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_files('rounakbanik/the-movies-dataset', path='dataset', unzip=True)
Once the dataset is downloaded, you can use the read_csv() method from pandas to read the dataset:
# import pandas
import pandas as pd

# read csv data
movies=pd.read_csv('dataset/movies_metadata.csv',low_memory=False)
# check shape of data
movies.shape
👁 Dataset shape

Database records and columns

The image shows that you have 45,466 records with 24 columns of metadata. Check all these metadata columns with the following:
# check column names
movies.columns
👁 Dataset columns

Dataset columns

There are a lot of columns you don’t need to create the recommender system. You can filter out the required columns with:
# filter required columns
trimmed_movies = movies[["id", "title", "overview", "release_date", "genres"]]
trimmed_movies.head(5)
👁 Required columns

Required columns

Also, some of the fields in the data are missing, so get rid of those rows to produce a clean dataset:
unclean_movies_dict = trimmed_movies.to_dict('records')
print('{} movies'.format(len(unclean_movies_dict)))
movies_dict = []

for movie in unclean_movies_dict:
 if movie["overview"] == movie["overview"] and movie["release_date"] == movie["release_date"] and movie["genres"] == movie["genres"] and movie["title"] == movie["title"]:
 movies_dict.append(movie)

Connect to Milvus

Now that you have all the required columns, connect to Milvus to start uploading the data. To connect to the Milvus cloud instance, you’ll need the Uniform Resource Identifier (URI) and token, which you can download from your Zilliz Cloud dashboard.
👁 Zilliz dashboard

Zilliz Cloud dashboard

Once you have your URI and API key, you can use the connect() method from PyMilvus to connect to the Milvus server:
# import milvus dependency
from pymilvus import *

# connect to milvus
milvus_uri="YOUR_URI"
token="YOUR_API_TOKEN"
connections.connect("default", uri=milvus_uri, token=token)
print("Connected!")

Generate Embeddings for Movies

Now it’s time to calculate the embeddings for the text data in the movie dataset. First, create a collection object that will store the movie ID and embeddings for the text data. Also, create an index field to make searches more efficient:
COLLECTION_NAME = 'film_vectors'
PARTITION_NAME = 'Movie'

# Here's our record schema
"""
"title": Film title,
"overview": description,
"release_date": film release date,
"genres": film generes,
"embedding": embedding
"""

id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True)
field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384)

schema = CollectionSchema(fields=[id, field], description="movie recommender: film vectors", enable_dynamic_field=True)

if utility.has_collection(COLLECTION_NAME): # drop the same collection created before
 collection = Collection(COLLECTION_NAME)
 collection.drop()
 
collection = Collection(name=COLLECTION_NAME, schema=schema)
print("Collection created.")

index_params = {
 "index_type": "IVF_FLAT",
 "metric_type": "L2",
 "params": {"nlist": 128},
}

collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

print("Collection indexed!")
Now that you have an indexed collection, create a function for generating the embeddings for the text. Although overview is the primary column used to generate the embeddings, you’ll also use the genre and release data information along with an overview to make the data more logical. To generate the embeddings, use the SentenceTransformer:
from sentence_transformers import SentenceTransformer
import ast

# function to extract the text from genre column 
def build_genres(data):
 genres = data['genres']
 genre_list = ""
 entries= ast.literal_eval(genres)
 genres = ""
 for entry in entries:
 genre_list = genre_list + entry["name"] + ", "
 genres += genre_list
 genres = "".join(genres.rsplit(",", 1))
 return genres

# create an object of SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2')

# function to generate embeddings
def embed_movie(data):
 embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data)) 
 embeddings = transformer.encode(embed)
 return embeddings
This function uses the build_genres() method to clean the genre column and get the text out of it. Then it creates a SentenceTransformer object to help generate the embeddings from the text. Finally, it uses the encode() method to generate the embeddings using the overview, release_date and genre features.

Send Embeddings to Milvus

Now you can create the embeddings using the embed_movie() method. This dataset is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, instead create batches (e.g., 5,000 rows) of data to send to Milvus:
# Loop counter for batching and showing progress
j = 0
batch = []

for movie_dict in movies_dict:
 try:
 movie_dict["embedding"] = embed_movie(movie_dict)
 batch.append(movie_dict)
 j += 1
 if j % 5 == 0:
 print("Embedded {} records".format(j))
 collection.insert(batch)
 print("Batch insert completed")
 batch=[]
 except Exception as e:
 print("Error inserting record {}".format(e))
 pprint(batch)
 break

collection.insert(movie_dict)
print("Final batch completed")
print("Finished with {} embeddings".format(j))
Note: You can play with the batch size to suit your individual needs and preferences. Also, a few movies will fail for IDs that cannot be cast to integers. You could fix this with a schema change or by verifying their format.

Recommend New Movies Using Milvus

Now you can leverage Milvus’ near real-time vector search functionality to get a close match of movies that meet the viewer’s criteria. For this, create two different functions:
  1. embed_search(): You need a transformer to convert the user’s search string to an embedding. This function takes the viewer’s criteria and passes it to the same transformer you used to populate Milvus.
  2. search_for_movies(): This function performs the actual vector search using the other function for support.
# load collection memory before search
collection.load() 

# Set search parameters
topK = 5
SEARCH_PARAM = {
 "metric_type":"L2",
 "params":{"nprobe": 20},
}

# convert search string to embeddings
def embed_search(search_string):
 search_embeddings = transformer.encode(search_string)
 return search_embeddings

# search similar embeddings for user's query
def search_for_movies(search_string):
 user_vector = embed_search(search_string)
 return collection.search([user_vector],"embedding",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])
The above code defines the parameters topK for getting the top five similar vectors, metric_type as L2 (squared Euclidean) that calculates the distance between two vectors and nprobe that indicates the number of cluster units to search. It also implements different functions to get similar vectors from the user’s query (recommendation). Finally, use the search_for_movies() function to recommend movies based on the user’s search string:
from pprint import pprint

search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)

# check results
for hits in iter(results):
 for hit in hits:
 print(hit.entity.get('title'))
 print(hit.entity.get('overview'))
 print("-------------------------------")
👁 Movie recommendation output

Movie recommendation output

By using Milvus’s vector search feature, the code recommends the top five similar movies based on the user’s query. This is it: You’ve now built your own movie recommender system using Milvus.

Conclusion

After reading this article, you know what a vector-based recommendation system is and how to create a movie recommender system with Milvus. Milvus helps build an efficient and scalable movie recommendation system. Leveraging vector storage and similarity search, Milvus has great potential for enabling personalized recommendations, enhancing user engagement and showcasing the role of advanced vector-based models in modern recommendation systems. You can learn more about it on the Milvus website.
Zilliz is a leading vector database company, offering high-performing and scalable solutions. We’re powered by Milvus, the popular open-source vector database that helps companies from any scale build AI-powered search solutions.
Learn More
TRENDING STORIES
Gourav Bais is an Applied Machine Learning Engineer at ValueMomentum Inc. He is skilled in developing Machine Learning/Deep learning pipelines, retraining systems, and transforming Data Science prototypes to production-grade solutions. He has been working in the same field for the...
Read more from Gourav Singh Bais
Zilliz sponsored this post.
SHARE THIS STORY
TRENDING STORIES
Docker is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: turing, Uniform, Docker.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
Milvus Lite, a lightweight version of the open source vectorDB Milvus, installs easily & integrates with 20+ AI tools.