Find It Like Magic: Smarter Text Searches with Vectors!
by Gabriel Vergara
Introduction
When you think about searching for text, you probably imagine SQL queries with LIKE '%keyword%'
or even complex regular expressions. But as soon as you deal with real-world data—where typos, synonyms, and phrasing differences exist—those traditional methods start to fall short. That’s where vector databases come in.
Initially, I discovered vector databases as part of implementing Retrieval-Augmented Generation (RAG) for AI-driven Q&A systems. But along the way, I realized something interesting: vector databases are not just for RAG. They are a powerful tool for improving search itself, even without an AI model answering questions.
Unlike SQL or regex-based searches, which rely on exact text matching, vector databases allow fuzzy, meaning-based searches. This means you can find relevant results even if the search text doesn’t match exactly. Imagine searching for “The Dark Night” and still getting results for “The Dark Knight”—something that would be tricky with traditional methods.
What We’ll Build
To see this in action, we’ll build a similarity search engine using:
- LangChain (for easy vector search integration)
- Ollama (for easy embedding model handling)
- FAISS (a fast, open-source vector database)
- Pandas (for handling and analyzing our dataset)
We’ll work with an IMDB movie dataset, allowing us to search for movies by name or description, even if there are typos or variations in phrasing.
If you want a more in-depth theory explanation, do not hesitate in check this article: From Messy Files to Magic Answers: How RAG Makes AI Smarter (and Life Easier)
By the end, you’ll have a solid understanding of how vector databases work—not just for AI chatbots, but as a powerful standalone search technology. Let’s get started!
Prerequisites
Before diving into the examples, ensure that your development environment is set up with the necessary tools and dependencies. Here’s what you’ll need:
- Ollama: A local instance of Ollama is required for embedding operations. If you don’t already have Ollama installed, you can download it from here. This guide assumes that Ollama is installed and running on your machine.
- Models: Once Ollama is set up, pull the required models.
- all-minilm embedding model: Used to create embeddings for document chunks (check here).
- Python Environment:
- Python version: This script has been tested with Python 3.10. Ensure you have a compatible Python version installed.
- Installing Dependencies: Use a Python environment management tool like
pipenv
to set up the required libraries. Execute the following command in your terminal:
pipenv install langchain langchain-community langchain-ollama faiss-cpu pandas
1. Setting Up Our Vector Database
Before diving into the code, let’s set the stage. We’re working with a free IMDB dataset extracted from this Kaggle source, specifically the “animation” movie subset in CSV format.
To transform text-based movie data into vector representations, we’ll use FAISS (a fast vector search library) and Ollama’s all-minilm embedding model (remember to pull the model to be able to use it). Since processing large datasets at once can be demanding, we’ll convert data in batches for better performance and efficiency.
The full script will be available at the end of this article, but let’s go step by step and dissect the key parts.
1.1. Configuration and Setup
The script starts by defining important parameters:
embedding_model = 'all-minilm' # can also try with 'nomic-embed-text'
movie_csv_file = "./movie_dataset/animation.csv"
vectorstore_path = f'vectorstore_{embedding_model}'
batch_size = 32
- We specify the embedding model (
all-minilm
) to convert text into vectors. - The script reads data from the animation movie dataset (
animation.csv
). - We define the vectorstore directory (
vectorstore_all-minilm
), where our FAISS index will be saved. - The script processes 32 movies at a time to reduce memory usage.
1.2. Removing Old Vector Stores
Before creating a new vector database, we check if an existing one exists and delete it to start fresh:
if os.path.isdir(vectorstore_path):
shutil.rmtree(vectorstore_path)
This ensures that we always work with a clean FAISS index instead of appending to outdated data.
1.3. Loading and Validating the Dataset
Next, we load the CSV file into a Pandas DataFrame and ensure it contains the necessary columns (movie_name
and description
):
df = pd.read_csv(movie_csv_file)
if "movie_name" not in df.columns or "description" not in df.columns:
raise ValueError("CSV file must contain 'movie_name' and 'description' columns.")
If these columns are missing, the script will raise an error. This step is important because vectorization relies on movie names and descriptions to generate embeddings.
1.4. Initializing FAISS and the Embedding Model
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore_index = None # Placeholder for FAISS index
- We use Ollama’s embedding model to generate text embeddings.
- The FAISS index starts as
None
because we’ll create it dynamically while processing batches of movies.
1.5. Batch Processing for Efficient Indexing
To handle large datasets efficiently, we process movies in batches rather than all at once:
for i in tqdm(range(0, len(df), batch_size), desc="Indexing Progress"):
batch_df = df.iloc[i:i + batch_size]
documents = [
Document(
page_content=f"Title: {row['movie_name']} | Description: {row['description']}",
metadata={"movie_name": row["movie_name"], "description": row["description"]}
)
for _, row in batch_df.iterrows()
]
- We iterate over the dataset in chunks of 32 movies per batch.
- Each movie is converted into a LangChain Document, combining the title and description into a single searchable text entry.
- Metadata (original movie name and description) is stored alongside the text for reference.
1.6. Creating and Updating the FAISS Index
Once we have a batch of movie documents, we either create a new FAISS index or add to an existing one:
if vectorstore_index is None:
vectorstore_index = FAISS.from_documents(documents, vectorstore_embeddings)
else:
vectorstore_index.add_documents(documents)
- If this is the first batch, we initialize the FAISS index.
- Otherwise, we incrementally add new documents to the existing index.
- This structure makes it easy to handle datasets of any size without overloading memory.
1.7. Saving the Vector Database
After processing all batches, we save the FAISS index to disk for future use:
vectorstore_index.save_local(vectorstore_path)
This allows us to reuse the indexed data later without needing to reprocess the entire dataset.
Wrapping Up
At this point, we have successfully:
- Loaded an IMDB animation movie dataset
- Converted movie names and descriptions into vector embeddings
- Indexed the data in FAISS, a fast and scalable vector database
- Saved the index for later searches
In the next section, we’ll use this vectorized database to perform fast, fuzzy movie searches—finding relevant results even when queries contain typos or phrasing variations.
2. Searching for Movies in the Vector Database
Now that we have built and stored our vectorized movie database, it’s time to put it to work! In this section, we’ll explore how to search for movies using FAISS and LangChain.
This script allows you to enter a movie title or description, and it will return the most similar results from our FAISS index. The key concept here is vector similarity—the closer a movie’s embedding is to the search query, the better the match.
As always, the full script will be available at the end of this article, but let’s go through its most important parts.
2.1. Defining the Search Function
At the heart of this script is the search_movies
function, which retrieves similar movies based on a given query:
def search_movies(query, top_k=5):
results = vectorstore.similarity_search_with_score(query, k=top_k)
return [
(doc.metadata["movie_name"], doc.metadata["description"], score)
for doc, score in results
]
How does it work?
- The query (movie name or description) is passed into FAISS.
- FAISS performs a similarity search and returns the top K matches (default: 5).
- Each result includes:
- The movie title
- The movie description
- A similarity score (lower scores mean better matches)
What is the similarity score?
The similarity_search_with_score
function measures how close each movie’s vector is to the query.
- A score closer to 0 means a better match.
- Higher scores indicate less relevant results.
For example, if you search for "Finding Nimo"
, it might still return "Finding Nemo"
because the vectors are similar, even though the text isn’t an exact match.
2.2. Loading the Vector Database
Before searching, we need to load our FAISS vector store and its embeddings:
embedding_model = 'all-minilm' # can also try with 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore = FAISS.load_local(vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True)
- We set up the same embedding model (
all-minilm
) that we used for indexing. - We define the path to our stored FAISS vector database.
- We reload the vector index using the embeddings so that we can perform searches.
The allow_dangerous_deserialization=True
flag is required when loading a local FAISS index. It enables compatibility when retrieving stored vectors.
2.3. Interactive Search Loop
Now, we enter a loop that allows the user to keep searching for different movies until they decide to exit:
search_another = True
while search_another:
print('-' * 80)
movie_search_query = input("Enter a movie/description to search for: ")
movie_results = search_movies(movie_search_query)
print("\n---- Similar Movies --------------------\n")
for name, description, score in movie_results:
print(f"- Title: {name}\n\t- Score: {score:.4f}\n\t- Description: {description}\n")
search_another = input("Search again? [Y,n]: ").lower() in ['y', '']
- The script asks for a search query (a movie title or a description).
- It retrieves the top matches using
search_movies(query)
. - Results are displayed in a readable format:
- Movie Title
- Similarity Score (lower is better)
- Movie Description
- The user is asked if they want to search again or exit the program.
Example Search Output
For real world case, there was a movie that I’ve wanted to recommend to a friend, but did not remember the name. The movie was commedy story about a girl that went over a family trip, and at the same time, an AI rebellion occurs. Using that idea as a search argument, let’s say we search for “family trip ai rebellion”. The output might look something like this:
----------------------------------------------------------------------
Enter a movie/description to search for: family trip ai rebellion
---- Similar Movies --------------------
- Title: The Hoovers Conquer the Universe
- Score: 1.0515
- Description: An animated family space adventure
- Title: Untitled Animated Feature Film
- Score: 1.1748
- Description: A progressive sociopolitical family film.
- Title: Robot Zot
- Score: 1.1781
- Description: The story of a wayward scout for an invading alien force, whose course goes hopelessly awry when he lands in the yard of a modern-day, suburban family with problems of their own. Based on the book by Jon Scieszka and David Shannon.
- Title: The Mitchells vs the Machines
- Score: 1.1836
- Description: A quirky, dysfunctional family's road trip is upended when they find themselves in the middle of the robot apocalypse and suddenly become humanity's unlikeliest last hope.
The movie was “The Mitchells vs the Machines”, and it appeared in the results. This demonstrates how vector-based search handles variations much better than traditional SQL or regex searches!
Wrapping Up
At this point, we have successfully:
- Loaded the vectorized movie database
- Implemented a similarity search function
- Created an interactive search loop
- Returned relevant movie recommendations based on text similarity
In the next section, we’ll wrap everything up and provide the full code so you can try it yourself!
Full code
This is the full code of both the vector store feeder as well as the vector store queries.
index_movie_description.py
import os
import shutil
import pandas as pd
from tqdm import tqdm
from langchain_community.vectorstores import FAISS
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain.schema import Document
if __name__ == '__main__':
# ---- Configuration --------------------------------------------
embedding_model = 'all-minilm' # can also try with 'nomic-embed-text'
movie_csv_file = "./movie_dataset/animation.csv"
vectorstore_path = f'vectorstore_{embedding_model}'
batch_size = 32
# ---- Vector Store setup ---------------------------------------
print('-' * 80)
print(f'> Creating vectorstore: {vectorstore_path} ')
# Remove previous vectorstore
if os.path.isdir(vectorstore_path):
print(f' > Removing existing vectorstore... ')
shutil.rmtree(vectorstore_path)
# ---- Grab movie .csv file for the vector store
print(f' > Processing movies from: {movie_csv_file} ')
# Ensure column names match expected format
df = pd.read_csv(movie_csv_file)
if "movie_name" not in df.columns or "description" not in df.columns:
raise ValueError("CSV file must contain 'movie_name' and 'description' columns.")
# Initialize FAISS index & embedding model
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore_index = None # Placeholder for FAISS index
# ---- Batch Processing ----
print(f' > Indexing {len(df)} movies in batches of {batch_size}...\n')
for i in tqdm(range(0, len(df), batch_size), desc="Indexing Progress"):
# Extract batch & convert movie names into Langchain Documents (concatenate movie name and description)
batch_df = df.iloc[i:i + batch_size]
documents = [
Document(
page_content=f"Title: {row['movie_name']} | Description: {row['description']}",
metadata={"movie_name": row["movie_name"], "description": row["description"]}
)
for _, row in batch_df.iterrows()
]
# Initialize or append to FAISS index
if vectorstore_index is None:
vectorstore_index = FAISS.from_documents(documents, vectorstore_embeddings)
else:
vectorstore_index.add_documents(documents)
# ---- Save vector store index
print(f' > Saving vectorstore: {vectorstore_path}')
vectorstore_index.save_local(vectorstore_path)
print('> Vectorstore created!')
search_movie_description.py
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
# Query function
def search_movies(query, top_k=5):
results = vectorstore.similarity_search_with_score(query, k=top_k)
return [
(doc.metadata["movie_name"], doc.metadata["description"], score)
for doc, score in results
]
if __name__ == '__main__':
# ---- Configuration --------------------------------------------
embedding_model = 'all-minilm' # can also try with 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
# ---- Load FAISS index -----------------------------------------
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore = FAISS.load_local(vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True)
# ---- Query vector store ---------------------------------------
search_another = True
while search_another:
print('-' * 80)
movie_search_query = input("Enter a movie/description to search for: ")
movie_results = search_movies(movie_search_query)
print("\n---- Similar Movies --------------------\n")
for name, description, score in movie_results:
print(f"- Title: {name}\n\t- Score: {score:.4f}\n\t- Description: {description}\n")
search_another = input("Search again? [Y,n]: ").lower() in ['y', '']
Conclusion: Smarter Searches, Zero Hassle
We’ve explored how vector databases can revolutionize text-based search, making it smarter, more flexible, and typo-resistant. Instead of relying on old-school SQL LIKE
queries or complex regex patterns, we leveraged FAISS, LangChain, and Ollama embeddings to perform meaning-based searches on an IMDB movie dataset.
Beyond Movies: What’s Next?
This approach isn’t just for movies—it can be applied to product searches, document retrieval, customer support chatbots, and any scenario where traditional search struggles with variations in wording.
With vector search, you can build Google-like search experiences without needing a full AI-powered chatbot. And the best part? You don’t need huge infrastructure or deep learning expertise—just smart indexing and a good embedding model.
Want to take it further? Try experimenting with different embedding models, datasets, or even multimodal (text + images) searches.
Now, it’s your turn—give it a try and let me know what you build!
About Me
I’m Gabriel, and I like computers. A lot.
For nearly 30 years, I’ve explored the many facets of technology—as a developer, researcher, sysadmin, security advisor, and now an AI enthusiast. Along the way, I’ve tackled challenges, broken a few things (and fixed them!), and discovered the joy of turning ideas into solutions. My journey has always been guided by curiosity, a love of learning, and a passion for solving problems in creative ways.
See ya around!