See It, Search It: AI-Powered Image Cataloging Made Easy
by Gabriel Vergara
Introduction
Have you ever struggled to find an image in a massive collection, remembering only a vague description like “sunset over a snowy mountain” or “a cat sitting next to a laptop”? What if you could just type a natural-language description and instantly retrieve the most relevant images? That’s exactly what we’re going to build in this article using multimodal AI and a vector database.
What’s a Multimodal Model, and How Is It Different from an LLM?
When we talk about Large Language Models (LLMs), we usually mean AI models trained to understand and generate text. They’re great at answering questions, summarizing documents, or generating creative text. However, they can’t “see” or interpret images. That’s where multimodal models come in.
A multimodal model processes multiple types of input—like text and images—simultaneously. It can understand and describe images in natural language, combining vision and language capabilities in one model. One such model is LLaVA (Large Language and Vision Assistant), defined as a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, which we’ll use in this project.
What’s a Vision Encoder? (In Simple Terms)
A vision encoder is the part of a multimodal model that processes images. It converts an image into a numerical representation (embeddings) that a language model can understand. Think of it as translating a picture into a format that an AI can “read” and describe in words.
What Are We Building?
In this guide, we’ll use LLaVA to automatically generate text descriptions for a repository of images. Instead of manually tagging each picture, we’ll let AI analyze and describe them for us. Then, we’ll store these descriptions in a vector database so we can perform similarity searches—meaning, you can find images based on their textual descriptions.
How Are We Doing This?
We’ll write Python code that integrates:
- LangChain for AI model orchestration
- Ollama to run the multimodal model (LLaVA) locally
- FAISS for storing and searching the text representations efficiently
- Pandas to structure the cataloged data
- Pillow to handle image processing
If you want a more in-depth theory explanation, do not hesitate in check this link: From Messy Files to Magic Answers: How RAG Makes AI Smarter (and Life Easier). Also, if you want to delve deeper into RAG implementations, take a look at this: Teach Your PDFs to Chat: Build an AI-Powered Assistant That Delivers Answers.
By the end of this article, you’ll have a working system that can catalog images automatically and make them searchable by description. Let’s get started!
Prerequisites
Before diving into the examples, ensure that your development environment is set up with the necessary tools and dependencies. Here’s what you’ll need:
- Ollama: A local instance of Ollama is required for embedding and querying operations. If you don’t already have Ollama installed, you can download it from here. This guide assumes that Ollama is installed and running on your machine.
- Models: Once Ollama is set up, pull the required models.
- llava:13b: Used to generate the images descriptions (link). If you are low on hardware, you can also try with llava:7b (link), which is less accurate but lighter on processing requirements (and faster, also). You will need about 7GB of free storage space for the llava:13b model.
- nomic-embed-text embedding model: Used to create embeddings from the images descriptions (link).
- Python Environment:
- Python version: This script has been tested with Python 3.10. Ensure you have a compatible Python version installed.
- Installing Dependencies: Use a Python environment management tool like
pipenv
to set up the required libraries. Execute the following command in your terminal:
pipenv install langchain langchain-community langchain-ollama faiss-cpu pillow tqdm pandas
- Sample images: a sample of ten to twenty PNG images files, used as a image repository. This can be used with other images types, but will require that you tweak the scripts a little.
With these prerequisites in place, you’ll be ready to proceed to the next steps in setting up the image catalog and query it.
Generating Image Descriptions and Storing Them for Search
Before we dive into the code, here’s the plan: we’ll process a repository of images, use a multimodal AI model to generate text descriptions, and then store those descriptions in a vector database for easy retrieval.
The full code will be available at the end of the article, but here, we’ll break it down into the key steps.
Step 1: Setting Up the Environment
We start by setting up our configurations:
llm_model = 'llava:13b'
embedding_model = 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
images_repository_path = './images_repository'
images_description_csv = './images_descriptions.csv'
batch_size = 8
This defines the multimodal model, embedding model, and the paths for storing processed data. The batch_size parameter is used to process image descriptions in small chunks to feed the vector database (for the sake of the example, 8 was enough… consider using a bigger number for real scenarios).
Step 2: Preparing Image Data
LLaVA requires images in base64 format, so we need to convert them before feeding them into the model. The following function reads an image, encodes it, and returns its base64 representation:
def get_png_as_base64(file_path):
buffered = BytesIO()
Image.open(file_path).save(buffered, format="PNG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")
I encourage you to enhance this function to process other image types!
Step 3: Extracting Descriptions from Images
Now, we iterate over the images in the repository, send them to LLaVA, and capture the text descriptions:
model = OllamaLLM(model=llm_model)
prompt = 'Describe this image in two or less sentences:'
input_files = glob.glob(f'{images_repository_path}/*.png')
dataframe = pd.DataFrame(columns=['image_path', 'description'])
for image_path in input_files:
image_base64 = get_png_as_base64(image_path)
response = model.bind(images=[image_base64]).invoke(input=prompt)
dataframe.loc[len(dataframe)] = [image_path, response]
dataframe.to_csv(images_description_csv, index=False)
Here’s what happens:
- The script grabs all
.png
files in the image repository. - Each image is converted to base64 and passed to LLaVA.
- The model generates a short text description, which is stored in a Pandas DataFrame.
- The DataFrame is saved as a CSV file for later reference.
Step 4: Creating the Vector Database
Once we have textual descriptions, we need to convert them into vectors and store them in FAISS for efficient retrieval.
if os.path.isdir(vectorstore_path):
shutil.rmtree(vectorstore_path)
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore_index = None
This section initializes FAISS and ensures we start with a fresh vector database by removing any existing one.
Step 5: Indexing Descriptions in FAISS
Since we’re working with potentially large datasets, we process text descriptions in batches:
for i in range(0, len(dataframe), batch_size):
batch_df = dataframe.iloc[i:i + batch_size]
documents = [
Document(
page_content=row['description'],
metadata={"image_path": row["image_path"], "description": row["description"]}
)
for _, row in batch_df.iterrows()
]
if vectorstore_index is None:
vectorstore_index = FAISS.from_documents(documents, vectorstore_embeddings)
else:
vectorstore_index.add_documents(documents)
This approach helps to:
- Reduce memory usage by processing only a few descriptions at a time.
- Ensure smooth indexing when dealing with thousands of images.
Step 6: Saving the Vector Store
Finally, we store the indexed vector database for future queries:
vectorstore_index.save_local(vectorstore_path)
And that’s it! Now, we have a searchable catalog of image descriptions stored in FAISS, making it easy to find images based on textual queries.
Searching Your Image Catalog Using Natural Language
Now that we have a vectorized image catalog, it’s time to put it to use! In this section, we’ll explore how to perform similarity searches against our FAISS vector database to retrieve the most relevant images based on a text query.
The full code will be available at the end of the article, but here, we’ll break it down into the most important parts.
Step 1: How Similarity Search Works
Instead of using traditional keyword-based searching, we’ll leverage vector similarity search to find images based on meaning, even if the descriptions aren’t an exact match.
FAISS allows us to use a function called similarity_search_with_score, which determines how closely a stored description matches a query. The lower the score, the better the match—a score of 0.0 means a perfect match.
Step 2: Defining the Search Function
We define a function that takes a text query (e.g., “a cat sitting next to a laptop”) and finds the top matching images from the vector database:
def search_images(query, top_k=3):
results = vectorstore.similarity_search_with_score(query, k=top_k)
return [
(doc.metadata["image_path"], doc.metadata["description"], score)
for doc, score in results
]
Here’s how it works:
- The query is compared against stored image descriptions.
- FAISS returns the best-matching images, along with their similarity scores.
- The results include:
- The image path (so you can locate the file).
- The AI-generated description.
- The similarity score (lower = better match).
By default, the function returns three results, but you can adjust top_k
to show more or fewer images.
IMPORTANT: consider that the function will return the (by default) 3 closest descriptions… if you have a very small set of images, it will find the three closest descriptions, even if they do not relate with your search query.
Step 3: Loading the Vector Database
Before running a search, we need to load the FAISS index and the text embedding model:
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
embedding_model = 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore = FAISS.load_local(vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True)
This ensures that:
- We use the same embedding model that was used during indexing.
- The vectorstore is loaded from disk, so we can perform searches instantly.
Step 4: Running the Search Loop
To make the system interactive, we create a loop where users can enter text queries, get relevant image matches, and decide whether to search again:
search_another = True
while search_another:
image_search_query = input("Enter an image description to search for: ")
movie_results = search_images(image_search_query)
for image_path, description, score in movie_results:
print(f"- Image Path: {image_path}\n\t- Score: {score:.4f}\n\t- Description: {description}\n")
search_another = input("Search again? [Y,n]: ").lower() in ['y', '']
This part:
- Prompts the user to enter a natural-language description of the image they’re looking for.
- Retrieves the closest-matching images using FAISS.
- Displays the image path, similarity score, and AI-generated description.
- Lets the user perform multiple searches without restarting the script.
Output samples
This is a output sample, just for you to see the results:
Enter a image description to search for: a landscape
---- Images related ----------------------
- Image Path: ./images_repository/Second_Life_Landscape_champ_des_fleurs.png
- Score: 0.4404
- Description: This image features a digital rendering of a natural landscape, possibly from a video game or a 3D model. The scene includes wildflowers, tall grasses, and trees in the background under a clear sky. There is a blank area on the bottom part of the image where there should be more detail but it appears to have been cut off or obscured.
- Image Path: ./images_repository/Semmering_landscape.png
- Score: 0.6829
- Description: This image captures a serene winter scene on a mountain slope. The landscape is covered in snow, and the forest at the base of the slope has snow-covered pine trees. There's a slight mist or fog that adds to the tranquility of the setting.
And these are the images:


Full code
This is the full code of both the image catalog creation as well as the vector store queries.
index_images_description.py
import base64
import glob
import os
import pandas as pd
import shutil
from io import BytesIO
from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaLLM
from langchain_ollama.embeddings import OllamaEmbeddings
from PIL import Image
from tqdm import tqdm
# ---- Functions definition ---------------------------------------------------
def get_dataframe(file_path):
if os.path.isfile(file_path):
os.remove(file_path)
return pd.DataFrame(columns=['image_path', 'description'])
def get_png_as_base64(file_path):
buffered = BytesIO()
Image.open(file_path).save(buffered, format="PNG") # You can change the format if needed
return base64.b64encode(buffered.getvalue()).decode("utf-8")
# ---- Entry point ------------------------------------------------------------
if __name__ == '__main__':
# ---- Configuration ------------------------------------------------------
llm_model = 'llava:13b'
embedding_model = 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
images_repository_path = './images_repository'
images_description_csv = './images_descriptions.csv'
batch_size = 8
# ---- Model definition ---------------------------------------------------
model = OllamaLLM(model=llm_model)
prompt = 'Describe this image in two or less sentences:'
# ---- Grab images from repository ----------------------------------------
print('-' * 80)
print(f'> Grabbing PNG images from repository: {images_repository_path}')
input_files = glob.glob(f'{images_repository_path}/*.png')
print(f'> Got {len(input_files)} file(s).')
# ---- Get the images description dataset ---------------------------------
print(f'> Creating images description dataset: {images_description_csv}')
dataframe = get_dataframe(images_description_csv)
# ---- Image processing ---------------------------------------------------
print('> Updating images description dataset...')
for i in tqdm(range(0, len(input_files), 1), desc="Images description generation progress"):
image_path = input_files[i]
image_base64 = get_png_as_base64(image_path)
response = model.bind(images=[image_base64]).invoke(input=prompt)
dataframe.loc[len(dataframe)] = [image_path, response]
dataframe.to_csv(images_description_csv, index=False)
print('> ...done!')
# ---- Vector Store setup -------------------------------------------------
print('-' * 80)
print(f'> Creating vectorstore: {vectorstore_path} ')
if os.path.isdir(vectorstore_path):
shutil.rmtree(vectorstore_path)
# Initialize FAISS index & embedding model
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore_index = None # Placeholder for FAISS index
# ---- Vector store indexing
print(f'> Indexing {len(dataframe)} images descriptions...\n')
for i in tqdm(range(0, len(dataframe), batch_size), desc="Indexing images description progress"):
batch_df = dataframe.iloc[i:i + batch_size]
documents = [
Document(
page_content=row['description'],
metadata={"image_path": row["image_path"], "description": row["description"]}
)
for _, row in batch_df.iterrows()
]
# Initialize or append to FAISS index
if vectorstore_index is None:
vectorstore_index = FAISS.from_documents(documents, vectorstore_embeddings)
else:
vectorstore_index.add_documents(documents)
# ---- Save vector store index
print(f'> Saving vectorstore: {vectorstore_path}')
vectorstore_index.save_local(vectorstore_path)
print('> Vectorstore created!')
search_images_description.py
from langchain_ollama.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
# Query function
def search_images(query, top_k=2):
results = vectorstore.similarity_search_with_score(query, k=top_k)
return [
(doc.metadata["image_path"], doc.metadata["description"], score)
for doc, score in results
]
if __name__ == '__main__':
# ---- Configuration ------------------------------------------------------
embedding_model = 'nomic-embed-text'
vectorstore_path = f'vectorstore_{embedding_model}'
# ---- Load FAISS index ---------------------------------------------------
vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)
vectorstore = FAISS.load_local(vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True)
# ---- Query vector store -------------------------------------------------
search_another = True
while search_another:
print('-' * 80)
image_search_query = input("Enter a image description to search for: ")
movie_results = search_images(image_search_query)
print("\n---- Images related ----------------------\n")
for image_path, description, score in movie_results:
print(f"- Image Path: {image_path}\n\t- Score: {score:.4f}\n\t- Description: {description}\n")
search_another = input("Search again? [Y,n]: ").lower() in ['y', '']
Wrapping Up: From Pixels to Searchable Insights
In this article, we built a powerful AI-driven image search system using a multimodal model (LLaVA) and a vector database (FAISS). Instead of manually tagging images, we let AI see, describe, and organize them—turning an unstructured image repository into a fully searchable collection.
What We Accomplished
- Generated image descriptions automatically using LLaVA.
- Converted those descriptions into vectors with nomic-embed-text.
- Stored them in FAISS for efficient similarity-based searches.
- Built an interactive search system that retrieves images based on natural language queries.
This workflow proves how multimodal AI and vector databases can revolutionize the way we interact with visual data. Whether you’re cataloging a massive photo archive, building an AI-powered image search tool, or enhancing media management workflows—this technique offers a scalable, efficient solution.
Next Steps
You can expand on this project by:
- Enhancing search accuracy with better prompts or fine-tuned models.
- Adding metadata filtering (e.g., searching by date or category).
- Building a web interface for a more user-friendly experience.
Now it’s your turn—how will you use AI to make image search smarter? 🚀
About Me
I’m Gabriel, and I like computers. A lot.
For nearly 30 years, I’ve explored the many facets of technology—as a developer, researcher, sysadmin, security advisor, and now an AI enthusiast. Along the way, I’ve tackled challenges, broken a few things (and fixed them!), and discovered the joy of turning ideas into solutions. My journey has always been guided by curiosity, a love of learning, and a passion for solving problems in creative ways.
See ya around!