{"id":14933,"date":"2025-01-17T14:47:23","date_gmt":"2025-01-17T14:47:23","guid":{"rendered":"https:\/\/temperies.com\/?p=14933"},"modified":"2025-01-17T14:47:24","modified_gmt":"2025-01-17T14:47:24","slug":"teach-your-pdfs-to-chat-build-an-ai-powered-assistant-that-delivers-answers","status":"publish","type":"post","link":"https:\/\/temperies.com\/es\/2025\/01\/17\/teach-your-pdfs-to-chat-build-an-ai-powered-assistant-that-delivers-answers\/","title":{"rendered":"Teach Your PDFs to Chat: Build an AI-Powered Assistant That Delivers Answers"},"content":{"rendered":"<h1>Teach Your PDFs to Chat: Build an AI-Powered Assistant That Delivers Answers<\/h1>\n\n\n\n<p><em>by Gabriel Vergara<\/em><\/p>\n\n\n\n<h2>Introduction<\/h2>\n\n\n\n<p>Turning static PDF documents into an AI-powered, interactive knowledge assistant might sound complex, but with the right tools, it\u2019s surprisingly achievable. In this hands-on guide, we\u2019ll dive straight into the nuts and bolts of building a <strong>Retrieval-Augmented Generation (RAG)<\/strong> pipeline\u2014a system that blends vector-based information retrieval with the power of generative AI.<\/p>\n\n\n\n<p>I will walk you through every step of the process, focusing on the code and tools that make it possible. From setting up a vector store using <strong>FAISS<\/strong> (Facebook AI Similarity Search) to integrating <strong>Langchain<\/strong> for orchestration and querying advanced language models via <strong>Ollama<\/strong>, this article is all about building, experimenting, and creating.<\/p>\n\n\n\n<p>This isn\u2019t just theory\u2014it\u2019s a practical, end-to-end implementation that transforms your PDFs into a queryable AI-driven resource. Along the way, you\u2019ll see Python scripts in action and understand how these tools come together to create a flexible, powerful documentation assistant <em>Proof of Concept<\/em>.<\/p>\n\n\n\n<p>If you want a more in-depth theory explanation, do not hesitate in check this article: <a href=\"https:\/\/temperies.com\/es\/2025\/01\/16\/from-messy-files-to-magic-answers-how-rag-makes-ai-smarter-and-life-easier\/\" data-type=\"post\" data-id=\"14927\">From Messy Files to Magic Answers: How RAG Makes AI Smarter (and Life Easier)<\/a><\/p>\n\n\n\n<p>Ready to roll up your sleeves and bring your documentation to life? Let\u2019s dive in!<\/p>\n\n\n\n<h2>Prerequisites<\/h2>\n\n\n\n<p>Before diving into the examples, ensure that your development environment is set up with the necessary tools and dependencies. Here\u2019s what you\u2019ll need:<\/p>\n\n\n\n<ol><li><strong>Ollama<\/strong>: A local instance of Ollama is required for embedding and querying operations. If you don\u2019t already have Ollama installed, you can download it from <a href=\"https:\/\/ollama.com\/download\">here<\/a>. This guide assumes that Ollama is installed and running on your machine.<\/li><li><strong>Models<\/strong>: Once Ollama is set up, pull the required models.<ul><li><strong>Mistral model<\/strong>: Used for querying during the retrieval-augmented generation process (<a href=\"https:\/\/ollama.com\/library\/mistral\">check here<\/a>).<\/li><li><strong>nomic-embed-text embedding model<\/strong>: Used to create embeddings for document chunks (<a href=\"https:\/\/ollama.com\/library\/nomic-embed-text\">check here<\/a>).<\/li><\/ul><\/li><li><strong>Python Environment<\/strong>:<ul><li>Python version: This script has been tested with Python 3.10. Ensure you have a compatible Python version installed.<\/li><li>Installing Dependencies: Use a Python environment management tool like <code>pipenv<\/code> to set up the required libraries. Execute the following command in your terminal:<\/li><\/ul><\/li><\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>pipenv install langchain langchain-community langchain-ollama faiss-cpu pypdf<\/code><\/pre>\n\n\n\n<ol start=\"4\"><li><strong>Sample documents<\/strong>: a sample of about ten PDF documents, for which I recommend files with <em>easy to grab<\/em> text (PDF files with scanned images will not work as you expect\u2026 I&#8217;ve warned you!).<\/li><\/ol>\n\n\n\n<p>With these prerequisites in place, you\u2019ll be ready to proceed to the next steps in setting up the vector store and querying it.<\/p>\n\n\n\n<h2>Feeding the Vector Store Index<\/h2>\n\n\n\n<p>In this section, we&#8217;ll explore how to populate a vector store with embeddings derived from PDF documents. This process involves configuring the setup, processing the documents, splitting them into manageable chunks, generating embeddings, and saving the vector store for efficient retrieval later (full code at the end of this article).<\/p>\n\n\n\n<h3>Configuration and Setup<\/h3>\n\n\n\n<p>The script begins by defining paths and configurations, including the embedding model to be used and directories for input documents and the vector store.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>embedding_model = 'nomic-embed-text'\ndocument_input_path = '.\/document_tray\/'\nvectorstore_path = f'vectorstore_{embedding_model}'<\/code><\/pre>\n\n\n\n<ul><li><code>embedding_model<\/code>: Specifies the model used to generate embeddings. In this example, we use <strong>nomic-embed-text<\/strong>.<\/li><li><code>document_input_path<\/code>: Directory containing the PDF documents to be processed.<\/li><li><code>vectorstore_path<\/code>: Directory where the vector store will be saved.<\/li><\/ul>\n\n\n\n<h3>Vector Store Initialization<\/h3>\n\n\n\n<p>Before creating the vector store, the script removes any existing store with the same name to ensure a fresh start.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Remove previous vectorstore\nif os.path.isdir(vectorstore_path):\n    shutil.rmtree(vectorstore_path)<\/code><\/pre>\n\n\n\n<p><strong>Purpose<\/strong>: Ensures that outdated or conflicting data does not interfere with the new vector store.<\/p>\n\n\n\n<h3>Loading and Splitting PDF Documents<\/h3>\n\n\n\n<p>The script processes all PDF files in the specified input directory, loading their content and splitting them into smaller chunks using a character-based text splitter.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>document_list = &#91;]\n\n# Setup text splitter\ntext_splitter = CharacterTextSplitter(\n    chunk_size=1000, chunk_overlap=30, separator=\"\\n\"\n)\n\n# Process files\nfor file in os.listdir(document_input_path):\n    if file.endswith('.pdf'):\n        pdf_file = os.path.join(document_input_path, file)\n        loader = PyPDFLoader(file_path=pdf_file)\n        docs_chunks = text_splitter.split_documents(loader.load())\n        document_list.extend(docs_chunks)<\/code><\/pre>\n\n\n\n<ul><li><strong>Text Splitting<\/strong>: Chunks are split to a size of 1000 characters with a 30-character overlap to preserve context across chunks.<\/li><li><strong>PyPDFLoader<\/strong>: Reads the content of each PDF file and prepares it for splitting.<\/li><\/ul>\n\n\n\n<h3>Creating and Saving the Vector Store<\/h3>\n\n\n\n<p>After processing the documents, the script generates embeddings for each chunk and saves them in a FAISS vector store.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)\nvectorstore_index = FAISS.from_documents(document_list, vectorstore_embeddings)\nvectorstore_index.save_local(vectorstore_path)<\/code><\/pre>\n\n\n\n<ul><li><strong>OllamaEmbeddings<\/strong>: Converts document chunks into high-dimensional vectors based on the specified embedding model.<\/li><li><strong>FAISS<\/strong>: Builds the vector store using the embeddings and the processed document chunks.<\/li><li><strong>Saving<\/strong>: The vector store is saved locally for later retrieval.<\/li><\/ul>\n\n\n\n<p>This script effectively automates the process of turning static PDFs into a structured and queryable vector store, forming the foundation for retrieval-augmented generation. With the vector store prepared, the next step is to query it for relevant context during LLM interactions.<\/p>\n\n\n\n<h2>Querying the Vector Store<\/h2>\n\n\n\n<p>This script demonstrates how to load an existing vector store, enhance the LLM\u2019s prompt using retrieval-augmented generation (RAG), and query the model for answers based on the stored context (full code at the end of this article). Here&#8217;s a step-by-step breakdown:<\/p>\n\n\n\n<h3>Configuration and Setup<\/h3>\n\n\n\n<p>The first step is to define the configuration, including the LLM model and embedding model, as well as the path to the pre-created vector store.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>llm_model = 'mistral'\nembedding_model = 'nomic-embed-text'\nvectorstore_path = f'vectorstore_{embedding_model}'<\/code><\/pre>\n\n\n\n<ul><li><code>llm_model<\/code>: Specifies the LLM used for generating responses. In this case, the Mistral model is used.<\/li><li><code>embedding_model<\/code>: Defines the embedding model used to generate vector representations.<\/li><li><code>vectorstore_path<\/code>: Points to the directory where the vector store index is saved.<\/li><\/ul>\n\n\n\n<h3>Loading the Vector Store<\/h3>\n\n\n\n<p>The vector store is loaded from the saved directory, enabling efficient retrieval of relevant chunks during queries.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)\nvectorstore_index = FAISS.load_local(\n    vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True\n)<\/code><\/pre>\n\n\n\n<ul><li><strong><code>OllamaEmbeddings<\/code><\/strong>: Re-creates the embedding structure required for interpreting the stored vectors.<\/li><li><strong><code>FAISS.load_local<\/code><\/strong>: Loads the FAISS vector store index and associates it with the embedding model for retrieval.<\/li><\/ul>\n\n\n\n<h3>Creating the Prompt Template and Retrieval Chain<\/h3>\n\n\n\n<p>A custom prompt template is defined to guide the LLM in generating answers solely based on the retrieved context.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>retrieval_qa_chat_prompt = ChatPromptTemplate.from_messages(&#91;\n    (\"system\", \"Answer any use questions based solely on the context below:\\n\\n&lt;context&gt;\\n{context}\\n&lt;\/context&gt;\"),\n    (\"placeholder\", \"{chat_history}\"),\n    (\"human\", \"{input}\"),\n])\ncombine_docs_chain = create_stuff_documents_chain(\n    ChatOllama(model=llm_model), retrieval_qa_chat_prompt\n)\nretrieval_chain = create_retrieval_chain(\n    vectorstore_index.as_retriever(), combine_docs_chain\n)<\/code><\/pre>\n\n\n\n<ul><li><strong>Prompt Template<\/strong>: Provides instructions to the LLM, including placeholders for context (<code>{context}<\/code>), chat history (<code>{chat_history}<\/code>), and user input (<code>{input}<\/code>).<\/li><li><strong>Combining Documents<\/strong>: Uses <code>create_stuff_documents_chain<\/code> to integrate the context and user input for the LLM.<\/li><li><strong>Retrieval Chain<\/strong>: Combines the vector store retriever with the document chain to form a seamless query pipeline.<\/li><\/ul>\n\n\n\n<h3>Querying the LLM with Questions<\/h3>\n\n\n\n<p>The script defines a list of sample queries and processes each query using the retrieval chain.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>query_list = &#91;\n    \"What can you tell me about Google Cloud Vision AI?\",\n    \"What kind of service is provided by Sightengine?\",\n    \"What is a chatbot?\",\n    \"Can you explain what is the 'chain of thought' process?\"\n]\n\nfor query in query_list:\n    print(f'query&gt; {query}')\n    res = retrieval_chain.invoke({'input': query})\n    print(f\"{llm_model}&gt; {res&#91;'answer']}\")<\/code><\/pre>\n\n\n\n<ul><li><strong>Query List<\/strong>: A set of example questions to demonstrate the model&#8217;s ability to retrieve and answer contextually.<\/li><li><strong>Retrieval Chain Invocation<\/strong>: Each query is passed to the <code>retrieval_chain<\/code>, which retrieves relevant context and feeds it to the LLM for a final answer.<\/li><\/ul>\n\n\n\n<h3>Output<\/h3>\n\n\n\n<p>The model processes each query and generates responses enriched with the retrieved context, ensuring high relevance and accuracy.<\/p>\n\n\n\n<h2>Full code<\/h2>\n\n\n\n<p>This is the full code of both the vector store feeder as well as the vector store queries.<\/p>\n\n\n\n<h3>vectorstore_feed.py<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>import os\nimport shutil\n\nfrom langchain_community.vectorstores import FAISS\nfrom langchain_community.document_loaders import PyPDFLoader\nfrom langchain_ollama.embeddings import OllamaEmbeddings\nfrom langchain_text_splitters import CharacterTextSplitter\n\n\nif __name__ == '__main__':\n    # ---- Configuration ------------------------------------------------------\n    embedding_model = 'nomic-embed-text'\n    document_input_path = '.\/document_tray\/'\n    vectorstore_path = f'vectorstore_{embedding_model}'\n\n    # ---- Vector Store setup -------------------------------------------------\n    print('-' * 80)\n    print(f'&gt; Creating vectorstore: {vectorstore_path} ')\n    # Remove previous vectorstore\n    if os.path.isdir(vectorstore_path):\n        print(f'  &gt; Removing existing vectorstore... ')\n        shutil.rmtree(vectorstore_path)\n\n    # ---- Grab each PDF file as a document for the vector store\n    print(f'  &gt; Processing PDF files in: {document_input_path} ')\n    # All documents list\n    document_list = &#91;]\n\n    # Setup text splitter\n    text_splitter = CharacterTextSplitter(\n        chunk_size=1000, chunk_overlap=30, separator=\"\\n\"\n    )\n\n    # Process files\n    for file in os.listdir(document_input_path):\n        if file.endswith('.pdf'):\n            # load the PDF document\n            pdf_file = os.path.join(document_input_path, file)\n            print(f'    &gt; Processing file: {pdf_file} ')\n            loader = PyPDFLoader(file_path=pdf_file)\n            docs_chunks = text_splitter.split_documents(loader.load())\n            document_list.extend(docs_chunks)\n\n    # ---- Create vector store index\n    print(f'  &gt; Saving vectorstore: {vectorstore_path}')\n    vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)\n    vectorstore_index = FAISS.from_documents(document_list, vectorstore_embeddings)\n    vectorstore_index.save_local(vectorstore_path)\n    print('&gt; Vectorstore created!')<\/code><\/pre>\n\n\n\n<h3>vectorstore_feed.py<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>from langchain_community.vectorstores import FAISS\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain.chains.retrieval import create_retrieval_chain\nfrom langchain_ollama.embeddings import OllamaEmbeddings\nfrom langchain_ollama.chat_models import ChatOllama\nfrom langchain_core.prompts import ChatPromptTemplate\n\n\nif __name__ == '__main__':\n    # ---- Configuration ------------------------------------------------------\n    llm_model = 'mistral'\n    embedding_model = 'nomic-embed-text'\n    vectorstore_path = f'vectorstore_{embedding_model}'\n\n    # ---- Load vector store index\n    print('-' * 80)\n    print(f'&gt; Loading vectorstore: {vectorstore_path}')\n    vectorstore_embeddings = OllamaEmbeddings(model=embedding_model)\n    vectorstore_index = FAISS.load_local(\n        vectorstore_path, vectorstore_embeddings, allow_dangerous_deserialization=True\n    )\n    print('&gt; Vectorstore loaded!')\n\n    # ---- Improved Prompt ----------------------------------------------------\n    retrieval_qa_chat_prompt = ChatPromptTemplate.from_messages(&#91;\n        (\"system\", \"Answer any use questions based solely on the context below:\\n\\n&lt;context&gt;\\n{context}\\n&lt;\/context&gt;\"),\n        (\"placeholder\", \"{chat_history}\"),\n        (\"human\", \"{input}\"),\n    ])\n    combine_docs_chain = create_stuff_documents_chain(\n        ChatOllama(model=llm_model), retrieval_qa_chat_prompt\n    )\n    retrieval_chain = create_retrieval_chain(\n        vectorstore_index.as_retriever(), combine_docs_chain\n    )\n\n    # ---- Query the LLM model with a set of questions ------------------------\n    query_list = &#91;\n        \"What can you tell me about Google Cloud Vision AI?\",\n        \"What kind of service is provided by Sightengine?\",\n        \"What is a chatbot?\",\n        \"Can you explain what is the 'chain of thought' process?\"\n    ]\n\n    print(f'&gt; Querying model: {llm_model}')\n    for query in query_list:\n        print('-' * 20)\n        print(f'query&gt; {query}')\n        res = retrieval_chain.invoke({'input': query})\n        print(f\"{llm_model}&gt; {res&#91;'answer']}\")<\/code><\/pre>\n\n\n\n<h2>Conclusion<\/h2>\n\n\n\n<p>In this article, we walked through building a RAG pipeline from the ground up, leveraging tools like <strong>FAISS<\/strong>, <strong>Langchain<\/strong>, and <strong>Ollama<\/strong> to create an AI-driven documentation assistant. We transformed static PDF files into an interactive, queryable knowledge resource, demonstrating how practical and powerful this approach can be.<\/p>\n\n\n\n<p>One of the greatest strengths of this implementation is its flexibility. You can easily swap out the language models, embedding models, or even expand the pipeline to handle other document types, like Word files, HTML, or emails. With a few tweaks to the code, this system can evolve alongside advancements in AI and adapt to meet your unique needs.<\/p>\n\n\n\n<p>This is just the beginning. Use it as a sandbox to experiment\u2014adjust parameters, test new embeddings, or refine the AI prompts to handle specific queries better. The modularity of the pipeline allows for endless customization, enabling you to tailor it to personal projects or professional applications.<\/p>\n\n\n\n<p>Now it\u2019s your turn. Dive into the code, explore the possibilities, and make it your own. By building on this foundation, you\u2019ll not only simplify how you access information but also create tools that bring value, clarity, and efficiency to the way we interact with data.<\/p>\n\n\n\n<p>Let the innovation begin!<\/p>\n\n\n\n<h2>About Me<\/h2>\n\n\n\n<p><em>I\u2019m Gabriel, and I like computers. A lot.<\/em><\/p>\n\n\n\n<p>For nearly 30 years, I\u2019ve explored the many facets of technology\u2014as a developer, researcher, sysadmin, security advisor, and now an AI enthusiast. Along the way, I\u2019ve tackled challenges, broken a few things (and fixed them!), and discovered the joy of turning ideas into solutions. My journey has always been guided by curiosity, a love of learning, and a passion for solving problems in creative ways.<\/p>\n\n\n\n<p>See ya around!<\/p>","protected":false},"excerpt":{"rendered":"<p>Turning static PDF documents into an AI-powered, interactive knowledge assistant might sound complex, but with the right tools, it\u2019s surprisingly achievable. In this hands-on guide, we\u2019ll dive straight into the nuts and bolts of building a Retrieval-Augmented Generation (RAG) pipeline\u2014a system that blends vector-based information retrieval with the power of generative AI.<\/p>","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[54],"tags":[55,67,65,66,61,64,63,57,62,59,60],"_links":{"self":[{"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/posts\/14933"}],"collection":[{"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/comments?post=14933"}],"version-history":[{"count":3,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/posts\/14933\/revisions"}],"predecessor-version":[{"id":14936,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/posts\/14933\/revisions\/14936"}],"wp:attachment":[{"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/media?parent=14933"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/categories?post=14933"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/temperies.com\/es\/wp-json\/wp\/v2\/tags?post=14933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}