

Discover more from Artificial Ignorance
Tutorial: How to chat with your documents
A step-by-step guide to doing Q&A with your data, using LlamaIndex and OpenAI.
Welcome to a new series: Tutorials! With all of the discussion around AI, I have been seeing less AI Engineering content than I would like - actual hands-on writeups of AI coding projects. So we're starting with an example we've discussed in the past: how to chat with your documents. This is quickly becoming the "Hello, World" of LLM applications, so it's a reasonable place to start.
This walkthrough is a bit on the technical side, as we’ll dive into a bunch of programming concepts. Knowledge of Python is expected, but machine learning is NOT a prerequisite.
For this tutorial, we will use LlamaIndex, and by extension OpenAI. LlamaIndex allows you to swap out OpenAI for other models, but for convenience (and because it's the default), we're going to assume OpenAI usage.
Setup
Assuming you already have a Python environment, installation is one line:
pip install llama-index
But, depending on what types of documents you're working with, you may need some additional libraries. In my case, I needed nltk
and pypdf
to get things fully operational.
pip install nltk pypdf
We'll also need to add an OpenAI API key to our environment. There are a few ways to do this, but the most explicit is declaring it at the top of any Python files where we're using LlamaIndex. By default, LlamaIndex uses OpenAI models for chat and embeddings, though it can be configured to use both local LLMs and local embeddings instead. (And if you ever see ImportError: Could not import llama_cpp
, it means you probably forgot to set your OpenAI API key!)
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
Preparing documents
Next, we'll create a data
folder with the files we want to work with. In this example, I'm going to download a Markdown version of my previous post, "OpenAI's top competitors".
mkdir data/
cd data/
wget https://gist.githubusercontent.com/charlierguo/e73aa2c74def8b7332555cfc5573512f/raw/0c0c62e08271aa4f5226f785d2621bd40710e5b0/openai.txt
# add your files here
We're using a simple text file here, but LlamaIndex will work with PDFs too! You might need to install the pypdf
library mentioned above, but after that, all the code below will work the same.
Building an index
After making the data folder, we need to build the index. It's only two lines, but a lot is happening behind the scenes.
from llama_index import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
First, we're using SimpleDirectoryReader
, a class meant to read files from a directory and load them into a list of Documents
. Each Document
is just a container around a piece of data - a text file, a PDF, a database query, etc. They store text, metadata (custom annotations), and relationships (ties to other documents). Documents are composed of Nodes
, which are "chunks" of Documents - text, images, etc.
The SimpleDirectoryReader
is a type of "data loader" - LlamaIndex has over 100 different data loaders available, which can be downloaded from LlamaHub. Our data loader is quite basic, but there are custom loaders to get data from third-party sources like Google Docs, Notion, or Wikipedia. We'll come back to these later.
Second, we're using a VectorStoreIndex
. To understand what this is doing, we need to dive into the architecture of LlamaIndex a little bit.
At its core, LlamaIndex is meant to build LLM-powered apps (like chatbots, question-answering, or agents) on top of custom data. The way it does this is through RAG (Retrieval Augmented Generation). RAG consists of two stages - indexing and querying. Indexing means converting each document to chunks (or Nodes
) for later reference. Querying takes an input (like a question) and searches the index to find the most relevant document chunks. Those chunks are then passed to a response generator, which we'll touch on later.
The most common way to index is with a vector store, which creates an embedding vector for each chunk and saves it. By default, LlamaIndex will use OpenAI's text-embedding-ada-002
model to create embeddings and store the vectors in memory (via a SimpleVectorStore
). But for any serious projects, you'll probably want to use a standalone vector database like Chroma, Pinecone, or Weaviate instead.
But you don’t have to create embedding vectors! You can create different kinds of indices entirely: a ListIndex
just keeps all of the text chunks in a list, whereas a TreeIndex
keeps them as text, but generates a tree for faster searching. The index type pretty heavily depends on the use case you're building for.
So to recap: in our two lines of Python above, we're doing the following:
Using a
SimpleDirectoryReader
to load text files from/data
.Creating a list of
Documents
from each text file.Chunking each document into a list of
Nodes
.Creating an embedding vector for each
Node
using OpenAI's embeddings model.Holding the embeddings in local memory with a
SimpleVectorStore
.
Asking a question
With our article ingested and indexed, we can try asking it a question:
query_engine = index.as_query_engine()
response = query_engine.query("What year was Anthropic founded?")
print(response)
If you're using the same data source, you should get an answer like "Anthropic was founded in 2021." Remember, LLMs are non-deterministic, so you may not always get the same answer.
Technically, we've gotten to our "Hello, World" moment. Hooray!
Don't worry; we're still going to push this a bit further. But before we make things more interesting, let's talk about how this works. Again, a lot is happening behind the scenes.
The first thing we're doing is creating a query engine from our index. A query engine comprises two parts: a retriever and a synthesizer. When we ask a question (or "query"), the retriever starts by finding the most relevant document chunks. Then, the synthesizer takes the query, plus the relevant chunks, and uses them to prompt an LLM for an answer.
This is a pretty huge simplification, and a lot of configuration goes into both the retriever and the synthesizer. Let's talk about the retriever first. The retriever fetches the document chunks most relevant to the user's query. We're using a VectorStoreIndex
, so the corresponding retriever is the VectorIndexRetriever
, but there are others. Depending on the type of index you're using, there are one or more ways of getting data back out of it.
But having the relevant chunks isn't enough - you need to turn them into an actual answer. The synthesizer handles this by using internal prompts to pass the chunks to an LLM. The default for LlamaIndex is ChatGPT, specifically gpt-3.5-turbo
, though as I’ve mentioned, you can swap the model. Here's what those prompts look like - there's one for the first chunk and one for the following chunks:
Context information is below.
---------------------
{context}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query}
Answer:
The original query is as follows: {query}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer:
What we're doing here is passing each relevant chunk to the LLM to shape an answer. With each chunk, we're adding context, and slowly molding the final response. That means if you have a ton of relevant data, you may end up using a lot of API calls - though the default mode is to try and compress the chunks as much as possible.
So index.as_query_engine()
is actually doing a ton of heavy lifting. If we wanted to create the query engine without shortcuts, it would look like this instead:
from llama_index import VectorStoreIndex, get_response_synthesizer
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
# create index
index = VectorStoreIndex.from_documents(documents)
# create retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=2
)
# create synthesizer
response_synthesizer = get_response_synthesizer(
response_mode="compact"
)
# create query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer
)
# query
response = query_engine.query("What year was Anthropic founded?")
print(response)
Getting more advanced
So far, this has been a pretty limited example. As I said, it's the "Hello, World" of working with LLMs. But there are more exciting use cases built on top of this foundation:
Running queries over Slack messages, Notion documents, emails, or SQL databases.
Having a chatbot help with answers rather than one-and-done questions.
Creating a web application for users to upload documents for querying.
Developing an agent to review new documents and data automatically.
Using Llama 2 and HuggingFace embeddings to run all models locally. No internet needed.
Let's combine the first two use cases and look at building a chatbot that runs on third-party data. In my case, what I want is to be able to chat about my repository of AI notes, which I keep in Obsidian. It's a pretty big library, so I don't think it will all fit into memory. So we'll need an Obisidan-specific data loader, a vector database, and a chatbot interface for the notes.
Luckily, LlamaHub already has an Obisidan-specific data loader, and downloading it is pretty easy.
from llama_index import download_loader
ObsidianReader = download_loader('ObsidianReader')
documents = ObsidianReader('/path/to/vault').load_data()
The vector database is a bit trickier. LlamaIndex offers a lot of choices, but I've decided to go with Pinecone as my tool of choice. I'll need to install the library and its dependencies before using it.
pip install pinecone-client transformers
Getting started is fast: I can make a new index with a single line. Note that you must set the dimensions of the vectors you intend to store when making the index. In our case, that's 1536, which we know from OpenAI's documentation on its text-embedding-ada-002
model.
import pinecone
pinecone.init(api_key="YOUR_PINECONE_KEY", environment="gcp-starter")
# dimensions are for text-embedding-ada-002
pinecone.create_index("obsidian", dimension=1536, metric="euclidean")
pinecone_index = pinecone.Index("obsidian")
And once we've created the Pinecone Index, we can use that to create our VectorStoreIndex. It's worth noting that I "only" have a few hundred documents in my Obsidian library, and creating this index takes a few minutes. If you're building a production application, I recommend creating the indices ahead of time.
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
The last step is to build a ChatEngine
(as opposed to a QueryEngine
). This is pretty similar to the process we described above, but there are a couple of key differences:
The engine maintains state: it keeps track of the whole conversation (up to the context window limit) and remembers prior questions and answers.
The underlying model is ChatGPT rather than GPT-3. That means that there is still a possibility for hallucinations when it comes to generating answers.
chat_engine = index.as_chat_engine()
response = chat_engine.chat("What are model cards?")
next_response = chat_engine.chat("Who created them?")
Final thoughts
After a first pass at building with LlamaIndex, it's clear it's a flexible, useful tool for working with data and LLMs. The data loaders offer an impressive amount of choice for using outside documents, and the library offers ever-growing customizability. We also barely scratched the surface of the different types of query engines and routers.
That said, its strong point feels like it's extracting specific wording from text rather than summarizing entire libraries. That's reinforced by the fact that it initially used a GPT-3 model by default, which is a text-completion model rather than a chat model. Another downside is how it handles text chunks: depending on your settings, doing queries over an extensive document library could result in a massive amount of tokens/API calls being used.
Ultimately, it's one attempt at building a storage/retrieval system for LLMs. Right now, we don't have the AI equivalent of a SQL database - while embeddings and vector databases can perform semantic search, it's not the most efficient or accurate. Other approaches are being tested, which I'm also hoping to explore. For now, any future LlamaIndex projects will likely involve trying to use locally-run models.
Tutorial: How to chat with your documents
Charlie, this may be a bit outside of the scope of what you wanted to talk about today, but: do you see qualitative improvements in having "conversations" with documents by way of LlamaIndex vs Code Interpreter? I'm behind the 8-ball and haven't done as much experimenting there as I'd like, but from what I have seen, there is a lot of potential. Thanks!