👋 Hi, this is Sagar with this week’s edition of the Mindful Matrix newsletter. This is my second edition on GenAI/LLM learning series.
In this edition, I’ll be discussing a key architectural approach known as RAG, that improves the efficacy of LLMs by leveraging custom data. I’ll explain how to build a simple LLM application in 5 easy steps to query your data, utilizing RAG architecture.
Before we begin, let’s take a look at what you're about to get from this article -
Limitations of LLMs and need for RAG.
What exactly is RAG and how does RAG pipeline looks?
Build a simple LLM application using RAG in 5 easy steps.
Common use cases for RAG
Note : For starting points in diving into LLMs/GenAI, please check out this post.
Limitations of LLMs and need for RAG.
LLMs learn language patterns by analyzing vast text datasets, predicting the next word in a sentence based on previous words. However, they face key limitations:
Once trained, LLMs can't update with new information beyond their training cutoff, leading to inaccuracies or hallucination with new or unseen data.
LLMs, typically trained on general data, struggle with domain-specific queries. For precise, relevant answers, they need training on specific organizational data rather than providing broad, generalized responses.
This is where significance of Retrieval Augmented Generation (RAG) comes into play.
So What is Retrieval Augmented Generation(RAG)?
RAG, is a technique to ground your LLMs to generate responses to your queries based on a custom knowledge-base that you provide.
This is done by retrieving data/documents relevant to a question or task and providing them as context for the LLM.
With RAG architecture, organizations can deploy any LLM model and augment it to return relevant results for their organization by giving it a small amount of their data without the costs and time of fine-tuning or pretraining the model.
Here’s simple RAG pipeline -
Let’s understand the above architecture in detail as we build our LLM application.
Build a simple LLM application in 5 easy steps
There are many ways to implement a RAG system, depending on specific needs and data nuances. We will build simple RAG application using our custom knowledge base and then query it using an LLM.
I used langchain framwork which provides the building blocks for RAG applications.
Initial Set up -
I’ve used Amazon Titan Embeddings text model as embedding model and Claud V2 model as LLM, through Amazon Bedrock, which require to have AWS account set up.
I’ve used AWS Sagemaker notebook instance to run my python application.
Note : You can also run it locally as well with all the required permissions to invoke external APIs (For ex - setting up API Keys if you are using OpenAI APIs for LLM or setting up required IAM permissions if using AWS )
Next, install all the required python modules - langhain, pypdf, boto3, and faiss-cpu.
Step - 1 : Load Document
Data is gathered and subjected to initial preprocessing. For this tutorial I will be grounding the LLM on my resume (custom knowledge) in pdf format (it can be any other format and langchain has more than 80 different document loaders)
## Your directory structure should look like this:
## └── data
##   └── my_resume.pdf
## └── rag.ipynb (For your application code)
from langchain.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader("data/my_resume.pdf")
# Load data from the pdf
pages = pdf_loader.load()
Document loaders deal with the specifics of accessing and converting data from a variety of different formats and sources into a standardized format.
Step - 2 : Document Splitting/Chunking
Before creating embedding split large documents into smaller chunks which also allows the retriever to select the more relevant chunks from the document instead of feeding the entire data to an LLM.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter_pdf = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n"],
)
pdf_splits = splitter_pdf.split_documents(pages)
RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. In above it splits by double newlines and then splits chunks by single newlines.
Step - 3 : Embedding and indexing with Vector Store
Embeddings take a piece of text and create a numerical representation of the text, and Embedding model is used to generate the embeddings. Vector stores and embeddings come after text splitting as we need to store our documents in an easily accessible format.
A vector store is a database where you can easily look up similar vectors later on. This becomes useful when we try to find documents that are relevant to a question.
Thus, text with semantically similar content will have similar vectors in embedding space. So we can compare embeddings(vectors) and find texts that are similar.
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
import boto3
# Defining bedrock client
bedrock = boto3.client(
service_name="bedrock",
region_name="us-east-1",
endpoint_url="https://bedrock.us-east-1.amazonaws.com"
)
# Defining bedrock-runtime client that will be used for predictions
bedrock_runtime = boto3.client(service_name="bedrock-runtime")
# Define the bedrock embeddings model
bedrock_embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v1", client=bedrock_runtime
)
# Define Vector DB
vectordb = FAISS.from_documents(
pdf_splits,
bedrock_embeddings,
)
FAISS (Facebook AI Similarity Search)is a library for efficient similarity search and clustering of dense vectors. It’s used as in-memory vector DB.
Step - 4 : Define Bedrock model for LLM inference :
We will initialize the LLM interface via Bedrock and set inference parameters, which are adjustable values that can restrict or guide the model's responses.
# Each model has a different set of inference parameters
inference_params = {
"temperature": 0.0
}
# Define the langchain module with the selected bedrock model
bedrock_llm = Bedrock(
model_id='anthropic.claude-v2', client=bedrock_runtime, model_kwargs=inference_params
)
If I simply query LLM without RAG
llm_response = bedrock_llm("What is Sagar's education?")
Output » I'm afraid I don't have enough information to know details about someone named Sagar's education. I would need more context to determine that.
Step - 5 : Retrieve, Augment and Generate
Let’s use pre-processed knowledge data from above steps and utilize RAG pipeline.
Retrieve parts of our data that are relevant to a user's query.
Create embeddings of the question, then compare this embeddings with all the different vectors in the vector store and pick the k most similar.
Augment the context of the prompt/query with retrieved data
We take k most similar chunks and pass these chunks as a context along with the question into an LLM.
LLM generates the response based on the prompt and retrieved data
RetrievalQA chain as part of langchain framework provide a interface to abstract all these different steps.
from langchain.chains import RetrievalQA
# Define the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
bedrock_llm,
retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
)
# Perform retrieval Q&A
qa_response = qa_chain({"query": "What is Sagar's education?"})
Output » Based on the resume, Sagar Gandhi earned a B.Tech degree in Computer Science and Engineering from MNIT Jaipur between 2010-2014.
Note : I’ll share the complete code in my github repo also.
That’s it folks! It’s a simple LLM application designed to demonstrate the use of the RAG concepts.
Common use cases for RAG :
Question and answer chatbots: Incorporating LLMs with chatbots allows them to automatically derive more accurate answers from company documents and knowledge bases.
Search augmentation: Incorporating LLMs with search engines that augment search results with LLM-generated answers can better answer informational queries.
Knowledge engine - ask questions on your data: Company data can be used as context for LLMs and allow employees to get answers to their questions easily. Â (e.g., HR, compliance documents)
Stay tuned for my next week’s edition where I’ll cover Prompt engineering, including techniques such as ReAct and ZeroShot prompting, and build another LLM application utilizing these concepts.
I'd also love to hear your suggestions on topics you'd like me to address in future edition of this series.
Interesting reads you don’t want to miss
In case you missed my previous articles on this series…
Unveiling the Revolutionary Architecture behind LLMs - "Attention is all you need"
If you found this useful, please share it with your network and consider subscribing for more such insights.
If you haven’t subscribed, or followed me on LinkedIn, I’d love to connect with you. Please share your thoughts, feedback, and ideas, or even just to say hello!
Simple and easy to understand for somebody who is not knowledgeable about RAG
Thanks, Sagar
Enjoyed the read, it was at a great detail but I loved your RAG explanation flow diagram!
Also, thanks for signal boosting my post!