Motivation
Accessing, understanding, and retrieving information from documents is central to many processes across a variety of industries. Whether you work in finance, healthcare, a mom-and-pop carpet store, or are a college student, there are situations where you come across a large document that you need to read to answer a question. Enter JITR, a groundbreaking tool that collects PDF files and leverages language models (LLMs) to answer user queries about content. Let’s take a look at the magic behind JITR.
What is JITR?
JITR, which stands for Just In Time Retrieval, is one of the latest tools in the DataRobot GenAI Accelerator family designed to process PDF documents, extract their content, and provide accurate answers to user questions and queries. Imagine having a personal assistant that can read and understand any PDF document and then instantly answer any questions you have about it. That's JITR for you.
AI Accelerator: Generate context-aware responses using JITR bots
How does JITR work?
PDF collection: The initial step involves ingesting the PDF into the JITR system. Here, the tool converts the static content of the PDF into a digital format that the embedding model can ingest. The embedding model converts each sentence in a PDF file into a vector. This process creates a vector database of input PDF files.
Apply for LLM: Once the content is collected, the tool calls the LLM. LLM is a state-of-the-art AI model trained on massive amounts of text data. It excels at understanding context, identifying meaning, and generating human-like text. JITR uses these models to understand and index PDF content.
Interactive query: Users can then raise questions about the PDF content. LLM takes relevant information and presents answers in a concise and coherent manner.
Benefits of using JITR
Every organization creates a variety of documents that are created by one department and used by other departments. Retrieving information for your employees and teams is often time-consuming. JITR improves staff efficiency by reducing review time for long PDFs and providing immediate, accurate answers to questions. Additionally, JITR can handle all types of PDF content, allowing organizations to insert and utilize it in a variety of workflows without worrying about input documents.
Many organizations may not have the resources and expertise in software development to develop tools that leverage LLMs in their workflows. JITR allows teams and departments that are not proficient in Python to convert PDF files into a vector database in the context of LLM. All you need is an endpoint to send the PDF file to, and you can integrate JITR into a web application like Slack (or other messaging tool) or an external portal for your customers. No knowledge of LLM, natural language processing (NLP) or vector databases is required.
real application
Given its versatility, JITR can be integrated into almost any workflow. Here are some applications:
business Report: Professionals can quickly gain insights from long reports, contracts, and white papers. Likewise, you can integrate this tool into your internal processes to allow employees and teams to interact with internal documents.
customer service: From understanding technical manuals to deep-diving tutorials, JITR allows customers to interact with manuals and documentation related to their products and tools. This can help increase customer satisfaction and reduce the number of support tickets and escalations.
Research and Development: R&D teams can quickly extract relevant, easy-to-understand information from complex research papers to implement cutting-edge technologies into their products or internal processes.
Follow the guidelines: Many organizations have guidelines that employees and teams must follow. JITR allows employees to efficiently retrieve relevant information from instructions.
Legal: JITR can collect legal documents and contracts and answer questions based on the information provided in the input documents.
How to Build a JITR Bot with DataRobot
The workflow for building a JITR Bot is similar to the workflow for deploying an LLM pipeline using DataRobot. The main differences between the two are:
- Vector databases are defined at runtime.
- Logic is required to process encoded PDFs.
In the latter case, you can define a simple function that takes the encoding and writes it back to a temporary PDF file within your deployment.
```python
def base_64_to_file(b64_string, filename: str="temp.PDF", directory_path: str = "./storage/data") -> str:
"""Decode a base64 string into a PDF file"""
import os
if not os.path.exists(directory_path):
os.makedirs(directory_path)
file_path = os.path.join(directory_path, filename)
with open(file_path, "wb") as f:
f.write(codecs.decode(b64_string, "base64"))
return file_path
```
Defining this helper function allows us to create a hook. Hooks are just a fancy way of saying a function with a specific name. In our case, we just need to define a hook called `load_model` and another hook called `score_unstructured`. In `load_model` we set the embedding model that will be used to find the most relevant chunks of text, as well as the LLM to ping with context-aware prompts.
```python
def load_model(input_dir):
"""Custom model hook for loading our knowledge base."""
import os
import datarobot_drum as drum
from langchain.chat_models import AzureChatOpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
try:
# Pull credentials from deployment
key = drum.RuntimeParameters.get("OPENAI_API_KEY")["apiToken"]
except ValueError:
# Pull credentials from environment (when running locally)
key = os.environ.get('OPENAI_API_KEY', '')
embedding_function = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2",
cache_folder=os.path.join(input_dir, 'storage/deploy/sentencetransformers')
)
llm = AzureChatOpenAI(
deployment_name=OPENAI_DEPLOYMENT_NAME,
openai_api_type=OPENAI_API_TYPE,
openai_api_base=OPENAI_API_BASE,
openai_api_version=OPENAI_API_VERSION,
openai_api_key=OPENAI_API_KEY,
openai_organization=OPENAI_ORGANIZATION,
model_name=OPENAI_DEPLOYMENT_NAME,
temperature=0,
verbose=True
)
return llm, embedding_function
```
Okay, now we have an embedding function and an LLM. There is also a way to do the encoding and return to PDF. Now we will look at the nitty-gritty of JITR Bot, which builds a vector store at runtime and uses it to query the LLM.
```python
def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for making completions with our knowledge base.
When requesting predictions from the deployment, pass a dictionary
with the following keys:
- 'question' the question to be passed to the retrieval chain
- 'document' a base64 encoded document to be loaded into the vector database
datarobot-user-models (DRUM) handles loading the model and calling
this function with the appropriate parameters.
Returns:
--------
rv : str
Json dictionary with keys:
- 'question' user's original question
- 'answer' the generated answer to the question
"""
import json
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores.base import VectorStoreRetriever
from langchain.vectorstores.faiss import FAISS
llm, embedding_function = model
DIRECTORY = "./storage/data"
temp_file_name = "temp.PDF"
data_dict = json.loads(data)
# Write encoding to file
base_64_to_file(data_dict['document'].encode(), filename=temp_file_name, directory_path=DIRECTORY)
# Load up the file
loader = PyPDFLoader(os.path.join(DIRECTORY, temp_file_name))
docs = loader.load_and_split()
# Remove file when done
os.remove(os.path.join(DIRECTORY, temp_file_name))
# Create our vector database
texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)
# Define our chain
retriever = VectorStoreRetriever(vectorstore=db)
chain = ConversationalRetrievalChain.from_llm(
llm,
retriever=retriever
)
# Run it
response = chain(inputs={'question': data_dict['question'], 'chat_history': []})
return json.dumps({"result": response})
```
Once the hooks are defined, all that's left is to deploy the pipeline so that people have endpoints with which they can interact. To some, the process of creating a secure, monitored, queryable endpoint from arbitrary Python code may sound intimidating or at least take some time to set up. The drx package allows you to deploy JITR Bot with a single function call.
```python
import datarobotx as drx
deployment = drx.deploy(
"./storage/deploy/", # Path with embedding model
name=f"JITR Bot {now}",
hooks={
"score_unstructured": score_unstructured,
"load_model": load_model
},
extra_requirements=["pyPDF"], # Add a package for parsing PDF files
environment_id="64c964448dd3f0c07f47d040", # GenAI Dropin Python environment
)
```
How to use JITR
Okay, the hard work is over. Now we enjoy interacting with our newfound distribution. Again with Python, you can leverage the drx package to answer your most pressing questions.
```python
# Find a PDF
url = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Instantnoodles.PDF"
resp = requests.get(url).content
encoding = base64.b64encode(io.BytesIO(resp).read()) # encode it
# Interact
response = deployment.predict_unstructured(
{
"question": "What does this say about noodle rehydration?",
"document": encoding.decode(),
}
)['result']
— – – –
{'question': 'What does this say about noodle rehydration?',
'chat_history': [],
'answer': 'The article mentions that during the frying process, many tiny holes are created due to mass transfer, and they serve as channels for water penetration upon rehydration in hot water. The porous structure created during frying facilitates rehydration.'}
```
But more importantly, since this is just an endpoint, you can deploy it in any language you want. Below you will find a screenshot of interacting with the deployment directly through Postman. This means we can integrate JITR Bot into basically any application we want by having the application make API calls.
![Integrate JITR Bot into your application - DataRobot](https://www.datarobot.com/wp-content/uploads/2023/12/image-1-1024x556.png)
Once included in your application, using JITR is very easy. For example, in the Slackbot application used internally at DataRobot, users can start a conversation about a document simply by uploading a PDF containing a question.
JITR makes it easy for anyone in your organization to create real value through generative AI across the countless touchpoints of your employees’ daily workflow. Check out this video to learn more about JITR.
What you can do to make your JITR bot more powerful
To answer your question, the code I showed ran a simple implementation of JITRBot that takes an encoded PDF and creates a vector store at runtime. We decided to omit several additional features that we implemented internally using JITRBot, such as the following, as they were not related to the core concept:
- Situational awareness prompt and return completion token
- Answer questions based on multiple documents
- Answer multiple questions at once
- Allow users to provide conversation history
- Use different chains for different types of questions
- Report custom metrics back to your deployment
There's no reason why JITRBot should only work with PDF files! As long as we can encode the document and convert it back to a text string, we can build more logic. `score_unstructured`
This is a hook that handles any file format provided by the user.
Start using JITR in your workflow
JITR makes it easy to interact with arbitrary PDFs. If you want to give it a try, follow along with the notebook here.