Deep Dive into JITR: The PDF Ingesting and Querying Generative AI Tool

Motivation

Accessing, understanding, and retrieving information from documents is central to many processes across a variety of industries. Whether you work in finance, healthcare, a mom-and-pop carpet store, or are a college student, there are situations where you come across a large document that you need to read to answer a question. Enter JITR, a groundbreaking tool that collects PDF files and leverages language models (LLMs) to answer user queries about content. Let’s take a look at the magic behind JITR.

What is JITR?

JITR, which stands for Just In Time Retrieval, is one of the latest tools in the DataRobot GenAI Accelerator family designed to process PDF documents, extract their content, and provide accurate answers to user questions and queries. Imagine having a personal assistant that can read and understand any PDF document and then instantly answer any questions you have about it. That's JITR for you.

AI Accelerator: Generate context-aware responses using JITR bots

How does JITR work?

PDF collection: The initial step involves ingesting the PDF into the JITR system. Here, the tool converts the static content of the PDF into a digital format that the embedding model can ingest. The embedding model converts each sentence in a PDF file into a vector. This process creates a vector database of input PDF files.

Apply for LLM: Once the content is collected, the tool calls the LLM. LLM is a state-of-the-art AI model trained on massive amounts of text data. It excels at understanding context, identifying meaning, and generating human-like text. JITR uses these models to understand and index PDF content.

Interactive query: Users can then raise questions about the PDF content. LLM takes relevant information and presents answers in a concise and coherent manner.

Benefits of using JITR

Every organization creates a variety of documents that are created by one department and used by other departments. Retrieving information for your employees and teams is often time-consuming. JITR improves staff efficiency by reducing review time for long PDFs and providing immediate, accurate answers to questions. Additionally, JITR can handle all types of PDF content, allowing organizations to insert and utilize it in a variety of workflows without worrying about input documents.

Many organizations may not have the resources and expertise in software development to develop tools that leverage LLMs in their workflows. JITR allows teams and departments that are not proficient in Python to convert PDF files into a vector database in the context of LLM. All you need is an endpoint to send the PDF file to, and you can integrate JITR into a web application like Slack (or other messaging tool) or an external portal for your customers. No knowledge of LLM, natural language processing (NLP) or vector databases is required.

real application

Given its versatility, JITR can be integrated into almost any workflow. Here are some applications:

business Report: Professionals can quickly gain insights from long reports, contracts, and white papers. Likewise, you can integrate this tool into your internal processes to allow employees and teams to interact with internal documents.

customer service: From understanding technical manuals to deep-diving tutorials, JITR allows customers to interact with manuals and documentation related to their products and tools. This can help increase customer satisfaction and reduce the number of support tickets and escalations.

Research and Development: R&D teams can quickly extract relevant, easy-to-understand information from complex research papers to implement cutting-edge technologies into their products or internal processes.

Follow the guidelines: Many organizations have guidelines that employees and teams must follow. JITR allows employees to efficiently retrieve relevant information from instructions.

Legal: JITR can collect legal documents and contracts and answer questions based on the information provided in the input documents.

How to Build a JITR Bot with DataRobot

The workflow for building a JITR Bot is similar to the workflow for deploying an LLM pipeline using DataRobot. The main differences between the two are:

Vector databases are defined at runtime.
Logic is required to process encoded PDFs.

In the latter case, you can define a simple function that takes the encoding and writes it back to a temporary PDF file within your deployment.

```python

def base_64_to_file(b64_string, filename: str="temp.PDF", directory_path: str = "./storage/data") -> str:     

    """Decode a base64 string into a PDF file"""

    import os

    if not os.path.exists(directory_path):

        os.makedirs(directory_path)

    file_path = os.path.join(directory_path, filename)

    with open(file_path, "wb") as f:

        f.write(codecs.decode(b64_string, "base64"))   

    return file_path

```

Defining this helper function allows us to create a hook. Hooks are just a fancy way of saying a function with a specific name. In our case, we just need to define a hook called `load_model` and another hook called `score_unstructured`. In `load_model` we set the embedding model that will be used to find the most relevant chunks of text, as well as the LLM to ping with context-aware prompts.

```python

def load_model(input_dir):

    """Custom model hook for loading our knowledge base."""

    import os

    import datarobot_drum as drum

    from langchain.chat_models import AzureChatOpenAI

    from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

    try:

        # Pull credentials from deployment

        key = drum.RuntimeParameters.get("OPENAI_API_KEY")["apiToken"]

    except ValueError:

        # Pull credentials from environment (when running locally)

        key = os.environ.get('OPENAI_API_KEY', '')

    embedding_function = SentenceTransformerEmbeddings(

        model_name="all-MiniLM-L6-v2",

        cache_folder=os.path.join(input_dir, 'storage/deploy/sentencetransformers')

    )

    llm = AzureChatOpenAI(

        deployment_name=OPENAI_DEPLOYMENT_NAME,

        openai_api_type=OPENAI_API_TYPE,

        openai_api_base=OPENAI_API_BASE,

        openai_api_version=OPENAI_API_VERSION,

        openai_api_key=OPENAI_API_KEY,

        openai_organization=OPENAI_ORGANIZATION,

        model_name=OPENAI_DEPLOYMENT_NAME,

        temperature=0,

        verbose=True

    )

    return llm, embedding_function

```

Okay, now we have an embedding function and an LLM. There is also a way to do the encoding and return to PDF. Now we will look at the nitty-gritty of JITR Bot, which builds a vector store at runtime and uses it to query the LLM.

```python

def score_unstructured(model, data, query, **kwargs) -> str:

    """Custom model hook for making completions with our knowledge base.

    When requesting predictions from the deployment, pass a dictionary

    with the following keys:

    - 'question' the question to be passed to the retrieval chain

    - 'document' a base64 encoded document to be loaded into the vector database

    datarobot-user-models (DRUM) handles loading the model and calling

    this function with the appropriate parameters.

    Returns:

    --------

    rv : str

        Json dictionary with keys:

            - 'question' user's original question

            - 'answer' the generated answer to the question

    """

    import json

    from langchain.chains import ConversationalRetrievalChain

    from langchain.document_loaders import PyPDFLoader

    from langchain.vectorstores.base import VectorStoreRetriever

    from langchain.vectorstores.faiss import FAISS

    llm, embedding_function = model

    DIRECTORY = "./storage/data"

    temp_file_name = "temp.PDF"

    data_dict = json.loads(data)

    # Write encoding to file

    base_64_to_file(data_dict['document'].encode(), filename=temp_file_name, directory_path=DIRECTORY)

    # Load up the file

    loader = PyPDFLoader(os.path.join(DIRECTORY, temp_file_name))

    docs = loader.load_and_split()

    # Remove file when done

    os.remove(os.path.join(DIRECTORY, temp_file_name))

    # Create our vector database 

    texts = [doc.page_content for doc in docs]

    metadatas = [doc.metadata for doc in docs] 

    db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)  

    # Define our chain

    retriever = VectorStoreRetriever(vectorstore=db)

    chain = ConversationalRetrievalChain.from_llm(

        llm, 

        retriever=retriever

    )

    # Run it

    response = chain(inputs={'question': data_dict['question'], 'chat_history': []})

    return json.dumps({"result": response})

```

Once the hooks are defined, all that's left is to deploy the pipeline so that people have endpoints with which they can interact. To some, the process of creating a secure, monitored, queryable endpoint from arbitrary Python code may sound intimidating or at least take some time to set up. The drx package allows you to deploy JITR Bot with a single function call.

```python

import datarobotx as drx

deployment = drx.deploy(

    "./storage/deploy/", # Path with embedding model

    name=f"JITR Bot {now}", 

    hooks={

        "score_unstructured": score_unstructured,

        "load_model": load_model

    },

    extra_requirements=["pyPDF"], # Add a package for parsing PDF files

    environment_id="64c964448dd3f0c07f47d040", # GenAI Dropin Python environment

)

```

How to use JITR

Okay, the hard work is over. Now we enjoy interacting with our newfound distribution. Again with Python, you can leverage the drx package to answer your most pressing questions.

```python

# Find a PDF

url = "https://s3.amazonaws.com/datarobot_public_datasets/drx/Instantnoodles.PDF"

resp = requests.get(url).content

encoding = base64.b64encode(io.BytesIO(resp).read()) # encode it

# Interact

response = deployment.predict_unstructured(

    {

        "question": "What does this say about noodle rehydration?",

        "document": encoding.decode(),

    }

)['result']

— – – – 

{'question': 'What does this say about noodle rehydration?',

 'chat_history': [],

 'answer': 'The article mentions that during the frying process, many tiny holes are created due to mass transfer, and they serve as channels for water penetration upon rehydration in hot water. The porous structure created during frying facilitates rehydration.'}

```

But more importantly, since this is just an endpoint, you can deploy it in any language you want. Below you will find a screenshot of interacting with the deployment directly through Postman. This means we can integrate JITR Bot into basically any application we want by having the application make API calls.

Integrate JITR Bot into your application - DataRobot

Once included in your application, using JITR is very easy. For example, in the Slackbot application used internally at DataRobot, users can start a conversation about a document simply by uploading a PDF containing a question.

JITR makes it easy for anyone in your organization to create real value through generative AI across the countless touchpoints of your employees’ daily workflow. Check out this video to learn more about JITR.

What you can do to make your JITR bot more powerful

To answer your question, the code I showed ran a simple implementation of JITRBot that takes an encoded PDF and creates a vector store at runtime. We decided to omit several additional features that we implemented internally using JITRBot, such as the following, as they were not related to the core concept:

Situational awareness prompt and return completion token
Answer questions based on multiple documents
Answer multiple questions at once
Allow users to provide conversation history
Use different chains for different types of questions
Report custom metrics back to your deployment

There's no reason why JITRBot should only work with PDF files! As long as we can encode the document and convert it back to a text string, we can build more logic. `score_unstructured` This is a hook that handles any file format provided by the user.

Start using JITR in your workflow

JITR makes it easy to interact with arbitrary PDFs. If you want to give it a try, follow along with the notebook here.

Deep Dive into JITR: The PDF Ingesting and Querying Generative AI Tool

Bonus: Interview with Trent Casi, Drone U’s new sales director for PROPS program, on Wingtra, latest in the drone industry and more !!

With Avinox Drive System, DJI takes flight…on two wheels

Ultimate FPV Goggles Guide: Find the Best FPV Headset for Every FPV System

Leave A Reply Cancel Reply

What Is the Biden Campaign’s Theory of Victory Now?

Bonus: Interview with Trent Casi, Drone U’s new sales director for PROPS program, on Wingtra, latest in the drone industry and more !!

The Eyes Have It – LYDIA SARFATI

Borderline Personality Disorder: Types and Treatment

UK Comedian Tony Knight Dies In “Freak Accident” At 54

Utah man sentenced for killing missing teen and burying him in remote area

NEA Approves AI Guidance, But It’s Vital for Educators to Tread Carefully

Lindsay Hubbard accuses Dorinda Medley of leaking pregnancy news

The Composer Who Changed Opera With ‘a Beautiful Simplicity’

The Case for Kamala

With Avinox Drive System, DJI takes flight…on two wheels

Hollywood Obsessed With Gwyneth Paltrow’s Bad House Guest

Popular Posts

Lindsay Hubbard accuses Dorinda Medley of leaking pregnancy news

Researchers discover new T cells, genes related to immune disorders

Germany could import up to 100 TWh of green hydrogen via pipelines by 2035, study shows

Most Read

Calls for Biden’s Withdrawal Are a Sign of a Healthy Democratic Party

The Best Tech in Cars, Vehicles, Bikes and Travel Gear – The Hollywood Reporter

Boston U. residence life workers strike, join grad students

Deep Dive into JITR: The PDF Ingesting and Querying Generative AI Tool

Motivation

What is JITR?

How does JITR work?

Benefits of using JITR

real application

How to Build a JITR Bot with DataRobot

How to use JITR

What you can do to make your JITR bot more powerful

Start using JITR in your workflow

Related Posts

Leave A Reply Cancel Reply