This repository contains two Jupyter notebooks that form a pipeline for document ingestion, chunking, and language model interaction. These notebooks allow you to convert documents (like PDFs), split them into smaller chunks, and send those chunks to a large language model (such as Mistral-7B Instruct) for processing or text generation.
-
injest-splitter.ipynb: A notebook designed to send chunks of text to a large language model for completion or generation tasks. It uses the Mistral-7B Instruct model via an API.
-
injest-local.ipynb: A notebook that ingests documents, splits them into chunks, and prepares them for further processing by the language model.
- Python 3.10+
- Jupyter Notebook
- Libraries (installed in the notebooks):
doclingquacklingllama-indexsemantic-routersemantic-chunkersrich
-
Clone the repository:
git clone <repository_url> cd <repository_folder>
-
Install the required libraries:
Run the first few code cells in either notebook to install the required dependencies automatically via
%pip.
This notebook (injest-local.ipynb) is used to ingest and split a document into chunks:
- Load your document (PDF) in the variable
source. - The
DocumentConverterandDoclingPDFReaderwill process the document into a chunked format. - The chunking process is handled by
RollingWindowSplitterandStatisticalChunker, which store the chunks in thesplitsandchunksvariables.
You can modify the chunking parameters, including min_split_tokens and max_split_tokens, to fit your needs.
If you want to save the chunked output, you can write it to a file. Here’s an example of how to save the chunks to a JSON file:
import json
with open('chunked_data.json', 'w') as f:
json.dump([chunk.to_dict() for chunk in chunks], f)Once the document is chunked, use the injest-splitter.ipynb notebook to send those chunks to a large language model.
- Load the chunked data (either by running the chunking notebook first or loading a previously saved file).
- The notebook will use the OpenLLM library to interact with the model and stream responses for each chunk.
You can modify the max_tokens and timeout settings to control the model's output length and response time.
- Step 1: Run
injest-local.ipynbto ingest and chunk the document. - Step 2: Save the chunked data as a file (optional).
- Step 3: Run
injest-splitter.ipynbto send the chunks to the language model and receive the responses.
Both notebooks require environment variables to connect to the language model API:
API_KEY: Your API key for accessing the model.LLM_URL: The base URL for the language model API.
You can load these environment variables using a .env file or by setting them directly in the notebook.
Example .env file:
API_KEY=your-api-key
LLM_URL=your-llm-url
- Adjust the chunking parameters in
injest-local.ipynbfor your document. - Modify the prompt generation and response handling in
injest-splitter.ipynbto fit your needs.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.