Ingest-Splitter and Ingest-Local Pipelines

This repository contains two Jupyter notebooks that form a pipeline for document ingestion, chunking, and language model interaction. These notebooks allow you to convert documents (like PDFs), split them into smaller chunks, and send those chunks to a large language model (such as Mistral-7B Instruct) for processing or text generation.

Files in the Repo

injest-splitter.ipynb: A notebook designed to send chunks of text to a large language model for completion or generation tasks. It uses the Mistral-7B Instruct model via an API.
injest-local.ipynb: A notebook that ingests documents, splits them into chunks, and prepares them for further processing by the language model.

Prerequisites

Python 3.10+
Jupyter Notebook
Libraries (installed in the notebooks):
- docling
- quackling
- llama-index
- semantic-router
- semantic-chunkers
- rich

Installation

Clone the repository:

git clone <repository_url>
cd <repository_folder>

Install the required libraries:

Run the first few code cells in either notebook to install the required dependencies automatically via %pip.

Usage

1. Ingest-Local (Document Ingestion and Chunking)

This notebook (injest-local.ipynb) is used to ingest and split a document into chunks:

Load your document (PDF) in the variable source.
The DocumentConverter and DoclingPDFReader will process the document into a chunked format.
The chunking process is handled by RollingWindowSplitter and StatisticalChunker, which store the chunks in the splits and chunks variables.

You can modify the chunking parameters, including min_split_tokens and max_split_tokens, to fit your needs.

If you want to save the chunked output, you can write it to a file. Here’s an example of how to save the chunks to a JSON file:

import json

with open('chunked_data.json', 'w') as f:
    json.dump([chunk.to_dict() for chunk in chunks], f)

2. Ingest-Splitter (Model Interaction)

Once the document is chunked, use the injest-splitter.ipynb notebook to send those chunks to a large language model.

Load the chunked data (either by running the chunking notebook first or loading a previously saved file).
The notebook will use the OpenLLM library to interact with the model and stream responses for each chunk.

You can modify the max_tokens and timeout settings to control the model's output length and response time.

Example Flow

Step 1: Run injest-local.ipynb to ingest and chunk the document.
Step 2: Save the chunked data as a file (optional).
Step 3: Run injest-splitter.ipynb to send the chunks to the language model and receive the responses.

Environment Variables

Both notebooks require environment variables to connect to the language model API:

API_KEY: Your API key for accessing the model.
LLM_URL: The base URL for the language model API.

You can load these environment variables using a .env file or by setting them directly in the notebook.

Example .env file:

API_KEY=your-api-key
LLM_URL=your-llm-url

Customization

Adjust the chunking parameters in injest-local.ipynb for your document.
Modify the prompt generation and response handling in injest-splitter.ipynb to fit your needs.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
injest-local.ipynb		injest-local.ipynb
injest-splitter.ipynb		injest-splitter.ipynb
qna.yaml		qna.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingest-Splitter and Ingest-Local Pipelines

Files in the Repo

Prerequisites

Installation

Usage

1. Ingest-Local (Document Ingestion and Chunking)

2. Ingest-Splitter (Model Interaction)

Example Flow

Environment Variables

Customization

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ingest-Splitter and Ingest-Local Pipelines

Files in the Repo

Prerequisites

Installation

Usage

1. Ingest-Local (Document Ingestion and Chunking)

2. Ingest-Splitter (Model Interaction)

Example Flow

Environment Variables

Customization

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages