Docling
Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.
This integration provides Docling's capabilities via the DoclingLoader
document loader.
Overview
The presented DoclingLoader
component enables you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.
DoclingLoader
supports two different export modes:
ExportType.DOC_CHUNKS
(default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain Document downstream, orExportType.MARKDOWN
: if you want to capture each input document as a separate LangChain Document
The example allows exploring both modes via parameter EXPORT_TYPE
; depending on the
value set, the example pipeline is then set up accordingly.
Setup
%pip install -qU langchain-docling
Note: you may need to restart the kernel to use updated packages.
For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use a GPU-enabled runtime.
Initialization
Basic initialization looks as follows:
from langchain_docling import DoclingLoader
FILE_PATH = "https://arxiv.org/pdf/2408.09869"
loader = DoclingLoader(file_path=FILE_PATH)
For advanced usage, DoclingLoader
has the following parameters:
file_path
: source as single str (URL or local file) or iterable thereofconverter
(optional): any specific Docling converter instance to useconvert_kwargs
(optional): any specific kwargs for conversion executionexport_type
(optional): export mode to use:ExportType.DOC_CHUNKS
(default) orExportType.MARKDOWN
md_export_kwargs
(optional): any specific Markdown export kwargs (for Markdown mode)chunker
(optional): any specific Docling chunker instance to use (for doc-chunk mode)meta_extractor
(optional): any specific metadata extractor to use
Load
docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
Note: a message saying
"Token indices sequence length is longer than the specified maximum sequence length..."
can be ignored in this case — more details here.
Inspecting some sample docs:
for d in docs[:3]:
print(f"- {d.page_content=}")
- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
Lazy Load
Documents can also be loaded in a lazy fashion:
doc_iter = loader.lazy_load()
for doc in doc_iter:
pass # you can operate on `doc` here
End-to-end Example
import os
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
- The following example pipeline uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var
HF_TOKEN
.- Dependencies for this pipeline can be installed as shown below (
--no-warn-conflicts
meant for Colab's pre-populated Python env; feel free to remove for stricter usage):
%pip install -q --progress-bar off --no-warn-conflicts langchain-core langchain-huggingface langchain_milvus langchain python-dotenv
Note: you may need to restart the kernel to use updated packages.
Defining the pipeline parameters:
from pathlib import Path
from tempfile import mkdtemp
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType
def _get_env_from_colab_or_os(key):
try:
from google.colab import userdata
try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)
load_dotenv()
HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"] # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
Now we can instantiate our loader and load documents:
from docling.chunking import HybridChunker
from langchain_docling import DoclingLoader
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=EXPORT_TYPE,
chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)
docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
Determining the splits:
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "Header_1"),
("##", "Header_2"),
("###", "Header_3"),
],
)
splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
Inspecting some sample splits:
for d in splits[:3]:
print(f"- {d.page_content=}")
print("...")
- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
...