Skip to main content

Parsing and Chunking

This section covers the data ingestion process: importing files, parsing them into documents, and chunking them for vectorization and storage in the vector store.

Document Management

Documents represent files that have been imported and processed into the vector store. Each document belongs to a collection and is broken down into searchable chunks.

Importing Files

The API accepts multiple file types: JSON, PDF, Markdown, and HTML.

Standard File Import

curl -X POST http://localhost:8000/v1/documents \
-H "Authorization: Bearer <api_key>" \
-F "file=@/path/to/document.pdf" \
-F "collection=1" \
-F "chunker=RecursiveCharacterTextSplitter" \
-F "chunk_size=2048" \
-F "chunk_overlap=0"

JSON Format

The JSON format is suitable for bulk importing data. Unlike other file types, JSON will be decomposed into multiple documents:

[
{
"text": "Content of the first document",
"title": "Document 1"
},
{
"text": "Content of the second document",
"title": "Document 2",
"metadata": {
"author": "John Doe",
"date": "2024-01-01"
}
}
]
info

Metadata is optional and only available for JSON files. It will be returned along with the chunk during search operations.

Chunking Strategy

The chunking strategy is configurable via parameters. Chunking breaks down documents into smaller pieces that can be efficiently vectorized and searched.

Available Chunkers

ChunkerDescription
NoSplitterThe file is considered as a single chunk
RecursiveCharacterTextSplitterSplits text recursively by different separators (Langchain documentation)

Available Parameters

  • chunker: Chunker type (default: RecursiveCharacterTextSplitter)
  • chunk_size: Size of chunks (default: 2048)
  • chunk_overlap: Overlap between chunks (default: 0)
  • chunk_min_size: Minimum chunk size (default: 0)
  • separators: List of separators (default: ["\n\n", "\n", ". ", " "])
  • preset_separators: Preset language-specific separators (e.g., markdown, python)
  • is_separator_regex: Whether separators are regex patterns (default: false)
  • metadata: Additional metadata as JSON string

Chunk Management

Chunks are the smallest units in the vector store, representing portions of text from documents. Each chunk is vectorized and can be retrieved during search operations.

curl -X GET "http://localhost:8000/v1/chunks/1?offset=0&limit=10" \
-H "Authorization: Bearer <api_key>"

Next Steps