Parsing and Chunking

This section covers the data ingestion process: importing files, parsing them into documents, and chunking them for vectorization and storage in the vector store.

Document Management

Documents represent files that have been imported and processed into the vector store. Each document belongs to a collection and is broken down into searchable chunks.

Importing Files

The API accepts multiple file types: JSON, PDF, Markdown, and HTML.

Standard File Import

Import document
Get documents
Get document by ID
Delete document

curl -X POST http://localhost:8000/v1/documents \
  -H "Authorization: Bearer <api_key>" \
  -F "file=@/path/to/document.pdf" \
  -F "collection=1" \
  -F "chunker=RecursiveCharacterTextSplitter" \
  -F "chunk_size=2048" \
  -F "chunk_overlap=0"

curl -X GET "http://localhost:8000/v1/documents?collection=1&offset=0&limit=10" \
  -H "Authorization: Bearer <api_key>"

curl -X GET http://localhost:8000/v1/documents/1 \
  -H "Authorization: Bearer <api_key>"

curl -X DELETE http://localhost:8000/v1/documents/1 \
  -H "Authorization: Bearer <api_key>"

JSON Format

The JSON format is suitable for bulk importing data. Unlike other file types, JSON will be decomposed into multiple documents:

[
  {
    "text": "Content of the first document",
    "title": "Document 1"
  },
  {
    "text": "Content of the second document",
    "title": "Document 2",
    "metadata": {
      "author": "John Doe",
      "date": "2024-01-01"
    }
  }
]

info

Metadata is optional and only available for JSON files. It will be returned along with the chunk during search operations.

Chunking Strategy

The chunking strategy is configurable via parameters. Chunking breaks down documents into smaller pieces that can be efficiently vectorized and searched.

Available Chunkers

Chunker	Description
`NoSplitter`	The file is considered as a single chunk
`RecursiveCharacterTextSplitter`	Splits text recursively by different separators (Langchain documentation)

Available Parameters

chunker: Chunker type (default: RecursiveCharacterTextSplitter)
chunk_size: Size of chunks (default: 2048)
chunk_overlap: Overlap between chunks (default: 0)
chunk_min_size: Minimum chunk size (default: 0)
separators: List of separators (default: ["\n\n", "\n", ". ", " "])
preset_separators: Preset language-specific separators (e.g., markdown, python)
is_separator_regex: Whether separators are regex patterns (default: false)
metadata: Additional metadata as JSON string

Chunk Management

Chunks are the smallest units in the vector store, representing portions of text from documents. Each chunk is vectorized and can be retrieved during search operations.

Get chunks
Get chunk by ID

curl -X GET "http://localhost:8000/v1/chunks/1?offset=0&limit=10" \
  -H "Authorization: Bearer <api_key>"

curl -X GET http://localhost:8000/v1/chunks/1/123 \
  -H "Authorization: Bearer <api_key>"

Next Steps

Learn how to search through your documents: RAG Search
Learn about collection management: Collections

Document Management​

Importing Files​

Standard File Import​

JSON Format​

Chunking Strategy​

Available Chunkers​

Available Parameters​

Chunk Management​

Next Steps​