Parsing and Chunking
This section covers the data ingestion process: importing files, parsing them into documents, and chunking them for vectorization and storage in the vector store.
Document Management
Documents represent files that have been imported and processed into the vector store. Each document belongs to a collection and is broken down into searchable chunks.
Importing Files
The API accepts multiple file types: JSON, PDF, Markdown, and HTML.
Standard File Import
- Import document
- Get documents
- Get document by ID
- Delete document
curl -X POST http://localhost:8000/v1/documents \
-H "Authorization: Bearer <api_key>" \
-F "file=@/path/to/document.pdf" \
-F "collection=1" \
-F "chunker=RecursiveCharacterTextSplitter" \
-F "chunk_size=2048" \
-F "chunk_overlap=0"
curl -X GET "http://localhost:8000/v1/documents?collection=1&offset=0&limit=10" \
-H "Authorization: Bearer <api_key>"
curl -X GET http://localhost:8000/v1/documents/1 \
-H "Authorization: Bearer <api_key>"
curl -X DELETE http://localhost:8000/v1/documents/1 \
-H "Authorization: Bearer <api_key>"
JSON Format
The JSON format is suitable for bulk importing data. Unlike other file types, JSON will be decomposed into multiple documents:
[
{
"text": "Content of the first document",
"title": "Document 1"
},
{
"text": "Content of the second document",
"title": "Document 2",
"metadata": {
"author": "John Doe",
"date": "2024-01-01"
}
}
]
Metadata is optional and only available for JSON files. It will be returned along with the chunk during search operations.
Chunking Strategy
The chunking strategy is configurable via parameters. Chunking breaks down documents into smaller pieces that can be efficiently vectorized and searched.
Available Chunkers
| Chunker | Description |
|---|---|
NoSplitter | The file is considered as a single chunk |
RecursiveCharacterTextSplitter | Splits text recursively by different separators (Langchain documentation) |
Available Parameters
chunker: Chunker type (default:RecursiveCharacterTextSplitter)chunk_size: Size of chunks (default: 2048)chunk_overlap: Overlap between chunks (default: 0)chunk_min_size: Minimum chunk size (default: 0)separators: List of separators (default:["\n\n", "\n", ". ", " "])preset_separators: Preset language-specific separators (e.g.,markdown,python)is_separator_regex: Whether separators are regex patterns (default: false)metadata: Additional metadata as JSON string
Chunk Management
Chunks are the smallest units in the vector store, representing portions of text from documents. Each chunk is vectorized and can be retrieved during search operations.
- Get chunks
- Get chunk by ID
curl -X GET "http://localhost:8000/v1/chunks/1?offset=0&limit=10" \
-H "Authorization: Bearer <api_key>"
curl -X GET http://localhost:8000/v1/chunks/1/123 \
-H "Authorization: Bearer <api_key>"
Next Steps
- Learn how to search through your documents: RAG Search
- Learn about collection management: Collections