Skip to content

Data pipelines

Pipelines are automated workflows that transform documents into searchable knowledge bases for AI agents. They monitor file storage locations, process documents when changes occur, and maintain vector databases that agents query for information.

Document processing workflow

Raw documents cannot be directly queried by agents. PDFs and Word files must be converted to text, split into manageable chunks, and transformed into vector embeddings that enable semantic search. Pipelines handle this transformation automatically.

The diagram shows the complete flow from document ingestion through to agent queries. Each stage transforms the data to make it searchable and retrievable.

Automatic synchronization

Pipelines monitor data sources for changes. When a document is added, modified, or deleted, the pipeline processes the change and updates the knowledge base. This keeps agent responses current without manual intervention.

Orchestration with Dagster

Dagster orchestrates pipeline execution, handling scheduling, retries, and logging. Each processing step is tracked, creating an audit trail from document ingestion through to storage. You can review pipeline runs to troubleshoot issues, verify document processing, and monitor data quality.

Built with ❤️ in Switzerland 🇨🇭