Data pipelines
Pipelines are automated workflows that transform documents into searchable knowledge bases for AI agents. They monitor file storage locations, process documents when changes occur, and maintain vector databases that agents query for information.
Document processing workflow
Raw documents cannot be directly queried by agents. PDFs and Word files must be converted to text, split into manageable chunks, and transformed into vector embeddings that enable semantic search. Pipelines handle this transformation automatically.
The diagram shows the complete flow from document ingestion through to agent queries. Each stage transforms the data to make it searchable and retrievable.
Automatic synchronization
Pipelines monitor data sources for changes. When a document is added, modified, or deleted, the pipeline processes the change and updates the knowledge base. This keeps agent responses current without manual intervention.
Orchestration with Dagster
Dagster orchestrates pipeline execution, handling scheduling, retries, and logging. Each processing step is tracked, creating an audit trail from document ingestion through to storage. You can review pipeline runs to troubleshoot issues, verify document processing, and monitor data quality.
