Pre-configured Source Templates

This directory contains pre-configured templates for common cloud storage sources. Each template provides:

Environment variable template (.env.template) - Required rclone configuration
Pipeline example (pipeline.py) - Ready-to-use Dagster pipeline definition
Setup guide (README.md) - Step-by-step configuration instructions

Available Sources

Microsoft 365

SharePoint - SharePoint Online document libraries
OneDrive - Personal and Business OneDrive

Cloud Storage

S3 - AWS S3, MinIO, or any S3-compatible storage
Azure Blob - Azure Blob Storage
Google Drive - Google Workspace and personal Drive

Infrastructure

SFTP - Secure file transfer (legacy systems, on-prem)
Local FS - Mounted network shares (NFS, SMB, Azure Files)

Usage

Choose your source (e.g., sharepoint/)
Copy .env.template variables to your .env file
Follow the README.md setup instructions to get credentials
Copy the pipeline.py example to your playground/ or custom pipeline location
Customize patterns, bucket names, and schedules as needed

Environment Variable Convention

All rclone source configuration uses the RCLONE_ prefix to avoid conflicts with other tools (AWS SDK, Azure CLI, etc.):

RCLONE_{SOURCE}_{OPTION}=value

Examples:

bash

# Azure Blob
RCLONE_AZUREBLOB_NAME=azureblob
RCLONE_AZUREBLOB_TYPE=azureblob
RCLONE_AZUREBLOB_ACCOUNT=mystorageaccount
RCLONE_AZUREBLOB_KEY=your-access-key

# S3 (won't conflict with AWS_* or S3_* vars used by boto3)
RCLONE_S3_NAME=s3
RCLONE_S3_TYPE=s3
RCLONE_S3_ACCESS_KEY_ID=AKIA...
RCLONE_S3_SECRET_ACCESS_KEY=your-secret

Required variables for all sources:

RCLONE_{SOURCE}_NAME - Remote name used in rclone
RCLONE_{SOURCE}_TYPE - Backend type (onedrive, drive, s3, azureblob, sftp, local)

All other variables are passed directly to rclone as backend-specific options.

Understanding Namespaces and Directory Structure

Important: The datalake_directory_name parameter determines the namespace used in the downstream RAG pipeline (vector store). This affects how your data is organized and searchable.

How it works

Source (SharePoint/S3/etc.) → Data Lake (container/directory/) → Vector Store (namespace)

datalake_container_name: The S3 bucket/container where files are stored
datalake_directory_name (optional): The target directory in the data lake = namespace in vector store

Namespace implications for RAG

The data-lake-to-vector-store pipeline creates one namespace per directory. When querying, you can:

Search within a specific namespace (scoped results)
Search across multiple namespaces (broader results)

Decision: Single vs. Multiple Namespaces

Option 1: Single namespace (specify datalake_directory_name):

python

defs = default_rclone_to_datalake_definitions(
    datalake_container_name="myproject",
    datalake_directory_name="all_docs",  # Everything goes into one namespace
    ...
)

All synced files end up in myproject/all-docs/... → namespace "all-docs". Use when: All documents should be searchable as one knowledge base.

Option 2: Preserve source structure (omit datalake_directory_name):

python

defs = default_rclone_to_datalake_definitions(
    datalake_container_name="myproject",
    # datalake_directory_name not set - source folders become namespaces
    ...
)

Source folder structure is mirrored in the data lake. Each top-level source folder becomes its own namespace.

Example: If SharePoint has:

/HR/policies.pdf
/HR/handbook.pdf
/Engineering/specs.pdf
/Engineering/docs/guide.pdf

Result in data lake:

myproject/HR/policies.pdf         → namespace "HR"
myproject/HR/handbook.pdf         → namespace "HR"
myproject/Engineering/specs.pdf   → namespace "Engineering"
myproject/Engineering/docs/guide.pdf → namespace "Engineering"

Use when: Your source is already organized into logical groupings that should be separate namespaces.

Option 3: Multiple pipelines (explicit control):

python

# Pipeline 1: HR documents
defs_hr = default_rclone_to_datalake_definitions(
    datalake_container_name="myproject",
    datalake_directory_name="hr-docs",
    source_remote="sharepoint:/HR/",
    ...
)

# Pipeline 2: Engineering documents
defs_eng = default_rclone_to_datalake_definitions(
    datalake_container_name="myproject",
    datalake_directory_name="eng-docs",
    source_remote="sharepoint:/Engineering/",
    ...
)

Use when: You need fine-grained control, different sync schedules, or want to rename namespaces.

Important: Root-level files

When datalake_directory_name is not set and your source has files at the root level (not in any folder), those files will not be processed by the RAG pipeline. Only files within directories get a namespace and are indexed.

Ensure your source structure places all files within folders, or specify a datalake_directory_name to wrap everything in a single namespace.

Creating Custom Sources

For sources not listed here, you can configure any of the 70+ rclone-supported backends. See the rclone documentation for available providers.

Monitoring & Alerting

Identity Provider Setup

Microsoft Entra ID

Sources

Pre-configured Source Templates ​

Available Sources ​

Microsoft 365 ​

Cloud Storage ​

Infrastructure ​

Usage ​

Environment Variable Convention ​

Understanding Namespaces and Directory Structure ​

How it works ​

Namespace implications for RAG ​

Decision: Single vs. Multiple Namespaces ​

Important: Root-level files ​

Creating Custom Sources ​

Pre-configured Source Templates

Available Sources

Microsoft 365

Cloud Storage

Infrastructure

Usage

Environment Variable Convention

Understanding Namespaces and Directory Structure

How it works

Namespace implications for RAG

Decision: Single vs. Multiple Namespaces

Important: Root-level files

Creating Custom Sources