LLM proxy

The LLM proxy (LiteLLM) provides a centralized gateway to language model providers. It abstracts vendor-specific APIs behind an OpenAI-compatible interface, allowing the platform to work with multiple AI providers without changing code.

Configuration

Models are configured in the LiteLLM configuration file. Each model entry specifies the provider, API endpoint, authentication, and capabilities.

Example model configuration:

yaml

model_list:
  # Cloud model (Swiss LLM Cloud)
  - model_name: text-generation/gemma-4-31B-it
    litellm_params:
      model: openai/google/gemma-4-31B-it
      api_base: os.environ/SWISS_LLM_CLOUD_API_BASE_URL
      api_key: os.environ/SWISS_LLM_CLOUD_API_KEY
      drop_params: true
    model_info:
      mode: chat
      supports_function_calling: true
      input_cost_per_token: 0.0000002
      output_cost_per_token: 0.0000008

  # Local GPU model (vLLM)
  - model_name: text-generation/Qwen3-VL-30B-A3B-Instruct-FP8
    litellm_params:
      model: openai/qwen3-vl-30b
      api_base: http://vllm:8000/v1
      api_key: os.environ/LOCAL_LLM_TOKEN
      drop_params: true
    model_info:
      mode: chat
      supports_function_calling: true
      supports_vision: true
      input_cost_per_token: 0
      output_cost_per_token: 0

The model_name identifies the model in agent configurations using the real canonical model name. The litellm_params section contains provider-specific connection details. The model_info section specifies capabilities and per-token pricing for cost tracking through Langfuse.

Core functions

Unified interface: LiteLLM provides an OpenAI-compatible API that works with Swiss LLM Cloud, locally hosted vLLM models, and other providers. Platform code uses the same interface regardless of which model handles the request.

Request routing: The proxy routes requests based on configured strategy. Current configuration uses "usage-based-routing-v2" which distributes load across available models.

Cost tracking: Usage tracking captures token consumption per request. Cost per token is configured for each model, allowing the platform to calculate and display costs per conversation. See Cost control for details on cost tracking and optimization.

PII protection: Presidio integration (when enabled) scans requests for personally identifiable information before sending them to external providers. See Data Anonymization for details.

Retry policies: The configuration specifies retry counts for timeout errors, rate limit errors, and internal server errors.

Introduction: The Swiss AI Hub Vision

Why Swiss AI Hub

Quick Start: Your First 30 Minutes

Platform Architecture

Deployment Guide

Monitoring & Alerting

Identity Provider Setup

Microsoft Entra ID

Agents

Data pipelines

Knowledge management

Chat Interface

Access Management

Auditing & Observability

Language models

Memory

Multi-tenancy

Slack & Teams Integrations

API

Security

Compliance and regulations

Quick Start

Building Agents

Building Pipelines

Building Processes

Advanced SDK Topics

Features

Contributing

Using AI to Contribute

Certification

API Reference

Troubleshooting

Glossary

Pipeline

Sources

LLM proxy

Configuration

Core functions

Monitoring & Alerting

Identity Provider Setup

Microsoft Entra ID

Sources

LLM proxy ​

Configuration ​

Core functions ​

LLM proxy

Configuration

Core functions