Ollama Distribution¶
Ollama is a user-friendly platform for running large language models locally. The LlamaStack Kubernetes operator provides built-in support for Ollama through a pre-configured distribution.
Overview¶
Ollama offers several advantages:
- Ease of Use: Simple model management and deployment
- Local Execution: Run models entirely on your infrastructure
- Model Library: Access to a curated collection of popular models
- Resource Efficiency: Optimized for various hardware configurations
- API Compatibility: OpenAI-compatible API endpoints
Pre-Built Ollama Distribution¶
The operator includes one pre-configured Ollama distribution:
ollama¶
- Image: docker.io/llamastack/distribution-ollama:latest
- Purpose: Standard Ollama deployment
- Requirements: CPU or GPU resources depending on model
- Use Case: General-purpose local LLM inference
Quick Start with Ollama¶
1. Create a LlamaStackDistribution¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: my-ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Use supported distribution
    containerSpec:
      port: 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
    storage:
      size: "20Gi"
      mountPath: "/.llama"
2. Deploy the Distribution¶
3. Verify Deployment¶
Configuration Options¶
Container Specification¶
The containerSpec section allows you to configure the container:
spec:
  server:
    containerSpec:
      name: "llama-stack"  # Optional, defaults to "llama-stack"
      port: 8321           # Optional, defaults to 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_HOST
          value: "0.0.0.0:11434"
        - name: OLLAMA_ORIGINS
          value: "*"
Environment Variables¶
Configure Ollama behavior through environment variables:
env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"
  - name: OLLAMA_HOST
    value: "0.0.0.0:11434"
  - name: OLLAMA_ORIGINS
    value: "*"
  - name: OLLAMA_NUM_PARALLEL
    value: "4"
  - name: OLLAMA_MAX_LOADED_MODELS
    value: "3"
Popular Models¶
You can specify different models using the INFERENCE_MODEL environment variable:
# Llama 2 variants
- name: INFERENCE_MODEL
  value: "llama2:7b"      # 7B parameter model
# value: "llama2:13b"     # 13B parameter model
# value: "llama2:70b"     # 70B parameter model
# Code-focused models
# value: "codellama:7b"   # Code generation
# value: "codellama:13b"  # Larger code model
# Chat-optimized models
# value: "llama2:7b-chat"
# value: "llama2:13b-chat"
# Other popular models
# value: "mistral:7b"     # Mistral 7B
# value: "neural-chat:7b" # Intel's neural chat
# value: "orca-mini:3b"   # Smaller, efficient model
Resource Requirements¶
GPU Support¶
For GPU acceleration:
resources:
  requests:
    nvidia.com/gpu: "1"
    memory: "8Gi"
    cpu: "2"
  limits:
    nvidia.com/gpu: "1"
    memory: "16Gi"
    cpu: "4"
env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"
  - name: OLLAMA_GPU_LAYERS
    value: "35"  # Number of layers to run on GPU
Storage Configuration¶
Advanced Configuration¶
Custom Model Management with Pod Overrides¶
spec:
  server:
    podOverrides:
      volumes:
        - name: ollama-models
          persistentVolumeClaim:
            claimName: ollama-models-pvc
      volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
    containerSpec:
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_MODELS
          value: "/root/.ollama/models"
Multiple Model Setup¶
spec:
  server:
    containerSpec:
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"  # Primary model
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "3"
        - name: ADDITIONAL_MODELS
          value: "codellama:7b,mistral:7b"  # Additional models to pull
      resources:
        requests:
          memory: "24Gi"
          cpu: "8"
        limits:
          memory: "48Gi"
          cpu: "16"
Scaling with Multiple Replicas¶
spec:
  replicas: 2
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
Using Ollama with the Kubernetes Operator¶
The LlamaStack Kubernetes operator supports Ollama in two ways:
1. Pre-Built Distribution (Recommended)¶
Use the pre-built, maintained distribution with the distribution.name field:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Supported distribution
    containerSpec:
      port: 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_URL
          value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
    storage:
      size: "20Gi"
With GPU Support¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: ollama-gpu-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Supported distribution
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "llama2:7b"
        - name: OLLAMA_GPU_LAYERS
          value: "35"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
    storage:
      size: "50Gi"
2. Bring Your Own (BYO) Custom Images¶
Use custom-built distributions with the distribution.image field:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: custom-ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      image: "my-registry.com/custom-ollama:v1.0.0"  # Custom image
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
        limits:
          memory: "32Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "custom-model:latest"
        - name: CUSTOM_OLLAMA_SETTING
          value: "optimized"
    storage:
      size: "100Gi"
Building Custom Ollama Distributions¶
Step 1: Build with LlamaStack CLI¶
Option A: From Template¶
# Install LlamaStack CLI
pip install llama-stack
# Build from Ollama template
llama stack build --template ollama --image-type container --image-name my-ollama-dist
Option B: Custom Configuration¶
Create custom-ollama-build.yaml:
name: custom-ollama
distribution_spec:
  description: Custom Ollama distribution with pre-loaded models
  providers:
    inference: remote::ollama
    memory: inline::faiss
    safety: inline::llama-guard
    agents: inline::meta-reference
    telemetry: inline::meta-reference
image_name: custom-ollama
image_type: container
Build the distribution:
Step 2: Enhance with Custom Dockerfile¶
Create Dockerfile.enhanced:
FROM distribution-custom-ollama:dev
# Install additional tools
RUN apt-get update && apt-get install -y \
    curl \
    jq \
    htop \
    && rm -rf /var/lib/apt/lists/*
# Pre-pull popular models
RUN ollama pull llama3.2:1b && \
    ollama pull llama3.2:3b && \
    ollama pull codellama:7b && \
    ollama pull mistral:7b
# Add custom model management scripts
COPY scripts/model-manager.sh /usr/local/bin/model-manager
COPY scripts/health-check.sh /usr/local/bin/health-check
RUN chmod +x /usr/local/bin/model-manager /usr/local/bin/health-check
# Add custom Ollama configuration
COPY ollama-config.json /etc/ollama/config.json
# Set optimized environment variables
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_ORIGINS=*
ENV OLLAMA_NUM_PARALLEL=4
ENV OLLAMA_MAX_LOADED_MODELS=3
ENV OLLAMA_KEEP_ALIVE=5m
# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD health-check
EXPOSE 8321 11434
Step 3: Deploy with Operator¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: enhanced-ollama-dist
  namespace: production
spec:
  replicas: 2
  server:
    distribution:
      image: "my-registry.com/enhanced-ollama:v1.0.0"
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
          nvidia.com/gpu: "1"
        limits:
          memory: "32Gi"
          cpu: "16"
          nvidia.com/gpu: "1"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:3b"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
        - name: CUSTOM_OPTIMIZATION
          value: "enabled"
    storage:
      size: "200Gi"
    podOverrides:
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: shared-model-cache
      volumeMounts:
        - name: model-cache
          mountPath: /shared-models
Comparison: Pre-Built vs BYO¶
| Aspect | Pre-Built Distribution | BYO Custom Images | 
|---|---|---|
| Setup Complexity | Simple - just specify name | Complex - build and maintain images | 
| Maintenance | Maintained by LlamaStack team | You maintain the images | 
| Model Management | Runtime model pulling | Pre-loaded models possible | 
| Customization | Limited to environment variables | Full control over Ollama configuration | 
| Security | Vetted by maintainers | You control security scanning and updates | 
| Performance | Standard Ollama setup | Custom optimizations possible | 
| Support | Community and official support | Self-supported | 
| Updates | Automatic with operator updates | Manual image rebuilds required | 
When to Use Pre-Built Distribution¶
- Quick deployment and standard use cases
- Production environments where stability is key
- Dynamic model management (pull models at runtime)
- Teams without container expertise
- Standard Ollama configurations
When to Use BYO Custom Images¶
- Pre-loaded models for faster startup
- Custom Ollama configurations or patches
- Additional tools and utilities
- Compliance requirements for image provenance
- Integration with existing model management systems
- Custom model formats or converters
Model Management¶
Accessing the Ollama Container¶
# Connect to running Ollama pod
kubectl exec -it <ollama-pod> -- bash
# Pull models
ollama pull llama2:7b
# List available models
ollama list
# Remove unused models
ollama rm old-model:tag
Model Information¶
# Show model details
kubectl exec -it <ollama-pod> -- ollama show llama2:7b
# Check model size and parameters
kubectl exec -it <ollama-pod> -- ollama show llama2:7b --modelfile
API Usage¶
REST API¶
Ollama provides OpenAI-compatible endpoints:
# Generate completion
curl -X POST http://ollama-service:8321/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Why is the sky blue?",
    "max_tokens": 100
  }'
# Chat completion
curl -X POST http://ollama-service:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
Python Client¶
import requests
# Generate text
response = requests.post(
    "http://ollama-service:8321/v1/completions",
    json={
        "model": "llama2:7b",
        "prompt": "Explain quantum computing",
        "max_tokens": 200
    }
)
print(response.json())
Monitoring and Troubleshooting¶
Health Checks¶
# Check pod status
kubectl get pods -l app=llama-stack
# View logs
kubectl logs -l app=llama-stack
# Test API endpoint
kubectl port-forward svc/my-ollama-distribution-service 8321:8321
curl http://localhost:8321/v1/health
Performance Monitoring¶
# Monitor resource usage
kubectl top pods -l app=llama-stack
# Check model loading status
kubectl exec -it <ollama-pod> -- ollama ps
Common Issues¶
- Model Download Failures
- Check internet connectivity
- Verify sufficient storage space
- 
Ensure proper permissions 
- 
Out of Memory 
- Use smaller models (3b, 7b instead of 13b, 70b)
- Increase memory limits
- 
Reduce concurrent requests 
- 
Slow Performance 
- Enable GPU acceleration
- Use faster storage for model cache
- Optimize model selection for hardware
Best Practices¶
Resource Planning¶
- Memory: Allocate 2-4x model size in RAM
- Storage: Plan for model downloads and cache
- CPU: More cores improve concurrent request handling
Model Selection¶
# For development/testing
env:
  - name: INFERENCE_MODEL
    value: "orca-mini:3b"    # Fast, lightweight
# For general use
env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"       # Good balance of quality/performance
# For high-quality responses
env:
  - name: INFERENCE_MODEL
    value: "llama2:13b"      # Better quality, more resources
# For code generation
env:
  - name: INFERENCE_MODEL
    value: "codellama:7b"    # Specialized for coding tasks
Security Considerations¶
- Use private registries for custom images
- Implement network policies for API access
- Secure model storage with appropriate permissions
- Monitor API usage and implement rate limiting
Examples¶
Production Setup¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: production-ollama
  namespace: llama-production
spec:
  replicas: 2
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
          nvidia.com/gpu: "1"
        limits:
          memory: "32Gi"
          cpu: "16"
          nvidia.com/gpu: "1"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
    storage:
      size: "100Gi"
Development Setup¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: dev-ollama
  namespace: development
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "4Gi"
          cpu: "2"
        limits:
          memory: "8Gi"
          cpu: "4"
      env:
        - name: INFERENCE_MODEL
          value: "orca-mini:3b"
    storage:
      size: "20Gi"
API Reference¶
For complete API documentation, see: - API Reference - Configuration Reference