Skip to content

Ollama Distribution

Ollama is a user-friendly platform for running large language models locally. The LlamaStack Kubernetes operator provides built-in support for Ollama through a pre-configured distribution.

Overview

Ollama offers several advantages:

  • Ease of Use: Simple model management and deployment
  • Local Execution: Run models entirely on your infrastructure
  • Model Library: Access to a curated collection of popular models
  • Resource Efficiency: Optimized for various hardware configurations
  • API Compatibility: OpenAI-compatible API endpoints

Pre-Built Ollama Distribution

The operator includes one pre-configured Ollama distribution:

ollama

  • Image: docker.io/llamastack/distribution-ollama:latest
  • Purpose: Standard Ollama deployment
  • Requirements: CPU or GPU resources depending on model
  • Use Case: General-purpose local LLM inference

Quick Start with Ollama

1. Create a LlamaStackDistribution

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: my-ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Use supported distribution
    containerSpec:
      port: 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
    storage:
      size: "20Gi"
      mountPath: "/.llama"

2. Deploy the Distribution

kubectl apply -f ollama-distribution.yaml

3. Verify Deployment

kubectl get llamastackdistribution my-ollama-distribution
kubectl get pods -l app=llama-stack

Configuration Options

Container Specification

The containerSpec section allows you to configure the container:

spec:
  server:
    containerSpec:
      name: "llama-stack"  # Optional, defaults to "llama-stack"
      port: 8321           # Optional, defaults to 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_HOST
          value: "0.0.0.0:11434"
        - name: OLLAMA_ORIGINS
          value: "*"

Environment Variables

Configure Ollama behavior through environment variables:

env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"
  - name: OLLAMA_HOST
    value: "0.0.0.0:11434"
  - name: OLLAMA_ORIGINS
    value: "*"
  - name: OLLAMA_NUM_PARALLEL
    value: "4"
  - name: OLLAMA_MAX_LOADED_MODELS
    value: "3"

You can specify different models using the INFERENCE_MODEL environment variable:

# Llama 2 variants
- name: INFERENCE_MODEL
  value: "llama2:7b"      # 7B parameter model
# value: "llama2:13b"     # 13B parameter model
# value: "llama2:70b"     # 70B parameter model

# Code-focused models
# value: "codellama:7b"   # Code generation
# value: "codellama:13b"  # Larger code model

# Chat-optimized models
# value: "llama2:7b-chat"
# value: "llama2:13b-chat"

# Other popular models
# value: "mistral:7b"     # Mistral 7B
# value: "neural-chat:7b" # Intel's neural chat
# value: "orca-mini:3b"   # Smaller, efficient model

Resource Requirements

resources:
  requests:
    memory: "8Gi"
    cpu: "4"
  limits:
    memory: "16Gi"
    cpu: "8"

GPU Support

For GPU acceleration:

resources:
  requests:
    nvidia.com/gpu: "1"
    memory: "8Gi"
    cpu: "2"
  limits:
    nvidia.com/gpu: "1"
    memory: "16Gi"
    cpu: "4"
env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"
  - name: OLLAMA_GPU_LAYERS
    value: "35"  # Number of layers to run on GPU

Storage Configuration

storage:
  size: "20Gi"
  mountPath: "/.llama"  # Optional, defaults to "/.llama"

Advanced Configuration

Custom Model Management with Pod Overrides

spec:
  server:
    podOverrides:
      volumes:
        - name: ollama-models
          persistentVolumeClaim:
            claimName: ollama-models-pvc
      volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
    containerSpec:
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_MODELS
          value: "/root/.ollama/models"

Multiple Model Setup

spec:
  server:
    containerSpec:
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"  # Primary model
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "3"
        - name: ADDITIONAL_MODELS
          value: "codellama:7b,mistral:7b"  # Additional models to pull
      resources:
        requests:
          memory: "24Gi"
          cpu: "8"
        limits:
          memory: "48Gi"
          cpu: "16"

Scaling with Multiple Replicas

spec:
  replicas: 2
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"

Using Ollama with the Kubernetes Operator

The LlamaStack Kubernetes operator supports Ollama in two ways:

Use the pre-built, maintained distribution with the distribution.name field:

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Supported distribution
    containerSpec:
      port: 8321
      resources:
        requests:
          memory: "8Gi"
          cpu: "4"
        limits:
          memory: "16Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_URL
          value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
    storage:
      size: "20Gi"

With GPU Support

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: ollama-gpu-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"  # Supported distribution
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "llama2:7b"
        - name: OLLAMA_GPU_LAYERS
          value: "35"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
    storage:
      size: "50Gi"

2. Bring Your Own (BYO) Custom Images

Use custom-built distributions with the distribution.image field:

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: custom-ollama-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      image: "my-registry.com/custom-ollama:v1.0.0"  # Custom image
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
        limits:
          memory: "32Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "custom-model:latest"
        - name: CUSTOM_OLLAMA_SETTING
          value: "optimized"
    storage:
      size: "100Gi"

Building Custom Ollama Distributions

Step 1: Build with LlamaStack CLI

Option A: From Template

# Install LlamaStack CLI
pip install llama-stack

# Build from Ollama template
llama stack build --template ollama --image-type container --image-name my-ollama-dist

Option B: Custom Configuration

Create custom-ollama-build.yaml:

name: custom-ollama
distribution_spec:
  description: Custom Ollama distribution with pre-loaded models
  providers:
    inference: remote::ollama
    memory: inline::faiss
    safety: inline::llama-guard
    agents: inline::meta-reference
    telemetry: inline::meta-reference
image_name: custom-ollama
image_type: container

Build the distribution:

llama stack build --config custom-ollama-build.yaml

Step 2: Enhance with Custom Dockerfile

Create Dockerfile.enhanced:

FROM distribution-custom-ollama:dev

# Install additional tools
RUN apt-get update && apt-get install -y \
    curl \
    jq \
    htop \
    && rm -rf /var/lib/apt/lists/*

# Pre-pull popular models
RUN ollama pull llama3.2:1b && \
    ollama pull llama3.2:3b && \
    ollama pull codellama:7b && \
    ollama pull mistral:7b

# Add custom model management scripts
COPY scripts/model-manager.sh /usr/local/bin/model-manager
COPY scripts/health-check.sh /usr/local/bin/health-check
RUN chmod +x /usr/local/bin/model-manager /usr/local/bin/health-check

# Add custom Ollama configuration
COPY ollama-config.json /etc/ollama/config.json

# Set optimized environment variables
ENV OLLAMA_HOST=0.0.0.0:11434
ENV OLLAMA_ORIGINS=*
ENV OLLAMA_NUM_PARALLEL=4
ENV OLLAMA_MAX_LOADED_MODELS=3
ENV OLLAMA_KEEP_ALIVE=5m

# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD health-check

EXPOSE 8321 11434

Step 3: Deploy with Operator

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: enhanced-ollama-dist
  namespace: production
spec:
  replicas: 2
  server:
    distribution:
      image: "my-registry.com/enhanced-ollama:v1.0.0"
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
          nvidia.com/gpu: "1"
        limits:
          memory: "32Gi"
          cpu: "16"
          nvidia.com/gpu: "1"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:3b"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
        - name: CUSTOM_OPTIMIZATION
          value: "enabled"
    storage:
      size: "200Gi"
    podOverrides:
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: shared-model-cache
      volumeMounts:
        - name: model-cache
          mountPath: /shared-models

Comparison: Pre-Built vs BYO

Aspect Pre-Built Distribution BYO Custom Images
Setup Complexity Simple - just specify name Complex - build and maintain images
Maintenance Maintained by LlamaStack team You maintain the images
Model Management Runtime model pulling Pre-loaded models possible
Customization Limited to environment variables Full control over Ollama configuration
Security Vetted by maintainers You control security scanning and updates
Performance Standard Ollama setup Custom optimizations possible
Support Community and official support Self-supported
Updates Automatic with operator updates Manual image rebuilds required

When to Use Pre-Built Distribution

  • Quick deployment and standard use cases
  • Production environments where stability is key
  • Dynamic model management (pull models at runtime)
  • Teams without container expertise
  • Standard Ollama configurations

When to Use BYO Custom Images

  • Pre-loaded models for faster startup
  • Custom Ollama configurations or patches
  • Additional tools and utilities
  • Compliance requirements for image provenance
  • Integration with existing model management systems
  • Custom model formats or converters

Model Management

Accessing the Ollama Container

# Connect to running Ollama pod
kubectl exec -it <ollama-pod> -- bash

# Pull models
ollama pull llama2:7b

# List available models
ollama list

# Remove unused models
ollama rm old-model:tag

Model Information

# Show model details
kubectl exec -it <ollama-pod> -- ollama show llama2:7b

# Check model size and parameters
kubectl exec -it <ollama-pod> -- ollama show llama2:7b --modelfile

API Usage

REST API

Ollama provides OpenAI-compatible endpoints:

# Generate completion
curl -X POST http://ollama-service:8321/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Why is the sky blue?",
    "max_tokens": 100
  }'

# Chat completion
curl -X POST http://ollama-service:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Python Client

import requests

# Generate text
response = requests.post(
    "http://ollama-service:8321/v1/completions",
    json={
        "model": "llama2:7b",
        "prompt": "Explain quantum computing",
        "max_tokens": 200
    }
)

print(response.json())

Monitoring and Troubleshooting

Health Checks

# Check pod status
kubectl get pods -l app=llama-stack

# View logs
kubectl logs -l app=llama-stack

# Test API endpoint
kubectl port-forward svc/my-ollama-distribution-service 8321:8321
curl http://localhost:8321/v1/health

Performance Monitoring

# Monitor resource usage
kubectl top pods -l app=llama-stack

# Check model loading status
kubectl exec -it <ollama-pod> -- ollama ps

Common Issues

  1. Model Download Failures
  2. Check internet connectivity
  3. Verify sufficient storage space
  4. Ensure proper permissions

  5. Out of Memory

  6. Use smaller models (3b, 7b instead of 13b, 70b)
  7. Increase memory limits
  8. Reduce concurrent requests

  9. Slow Performance

  10. Enable GPU acceleration
  11. Use faster storage for model cache
  12. Optimize model selection for hardware

Best Practices

Resource Planning

  • Memory: Allocate 2-4x model size in RAM
  • Storage: Plan for model downloads and cache
  • CPU: More cores improve concurrent request handling

Model Selection

# For development/testing
env:
  - name: INFERENCE_MODEL
    value: "orca-mini:3b"    # Fast, lightweight

# For general use
env:
  - name: INFERENCE_MODEL
    value: "llama2:7b"       # Good balance of quality/performance

# For high-quality responses
env:
  - name: INFERENCE_MODEL
    value: "llama2:13b"      # Better quality, more resources

# For code generation
env:
  - name: INFERENCE_MODEL
    value: "codellama:7b"    # Specialized for coding tasks

Security Considerations

  • Use private registries for custom images
  • Implement network policies for API access
  • Secure model storage with appropriate permissions
  • Monitor API usage and implement rate limiting

Examples

Production Setup

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: production-ollama
  namespace: llama-production
spec:
  replicas: 2
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "16Gi"
          cpu: "8"
          nvidia.com/gpu: "1"
        limits:
          memory: "32Gi"
          cpu: "16"
          nvidia.com/gpu: "1"
      env:
        - name: INFERENCE_MODEL
          value: "llama3.2:1b"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
    storage:
      size: "100Gi"

Development Setup

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: dev-ollama
  namespace: development
spec:
  replicas: 1
  server:
    distribution:
      name: "ollama"
    containerSpec:
      resources:
        requests:
          memory: "4Gi"
          cpu: "2"
        limits:
          memory: "8Gi"
          cpu: "4"
      env:
        - name: INFERENCE_MODEL
          value: "orca-mini:3b"
    storage:
      size: "20Gi"

API Reference

For complete API documentation, see: - API Reference - Configuration Reference

Next Steps