vLLM Distribution¶
vLLM is a high-performance inference engine optimized for large language models. The LlamaStack Kubernetes operator provides built-in support for vLLM through pre-configured distributions.
Overview¶
vLLM offers excellent performance characteristics:
- High Throughput: Optimized for serving multiple concurrent requests
- Memory Efficiency: Advanced memory management and attention mechanisms
- GPU Acceleration: Native CUDA support for NVIDIA GPUs
- Model Compatibility: Supports a wide range of popular model architectures
Pre-Built vLLM Distributions¶
The operator includes two pre-built vLLM distributions:
vllm-gpu (Self-Hosted)¶
- Image: docker.io/llamastack/distribution-vllm-gpu:latest
- Purpose: GPU-accelerated vLLM inference with local model serving
- Requirements: NVIDIA GPU with CUDA support
- Infrastructure: You provide GPU infrastructure
- Use Case: High-performance inference for production workloads
remote-vllm (External Connection)¶
- Image: docker.io/llamastack/distribution-remote-vllm:latest
- Purpose: Connect to external vLLM server
- Requirements: Access to external vLLM endpoint
- Infrastructure: External vLLM server required
- Use Case: Using existing vLLM deployments or managed services
Quick Start with vLLM¶
1. Create a LlamaStackDistribution¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: my-vllm-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "vllm-gpu"  # Use supported distribution
    containerSpec:
      port: 8321
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
    storage:
      size: "50Gi"
      mountPath: "/.llama"
2. Deploy the Distribution¶
3. Verify Deployment¶
Configuration Options¶
Container Specification¶
The containerSpec section allows you to configure the container:
spec:
  server:
    containerSpec:
      name: "llama-stack"  # Optional, defaults to "llama-stack"
      port: 8321           # Optional, defaults to 8321
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.9"
        - name: VLLM_MAX_SEQ_LEN
          value: "4096"
Environment Variables¶
Configure vLLM behavior through environment variables:
env:
  - name: INFERENCE_MODEL
    value: "meta-llama/Llama-2-7b-chat-hf"
  - name: VLLM_GPU_MEMORY_UTILIZATION
    value: "0.9"
  - name: VLLM_MAX_SEQ_LEN
    value: "4096"
  - name: VLLM_MAX_BATCH_SIZE
    value: "32"
  - name: VLLM_TENSOR_PARALLEL_SIZE
    value: "1"
Resource Requirements¶
resources:
  requests:
    nvidia.com/gpu: "1"
    memory: "16Gi"
    cpu: "4"
  limits:
    nvidia.com/gpu: "1"
    memory: "32Gi"
    cpu: "8"
Storage Configuration¶
Advanced Configuration¶
Multi-GPU Setup¶
For larger models requiring multiple GPUs:
spec:
  server:
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "4"
          memory: "64Gi"
          cpu: "16"
        limits:
          nvidia.com/gpu: "4"
          memory: "128Gi"
          cpu: "32"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-70b-chat-hf"
        - name: VLLM_TENSOR_PARALLEL_SIZE
          value: "4"
Custom Volumes with Pod Overrides¶
spec:
  server:
    podOverrides:
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      volumeMounts:
        - name: model-cache
          mountPath: /models
    containerSpec:
      env:
        - name: INFERENCE_MODEL
          value: "/models/custom-llama-model"
Scaling with Multiple Replicas¶
spec:
  replicas: 3
  server:
    distribution:
      name: "vllm-gpu"
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
Using vLLM with the Kubernetes Operator¶
The LlamaStack Kubernetes operator supports vLLM in two ways:
1. Pre-Built Distributions (Recommended)¶
Use pre-built, maintained distributions with the distribution.name field:
vllm-gpu Distribution¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: vllm-gpu-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "vllm-gpu"  # Supported distribution
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "32Gi"
          cpu: "8"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.9"
    storage:
      size: "50Gi"
remote-vllm Distribution¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: remote-vllm-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "remote-vllm"  # Supported distribution
    containerSpec:
      resources:
        requests:
          memory: "4Gi"
          cpu: "2"
        limits:
          memory: "8Gi"
          cpu: "4"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
        - name: VLLM_URL
          value: "http://external-vllm-service:8000"
2. Bring Your Own (BYO) Custom Images¶
Use custom-built distributions with the distribution.image field:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: custom-vllm-distribution
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      image: "my-registry.com/custom-vllm:v1.0.0"  # Custom image
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "2"
          memory: "32Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: "2"
          memory: "64Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "my-custom-model"
        - name: CUSTOM_VLLM_SETTING
          value: "optimized"
    storage:
      size: "100Gi"
Building Custom vLLM Distributions¶
Step 1: Build with LlamaStack CLI¶
Option A: From Template¶
# Install LlamaStack CLI
pip install llama-stack
# Build from vLLM template
llama stack build --template vllm-gpu --image-type container --image-name my-vllm-dist
Option B: Custom Configuration¶
Create custom-vllm-build.yaml:
name: custom-vllm
distribution_spec:
  description: Custom vLLM distribution with optimizations
  providers:
    inference: inline::vllm
    memory: inline::faiss
    safety: inline::llama-guard
    agents: inline::meta-reference
    telemetry: inline::meta-reference
image_name: custom-vllm
image_type: container
Build the distribution:
Step 2: Enhance with Custom Dockerfile¶
Create Dockerfile.enhanced:
FROM distribution-custom-vllm:dev
# Install additional dependencies
RUN pip install \
    flash-attn \
    custom-optimization-lib \
    monitoring-tools
# Add custom configurations
COPY vllm-config.json /app/config.json
COPY custom-models/ /app/models/
# Set optimization environment variables
ENV VLLM_USE_FLASH_ATTN=1
ENV VLLM_OPTIMIZATION_LEVEL=high
ENV CUSTOM_GPU_SETTINGS=enabled
# Add health check script
COPY health-check.sh /app/health-check.sh
RUN chmod +x /app/health-check.sh
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD /app/health-check.sh
EXPOSE 8321
Build the enhanced image:
Step 3: Push to Registry¶
# Tag for your registry
docker tag my-registry.com/enhanced-vllm:v1.0.0 my-registry.com/enhanced-vllm:latest
# Push to registry
docker push my-registry.com/enhanced-vllm:v1.0.0
docker push my-registry.com/enhanced-vllm:latest
Step 4: Deploy with Operator¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: enhanced-vllm-dist
  namespace: production
spec:
  replicas: 2
  server:
    distribution:
      image: "my-registry.com/enhanced-vllm:v1.0.0"
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "2"
          memory: "32Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: "2"
          memory: "64Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-13b-chat-hf"
        - name: VLLM_TENSOR_PARALLEL_SIZE
          value: "2"
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.85"
        - name: CUSTOM_OPTIMIZATION
          value: "enabled"
    storage:
      size: "200Gi"
    podOverrides:
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: shared-model-cache
      volumeMounts:
        - name: model-cache
          mountPath: /shared-models
Comparison: Pre-Built vs BYO¶
| Aspect | Pre-Built Distributions | BYO Custom Images | 
|---|---|---|
| Setup Complexity | Simple - just specify name | Complex - build and maintain images | 
| Maintenance | Maintained by LlamaStack team | You maintain the images | 
| Customization | Limited to environment variables | Full control over dependencies and configuration | 
| Security | Vetted by maintainers | You control security scanning and updates | 
| Performance | Standard optimizations | Custom optimizations possible | 
| Support | Community and official support | Self-supported | 
| Updates | Automatic with operator updates | Manual image rebuilds required | 
When to Use Pre-Built Distributions¶
- Quick deployment and standard use cases
- Production environments where stability is key
- Limited customization requirements
- Teams without container expertise
When to Use BYO Custom Images¶
- Specialized models or inference engines
- Custom optimizations for specific hardware
- Additional dependencies not in standard images
- Compliance requirements for image provenance
- Integration with existing infrastructure
Monitoring and Troubleshooting¶
Health Checks¶
The vLLM distribution includes built-in health checks:
# Check pod status
kubectl get pods -l app=llama-stack
# View logs
kubectl logs -l app=llama-stack
# Check service endpoints
kubectl get svc my-vllm-distribution-service
Performance Monitoring¶
# Monitor GPU utilization
kubectl exec -it <vllm-pod> -- nvidia-smi
# Check memory usage
kubectl top pods -l app=llama-stack
Common Issues¶
- GPU Not Available
- Ensure NVIDIA device plugin is installed
- 
Verify GPU resources in node capacity 
- 
Out of Memory 
- Reduce VLLM_GPU_MEMORY_UTILIZATION
- Increase memory limits
- 
Use smaller models 
- 
Model Loading Failures 
- Check model path and permissions
- Verify sufficient storage space
- Check environment variable values
Best Practices¶
Resource Planning¶
- GPU Memory: Ensure sufficient VRAM for model + batch processing
- CPU: Allocate adequate CPU for preprocessing and coordination
- Storage: Use fast storage (NVMe SSD) for model loading
Environment Variable Guidelines¶
- Use INFERENCE_MODELto specify the model to load
- Set VLLM_GPU_MEMORY_UTILIZATIONto control GPU memory usage (0.8-0.9 recommended)
- Configure VLLM_MAX_SEQ_LENbased on your use case requirements
- Use VLLM_TENSOR_PARALLEL_SIZEfor multi-GPU setups
Security¶
- Use private registries for custom images
- Implement proper RBAC for distribution management
- Secure model storage with appropriate access controls
Examples¶
Production Setup¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: production-vllm
  namespace: llama-production
spec:
  replicas: 2
  server:
    distribution:
      name: "vllm-gpu"
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "2"
          memory: "32Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: "2"
          memory: "64Gi"
          cpu: "16"
      env:
        - name: INFERENCE_MODEL
          value: "meta-llama/Llama-2-13b-chat-hf"
        - name: VLLM_TENSOR_PARALLEL_SIZE
          value: "2"
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.85"
    storage:
      size: "100Gi"
Development Setup¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: dev-vllm
  namespace: development
spec:
  replicas: 1
  server:
    distribution:
      name: "vllm-gpu"
    containerSpec:
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "8Gi"
          cpu: "2"
        limits:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
      env:
        - name: INFERENCE_MODEL
          value: "microsoft/DialoGPT-small"
    storage:
      size: "20Gi"
API Reference¶
For complete API documentation, see: - API Reference - Configuration Reference