Scaling¶
Learn how to scale your LlamaStack distributions for optimal performance and cost efficiency.
Scaling Overview¶
LlamaStack supports both horizontal and vertical scaling:
- Horizontal Scaling: Add more replicas
- Vertical Scaling: Increase resources per replica
- Auto Scaling: Automatic scaling based on metrics
Horizontal Scaling¶
Manual Scaling¶
Scale replicas manually:
# Scale to 3 replicas
kubectl patch llamastackdistribution my-llamastack \
-p '{"spec":{"replicas":3}}'
# Or edit the resource directly
kubectl edit llamastackdistribution my-llamastack
Declarative Scaling¶
Update your YAML configuration:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: scaled-llamastack
spec:
image: llamastack/llamastack:latest
replicas: 5 # Scale to 5 replicas
resources:
requests:
cpu: "1"
memory: "2Gi"
Vertical Scaling¶
Resource Adjustment¶
Increase CPU and memory:
spec:
resources:
requests:
cpu: "2" # Increased from 1
memory: "4Gi" # Increased from 2Gi
limits:
cpu: "4" # Increased from 2
memory: "8Gi" # Increased from 4Gi
GPU Scaling¶
Add GPU resources:
Auto Scaling¶
Horizontal Pod Autoscaler (HPA)¶
Create an HPA for automatic scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llamastack-hpa
spec:
scaleTargetRef:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
name: my-llamastack
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Vertical Pod Autoscaler (VPA)¶
Enable automatic resource adjustment:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: llamastack-vpa
spec:
targetRef:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
name: my-llamastack
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: llamastack
maxAllowed:
cpu: "4"
memory: "8Gi"
minAllowed:
cpu: "100m"
memory: "128Mi"
Performance Considerations¶
Load Balancing¶
Configure load balancing for multiple replicas:
apiVersion: v1
kind: Service
metadata:
name: llamastack-service
spec:
selector:
app: my-llamastack
ports:
- port: 8080
targetPort: 8080
type: LoadBalancer
sessionAffinity: None # Round-robin
Resource Requests vs Limits¶
Best practices for resource configuration:
spec:
resources:
requests:
cpu: "1" # Guaranteed resources
memory: "2Gi"
limits:
cpu: "2" # Maximum allowed (2x requests)
memory: "4Gi" # Maximum allowed (2x requests)
Monitoring Scaling¶
Scaling Metrics¶
Monitor key scaling metrics:
# Check HPA status
kubectl get hpa
# Check resource usage
kubectl top pods -l app=my-llamastack
# Check scaling events
kubectl describe hpa llamastack-hpa
Custom Metrics¶
Scale based on custom metrics:
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
Scaling Strategies¶
Blue-Green Scaling¶
Deploy new version alongside old:
# Blue deployment (current)
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastack-blue
spec:
image: llamastack/llamastack:v1.0
replicas: 3
---
# Green deployment (new)
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastack-green
spec:
image: llamastack/llamastack:v1.1
replicas: 3
Canary Scaling¶
Gradual rollout with traffic splitting:
# Main deployment (90% traffic)
spec:
replicas: 9
version: "stable"
---
# Canary deployment (10% traffic)
spec:
replicas: 1
version: "canary"
Cost Optimization¶
Spot Instances¶
Use spot instances for cost savings:
spec:
nodeSelector:
node-type: "spot"
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Scheduled Scaling¶
Scale down during off-hours:
# CronJob for scaling down
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-llamastack
spec:
schedule: "0 18 * * *" # 6 PM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl
command:
- kubectl
- patch
- llamastackdistribution
- my-llamastack
- -p
- '{"spec":{"replicas":1}}'
Troubleshooting Scaling¶
Common Issues¶
Pods Not Scaling:
# Check HPA conditions
kubectl describe hpa llamastack-hpa
# Check resource metrics
kubectl top nodes
kubectl top pods
Resource Constraints:
Scaling Too Aggressive:
# Adjust HPA behavior
kubectl patch hpa llamastack-hpa -p '{"spec":{"behavior":{"scaleUp":{"stabilizationWindowSeconds":300}}}}'