Monitoring¶
Set up comprehensive monitoring for your LlamaStack distributions.
Monitoring Overview¶
Monitor your LlamaStack deployments with:
- Metrics: Performance and resource usage
- Logs: Application and system logs
- Alerts: Proactive issue detection
- Dashboards: Visual monitoring
Metrics Collection¶
Prometheus Setup¶
Deploy Prometheus for metrics collection:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'llamastack'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: llamastack
ServiceMonitor¶
Create a ServiceMonitor for automatic discovery:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llamastack-monitor
spec:
  selector:
    matchLabels:
      app: llamastack
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
Key Metrics¶
Application Metrics¶
Monitor LlamaStack-specific metrics:
# Custom metrics exposed by LlamaStack
llamastack_requests_total
llamastack_request_duration_seconds
llamastack_active_connections
llamastack_model_load_time_seconds
llamastack_inference_latency_seconds
Resource Metrics¶
Track resource usage:
# CPU and Memory
container_cpu_usage_seconds_total
container_memory_usage_bytes
container_memory_working_set_bytes
# Network
container_network_receive_bytes_total
container_network_transmit_bytes_total
# Storage
kubelet_volume_stats_used_bytes
kubelet_volume_stats_capacity_bytes
Logging¶
Centralized Logging¶
Set up log aggregation with Fluentd:
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*llamastack*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      format json
    </source>
    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      index_name llamastack-logs
    </match>
Log Levels¶
Configure appropriate log levels:
spec:
  env:
  - name: LOG_LEVEL
    value: "info"  # debug, info, warn, error
  - name: LOG_FORMAT
    value: "json"  # json, text
Dashboards¶
Grafana Dashboard¶
Create a comprehensive dashboard:
{
  "dashboard": {
    "title": "LlamaStack Monitoring",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(llamastack_requests_total[5m])",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(llamastack_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ]
      },
      {
        "title": "Resource Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total[5m])",
            "legendFormat": "CPU"
          },
          {
            "expr": "container_memory_usage_bytes",
            "legendFormat": "Memory"
          }
        ]
      }
    ]
  }
}
Alerting¶
Prometheus Alerts¶
Define critical alerts:
groups:
- name: llamastack.rules
  rules:
  - alert: LlamaStackDown
    expr: up{job="llamastack"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "LlamaStack instance is down"
      description: "LlamaStack instance {{ $labels.instance }} has been down for more than 1 minute."
  - alert: HighErrorRate
    expr: rate(llamastack_requests_total{status=~"5.."}[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} errors per second."
  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(llamastack_request_duration_seconds_bucket[5m])) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is {{ $value }} seconds."
  - alert: HighMemoryUsage
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage"
      description: "Memory usage is above 90%."
AlertManager Configuration¶
Configure alert routing:
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@example.com'
    subject: 'LlamaStack Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}
Health Checks¶
Liveness Probe¶
Configure liveness probes:
spec:
  containers:
  - name: llamastack
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
Readiness Probe¶
Configure readiness probes:
spec:
  containers:
  - name: llamastack
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3
Performance Monitoring¶
Custom Metrics¶
Expose custom application metrics:
# Example Python code for custom metrics
from prometheus_client import Counter, Histogram, Gauge
REQUEST_COUNT = Counter('llamastack_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('llamastack_request_duration_seconds', 'Request latency')
ACTIVE_CONNECTIONS = Gauge('llamastack_active_connections', 'Active connections')
# In your application code
REQUEST_COUNT.labels(method='POST', endpoint='/inference').inc()
REQUEST_LATENCY.observe(response_time)
ACTIVE_CONNECTIONS.set(current_connections)
Distributed Tracing¶
Set up distributed tracing with Jaeger:
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-config
data:
  config.yaml: |
    jaeger:
      endpoint: "http://jaeger-collector:14268/api/traces"
      service_name: "llamastack"
      sampler:
        type: "probabilistic"
        param: 0.1
Monitoring Best Practices¶
Resource Monitoring¶
Monitor these key resources:
# CPU usage
kubectl top pods -l app=llamastack
# Memory usage
kubectl top pods -l app=llamastack --containers
# Storage usage
kubectl exec -it <pod> -- df -h
# Network usage
kubectl exec -it <pod> -- netstat -i
Log Analysis¶
Analyze logs for issues:
# Check error logs
kubectl logs -l app=llamastack | grep ERROR
# Check recent logs
kubectl logs -l app=llamastack --since=1h
# Follow logs in real-time
kubectl logs -f -l app=llamastack
Troubleshooting Monitoring¶
Common Issues¶
Metrics Not Appearing:
# Check ServiceMonitor
kubectl get servicemonitor
# Check Prometheus targets
kubectl port-forward svc/prometheus 9090:9090
# Visit http://localhost:9090/targets
High Resource Usage:
# Check resource limits
kubectl describe pod <pod-name>
# Check node resources
kubectl describe node <node-name>
Alert Fatigue:
# Review alert thresholds
kubectl get prometheusrule
# Check alert history
kubectl logs -l app=alertmanager