Text Generation Inference (TGI) Distribution¶
Distribution Availability
The TGI distribution container image may not be currently maintained or available.
Please verify the image exists at docker.io/llamastack/distribution-tgi:latest before using this distribution.
For production use, consider using the ollama or vllm distributions which are actively maintained.
The TGI distribution integrates with Hugging Face's Text Generation Inference (TGI) server, providing high-performance inference for large language models with optimized serving capabilities.
Overview¶
Text Generation Inference (TGI) is Hugging Face's solution for deploying and serving Large Language Models. The TGI distribution:
- Connects to TGI servers for optimized model inference
- Supports streaming responses for real-time applications
- Provides high throughput with batching and optimization
- Compatible with Hugging Face models from the Hub
Distribution Details¶
| Property | Value |
|---|---|
| Distribution Name | tgi |
| Image | docker.io/llamastack/distribution-tgi:latest |
| Use Case | Hugging Face TGI server integration |
| Requirements | TGI server endpoint |
| Recommended For | High-performance inference, Hugging Face ecosystem |
Prerequisites¶
1. TGI Server Setup¶
You need a running TGI server. You can deploy one using:
Option A: Deploy TGI in Kubernetes¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-server
spec:
replicas: 1
selector:
matchLabels:
app: tgi-server
template:
metadata:
labels:
app: tgi-server
spec:
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:latest
ports:
- containerPort: 80
env:
- name: MODEL_ID
value: "microsoft/DialoGPT-medium"
- name: PORT
value: "80"
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
---
apiVersion: v1
kind: Service
metadata:
name: tgi-server-service
spec:
selector:
app: tgi-server
ports:
- port: 80
targetPort: 80
Option B: External TGI Server¶
Use an existing TGI deployment (cloud or on-premises).
Quick Start¶
1. Create TGI Distribution¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: my-tgi-llamastack
namespace: default
spec:
replicas: 1
server:
distribution:
name: "tgi"
containerSpec:
port: 8321
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
env:
- name: TGI_URL
value: "http://tgi-server-service:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-medium"
storage:
size: "10Gi"
2. Deploy the Distribution¶
3. Verify Deployment¶
# Check the distribution status
kubectl get llamastackdistribution my-tgi-llamastack
# Check the pods
kubectl get pods -l app=llama-stack
# Test TGI connectivity
kubectl logs -l app=llama-stack
Configuration Options¶
Environment Variables¶
Configure TGI connection and behavior:
env:
- name: TGI_URL
value: "http://tgi-server-service:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-medium"
- name: TGI_TIMEOUT
value: "30" # Request timeout in seconds
- name: TGI_MAX_TOKENS
value: "512"
- name: TGI_TEMPERATURE
value: "0.7"
- name: TGI_TOP_P
value: "0.9"
- name: LOG_LEVEL
value: "INFO"
TGI Server Configuration¶
Common TGI server models and configurations:
Small Models (Development)¶
env:
- name: TGI_URL
value: "http://tgi-server:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-small" # ~117M parameters
- name: TGI_MAX_TOKENS
value: "256"
Medium Models (Production)¶
env:
- name: TGI_URL
value: "http://tgi-server:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-medium" # ~345M parameters
- name: TGI_MAX_TOKENS
value: "512"
Large Models (High Performance)¶
env:
- name: TGI_URL
value: "http://tgi-server:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-large" # ~762M parameters
- name: TGI_MAX_TOKENS
value: "1024"
Resource Requirements¶
Lightweight Setup¶
Standard Setup¶
High-Performance Setup¶
Advanced Configuration¶
Multiple TGI Servers¶
Connect to multiple TGI servers for load balancing:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: multi-tgi-llamastack
spec:
replicas: 2
server:
distribution:
name: "tgi"
containerSpec:
env:
- name: TGI_URLS
value: "http://tgi-server-1:80,http://tgi-server-2:80"
- name: TGI_LOAD_BALANCE
value: "round_robin" # round_robin, random, least_connections
TGI with GPU Support¶
For GPU-accelerated TGI servers:
# TGI Server with GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-gpu-server
spec:
template:
spec:
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- name: MODEL_ID
value: "meta-llama/Llama-2-7b-chat-hf"
- name: CUDA_VISIBLE_DEVICES
value: "0"
resources:
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
Custom TGI Configuration¶
Use ConfigMaps for complex TGI configurations:
apiVersion: v1
kind: ConfigMap
metadata:
name: tgi-config
data:
tgi-settings.json: |
{
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1024,
"max_total_tokens": 2048,
"waiting_served_ratio": 1.2,
"max_batch_prefill_tokens": 4096,
"max_batch_total_tokens": 8192
}
---
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: custom-tgi-llamastack
spec:
server:
distribution:
name: "tgi"
containerSpec:
env:
- name: TGI_CONFIG_FILE
value: "/config/tgi-settings.json"
podOverrides:
volumes:
- name: tgi-config
configMap:
name: tgi-config
volumeMounts:
- name: tgi-config
mountPath: /config
Use Cases¶
1. Development Environment¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: dev-tgi
namespace: development
spec:
replicas: 1
server:
distribution:
name: "tgi"
containerSpec:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
env:
- name: TGI_URL
value: "http://tgi-dev-server:80"
- name: MODEL_ID
value: "microsoft/DialoGPT-small"
- name: LOG_LEVEL
value: "DEBUG"
storage:
size: "5Gi"
2. Production Deployment¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: production-tgi
namespace: production
spec:
replicas: 3
server:
distribution:
name: "tgi"
containerSpec:
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
env:
- name: TGI_URL
value: "http://tgi-prod-server:80"
- name: MODEL_ID
value: "meta-llama/Llama-2-7b-chat-hf"
- name: TGI_MAX_TOKENS
value: "1024"
- name: TGI_TEMPERATURE
value: "0.7"
- name: ENABLE_TELEMETRY
value: "true"
storage:
size: "50Gi"
3. High-Throughput Setup¶
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: high-throughput-tgi
namespace: production
spec:
replicas: 5
server:
distribution:
name: "tgi"
containerSpec:
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
env:
- name: TGI_URLS
value: "http://tgi-1:80,http://tgi-2:80,http://tgi-3:80"
- name: TGI_LOAD_BALANCE
value: "least_connections"
- name: TGI_TIMEOUT
value: "60"
- name: TGI_MAX_CONCURRENT_REQUESTS
value: "256"
Monitoring and Troubleshooting¶
Health Checks¶
# Check distribution status
kubectl get llamastackdistribution
# Check TGI connectivity
kubectl logs -l app=llama-stack | grep -i tgi
# Test TGI server directly
kubectl exec -it <pod-name> -- curl http://tgi-server:80/health
Performance Monitoring¶
# Monitor resource usage
kubectl top pods -l app=llama-stack
# Check TGI server metrics
kubectl exec -it <tgi-pod> -- curl http://localhost:80/metrics
# Monitor request latency
kubectl logs -l app=llama-stack | grep -i "response_time"
Common Issues¶
-
TGI Server Unreachable
-
Model Loading Failures
- Verify model ID exists on Hugging Face Hub
- Check TGI server has sufficient resources
-
Ensure model is compatible with TGI
-
Timeout Issues
- Increase
TGI_TIMEOUTvalue - Check TGI server performance
- Monitor network latency
Best Practices¶
Performance Optimization¶
- Use appropriate batch sizes for your workload
- Configure TGI server with optimal parameters
- Monitor and tune timeout values
- Use multiple TGI servers for high availability
Resource Management¶
- Size TGI servers based on model requirements
- Monitor GPU utilization if using GPU acceleration
- Scale LlamaStack replicas based on request volume
- Use resource requests and limits
Security¶
- Secure TGI server endpoints with authentication
- Use network policies to restrict access
- Monitor API usage and implement rate limiting
- Keep TGI server images updated
Model Management¶
- Version your models and TGI configurations
- Test model changes in development first
- Monitor model performance and accuracy
- Have rollback procedures for model updates
Next Steps¶
API Reference¶
For complete API documentation, see: - API Reference - Configuration Reference