Skip to content

Together AI Distribution

Distribution Availability

The Together distribution container image may not be currently maintained or available. Please verify the image exists at docker.io/llamastack/distribution-together:latest before using this distribution. For production use, consider using the ollama or vllm distributions which are actively maintained.

The Together distribution integrates with Together AI's inference platform, providing access to a wide variety of open-source models through their optimized API service.

Overview

Together AI offers fast, scalable inference for open-source language models. The Together distribution:

  • Connects to Together AI API for model inference
  • Supports multiple open-source models (Llama, Mistral, CodeLlama, etc.)
  • Provides high-performance inference with optimized serving
  • Offers cost-effective scaling with pay-per-use pricing

Distribution Details

Property Value
Distribution Name together
Image docker.io/llamastack/distribution-together:latest
Use Case Together AI API integration
Requirements Together AI API key
Recommended For Open-source models, cost-effective inference

Prerequisites

1. Together AI Account

  • Sign up at together.ai
  • Get your API key from the dashboard
  • Choose your preferred models

2. API Key Setup

Create a Kubernetes secret with your Together AI API key:

kubectl create secret generic together-api-key \
  --from-literal=TOGETHER_API_KEY=your-api-key-here

Quick Start

1. Create Together Distribution

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: my-together-llamastack
  namespace: default
spec:
  replicas: 1
  server:
    distribution:
      name: "together"
    containerSpec:
      port: 8321
      resources:
        requests:
          memory: "1Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "1"
      env:
        - name: TOGETHER_API_KEY
          valueFrom:
            secretKeyRef:
              name: together-api-key
              key: TOGETHER_API_KEY
        - name: TOGETHER_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
    storage:
      size: "10Gi"

2. Deploy the Distribution

kubectl apply -f together-distribution.yaml

3. Verify Deployment

# Check the distribution status
kubectl get llamastackdistribution my-together-llamastack

# Check the pods
kubectl get pods -l app=llama-stack

# Check logs for Together AI connectivity
kubectl logs -l app=llama-stack

Configuration Options

Supported Models

Together AI supports many popular open-source models:

Meta Llama Models

env:
  - name: TOGETHER_MODEL
    value: "meta-llama/Llama-2-7b-chat-hf"
  # value: "meta-llama/Llama-2-13b-chat-hf"
  # value: "meta-llama/Llama-2-70b-chat-hf"
  # value: "meta-llama/CodeLlama-7b-Instruct-hf"
  # value: "meta-llama/CodeLlama-13b-Instruct-hf"

Mistral Models

env:
  - name: TOGETHER_MODEL
    value: "mistralai/Mistral-7B-Instruct-v0.1"
  # value: "mistralai/Mixtral-8x7B-Instruct-v0.1"
env:
  - name: TOGETHER_MODEL
    value: "togethercomputer/RedPajama-INCITE-7B-Chat"
  # value: "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"
  # value: "teknium/OpenHermes-2.5-Mistral-7B"

Environment Variables

Configure Together AI connection and model parameters:

env:
  - name: TOGETHER_API_KEY
    valueFrom:
      secretKeyRef:
        name: together-api-key
        key: TOGETHER_API_KEY
  - name: TOGETHER_MODEL
    value: "meta-llama/Llama-2-7b-chat-hf"
  - name: TOGETHER_MAX_TOKENS
    value: "512"
  - name: TOGETHER_TEMPERATURE
    value: "0.7"
  - name: TOGETHER_TOP_P
    value: "0.9"
  - name: TOGETHER_TOP_K
    value: "50"
  - name: TOGETHER_REPETITION_PENALTY
    value: "1.0"
  - name: TOGETHER_TIMEOUT
    value: "30"  # Request timeout in seconds
  - name: LOG_LEVEL
    value: "INFO"

Resource Requirements

Development Setup

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"
    cpu: "500m"

Production Setup

resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "2Gi"
    cpu: "1"

High-Throughput Setup

resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "4Gi"
    cpu: "2"

Advanced Configuration

Multiple Models

Deploy different distributions for different models:

# Llama 2 7B for general chat
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: together-llama2-7b
spec:
  server:
    distribution:
      name: "together"
    containerSpec:
      env:
        - name: TOGETHER_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
---
# CodeLlama for code generation
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: together-codellama
spec:
  server:
    distribution:
      name: "together"
    containerSpec:
      env:
        - name: TOGETHER_MODEL
          value: "meta-llama/CodeLlama-7b-Instruct-hf"

Production Configuration

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: production-together
  namespace: production
spec:
  replicas: 3
  server:
    distribution:
      name: "together"
    containerSpec:
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "2"
      env:
        - name: TOGETHER_API_KEY
          valueFrom:
            secretKeyRef:
              name: together-api-key
              key: TOGETHER_API_KEY
        - name: TOGETHER_MODEL
          value: "meta-llama/Llama-2-13b-chat-hf"
        - name: TOGETHER_MAX_TOKENS
          value: "1024"
        - name: TOGETHER_TEMPERATURE
          value: "0.7"
        - name: TOGETHER_TIMEOUT
          value: "60"
        - name: LOG_LEVEL
          value: "WARNING"
        - name: ENABLE_TELEMETRY
          value: "true"
    storage:
      size: "20Gi"

Custom Configuration with ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: together-config
data:
  together-settings.json: |
    {
      "default_model": "meta-llama/Llama-2-7b-chat-hf",
      "max_tokens": 512,
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 50,
      "repetition_penalty": 1.0,
      "stop_sequences": ["</s>", "[INST]", "[/INST]"],
      "retry_config": {
        "max_retries": 3,
        "backoff_factor": 2,
        "max_backoff": 60
      }
    }
---
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: custom-together
spec:
  server:
    distribution:
      name: "together"
    containerSpec:
      env:
        - name: TOGETHER_CONFIG_FILE
          value: "/config/together-settings.json"
    podOverrides:
      volumes:
        - name: together-config
          configMap:
            name: together-config
      volumeMounts:
        - name: together-config
          mountPath: /config

Use Cases

1. Development and Prototyping

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: dev-together
  namespace: development
spec:
  replicas: 1
  server:
    distribution:
      name: "together"
    containerSpec:
      resources:
        requests:
          memory: "512Mi"
          cpu: "250m"
        limits:
          memory: "1Gi"
          cpu: "500m"
      env:
        - name: TOGETHER_API_KEY
          valueFrom:
            secretKeyRef:
              name: together-api-key
              key: TOGETHER_API_KEY
        - name: TOGETHER_MODEL
          value: "meta-llama/Llama-2-7b-chat-hf"
        - name: TOGETHER_MAX_TOKENS
          value: "256"
        - name: LOG_LEVEL
          value: "DEBUG"
    storage:
      size: "5Gi"

2. Code Generation Service

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: code-generation-together
  namespace: default
spec:
  replicas: 2
  server:
    distribution:
      name: "together"
    containerSpec:
      resources:
        requests:
          memory: "1Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "1"
      env:
        - name: TOGETHER_API_KEY
          valueFrom:
            secretKeyRef:
              name: together-api-key
              key: TOGETHER_API_KEY
        - name: TOGETHER_MODEL
          value: "meta-llama/CodeLlama-13b-Instruct-hf"
        - name: TOGETHER_MAX_TOKENS
          value: "2048"
        - name: TOGETHER_TEMPERATURE
          value: "0.1"  # Lower temperature for code
    storage:
      size: "15Gi"

3. High-Volume Production

apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  name: high-volume-together
  namespace: production
spec:
  replicas: 5
  server:
    distribution:
      name: "together"
    containerSpec:
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "2"
      env:
        - name: TOGETHER_API_KEY
          valueFrom:
            secretKeyRef:
              name: together-api-key
              key: TOGETHER_API_KEY
        - name: TOGETHER_MODEL
          value: "meta-llama/Llama-2-70b-chat-hf"
        - name: TOGETHER_MAX_TOKENS
          value: "1024"
        - name: TOGETHER_TIMEOUT
          value: "120"
        - name: ENABLE_TELEMETRY
          value: "true"
    storage:
      size: "50Gi"

Monitoring and Troubleshooting

Health Checks

# Check distribution status
kubectl get llamastackdistribution

# Check API connectivity
kubectl logs -l app=llama-stack | grep -i together

# Test API key
kubectl exec -it <pod-name> -- curl -H "Authorization: Bearer $TOGETHER_API_KEY" \
  https://api.together.xyz/v1/models

Performance Monitoring

# Monitor resource usage
kubectl top pods -l app=llama-stack

# Check API response times
kubectl logs -l app=llama-stack | grep -i "response_time"

# Monitor API usage
kubectl logs -l app=llama-stack | grep -i "api_usage"

Common Issues

  1. Invalid API Key

    # Verify API key in secret
    kubectl get secret together-api-key -o yaml
    
    # Test API key manually
    kubectl exec -it <pod-name> -- env | grep TOGETHER_API_KEY
    

  2. Model Not Available

  3. Check if model exists in Together AI catalog
  4. Verify model name spelling and format
  5. Some models may have usage restrictions

  6. Rate Limiting

  7. Monitor API usage and limits
  8. Implement request queuing
  9. Consider upgrading Together AI plan

  10. Timeout Issues

  11. Increase TOGETHER_TIMEOUT value
  12. Check network connectivity
  13. Monitor Together AI service status

Best Practices

Cost Optimization

  • Choose appropriate models for your use case
  • Monitor token usage and optimize prompts
  • Use smaller models for development/testing
  • Implement caching for repeated requests
  • Set up usage alerts and budgets

Performance

  • Scale replicas based on request volume
  • Use connection pooling and keep-alive
  • Implement request batching where possible
  • Monitor and optimize timeout values

Security

  • Store API keys in Kubernetes Secrets
  • Use least-privilege access controls
  • Monitor API usage for anomalies
  • Rotate API keys regularly
  • Implement rate limiting and request validation

Reliability

  • Implement retry logic with exponential backoff
  • Use multiple replicas for high availability
  • Monitor Together AI service status
  • Have fallback mechanisms for service outages

Cost Management

Usage Monitoring

env:
  - name: ENABLE_USAGE_TRACKING
    value: "true"
  - name: USAGE_LOG_LEVEL
    value: "INFO"
  - name: COST_ALERT_THRESHOLD
    value: "100"  # Alert when daily cost exceeds $100

Budget Controls

  • Set up billing alerts in Together AI dashboard
  • Implement request quotas per user/application
  • Monitor token usage patterns
  • Use smaller models for non-critical workloads

Next Steps

API Reference

For complete API documentation, see: - API Reference - Configuration Reference - Together AI API Documentation