Together AI – The AI Acceleration Cloud - Fast Inference, Fine-Tuning & Training

Run and fine-tune generative AI models with simple APIs and scalable GPU clusters. Train & deploy at scale on The AI Acceleration Cloud.

Created Aug 30, 2025

Updated May 31, 2026

What it is

Together AI is a cloud platform providing GPU-accelerated infrastructure and tools for developing and deploying generative AI models. It is designed for AI researchers, developers, and enterprises building applications with large language models (LLMs) and other generative AI technologies. The platform enables users to run inference, fine-tune models, and train custom models from scratch using open-source and specialized models.

Main Features

Model Platform

Serverless Inference API: For inference on open-source models with OpenAI-compatible APIs.
Dedicated Endpoints: Deploy models on custom, single-tenant hardware.
Fine-Tuning: Tools for LoRA and full fine-tuning of models on custom data.
Evaluations: Capabilities to measure and benchmark model quality.
Together Chat: A chat application for interacting with open-source AI models.

Code Execution

Code Sandbox: Environments for building and testing AI development projects.
Code Interpreter: Functionality to execute code generated by LLMs.

GPU Cloud

Instant Clusters: Self-service provisioning of clusters with up to 64 NVIDIA GPUs.
Reserved Clusters: Larger clusters ranging from 64 to 1,000 NVIDIA GPUs.
Frontier AI Factory: Massive-scale clusters from 1,000 to over 100,000 NVIDIA GPUs.
Global Data Centers: GPU resources available in over 25 cities worldwide.
Slurm Integration: Support for the Slurm workload manager for cluster orchestration.

Supported Hardware

Access to NVIDIA's latest GPUs, including the GB200 NVL72, HGX B200, H200, and H100.

How it works

Running Model Inference

Users can select from over 200 open-source models for tasks like chat, image generation, code, audio, and embeddings. They interact with these models through a serverless API or deploy them on dedicated endpoints for consistent, high-performance inference without rate limits.

Customizing Models via Fine-Tuning

Developers provide their own dataset in a compatible format (e.g., JSONL) and use the Fine-Tuning API to adapt a base model (e.g., Llama 2) to their specific task. The process allows control over hyperparameters like learning rate, batch size, and number of epochs.

Training Large-Scale Models

For organizations training custom models from scratch, Together AI provides scalable GPU clusters. Users can reserve clusters of various sizes, utilize high-speed InfiniBand interconnects, and manage workloads with Slurm or Kubernetes to accelerate large model training.

Key Points

The platform emphasizes open-source AI, providing an alternative to closed models and helping users avoid vendor lock-in.
It boasts performance optimizations, claiming inference speeds 4x faster than vLLM and costs 11x lower than GPT-4o for some models.
The infrastructure is SOC 2 Type 2 and HIPAA compliant, catering to enterprise security requirements.
Together AI's own research team contributes to core AI advancements, such as the FlashAttention optimization and the RedPajama dataset, which are integrated into the platform.
It offers a high degree of deployment flexibility, allowing models to be run on Together's cloud or within a customer's own virtual private cloud (VPC).

Additional Details

Pricing: Offered on a pay-per-use basis for serverless inference (per token) and fine-tuning, plus hourly rates for reserved GPU clusters, starting at $1.75/hour.
Availability: Global GPU cloud services are available in 25+ data center locations.
Requirements: Access is provided through APIs and a web interface; users need an account to utilize the services.
Notable Models: The platform hosts a wide range of state-of-the-art models, including DeepSeek-V3, Llama 3.3 70B, Mistral Small 3, and Whisper Large v3.

Quick Actions

Visit Website

Table of Contents

Run open-source machine learning models with a cloud API