Segment Anything
with Concepts
A unified foundation model for promptable segmentation in images and videos. Detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.
What is SAM 3?
SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.
Open-Vocabulary Segmentation
Exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplar images.
Image & Video Support
Works on both static images and video streams with consistent tracking across frames using a unified architecture.
Promptable Interface
Use text, points, boxes, or masks as input prompts. The model handles a vastly larger set of open-vocabulary prompts than prior work.
Interactive Refinement
Refine segmentation results interactively with additional point or box prompts for precise control.
Batched Inference
Process multiple images efficiently with batched inference support for production workloads.
SAM 3 Agent
Handle complex text prompts with agent-based reasoning for multi-step segmentation tasks.
Latest Updates
Recent releases and improvements to SAM 3.
SAM 3.1 Object Multiplex Released
SAM 3.1 introduces a shared-memory approach for joint multi-object tracking that is significantly faster without sacrificing accuracy. A new suite of improved model checkpoints (denoted as SAM 3.1) are released on Hugging Face. See the release notes for full details. To use the new SAM 3.1 checkpoints, pull the latest code from the repository and reinstall.
From Prompt to Segmentation in Four Steps
SAM 3 turns natural language descriptions or visual cues into precise segmentation masks with minimal code.
Load the Model
Build the SAM 3 image model or video predictor with a single function call. Choose between build_sam3_image_model() or build_sam3_video_predictor().
Prepare Input
Load your image (JPEG/PNG) or video (JPEG folder or MP4). The processor handles preprocessing and tensor preparation automatically.
Prompt with Text
Use short text phrases like “a dog” or “a player in white” as prompts. The presence token improves discrimination between closely related concepts.
Get Results
Receive masks, bounding boxes, and confidence scores for every instance. Results are ready for visualization or downstream processing.
Quick Start
Get up and running with SAM 3 in minutes. These examples show image and video inference with text prompts.
import torch from PIL import Image from sam3.model_builder import build_sam3_image_model from sam3.model.sam3_image_processor import Sam3Processor # Load the model model = build_sam3_image_model() processor = Sam3Processor(model) # Load an image image = Image.open("input.jpg") inference_state = processor.set_image(image) # Prompt the model with text output = processor.set_text_prompt( state=inference_state, prompt="a dog" ) masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
from sam3.model_builder import build_sam3_video_predictor video_predictor = build_sam3_video_predictor() video_path = "video.mp4" # Start a session response = video_predictor.handle_request( request={ "type": "start_session", "resource_path": video_path } ) # Add a text prompt on frame 0 response = video_predictor.handle_request( request={ "type": "add_prompt", "session_id": response["session_id"], "frame_index": 0, "text": "a car" } ) output = response["outputs"]
Model Architecture
SAM 3 consists of a detector and a tracker that share a vision encoder. It has 848M parameters.
Image Results
SAM 3 achieves 75-80% of human performance on the SA-Co benchmark, which contains 270K unique concepts — over 50 times more than existing benchmarks.
| Model | Instance Segmentation | Box Detection | |||||
|---|---|---|---|---|---|---|---|
| LVIS cgF1 | SA-Co/Gold AP | LVIS cgF1 | COCO AP | SA-Co/Gold AP | COCO APo | SA-Co/Gold cgF1 | |
| Human | — | — | 72.8 | — | — | — | 74.0 |
| OWLv2* | 29.3 | 43.4 | 24.6 | 30.2 | 45.5 | 46.1 | 23.9 |
| DINO-X | — | 38.5 | 21.3 | — | 52.4 | 56.0 | — |
| Gemini 2.5 | 13.4 | — | 13.0 | 16.1 | — | — | — |
| SAM 3 | 37.2 | 48.5 | 54.1 | 40.6 | 53.6 | 56.4 | 55.7 |
* Partially trained on LVIS. APo refers to COCO-O accuracy.
Video Results
Performance across video segmentation benchmarks including SA-V, YouTube, and BURST.
| Model | SA-V cgF1 | SA-V pHOTA | YT-Temporal-1B cgF1 | SmartGlasses pHOTA | LVVIS cgF1 | LVVIS pHOTA | BURST mAP | BURST HOTA |
|---|---|---|---|---|---|---|---|---|
| Human | 53.1 | 70.5 | 71.2 | 78.4 | 58.5 | 72.3 | — | — |
| SAM 3 | 30.3 | 58.0 | 50.8 | 69.9 | 36.4 | 63.6 | 36.3 | 44.5 |
SA-Co Dataset
We release 2 image benchmarks (SA-Co/Gold and SA-Co/Silver) and a video benchmark (SA-Co/VEval). The datasets contain images or videos with annotated noun phrases. Each image/video and noun phrase pair is annotated with instance masks and unique IDs of each object matching the phrase.
Phrases that have no matching objects (negative prompts) have no masks, shown in red font in the figure. See the linked READMEs for more details on how to download and run evaluations on the datasets.
Frequently Asked Questions
Quick answers to the most common questions about SAM 3.