Meta Superintelligence Labs

Segment Anything
with Concepts

A unified foundation model for promptable segmentation in images and videos. Detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Nicolas Carion*, Laura Gustafson*, Yuan-Ting Hu*, Shoubhik Debnath*, Ronghang Hu*, Didac Suris*, Chaitanya Ryali*, Kalyan Vasudev Alwala*, Haitham Khedr*, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Radle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu°, Tsung-Han Wu°, Yu Zhou°, Liliane Momeni°, Rishi Hazra°, Shuangrui Ding°, Sagar Vaze°, Francois Porcher°, Feng Li°, Siyuan Li°, Aishwarya Kamath°, Ho Kei Cheng°, Piotr Dollar†, Nikhila Ravi†, Kate Saenko†, Pengchuan Zhang†, Christoph Feichtenhofer†

* core contributor, ° intern, † project lead — order is random within groups

What is SAM 3?

SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Open-Vocabulary Segmentation

Exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplar images.

Image & Video Support

Works on both static images and video streams with consistent tracking across frames using a unified architecture.

Promptable Interface

Use text, points, boxes, or masks as input prompts. The model handles a vastly larger set of open-vocabulary prompts than prior work.

Interactive Refinement

Refine segmentation results interactively with additional point or box prompts for precise control.

Batched Inference

Process multiple images efficiently with batched inference support for production workloads.

SAM 3 Agent

Handle complex text prompts with agent-based reasoning for multi-step segmentation tasks.

Latest Updates

Recent releases and improvements to SAM 3.

March 27, 2026

SAM 3.1 Object Multiplex Released

SAM 3.1 introduces a shared-memory approach for joint multi-object tracking that is significantly faster without sacrificing accuracy. A new suite of improved model checkpoints (denoted as SAM 3.1) are released on Hugging Face. See the release notes for full details. To use the new SAM 3.1 checkpoints, pull the latest code from the repository and reinstall.

From Prompt to Segmentation in Four Steps

SAM 3 turns natural language descriptions or visual cues into precise segmentation masks with minimal code.

01

Load the Model

Build the SAM 3 image model or video predictor with a single function call. Choose between build_sam3_image_model() or build_sam3_video_predictor().

02

Prepare Input

Load your image (JPEG/PNG) or video (JPEG folder or MP4). The processor handles preprocessing and tensor preparation automatically.

03

Prompt with Text

Use short text phrases like “a dog” or “a player in white” as prompts. The presence token improves discrimination between closely related concepts.

04

Get Results

Receive masks, bounding boxes, and confidence scores for every instance. Results are ready for visualization or downstream processing.

Quick Start

Get up and running with SAM 3 in minutes. These examples show image and video inference with text prompts.

Image segmentation with text prompts
import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("input.jpg")
inference_state = processor.set_image(image)

# Prompt the model with text
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
Video segmentation with text prompts
from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
video_path = "video.mp4"

# Start a session
response = video_predictor.handle_request(
    request={
        "type": "start_session",
        "resource_path": video_path
    }
)

# Add a text prompt on frame 0
response = video_predictor.handle_request(
    request={
        "type": "add_prompt",
        "session_id": response["session_id"],
        "frame_index": 0,
        "text": "a car"
    }
)

output = response["outputs"]

Model Architecture

SAM 3 consists of a detector and a tracker that share a vision encoder. It has 848M parameters.

848M
Total Parameters
DETR-based
Detector conditioned on text, geometry, and image exemplars
SAM 2
Tracker inherits the SAM 2 transformer encoder-decoder architecture
Presence Token
Improves discrimination between closely related text prompts

Image Results

SAM 3 achieves 75-80% of human performance on the SA-Co benchmark, which contains 270K unique concepts — over 50 times more than existing benchmarks.

Model Instance Segmentation Box Detection
LVIS cgF1 SA-Co/Gold AP LVIS cgF1 COCO AP SA-Co/Gold AP COCO APo SA-Co/Gold cgF1
Human72.874.0
OWLv2*29.343.424.630.245.546.123.9
DINO-X38.521.352.456.0
Gemini 2.513.413.016.1
SAM 337.248.554.140.653.656.455.7

* Partially trained on LVIS. APo refers to COCO-O accuracy.

Video Results

Performance across video segmentation benchmarks including SA-V, YouTube, and BURST.

Model SA-V cgF1 SA-V pHOTA YT-Temporal-1B cgF1 SmartGlasses pHOTA LVVIS cgF1 LVVIS pHOTA BURST mAP BURST HOTA
Human53.170.571.278.458.572.3
SAM 330.358.050.869.936.463.636.344.5

SA-Co Dataset

We release 2 image benchmarks (SA-Co/Gold and SA-Co/Silver) and a video benchmark (SA-Co/VEval). The datasets contain images or videos with annotated noun phrases. Each image/video and noun phrase pair is annotated with instance masks and unique IDs of each object matching the phrase.

Phrases that have no matching objects (negative prompts) have no masks, shown in red font in the figure. See the linked READMEs for more details on how to download and run evaluations on the datasets.

Frequently Asked Questions

Quick answers to the most common questions about SAM 3.

SAM 3 (Segment Anything Model 3) is a unified foundation model from Meta Superintelligence Labs for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.
SAM 3 introduces exhaustive open-vocabulary segmentation, a presence token for better text prompt discrimination, and a decoupled detector-tracker design.
You need Python 3.12+, PyTorch 2.7+, and a CUDA-compatible GPU with CUDA 12.6+. For faster inference, install flash-attn-3 optionally.
SAM 3 has 848M parameters. It consists of a DETR-based detector and a tracker that share a vision encoder.
Yes, available on GitHub at facebookresearch/sam3 under the SAM License. Checkpoints on Hugging Face after requesting access.
SA-Co contains 270K unique concepts — over 50x more than existing benchmarks. SAM 3 achieves 75-80% of human performance on it.