Meta Superintelligence Labs

Segment Anything
with Concepts

A unified foundation model for promptable segmentation in images and videos. Detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Get Started View on GitHub

Nicolas Carion*, Laura Gustafson*, Yuan-Ting Hu*, Shoubhik Debnath*, Ronghang Hu*, Didac Suris*, Chaitanya Ryali*, Kalyan Vasudev Alwala*, Haitham Khedr*, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Radle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu°, Tsung-Han Wu°, Yu Zhou°, Liliane Momeni°, Rishi Hazra°, Shuangrui Ding°, Sagar Vaze°, Francois Porcher°, Feng Li°, Siyuan Li°, Aishwarya Kamath°, Ho Kei Cheng°, Piotr Dollar†, Nikhila Ravi†, Kate Saenko†, Pengchuan Zhang†, Christoph Feichtenhofer†

* core contributor, ° intern, † project lead — order is random within groups

Overview

What is SAM 3?

SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Open-Vocabulary Segmentation

Exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplar images.

Image & Video Support

Works on both static images and video streams with consistent tracking across frames using a unified architecture.

Promptable Interface

Use text, points, boxes, or masks as input prompts. The model handles a vastly larger set of open-vocabulary prompts than prior work.

Interactive Refinement

Refine segmentation results interactively with additional point or box prompts for precise control.

Batched Inference

Process multiple images efficiently with batched inference support for production workloads.

SAM 3 Agent

Handle complex text prompts with agent-based reasoning for multi-step segmentation tasks.

Updates

Latest Updates

Recent releases and improvements to SAM 3.

March 27, 2026

SAM 3.1 Object Multiplex Released

SAM 3.1 introduces a shared-memory approach for joint multi-object tracking that is significantly faster without sacrificing accuracy. A new suite of improved model checkpoints (denoted as SAM 3.1) are released on Hugging Face. See the release notes for full details. To use the new SAM 3.1 checkpoints, pull the latest code from the repository and reinstall.

How It Works

From Prompt to Segmentation in Four Steps

SAM 3 turns natural language descriptions or visual cues into precise segmentation masks with minimal code.

Load the Model

Build the SAM 3 image model or video predictor with a single function call. Choose between build_sam3_image_model() or build_sam3_video_predictor().

Prepare Input

Load your image (JPEG/PNG) or video (JPEG folder or MP4). The processor handles preprocessing and tensor preparation automatically.

Prompt with Text

Use short text phrases like “a dog” or “a player in white” as prompts. The presence token improves discrimination between closely related concepts.

Get Results

Receive masks, bounding boxes, and confidence scores for every instance. Results are ready for visualization or downstream processing.

Code

Quick Start

Get up and running with SAM 3 in minutes. These examples show image and video inference with text prompts.

            
            Image segmentation with text prompts
          

import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("input.jpg")
inference_state = processor.set_image(image)

# Prompt the model with text
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
          

            
            Video segmentation with text prompts
          

from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
video_path = "video.mp4"

# Start a session
response = video_predictor.handle_request(
    request={
        "type": "start_session",
        "resource_path": video_path
    }
)

# Add a text prompt on frame 0
response = video_predictor.handle_request(
    request={
        "type": "add_prompt",
        "session_id": response["session_id"],
        "frame_index": 0,
        "text": "a car"
    }
)

output = response["outputs"]
          

Architecture

Model Architecture

SAM 3 consists of a detector and a tracker that share a vision encoder. It has 848M parameters.

848M

Total Parameters

DETR-based

Detector conditioned on text, geometry, and image exemplars

SAM 2

Tracker inherits the SAM 2 transformer encoder-decoder architecture

Presence Token

Improves discrimination between closely related text prompts

Benchmarks

Image Results

SAM 3 achieves 75-80% of human performance on the SA-Co benchmark, which contains 270K unique concepts — over 50 times more than existing benchmarks.

Model	Instance Segmentation			Box Detection
	LVIS cgF1	SA-Co/Gold AP	LVIS cgF1	COCO AP	SA-Co/Gold AP	COCO APo	SA-Co/Gold cgF1
Human	—	—	72.8	—	—	—	74.0
OWLv2*	29.3	43.4	24.6	30.2	45.5	46.1	23.9
DINO-X	—	38.5	21.3	—	52.4	56.0	—
Gemini 2.5	13.4	—	13.0	16.1	—	—	—
SAM 3	37.2	48.5	54.1	40.6	53.6	56.4	55.7

* Partially trained on LVIS. APo refers to COCO-O accuracy.

Video Results

Performance across video segmentation benchmarks including SA-V, YouTube, and BURST.

Model	SA-V cgF1	SA-V pHOTA	YT-Temporal-1B cgF1	SmartGlasses pHOTA	LVVIS cgF1	LVVIS pHOTA	BURST mAP	BURST HOTA
Human	53.1	70.5	71.2	78.4	58.5	72.3	—	—
SAM 3	30.3	58.0	50.8	69.9	36.4	63.6	36.3	44.5

Data

SA-Co Dataset

We release 2 image benchmarks (SA-Co/Gold and SA-Co/Silver) and a video benchmark (SA-Co/VEval). The datasets contain images or videos with annotated noun phrases. Each image/video and noun phrase pair is annotated with instance masks and unique IDs of each object matching the phrase.

Phrases that have no matching objects (negative prompts) have no masks, shown in red font in the figure. See the linked READMEs for more details on how to download and run evaluations on the datasets.

HuggingFace: SA-Co/Gold, SA-Co/Silver, SA-Co/VEval Roboflow: SA-Co/Gold, SA-Co/Silver, SA-Co/VEval

FAQ

Frequently Asked Questions

Quick answers to the most common questions about SAM 3.

SAM 3 (Segment Anything Model 3) is a unified foundation model from Meta Superintelligence Labs for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

SAM 3 introduces exhaustive open-vocabulary segmentation, a presence token for better text prompt discrimination, and a decoupled detector-tracker design.

You need Python 3.12+, PyTorch 2.7+, and a CUDA-compatible GPU with CUDA 12.6+. For faster inference, install flash-attn-3 optionally.

SAM 3 has 848M parameters. It consists of a DETR-based detector and a tracker that share a vision encoder.

Yes, available on GitHub at facebookresearch/sam3 under the SAM License. Checkpoints on Hugging Face after requesting access.

SA-Co contains 270K unique concepts — over 50x more than existing benchmarks. SAM 3 achieves 75-80% of human performance on it.

Segment Anythingwith Concepts