Falcon Perception: Early-fusion Transformer for vision

Falcon Perception presents a clear bet: a single early-fusion Transformer that processes image and text in the same stack to do grounding and open-vocabulary segmentation. Sounds like a risky simplification? Yes — but the results and design show why that simplicity can win in speed, clarity and scalability.

Central design

The idea is straightforward and elegant. Instead of the classic recipe vision-encoder + fusion-decoder, Falcon Perception uses an autoregressive Transformer that consumes a unified sequence of image patches, text tokens and task tokens from the very first layer.

Image and text share the same parameter space.
It uses a hybrid attention mask to respect the distinct structure of both domains.

The hybrid mask is key: image tokens attend bidirectionally among themselves to build global visual context. Text and task tokens follow causal attention with respect to everything before them — including the visual tokens — to allow autoregressive generation of answers and lists of instances.

Level	Capability	Example prompt
L0	Simple objects	car
L1	Attributes and subtypes	red car
L2	OCR-guided	Diet Coke bottle
L3	Spatial understanding	car on the left
L4	Relations and interactions	person holding umbrella
Dense	Very crowded scenes	hundreds of instances

Capability	SAM 3	Falcon Perception	Gap
L0 simple	64.3	65.1	+0.8
L1 attributes	54.4	63.6	+9.2
L2 OCR-guided	24.6	38.0	+13.4
L3 spatial	31.6	53.5	+21.9
L4 relations	33.3	49.1	+15.8
Dense	58.4	72.6	+14.2

Central design

Output interface: Chain-of-Perception

Heads and coordinate encoding

Data, preprocesses and training curriculum

PBench: diagnostic benchmark

Ablations and design choices

Results and comparison with SAM 3

Falcon OCR: compact, fast OCR

Inference infrastructure

Limitations and next steps

Original source

Stay up to date!

Falcon Perception: Early-fusion Transformer for vision