Falcon Perception presents a clear bet: a single early-fusion Transformer that processes image and text in the same stack to do grounding and open-vocabulary segmentation. Sounds like a risky simplification? Yes — but the results and design show why that simplicity can win in speed, clarity and scalability.
Central design
The idea is straightforward and elegant. Instead of the classic recipe vision-encoder + fusion-decoder, Falcon Perception uses an autoregressive Transformer that consumes a unified sequence of image patches, text tokens and task tokens from the very first layer.
Image and text share the same parameter space.
It uses a hybrid attention mask to respect the distinct structure of both domains.
The hybrid mask is key: image tokens attend bidirectionally among themselves to build global visual context. Text and task tokens follow causal attention with respect to everything before them — including the visual tokens — to allow autoregressive generation of answers and lists of instances.
Output interface: Chain-of-Perception
How do you represent a variable number of objects without exploding latency? Falcon Perception uses a compact, structured interface with three steps per instance: <coord> → <size> → <seg>.
<coord> predicts the object center and resolves which instance the model is referring to.
<size> estimates the spatial extent.
<seg> is an embedding that, via dot product with upsampled features, produces a high-resolution mask.
This order reduces ambiguity and turns the mask stage into a geometry-conditioned refinement. You don't need to decode polygons token-by-token, which keeps costs reasonable.
Heads and coordinate encoding
Coordinates and sizes are decoded with a head that uses Fourier features: a random Gaussian projection followed by sines and cosines to overcome spectral bias and allow precise continuous localization.
The decoded coordinates are re-injected into the sequence as conditioning for later steps.
The segmentation head does a dot product between the hidden state of <seg> and the content-dependent upsampled features.
Data, preprocesses and training curriculum
They built a multi-stage pipeline to get balanced coverage and difficulty:
Hierarchical clustering with DINOv3 embeddings for uniform coverage of concepts.
Dense listing driven by VLMs to generate per-image descriptions, labeled by PBench difficulty.
Negative mining to counter hallucinations.
Assembly consensus among SAM 3, Qwen3-VL-30B and Moondream3 for automatic acceptance; disagreements go to human annotators.
Strict 1:1 ratio of positives to negatives to prioritize presence calibration.
Main training phases:
In-Context Listing (450 GT): learns to list scene inventories with full causal attention between queries. This builds a global understanding of the scene.
Task Alignment (225 GT): queries are not visible to each other, simulating independent prompts; training focuses on presence classification and localization.
Long-Context Finetuning (10 GT): expands context to handle dense scenes without losing earlier capabilities.
They also use multi-teacher distillation for visual initialization: DINOv3 (ViT-H) for local features and SigLIP2 for language alignment. Distillation yields 74.25% zero-shot on ImageNet-1k and 85.11% linear-probe mIoU on Pascal VOC.
PBench: diagnostic benchmark
PBench segments evaluation by dominant capabilities, avoiding a single opaque score. Levels and examples:
Level
Capability
Example prompt
L0
Simple objects
car
L1
Attributes and subtypes
red car
L2
OCR-guided
Diet Coke bottle
L3
Spatial understanding
car on the left
L4
Relations and interactions
person holding umbrella
Dense
Very crowded scenes
hundreds of instances
Each sample targets a dominant capability, which lets you profile strengths and weaknesses and decide whether to invest in data, curriculum or extra components.
Ablations and design choices
Practical outcomes from specific decisions:
Muon optimizer for specialized heads vs AdamW: +4.8 points in SA-Co detection.
Rasterized instance ordering vs random: +10 points in SA-Co.
Gram-feature regularization to avoid distillation drift: +1.5 points in segmentation.
Global loss normalization to correct biases when packing variable sequences in FSDP.
These choices show that small training and ordering tweaks can be as important as the core architecture.
Results and comparison with SAM 3
On the SA-Co open-vocabulary segmentation benchmark, Falcon Perception (0.6B parameters) reaches Macro-F1 68.0 versus SAM 3's 62.3. The biggest gaps appear on compositional prompts: OCR, spatiality and relations.
Metric-wise:
Macro-F1: 68.0 vs 62.3 (SAM 3)
Presence / calibration (MCC): 0.64 vs 0.82 (SAM 3) — here Falcon has the main area to improve.
Breakdown by capability (gains over SAM 3):
Capability
SAM 3
Falcon Perception
Gap
L0 simple
64.3
65.1
+0.8
L1 attributes
54.4
63.6
+9.2
L2 OCR-guided
24.6
38.0
+13.4
L3 spatial
31.6
53.5
+21.9
L4 relations
33.3
49.1
+15.8
Dense
58.4
72.6
+14.2
Falcon shines especially when the task demands compositionality: reading text on objects, resolving spatial relations, or scaling to hundreds of instances.
Falcon OCR: compact, fast OCR
They also release Falcon OCR, a 0.3B-parameter model designed from scratch for OCR with the same early-fusion idea and hybrid mask. Trained on a curated mix for document parsing, formulas and tables.
Key results:
olmOCR: 80.3% (leader on Multi-Column 87.1% and Tables 90.3%).
OmniDocBench: 88.64, ahead of several larger and/or proprietary models.
The practical advantage is throughput: at 0.3B parameters it's 3x smaller than many ~0.9B solutions, enabling higher deployment performance. In their measurement with vLLM and an A100-80GB they reach 5,825 tok/s and 2.9 img/s in Layout + OCR mode.
Inference infrastructure
To serve these patterns and the caching they need, they implemented a stack on top of PyTorch FlexAttention with key features:
Paged KV cache with virtual page tables to avoid wasted memory from padding.
Continuous batching to insert new sequences mid-generation.
CUDA graph capture for the decode loop.
Background tokenization and an HR feature cache to skip costly upsampling when an image is reused.
On H100, typical latencies: ~100 ms prefill, ~200 ms upsampling (0 ms if cached), ~50 ms decode for a few objects. Values depend on resolution and sequence length.
Limitations and next steps
Falcon Perception shows that a single early-fusion backbone can compete with and outperform more complex pipelines in many abilities, but it's not perfect:
Presence calibration remains the main weak point (MCC 0.64).
Some advantages rely on a very careful data pipeline and curriculum.
Scaling context and mixing more textual and visual data are natural paths to improve.
What if you need more parallel queries or ultra-low latencies in production? The design doesn't block scaling paths: you can increase context, data and parameters while keeping the stack simple.
Falcon Perception is an invitation to rethink classic multimodal perception architecture. Instead of assembling more modules, the lesson is that with the right attention mask, a compact interface and a thought-out data curriculum, a single Transformer can solve grounding, OCR and segmentation competitively.