WildDet3D: open 3D detection from a single image | Keryc
Imagine taking a photo of a street, tapping the image, and a system telling you not only 'what' objects are there but exactly where they are in the world: distance, size, and orientation. Sound like sci‑fi? That's exactly what WildDet3D offers: an open model that performs monocular 3D detection from a single image and accepts multiple ways of asking for what you want.
Qué es WildDet3D y por qué importa
WildDet3D predicts 3D bounding boxes in metric coordinates from a single RGB image. It can take queries by category name (for example, 'bench'), by point (you touched the object), or by a 2D box (you give a prior detection and it lifts it to 3D). Why does that matter? Because many real applications need to know where things are in space: autonomous vehicles in construction zones, robots in warehouses, AR apps placing directions on the street.
Also, WildDet3D doesn't require a specific camera type: it accepts phone photos, action wide‑angle cameras, or robotic streams. And when there are extra geometric signals (sparse depth, LiDAR, TOF), it incorporates them to refine its predictions.
Arquitectura: simple en bloques, potente en resultados
The design combines three components that run in parallel and fuse together:
A 2D detector based on the backbone SAM3 that accepts the three prompt types (text, point, box).
A geometry backend with a frozen DINOv2 encoder and a trainable depth decoder that produces per‑pixel features carrying geometric information.
A 3D detection head that fuses the 2D detections with the depth features via cross-attention to output 3D boxes with position, dimensions, and orientation.
A key detail: the geometry backend is modular. That means you can swap the depth model without rewriting the whole architecture. The decoder also uses a 'ray-aware' representation that embeds camera geometry via spherical harmonic encodings of ray directions, avoiding the need for a separate camera calibration branch.
When sparse depth is available at inference (LiDAR, RGB‑D, stereo), it's integrated without changing the overall pipeline, improving localization.
Practical point: modularity makes experiments easier. If you already have a better depth decoder, plug it in and improve accuracy without redoing the detector.
Los datos detrás: WildDet3D-Data
It's not just the model; they release WildDet3D-Data: over 1 million images with 3.7 million verified 3D annotations, covering more than 13,000 categories and a core of 100k images annotated by humans. How did they build it? They generated 3D candidates from 2D datasets (COCO, LVIS, Objects365, V3Det) using five complementary methods, refined and filtered them, and validated with VLMs plus human selection. That variety is what lets the model generalize beyond fixed taxonomies.
Rendimiento y transferencia zero-shot (sí, de verdad funciona)
They evaluated on several fronts:
Omni3D (6 datasets, 50 categories): 34.2 AP with text prompts (a 5.8 point improvement over 3D-MOOD), and 36.4 AP with an oracle box, training only 12 epochs versus 80–120 of prior methods.
With sparse depth at test time: it goes up to 41.6 AP (text) and 45.8 AP (oracle), with large gains indoors.
To test generalization beyond Omni3D:
Argoverse 2 (driving): 40.3 ODS vs 23.8 previously.
ScanNet (indoor): 48.9 ODS, a 17.4 point gain.
Improvements are larger on novel categories not seen during training: for example, WildDet3D reaches 38.6 ODS on new categories in Argoverse 2 versus 14.8 for the previous best.
On Stereo4D (benchmark with real stereo depth) it showed 7.5 AP without depth; with depth it rises to 27.7 AP in oracle box mode.
On the WildDet3D-Bench (700+ categories): trained only on Omni3D it reaches 6.8 AP in text mode (vs 2.3 baseline). With the full data it climbs to 22.6 AP, and with ground‑truth depth it hits 41.6 AP. The jump on rare categories is huge: 47.4 AP vs 2.4 for the baseline.
In short: better building blocks (SAM3, DINOv2) + diverse data = real generalization, with less training.
Aplicaciones prácticas, limitaciones y próximos pasos
Immediate applications:
Real‑time AR (the team released an iOS app that uses camera and LiDAR to overlay 3D boxes).
Warehouse robots estimating package size and orientation.
Zero‑shot 3D tracking: if a tracker produces 2D boxes, WildDet3D lifts them to 3D frame by frame.
Spatial support for wearables (smart glasses) for persistent environmental awareness.
Limitations to consider:
The full model still needs server‑side compute for top performance; optimization is required for true on‑device real‑time.
Final quality keeps improving with real depth signals; monocular is impressive but doesn't always match dedicated sensors.
As always, real‑world deployments must consider data biases and safety in critical scenarios (vehicles, human‑robot interaction).
Reasonable next steps: optimize latency for the edge, improve energy efficiency, and explore integrations with VLMs for spatially aware conversational interfaces.
The paper and release include the model, dataset, interactive demo, and open evaluation materials. That makes reproducibility easier and lets the community iterate on the work.
The practical question is: what will you build with a model that can see the world in 3D from a single image? Some will improve AR; others, build more useful robots; someone may invent an app we can't even imagine yet.