Our new model ZOO works with DepthAI V3. Find out more in our documentation.
0 Likes
Model Details
Model Description
FoundationStereo is a large foundation model for stereo depth estimation that achieves strong zero-shot generalization without per-domain fine-tuning, trained on a large-scale (1M image pairs) high-fidelity synthetic dataset with high diversity and photorealism.
Developed by:
Shared by:
Model type: Computer Vision
License:
Resources for more information:
Training Details
Training Data
The model was trained on the which consists of 1M synthetic stereo images with structured indoor/outdoor scenes as well as more randomized scenes with challenging flying objects and higher geometry and texture diversity.
Testing Details
Metrics
These results showcase the performance of FoundationStereo on two popular public benchmarks, the Middlebury, ETH3D. The results are obtained from the and the pages respectively.
Evaluation Dataset
AvgErr
Bad1.0
Bad2.0
ETH3D
0.09
0.26
0.08
Middlebury
0.78
4.39
1.84
For the evaluation, the three following popular metrics are considered:
AvgErr: average error (the smaller the better).
Bad1.0: percentage of pixels with disparity error larger than 1 pixels (the smaller the better).
Bad2.0: percentage of pixels with disparity error larger than 2 pixels (the smaller the better).
Technical Specifications
Input/Output Details
Input:
Name: left
Info: NCHW, BGR normalized image
Name: right
Info: NCHW, BGR normalized image
Output:
Name: disp
Info: NCHW, stereo disparity map
Model Architecture
FoundationStereo couples a lightweight EdgeNeXt-S “side-tuning” CNN to a frozen DepthAnything V2 ViT so that pixel-sharp local cues are blended with rich monocular depth priors; it then forms a hybrid 4-D cost volume that keeps both group-wise correlations and those fused features. An Attentive Hybrid Cost Filter denoises the volume, after which a soft-argmin gives an initial disparity map that three coarse-to-fine ConvGRU iterations polish into crisp, detail-preserving depth.