X

Model Details

Model Description

FoundationStereo is a large foundation model for stereo depth estimation that achieves strong zero-shot generalization without per-domain fine-tuning, trained on a large-scale (1M image pairs) high-fidelity synthetic dataset with high diversity and photorealism.

Developed by:
Shared by:
Model type: Computer Vision
License:
Resources for more information:

Training Details

Training Data

The model was trained on the which consists of 1M synthetic stereo images with structured indoor/outdoor scenes as well as more randomized scenes with challenging flying objects and higher geometry and texture diversity.

Testing Details

Metrics

These results showcase the performance of FoundationStereo on two popular public benchmarks, the Middlebury, ETH3D. The results are obtained from the and the pages respectively.

Evaluation Dataset	AvgErr	Bad1.0	Bad2.0
ETH3D	0.09	0.26	0.08
Middlebury	0.78	4.39	1.84

For the evaluation, the three following popular metrics are considered:

AvgErr: average error (the smaller the better).
Bad1.0: percentage of pixels with disparity error larger than 1 pixels (the smaller the better).
Bad2.0: percentage of pixels with disparity error larger than 2 pixels (the smaller the better).

Technical Specifications

Input/Output Details

Input:
- Name: left
- Info: NCHW, BGR normalized image
- Name: right
- Info: NCHW, BGR normalized image
Output:
- Name: disp
- Info: NCHW, stereo disparity map

Model Architecture

FoundationStereo couples a lightweight EdgeNeXt-S “side-tuning” CNN to a frozen DepthAnything V2 ViT so that pixel-sharp local cues are blended with rich monocular depth priors; it then forms a hybrid 4-D cost volume that keeps both group-wise correlations and those fused features. An Attentive Hybrid Cost Filter denoises the volume, after which a soft-argmin gives an initial disparity map that three coarse-to-fine ConvGRU iterations polish into crisp, detail-preserving depth.

Throughput

Model variant: foundation-stereo:640x416

• Input shapes: [[1, 3, 416, 640], [1, 3, 416, 640]] • Output shape: [1, 1, 416, 640]

• Params (M): 371.687 • GFLOPs: 7144.733

Model variant: foundation-stereo:1280x800

• Input shapes: [[1, 3, 800, 1280], [1, 3, 800, 1280]] • Output shape: [1, 1, 800, 1280]

• Params (M): 371.687 • GFLOPs: 31306.855

* Parameters and FLOPs are obtained from the package.

Utilization

This script demonstrates how to generate neural stereo disparity using the FoundationStereo ONNX model on OAK devices with DepthAI v3.

Install required libraries:

pip install -r requirements.txt

Load ONNX model:

onnx_session = load_onnx_model(ONNX_MODEL_PATH)

...

left = preprocess_image(rectified_left, (INFERENCE_H, INFERENCE_W))
right = preprocess_image(rectified_right, (INFERENCE_H, INFERENCE_W))

nn_disparity = run_onnx_inference(onnx_session, left, right)[0][0, 0]

After the interference is running press F to get generate Foundation Stereo Disparity.

Example

Check out the complete example .

A foundation model for stereo depth estimation designed to achieve strong zero-shot generalization.
License	Not Defined
Downloads	151
Tasks	Depth Estimation
Model Types	ONNX

Name	Version	Available For	Created At	Deploy
			3 months ago
			3 months ago