X

Model Details

Model Description

DINOv3 is a family of vision foundation models trained using self-supervised learning (SSL), which produces high-quality dense features and achieves outstanding performance across diverse vision tasks without fine-tuning.

Developed by: Facebook AI Research (FAIR)
Shared by:
Model type: Computer vision - Vision Foundation Model
License:
Resources for more information:
- Paper:
- Website:

Training Details

Training Data

DINOv3 models are trained on LVD-1689M (a curated web dataset with 1.7 billion images) for general-purpose visual representations. Some variants are also trained on SAT-493M (493 million satellite images) for satellite imagery applications.

Testing Details

Metrics

DiNOV3 models are evaluated on various downstream tasks, including:

Image Classification
Object Detection
Semantic Segmentation
Instance Segmentation
Depth Estimation (Monocular)
Video Classification
Video Tracking

Indicative, we detail some results on a set of global and dense benchmarks: classification (IN-ReAL, IN-R, ObjectNet), retrieval (Oxford-H), segmentation (ADE20k), depth (NYU), tracking (DAVIS at 960px), and keypoint matching (NAVI, SPair).

Model	IN-Real	IN-R	Obj.	OX-H	ADE20k	NYU↓	DAVIS	NAVI	SPair
DINOv3-S	87.0	60.4	50.9	49.5	47.0	0.403	72.7	56.3	50.4
DINOv3-S+	88.0	68.8	54.6	50.0	48.8	0.399	75.5	57.3	55.2
DINOv3-B	89.3	76.7	64.1	55.6	51.8	0.354	78.2	59.4	57.0
DINOv3-L	90.3	86.7	71.2	65.8	53.8	0.337	79.4	63.7	62.7
DINOv3-H+	90.3	90.0	78.6	64.5	54.8	0.352	79.3	63.3	56.3

For more comprehensive evaluation results, please refer to the and the .

Technical Specifications

Input/Output Details

Input:
- Name: image
- Info: NCHW BGR 0-255 image.
Output:
- Name: embeddings
- Info: Dense visual features suitable for various downstream tasks

Model Architecture

The DINOv3 architecture varies by model family:

Vision Transformer (ViT) variants

ViT-S (21M parameters): patch size 16, embedding dimension 384, 4 register tokens, 6 heads, MLP FFN, RoPE
ViT-S+ (29M parameters): patch size 16, embedding dimension 384, 4 register tokens, 6 heads, SwiGLU FFN, RoPE
ViT-B (86M parameters): patch size 16, embedding dimension 768, 4 register tokens, 12 heads, MLP FFN, RoPE
ViT-L (300M parameters): patch size 16, embedding dimension 1024, 4 register tokens, 16 heads, MLP FFN, RoPE
ViT-H+ (840M parameters): patch size 16, embedding dimension 1280, 4 register tokens, 20 heads, SwiGLU FFN, RoPE
ViT-7B (6716M parameters): patch size 16, embedding dimension 4096, 4 register tokens, 32 heads, SwiGLU FFN, RoPE

ConvNeXt variants

ConvNeXt Tiny (29M parameters)
ConvNeXt Small (50M parameters)
ConvNeXt Base (89M parameters)
ConvNeXt Large (198M parameters)

Throughput

Model variant: dinov3-backbone:convnext-small-640x480

• Input shape: [1, 3, 480, 640] • Output shape: [1, 768, 15, 20]

• Params (M): 49.45 • GFLOPs: 54.29

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	39.85	7.89

Model variant: dinov3-backbone:convnext-base-512x384

• Input shape: [1, 3, 384, 512] • Output shape: [1, 1024, 12, 16]

• Params (M): 87.56 • GFLOPs: 61.13

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	40.03	8.37

Model variant: dinov3-backbone:vit-small-384x288

• Input shape: [1, 3, 288, 384] • Output shape: [1, 384, 18, 24]

• Params (M): 28.74 • GFLOPs: 15.11

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	27.33	4.17

Model variant: dinov3-backbone:vit-base-256x192

• Input shape: [1, 3, 192, 256] • Output shape: [1, 768, 12, 16]

• Params (M): 85.66 • GFLOPs: 17.87

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	51.62	5.51

Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).
Parameters and FLOPs are approximate estimates. For exact values, use the package.

Utilization

Models converted for RVC Platforms can be used for inference on OAK devices. DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)). Below, we present the most crucial utilization steps for the particular model. Please consult the docs for more information.

Install DAIv3 and depthai-nodes libraries:

pip install depthai
pip install depthai-nodes

Define model:

model_description = dai.NNModelDescription(
    "luxonis/dinov3-backbone:vit-small-384x288"
)

nn = pipeline.create(ParsingNeuralNetwork).build(
    <CameraNode>, model_description
)

Inspect model head(s):

EmbeddingsParser that outputs dai.NNData containing the output embeddings of the model.

Get parsed output(s):

while pipeline.isRuning():
    parser_output: dai.NNData = parser_output_queue.get()
    embeddings = message.getTensor("embeddings") # model embeddings

Example

You can check out the . You can run it with:

python3 main.py \
    -det luxonis/scrfd-face-detection:10g-640x640 \
    -rec luxonis/dinov3-backbone:vit-small-384x288

DINOv3 is a family of versatile vision foundation models that produces high-quality dense features
License	Not Defined
Downloads	4621
Tasks	Image Embedding
Model Types	ONNX

Name	Version	Available For	Created At	Deploy
		RVC4	4 months ago
		RVC4	4 months ago
		RVC4	4 months ago
		RVC4	4 months ago