Luxonis
    Our new model ZOO works with DepthAI V3. Find out more in our documentation.
    Model Details
    Model Description
    DINOv3 is a family of vision foundation models trained using self-supervised learning (SSL), which produces high-quality dense features and achieves outstanding performance across diverse vision tasks without fine-tuning.
    • Developed by: Facebook AI Research (FAIR)
    • Shared by:
    • Model type: Computer vision - Vision Foundation Model
    • License:
    • Resources for more information:
      • Paper:
      • Website:
    Training Details
    Training Data
    DINOv3 models are trained on LVD-1689M (a curated web dataset with 1.7 billion images) for general-purpose visual representations. Some variants are also trained on SAT-493M (493 million satellite images) for satellite imagery applications.
    Testing Details
    Metrics
    DiNOV3 models are evaluated on various downstream tasks, including:
    • Image Classification
    • Object Detection
    • Semantic Segmentation
    • Instance Segmentation
    • Depth Estimation (Monocular)
    • Video Classification
    • Video Tracking
    Indicative, we detail some results on a set of global and dense benchmarks: classification (IN-ReAL, IN-R, ObjectNet), retrieval (Oxford-H), segmentation (ADE20k), depth (NYU), tracking (DAVIS at 960px), and keypoint matching (NAVI, SPair).
    ModelIN-RealIN-RObj.OX-HADE20kNYU↓DAVISNAVISPair
    DINOv3-S87.060.450.949.547.00.40372.756.350.4
    DINOv3-S+88.068.854.650.048.80.39975.557.355.2
    DINOv3-B89.376.764.155.651.80.35478.259.457.0
    DINOv3-L90.386.771.265.853.80.33779.463.762.7
    DINOv3-H+90.390.078.664.554.80.35279.363.356.3
    For more comprehensive evaluation results, please refer to the and the .
    Technical Specifications
    Input/Output Details
    • Input:
      • Name: image
      • Info: NCHW BGR 0-255 image.
    • Output:
      • Name: embeddings
      • Info: Dense visual features suitable for various downstream tasks
    Model Architecture
    The DINOv3 architecture varies by model family:
    Vision Transformer (ViT) variants
    • ViT-S (21M parameters): patch size 16, embedding dimension 384, 4 register tokens, 6 heads, MLP FFN, RoPE
    • ViT-S+ (29M parameters): patch size 16, embedding dimension 384, 4 register tokens, 6 heads, SwiGLU FFN, RoPE
    • ViT-B (86M parameters): patch size 16, embedding dimension 768, 4 register tokens, 12 heads, MLP FFN, RoPE
    • ViT-L (300M parameters): patch size 16, embedding dimension 1024, 4 register tokens, 16 heads, MLP FFN, RoPE
    • ViT-H+ (840M parameters): patch size 16, embedding dimension 1280, 4 register tokens, 20 heads, SwiGLU FFN, RoPE
    • ViT-7B (6716M parameters): patch size 16, embedding dimension 4096, 4 register tokens, 32 heads, SwiGLU FFN, RoPE
    ConvNeXt variants
    • ConvNeXt Tiny (29M parameters)
    • ConvNeXt Small (50M parameters)
    • ConvNeXt Base (89M parameters)
    • ConvNeXt Large (198M parameters)
    Throughput
    Model variant: dinov3-backbone:convnext-small-640x480
    • Input shape: [1, 3, 480, 640] • Output shape: [1, 768, 15, 20]
    • Params (M): 49.45 • GFLOPs: 54.29
    PlatformPrecisionThroughput (infs/sec)Power Consumption (W)
    RVC4FP1639.857.89
    Model variant: dinov3-backbone:convnext-base-512x384
    • Input shape: [1, 3, 384, 512] • Output shape: [1, 1024, 12, 16]
    • Params (M): 87.56 • GFLOPs: 61.13
    PlatformPrecisionThroughput (infs/sec)Power Consumption (W)
    RVC4FP1640.038.37
    Model variant: dinov3-backbone:vit-small-384x288
    • Input shape: [1, 3, 288, 384] • Output shape: [1, 384, 18, 24]
    • Params (M): 28.74 • GFLOPs: 15.11
    PlatformPrecisionThroughput (infs/sec)Power Consumption (W)
    RVC4FP1627.334.17
    Model variant: dinov3-backbone:vit-base-256x192
    • Input shape: [1, 3, 192, 256] • Output shape: [1, 768, 12, 16]
    • Params (M): 85.66 • GFLOPs: 17.87
    PlatformPrecisionThroughput (infs/sec)Power Consumption (W)
    RVC4FP1651.625.51
    • Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).
    • Parameters and FLOPs are approximate estimates. For exact values, use the package.
    Utilization
    Models converted for RVC Platforms can be used for inference on OAK devices. DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)). Below, we present the most crucial utilization steps for the particular model. Please consult the docs for more information.
    Install DAIv3 and depthai-nodes libraries:
    pip install depthai
    pip install depthai-nodes
    
    Define model:
    model_description = dai.NNModelDescription(
        "luxonis/dinov3-backbone:vit-small-384x288"
    )
    
    nn = pipeline.create(ParsingNeuralNetwork).build(
        <CameraNode>, model_description
    )
    
    Inspect model head(s):
    • EmbeddingsParser that outputs dai.NNData containing the output embeddings of the model.
    Get parsed output(s):
    while pipeline.isRuning():
        parser_output: dai.NNData = parser_output_queue.get()
        embeddings = message.getTensor("embeddings") # model embeddings
    
    Example
    You can check out the . You can run it with:
    python3 main.py \
        -det luxonis/scrfd-face-detection:10g-640x640 \
        -rec luxonis/dinov3-backbone:vit-small-384x288
    
    DINOv3 Backbone
    DINOv3 is a family of versatile vision foundation models that produces high-quality dense features
    License
    Not Defined
    Downloads
    6000
    Tasks
    Image Embedding
    Model Types
    ONNX
    Model Variants
    NameVersionAvailable ForCreated AtDeploy
    RVC46 months ago
    RVC46 months ago
    RVC46 months ago
    RVC46 months ago
    Luxonis - Robotic vision made simple.
    XYouTubeLinkedInGitHub