Luxonis
    Our new model ZOO works with DepthAI V3. Find out more in our documentation.
    0 Likes
    Model Details
    Model Description
    YOLOE (“Real‑Time Seeing Anything”) is a unified, open‑vocabulary object detection and segmentation model that supports text prompts, visual prompts, and prompt‑free inference with zero inference or transferring overhead compared to closed‑set YOLOs.
    • Developed by: Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
    • Shared by:
    • Model type: Computer Vision (open‑vocabulary detection & segmentation)
    • License:
    • Resources for more information:
    To enhance versatility, YOLOE accepts both images and prompt embeddings—text (from MobileCLIP‑BLT) or visual—plus a prompt‑free mode—enabling real‑time “seeing anything.”
    Training Details
    Training Data
    Training data includes Objects365v1, GQA GoldG and Flickr30k datasets.
    Annotations are preprocessed to include segmentation masks.
    Testing Details
    Metrics
    Results of fixed‑AP evaluation on the LVIS minival set (640×640 input). FPS measured on Nvidia T4 (TensorRT) and iPhone 12 (CoreML).
    ModelPre‑train DataSizePromptParamsTraining DataTimeFPS (T4 / CoreML)AP (T / V)APr (T / V)APc (T / V)APf (T / V)
    YOLOE‑v8‑LOG (O365 + GoldG)640Text / Visual45 M / 50 MOG22.5 h102.5 / 27.235.9 / 34.233.2 / 33.234.8 / 34.637.3 / 34.1
    OG denotes Objects365v1 + GoldG.
    Technical Specifications
    Input/Output Details
    • Input:
      • images: NCHW BGR un‑normalized image
      • texts: text embeddings (MobileCLIP‑BLT) for text prompts, visual prompt features for visual prompts; none for prompt‑free
    • Outputs:
      • Name: output1_yolov6r2
        • Info: Unprocessed output of the first channel
      • Name: output2_yolov6r2
        • Info: Unprocessed output of the second channel
      • Name: output3_yolov6r2
        • Info: Unprocessed output of the third channel
      • Name: output1_masks
        • Info: Mask coefficients
      • Name: output2_masks
        • Info: Mask coefficients
      • Name: output3_masks
        • Info: Mask coefficients
      • Name: protos_output
        • Info: Mask prototypes
    Model Architecture
    • Backbone: YOLOv8 CSP backbone
    • Neck: PAN (Path Aggregation Network)
    • Head: detection & segmentation head augmented with:
      • RepRTA (Re‑parameterizable Region‑Text Alignment) for text prompts
      • SAVPE (Semantic‑Activated Visual Prompt Encoder) for visual prompts
      • LRPC (Lazy Region‑Prompt Contrast) for prompt‑free
    Throughput
    Model variant: YOLOE-v8-L
    • Input shapes: [[1, 3, 640, 640], [1, 80, 512]]
    • Output shapes: [[1, 80, 80, 85], [1, 40, 40, 85], [1, 20, 20, 85], [1, 32, 80, 80], [1, 32, 40, 40], [1, 32, 20, 20], [1, 32, 160, 160]]
    • Params (M): 47.890 • GFLOPs: 114.541
    PlatformPrecisionThroughput (infs/sec)Power Consumption (W)
    RVC4INT884.139.10
    * Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).
    * Parameters and FLOPs are obtained from the package.
    Quantization
    RVC4 version of the model was quantized using a custom dataset.
    Utilization
    Models converted for RVC Platforms can be used for inference on OAK devices. DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)). Below, we present the most crucial utilization steps for the particular model. Please consult the docs for more information.
    Example
    You can quickly run the model using our example.
    This example demonstrates an advanced use of a custom frontend. On the DepthAI backend, it runs either the YOLO-World (default) or YOLOE model on-device, with configurable class labels and confidence threshold — both controllable via the frontend. The frontend, built using the @luxonis/depthai-viewer-common package, displays a real-time video stream with detections.
    To try it out, select YOLOE model using app arguments and run:
    oakctl connect <DEVICE_IP>
    oakctl app run .
    
    YOLOE-v8-L
    YOLOE: A highly efficient, open-vocabulary model for unified object detection and segmentation, offering versatility and performance.
    License
    GNU Affero General Public License v3.0
    Commercial use
    Downloads
    899
    Tasks
    Object Detection
    Model Types
    ONNX
    Model Variants
    NameVersionAvailable ForCreated AtDeploy
    RVC4About 1 month ago
    Luxonis - Robotic vision made simple.
    XYouTubeLinkedInGitHub