Our new model ZOO works with DepthAI V3. Find out more in our documentation.
0 Likes
Model Details
Model Description
YOLOE (“Real‑Time Seeing Anything”) is a unified, open‑vocabulary object detection and segmentation model that supports text prompts, visual prompts, and prompt‑free inference with zero inference or transferring overhead compared to closed‑set YOLOs.
Developed by: Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Shared by:
Model type: Computer Vision (open‑vocabulary detection & segmentation)
License:
Resources for more information:
To enhance versatility, YOLOE accepts both images and prompt embeddings—text (from MobileCLIP‑BLT) or visual—plus a prompt‑free mode—enabling real‑time “seeing anything.”
Training Details
Training Data
Training data includes Objects365v1, GQA GoldG and Flickr30k datasets.
Annotations are preprocessed to include segmentation masks.
Testing Details
Metrics
Results of fixed‑AP evaluation on the LVIS minival set (640×640 input). FPS measured on Nvidia T4 (TensorRT) and iPhone 12 (CoreML).
Model
Pre‑train Data
Size
Prompt
Params
Training Data
Time
FPS (T4 / CoreML)
AP (T / V)
APr (T / V)
APc (T / V)
APf (T / V)
YOLOE‑v8‑L
OG (O365 + GoldG)
640
Text / Visual
45 M / 50 M
OG
22.5 h
102.5 / 27.2
35.9 / 34.2
33.2 / 33.2
34.8 / 34.6
37.3 / 34.1
OG denotes Objects365v1 + GoldG.
Technical Specifications
Input/Output Details
Input:
images: NCHW BGR un‑normalized image
texts: text embeddings (MobileCLIP‑BLT) for text prompts, visual prompt features for visual prompts; none for prompt‑free
Outputs:
Name: output1_yolov6r2
Info: Unprocessed output of the first channel
Name: output2_yolov6r2
Info: Unprocessed output of the second channel
Name: output3_yolov6r2
Info: Unprocessed output of the third channel
Name: output1_masks
Info: Mask coefficients
Name: output2_masks
Info: Mask coefficients
Name: output3_masks
Info: Mask coefficients
Name: protos_output
Info: Mask prototypes
Model Architecture
Backbone: YOLOv8 CSP backbone
Neck: PAN (Path Aggregation Network)
Head: detection & segmentation head augmented with:
RepRTA (Re‑parameterizable Region‑Text Alignment) for text prompts
SAVPE (Semantic‑Activated Visual Prompt Encoder) for visual prompts
LRPC (Lazy Region‑Prompt Contrast) for prompt‑free
* Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).
* Parameters and FLOPs are obtained from the package.
Quantization
RVC4 version of the model was quantized using a custom dataset.
Utilization
Models converted for RVC Platforms can be used for inference on OAK devices.
DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)).
Below, we present the most crucial utilization steps for the particular model.
Please consult the docs for more information.
Example
You can quickly run the model using our example.
This example demonstrates an advanced use of a custom frontend. On the DepthAI backend, it runs either the YOLO-World (default) or YOLOE model on-device, with configurable class labels and confidence threshold — both controllable via the frontend.
The frontend, built using the @luxonis/depthai-viewer-common package, displays a real-time video stream with detections.
To try it out, select YOLOE model using app arguments and run:
oakctl connect <DEVICE_IP>oakctl app run .
YOLOE: A highly efficient, open-vocabulary model for unified object detection and segmentation, offering versatility and performance.