Our new model ZOO works with DepthAI V3. Find out more in our documentation.
1 Likes
YOLOE: A highly efficient, open-vocabulary model for unified object detection and segmentation, offering versatility and performance.
License
GNU Affero General Public License v3.0
Commercial use
Downloads
2982
Tasks
Object Detection
Model Types
ONNX
Model Variants
Name
Version
Available For
Created At
Deploy
RVC4
About 1 month ago
RVC4
About 1 month ago
RVC4
About 1 month ago
RVC4
3 months ago
Model Details
Model Description
YOLOE (“Real‑Time Seeing Anything”) is a unified, open‑vocabulary object detection and segmentation model that supports text prompts, visual prompts, and prompt‑free inference with zero inference or transferring overhead compared to closed‑set YOLOs.
Developed by: Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Shared by:
Model type: Computer Vision (open‑vocabulary detection & segmentation)
License:
Resources for more information:
To enhance versatility, YOLOE accepts both images and prompt embeddings—text (from MobileCLIP‑BLT) or visual—plus a prompt‑free mode—enabling real‑time “seeing anything.”
Training Details
Training Data
Training data includes Objects365v1, GQA GoldG and Flickr30k datasets.
Annotations are preprocessed to include segmentation masks.
Testing Details
Metrics
Results of fixed‑AP evaluation on the LVIS minival set (640×640 input). FPS measured on Nvidia T4 (TensorRT) and iPhone 12 (CoreML).
Model
Pre‑train Data
Size
Prompt
Params
Training Data
Time
FPS (T4 / CoreML)
AP (T / V)
APr (T / V)
APc (T / V)
APf (T / V)
YOLOE‑v8‑L
OG (O365 + GoldG)
640
Text / Visual
45 M / 50 M
OG
22.5 h
102.5 / 27.2
35.9 / 34.2
33.2 / 33.2
34.8 / 34.6
37.3 / 34.1
OG denotes Objects365v1 + GoldG.
Technical Specifications
Input/Output Details
Input:
Model variant: yoloe-v8-l:640x640
images: NCHW BGR un‑normalized image
texts: text embeddings (MobileCLIP‑BLT) for text prompts
Model variant: yoloe-v8-l:640x640-visual-prompt
images: NCHW BGR un‑normalized image
texts: visual prompt features for visual prompts (YOLOE SAVPE module outputs)
Model variant: yoloe-v8-l:640x640-text-visual-prompt
images: NCHW BGR un‑normalized image
texts: text embeddings (MobileCLIP‑BLT) for text prompts
image_prompts: visual prompt features for visual prompts (YOLOE SAVPE module outputs)
Outputs:
Name: output1_yolov6r2
Info: Unprocessed output of the first channel
Name: output2_yolov6r2
Info: Unprocessed output of the second channel
Name: output3_yolov6r2
Info: Unprocessed output of the third channel
Name: output1_masks
Info: Mask coefficients
Name: output2_masks
Info: Mask coefficients
Name: output3_masks
Info: Mask coefficients
Name: protos_output
Info: Mask prototypes
Outputs are the same for all model variants.
Model Architecture
Backbone: YOLOv8 CSP backbone
Neck: PAN (Path Aggregation Network)
Head: detection & segmentation head augmented with:
RepRTA (Re‑parameterizable Region‑Text Alignment) for text prompts
SAVPE (Semantic‑Activated Visual Prompt Encoder) for visual prompts
LRPC (Lazy Region‑Prompt Contrast) for prompt‑free
Model variant: yoloe-v8-l:640x640-text-visual-prompt
Platform
Precision
Throughput (infs/sec)
Power Consumption (W)
RVC4
FP16
27.57
N/A
Model variant: yoloe-v8-l:640x480-text-visual-prompt
Platform
Precision
Throughput (infs/sec)
Power Consumption (W)
RVC4
FP16
35.82
N/A
* Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).
* Parameters and FLOPs are obtained from the package.
Quantization
RVC4 version of the model was quantized using a custom dataset.
Utilization
Models converted for RVC Platforms can be used for inference on OAK devices.
DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)).
Below, we present the most crucial utilization steps for the particular model.
Please consult the docs for more information.
Install DAIv3 and depthai-nodes libraries:
pip install depthai
pip install depthai-nodes
Inspect model head(s):
YOLOExtendedParser that outputs message (detected objects with instance segmentation masks).
Get parsed output(s):
while pipeline.isRuning():
detection_parser_output: ImgDetectionsExtended = detection_parser_output_queue.get()
Example
You can quickly run the model using our example.
This example demonstrates an advanced use of a custom frontend. On the DepthAI backend, it runs either the YOLO-World or YOLOE model on-device, with configurable class labels and confidence threshold — both controllable via the frontend.
The frontend, built using the @luxonis/depthai-viewer-common package, displays a real-time video stream with detections.
To try it out, select YOLOE model using app arguments and run: