X

Model Details

Model Description

YOLOE (“Real‑Time Seeing Anything”) is a unified, open‑vocabulary object detection and segmentation model that supports text prompts, visual prompts, and prompt‑free inference with zero inference or transferring overhead compared to closed‑set YOLOs.

Developed by: Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Shared by:
Model type: Computer Vision (open‑vocabulary detection & segmentation)
License:
Resources for more information:

To enhance versatility, YOLOE accepts both images and prompt embeddings—text (from MobileCLIP‑BLT) or visual—plus a prompt‑free mode—enabling real‑time “seeing anything.”

Training Details

Training Data

Training data includes Objects365v1, GQA GoldG and Flickr30k datasets.

Annotations are preprocessed to include segmentation masks.

Testing Details

Metrics

Results of fixed‑AP evaluation on the LVIS minival set (640×640 input). FPS measured on Nvidia T4 (TensorRT) and iPhone 12 (CoreML).

Model	Pre‑train Data	Size	Prompt	Params	Training Data	Time	FPS (T4 / CoreML)	AP (T / V)	AP_r (T / V)	AP_c (T / V)	AP_f (T / V)
YOLOE‑v8‑L	OG (O365 + GoldG)	640	Text / Visual	45 M / 50 M	OG	22.5 h	102.5 / 27.2	35.9 / 34.2	33.2 / 33.2	34.8 / 34.6	37.3 / 34.1

OG denotes Objects365v1 + GoldG.

Technical Specifications

Input/Output Details

Input:

Model variant: yoloe-v8-l:640x640

images: NCHW BGR un‑normalized image
texts: text embeddings (MobileCLIP‑BLT) for text prompts

Model variant: yoloe-v8-l:640x640-visual-prompt

images: NCHW BGR un‑normalized image
texts: visual prompt features for visual prompts (YOLOE SAVPE module outputs)

Model variant: yoloe-v8-l:640x640-text-visual-prompt

images: NCHW BGR un‑normalized image
texts: text embeddings (MobileCLIP‑BLT) for text prompts
image_prompts: visual prompt features for visual prompts (YOLOE SAVPE module outputs)

Outputs:
- Name: output1_yolov6r2
  - Info: Unprocessed output of the first channel
- Name: output2_yolov6r2
  - Info: Unprocessed output of the second channel
- Name: output3_yolov6r2
  - Info: Unprocessed output of the third channel
- Name: output1_masks
  - Info: Mask coefficients
- Name: output2_masks
  - Info: Mask coefficients
- Name: output3_masks
  - Info: Mask coefficients
- Name: protos_output
  - Info: Mask prototypes

Outputs are the same for all model variants.

Model Architecture

Backbone: YOLOv8 CSP backbone
Neck: PAN (Path Aggregation Network)
Head: detection & segmentation head augmented with:
- RepRTA (Re‑parameterizable Region‑Text Alignment) for text prompts
- SAVPE (Semantic‑Activated Visual Prompt Encoder) for visual prompts
- LRPC (Lazy Region‑Prompt Contrast) for prompt‑free

Throughput

Model variant: yoloe-v8-l:640x640

• Input shapes: [[1, 3, 640, 640], [1, 80, 512]]

• Output shapes: [[1, 80, 80, 85], [1, 40, 40, 85], [1, 20, 20, 85], [1, 32, 80, 80], [1, 32, 40, 40], [1, 32, 20, 20], [1, 32, 160, 160]]

• Params (M): 47.890 • GFLOPs: 114.541

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	INT8	84.13	9.10
RVC4	FP16	29.97	N/A

Model variant: yoloe-v8-l:640x640-visual-prompt

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	INT8	84.77	N/A
RVC4	FP16	28.51	N/A

Model variant: yoloe-v8-l:640x640-text-visual-prompt

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	27.57	N/A

Model variant: yoloe-v8-l:640x480-text-visual-prompt

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	35.82	N/A

* Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).

* Parameters and FLOPs are obtained from the package.

Quantization

RVC4 version of the model was quantized using a custom dataset.

Utilization

Models converted for RVC Platforms can be used for inference on OAK devices. DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)). Below, we present the most crucial utilization steps for the particular model. Please consult the docs for more information.

Install DAIv3 and depthai-nodes libraries:

pip install depthai
pip install depthai-nodes

Inspect model head(s):

YOLOExtendedParser that outputs message (detected objects with instance segmentation masks).

Get parsed output(s):

while pipeline.isRuning():
    detection_parser_output: ImgDetectionsExtended = detection_parser_output_queue.get()

Example

You can quickly run the model using our example.

This example demonstrates an advanced use of a custom frontend. On the DepthAI backend, it runs either the YOLO-World or YOLOE model on-device, with configurable class labels and confidence threshold — both controllable via the frontend. The frontend, built using the @luxonis/depthai-viewer-common package, displays a real-time video stream with detections.

To try it out, select YOLOE model using app arguments and run:

oakctl connect <DEVICE_IP>
oakctl app run .

YOLOE: A highly efficient, open-vocabulary model for unified object detection and segmentation, offering versatility and performance.
License	GNU Affero General Public License v3.0 Commercial use
Downloads	2982
Tasks	Object Detection
Model Types	ONNX

Name	Version	Available For	Created At	Deploy
		RVC4	About 1 month ago
		RVC4	About 1 month ago
		RVC4	About 1 month ago
		RVC4	3 months ago