Whisper Tiny EN

Our new model ZOO works with DepthAI V3. Find out more in our documentation.

Model Details

Model Description

The Whisper Tiny EN model is a lightweight version of the Whisper ASR (Automatic Speech Recognition) family, specifically designed for English speech recognition and translation tasks. It is part of a series of 9 models of different sizes and capabilities, as listed in the . Whisper Tiny EN excels in environments with limited computational resources while maintaining robust performance across diverse audio scenarios.

Developed by: OpenAI
Shared by:
Model type: Sequence-to-sequence ASR
License:
Resources for more information:

Training Details

Training Data

Training Datasets:
- Diverse English speech datasets curated by OpenAI. The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages.
Evaluation Datasets:
- Short-form English-only datasets such as , , and others.
- Long-form English-only datasets such as , , and others.
- Multilingual datasets such as , , and others.

More information on the evaluation datasets can be found in the and the .

Testing Details

Metrics

Performance metrics for Whisper models include Word Error Rate (WER)* as the primary measure, evaluated on several standard benchmarks. Below are results for the Whisper Tiny EN model on English transcription using Greedy Decoding (detailed results in Appendix D of the ).

Evaluation Dataset	WER (%)
LibriSpeech.test-clean	5.6
LibriSpeech.test-other	14.6
TED-LIUM3	6.0
CommonVoice5.1	26.3

*WER (Word Error Rate): Measures the ratio of errors in a transcript to the total words spoken (lower is better).

Input/Output Details

Encoder

Input:
- Name: audio
  - Info: NTF, log-Mel spectrogram
Outputs:
- Name: k_cache_cross
  - Info: Encoder output to be used by the Decoder
- Name: v_cache_cross
  - Info: Encoder output to be used by the Decoder

Decoder

Inputs:
- Name: k_cache_cross
  - Info: Encoder Output 1
- Name: v_cache_cross
  - Info: Encoder Output 2
- Name: k_cache_self
  - Info: Feeded from Output 2 of the Decoder itself.
- Name: v_cache_self
  - Info: Feeded from Output 3 of the Decoder itself.
- Name: x
  - Info: Previously generated token.
- Name: index
  - Info: Used to get positional embedding correctly
Outputs:
- Name: logits
  - Info: Logits of the Decoder used to generate new token.
- Name: k_cache
  - Info: Feedback data to the Decoder Input 3
- Name: v_cache
  - Info: Feedback data to the Decoder Input 4

Model Architecture

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Throughput

Model variant: whisper-tiny-en:decoder

• Input shapes: [[1, 1], [1, 1], [4, 6, 64, 1500], [4, 6, 1500, 64], [4, 6, 64, 224], [4, 6, 224, 64]] • Output shapes: [[1, 1, 51864], [4, 6, 64, 224], [4, 6, 224, 64]]

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	23.73	3.09

Model variant: whisper-tiny-en:encoder

• Input shape: [1, 80, 3000] • Output shapes: [[4, 6, 64, 1500], [4, 6, 1500, 64]]

• Params (M): 9.390 • GFLOPs: 22.328

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC4	FP16	2.94	3.62

* Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).

* Parameters and FLOPs are obtained from the package.

Acquiring ONNX Files for the Encoder and Decoder

To obtain the ONNX files for the Encoder and Decoder components, follow these steps:

Set Up Access to Qualcomm® AI Hub.
- Create a .
- Log in to the using your credentials.
- Configure your :
```
qai-hub configure --api_token API_TOKEN
```

Install and Export the Whisper-Tiny-En Model.

Follow the instructions provided on the Whisper-Tiny-En card to set up and export the model:

pip install "qai_hub_models[whisper_tiny_en]"
python -m qai_hub_models.models.whisper_tiny_en.export --target-runtime precompiled-qnn-onnx --device 'QCS8550 (Proxy)' --skip-inferencing

Download the ONNX Files.
- Log in to the again.
- Navigate to the Compile tab and locate the jobs for the Whisper-Tiny-En Encoder and Decoder.
- Open the respective job tabs, go to the Intermediate Assets section, and download the ONNX model files for each component.

Utilization

Models converted for RVC Platforms can be used for inference on OAK devices.
DepthAI pipelines are used to define the information flow linking the device, and inference model.

Install DAIv3 library:

pip install depthai

Example

You can quickly run the model using our example.

It automatically downloads the model(s), creates a DepthAI pipeline, runs the inference, and displays the results using our DepthAI visualizer tool.

To try it out, run:

python3 main.py --device_ip <device_ip> --audio_file <audio_file>

Note: The example only works on RVC4.

A general-purpose speech recognition model for transcribing English audio into text.
License	MIT Commercial use
Downloads	1176
Tasks	Speech Recognition
Model Types	ONNX

Model Variants

Name	Version	Available For	Created At	Deploy
		RVC4	10 months ago
		RVC4	10 months ago