Our new model ZOO works with DepthAI V3. Find out more in our documentation.
0 Likes
Model Details
Model Description
The Whisper Tiny EN model is a lightweight version of the Whisper ASR (Automatic Speech Recognition) family, specifically designed for English speech recognition and translation tasks. It is part of a series of 9 models of different sizes and capabilities, as listed in the . Whisper Tiny EN excels in environments with limited computational resources while maintaining robust performance across diverse audio scenarios.
Developed by: OpenAI
Shared by:
Model type: Sequence-to-sequence ASR
License:
Resources for more information:
Training Details
Training Data
Training Datasets:
Diverse English speech datasets curated by OpenAI. The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages.
Evaluation Datasets:
Short-form English-only datasets such as , , and others.
Long-form English-only datasets such as , , and others.
Multilingual datasets such as , , and others.
More information on the evaluation datasets can be found in the and the .
Testing Details
Metrics
Performance metrics for Whisper models include Word Error Rate (WER)* as the primary measure, evaluated on several standard benchmarks. Below are results for the Whisper Tiny EN model on English transcription using Greedy Decoding (detailed results in Appendix D of the ).
Evaluation Dataset
WER (%)
LibriSpeech.test-clean
5.6
LibriSpeech.test-other
14.6
TED-LIUM3
6.0
CommonVoice5.1
26.3
*WER (Word Error Rate): Measures the ratio of errors in a transcript to the total words spoken (lower is better).
Input/Output Details
Encoder
Input:
Name: audio
Info: NTF, log-Mel spectrogram
Outputs:
Name: k_cache_cross
Info: Encoder output to be used by the Decoder
Name: v_cache_cross
Info: Encoder output to be used by the Decoder
Decoder
Inputs:
Name: k_cache_cross
Info: Encoder Output 1
Name: v_cache_cross
Info: Encoder Output 2
Name: k_cache_self
Info: Feeded from Output 2 of the Decoder itself.
Name: v_cache_self
Info: Feeded from Output 3 of the Decoder itself.
Name: x
Info: Previously generated token.
Name: index
Info: Used to get positional embedding correctly
Outputs:
Name: logits
Info: Logits of the Decoder used to generate new token.
Name: k_cache
Info: Feedback data to the Decoder Input 3
Name: v_cache
Info: Feedback data to the Decoder Input 4
Model Architecture
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
Navigate to the Compile tab and locate the jobs for the Whisper-Tiny-En Encoder and Decoder.
Open the respective job tabs, go to the Intermediate Assets section, and download the ONNX model files for each component.
Utilization
Models converted for RVC Platforms can be used for inference on OAK devices.
DepthAI pipelines are used to define the information flow linking the device, and inference model.
Install DAIv3 library:
pip install depthai
Example
You can quickly run the model using our example.
It automatically downloads the model(s), creates a DepthAI pipeline, runs the inference, and displays the results using our DepthAI visualizer tool.