Update README.md

3037d48 verified about 2 months ago

10.5 kB

	---
	license: apache-2.0
	language:
	- eu
	library_name: nemo
	datasets:
	- mozilla-foundation/common_voice_18_0
	- gttsehu/basque_parliament_1
	- openslr/openslr
	- HiTZ/composite_corpus_eseu_v1.0
	- PRIVATE/EITB
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- Transducer
	- Conformer
	- NeMo
	- pytorch
	- Transformer
	model-index:
	- name: stt_eu_conformer_transducer_large_v2
	results:
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: Mozilla Common Voice 18.0
	type: mozilla-foundation/common_voice_18_0
	config: eu
	split: test
	args:
	language: eu
	metrics:
	- name: Test WER
	type: wer
	value: 2.5
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: Basque Parliament
	type: gttsehu/basque_parliament_1
	config: eu
	split: test
	args:
	language: eu
	metrics:
	- name: Test WER
	type: wer
	value: 3.78
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: OpenSLR
	type: HiTZ/composite_corpus_eu_v2.1
	config: eu
	split: test_oslr
	args:
	language: eu
	metrics:
	- name: Test WER
	type: wer
	value: 11.87
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: EITB
	type: private
	config: eu
	split: test
	args:
	language: eu
	metrics:
	- name: Test WER
	type: wer
	value: 9.17
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: Mozilla Common Voice 18.0
	type: HiTZ/composite_corpus_eu_v2.1
	config: eu
	split: dev_cv
	args:
	language: eu
	metrics:
	- name: Dev WER
	type: wer
	value: 2.28
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: Basque Parliament
	type: HiTZ/composite_corpus_eu_v2.1
	config: eu
	split: dev_parl
	args:
	language: eu
	metrics:
	- name: Dev WER
	type: wer
	value: 4.2
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: OpenSLR
	type: HiTZ/composite_corpus_eu_v2.1
	config: eu
	split: dev_oslr
	args:
	language: eu
	metrics:
	- name: Dev WER
	type: wer
	value: 11.65
	- task:
	type: Automatic Speech Recognition
	name: speech-recognition
	dataset:
	name: EITB
	type: private
	config: eu
	split: validation
	args:
	language: eu
	metrics:
	- name: Dev WER
	type: wer
	value: 12.81
	---

	# HiTZ/Aholab's Basque Speech-to-Text model Conformer-Transducer v2
	## Model Description

	<style>
	img {
	display: inline;
	}
	</style>

	\| [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)
	\| [![Model size](https://img.shields.io/badge/Params-119M-lightgrey#model-badge)](#model-architecture)
	\| [![Language](https://img.shields.io/badge/Language-eu-lightgrey#model-badge)](#datasets)

	This model transcribes speech in lowercase Basque alphabet including spaces, and was trained on a composite dataset comprising of 771.73 hours of Basque speech. The model was fine-tuned from a pre-trained Spanish [stt_es_conformer_transducer_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_transducer_large) model using the [Nvidia NeMo](https://github.com/NVIDIA/NeMo) toolkit. It is an autoregressive "large" variant of Conformer, with around 119 million parameters.
	See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.

	## Usage
	To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.

	```bash
	pip install nemo_toolkit['all']
	```

	### Transcribing using Python
	Clone repository to download the model:

	```bash
	git clone https://huggingface.co/HiTZ/stt_eu_conformer_transducer_large_v2
	```

	Given `NEMO_MODEL_FILEPATH` is the path that points to the downloaded `stt_eu_conformer_transducer_large_v2.nemo` file.

	```python
	import nemo.collections.asr as nemo_asr

	# Load the model
	asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(NEMO_MODEL_FILEPATH)

	# Create a list pointing to the audio files
	audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"]

	# Fix the batch_size to whatever number suits your purpouse
	batch_size = 8

	# Transcribe the audio files
	transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size)

	# Visualize the transcriptions
	print(transcriptions)
	```

	## Input
	This model accepts 16000 kHz Mono-channel Audio (wav files) as input.

	## Output
	This model provides transcribed speech as a string for a given audio sample.

	## Model Architecture
	Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer).

	## Training
	### Data preparation
	This model has been trained on a composite dataset comprising 771.73 hours of Basque speech that contains:
	- The 675.98 hours of the train split from the [Composite Corpus EU v2.1](https://huggingface.co/datasets/HiTZ/composite_corpus_eu_v2.1):
	- This dataset is composed of three main datasets which are:
	- [Mozilla Common voice 18.0](https://commonvoice.mozilla.org/eu/datasets)
	- [Basque Parliament EU](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
	- [OpenSLR EU](https://huggingface.co/datasets/openslr/openslr#slr76-crowdsourced-high-quality-basque-speech-data-set)
	- A 95.75 hours dataset from EITB programs.

	### Training procedure
	This model was trained starting from the pre-trained Spanish model [stt_es_conformer_transducer_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_transducer_large) over several hundred of epochs in multiple GPU devices, using the NeMo toolkit [3]
	The tokenizer for these model was built using the text transcripts of the composite train dataset with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py), with a total of 512 basque language tokens.

	## Performance
	Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table.
	\| Tokenizer \| Vocabulary Size \| MCV 18.0 Test \| Basque Parliament Test \| OpenSLR Test \| EITB Test \| MCV 18.0 Dev \| Basque Parliament Dev \| OpenSLR Dev \| EITB Dev \| Train Dataset \|
	\|-----------------------\|-----------------\|---------------\|------------------------\|--------------\|---------------\|--------------\|-----------------------\|-------------\|--------------\|------------------------------\|
	\| SentencePiece Unigram \| 512 \| 2.50 \| 3.78 \| 11.87 \| 9.17 \| 2.28 \| 4.20 \| 11.65 \| 12.81 \| Composite Dataset (771.73 h) \|

	## Limitations
	Since this model was trained on almost publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

	# Aditional Information
	## Author
	HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.

	## Licensing Information
	Copyright (c) 2025 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.

	## Funding
	This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government.
	This model was trained at [Hyperion](https://scc.dipc.org/docs/systems/hyperion/overview/), one of the high-performance computing (HPC) systems hosted by the DIPC Supercomputing Center.

	## References
	- [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
	- [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
	- [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

	## Disclaimer
	<details>
	<summary>Click to expand</summary>
	The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

	When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

	In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.