This repository facilitates constructing a Persian ASR (Automatic Speech Recognition) dataset by finding utterance alignments within large audio files. It uses CTC-segmentation algorithm (Ludwig Kürzinger et al., 2020) and a Persian ASR model trained with Connectionist Temporal Classification (CTC) to find the most probable alignment between a pair of text and speech. The ASR model includes a XLS-R representation learning model (Alexis Conneau et al., 2020) pre-trained on 53 languages using a contrastive self-supervised objective, and a linear layer which was trained on labeled Persian speech data. Since the XLS-R model has
pip install -r requirements.txt
pip install gdown
gdown 1JO_UmvZC-yDWxOfZl3TThpqGI1IQAfDW
- Create a corrosponding transcript (one sentence per line) for each audio file.
- Create a
csv
file that contains relative paths to audio files and their transcripts. Samplemetadata.csv
:
audio_path,transcript_path
audios/1.mp3,transcripts/1.txt
audios/2.mp3,transcripts/2.txt
There is a sample_input
directory inside the repository that contains an example.
python segment.py \
--metadata metadata.csv \
--output_dir output \
--device cuda
Run python segment.py -h
for more information about the arguments.
To fine-tune XLSR on Persian, clone the repository and install the requirements as explained above. Next, follow these steps:
Prepare train.csv
and validation.csv
files containing two columns, path
(name of an audio file) and sentence
(corresponding transcript).
The train/normalizer.py
script cleans the text in CSV files and saves the results with a _clean
appended to the name of the input files.
python train/normalizer.py --csv_path train.csv --delimiter "," # Generates train_clean.csv
python train/normalizer.py --csv_path validation.csv --delimiter "," # Generates validation_clean.csv
python train/train.py \
--train_csv train_clean.csv \
--valid_csv validation_clean.csv \
--wav_dir path_to_wavs_dir