Persian CTC-Segmentation

This repository facilitates constructing a Persian ASR (Automatic Speech Recognition) dataset by finding utterance alignments within large audio files. It uses CTC-segmentation algorithm (Ludwig Kürzinger et al., 2020) and a Persian ASR model trained with Connectionist Temporal Classification (CTC) to find the most probable alignment between a pair of text and speech. The ASR model includes a XLS-R representation learning model (Alexis Conneau et al., 2020) pre-trained on 53 languages using a contrastive self-supervised objective, and a linear layer which was trained on labeled Persian speech data. Since the XLS-R model has $O(n^2)$ space/time complexity, a chunking mechanism with overlapping chunks is used to reduce the spece/time complexity of the inference while keeping a proper context in the center of windows. This mechanism is specially essential when dealing with large audio files (few hours of speech).

Installation & Usage

1. Install The Requirements

pip install -r requirements.txt

2. Download The Model Weights

pip install gdown
gdown 1JO_UmvZC-yDWxOfZl3TThpqGI1IQAfDW

3. Prepare Data

Create a corrosponding transcript (one sentence per line) for each audio file.
Create a csv file that contains relative paths to audio files and their transcripts. Sample metadata.csv:

audio_path,transcript_path
audios/1.mp3,transcripts/1.txt
audios/2.mp3,transcripts/2.txt

There is a sample_input directory inside the repository that contains an example.

4. Run The Algorithm

python segment.py \
  --metadata metadata.csv \
  --output_dir output \
  --device cuda

Run python segment.py -h for more information about the arguments.

Fine-tune XLSR on Persian

To fine-tune XLSR on Persian, clone the repository and install the requirements as explained above. Next, follow these steps:

1. Prepare Train and Validation CSVs

Prepare train.csv and validation.csv files containing two columns, path (name of an audio file) and sentence (corresponding transcript).

2. Preprocess The Text

The train/normalizer.py script cleans the text in CSV files and saves the results with a _clean appended to the name of the input files.

python train/normalizer.py --csv_path train.csv --delimiter ","         # Generates train_clean.csv
python train/normalizer.py --csv_path validation.csv --delimiter ","    # Generates validation_clean.csv

3. Training The Model

python train/train.py \
  --train_csv train_clean.csv \
  --valid_csv validation_clean.csv \
  --wav_dir path_to_wavs_dir

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
sample_input		sample_input
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
preprocessor_config.json		preprocessor_config.json
requirements.txt		requirements.txt
segment.py		segment.py
utils.py		utils.py
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

sample_input

sample_input

train

train

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.json

config.json

preprocessor_config.json

preprocessor_config.json

requirements.txt

requirements.txt

segment.py

segment.py

utils.py

utils.py

vocab.json

vocab.json

Repository files navigation

Persian CTC-Segmentation

Installation & Usage

1. Install The Requirements

2. Download The Model Weights

3. Prepare Data

4. Run The Algorithm

Fine-tune XLSR on Persian

1. Prepare Train and Validation CSVs

2. Preprocess The Text

3. Training The Model

About

Languages

License

alifarrokh/persian-ctc-segmentation

Folders and files

Latest commit

History

Repository files navigation

Persian CTC-Segmentation

Installation & Usage

1. Install The Requirements

2. Download The Model Weights

3. Prepare Data

4. Run The Algorithm

Fine-tune XLSR on Persian

1. Prepare Train and Validation CSVs

2. Preprocess The Text

3. Training The Model

About

Topics

Resources

License

Stars

Watchers

Forks

Languages