This project fine-tunes a base facebook/wav2vec2-base model from huggingface. For this study the recordings for the dataset was gathered by us. To preprocess the audios, we have cleaned the audio by noise reduction, removing non-speech parts, dehumming, segmented each audio, reformatted the audio to 16kHz, transcribed, and cleaned the transcriptions. Hyperparameter search was also done to find the best hyperparameters. This resulted in a word error rate (WER) of 26% and a loss of 0.57%.

link to google colab