This repository contains the code for my Bachelor thesis, Performance of Objective Speech Quality Metrics on Languages Beyond Validation Data: A Study of Turkish and Korean, part of TU Delft (2025).
Abstract:
This study investigates the performance of two objective speech quality metrics, Perceptual Evaluation of Speech Quality (PESQ) and Virtual Speech Quality Objective Listener (ViSQOL), in predicting human-rated speech quality scores, which are essential for telecommunication systems' Quality of Experience (QoE). These metrics have been validated using a limited number of languages due to the insufficiency of labeled data with human-rated scores. This research focuses on the applicability of PESQ and ViSQOL in Turkish and Korean, two languages that were not part of the validation data for calibrating these metrics. The experiment used English as the baseline language for comparison, and the results showed that Turkish samples had higher average ViSQOL scores, with the difference being statistically significant compared to the English samples. Furthermore, Turkish male speakers had the highest correlation between PESQ and ViSQOL scores, and ViSQOL rated speech higher than PESQ, especially under babble noise degradations. Future research should focus on extending this study by exploring biases across additional metrics and languages, while also constructing a dataset with labeled subjective scores for more languages to improve the calibration of these metrics.
Below are the key visualizations from the experiment:
Metric | Comparison | KS-statistic | p-value |
---|---|---|---|
PESQ | English vs Korean | 0.17 | 0.61 |
English vs Turkish | 0.20 | 0.44 | |
Korean vs Turkish | 0.21 | 0.29 | |
ViSQOL | English vs Korean | 0.14 | 0.79 |
English vs Turkish | 0.33 | 0.02 | |
Korean vs Turkish | 0.23 | 0.18 |
Metric | Comparison | KS-statistic | p-value |
---|---|---|---|
PESQ | Blue vs Pink Noise | 0.12 | 0.93 |
Blue vs Babble Noise | 0.17 | 0.61 | |
Pink vs Babble Noise | 0.19 | 0.44 | |
ViSQOL | Blue vs Pink Noise | 0.10 | 0.99 |
Blue vs Babble Noise | 0.31 | 0.04 | |
Pink vs Babble Noise | 0.29 | 0.06 |
Metric | Overall | Non-Turkish Male | Turkish Male | Diff |
---|---|---|---|---|
MAD | 0.71 | 0.73 | 0.62 | -0.11 |
RMSD | 0.89 | 0.91 | 0.77 | -0.13 |
Mean difference | -0.62 | -0.65 | -0.47 | 0.18 |
These plots provide insights into the relationships between different metrics, score distributions, noise type effects, and how SNR affects PESQ and ViSQOL scores.
To install the required packages, follow these steps:
-
Install MATLAB (R2024a or later) from here.
-
Install Python 3 from here.
-
Install the required Python dependencies by running:
pip install -r requirements.txt
-
Install Audio Toolbox in MATLAB, as it is required for audio processing.
To run the experiment, execute the following command:
python main.py
To plot your results and perform statistical analysis, exectute the following command:
python results_extractor.py
Note that to plot the results, it is necessary to have an analysis_results.json
file in the same format as the one included in this repository, which is not generated automatically by the main.py
file. To generate this file, run fix_txt_files_to_json
in file_manager.py
inside each of the individual language results txt file. Then, extract the results
component and save it to analysis_results.json
.
Please cite this repository if it was useful for your research:
@article{javipeloza2025speechqualitybias,
title={Performance of Objective Speech Quality Metrics on Languages Beyond Validation Data: A Study of Turkish and Korean},
author={Javier Perez Lopez},
year={2025},
school={Delft University of Technology},
type={Bachelor Thesis},
url = {https://resolver.tudelft.nl/uuid:b07c65cb-a633-4a63-8a7b-95d8ec4b8914},
}