A CNN-BASED APPROACH FOR SPEAKER IDENTIFICATION USING MFCC FEATURES IN NOISY AND REAL-WORLD ENVIRONMENTS
Abstract
Speech extraction and recognition have become essential components in modern intelligent systems, especially in applications requiring accurate speaker identification under real-world conditions. Thus, this study represents a deep learning-based approach to speech recognition for speaker identification using Convolutional Neural Networks (CNNs) trained on Mel-Frequency Cepstral Coefficients (MFCCs). This research integrates both locally collected speech data and an external benchmark dataset (Mikhailava et al., 2022) to evaluate model performance under varying acoustic conditions. A total of 630 audio samples were collected from 21 participants across diverse environments, including both clean and noisy recordings. The proposed model achieved training accuracy exceeding 95%, validation accuracy of approximately 73%, and test accuracy of 75.82% on the local dataset. Evaluation on the benchmark dataset produced a test accuracy of 100%, indicating strong model learning under controlled conditions. The results showed that while high accuracy can be achieved with clean data, performance declines in real-world noisy environments due to variability in speech patterns, recording quality, and background interference. This study demonstrates that CNN-based models can effectively support speaker identification tasks, while highlighting the need for improved generalization, larger datasets, and enhanced noise-handling techniques for practical deployment.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Science World Journal

This work is licensed under a Creative Commons Attribution 4.0 International License.