Speech Recognition Models For Holy Quran Recitatio

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 14, No. 12, 2023

Speech Recognition Models for Holy Quran


Recitation Based on Modern Approaches and
Tajweed Rules: A Comprehensive Overview

Sumayya Al-Fadhli1 , Hajar Al-Harbi2 , Asma Cherif3


Department of Computer Science, King Abdulaziz University, Jeddah, Saudi Arabia1,2
Department of Computer Science-Adham University College, Umm Al-Qura University, Makkah, Saudi Arabia1
Department of Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia3
Center of Excellent in Smart Environment Research, King Abdulaziz University, Jeddah, Saudi Arabia3

Abstract—Speech is considered the most natural way to speech processing in various languages spoken worldwide.
communicate with people. The purpose of speech recognition There are three classes in Arabic, which has approximately
technology is to allow machines to recognize and understand 420 million speakers [47]. The primary class taught in schools
human speech, enabling them to take action based on the spoken is Modern Standard Arabic (MSA), which adheres to the
words. Speech recognition is especially useful in educational fields, grammatical rules of the Arabic language. The second class
as it can provide powerful automatic correction for language
learning purposes. In the context of learning the Quran, it is
is Arabic Dialect (AD), which represents the everyday spoken
essential for every Muslim to recite it correctly. Traditionally, this language of native Arabic speakers, varying across countries
involves an expert gari who listens to the student’s recitation, and regions. The third class is classical Arabic (CA), the
identifies any mistakes, and provides appropriate corrections. language used in the Holy Quran, which has been renowned
While effective, this method is time-consuming. To address this globally for centuries. CA is known for its extensive grammar
challenge, apps that help students fix their recitation of the Holy and vocabulary, as well as its unique recitation guidelines [15],
Quran are becoming increasingly popular. However, these apps [24].
require a robust and error-free speech recognition model. While
recent advancements in speech recognition have produced highly Recently, the use of speech recognition in the Quranic
accurate results for written and spoken Arabic and non-Arabic recitation field has emerged as an important research direction.
speech recognition, the field of Holy Quran speech recognition
is still in its early stages. Therefore, this paper aims to provide
Indeed, there are more than two billion Muslims in the world
a comprehensive literature review of the existing research in the [2]. Muslims generally strive to learn the precise recitation of
field of Holy Quran speech recognition. Its goal is to identify the CA and adhere to certain rules known as Tajweed in order
limitations of current works, determine future research directions, to recite the Holy Quran accurately. Learning these rules is
and highlight important research in the fields of spoken and very important for all Muslims to master the recitation of the
written languages. Holy Quran [15]. Consequently, building accurate Holy Quran
Speech Recognition (HQSR) models represents a significant
Keywords—Speech recognition; acoustic models; language
model; neural network; deep learning; quran recitation research outcome for all Muslims.

Teaching the correct recitation of the Quran is essential for


I. I NTRODUCTION every Muslim. Learning Quran recitation usually depends on
Speech is the most natural way to communicate with an expert, also known as gari, who listens to the student’s
people [7]. Designing a machine that mimics human behav- recitation, determines recitation mistakes, and instructs the
ior, including speaking naturally and responding correctly to student with the appropriate correction. This way of learning
spoken language, has puzzled engineers and scientists for is very effective, but it’s time-consuming because the teacher
centuries [51]. Automatic speech recognition (ASR) refers to needs to correct the errors of every student independently. For
the computational process of transforming acoustic speech this reason, the apps that help students fix their recitation of
signals into written words or other linguistic units through the Holy Quran are beneficial and essential, but these apps
dedicated algorithms [28], [7]. The goal of ASR is to enable need a robust and error-free speech recognition model. Despite
machines to interpret and respond to spoken language [4]. ASR conducting several research studies in this area, researchers
involves the capability of a machine to accurately recognize have not yet achieved the optimal solution for recognizing
speech, convert it into text, and take appropriate actions based speech in the Holy Quran. Though recent models have been
on human instructions [7]. In particular, speech recognition applied to written and spoken Arabic and non-Arabic speech
is useful in educational fields as it allows for the building of recognition and produced highly accurate results, Quran speech
powerful automatic correctors for language learning purposes. recognition is still in its early stages. Therefore, this paper aims
As [41], they build a model of English pronunciation learning to propose a comprehensive literature review of the works in
for Chinese learners. the field of Holy Quran speech recognition and shed light on
some important research in the field of spoken and written
The researchers have made significant contributions to languages.
www.ijacsa.thesai.org 961 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

The main motivation for our research is as follows: Recognition. Next, Section IV discusses some of the research
directions in the field of Holy Quran Speech Recognition.
1) Though the Holy Quran represents an essential book Finally, Section V concludes the paper.
for all Muslims, current models for Holy Quran
speech recognition have low accuracy or do not cover
all chapters (i.e., rely on small datasets). II. S PEECH R ECOGNITION FOR W RITTEN & S POKEN
2) Some people find it difficult to attend Quran learning L ANGUAGES
courses or retrieve their memorization in front of the In this section, we highlight some speech recognition
teacher. Many individuals struggle to retain the Quran solutions that produced impressive results in Arabic and non-
due to fear. Thus, building a professional app for Arabic languages. These solutions may be categorized as
Quran learning is important to help them retain the traditional speech recognition (either with deep learning ar-
Quran in their home. chitecture or without deep learning) or as an end-to-end-based
3) Quran memorization requires a continuous review speech recognition solution.
process, which is time-consuming. Thus, it is hard for
Quran teachers to listen and validate long recitations
for many students. A. Traditional Models
4) Some people prefer reading what they memorize, Fig. 2 shows that traditional ASR systems are made up
especially in night prayer (i.e., without reading from of three separate parts: the acoustic model, the pronunci-
the Mushaf to not lose their submission in prayer). ation model, and the language model [54]. The Acoustic
However, they can easily make mistakes. An auto- Model (AM) assesses the likelihood of acoustic units such
matic corrector can assist Muslims in their prayers. as phonemes, graphemes, or sub-word units [13]. In contrast,
5) Some non-Arabic countries, mainly those with a the Language Model (LM) evaluates the likelihood of word
minority of Muslims, do not have enough qualified sequences. By integrating linguistic knowledge derived from
teachers to teach the Holy Quran. extensive text collections, language models improve the pre-
cision of acoustic models. These models use the acquired
However, research in the field of HQSR is still in its
syntactic and semantic rules to re-evaluate the hypotheses
early stages. Indeed, recognizing individual words is easy,
generated by the acoustic model. The process of mapping a
but the challenge is recognizing continuous recitation [7]
series of phonemes to words is done by the Pronunciation
and detecting erroneous recitation and violations of tajweed
Dictionary (PD), and it aligns the phonetic transcriptions
rules. In the realm of speech recognition systems (see Fig.
produced by the AM system with the unprocessed text used
1), various factors, such as speaker dependency, vocabulary
in language models. The training of these three components is
size, and noisy environments, can significantly impact their
done individually, and then they are merged together to form a
performance. Recognition performance increases with limited
search graph by utilizing finite-state transducers (FSTs). Fea-
vocabulary and reciter-dependent conditions while using broad
ture Extraction (FE) takes input speech as input, produces the
vocabulary and reciter-independent scenarios; performance can
essential features, and then sends these features to the decoder.
decrease significantly [7]. Besides, most research develop-
Following that, the decoder produces lattices, which are then
ments focus on one or a few chapters or a few tajweed rules.
evaluated and ordered to generate the desired sequences of
Also, existing works in HQSR suffer from the lack of large
words.
datasets used. Finally, current works use traditional techniques
and do not investigate end-to-end learning. The acoustic model can be modeled using HMMs [32] and
The critical objective of this research is to use machine Gaussian Mixture Models (GMMs) [61]. It is worth noting that
learning for Holy Quran recitation. Our main contributions are recent ASR models have replaced the use of GMMs in the
to provide a thorough literature review to find the most impor- acoustic model with deep neural networks (DNNs) [30]. These
tant issues that need more investigation in the field of Holy are referred to as hybrid HMM-DNN and are widely used as
Quran speech recognition. Moreover, our study summarizes competitive ASR models. Also, some research replaced GMMs
some important and informative papers in the Arabic and non- with Bidirectional Long-Short Term Memory (BLSTM) [50],
Arabic languages fields and recent papers in the HQSR field while some other studies replaced HMM with another clas-
and provides a taxonomy for speech recognition and HQSR. sification method such as Support Vector Machine (SVM)
[12], [36], Linear Discriminant Analysis (LDA) combined with
Various machine learning algorithms could be used in Quadratic Discriminant Analysis (QDA) [33], Convolutional
speech recognition, including Dynamic Time Warping (DTW), Neural Networks (CNNs) and SVM [40], and Hidden Semi-
Hidden Markov Models (HMM), and Artificial Neural Net- Markov Model (HSMM) [34], etc.
works (ANN) [51].
Some research has been suggested to address speech recog-
In the context of Arabic ASR, many algorithms were used, nition for Arabic and non-Arabic languages.
such as recurrent neural networks (RNN), long short-term
memory (LSTM), which is a particular case of RNN, and Arabic Language. In their study, [39] introduced a novel
connectionist temporal classification (CTC) [4]. approach that combines three distinct training systems for
speech recognition. Four-gram language model re-scoring,
The remaining parts of this paper are structured in the system combination with minimum Bayes risk decoding, and
following way: Section II discusses some important research in lattice-free maximum mutual information are a few of these
written and spoken languages speech recognition. Section III groups. They achieved significant progress, with a word error
discusses the recent papers in the field of Holy Quran Speech rate of 42.25% on the Multi-Genre Broadcast (MGB-3) Arabic
www.ijacsa.thesai.org 962 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

Fig. 1. Speech recognition taxonomy.

License v2.0, under which Kaldi is released, enables extensive


usage. Kaldi provides a robust speech recognition system that
utilizes finite-state transducers and is built on the OpenFst
library. The toolkit provides comprehensive documentation and
scripts that make it easier to build comprehensive recognition
systems. Kaldi is coded in C++, and its core library offers a
range of functionalities, including phonetic-context modeling,
acoustic modeling using subspace Gaussian mixture models
Fig. 2. Traditional ASR Pipeline ([54]).
(SGMM), standard Gaussian mixture models, and linear and
affine transforms.
A speech-learning system for the English language was
development set. They got this result by using a 4-gram re- developed and implemented by [41]. It utilized a speech
scoring strategy for a chain BLSTM system. This system did recognition technique based on HMM to decode speech using
better than a DNN system that had a word error rate of 65.44%. the Viterbi algorithm and determine the recognition score
through posterior probability. The system achieved an average
In [60], the authors presented a comprehensive framework recognition rate of 94%. Its purpose was to help English
for Arabic speech recognition. To turn sequences of Mel learners assess pronunciation accuracy during verbal practice
Frequency Cepstral Coefficients (MFCC) and Filter Bank (FB) and identify different types of errors. By engaging in system-
features into fixed-size vectors, they used recurrent LSTM or atic practice with this system, users can significantly enhance
GRU architectures. They then fed these vectors into a multi- their listening and speaking skills. The system provides real-
layer perceptron network (MLP) to perform classification and time feedback on oral pronunciation accuracy, error correction
recognition tasks. The researchers evaluated their system using reports, and allows for repeated practice to facilitate effective
two different databases: one for spoken-digit recognition and training.
another for spoken TV commands. However, a limitation of
their work is the absence of datasets that incorporate recorded Table I summarizes traditional speech recognition tech-
speech signals in noisy, realistic environments. niques for Arabic and non-Arabic languages.

In [12], they presented a speech recognition system for the B. End-to-End-based Speech Recognition Models
Arabic language. The system aimed to evaluate three feature
extraction algorithms: MFCC, Power Normalized Cepstral The purpose of the end-to-end (E2E) system is to directly
Coefficients (PNCC), and Modified Group Delay Function transform a series of acoustic features into a corresponding
(ModGDF). We performed the classification process using an series of graphemes or words. This approach greatly simplifies
SVM. The results indicated that PNCC was the most effective traditional speech recognition methods by eliminating the need
algorithm, while ModGDF achieved moderate accuracy. PNCC for manual labeling of information in the neural network.
and ModGDF outperformed MFCC in terms of precision. Instead, the E2E system automatically learns language and
PNCC achieved an accuracy rate of 93% to 97%, ModGDF pronunciation information, as depicted in Fig. 3.
achieved 90%, and MFCC achieved 88%.
End-to-end speech recognition systems typically rely on an
Non-Arabic languages. The authors of [42] presented the encoder-decoder framework. According to studies [17], [13],
design of Kaldi, a speech recognition toolkit that is freely this architecture takes an audio file as input and processes it
available and open-source. The highly permissive Apache through a series of convolution layers to generate a condensed
www.ijacsa.thesai.org 963 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

TABLE I. S UMMARY OF T RADITIONAL S PEECH R ECOGNITION T ECHNIQUES IN W RITTEN AND S POKEN L ANGUAGES

Ref Lang. Main idea FE Classification LM AM


[12] Arabic Study three feature extraction methods, MFCC, SVM - -
MFCC, PNCC, and ModGDF for the PNCC, and
development of an ASR System in Ara- ModGDF
bic.

[60] Arabic Prsent an approach based on RNN to MFCC (static LSTM and Neural Net- - BiLSTM
process sequences of variable lengths of and dynamic work (MultiLayer Percep- model
MFCCs, FBs and delta-delta features of features), tron: MLP) classifier
the different spoken digits/commands. and the FB
coefficients.

[39] Arabic Improve Hybrid ASR for MGB & Al- MFCCs Hybrid ASR using n-gram Hybrid ASR
Jazeera speech data. features TDNN-LSTM & Bi- LM using TDNN-
directional prioritized LSTM &
grid LSTM (BPGLSTM) BPGLSTM

[42] Non- Build a new open source toolkit for MFCC GMM-DNN-HMM bigram DNN-HMM
Arabic Conventional speech recognition from features
scratch called Kaldi toolkit.

[41] Non- Build a leight-weight speech recognition MFCC HMM based n-gram GMM-HMM
Arabic using GMM-HMM, to learn English features
language using HMM based speech
recognition for Chinese speakers.

niques. They trained and tested their models on the Standard


Arabic Single Speaker Corpus (SASSC) with diacritized text
data, using MFCCs and FB for feature extraction. The ASR
speech recognition system incorporated a total of eight models,
Fig. 3. End-to-End ASR Pipeline ([54]). comprising four GMM models, two SGMM models, and
two DNN models. We constructed these models using the
KALDI toolkit and performed language modeling using CMU-
CLMTK. The best achieved word error rate (WER) among
vector. The decoder then uses this vector to generate a charac- these models was 33.72% using DNN-MPE. Additionally,
ter sequence. Researchers can use different objective functions, the authors proposed an end-to-end approach for diacritized
such as CTC [23], ASG [20], LF-MMI [25], sequence-to- Arabic ASR, employing joint CTC-attention and CNN-LSTM
sequence [18], transduction [44], and differentiable decoding attention methods. The CNN-LSTM with attention method
[19], to optimize the end-to-end ASR [13]. Researchers have outperformed the others, achieving a character error rate (CER)
also explored different neural network architectures, including of 5.66% and a WER of 28.48%. This method resulted in a
ResNet [27], TDS [26], and Transformer [52]. Additionally, significant reduction in WER compared to both the traditional
integrating an external language model has been shown to ASR and joint CTC-attention method, by 5.24% and 2.62%,
improve the overall performance of the system. respectively.
Recently, some research has been suggested for both Arabic
Researchers conducted a comprehensive comparison on
and non-Arabic speech recognition using end-to-end models.
Arabic language and its dialects using different ASR ap-
In what follows, we discuss and summarize these solutions.
proaches in [31]. The researchers collected a new evaluation set
Arabic Language. In [9], the researchers introduced the comprising news reports, conversational speech, and various
first comprehensive approach to building an Arabic speech-to- datasets to ensure unbiased analysis. They extensively analyzed
text transcription system. They utilized lexicon-free RNNs and the errors and compared the ASR system’s performance with
the CTC objective function to achieve this. The system con- that of expert linguists and native speakers. While the ma-
sisted of three main components: a BDRNN acoustic model, a chine ASR system showed better performance than the native
language model, and a character-based decoder. Unlike word- speaker, there was still an average WER gap of 3.5% compared
level decoders, their decoder did not rely on a lexicon during to expert linguists in raw Arabic transcription. The proposed
the transcription process. The RNN acoustic and language end-to-end transformer model outperformed prior state-of-the-
model successfully distinguished between characters with the art systems on MGB2, MGB3, and MGB5 datasets, achiev-
same accent but different writing styles. The researchers eval- ing new state-of-the-art performances of 12.5%, 27.5%, and
uated the model using a 1200-hour corpus of Aljazeera multi- 33.8%, respectively.
genre broadcast programs, resulting in a 12.03% word error
rate for non-overlapped speech. It is important to note that deep Non-Arabic Languages. Researchers introduced ESPnet, a
learning techniques were only used in the feature extraction novel open-source platform for end-to-end speech processing,
phase. in [55]. ESPnet leverages dynamic neural network toolkits like
Chainer and PyTorch, serving as the primary deep learning
In [16], the authors proposed a robust diacritized ASR engine. This platform simplifies the training and recognition
system using both traditional ASR and end-to-end ASR tech- processes of the entire ASR pipeline. The ESPnet uses the
www.ijacsa.thesai.org 964 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

same feature extraction/format, data processing, and scheme mance of the proposed solutions. Moreover, it determines the
style as the Kaldi ASR toolkit. This gives researchers a gap and limitations in the current research on the HQSR. We
complete way to test speech recognition and other speech classified the current studies of HQSR into three categories:
processing techniques. The test results show that ESPnet does template-based speech recognition, traditional-based speech
a good job with ASR and is about as efficient as the most recognition, and other HQSR studies.
advanced HMM/DNN systems that use traditional setups.
Notably, ESPnet has made significant advancements, including A. Template Based Speech Recognition
the incorporation of multi-GPU functionality (up to 5 GPUs).
In just 26 hours, ESPnet successfully completed the training This section summarizes the papers that follow template-
of 581 hours of the CSJ task. based speech recognition, which is an old style of speech
recognition that relied only on FE, classification, and matching
In [25], the authors described a simple HMM-based end- techniques (i.e., it didn’t have acoustic, lexical, or language
to-end method for ASR and tested how well it worked on well- models).
known large-vocabulary speech recognition tasks, specifically
the Switchboard and Wall Street Journal (WSJ) corpora. The In [46], the authors provide a deep learning model utilizing
authors trained the acoustic model used in this approach a dataset of seven famous reciters and CNNs. They employ
without the need for initial alignments, prior training, pre- MFCCs to extract and assess data from audio sources. Their
estimation, or transition training, making it entirely neural ex- provided model achieved 99.66% accuracy.
cept for the decoding/LM part. The proposed method surpassed In [6], the authors highlight the key distinctions between
other end-to-end methods in similar setups, particularly when a basic ASR system and an ASR-based language tutor specif-
dealing with small databases. By employing a comprehensive ically designed for Quran memorization. They demonstrate
biphone modeling approach, the researchers achieved results that ASR techniques alone are not sufficient for an intelligent
almost comparable to regular LF-MMI training. Quran tutor and propose modifications to enhance its capa-
In [18], the researchers introduced a new attention-based bilities. To support their claims, the researchers utilize data
model called Listen, Attend, and Spell (LAS) for sequence- from Sūrat Al-Nass. However, one of the major obstacles to
to-sequence speech recognition. LAS combines the sound, developing a Quran tutor is the absence of a comprehensive
pronunciation, and language model parts of regular ASR dataset containing both correct and incorrect recitations, which
systems into a single neural network, so there’s no need for is necessary for conducting meaningful experiments in this
a separate dictionary or text normalization. The researchers domain.
compared LAS with a hybrid HMM-LSTM system and found In [37], researchers proposed an online verification system
that LAS achieved a WER of 5.6%, outperforming the hybrid for Quran verses to ensure the integrity and authenticity of the
system’s WER of 6.7%. In a dictation task, LAS achieved a Quran. They gathered data from ten expert Qari who recited
WER of 4.1%, while the hybrid system achieved a WER of Surat Al-Nass ten times correctly and ten times with various
5%. types of mistakes (e.g., Tajweed, Makhraj, missing words).
The study [29] introduced a new strategy to address mul- Unlike modern techniques, this study did not utilize acoustic,
tilingual ASR speech recognition, specifically in the context lexical, or language models. Instead, it relied on MFCC for
of code-switching speech. The researchers employed three feature extraction and HMMs for recognition and matching.
techniques to achieve this. The researchers decoded the speech However, the study did not provide any testing results.
by utilizing a global language model constructed from multi- In [14], the authors center on the examination and identifi-
lingual text. Their system used a multigraph approach along cation of classical Arabic vocal phonemes, specifically vowels,
with weighted finite-state transducers (WFST), which let them through the utilization of HMM. They aim to tackle the issue
switch between languages while decoding by using a closure of semantic changes that can occur due to variations in vowel
operation. The output of this process was a bilingual or durations (short or long) in Arabic. To investigate this, they
multilingual text based on the input audio. Secondly, they examine three chapters (Alfateha, Albaqarah, and Alshuraa)
employed a robust transformer system for speech decoding. from the Holy Quran. Their findings demonstrate an impressive
They found that WFST decoding was particularly suitable for overall accuracy rate of 87.60% without the utilization of a
inter-sentential code-switching datasets among the techniques specified language model.
used.
In [58], researchers developed a speech recognition system
Table II summarizes end-to-end speech recognition tech-
that utilizes MFCC for feature extraction and HMM for clas-
niques for Arabic and non-Arabic languages.
sification. The system focuses on recognizing and identifying
In the following, we compare the end-to-end architecture the rules of Iqlab on Qira’at of Warsh. It uses a database of
mentioned in the previous works with the baseline traditional expert teachers’ rules to compare and report any mismatches
techniques on the same datasets as indicated in the previous in the iqlab rules for specific verses. The system achieved a
reviewed studies. As we see in Table III, the end-to-end 70% accuracy in correctly spelling words with the correct rules
architecture outperforms the hybrid architecture in all the from the database, a 50% accuracy for words with incorrect
studies mentioned in the table, except [25]. rules from the database, and a 40% accuracy for new words
not included in the training database.
III. H OLY Q URAN S PEECH R ECOGNITION (HQSR)
In [57], the researchers presented an interactive Tajweed
This section summarizes the recent studies that concern system that assists in verifying the appropriate Imaalah Check-
the HQSR. It also presents the used techniques and the perfor- ing rule for Warsh recitation. This system utilizes an auto-
www.ijacsa.thesai.org 965 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

TABLE II. S UMMARY OF E ND - TO -E ND BASED S PEECH R ECOGNITION IN W RITTEN AND S POKEN L ANGUAGES

Ref Lang. Main idea FE E2E DL LM AM


[29] Arabic A new strategy for multilingual ASR speech recog- MFCCs and Transformer based n-gram TDNN & Trans-
&Non- nition. Implementation of three strategies to identify FB E2E Architecture former based E2E
Arabic code-switching speech. architecture

[31] Arabic A thorough examination to compare the E2E trans- MFCCs E2E transformer LSTM and combining a TDNN
former ASR, the modular HMM-DNN ASR, and HSR. and Mel- with hybrid transformer-based with LSTM layers.
spectrogram. (CTC+Attention) language model
(TLM)

[16] Arabic Build a robust diacritised Arabic ASR MFCCs and using joint CTC built by KALDI, ESPnet,
the log Mel- attention and using CMUCLMTK and Espresso
Scale Filter CNN-LSTM with tool based on the
Bank energies. attention 3-g and trained on
RNN-LM.

[9] Arabic E2E model for Arabic speech-to text transcription sys- FB BDRNNs n-gram TDNN-LSTM &
tem using the lexicon free RNNs and CTC objective BDRNNs
function based on Stanford CTC source code.

[25] Non- E2E training of AM using the LF-MMI objective MFCC TDNN-LSTM n-gram & RNN E2E LF-MMI
Arabic function in the context of HMMs

[18] Non- Improving the performance of LAS which is a novel FB LAS ASR 5-gram LAS ASR
Arabic technique in ASR research.

[55] Non- Proposed purely E2E speech recognition open source MFCC & FB E2E SR framework RNN LM CTC-objective fuc-
Arabic framework called ESPnet toolkit. nction

TABLE III. E ND - TO -E ND BASED S PEECH R ECOGNITION P ERFORMANCE C OMPARED TO BASELINE T RADITIONAL T ECHNIQUES

Ref. Lang. Dataset Baseline Traditional Tech. End-to-End Tech.


Model WER/CER Architecture WER/CER
[29] Arabic& non- Arabic MGB2 &English TEDLIUM-3 & ES- Hybrid ASR 9.8% E2E-Transformer 8.29%
Arabic CWA Corpus

[31] Arabic MGB2, and (Hidden Test (HT)) HMM-DNN HT:15.9% E2E-T (CTC + HT:12.6%
MGB2: 15.8% Attention) MGB2: 12.5%

[16] Arabic Standard Arabic Single Speaker Corpus Kaldi toolkit using 33.72% CNN-LSTM with 28.48%
(SASSC) DNN, MPE, and attention using
SGMM Espresso toolkit

[9] Arabic 8 hours Aljazeera corpus 1200 hours of TV TDNN-LSTM- 14.7% BDRNN with 12.03%
Aljazeera corpus BLSTM CTC objective
function

[55] non-Arabic Corpus of Spontaneous Japanese (CSJ) HMM/DNN (Kaldi eval1:9.0% ESPnet (i.e., eval1:8.7%
nnet1) eval2:7.2% VGG2-BLSTM, eval2:6.2%
eval3:9.6% char-RNNLM, eval3:6.9%
and joint
decoding)

[25] non-Arabic Switchboard And WSJ Regular LF-MMI Switchboard: E2E-LF-MMI Switchboard:
9.1% 9.6%
WSJ: 2.8% WSJ:3.0%

[18] non-Arabic 12,500 hour training set consisting of 15 mil- hybrid HMM- 6.7% LAS end-to-end 5.6%
lion English utterances LSTM dictation model dictation task
task 5% 4.1%

www.ijacsa.thesai.org 966 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

matic speech recognition system with MFCC as the feature Quran acoustic wave. The study utilized MFCC as the feature
extraction technique and HMM as the classification method. extraction technique and SVM as the classification method.
The researchers conducted experiments using fifteen speech With a dataset of 258 wave files for 10 “Qira’ah” and including
samples. The results showed that the system achieved a 60% various reciters, the SVM accuracy achieved approximately
accuracy rate for identifying the Imaalah rule based on the 96%, while the accuracy of the ANN was 62%.
Warsh narration in the training data.
In another study, [45] proposed a system that automates
Researchers developed a system in [10] that identifies the the process of checking Tajweed for children who are learning
Ahkam Al-Tajweed in a specific audio recording of Quranic the Quran. The system used the MFCC algorithm to extract
recitation. The study focused on eight rules: “EdgamMeem” the input speech signal and the HMM algorithm to compare
(one rule), “EkhfaaMeem” (one rule), “Ahkam Lam” in ‘Al- children’s recitation with the recitation stored in the database.
lah’ Term (two rules), and “Edgam Noon” (four rules). The However, this project focused solely on Surah Al-Fatihah and
classification problem involved 16 classes, covering the entire did not provide any testing results.
Holy Quran for verses that contained the eight rules. The
system utilized various feature extraction techniques, including Table IV summarizes the previous studies of template-
traditional methods like MFCC and LPC as well as newer based speech recognition in the HQSR field.
methods like CDBN. Classifiers such as SVM and RF were
employed, with the best accuracy of 96.4% achieved using B. Traditional Based Speech Recognition
SVM for classification and features extracted through MFCC, This section summarizes the papers that follow traditional
WPD, HMM-SPL, and CDBN. speech recognition as described in Section II-A.
In [59], authors developed a speech recognition system that The researchers aimed to develop a precise Arabic rec-
can accurately differentiate between different types of Madd ognizer for educational purposes in the study conducted by
(elongated tone) and Qira’at (method of recitation) related [34]. They implemented an HSMM model with the primary
to Madd. The system utilized MFCC as a feature extraction objective of improving the durational behavior of the tradi-
technique and HMM as a classification method. The focus tional HMM model. To achieve this, they utilized a corpus
of the study was on two specific types of Madd: greater consisting of recordings from 10 reciters, totaling over 487
connective prolongation and exchange prolongation rules for minutes of speech. They meticulously segmented the corpus at
Hafss and Warsh. We collected a total of sixty data samples for three levels: phoneme, allophone, and words, with precise time
analysis. The results showed that the accuracy of identifying boundaries. They obtained the recordings by reciting the Holy
the exchange prolongation rule was 60% for Warsh and 50% Quran, covering all the essential Arabic sounds. As a result of
for Hafss. Additionally, the accuracy for identifying the greater their work, the recognition accuracy saw an improvement of
connective prolongation rule was 40% for Warsh and 70% for approximately 1.5%.
Hafss.
The researchers constructed an acoustic model using the
Researchers developed an automated self-learning system Carnegie Melon University (CMU) Sphinx trainer [21]. The
in [33] to support the traditional method of teaching and CMU Sphinx trainer utilized recordings from 39 different
learning Quran. The system aimed to classify the charac- reciters and 49 chapters (surah) to build a robust framework for
teristics of Quranic letters. The study collected audio data continuous speech recognition. The acoustic model achieved
from 30 participants, including 19 males and 11 females. an impressive WER of approximately 15%, showcasing its
The participants recited each sukoon alphabet once without accuracy and effectiveness.
repetition. The system used the Sukoon alphabet from the
Quran to provide a description of the Makhraj (point of In their research, [50] utilized data from everyayah.com,
articulation) and Sifaat (characteristics) of each letter. The a website that provides open-access Quran recitations by
study successfully identified and classified the characteristic numerous professional reciters, including Sheikh [1]. They
features of the alphabet, specifically in terms of learning (Al- adopted a deep learning approach to train an acoustic model
Inhiraf) and repetition (Al-Takrir). The results showed that for Quranic speech recognition. The study focused on 13
using QDA with all 19 features achieved the highest accuracy, different reciters and concluded that the hybrid HMM-BLSTM
with 82.1% for leaning (Al-Inhiraf) and 95.8% for repetition method outperformed the HMM-GMM method in terms of
(Al-Takrir) characteristics. speech recognition accuracy. The baseline models (HMM-
GMM) achieved an average WER of 18.39%. In contrast,
The researchers conducted the study with the aim of cre- the acoustic model using Hybrid HMM-BLSTM achieved
ating a comprehensive system that accurately recognizes and significantly better results, with an average WER of 4.63%
determines the correct pronunciation of different Tajweed rules in the same testing scenario.
in audio. To achieve this, the researchers employed 70 filter
banks as a feature extraction technique and utilized SVM as the The researchers utilized the KALDI toolkit to create
classification method. The study focused on four specific rules, and assess a speaker-independent continuous speech recog-
namely Ekhfaa Meem, Edgham Meem, Takhfeef Lam, and nizer specifically designed for Holy Quran recitations in a
Tarqeeq Lam. The study utilized a dataset of 80 records, com- study [49]. The researchers successfully developed a large-
prising a total of 657 recordings, encompassing both correct vocabulary system capable of recognizing and analyzing
and incorrect recitations for each rule. They tested the models Quranic recitations. They use 32 recitations for Chapter 20
in the system against 30% of the recorded data and achieved (Sūrat Taha), according to Hafs from the A’asim narration. The
a validation accuracy of 99%. [38] developed a recognition most effective experimental configuration involves utilizing
model to identify the “Qira’ah” from the corresponding Holy Time Delay Neural Networks (TDNN) with a sub-sampling
www.ijacsa.thesai.org 967 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

TABLE IV. S UMMARY OF HQSR T EMPLATE -BASED S PEECH R ECOGNITION

Ref# Main Idea Dataset FE ML algo. Pros Cons


[45] Automated tajweed Checking System for Surah Al-Fatihah. Ten respondents’ MFCC HMM - very small data
Children. recitation for testing purposes. One set & no test-
audio of correct recitation is used ing result
for comparison with the respon-
dents’ audio

[46] The objective of this research was to differen- Seven well-known Qur’anic re- MFCC CNN The proposed -
tiate between reliable and unethical Qur’anic citers have been gathered into a system stages
reciters. dataset. On an audio file, each re- are well-
citer recited the Quran’s surahs for organized and
eighty minutes. easily under-
standable.

[38] A recognition model for the “Qira’ah” from The corpus contains 258 wave files MFCC SVM good accuracy -
the corresponding Holy Quran acoustic wave. labeled based on the “Qira’ah” (96.12 %)
(they consider 10 “Qira’ah”) .

[11] A system for recognizing/correcting the dif- Almost 80 records for each rule FBs SVM Good process Consider only
ferent rules of Tajweed in an audio. name and type, a total of 657 for data collec- 4 rules
recordings of 4 different rules. tion.

[33] Features identification and classification of 30 reciters(19 males and 11 fe- PSD & MFCC LDA & QDA Used multiple Not
alphabet (ro) in Leaning (Al-Inhiraf) and males). features determined
Repetition (AlTakrir) characteristics. extraction and the used
classification dataset
methods

[59] recognize, identify, and highlight discrepan- Reciter’s database selected from MFCC HMM - few data sam-
cies between two specific types of Madd Internet (60 data samples). ples
rules: the greater connective prolongation and
the exchange prolongation rules. This system
focuses on verses that contain both rules
and aims to point out the mismatches and
differences between the rules for Hafss and
Warsh recitation styles.

[10] A system that determines which tajweed rule 3,071 audio files collected from ten Traditional (MFCC, KNN, SVM, ANN, Use of multi- -
is used in a specific audio recording of a different expert reciters (5 males LPC, WPD, HMM- RF, multiclass clas- ple feature ex-
Quranic recitation (8 tajweed rules). and 5 females). Each file contains SPL) & Non- sifier, bagging. traction algo-
a recording of one of the 8 rules traditional (CBDN) rithms.
considered (in either the correct or
the incorrect usage).

[57] A system for distinguishing, recognizing, and 15 speech simples. 5 verses recited MFCC HMM - Few data sam-
correcting the pronunciation of tajweed rules by 3 Warsh . ples
for Warsh narration type).

[58] A system for recognizing, identifying and 6 verses recited by 4 reciters with MFCC HMM - Few data sam-
pointing out the mismatch of the iqlab rules Qira’at of Warsh. Hence. The total ples
for the verses containing the rules. is 24 of speech simples.

[14] differentiate between short and long vowels MFCCs, deltas coefficients, deltas- HMM have a good accu- -
in Arabic. This distinction is crucial as it deltas coefficients and the cepstral racy
plays a significant role in altering the mean- pseudoenergy
ing of words.

[37] A model to identify errors in the Quranic Ten of expert Qari each of them MFCC HMM and DTW - Covers only
audio files and subsequently distinguish in- recite surat Al-Nnass ten times cor- one sura
correct recitation from the correct recitation. rect and ten times with mistakes.

[6] A system implemented using ASR technique. Alnass, 20 utterances recited by MFCC ANN - Covers only
only a one speaker with and with- one sura
out errors in recitation.

www.ijacsa.thesai.org 968 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

technique. This setup achieved a WER ranging from 0.27% to to calculate the degree of similarity between the original and
6.31% and a sentence error rate (SER) ranging from 0.4% to transformed texts, and employs the Google Cloud Speech
17.39%. Recognition API to translate Arabic speech to text. Seventy
gathered files, ranging in length from twelve to four hundred
The researchers of [3] used MFCC for the purpose of fea- words, together with a few chapters from the Holy Qur’an,
ture extraction. The researchers adjusted these features using were used to test the system. For the 70 files, the average
the minimal phone error (MPE) as a discriminative model. The similarity was 83.33%, while for the chosen chapters of the
researchers utilized the deep neural network (DNN) model Holy Qur’an, it was 69%. Preprocessing operations on the text
to construct the acoustic model. Here, they introduce an n- files and the Holy Qur’an improved these results to 91.33%
gram LM. The dataset utilized for training and assessing the and 95.66%, respectively.
proposed model comprises 10 hours of.wav recitations con-
ducted by 60 reciters. The experimental results demonstrated In [48], the authors focus on the digital transformation
that the proposed DNN model attained a remarkably low CER of Quranic voice signals and the identification of Tajweed-
of 4.09% and a WER of 8.46%. based recitation faults in Harakaat as the primary research
objective. They wanted to look into how to process speech
Table V summarizes the previous studies of traditional- using Quranic Recitation Speech Signals (QRSS) in the best
based speech recognition in the HQSR field. digital format possible, using Al-Quran syllables and a design
for feature extraction. The objective was to identify similarities
C. Other HQSR Studies or differences in recitation (based on Al-Quran syllables)
between experts and students. We employ the DTW approach
This section summarizes papers that follow other HQSR
as a Short Time Frequency Transform (STFT) to quantify the
techniques, such as using the Google Speech API, Genetic
Harakaat of QRSS syllable features. The research presents a
Algorithm (GA), and MFCC.
method that utilizes human-guidance threshold classification
The authors in [56] used MFCC to detect and recognize to assess Harakaat, focusing on the syllables of the Qur’an.
sounds for simple IDHAR tajweed without providing any The categorization performance achieved for Harakaat exceeds
testing results for this study. 80% in both the training and testing phases.
Researchers proposed a solution in [22] to facilitate the Table VI summarizes the previously discussed studies of
memorization and learning of the Holy Quran. They employed the other techniques used in the HQSR field.
the Fisher-Yates Shuffle algorithm to randomize the letters of
the Quran, aiding in the memorization of verses. In addition, D. HQSR Taxonomy
they employed the Jaro-Winkler algorithm for text matching
and utilized the Google Speech API for speech recognition. This section classifies the previously mentioned works of
The study focused on data from Juz 30. The achieved accuracy HQSR based on feature extraction methods and classification
was approximately 91%, with an average matching time of 1.9 techniques. It is worth noting that most work done uses
ms. However, the study revealed that it was still not possible to MFCCs as feature extraction techniques and the HMM as
distinguish certain Arabic letters with similar pronunciations a classifier (e.g., [37], [14], [58], [57], [59], [45]). Fig. 4
in Quranic verses in detail. illustrates the techniques of feature extraction and classification
used in current HQSR research.
In [8], the authors produce a new speech segmentation al-
gorithm for the Arabic language. Developing robust algorithms Researchers improved an Arabic recognizer by incorporat-
to accurately segment speech signals into fundamental units, ing a HSMM instead of the traditional HMM [34]. Another ap-
rather than just frames, is a crucial preprocessing step in speech proach, mentioned in [6], replaced HMM with ANN. Similarly,
recognition systems. They focus on the precise segmentation Nahar et al. [38] opted for SVM instead of HMM, and in [46]
of Quran recitation using multiple features (entropy, crossings, they use CNNs. Furthermore, researchers also explored various
zero, and energy) and a GA-based optimization scheme. The feature extraction techniques. For instance, [11] utilized FB for
results of the testing demonstrate a significant enhancement feature extraction and SVM for classification. [33] also used
in segmentation performance, with an approximate 20% im- different methods, such as Formant Analysis, Power Spectral
provement compared to conventional segmentation techniques Density (PSD), and MFCC, along with LDA and QDA for
based on a single feature. sorting.

In [5], the authors implement an Android-based application In [10], two categories of feature extraction techniques
called TeBook and provide a method for the assessment of were employed: traditional and non-traditional. The traditional
the Holy Quran’s recitation without the involvement of a third approach involved the utilization of MFCC, LPC, multi-signal
party by taking advantage of the use of speech recognition and WPD, and HMM-SPL. As for the non-traditional type, they use
an online Holy Quran search engine. There is no testing result CDBN. They use K-Nearest Neighbors (KNN), SVM, ANN,
present for this study. The limitation of this application is its Random Forest (RF), multiclass classifiers, and bagging for
reliance on multiple online services, which renders it unusable classification. [50] used MFCC for feature extraction, BLSTM
if the services are down. as one of the deep learning topologies, and combined it with
HMM as a hybrid system. The entire speech recognition
The authors in [35] suggested a brand-new method called system was built using the Kaldi toolkit [43], starting with
Samee’a to make it easier to memorize any form of litera- feature extraction, acoustic modeling, and model testing. [49]
ture, including speeches, poetry, and the entire Holy Qur’an. used the deep learning approach in the KALDI toolkit to
Samee’s system utilizes the Jaro Winkler Distance technique design, develop, and evaluate an ASR engine for the Holy
www.ijacsa.thesai.org 969 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

TABLE V. S UMMARY OF HQSR T RADITIONAL -BASED S PEECH R ECOGNITION

Ref# Main Idea Dataset FE AM LM Pros Cons

[49] create a speech recognition engine that is in- 32 recitations for Sūrat Taha MFCC KALDI toolkit n-gram Best research Used only one
dependent of speaker and capable of handling according to Hafs from to train the of current sura
continuous speech. Additionally, a written A’asim narration. acoustic model HQSR
corpus that accurately represents the script (traditional and researches
of The Holy Quran is developed. To aid in DNN approaches)
the recognition process, a phonetic dictionary
for The Holy Quran recitations is also be
constructed.

[50] The acoustic model for Quran speech recog- The dataset is from ev- MFCC Hybrid HMM- 3-grams First work that no preprocessing
nition was trained using a deep learning ap- eryayah.com [1]. BLSTM used BLSTM method to
proach. In addition, the model was built to in HQSR eliminate noise
analyze the effect of Quran recitation styles and echo and not
(Maqam) on speech recognition. specified the used
dataset

[21] Used CMU Sphinx which is a robust frame- 49 chapters were used: CMU Sphinx framework Get WER -
work for speaker-independent continuous From chapter 067 to chap- around 15%
speech recognition to train accurate acoustic ter 114 in addition to the of trained
models. chapter 001 and the suppli- acoustic
cation that is recited before model.
the Holy Quran (Isti’adah).

[34] Presented the results of an enhanced Ara- Arabic database utilized MFCC HSMM flat LM Get -
bic recognizer by implementing an HSMM consists of 5935 waveform enhancement
model instead of the standard one utilized in files for 10 reciters. by around
the baseline recognizer. 1.5% in the
accuracy
[3] Suggested the traditional method to recogniz- A total duration of 10 hours MFCC DNN n-gram Well- -
ing Qur’an verses using a dataset of Qur’an of MP3 recordings contain- structured
verses. ing recitations of Qur’an and clear
verses by 60 reciters. article.

TABLE VI. S UMMARY OF OTHER HQSR S TUDIES

Ref# Main Idea Dataset Used Algorithms Pros Cons


[5] Allow learners to learn how to memorize Everyayah.com Recitation Audio, Alfanous JOS2 - Relied heavily on online ser-
without the constraints of being in a fixed Surah.my Translation API, Android vices and not specified the
place and outside the classroom. Speech Recognition used dataset

[8] A novel speech segmentation algorithm for They used the comprehensive feature fusion and First -
Arabic language with a focus on the accurate KACST dataset with manually Genetic Algoritms. segmentation
segmentation of Quran recitation. Starting labelled Quran syllable structures. on Quran
with a set of initial segmentations, three basic recitation.
speech features: zero crossings, entropy, and
energy are used.

[22] A solution to memorize and learn the Holy juz 30 Fisher-Yates Shuffle - Relies on Google Speech API
Quran easily. Jaro-Winkler which not trained on Quran
verses

[56] Emphasize idhar, which had a distinct and - MFCC, FFT - Not specifying dataset, no
unambiguous pronunciation. The chosen hi- testing result, and the poor
jaiyah letters comprised six possibilities, with organization of the content.
only nun sukun and tanwin, making it effort-
less to identify them.
[35] This article introduces a novel system called The system completed testing uti- Google Cloud A comprehen- -
Samee’a, which aims to enhance the process lizing a dataset of 70 files, with Speech Recognition sive and infor-
of memorizing various types of texts, includ- word counts ranging from 12 to API and Jaro mative paper
ing poems, speeches, and the Holy Qur’an. 400, including selected chapters Winkler Distance
from the Holy Qur’an. algorithm
[48] This study focuses on the digital transforma- - Dynamic Time - Not specifying dataset
tion of Quranic voice signals and the identi- Warping (DTW)
fication of Tajweed-based recitation faults of
Harakaat as its main research objective.

www.ijacsa.thesai.org 970 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

Fig. 4. HQSR Taxonomy of Used Technique

Quran recitations. The best experimental setup was achieved than 200), and N (number of verses not de-
using TDNN with sub-sampling technique. termined in the paper).
2) #sura: This refers to the total number of suras
In [21], the CMU Sphinx trainer [53] was employed to train used in the study.
the acoustic model specifically for the Holy Quran. In a similar 3) #reciters: This shows the number of reciters
vein, a study by [22] utilized the Jaro-Winkler algorithm for who participate in this study.
text matching and relied on the Google Speech API to establish
a framework for speech recognition. • Proposed methodology:
The solution of [5] uses Android speech recognition and 1) DL-based: This criteria shows if the study
depends heavily on third-party online services. In [8], they used deep learning in any stage of the study.
developed a robust hybrid speech segmentation system based 2) LM: This shows if the study used a language
on multiple features (entropy, zero crossings, and energy) and model in their solution or not.
a GA-based optimization scheme to obtain accurate segment 3) AM: This shows if the study used an acoustic
units specially adapted for Quran recitation. model in their solution or not.
4) Reciter Independent: This shows if the output
IV. D ISCUSSION AND F UTURE R ESEARCH D IRECTIONS model of this study is reciter independent or
not.
Recent research in the field of HQSR has suggested numer- 5) Speaker Adaptation: This shows if this study
ous works. We provide in Table VII a comparative analysis of used any techniques of speaker adaptation or
current research works based on the dataset characteristics and not.
the suggested methodology:
As we can see in Table VII, there is a significant gap
• Dataset characteristics: in the current work of HQSR. First, most of the works in
1) #verses: This refers to the total number of HQSR follow template-based speech recognition, which is an
verses used in the study: L (number of verses old style of speech recognition. This style extracts features of
between 1-100), M (number of verses be- raw audio and feeds these features into a classifier to classify
tween 101-200), H (number of verses greater and match with stored templates without using acoustic, lexical
www.ijacsa.thesai.org 971 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

(pronunciation), or language models [58]. Examples of these more difficulty for the model when recognizing the
works are ([45], [38], [11], [33], [59], [57], [58], [14], [6], and recitation.
[37]) as shown in Table VII. Few works suggest the use of deep
learning. However, they still rely on an old-style design. For • The length of prolongation (Madd) varies when recit-
instance, [10] used a deep learning architecture with the old ing the Quran. In Hafs An Asim narration, reciters can
style (i.e., it didn’t use acoustic, lexical, and language models recite some types of the madd with 2, 4, or 5 Harakat.
but deep learning in the feature extraction phase only). In • Recitation of the Holy Quran must follow the rules of
addition, the authors in [5], [22] employed the Google Speech “tajweed“ and correctly pronounce Makhraj (point of
API for their solution. However, this approach had limitations articulations) and the Sifaat (characteristics) of each
as the API was unable to accurately differentiate between the alphabet.
seven letters that share similar pronunciations in the verses of
the Quran.
TABLE VII. C OMPARING HQSR S OLUTIONS
Second, only a few studies follow traditional speech recog-
nition, as explained in Fig. 2, either with a deep learning Ref# Dataset Methodology
architecture like [49], [50], [3] or without, such as [34], [21]. #verses #sura #reciters DL- LM AM Reciter Speaker
based Inde- Adap-
It is worth noting that more work investigating deep learning pen- tation
architecture should be conducted to improve the accuracy of dent
Arabic speech recognition in general and the Holy Quran in [45] L 1 1
[49] M 1 32 ✓ ✓ ✓ ✓ ✓
particular. [38] H
[11] H
Third, we can observe in Table VII that the used data set is [33] N 30
too small for most of the work (number of verses, suras, and [5] N
[50] N 13 ✓ ✓ ✓ ✓
reciters). while an extensive dataset helps produce a robust and [22] H
generalized speech recognition system. [59] L
[10] H 10 ✓
Finally, no research on HQSR used end-to-end deep learn- [57] L
[58] L
ing architecture, while this architecture shows outstanding [56] N
results with Arabic and non-Arabic languages, as previously [21] H 49 39 ✓ ✓ ✓
[34] H 10 ✓ ✓
discussed in Section II-B (see Table III, which presents a [14] H 3 4
comparison between some end-to-end based speech recogni- [37] L 10
tion architecture and traditional techniques in Arabic and non- [6] L 1 1
[46] H 7 ✓
Arabic languages). [3] L 60 ✓ ✓ ✓
Note: L (number of verses between 1-100), M (number of verse between
To sum up, many challenges still need to be considered 101-200), H (number of verses greater than 200), and N (Number of verses
in future work. Indeed, in recognition of speech, recognizing not determined in the paper).
individual words is easy, but the challenge is recognizing
continuous speech [7]. Multiple conditions, including speaker
dependency, vocabulary size, and noisy environments, can V. C ONCLUSION
affect the performance of speech recognition systems. Recog-
nition performance increases with limited vocabulary and This paper surveys Holy Quran Speech Recognition
speaker-dependent conditions while using broad vocabulary (HQSR) works. It summarizes some studies of speech recog-
and speaker-independent scenarios; performance can decrease nition in written and spoken languages and the most recent
significantly [7]. Moreover, Arabic is a morphologically com- work in the HQSR field. It provides a general taxonomy of
plex language that contains a high degree of affixation and speech recognition and a specific one dedicated to HQSR
derivation, resulting in a massive increase in word forms studies that illustrates the techniques of feature extraction and
[31]. Furthermore, speech recognition of the Holy Quran classification used in current HQSR research. We compared
has additional difficulties compared with written and spoken the current solutions and clarified the limitations of the current
languages for the following reasons: studies. The main challenges of the HQSR field are the lack
of a comprehensive dataset, minimizing mistakes that are not
• Lack of a comprehensive dataset that contains recita- acceptable when reading the Quran, diversity of narrations,
tions of women, children, and native and non-native diversity of Magam in Quran recitation, and diversity of
Arabic speakers with both the correct and incorrect prolongation (Madd) length when reciting the Quran. The
recitation of the Holy Quran. field of HQSR needs a lot of work to improve the current
speech recognition models of the Holy Quran by using better
• Mistakes are not acceptable when reading the Quran techniques that already show good results with written and
because an error in reciting only one letter may change spoken languages but haven’t been used with HQSR yet.
the meaning.
• The diversity of narrations in reading the Qur’an R EFERENCES
makes it difficult for the model to recognize different
narrations. [1] Every ayah, http://www.everyayah.com/, 2022.
[2] Muslim population by country 2021,
• The diversity of Magam in Quran Recitation, such as https://worldpopulationreview.com/country-rankings/muslim-
(bayat, Ajam, Nahawand, Hijaz, Rost, Sika, etc.), adds population-by-country, 2021.

www.ijacsa.thesai.org 972 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

[3] Alsayadi Hamzah A and Hadwan Mohammed. Automatic speech independent holy quran acoustic model. In 2017 4th IEEE Interna-
recognition for qur’an verses using traditional technique. Journal of tional Conference on Engineering Technologies and Applied Sciences
Artificial Intelligence and Metaheuristics (JAIM), 2022. (ICETAS), pages 1–4. IEEE, 2017.
[4] Abdelaziz A Abdelhamid, Hamzah A Alsayadi, Islam Hegazy, and [22] YA Gerhana, AR Atmadja, DS Maylawati, A Rahman, K Nufus,
Zaki T Fayed. End-to-end arabic speech recognition: A review. 2020. H Qodim, and MA Ramdhani. Computer speech recognition to text
for recite holy quran. In IOP Conference Series: Materials Science and
[5] Mohd Hafiz Bin Abdullah, Zalilah Abd Aziz, Rose Hafsah Abd Rauf,
Engineering, volume 434, page 012044. IOP Publishing, 2018.
Noratikah Shamsudin, and Rosmah Abd Latiff. Tebook a mobile holy
quran memorization tool. In 2019 2nd International Conference on [23] Alex Graves, Santiago Fern?ndez, Faustino Gomez, and Jürgen Schmid-
Computer Applications & Information Security (ICCAIS), pages 1–6. huber. Connectionist temporal classification: labelling unsegmented
IEEE, 2019. sequence data with recurrent neural networks. In Proceedings of the
23rd international conference on Machine learning, pages 369–376,
[6] Bushra Abro, Asma Batool Naqvi, and Ayyaz Hussain. Qur’an
2006.
recognition for the purpose of memorisation using speech recognition
technique. In 2012 15th International Multitopic Conference (INMIC), [24] Imane Guellil, Houda Saâdane, Faical Azouaou, Billel Gueni, and
pages 30–34. IEEE, 2012. Damien Nouvel. Arabic natural language processing: An overview.
Journal of King Saud University-Computer and Information Sciences,
[7] Ahmed Hamdi Abo Absa. Self-Learning Techniques for Arabic Speech
33(5):497–507, 2021.
Segmentation and Recognition. Thesis, 2018.
[25] Hossein Hadian, Hossein Sameti, Daniel Povey, and Sanjeev Khu-
[8] Ahmed Hamdi Abo Absa, Mohamed Deriche, Moustafa Elshafei-
danpur. End-to-end speech recognition using lattice-free mmi. In
Ahmed, Yahya Mohamed Elhadj, and Biing-Hwang Juang. A hybrid
Interspeech, pages 12–16, 2018.
unsupervised segmentation algorithm for arabic speech using feature
fusion and a genetic algorithm (july 2018). IEEE Access, 6:43157– [26] Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert. Sequence-
43169, 2018. to-sequence speech recognition with time-depth separable convolutions.
arXiv preprint arXiv:1904.02619, 2019.
[9] Abdelrahman Ahmed, Yasser Hifny, Khaled Shaalan, and Sergio Toral.
End-to-end lexicon free arabic speech recognition using recurrent neural [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
networks. Computational Linguistics, Speech And Image Processing For residual learning for image recognition. In Proceedings of the IEEE
Arabic Language, pages 231–248, 2019. conference on computer vision and pattern recognition, pages 770–778,
2016.
[10] Mahmoud Al-Ayyoub, Nour Alhuda Damer, and Ismail Hmeidi. Using
deep learning for automatically determining correct application of basic [28] Xiaodong He and Li Deng. Discriminative learning for speech recog-
quranic recitation rules. Int. Arab J. Inf. Technol., 15(3A):620–625, nition: theory and practice. Synthesis Lectures on Speech and Audio
2018. Processing, 4(1):1–112, 2008.
[11] Ali M Alagrami and Maged M Eljazzar. Smartajweed auto- [29] Ahmed Ali Hifny, Shammur Absar Chowdhury, Amir Hussein, and
matic recognition of arabic quranic recitation rules. arXiv preprint Yasser. Arabic code-switching speech recognition using monolingual
arXiv:2101.04200, 2020. data. Proc. Interspeech 2021, pages 3475–3479, 2021.
[12] Abdulmalik A Alasadi, TH Aldhayni, Ratnadeep R Deshmukh, [30] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman
Ahmed H Alahmadi, and Ali Saleh Alshebami. Efficient feature Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
extraction algorithms to develop an arabic speech recognition system. Nguyen, and Tara N Sainath. Deep neural networks for acoustic
Engineering, Technology & Applied Science Research, 10(2):5547– modeling in speech recognition: The shared views of four research
5553, 2020. groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
[13] Hanan Aldarmaki, Asad Ullah, and Nazar Zaki. Unsupervised automatic [31] Amir Hussein, Shinji Watanabe, and Ahmed Ali. Arabic speech
speech recognition: A review. arXiv preprint arXiv:2106.04897, 2021. recognition by end-to-end, modular systems and human. arXiv preprint
arXiv:2101.08454, 2021.
[14] Yousef A Alotaibi, Mohammed Sidi Yakoub, Ali Meftah, and Sid-
Ahmed Selouani. Duration modeling in automatic recited speech recog- [32] Biing Hwang Juang and Laurence R Rabiner. Hidden markov models
nition. In 2016 39th International Conference on Telecommunications for speech recognition. Technometrics, 33(3):251–272, 1991.
and Signal Processing (TSP), pages 323–326. IEEE, 2016. [33] Safiah Khairuddin, Salmiah Ahmad, Abdul Halim Embong, Nik Nur
[15] Fatimah Alqadheeb, Amna Asif, and Hafiz Farooq Ahmad. Correct pro- Wahidah Nik Hashim, and Surul Shahbuddin Hassan. Features iden-
nunciation detection for classical arabic phonemes using deep learning. tification and classification of alphabet (ro) in leaning (al-inhiraf)
In 2021 International Conference of Women in Data Science at Taif and repetition (al-takrir) characteristics. In 2019 IEEE International
University (WiDSTaif), pages 1–6. IEEE, 2021. Conference on Automatic Control and Intelligent Systems (I2CACIS),
pages 295–299. IEEE, 2019.
[16] Hamzah A Alsayadi, Abdelaziz A Abdelhamid, Islam Hegazy, and
Zaki T Fayed. Arabic speech recognition using end-to-end deep [34] Mohamed OM Khelifa, Mostafa Belkasmi, Yousfi Abdellah, and
learning. IET Signal Processing, 2021. Yahya OM ElHadj. An accurate hsmm-based system for arabic
phonemes recognition. In 2017 Ninth International Conference on
[17] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai,
Advanced Computational Intelligence (ICACI), pages 211–216. IEEE,
Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catan-
2017.
zaro, Qiang Cheng, and Guoliang Chen. Deep speech 2: End-to-end
speech recognition in english and mandarin. In International conference [35] Souad Larabi-Marie-Sainte, Betool S. Alnamlah, Norah F. Alkassim,
on machine learning, pages 173–182. PMLR, 2016. and Sara Y. Alshathry. A new framework for arabic recitation using
speech recognition and the jaro winkler algorithm. Kuwait Journal of
[18] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar,
Science, 49, 2022.
Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka
Rao, and Ekaterina Gonina. State-of-the-art speech recognition with [36] Lina Marlina, Cipto Wardoyo, WS Mada Sanjaya, Dyah Anggraeni,
sequence-to-sequence models. In 2018 IEEE International Conference Sinta Fatmala Dewi, Akhmad Roziqin, and Sri Maryanti. Makhraj
on Acoustics, Speech and Signal Processing (ICASSP), pages 4774– recognition of hijaiyah letter for children based on mel-frequency cep-
4778. IEEE, 2018. strum coefficients (mfcc) and support vector machines (svm) method.
In 2018 International Conference on Information and Communications
[19] Ronan Collobert, Awni Hannun, and Gabriel Synnaeve. A fully
Technology (ICOIACT), pages 935–940. IEEE, 2018.
differentiable beam search decoder. In International Conference on
Machine Learning, pages 1341–1350. PMLR, 2019. [37] Ammar Mohammed, Mohd Shahrizal Sunar, and Md Sah Hj Salam.
Quranic verses verification using speech recognition techniques. Jurnal
[20] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter:
Teknologi, 73(2), 2015.
an end-to-end convnet-based speech recognition system. arXiv preprint
arXiv:1609.03193, 2016. [38] Khalid MO Nahar, M Ra’ed, A Moy’awiah, and M Malek. An
efficient holy quran recitation recognizer based on svm learning model.
[21] Mohamed Yassine El Amrani, MM Hafizur Rahman, Mohamed Ridza
Jordanian Journal of Computers and Information Technology (JJCIT),
Wahiddin, and Asadullah Shah. Towards an accurate speaker-
6(04), 2020.

www.ijacsa.thesai.org 973 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023

[39] Maryam Najafian, Wei-Ning Hsu, Ahmed Ali, and James Glass. Au- modeling for automatic speech recognition on quran recitation. In 2018
tomatic speech recognition of arabic multi-genre broadcast media. In International Conference on Asian Language Processing (IALP), pages
2017 IEEE Automatic Speech Recognition and Understanding Workshop 203–208. IEEE, 2018.
(ASRU), pages 353–359. IEEE, 2017. [51] Pahini A Trivedi. Introduction to various algorithms of speech recog-
[40] Maryam Najafian, Sameer Khurana, Suwon Shan, Ahmed Ali, and nition: Hidden markov model, dynamic time warping and artificial
James Glass. Exploiting convolutional neural networks for phonotactic neural networks. International Journal of Engineering Development
based dialect identification. In 2018 IEEE International Conference on and Research, 2(4):3590–3596, 2014.
Acoustics, Speech and Signal Processing (ICASSP), pages 5174–5178. [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
IEEE, 2018. Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention
[41] Lv Ping. English speech recognition method based on hmm technology. is all you need. In Advances in neural information processing systems,
In 2021 International Conference on Intelligent Transportation, Big pages 5998–6008, 2017.
Data & Smart City (ICITBS), pages 646–649. IEEE, 2021. [53] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh,
[42] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A flexible
Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin open source framework for speech recognition. 2004.
Qian, and Petr Schwarz. The kaldi speech recognition toolkit. In IEEE [54] Song Wang and Guanyu Li. Overview of end-to-end speech recognition.
2011 workshop on automatic speech recognition and understanding. In Journal of Physics: Conference Series, volume 1187, page 052068.
IEEE Signal Processing Society, 2011. IOP Publishing, 2019.
[43] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej [55] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro
Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,
Qian, and Petr Schwarz. The kaldi speech recognition toolkit. In IEEE Matthew Wiesner, and Nanxin Chen. Espnet: End-to-end speech
2011 workshop on automatic speech recognition and understanding. processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
IEEE Signal Processing Society, 2011.
[56] Efy Yosrita and Abdul Haris. Identify the accuracy of the recitation of
[44] Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, al-quran reading verses with the science of tajwid with mel-frequency
and Navdeep Jaitly. A comparison of sequence-to-sequence models for ceptral coefficients method. In 2017 International Symposium on
speech recognition. In Interspeech, pages 939–943, 2017. Electronics and Smart Devices (ISESD), pages 179–183. IEEE, 2017.
[45] Munirah Ab Rahman, Izatul Anis Azwa Kassim, Tasiransurini Ab [57] Bilal Yousfi and Akram M Zeki. Holy qur’an speech recognition
Rahman, and Siti Zarina Mohd Muji. Development of automated system imaalah checking rule for warsh recitation. In 2017 IEEE
tajweed checking system for children in learning quran. Evolution in 13th international colloquium on signal processing & its applications
Electrical and Electronic Engineering, 2(1), 2021. (CSPA), pages 258–263. IEEE, 2017.
[46] Ghassan Samara, Essam Al-Daoud, Nael Swerki, and Dalia Alzu’bi. [58] Bilal Yousfi, Akram M Zeki, and Aminah Haji. Isolated iqlab checking
The recognition of holy qur’an reciters using the mfccs’ technique and rules based on speech recognition system. In 2017 8th International
deep learning. Advances in Multimedia, 2023, 2023. Conference on Information Technology (ICIT), pages 619–624. IEEE,
[47] Benjamin Elisha Sawe. Arabic speaking countries, Jul 2018. 2017.
[48] Noraimi Shafie, Azizul Azizan, Mohamad Zulkefli Adam, Hafiza [59] Bilal Yousfi, Akram M Zeki, and Aminah Haji. Holy qur’an speech
Abas, Yusnaidi Md Yusof, and Nor Azurati Ahmad. Dynamic time recognition system distinguishing the type of prolongation. Sukkur IBA
warping features extraction design for quranic syllable-based harakaat Journal of Computing and Mathematical Sciences, 2(1):36–43, 2018.
assessment. International Journal of Advanced Computer Science and [60] Naima Zerari, Samir Abdelhamid, Hassen Bouzgou, and Christian
Applications, 13, 2022. Raymond. Bidirectional deep architecture for arabic speech recognition.
[49] Imad K Tantawi, Mohammad AM Abushariah, and Bassam H Hammo. Open Computer Science, 9(1):92–102, 2019.
A deep learning approach for automatic speech recognition of the holy [61] Yaxin Zhang, Mike Alder, and Roberto Togneri. Using gaussian mixture
qur’an recitations. International Journal of Speech Technology, pages modeling in speech recognition. In Proceedings of ICASSP’94. IEEE
1–16, 2021. International Conference on Acoustics, Speech and Signal Processing,
[50] Faza Thirafi and Dessi Puji Lestari. Hybrid hmm-blstm-based acoustic volume 1, pages I/613–I/616 vol. 1. IEEE, 1994.

www.ijacsa.thesai.org 974 | P a g e

You might also like