Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling

: This paper implements the continuous Hindi Automatic Speech Recognition (ASR) system using the proposed integrated features vector with Recurrent Neural Network (RNN) based Language Modeling (LM). The proposed system also implements the speaker adaptation using Maximum-Likelihood Linear Regression (MLLR) and Constrained Maximum likelihood Linear Regression (C-MLLR). This system is discriminatively trained by Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) techniques with 256 Gaussian mixture per Hidden Markov Model (HMM) state. The training of the baseline system has been done using a phonetically rich Hindi dataset. The results show that discriminative training enhances the baseline sys-tem performance by up to 3%. Further improvement of ~7% has been recorded by applying RNN LM. The proposed Hindi ASR system shows significant performance improvement over other current state-of-the-art techniques.


Introduction
ASR is the process of taking speech utterance and converting it into text sequence as close as possible. There are many functional areas in ASR. Some are as follows: dictation, a program control application, dialog systems, audio indexing, speech-to-speech translation, and query-based information retrieval system, i.e., weather information system, or some travel information system. With the increase in need of end-user focused applications such as look for voice and voice communication with the cellular device and domicile amusement systems, the robust speech recognition that works in all the real-world noises and other acoustic distorting conditions is in demand [29]. To implement the ASR system, some obstacles may occur due to abnormality in speaking style and noises in the environment. The acoustic environment for ASR is much difficult or different than in the past [13]. Despite several technological advancements of the ASR system, there is a huge gap in terms of accuracy and speed in comparison to the human perspective [2]. The main objective behind developing the ASR system is to convert a speech utterance into text sequence, independent of a speaker and the surrounding environment.

Frequency Cepstrum Coeflcient (MFCC)
Let X(n) be the input speech signal and frames are blocked and smoothed by applying hamming window W(n). Feature extraction through MFCC involves mainly the following five steps.
i) After performing pre-emphasis step over speech signal, short time Fourier analysis is done using hamming window [28]. To amplify the energy at a higher frequency, pre-emphasis is generally performed [10]. The power spectral estimation is done as: where N corresponds to hamming window. ii) After that power spectrum is passes throgh the Mel-scale triangular filter bank, get the energies of each filter bank as: where M denotes the number of triangular filters. iii) Discrete Cosine Transform (DCT) is applied to the filterbank energies to get the MFCCs (c i ): log 10 (Em).cos(j(m + 0.5) π M ); j = 1, 2, ..., L iv) Append normalized frame energy, producing a 13-dimensional standard feature vector. v) More features can get by applying first and second derivatives as follows: where t is time, and c (t) i and c (−t) i represents the t th following and previous cepstral coefficients in time frame, respectively.

Gammatone-Frequency Cepstral Coeflcient (GFCC)
The MFCC [12] features give promising results in a clean environment, but in a noisy environment, the performance of MFCC decreases. The GFCC [43,44] features work well in a noisy environment as their model is based on the human auditory system. The GFCC features are determined by Equivalent Rectangular Bandwidth (ERB) scale and set of gammatone filterbanks. We found GFCC features more robust in comparison to MFCC and PLP features [15]. The initial operations of GFCC and MFCC are similar. The output of the fourier transform is passed through gammatone filterbank where center frequency f can be defined as: Where a denotes a constant, ϕ represent the phase of the filter, and n denotes the order of the filter. The filterbank bandwidth b factor is denoted as: After that DCT is performed to get the uncorrelated cepstral features.

Wavelet packet based ERB Cepstral features (WERBC)
The wavelet transforms effectively do the time-frequency analysis in the case of the non-stationary or quasistationary signal [4]. Wavelet packets (WP) [3,26] shows their importance in signal representation schemes such as speech analysis [9]. WP's with the broad coverage of time-frequency characteristics outperform in comparison to standard MFCC features for the speech recognition task. Some amount of research has been done by [9,16,21,38,39,45] in Hindi speech recognition using wavelet transformation. The WERBC feature extraction technique is proposed in 2014 [4]. The process of converting the speech signal into WERBC features via admissible wavelet packet transform is shown in Figure 1. The frame size of 25 ms with 70% overlapping is used to extract WERBC features.
After applying the hamming window, the entire frequency band is decomposed with the help of 3-level WP decomposition. It will give 8 subbands of 1 kHz each. Again apply WP decomposition as shown in Figure 1 to get 24 frequency subband.
The first 0-500 Hz frequency band is divided into 8 subbands of 62.5 Hz each. This division finely emphasis a frequency band of 0-500 Hz, which contains a large amount of signal energy. Next, 500-1000 Hz is further decomposed by applying 2 level WP decomposition to get 4 subbands of 125 Hz each. Next, 4 sub-band of 250 Hz is found by applying 2 level WP decomposition on 1-2 Hz frequency band. Similarly, 4 subbands of 500 Hz and 4 subbands of 1 kHz are found subsequently. The equal loudness and log are performed to those 24 subbands to get 24 coefficients. Finally, DCT has been applied to get 12 cepstral features. To get more cepstral features, delta and acceleration coefficients are applied.

Discriminative techniques
HMMs are the statistical model of speech production [36], whose parameters are optimized by the BW algorithm. Most popular ASR systems are based on statistical-based acoustic modeling. In the past few years, the discriminative technique gets more attention as it further optimizes the HMM parameters to achieve high accuracy [14,37]. In conventional GMM-HMM based acoustic modeling, HMM parameters are estimated via MLE technique [18]. The MLE technique has the ability to produce an accurate system that is quickly trained using the BW algorithm [24]. The MLE is unbeatable if observations are from the known Gaussian family distribution, training data is unlimited, and the priorly known true language model is available. Unfortunately, these assumptions are not true in case of speech. The discriminative techniques try to optimize the model correctness in such a way to penalize parameters that are responsible to creating confusion between right and wrong predictions [48]. In this work, the baseline acoustic modeling is based on HMM, where the states are represented by Gaussian mixtures, and the discriminative technique is applied over this to optimize the HMM parameters using the E-BW algorithm [19]. In this paper, lattice-based discriminative training is applied by using the HMMIRest tool of HTK3.5 toolkit, and these lattices are generated by a weak language model (e.g., bi-gram) to improve the generalization capability of the discriminative technique. To make the discriminative technique more efficient, phone-marked lattices are used by HMMIRest. For the discriminative technique, HTK uses more than one expected hypothesis for each speech utterance. In this paper, MMI and MPE discriminative techniques are applied to the integrated feature set supported by the HMMIRest tool. Speaker adaptation is also applied in acoustic modeling. Speaker adaptation is the process of modifying the acoustic model parameters by using a small amount of speech data of the specific user in such a way so that the resultant model able to recognize the speech of that speaker. Generally, these techniques are applied to the well trained SI model set to model the characteristics of a new speaker [47]. In many situations, if a large well-defined SI model is used, the baseline SI performance can be quite high. Hence, the error rate gain from speaker adaptation may be smaller than the simpler model. Speaker adaptation techniques can be grouped into two families: i) Linear transformation-based adaptation and ii) Maximum a posteriori (MAP) adaptation [52]. In this work, we only explored the linear transform based adaptation techniques. These techniques estimate the linear transformation from the adaptation data to modify HMM parameters.

Mutual Information (MMI)
The main motive behind the discriminative technique is to assess HMM parameters so as to boost the accuracy of the ASR system. The objective function of MMI to maximize the mutual information on the set of observation is defined as: P(H r ref ) denotes the word sequence probability given by LM and H r ref is the HMM corresponding to the word transcription. The denominator term sum over each possible word sequences. To boost the objective function, the numerator term should be high, and the denominator term should be low. The MMI tries to make the correct hypotheses more probable and incorrect hypotheses less probable at the same time [48].

Minimum Phone Error (MPE)
In MPE, we try to maximize the phone level accuracy rather than maximizing the word accuracy. The MPE training relies on minimum Baye's risk training. The objective function is defined as: In HMMIRest MPE criterion is expressed as: M H is a numerator acoustic model, and M den r is the denominator acoustic model for utterance r. The notation denotes the loss between hypotheses and reference. The main difference between MMI and MPE is lying in how the denominator and numerator lattices are computed. The parameter estimation is based on E-BW algorithm. The loss function is measured by Levenshtein-edit distance. This distance is measured between the phone sequences of the reference and the hypotheses. The MPE technique has been found to improve accuracy as it supports word transcription with corresponding phone accuracy [18].

Recurrent Neural Network (RNN) Language Modeling (LM)
Mostly, state-of-the-art LM used in LVCSR systems are based on RNN [30,51]. In various work [31][32][33], the usefulness of RNN LM has been reported in LVCSR task. The traditional RNN is the three-layer architecture, as shown in figure 2. The first layer, known as the input layer, contains a full history vector by concatenating h i−2 and X i−1 as input to the hidden layer. For empty history, it is initialized to a vector of all ones. RNN LM encode full, non-truncated history h 1 i−1 = [X i−1 , ..., X 1 ] for current word X i . The current word X i is predicted by using 1-of-k encoding of the most recent preceding word X i−1 and history context h i−2 . Information receives at the hidden layer is further compressed using the sigmoid activation function. It also gives feedback to the input layer. An Out-of-Vocabulary (OOV) node is also added at the input layer to cover those words which are not present in the recognition dictionary. In the third layer, the softmax activation function is applied to produce normalized RNN LM probabilities [52]. The output of this layer is also feedback into the input layer as remaining history to compute LM probability for the next future word.
The training and decoding are computationally expensive in RNN LM, and major computation is done at the output layer. To reduce the computational cost at the output layer, one more node Out-of-Shortlisted (OOS) is used, which contains the most frequent words. The extension of the back-propagation algorithm, Backpropagation through Time (BPTT), is used to train the RNN LM [40]. In BPTT training, the output error is backpropagated in time for a specific number of time steps. This paper uses a 3-8 word length history. In our work, we use one million Hindi text corpus to train the RNN LM. The texts from the various sources were collected to train the language model. These sources were Emili corpus, magazines, newspapers, web text, and newsletters. One million text corpus is assumed as the medium-sized dataset for conventional n-gram LM, but it is reasonably large for RNN LM. In order to further reduce the computational cost at the output

History Vector
Hidden layer

Input layer
Output layer

Proposed Architecture
The proposed architecture is divided into two parts. The first part describes the process of feature extraction and the integration of different feature sets. In this part of the proposed architecture, feature vectors are generated using GFCC, MFCC, PLP, WERBC feature extraction techniques. The second part of the architecture covers the discriminative training of the feature vector proposed in the first part.

Proposed integrated feature set
The idea of an acoustic feature combination was initially proposed by Harmansky in 1994 [25]. In his work, he combined PLP features with RASTA features to improve the performance of the targeted ASR system. In the process of speech recognition, feature extraction plays a vital role in achieving high accuracy. Recently, Many papers come into existence to prove their superiority over the previous one by achieving high-performance gain [1,15,23,42,53,54]. In the last few years, GFCC [5] and wavelet-based techniques [45] have become more popular as they work well in a noisy environment. This property attracts many researchers to combined these features with other features to improve the accuracy of the ASR system [1,15,23,42,54].
In this proposed work, the sequential combination of MFCC, GFCC, and WERBC features is done, as mention in Figure 3. The dimension of the feature vector is further reduced by applying Heteroscedastic Linear Discriminant Analysis (HLDA) [27] as HLDA reduces the feature dimension 25%, which helps to reduce the computational load of the ASR system. HLDA is the method of projecting high-dimensional acoustic representation into lower-dimensional spaces [46]. The finding of lower-dimensional representation reduces the  number of parameters that are used to train the acoustic model and thereby, a significant reduction in computational load.

Discriminative training
In the second part of the proposed architecture, an integrated feature set is trained using discriminative training approaches. The MMI and MPE discriminative techniques are used in the proposed work. The acoustic modeling is done by the HTK3.5 beta toolkit developed by Cambridge University, USA. To apply the discriminative technique, a cross-word triphone set of HMM's are initially trained using MLE. A weak language model (e.g., bi-gram LM) is the next requirement to apply the discriminative technique. The language model creates the lattice used by MMI and MPE in training.
Two sets of "phone-marked" lattices are required for the discriminative technique known as denominator lattice and numerator lattice. HDEcode is used to create the denominator lattice, and numerator lattice is created by the HLRescore tool. The numerator lattice includes language model log probabilities, and denominator lattice implements confusable hypotheses [36,48]. The initial word lattices are further processed to create the phone marked lattice, and these phone marked lattices are used by HMMIRest tool to discriminatively trained the HMM. The E-BW algorithm does the parameter estimation. In this work, 4 iterations of each MMI and MPE is done to train the acoustic model.

Hindi Speech Corpus
Hindi is the fourth most natively spoken language in the world. According to Ethnologue, India have almost 260 million people who use this language. After Chinese and Spanish, English is in third place, with 335 million speakers. Except for Hindi, all these languages have well developed ASR system with their standard dataset. One major challenge for Hindi speech recognition is the deficiency in the Hindi speech dataset and text corpora. In this work, a well-annotated and phonetically rich Hindi dataset is used developed by TIFR, Mumbai [41]. This dataset contains 100 speakers, and each speaker utters 10 sentences out of which two sentences are common to everyone. These two common sentences cover all phonemes of the Hindi language. The next eight sentences also cover the maximum phones of the Hindi language. The recording was done by two microphones in a quiet room on 16 kHz sampling frequency. For training, 80 speakers out of 100 speakers are randomly selected, out of which 55 are male speakers, and 25 are female speakers. The remaining 20 speakers are used for testing purposes.

Simulation details and experiment results
The acoustic modeling is done by the new version of the HTK toolkit 3.5. For front-end feature extraction and combination, MATLAB R2015a has been used. The training and evaluation of RNN LM have been carried out using the CUED-RNN LM toolkit [8]. The speech database is divided into two parts: training and evaluation. For training, 80 speakers out of 100 speakers randomly selected, and the remaining 20 speakers left for the testing purpose. The training data is further divided into three parts of 30 speakers each. Set 1 contains only 30 speakers out of 80 who speak Hindi frequently and belong to the northern part of India. Same as Set 2 contains 30 Hindi speakers out of the remaining 50 speakers who belong to the south region of India. Set 3 contains the remaining 20 speakers and the mixture of speakers from set 1 and set 2. Same as the training set, the testing set is also further divided into three parts. Set 1 includes only male speakers of count 12. Set 2 contains 8 female speakers and set 3 contain all 20 speakers for evaluation purposes.

Performance analysis of multiple feature combination
In this experiment, we continue with the work started in [15]. The baseline GMM-HMM system contains 256 Gaussian mixture per HMM state with tri-phone based acoustic modeling. The comparative analysis of various multiple feature combination techniques using the baseline system has been shown in Table 3. The standard feature set size without integration is 39 in this experiment. In the case of integrated acoustic features, the dimensionality of the feature vector is reduced by HLDA. To get the integrated feature set of MF-GFCC, the first four MFCC features were chosen to integrate with 13 GFCC features. In this way, MF-GFCC makes the set of 17 features. After taking the first and second derivatives, the feature vector size will become 51. These 51 MF-GFCC features are reduced to 39 features after applying the HLDA technique. The same procedure is applied to get the MFCC+GFCC+WERBC features, in which the first four features were taken from MFCC and combined with 13 GFCC and WERBC features (i.e., 30) and take the first and second derivative. In the same way, all other feature combinations have been taken place. The results clearly show that the combination of MFCC+GFCC+WERBC with HLDA transformation outperforms over all other feature combinations. The training set 3 with testing set 3 gives the best results where both male and female speakers come from the north and south region of India in comparison to other combinations. The proposed feature combination shows 9% relative improvement over MFCC based ASR system. The bi-gram LM was used in this experiment.

System combination
For the detailed study of performance measurement of the proposed integrated feature set with other configuration settings at the back-end, several models have proposed in this experiment. The best three integrated feature sets are choosen from the previous experiment to apply discriminative training with speaker adaptation. Speaker adaptation can significantly improve the WER in the SI training set [52]. It has been observed in the previous work [1] that MLLR will help to achieve low WER as the size of vocabulary increases. In this experiment, we proposed a series of system configurations. The best possible system combination of front-end processing with back-end processing is chosen in this experiment. The proposed baseline system is tested with and without speaker adaptive training (SAT) to measure the performance gain. The naming convention for the proposed series of the system is done by capital letters indicate the type of feature combination, speaker adaptation (MLLR,C-MLLR), and discriminative training criteria. For example, PG-M MMI indicates PLP+GFCC feature set with MLLR adaptation modeled by MMI discriminative technique. In this experiment, we choose 15 systems based on their feature extraction techniques, model type, discriminative training, no of iteration, and no of the transform for acoustic model adaptation. The acronyms used for the different number of systems helps for further discussion.

Performance evaluation of different systems
The choice of the front-end feature combination has a tremendous impact on ASR performance. From the previous experiment, we choose three best feature combinations and evaluate them with a number of different parameters. In this section, the performance is evaluated of all the 15 proposed system described in the previous experiment. Here, again, the training set 3, which contains a mixture of north and south Indian dialects, gives maximum accuracy with the test set 3. Discriminative techniques help to optimize the HMM parameters, which leads to the performance gain. In all experiments, it was observed that discriminative techniques improve the generalization capability of the ASR system. The MGW-C MPE system gives the best performance results of 80.36%, which is~3% more than the baseline configuration system. The MPE discriminative technique performs slightly better from MMI discriminative technique in all experiments. Speaker adaptation also helps to maintain low WER.

Experiment with language modeling
Based on the performance evaluation in the previous section, the best four systems have been selected for this experiment. The performance of the proposed Hindi ASR system is further improved using RNN LM. RNN architecture is suitable to deal with variable-length inputs. RNN LM is well suited to model the sequential data. By applying RNN LM, the computational load of the system increased but gave the significant performance gain. RNN LM can learn the long-term contextual information within the text. To implement RNN LM, CUED-RNN LM [8] toolkit has been used. The RNN LM experiment uses 500 hidden units and N-best lattice rescoring. The performance is further improved by up to 87.96% using RNN LM. One million text transcription is used from various sources to train the language model. One more observation has been recorded in this experiment that the performance of the ASR system is increased up tri-gram LM only.

Conclusion
A novel integrated features combination of MF-GFCC with WERBC features are discriminatively trained with RNN language modeling to improve the performance of the Hindi ASR system. For speaker adaptation, MLLR and C-MLLR techniques have been applied, and their corresponding improvements have been recorded. The performance of the proposed Hindi ASR has been evaluated on a different number of parameters. The results conclude that MFCC+GFCC+WERBC features are more robust and give maximum accuracy with MPE discriminative training. The MPE technique show 1% relative improvement over the MMI technique. This work can further be extended by applying these feature combinations to DNN based acoustic modeling, and various optimization techniques with the proposed feature set can also be tested.