Numeral recognition is one among the most vital problems in pattern recognition. Its numerous applications like reading postal zip code, passport number, employee code, bank cheque processing and video gaming etc. To the best of our knowledge, little work has been done in Marathi language as compared with those for other Indian and non-Indian languages. This paper has discussed a novel method for recognition of isolated Marathi numerals. It introduces a Marathi database and isolated numeral recognition system using Mel-Frequency Cepstral Coefficient (MFCC) and Distance Time Warping (DTW) as attributes. The precision of the pre-recorded samples is higher than that of the real-time testing samples. We have also seen that the accuracy of the speaker dependent samples is higher than that of the speaker independent samples. Another method called HMM that statistically models the words is also presented. Experimentally, it is observed that recognition precision is higher for HMM compared with DTW, but the training process in DTW is very simple and fast, as compared to the Hidden Markov Model (HMM). The time required for recognition of numerals using HMM is more as compared to DTW, as it has to go through the many states, iterations and many more mathematical modeling, so DTW is preferred for the real-time applications.

Keywords: Hidden Markov Model (HMM), Mel-Frequency Cepstral Coefficient (MFCC), Distance Time Warping (DTW).

1. Introduction

Speech recognition systems are used in different fields in our daily life. Due to the rapid advancement in this field all over the world we can see many systems and devices with voice input 1. Speech Synthesis and Speech Recognition combinely form a speech interface. A speech synthesizer transforms text into speech, so it can read out the textual contents from the screen. Speech recognizer had the ability to find the spoken words and transform it into text. We require such software’s to be available for Indian languages.

Speech recognition in computer domain involves many steps with issues attached with them. The steps needed to make computers perform speech recognition are: Voice recording, word boundary detection, feature extraction, and recognition by using knowledge models.

2. Problem Definition

The aim of the paper is to build a speech recognition tool for Marathi language, which is an isolated word speech recognition devices that uses Mel-Frequency Cepstral Coefficient (MFCC) for Feature Extraction and Distance Time Warping (DTW) for Feature Matching or to compare the test patterns.

3. Marathi Numeral Recognition using MFCC and DTW Features

The popularly used cepstrum based techniques to check the pattern to find their similarity are the MFCC and DTW. The MATLAB is utilized for the implementation of MFCC and DTW attributes.

FEATURE EXTRACTION (MFCC)

The MFCCs are used for feature extraction. The efficiency of this phase is important for the next phase since it affects its behavior 2. In MFCC feature extraction, the magnitude spectrum of windowed speech frame was filtered by using a triangular Mel filter bank have twenty Mel filters. From a group of twenty Mel-scaled log filter bank outputs, MFCC feature vector that consists of thirteen MFCC and the corresponding delta and acceleration coefficients (total thirty nine coefficients) is extracted from every frame.

The widespread use of the MFCCs is because of its low computational complexity and higher performance for ASR in the clean matched conditions. Performance of MFCC degrades drastically in presence of noise and degradation is directly proportional to signal-to noise ratio (SNR). The recognition accuracy for MFCC attribute is taken into account because it mimics the human ear perception 2.

The complete procedure of the MFCC is shown in Fig. 3.1. As shown in Fig.3.1, MFCC consists of seven computational steps. Every step has its function and approaches as mentioned in brief as follows.

Step 1: Pre–emphasis

This method can increase the energy of signal at higher frequency. It permits the passing of every speech signal through a 1st order FIR filter which emphasizes higher frequencies. The 1st order FIR filter equation utilized is

Yn = xn-0.95 x n-1 …………………………………………………………..… (1)

Step 2: Framing.

Every speech signal is split into frames of thirty six ms (milliseconds) and most of spectral characteristics stay the constant in this period, with 50 % of overlapping.

Step 3: Windowing

To eliminate edge effects, every frame is formed with hamming window that

works better than other windows. The hamming window is represented by

W(n)=0.54-0.46 cos?(2?n/(N-1)) Where, 0?n?N-1……..….. (2)

Step 4: Fast Fourier Transformation (FFT)

The FFT is employed to get log magnitude spectrum to estimate MFCC. We have utilized 1024 point to obtain higher frequency resolution.

Step 5: Mel Filter Bank Processing

The twenty Mel triangular filters are designed with 50% overlapping. From every filter the spectrum are included to obtain one coefficient each, hence we have considered the first thirteen coefficients as our attributes. These frequencies are converted to Mel scale utilizing conversion formula.

F(Mel)=2595*log101+f700………………………….…… (3)

We have taken into account only 13 MFCC coefficients due to the fact it gives higher recognition accuracy than other coefficients.

Step 6: Discrete Cosine Transformation (DCT)

The DCT of every Mel frequency Ceptral are used for de-correlation and energy

compaction is called as MFCC. The group of coefficient are called MFCC Acoustic Vectors. So, every input speech signal is converted into a sequence of MFCC Acoustic Vector from which reference templates are obtained.

Step 7: Delta Energy and Delta Spectrum

The attributes associated to the variation in cepstral features over time are represented by thirteen delta features (12 cepstral features and one energy feature), and 13 double delta or acceleration attributes. Each of the 13 delta features gives the variation between frames, while each of the 13 double delta attributes gives the variation between frames in the corresponding delta features. In similar way, all the total 39 MFCC feature are estimated for each frame which has feature vector. The Mel filter bank created is shown in Fig.3.2.

The operating process of the MFCC coefficient extraction is:

Pre-emphasis of the speech signal, frame, adding window, then use the

FFT to get the frequency information.

2. Pass the signal through the Mel frequency coordinate triangle filter sets to

match the human hearing techniques and the human hearing sensibility to

variant speech spectrum.

Estimate the logarithm value of the signal after the Mel filters to get the

logarithmic spectrum.

4. Obtain the discrete cosine transform to the signal and get the MFCC

coefficients.

Mel-frequency wrapping

According to psychophysical studies, human perception of the frequency content of sounds follows a subjectively defined nonlinear scale called the Mel scale .The

speech signal consists of tones with different frequencies F or each tone with an actual frequency measured in Hz, a subjective pitch is measured on the ‘Mel’ scale. The mel-frequency scale is a linear spacing below 1000Hz and above 1000Hz is a logarithmic spacing 3.

4. Features Matching (DTW)

Overview

In this form of speech recognition method, the test data is transformed to templates. The recognition method then includes the matching the incoming speech with stored templates. The template with the lowest distance measure from the input pattern is the known word. The best choice (lowest distance measure) is based upon dynamic programming. This is called a Dynamic Time Warping (DTW) word recognizer 3.

To understand the concept of DTW, we require to know this parameters,

Features: the information in every signal has to be exhibited in some fashion.

Distances: some type of metric has be utilized so as to obtain a match path.

Since the feature, vectors may probably have multiple elements, a ways of calculating the local distance is needed. The distance measure between two feature vectors is estimated by the Euclidean distance metric. The Euclidean distance between two points P = (p1, p2…pn) and Q = (q1, q2…qn), is expressed as,

?(?_(i=1)^n??(pi-qi)2?)……………………………… (4)

DTW Algorithm

Speech is a time-dependent technique. So, the articulations of the same word will have variant durations, and articulations of the same word with the same duration will differ in the middle, as the different parts of the words are being spoken at variant rates. In order to obtain a global distance between two speech patterns (given as a sequence of vectors) a time alignment should be performed.

DTW algorithm is relies on Dynamic Programming methods. This method is for estimating similarity between two time series that may vary in time or speed. This method is also utilized to obtain the optimal alignment between two times series if one time series may be “warped” non-linearly by stretching or shrinking it along its time axis. The warping between two time series can then be utilized to obtain the corresponding regions between the two time series or to find the similarity between the two time series. Fig. 4.1 shows the example of how one times series is ‘warped’ to another 4.

In Fig. 4.1, every vertical line joins a point in one time series to its respectively similar point in the different time series. The lines have similar values on the y-axis, while they have been departed so the vertical lines between them can be seen more easily. When both of the time series in Fig. 4.1 were identical, all of the lines would be straight vertical lines as no warping would be required to ‘line up’ the two time series. The warp path distance is a measure of the variation between the two time series after they have been warped collectively, which is estimated by the sum of the distances between every pair of points joined by the vertical lines in Fig. 4.1. Hence, two time series that are same except for localized stretching of the time axis will have zero DTW distances. The aim of DTW is to compare two dynamic patterns and calculate its similarity by finding a minimum distance between them.

Continuity condition: – The path forwards one step at a time. Both i and j can only increase by 1 on every step along the path.

Boundary condition: – The path begins at the bottom left and stops at the top right.

Adjustment window condition: – A good path is the one that wander away from the diagonal. The distance that the path is allowed to wander is the window length r.

Slope constraint condition: -The path should not be very steep or shallow. This restraints very short sequences matching very long ones. The condition is given as a ratio n/m where m is the number of steps in the x direction and n is the number in the y direction.

Finding for the best path that matches two time-series signals is the main task for many researchers, due to its importance in these applications. Dynamic Time-Warping (DTW) is one of the main techniques to accomplish this task, especially in speech recognition systems to deal with different speaking speeds. DTW is a cost minimization matching method, in which a test signal is stretched or compressed as per a reference template. Dynamic time warping (DTW) is such a typical method for a template based approach matching for speech recognition and also DTW stretches and compresses different parts of utterance in order to obtain alignment that gives the best possible match between template and utterance on frame by frame manner. The template with nearest match defined in a manner chosen as recognized word 6.

The DTW is mainly used in the small-scale embedded-speech recognition systems like those embedded in cell phones. The reason for this is due to the simplicity of the hardware implementation of the DTW, hat makes it suitable for different mobile devices. Also, the training procedure in DTW is very simple and fast, as compared with the HMM and ANN. DTW has been used in various areas, like speech recognition, data mining, and movement recognition 3.

5. Hidden Markov Model Training and Recognition

The Hidden Markov Models (HMMs) are widely used statistical tools in recognition system. It covers from isolated speech recognition to very large vocabulary unconstrained continuous speech recognition and speaker identification fields. Therefore most of the current speech recognitions are conducted based on Hidden Markov Model (HMMs)7.

The Hidden Markov Model is a statistical model where the system is modelled as Markov process which can be represented as a state machine with unknown parameter through it (Ting, 2007). The main important process is to determine the unknown parameter from the observable parameter. After determining the parameter, it then used to perform further analysis. One of the examples of the Hidden Markov Model is normally use in pattern application. For example, Hidden Markov Model in pattern application is speech, signature, handwriting, gesture recognition and bioinformatics and genomics in medical.

The hidden Markov Model is given by the formula, ?=(?, A, B).

? = initial state distribution vector.

A = State transition probability matrix.

B = continuous observation probability density function matrix.