In the course of speech recognition system processes, HMM models play an important role. These make it possible to find the phonemes that best fit the input signals. An acoustic model of a phoneme is divided into several parts: the beginning, depending on the length of many different centerpieces and the end. The input signals are compared with those stored and searched using possible combinations of the Viterbi algorithm used in Professional Phone Messages.
As the computing power of modern PCs has increased significantly, smooth (continuous) language which consist of several words and the transitions between them can now be detected by larger Hidden Markov Models.
Alternatively, attempts have been made to use neural networks for the acoustic model. With Time Delay Neural Networks the changes in the frequency spectrum can be used over the course of time for detection. The development has certainly brought positive results, but was ultimately abandoned in favor of HMMs.
There is also a hybrid approach, in which the data obtained from the pre-processing is classified by a neural network and the output of the network can be used as parameters for the hidden Markov models.
This has the advantage that data from just before and just after being processed can not increase the complexity of the HMMs. You can also then separate the classification of the data and the context-sensitive composition (formation of meaningful words / phrases) from each other.
The language model then attempts to determine the likelihood of certain combinations of words and thereby eliminate improbable or incorrect hypotheses. This can either be used as a grammar model using formal grammar or statistical models employing N-grams.
A bi – or tri-gram stores the probability of occurrence of word combinations of two or three words in Professional Phone Messages. These statistics are obtained from large corpora (as texts). Each hypothesis identified by the speech recognizer is then checked and, discarded if their probability is low. This also includes homophones, ie, different words can be made with identical pronunciation.
Theoretically more accurate estimates of the probabilities of occurrence of word combinations are possible compared to bigrams. However, the text databases from which the trigrams are extracted may be significantly larger than bigrams, given that all permissible combinations of words of three words in a statistically significant number occur (ie : each occurs substantially more than once).
Combinations of four or more words long were not used because, in general, can not find any more text databases, which contain a sufficient number of all of the word combinations. An exception is Dragon, which since version 12 also used pentagrams – which enhances the recognition accuracy in this system further noticeable.