Speech
Recognition,
Technologies and Applications
484
improve recognition firstly by restricting vocabulary size, and secondly by improving
signal-to-noise ratio. The former task,
constraint of vocabulary size, is the role of constructed
grammar in VI systems.
It is well known that vocabulary restrictions can lead to recognition improvements whether
these are domain based (Chevalier et al., 1995) or simply involve search-size restriction
(Kamm et al., 1994). Similarly the quality of captured speech obviously affects recognition
accuracy (Sun et al., 2004). Real-time response is also a desirable characteristic in many
cases.
Fig. 3. Effect of vocabulary size and SNR on word recognition by humans,
after data
obtained in (Miller et al, 1951).
Actually the three aspects of performance: recognition speed, memory resource
requirements, and recognition accuracy, are in mutual conflict, since it is relatively easy to
improve recognition speed and reduce memory requirements at the expense of reduction in
accuracy (Ravishankar, 1996). The task for designing a vocal response system is thus to
restrict vocabulary size as much as practicable at each point in a conversation. However, in
order to determine how much the vocabulary should
be restricted, it is useful to relate
vocabulary size to recognition accuracy at a given noise level.
Automatic speech recognition systems often use domain-specific and application-specific
customisations to improve performance, but vocabulary size is important in any
generic
ASR system regardless of techniques used for their implementation.
Some systems have been designed from the ground-up to allow for examination of the
effects of vocabulary restrictions, such as the Bellcore system (Kamm et al., 1994) which
provided comparative performance figures against vocabulary size:
it sported a very large
but variable vocabulary of up to 1.5 million individual names. Recognition accuracy
decreased linearly with logarithmic increase in directory size (Kamm et al., 1994).
Speech Recognition for Smart Homes
485
Fig. 4. Plot of speech recognition accuracy results showing the linear decrease in recognition
accuracy with logarithmic increase in vocabulary size in the presence of various levels of
SNR (McLoughlin, 2009).
To obtain a metric capable of identifying voice recognition performance, we can conjecture
that, in the absence
of noise and distortion, recognition by human beings describes an
approximate upper limit on machine recognition capabilities: the human brain and hearing
system is undoubtedly designed to match closely with the human speech creation
apparatus. In addition, healthy humans grow up from infanthood with an in-built feedback
loop to match the two.
While digital signal processing systems may well perform better at handling additive noise
and distortion than the human brain, to date computers have not demonstrated better
recognition accuracy in the real world than humans. As an upper limit it is thus instructive
to consider results such as those from Miller et al. (Miller et al, 1951) in
which human
recognition accuracy was measured against word vocabulary size in various noise levels.
The graph of figure 3 plots several of Miller’s tabulated results (Kryter, 1995), to show
percentage recognition accuracy against an SNR range of between -18 and +18dB
SNR
with
results fit to a sigmoid curve tapering off at approximately 100% accuracy and 0% accuracy
at either extreme of SNR. Note that the centre region of each line is straight so that,
irrespective of the vocabulary, a logarithmic relationship exists between SNR and
recognition accuracy. Excluding the sigmoid endpoints and plotting recognition accuracy
against the
logarithm of vocabulary size, as in figure 4, clarifies this relationship
(McLoughlin, 2009).
Considering that the published evidence discussed above for both human and computer
recognition of speech shows a similar relationship, we state that in the presence of moderate
levels of SNR, recognition accuracy (A) reduces in line with logarithmic increase in
Speech
Recognition,
Technologies and Applications
486
vocabulary size (V), related by some system dependent scaling factor
which will represent
by
γ
:
A
-1
=
γ
log (V)
(1)