Vocabulary size and performance




Download 0,56 Mb.
Pdf ko'rish
bet8/14
Sana15.05.2024
Hajmi0,56 Mb.
#235098
1   ...   4   5   6   7   8   9   10   11   ...   14
5. Vocabulary size and performance 
The ability of a system to recognise captured speech, to cater for intra- and inter-speaker 
variability, and the processing time allowable for recognising utterances are three main 
usability issues related to VI systems. Other issues include training requirements, 
robustness, linguistic flexibility and dialogue interaction.
Many factors in ASR for VI can be controlled. For example the variability of speech is mostly 
confined to a limited set of uses: linguistic flexibility can, and should, be constrained 
through appropriate grammar design (which is focus of section 6) and so on. The ability to 
accurately recognize captured speech that has been constrained in the ways discussed above 
will then depend primarily upon vocabulary size and speech-to-noise ratio. Thus we can 


 Speech 
Recognition, 
Technologies and Applications 
484 
improve recognition firstly by restricting vocabulary size, and secondly by improving 
signal-to-noise ratio. The former task, constraint of vocabulary size, is the role of constructed 
grammar in VI systems. 
It is well known that vocabulary restrictions can lead to recognition improvements whether 
these are domain based (Chevalier et al., 1995) or simply involve search-size restriction 
(Kamm et al., 1994). Similarly the quality of captured speech obviously affects recognition 
accuracy (Sun et al., 2004). Real-time response is also a desirable characteristic in many 
cases. 
Fig. 3. Effect of vocabulary size and SNR on word recognition by humans, after data 
obtained in (Miller et al, 1951). 
Actually the three aspects of performance: recognition speed, memory resource 
requirements, and recognition accuracy, are in mutual conflict, since it is relatively easy to 
improve recognition speed and reduce memory requirements at the expense of reduction in 
accuracy (Ravishankar, 1996). The task for designing a vocal response system is thus to 
restrict vocabulary size as much as practicable at each point in a conversation. However, in 
order to determine how much the vocabulary should be restricted, it is useful to relate 
vocabulary size to recognition accuracy at a given noise level. 
Automatic speech recognition systems often use domain-specific and application-specific 
customisations to improve performance, but vocabulary size is important in any
 
generic 
ASR system regardless of techniques used for their implementation. 
Some systems have been designed from the ground-up to allow for examination of the 
effects of vocabulary restrictions, such as the Bellcore system (Kamm et al., 1994) which 
provided comparative performance figures against vocabulary size: it sported a very large 
but variable vocabulary of up to 1.5 million individual names. Recognition accuracy 
decreased linearly with logarithmic increase in directory size (Kamm et al., 1994). 


Speech Recognition for Smart Homes 
485 
Fig. 4. Plot of speech recognition accuracy results showing the linear decrease in recognition 
accuracy with logarithmic increase in vocabulary size in the presence of various levels of 
SNR (McLoughlin, 2009). 
To obtain a metric capable of identifying voice recognition performance, we can conjecture 
that, in the absence of noise and distortion, recognition by human beings describes an 
approximate upper limit on machine recognition capabilities: the human brain and hearing 
system is undoubtedly designed to match closely with the human speech creation 
apparatus. In addition, healthy humans grow up from infanthood with an in-built feedback 
loop to match the two.
While digital signal processing systems may well perform better at handling additive noise 
and distortion than the human brain, to date computers have not demonstrated better 
recognition accuracy in the real world than humans. As an upper limit it is thus instructive 
to consider results such as those from Miller et al. (Miller et al, 1951) in which human 
recognition accuracy was measured against word vocabulary size in various noise levels.
The graph of figure 3 plots several of Miller’s tabulated results (Kryter, 1995), to show 
percentage recognition accuracy against an SNR range of between -18 and +18dB
SNR
with 
results fit to a sigmoid curve tapering off at approximately 100% accuracy and 0% accuracy 
at either extreme of SNR. Note that the centre region of each line is straight so that, 
irrespective of the vocabulary, a logarithmic relationship exists between SNR and 
recognition accuracy. Excluding the sigmoid endpoints and plotting recognition accuracy 
against the logarithm of vocabulary size, as in figure 4, clarifies this relationship 
(McLoughlin, 2009). 
Considering that the published evidence discussed above for both human and computer 
recognition of speech shows a similar relationship, we state that in the presence of moderate 
levels of SNR, recognition accuracy (A) reduces in line with logarithmic increase in 


 Speech 
Recognition, 
Technologies and Applications 
486 
vocabulary size (V), related by some system dependent scaling factor which will represent 
by 
γ

A
-1

γ
log (V)
(1) 

Download 0,56 Mb.
1   ...   4   5   6   7   8   9   10   11   ...   14




Download 0,56 Mb.
Pdf ko'rish