Sphinx as an ASR for smart homes




Download 0,56 Mb.
Pdf ko'rish
bet11/14
Sana15.05.2024
Hajmi0,56 Mb.
#235098
1   ...   6   7   8   9   10   11   12   13   14
8. Sphinx as an ASR for smart homes 
Among many automatic speech recognizers available for different applications with various 
features, the open source Sphinx recognizer is an excellent examples of a flexible modern 
speech recognition system. Sphinx, originally developed at Carnegie Mellon University in 
the USA, provides and integrates several capabilities that allow it to be adapted for a wide 
range of different speech recognition applications.
At one extreme, it can be used for single word recognition, or expanded at the other extreme 
to large vocabularies containing tens of thousands of words. In terms of resource 
constraints, it can run on anything from a tiny embedded system (PocketSphinx) to a large 
and powerful server (which could run the Java language version Sphinx-4). Sphinx is 
regularly updated and evaluated within the speech recognition research field. 
Sphinx, in common with most current ASR implementations, relies upon Hidden Markov 
Modelling to match speech features to stored patterns (Lee, 1989). It is highly configurable 
and incredibly flexible – the required features used can be selected as required. 
Sphinx2, the decoding engine for Sphinx II, can be a good choice for smart home services, 
provided several appropriate model files and databases are used. These are classified into 
three categories: 
a.
Pronunciation lexicon/dictionary defining words of current interest, and a phonemic 
pronunciation for each. 
b.
Acoustic models based on Hidden Markov Models (HMM) for base phones and 
triphones. Sphinx2 uses both semi-continuous and continuous density acoustic models 
which are typically generated by the Sphinx acoustic model trainer. 
c.
A predetermined language model accepting two flavours of language: either the finite 
state graph (FSG) and N-gram models (where N is either two or three). 
Apart from ordinary words, noise or filler words can be specified for a particular application 
by placing them in a corresponding dictionary. The N-gram language model additionally 
includes begin-sentence and end-sentence symbols, denoted and , normally 
representing silence. These can be used in continuous speech applications for quiet homes, 
but may need to be augmented with predetermined start/stop attention phrases.
The core speech decoder operates on finite-length segments of speech or utterance, one 
utterance at a time. An utterance can be up to one minute long, but in practice most 
applications handle sentences or phrases which are much shorter than this. For real-time 
use, processing must be continuous, with a response latency that is not excessive. Response 
delays of a second or more may well lead to user annoyance. 
As mentioned in section 3, smart home services are mixture of small (for command-and-
control applications) and large (for email dictation and similar applications) continuous 
vocabulary speech systems; thus, we need an ASR which supports both modes. As a 
comparison, the concept is similar to (Nakano et al., 2006) in which the Honda ASIMO 
humanoid robot has two dialogue strategies: a) task-oriented dialogues which utilize the 
outputs of a small vocabulary speech recognizer, and b) non-task-oriented dialogues which 
utilize the outputs of a large vocabulary speech recognizer.
The major difference between Sphinx and this approach occurs during the implementation 
phase where ASIMO deploys two different ASR engines (Julian for small vocabulary and 
Julius for large one). This differs to the authors Sphinx-based system which proposed a 
single recognition engine that not only caters for the needs of both tasks, but has a 
continuously variable vocabulary instead of two extremes as in the ASIMO case. This 


 Speech 
Recognition, 
Technologies and Applications 
490 
therefore allows a continuum of dialogue complexities to suit the changing needs of the 
vocal human-computer interaction. The particular vocabulary in use at any one time would 
depend upon the current position in the grammar syntax tree. 
As a noticeable choice in embedded applications necessary for smart homes, Sphinx II is 
available in an embedded version called PocketSphinx. Sphinx II was the baseline system 
for creating PocketSphinx because it is was faster than other recognizers currently available 
in the Sphinx family (Huggins-Daines et al., 2006). The developers claim PocketSphinx is 
able to address several technical challenges in deployment of speech applications on 
embedded devices. These challenges include computational requirements of continuous 
speech recognition for a medium to large vocabulary scenario, the need to minimize the size 
and power consumption for embedded devices which imposes further restrictions on 
capabilities and so on (Huggins-Daines et al., 2006).
Actually, PocketSphinx, by creating a four-layer framework including: frame layer
Gaussian mixture model (GMM) layer, Gaussian layer, and component layer, allows for 
straightforward categorisation of different speed-up techniques based upon the layer(s) 
within which they operate. 

Download 0,56 Mb.
1   ...   6   7   8   9   10   11   12   13   14




Download 0,56 Mb.
Pdf ko'rish