Speech
Recognition,
Technologies and Applications
488
most people to use a remote control to set the timer on their video recorder to record
forthcoming broadcasts. In addition, as devices decrease in size, and average users increase
in age, manual manipulation has similarly become more difficult. From a system
architecture
point of view, embedded speech recognition is now becoming considered a
simple approach to user interfacing. Adoption in the embedded sphere contrasts with the
more sluggish adoption of larger distributed system approaches (Tan & Varga, 2008).
However there is a price to be paid for such architectural simplicity: complex speech
recognition algorithms must run on under-resourced consumer devices. In fact, this forces
the development of special techniques to cope with limited resources in terms of computing
speed and memory on such system.
Resource scarcity limits the available applications: on the other hand it forces algorithm
designers to optimise techniques in order to guarantee sufficient recognition performance
even in adverse conditions, on limited platforms, and with significant memory constraints
(Tan & Varga, 2008). Of course, ongoing advances in semiconductor
technologies mean that
such constraints will naturally become less significant over time.
In fact, increased computing resources coupled with more sophisticated software methods
may be expected to narrow the performance differential between embedded and server-
based recognition applications: the border between applications realized by these
techniques will narrow, allowing for advanced features such as natural language
understanding to become possible in an embedded context rather than simple command-
and-control systems. At this point there will no longer be significant technological barriers
to use of embedded systems to create a smart VI-enabled home.
However
at present, embedded devices typically have relatively slow memory access, and a
scarcity of system resources, so it is necessary to employ a fast and lightweight speech
recognition engine in such contexts. Several such embedded ASR systems have been
introduced in (Hataoka et al., 2002), (Levy et al., 2004), and (Phadke et al., 2004) for
sophisticated human computer interfaces within car information systems, cellular phones,
and interaction device for physically handicapped persons (and other
embedded
applications) respectively.
It is also possible to perform speech recognition in smart homes by utilising a centralised
server which performs the processing, connected to a set of microphones and loudspeakers
scattered throughout a house: this requires significantly greater communications bandwidth
than a distributed system (since there may be arrays of several microphones in each
location, each with 16 bit sample depth and perhaps 20kHz sampling rate), introduces
communications delays, but allows the ASR engine to operate on a faster computer with
fewer memory constraints.
As the capabilities of embedded systems continue to improve,
the argument for a
centralised solution will weaken. We confine the discussion here to a set of distributed
embedded systems scattered throughout a smart home, each capable of performing speech
recognition, and VI. Low-bandwidth communications between devices in such a scenario to
allow co-operative ASR (or CPU cycle-sharing) is an ongoing research theme of the authors,
but will not affect the basic conclusions at this stage.
In the next section, the open source Sphinx is described as a reasonable choice among
existing ASRs for smart home services. We will explain why Sphinx is suitable
for utilisation
in smart homes as a VI core through examining its capabilities in an embedded speech
recognition context.