Interface
Multiplexer
Local info &
Commands
Microphone
User
Interface
Multiplexer
Local info &
Commands
Re
fin
e
Request
Matching
Information
Extraction
Query
Query
Generation
Query
Generation
Information
Extraction
Information
Extraction
Formatting
Results
Presentation
Request
Phrasing
Semantic
Web
Local DB
(Wikipedia)
Web
Re
sul
ts
U
se
r R
eq
ue
st
Re
fin
e
Request
Matching
Information
Extraction
Query
Query
Generation
Query
Generation
Information
Extraction
Information
Extraction
Request
Formatting
Results
Presentation
Request
Phrasing
Semantic
Web
Local DB
(Wikipedia)
Web
Re
sul
ts
U
se
r R
eq
ue
st
Request
Matching
Information
Extraction
Query
Generation
Query
Generation
Query
Generation
Information
Extraction
Information
Extraction
Request
Formatting
Results
Presentation
Request
Phrasing
Semantic
Web
Local DB
(Wikipedia)
Web
Re
sul
ts
U
se
r R
eq
ue
st
Fig. 2. Overall structure of the WWW vocal query access system.
Speech
Recognition,
Technologies and Applications
482
The semantic web currently being promoted and researched by Tim Berners-Lee and others
(see Wikipedia 2008), goes a long way towards providing a solution: it divorces the
graphical/textual nature of web pages from their information content. In the semantic web,
pages are based around information. This information can then be marked up and displayed
graphically
if required. When designing smart home services benefiting from vocal
interactions of the semantic web, the same information could be marked up and presented
vocally, where the nature of the information warrants a vocal response (or the user requires
a vocal response).
There are three alternative methods of VI relating to the WWW resource:
•
The few semantic web pages (with information extracted and then, either as specified in
the page, or using local preferences, converted to speech), and then presented vocally.
•
HTML web pages, with information extracted, refined then presented vocally.
•
Vocally-marked up web pages, presented vocally.
Figure 2 shows the overall structure proposed by the authors for vocal access to the WWW.
On the left is the core vocal response system handling information transfer to and from the
user. A user interface and multiplexer allow different forms of information to be combined
together. Local information and commands relate to system operation: asking the computer
to repeat itself, take and replay messages, give the time, update status, increase volume and
so on. For the current discussion, it is the ASR aspects of the VI system which are most
interesting:
User requests are formatted into queries, which are then phrased as required and issued
simultaneously to the web, the semantic web and a local Wikipedia database. The semantic
web is preferred, followed by Wikipedia and then the WWW.
WWW responses can then be refined by the local Wikipedia database. For example too
many unrelated hits in Wikipedia indicate that query adjustments may be required.
Refinement may also involve asking the user to choose between several options, or may
simply require rephrasing the question presented to the information sources. Since the
database is local, search time is almost instantaneous, allowing a very rapid request for
refinement of queries to be put to the user if required before the WWW search may have
completed.
Finally, results are obtained as either Wikipedia information, web pages or semantic
information. These are analysed, formatted, and presented to the user. Depending on the
context, information type and amount, the answer is either given vocally, graphically or
textually. A query cache and learning system (not shown) can be used to improve query
processing and matching based on the results of previous queries.
4.3 Dictation
Dictation involves the automatic translation of speech into written form, and is
differentiated from other speech recognition functions mostly because user input does not
need to be interpreted (although doing so may well aid recognition accuracy), and usually
there is little or no dialogue between user and machine.
Dictation systems imply large vocabularies and, in some cases, an application will include
an additional specialist vocabulary for the application in question (McTear, 2004). Domain-
specific systems can lead to increased accuracy.
Speech Recognition for Smart Homes
483
|