Friday, July 23, 2010

Speech Recognition

In order to successfully simulate an interpersonal scenario with a virtual human, you need speech recognition (in real-life we speak to each other and not click on buttons or use text). For this reason, I have been following closely the evaluation of the speech recognition industry for some time now. 

During the MGUIDE project I successfully integrated speech recognition into one of my prototypes. I used Microsoft Speech Recognition Engine 6.1 (SAPI 5.1) with dictation grammars which I developed using the Chant GrammarKit, in pure XML. The grammars look like this:

<RULE name="Q1" TOPLEVEL="ACTIVE">
<l>
<P><RULEREF NAME="want_phrases"/>to begin</P>
<P><RULEREF NAME="want_phrases"/>to start</P>
<P><RULEREF NAME="want_phrases"/>to start immediately</P>
</l>
<opt>the tour </opt>
<opt>the tour ?then</opt>
</RULE>

I also voice-enabled the control of the interface of my system, so if you would say “Pause” the virtual guide would pause its presentation. I briefly tested both modes with one participant in the lab. In the dictation mode, with just a couple of minutes of training Microsoft’s engine performed with 100% accuracy within the constrains of the grammar. For completely unknown input, the engine performed with less than 40% accuracy. In CnC mode, the engine worked with 100% accuracy without any training. Of course,  SAPI 5.4 in Windows 7 offer much better recognition rates in both dictation and CnC modes. I haven’t tried SAPI 5.4 but is within my plans for the future.I think that true speaker-independent (i.e., without training) recognition in indoor environments, is only 5 years away, at least for the English language.

In mobile environments, Siri appears to be the only solution out there that realises the idea of a virtual assistant on the go using speech recognition. Siri works uses dynamic grammar recognition, similar to my approach. If you say something within the constrains of the grammar the accuracy of recognition reaches 100%. However, as in the case of my prototype, if you say something outside the grammar files the recognition results can be really funny.

Statement to Siri: Tell my husband I’ll be late

Reply: Tell my Husband Ovulate (he he he)

ASR

Source: http://siri.com/v2/assets/Web3.0Jan2010.pdf

Terminology:

Dictation Speech Recognition: Refers to the type of speech recognition where the computer tries to translate what you say into text

Command and Control mode (CnC): This type of speech recognition is used to control applications

0 comments:

Post a Comment