Monday, May 31, 2010

Sample Portfolio 2

Several more prototypes can be constructed and evaluated around the virtual guide agent with minimal effort. However, I think its time to present some architectural designs on which the system was based upon:

Architecture-related software: The architectures below are made mainly using MindJet Mindmanager.  

Hub-and-Spokes like Dialogues:

Prototype 1 features a Finite State Machine (FSM) dialogue manager capable of dynamically displaying questions based on the user’s selection and the current context. The questions cover a very broad range of the possible questions/clarifications that a user can ask after a presentation for a location. This FSM was used in most systems except prototype 4 where the user could ask questions using natural dialogue. The flow diagram below gives an idea of how dialogue is brunched within the system.

Please note that only the branching for questions 1 and 4 is shown in the diagram below. In addition, although not shown the final hubs all terminate in the “Dialogue Terminate” state. 

Short explanation:

The user listens to an introductory script from the guide agent and then, they can either select to listen to the presentation about a location or have a look around. Once the presentation is complete, a hub with questions is loaded from where the user can select what to ask . After hearing the guide’s answer, the user can either return to the main hub (i.e., by selecting “I want to ask something else”), from which they can ask another question, or enters a deeper hub with more options to choose from. The user can either exhaust all questions available to them and proceed to the next location, in which case the dialogue enters the “Dialogue Terminate State” or simply tap the button next on the interface of the system to proceed.

System Architecture:
 
This is one of my first attempts to design Talos – an authoring toolkit for Virtual guide development and research. The architecture is far from complete, but it serves as an excellent basis for designing the final architecture of the toolkit. Unfortunately, I can not provide any explanation on the toolkit or its modules, as I have to publish a paper on it. However, the main idea of Talos is to provide a cheap solution for researchers wishing to create virtual guide prototypes, as well as, for content writers (e.g., guide book writers) wishing their content to be presented through a truly multi-modal medium.
 
Software Structure:

1) This is my first attempt to implement a simple UI to fully/partially automate the development of Virtual humans. This prototype has many interesting features, that implemented during my M.Phil years (2003-2006), like for instance a Cyc query builder aimed to automate the transformation of questions into a format suitable to be inserted into CyN AIML scripts.

2) A system structure of a system I built during the M.Phil years (2003-2006)

 

 

 

Saturday, May 29, 2010

Sample Portfolio

Information architecture is the “art and science of organizing and labelling websites, intranets, online communities and software to support usability.”[1]. MGUIDE takes information architecture to the next level by suggesting human modalities (gestures, facial expressions, emotion recognition, real-time user information adaptation, etc) as an alternative method for supporting this goal. I have done extensive experimental work on trying different human modalities and methods for ensuring effective information architecture. Technology nowadays in the avatar domain is so advanced, that we have the ability to simulate various human-to-human communication scenarios, in order to ensure that our message will get across effectively. Such a scenario is for example, the use of humour/serious messages to attract attention when someone is now paying attention to what we are saying. The list of these scenarios is endless and so is the experiments that can be performed to gather some evidence on their application in the virtual world. I truly believe that the future of Web is 3D and hence, avatar-based applications will be the next “big thing”. No matter what we do, no matter how well we will design a web site it can not be compared with a humanised interface for information retrieval and processing. Human gesturing, natural language processing, emotion recognition, and other human modalities if all combined together correctly, they can create the most powerful paradigm for our proliferated information age.

Hence, the MGUIDE is not related to the “traditional” web page model. It goes towards the Web 2.0 applications, i.e., applications with high interactivity. Any architectural designs that you will find related to MGUIDE, are complex software blueprinting, dialogue hubs and branching and others. However, this doesn’t mean that I can’t do information architecture for web sites/internet applications or that I haven’t done it in the past. Actually I find it relatively easy, compared to the complexity of the work I did in MGUIDE. Back in the old days – I am working in the new media/education industry for almost 10 years - we used to do information architecture on the fly. I have done it for large companies in Greece and international, on both multimedia productions and web sites. However, back then I was  responsible for a web site from its “birth” until its full deployment, which means that I have learn about information architecture, the old school way – through designing, developing and deploying large multimedia projects and web sites. Below you will find a sample of this work.

Web Sites: Companies include:

Elnet Site,  Metrolife – Emporiki, Marketing Lead and others

 

CD/DVD ROM

Companies include: Centric multimedia,  Grecotel,  Metrolife – Emporiki and others 

 

References

[1] http://iainstitute.org/documents/learn/

Prototype 5

This prototype is similar to the others, but it:

1) Features only brief descriptions about four locations of the castle of Monemvasia. This content has been carefully crafted to ensure that it is short (but not to short) in terms of length, analytical, and most importantly  simple to understand.

2) The Loquendo Kate text-to-speech voice speaks too fast even in low speed mode. To fix this, I have introduced a 1 sec delay between each sentence of the description. That makes the presentation to flow more naturally and the text easier to understand.

3) Features two types of virtual guides:

a) One virtual guide, that uses full gestures and happy facial expressions. This guide always looks at the user.

b) Another virtual guide that uses no gestures and her face appears always serious. This guide looks away from the user most of the times.

The screenshots below demonstrate both guides:

guides

Friday, May 28, 2010

Prototype 2

Prototype 2 (and prototype 1) were evaluated in Greece. The idea here was to build a system that offers information of variable complexity about the castle and most importantly allows the visitor to navigate freely in the castle.

A number of solutions were considered in order to allow free navigation:

a) Real – time landmark recognition. This feature is shown below (move to slider 10 1.58). The idea here, is to train the system to recognize a particular location (simply by taking a photo) and allow the user to initiate a presentation about the location by pointing the camera of the device to it. That would be extremely neat to have in the current prototype 2 implementation but the technology is probably still very expensive to purchase.  

b) QR-CODE technology. This feature is shown below (Greek language only). The idea here, is to tag each location on the castle with a QR-CODE (similar to the product bar code but it can hold textual as well as numerical information) and allow the user to initiate a presentation about a location simply by photographing the QR-CODE of the location. A more advanced version of the technology, allows a similar functionality to the real-time landmark recognition (i.e.,  real-time video QR-CODE recognition). It would be interesting to come up with a study to compare the performance and usability of both technologies. At the moment apart from cost (QR-CODE is far cheaper) I see no core differences between the two technologies.

Wednesday, May 26, 2010

Prototype 4 (Technical updates)

After a series of tests with the Antelope tagger, I found that it still has a lot of bugs. Hence, I decided to use the Stanford parser as a tagger in order to perform shallow parsing to the returned VPF trigger. It seems to be working OK, though its much slower than the Antelope tagger.

I also changed the flat textual database with the an XML one that provides a more structured way to access its contents. For example,

<sentances id="2">
<text>Can we begin the tour please</text>
<predicates2>begin</predicates2>
<predicates2>please</predicates2>
<Deep_Syntax name="Subject">We</Deep_Syntax>
<Deep_Syntax name="Subject">Tour</Deep_Syntax>
</sentances>

A pilot with three participants showed that the search algorithm works effectively. In addition, the following design alternations were suggested:

1) Break down the presentations each user has to listen (and ask questions) into parts. That way question asking can be focused on certain parts and not on the whole of the presentation

2) Create a castle window that will help users visualize the presented information. I used detailed 3D panoramic pictures from each location in the castle, to help them achieve this goal (see screenshot)

Prototype4_new

3) Include a stop and pause button on the interface. Participants can pause the text to speech synthesis or simply stop it if they find i annoying.

Prototype4_new2

4) Search google and wikipedia using natural language for words and terms that are unknown to the user. This feature was simulated in the current system. 

This system will be evaluated with 15 participants at Middlesex University.

Future Work:

1) Extend the current algorithm to include semantics. Such system will match the user’s input with existing triggers in the DB at the semantic level.

2) Replace the XML database with an SQL one. Although the XML DB works fine, it has to load all phrases and parses into RAM. With a static MySQL database I can avoid that.

3) In the current implementation, if the system fails to match an input with a trigger in the DB, it will return “I am not sure if I understand your question please re phrase”. Parsing of the input though, is usually successful. The idea here, is to allow the system to actually learn the questions the previous users asked. For example, if the first user asks “What is a pergola?” and the system doesn’t have an answer, it should be able to dynamically update its DB’s and return an answer by searching the WWW. That way, the more questions users ask the more “intelligent” the system it becomes. 

Monday, May 24, 2010

Videos – Prototype 4 (2nd Demo)

    The video below demonstrates the two processing layers I constructed upon the VPF for better language understanding. An early development of the algorithm can be found here. Please note that the few minutes of delay at the beginning of the video are because of the initialization of the tagger.

    This system uses shallow parsing and deep syntactic processing to match the user’s input with the database. In particular, the following steps are taken to find the database phrase closest to the input:

    Stage 1: Shallow Parsing

  • Replace contractions

[didn't, 'll, 're, lets, let's, 've, 'm, won't, 'd, 's, n't] with [did not, will, are, let us, let us, have, am, will not, would, is, not]

  • Remove unnecessary words and POS

[ok, yes, no, hmm, yeah, uh, huh, to, Um, Oh, Alas, Oh, Eh, er, uh, uh huh, um, well]

[Article, Preposition, Conjunction, Determiner, Modal, Interjection, Numeral, Punctuation]

  • Tag the user’s input with its Part of Speech (POS).

  • Tag the VPF match of the user’s input with its Part of Speech (POS)

  • Filter both input and VPF match, based on a list of the global keywords returned by the VPF Web Service.

  • Compare what is left for POS and values. For example, for my question “Does the castle has any other gates?” only the (gates castle) keywords are returned.

  • If the comparison is successful allow the output (i.e., a script containing all the synchronized animations, speech, etc) of the VPF service to be executed by the system

  • If the comparison fails, pass the input to Stage 2 for deep syntactic processing.

    Stage 2: Deep Syntactic Processing

  • Fully Parse the user’s input and extract its predicates and Deep Dependencies (e.g., Subject, DirectObject, etc). For example, my phrase “I would like a brief description about all walls!” fails in the first stage of processing and parses like: 

like(Subject: I, DirectObject: description, SpaceComplement: about walls)

  • If it is a single predicate sentence, conduct 10 similarity tests, to check for similarities between the parsed input and the pre-parsed sentences in the database.

  • If a match is found, query the VPF with the match.

  • Return the output (i.e., a script containing all the synchronized animations, speech, etc) and execute it in the system.

  • If it is a double predicate sentence, conduct 9 similarity tests, to check for similarities between the parsed input and the pre-parsed sentences in the database

  • If a match is found query the VPF with the match. 

  • Return the output (i.e., a script containing all the synchronized animations, speech, etc) and execute it in the system.

  • If this stage fails, ask the user to rephrase or to move on to another question.

    I also experimented with the semantic parsing, but the Antelope parser is currently experimental. I am planning to add a third stage for full semantic processing in the current algorithm once Antelope’s parser is fully matured.

Tuesday, May 11, 2010

Videos – Prototype 4 (Original Version)

This is a demo of the original version of prototype 4. The system was initially constructed with the aim of comparing a natural method of communication with the guide vs. a menu of predefined phrases. This system also uses speech recognition (not shown in this demo) with dynamic grammars. However, because of the time needed to evaluate the system per participant (more than two hours) we decided to drop it and replace it with a simpler system (shown in screenshot 1).

This demo doesn’t use AIML but rather access the VPF service (http://vpf.cise.ufl.edu/VirtualPeopleFactory/) through an API provided by its creator Brent Rossen. VPF uses a similar approach to AIML, but the patterns don’t have to be said verbatim to match. This means that if the system has a trigger (i.e., a question) that is even remotely similar to what you said, it will match it and return it associated speech (i.e., answer). This is one step further to language understanding without any linguistic processing, but of course it has its limitations. I will discuss these limitations in another post.

Screenshot 1: The simpler system that replaced Prototype 4.

Language

Friday, May 7, 2010

Videos – Prototype 1 (Part 2 Greek Only)

This is the second part of the vide for prototype 1. The system has just completed the presentation for location 1, and asks the user to move on to location 2.  Notice how the agent reacts at the beginning of the presentation (she can not see the user through the camera). The reaction is similar to the one you would expect from a real guide, if she saw that someone from the group is not listening to what she is saying.

Videos – Prototype 1 (Part 1 Greek Only)

This is the first part of the two videos for Prototype 1. The guide’s task is to provide navigation instructions on predetermined routes, as well as personalised information for specific locations in the castle of Monemvasia.

When the system loads the user is given a choice between three information scenarios: Architecture, History and Biographical. The user can also customize the appearance of the agent and other system settings, but this doesn’t show on the video. During a presentation the agent can utilize information from the Face-Detection module, and react if for example, the user is standing too far away from the camera. Finally, notice the use of FSM (Finite State Machine) in the construction of the dialogues. With the proper authoring tool such dialogues are extremely easy to make and can cover a whole range of dialogue phenomena.

Another idea I experimented for a while was the use of emotional responses as a method to guide how a presentation evolves about a location. For example if the user is too bored of the provided information, the agent can try to either provide alternative information or speed up the pace. However, its impossible to implement such approach in the existing Script-based systems. Possibly a KB is needed to dynamically create the contents of each presentation, but how the agent augments the story with non-verbal behaviour is an open question.

Thursday, May 6, 2010

Videos – Prototype 3 (Clothes)

This is the same prototype system (Prototype 3) but with a clothed avatar. The clothing is completely dynamic and very demanding in hardware resources. It won't even open on the UMPC. The run the clothe animations smoothly (along with the avatar ones) you need a quad-core system. In my opinion the avatar looks more realistic with the clothes on than without. The clothes are available free-of-charge at http://virtual-guide-systems.blogspot.com/2009/09/current-developments-haptek-clothing.html

Videos – Prototype 3

Below is a link to a video from the third prototype of the MGUIDE project. The guide’s task is to provide navigation instructions based on photographs of landmarks shown in her background. The only draw back of the video is the lost of Simon's lip-syncing and smooth animations due to the limitations of the screen capture software, that had trouble to handle everything at the same time. I will use a video camcorder to show a user in action once I will begin the evaluation of the system.

On the UMPC device I get 35 FPS, which is impressive if you think about the hardware limitations. The only way to get the FPS that high, is to have the face-detection module of the system on at all times. Although this should have had the opposite effect (i.e., to decrease the fps) for a very strange reason it boosts the engine’s FPS to maximum. Without the face-detection I am getting 4-10 FPS. I am not sure why it happens but I am glad it does.

 

Monday, May 3, 2010

MGUIDE Components

Below are a number of components I used in the development phase of MGUIDE. I am giving them away for free, all you have to do is to email me at virtual.guide.systems at googlemail.com.

1) AIML Control (.dll): This control can be added into any authoring environment (e.g., Flash, Director, etc) as long as you have the latest .NET framework installed. The control can handle Unicode characters and has full Javascript support.

2) WebCam Capture (.dll). A simple control that allows to add web cam support into any project.

3) Face Detection (.dll). A control that allows you to add face detection into any project. It uses the Intel’s OpenCV library for face detection and a .NET wrapper from Mr.Oshikiri. The control can detect right, left, far_away, normal and close. It currently uses OpenCV 1.0, and if you want to upgrade it to the latest OpenCV 2.0 you will need the latest .NET wrapper from here   

4) QRCode (.dll). The control allows you to add QR-CODE recognition into any project. It uses two commercial components from a company called PartiTek.

a) PtImageRW.dll

b) PtQRDecode.dll

You will need to purchase these components if you want to make the control to work.

5) Subtitles (.dll). A control that allows you to add speech subtitles to any project. The control uses a number of commercial components from Chant.

a) Chant.Shared.dll

b) Chant.SpeechKit.dll

c) DNSpeechKit.dll

d) NSpeechKitLib.dll

and an XML file as a subtitles feed. You will need to purchase Chant SpeechKit if you want to make this component to work.

Finally don't forget my free offering on Haptek Clothing at: http://virtual-guide-systems.blogspot.com/2009/09/current-developments-haptek-clothing.html

 

MGUIDE Project Investors and Supporters

I wish to thank the following for supporting my work:

 

Individuals:

1) Ms Maria Tsouros - Graphic Design
2) Dr Katerina Theodoridou
3) Ms Fotini Singel, Proof-Reading of Greek texts.
4) Ms Arxontoula Tsakou – Recording of Video Clips from the Castle of Monemvasia

Universities:

2) Mr Mark Chavez, Nanyang Technological University
3) Middlesex University, School of Engineering and Information Sciences.

 

Companies:

1)  Haptek (3D Avatar Engine)
image
 
2) Chant ( Interfaces for Speech recognition and Text to Speech generation)
image
 
3)  Loquendo (Text to Speech Engine)
image
 
4) Proxeme (Natural Language Processing Platform)
image
 
5) Panoramic Applications
image
 
6)  PartiTek  (QR-Code Recognition and generation)
image