6533b829fe1ef96bd128b033

RESEARCH PRODUCT

An Emotional Talking Head for a Humoristic Chatbot

Agnese AugelloSalvatore GaglioOrazio GambinoGiovanni PilatoRoberto PirroneVincenzo Cannella

subject

Settore ING-INF/05 - Sistemi Di Elaborazione Delle InformazioniComputer sciencetalking head chatbot emotional AIMLSpeech synthesisAnimationcomputer.software_genreCyberwareASCIIChatbotIntelligent agentHuman–computer interactionGraphicscomputerNatural languageComputingMethodologies_COMPUTERGRAPHICS

description

The interest about enhancing the interface usability of applications and entertainment platforms has increased in last years. The research in human-computer interaction on conversational agents, named also chatbots, and natural language dialogue systems equipped with audio-video interfaces has grown as well. One of the most pursued goals is to enhance the realness of interaction of such systems. For this reason they are provided with catchy interfaces using humanlike avatars capable to adapt their behavior according to the conversation content. This kind of agents can vocally interact with users by using Automatic Speech Recognition (ASR) and Text To Speech (TTS) systems; besides they can change their “emotions” according to the sentences entered by the user. In this framework, the visual aspect of interaction plays also a key role in human-computer interaction, leading to systems capable to perform speech synchronization with an animated face model. These kind of systems are called Talking Heads. Several implementations of talking heads are reported in literature. Facial movements are simulated by rational free form deformation in the 3D talking head developed in Kalra et al. (2006). A Cyberware scanner is used to acquire surface of a human face in Lee et al. (1995). Next the surface is converted to a triangle mesh thanks to image analysis techniques oriented to find reflectance local minima and maxima. In Waters et al. (1994) the DECface system is presented. In this work, the animation of a wireframe face model is synchronized with an audio stream provided by a TTS system. An input ASCII text is converted into a phonetic transcription and a speech synthesizer generates an audio stream. The audio server receives a query to determine the phoneme currently running and the shape of the mouth is computed by the trajectory of the main vertexes. In this way, the audio samples are synchronized with the graphics. A nonlinear function controls the translation of the polygonal vertices in such a way to simulate the mouth movements. Synchronization is achieved by calculating the deformation length of the mouth, based on the duration of an audio samples group. BEAT (Behavior Expression Animation Toolkit) an intelligent agent with human characteristics controlled by an input text is presented in Cassell et al. (2001). A talking head for the Web with a client-server architecture is described in Ostermann et al. (2000). The client application comprises the browser, the TTS engine, and the animation renderer. A 15

https://publications.cnr.it/doc/208119