Multimodal Communication Analysis

From TEILab
Jump to: navigation, search

Written by Dr. Francis Quek

Multimodal Analysis

The majority of our work in Computer Vision and Multimedia Systems fall under the purview of Multimodal Analysis where human tracking and computational analysis supports instrumental access to understand multimodal communicative behavior. This research relates to the overall philosophy of embodied interaction in HCI in that human language and interaction is multimodal because gesture and speech are inseparable parts of a whole. In fact, it is this earlier and ongoing body of research in human multimodal communicative behavior that directed our attention to a more fundamental understanding of human embodiment.

The fundamental insight that drives our instrumental analysis of multimodal communicative behavior is that we are embodied beings. When we speak, our heads, eyes, bodies, arms, hands, and face are brought into the service of communication. A common thread that flows through modern gesture research is that spontaneous gesture and speech are inseparable parts of the same whole. While gestures are brought into the service of communication, this is not their sole purpose. In fact, gestures are performed not so much for the hearer, but for the speaker (this is why we gesture while on the phone). It reveals how we use the resources of the body-space to organize our thoughts, keep context, index our ideas, and situate/shape our mental imagery out of which our talk flows. Our capacity for spatial memory, situated attention, and motor activity fuel these embodied resources. Our approach to understanding multimodality in language is illustrated in Figure 1: mental imagery and spatial structuring participate in the pulse-by-pulse conceptual construction of discourse (language at the super-segmental level of units of ideas rather than specific syntactic units). These imagery and spatial structuring present themselves in body behavior that is temporally situated with speech at the micro-level (gesture-speech synchrony reveals this tight causal relationship). The units of cohesion are the specific imagistic features that carry the unit of thought. We have demonstrated segmentation of speech into idea units by the features of hand use, oscillatory action, hand motion symmetries, and spatial loci. Hence the ‘multi’ part of multi-modal language is entirely in the eye of the beholder – for the speaker, the genesis is single. From the perspective of the viewer or sensing technology, the signals appear several (audio, visual, and tactile channels).

Several ongoing and recently concluded projects reside along this research trajectory of our research. These include:

  • Agent-based human tracking where we model parts of the human body are modeled using active agents. These agents forage in the video feature space to find candidates for the objects being tracked (each candidate is associated with an agent). These autonomous software agents are able to form coalitions that as a group represent a consistent set of body parts that constitute a human being tracked.
  • The KABAAM project that investigates the learning of structural temporal event models that describe meaningful human behavior. We employ a genetic programming learning model using temporal reasoning about temporal events. Our approach supports joint discovery between human experts in behavioral studies, and the computational process.
  • MacVisSTA, a multimedia interaction and visualization system that enables multi-video and audio channel analysis of multimodal communicative behavior.
  • Vision GPU, where we employ CUDA-based GPU computation to implement our Vector Coherence Mapping (VCM) algorithm. We were able to implement a real-time version of the VCM algorithm that presents a set of significant challenges in mapping algorithm to GPU computing architecture.
  • Meeting Analysis, where we investigated multimodal cues to understand real discourse among meeting participants using multimodal behavioral cues.
  • Mirror Track, where we investigate the tracking of fingers hovering and touching a large display device. We use the reflective property of display screens as the view angle is at a low azimuth (exceeding the criitical angle for refraction) to enable this tracking.