|
-
|
Bohus, D., Andrist, S., Bao, Y., Horvitz, E., Paradiso, A., (2024) -
"Is This It?": Towards Ecologically Valid Benchmarks for Situated Collaboration, to appear in Proceedings of International Conference on Multimodal Interaction (ICMI Companion '24), November 4--8, 2024, San Jose, Costa Rica. [abs]
|
|
|
We report initial work towards constructing ecologically valid
benchmarks to assess the capabilities of large multimodal models
for engaging in situated collaboration. In contrast to existing
benchmarks, in which question-answer pairs are generated post
hoc over preexisting or synthetic datasets via templates, human
annotators, or large language models (LLMs), we propose and investigate
an interactive system-driven approach, where the questions
are generated by users in context, during their interactions with
an end-to-end situated AI system. We illustrate how the questions
that arise are different in form and content from questions typically
found in existing embodied question answering (EQA) benchmarks
and discuss new real-world challenge problems brought to the fore.
|
|
|
|
-
|
Stiber, M., Bohus, D., Andrist, S., (2024) -
"Uh, This One?": Leveraging Behavioral Signals for Detecting Confusion During Physical Tasks, to appear in Proceedings of International Conference on Multimodal Interaction, 2024, San Jose, Costa Rica. [abs]
|
|
|
A longstanding goal in the AI and HCI research communities is
building intelligent assistants to help people with physical tasks. To
be effective in this, AI assistants must be aware of not only the physical
environment, but also the human user and their cognitive states.
In this paper, we specifically consider the detection of confusion,
which we operationalize as the moments when a user is “stuck” and
needs assistance. We explore how behavioral features such as gaze,
head pose, and hand movements differ between periods of confusion
vs no-confusion. We present various modeling approaches
for detecting confusion that combine behavioral features, length of
time, instructional text embeddings, and egocentric video. Although
deep networks (e.g., V-Jepa) trained on full video streams perform
well in distinguishing confusion from non-confusion, simpler models
leveraging lighter weight behavioral features exhibit similarly
high performance, even when generalizing to unseen tasks.
|
|
|
|
-
|
Bohus, D., Andrist, S., Saw, N., Paradiso, A., Chakraborty, I., Rad, M., (2024) -
SIGMA: An Open-Source Interactive System for Mixed-Reality Task Assistance Research -- Extended Abstract, in Proceedings of 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Orlando, FL. [abs]
[github]
[blog post]
|
|
|
We introduce an open-source system called Sigma (short for “Situated Interactive Guidance, Monitoring, and Assistance”) as a
platform for conducting research on task-assistive agents in mixed-reality scenarios. The system leverages the sensing and
rendering affordances of a head-mounted mixed reality device in conjunction with large language and vision models to guide
users step by step through procedural tasks. By open-sourcing the system, we aim to lower the barrier to entry, accelerate
research in this space, and chart a path towards community-driven end-to-end evaluation of large language, vision, and
multimodal models in the context of real-world interactive applications.
|
|
|
|
-
|
Bohus, D., Andrist, S., Saw, N., Paradiso, A., Chakraborty, I., Rad, M., (2024) -
SIGMA: An Open-Source Interactive System for Mixed-Reality Task Assistance Research, Technical Report [abs]
[github]
[blog post]
|
|
|
We introduce an open-source system called SIGMA (short for "Situated Interactive
Guidance, Monitoring, and Assistance") as a platform for conducting research on task-assistive
agents in mixed-reality scenarios. The system leverages the sensing and rendering affordances of
a head-mounted mixed-reality device in conjunction with large language and vision models to guide
users step by step through procedural tasks. We present the system's core capabilities, discuss
its overall design and implementation, and outline directions for future research enabled by the
system. SIGMA is easily extensible and provides a useful basis for future research at the intersection
of mixed reality and AI. By open-sourcing an end-to-end implementation, we aim to lower the barrier to
entry, accelerate research in this space, and chart a path towards community-driven end-to-end evaluation
of large language, vision, and multimodal models in the context of real-world interactive applications.
|
|
|
|
|
-
|
Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F.V., Joshi, N., Pollefeys, M., (2023) -
HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World, in Proceedings of ICCV'2023, Paris, France. [abs]
[dataset website]
[blog post]
|
|
|
Building an interactive AI assistant that can perceive, reason,
and collaborate with humans in the real world has been a long-standing pursuit in the AI community.
This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through
performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric
human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer
executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task
instructor watches the performer’s egocentric video in real time and guides them verbally. By augmenting the data with action
and conversational annotations and observing the rich behaviors of various participants, we present key insights into
how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the
environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we
construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with
detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly
collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/.
|
|
|
|
|
-
|
Bohus, D., Andrist, S., Feniello, A., Saw, N., Horvitz, E., (2022) -
Continual Learning about Objects in the Wild: An Interactive Approach, in Proceedings of ICMI'2022, Bengaluru (Bangalore), India. [abs]
|
|
|
We introduce a mixed-reality, interactive approach for continually learning to recognize an open-ended set of objects in a user’s surrounding environment. The proposed approach leverages the multimodal sensing, interaction, and rendering affordances of a mixed-reality headset, and enables users to label nearby objects via speech, gaze, and gestures. Image views of each labeled object are automatically captured from varying viewpoints over time, as the user goes about their everyday tasks. The labels provided by the user can be propagated forward and backwards in time and paired with the collected views to update an object recognition model, in order to continually adapt it to the user’s specific objects and environment. We review key challenges for the proposed interactive continual learning approach, present details of an end-to-end system implementation, and report on results and lessons learned from an initial, exploratory case study using the system
|
|
|
|
-
|
Andrist, S., Bohus, D., Feniello, A., Saw, N., (2022) -
Developing Mixed Reality Applications with Platform for Situated Intelligence, 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 48-50, doi: 10.1109/VRW55335.2022.00018.
|
|
|
|
-
|
Bohus, D., Andrist, S., Feniello, A., Saw, N., Jalobeanu, M., Sweeney, P., Thompson, A.L., and Horvitz, E., (2021) -
Platform for Situated Intelligence, Microsoft Research Technical Report MSR-TR-2021-2, March, 2021.
|
|
|
|
|
|
-
|
Zhi Tan, X., Andrist, S., Bohus, D., , and Horvitz, E., (2020) -
Now, Over Here: Leveraging Extended Attentional Capabilities in Human-Robot Interaction, late breaking report, in Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, Cambridge, UK.
|
|
|
|
|
|
|
|
|
-
|
Andrist, S., Bohus, D., Kamar, E., and Horvitz, E., (2017) -
What Went Wrong and Why? Diagnosing Situated Interaction Failures in the Wild
, in Proceedings of ICSR'2017, Tsukuba, Japan.
[abs]
|
|
|
Effective situated interaction hinges on the well-coordinated operation of a set of competencies,
including computer vision, speech recognition, and natural language, as well as higher-level inferences
about turn taking and engagement. Systems often rely on a set of hand-coded and machine-learned components
organized into several sensing and decision-making pipelines. Given their complexity and inter-dependencies, developing and debugging
such systems can be challenging. "In-the-wild" deployments outside of controlled lab conditions bring further
challenges due to unanticipated phenomena, including unexpected interactions such as playful engagements. We present
a methodology for assessing performance, identifying problems, and diagnosing the root causes and influences of
different types of failures on the overall performance of a situated interaction system functioning in the wild.
We apply the methodology to a dataset of interactions collected with a robot deployed in a public space inside an
office building. The analyses identify and characterize multiple types of failures, their causes, and their relationship
to overall performance. We employ models that predict overall interaction quality from various combinations of failures.
Finally, we discuss lessons learned with such a diagnostic methodology for improving situated systems deployed in the wild.
|
|
|
|
-
|
Bohus, D., Andrist, S., and Jalobeanu, M., (2017) -
Rapid Development of Multimodal Interactive Systems: A Demonstration of Platform for Situated Intelligence
, in Proceedings of ICMI'2017, Glasgow, Scotland.
[abs] | ICMI'17 best demostration award
|
|
|
We demonstrate an open, extensible platform for developing and studying multimodal, integrative-AI systems. The
platform provides a time-aware, stream-based programming model for parallel coordinated computation, a set of
tools for data visualization, processing, and learning, and an ecosystem of pluggable AI components. The
demonstration will showcase three applications built on this platform and highlight how the platform can
significantly accelerate development and research in multimodal interactive systems.
|
|
|
|
-
|
Bohus, D., Andrist, S., and Horvitz, E., (2017) -
A Study in Scene Shaping: Adjusting F-formations in the Wild
, in AAAI Fall Symposium 2017, Arlington, VA
[abs]
|
|
|
We study the automated shaping of F-formations in the proximity of a stationary robot that has been deployed
to provide directions within a building. We introduce the notion of active scene shaping where suboptimal spatial
configurations are detected, and desired shifts in the locations of participants and bystanders are communicated
with natural language and gestures. We conduct an initial in-the-wild study with the proposed methods, and we
report results, lessons learned, and future directions of research.
|
|
|
|
|
-
|
Andrist, S., Bohus, D., Mutlu, B., and Schlangen, D., (2016) -
Turn-Taking and Coordination in Human-Machine Interaction
, in AI Magazine, Winter Issue, vol. 34, no. 4
[abs]
|
|
|
This issue of AI Magazine brings together a collection of articles on challenges, mechanisms, and research progress in turn-taking and coordination between humans and machines. The contributing authors work in interrelated fields of spoken dialog systems, intelligent virtual agents, human-computer interaction, human-robot interaction, and semiautonomous collaborative systems and explore core concepts in coordinating speech and actions with virtual agents, robots, and other autonomous systems. Several of the contributors participated in the AAAI Spring Symposium on Turn-Taking and Coordination in Human-Machine Interaction, held in March 2015, and several articles in this issue are extensions of work presented at that symposium. The articles in the collection address key modeling, methodological, and computational challenges in achieving effective coordination with machines, propose solutions that overcome these challenges under sensory, cognitive, and resource restrictions, and illustrate how such solutions can facilitate coordination across diverse and challenging domains. The contributions highlight turn-taking and coordination in human-machine interaction as an emerging and evolving research area with important implications for future applications of AI.
|
|
|
|
|
-
|
Yu, Z., Bohus, D., and Horvitz, E., (2015) -
Incremental Coordination: Attention-Centric Speech Production in a Physically Situated Conversational Agent
, in SigDIAL'2015,
Prague, Czech Republic [abs]
|
|
|
Inspired by studies of human-human conversations, we present methods for incrementally coordinating speech production with listeners' visual foci of attention. We introduce a model that considers the demands and availability of listeners' attention at the onset and throughout the production of system utterances, and that incrementally coordinates speech synthesis with the listener's gaze. We present an implementation and deployment of the model in a physically situated dialog system and discuss lessons learned.
|
|
|
|
-
|
Andrist, S., Bohus, D., Yu, Z., and Horvitz, E., (2015) -
Are You Messing with Me? Querying about the Sincerity of Interactions in the Open World
, late breaking report, in HRI'2015,
Christchurch, New Zealand [abs]
|
|
|
When interacting with robots deployed in the open world, people may often attempt to engage with them in a playful manner or test their competencies. Such engagements are often associated with language and behaviors that fall outside of designed task capabilities and can lead to interaction failures. Detecting when users are driven by play and curiosity can help a robot to understand why some interactions are breaking down, respond more appropriately by conveying its capabilities to its users, and enhance perceptions of its situational awareness and social intelligence. We have been studying the intentions of everyday users in their engagement with a long-lived robot system that provides directions within an office building. We report on a pilot field-study exploring the use of direct queries to elicit the sincerity of user requests, in terms of their actual need for directions. We discuss early results from this initial study and frame research directions and design implications for robots deployed in the wild.
|
|
|
|
|
-
|
Bohus, D., Horvitz, E., (2014) - Managing Human-Robot Engagement with Forecasts and ... um ... Hesitations, in Proceedings of ICMI'2014,
Istanbul, Turkey [abs]
|
|
|
We explore methods for managing conversational engagement in
open-world, physically situated dialog systems. We investigate a
self-supervised methodology for constructing forecasting models
that aim to anticipate when participants are about to terminate their
interactions with a situated system. We study how these models can
be leveraged to guide a disengagement policy that uses linguistic
hesitation actions, such as filled and non-filled pauses, when
uncertainty about the continuation of engagement arises. The
hesitations allow for additional time for sensing and inference, and
convey the system’s uncertainty. We report results from a study of
the propose
|
|
|
|
-
|
Pejsa, T., Bohus, D., Cohen, M., Saw, C.W., Mahoney, J., Horvitz, E. (2014) -
Natural Communication about Uncertainties in Situated Interaction, in ICMI'2014,
Istanbul, Turkey [abs] [supplemental video]
|
|
|
Physically situated, multimodal interactive systems must often grapple with uncertainties about properties of the world, people, and their intentions and actions. We present methods for estimating and communicating about different uncertainties in situated interaction, leveraging the affordances of an embodied conversational agent. The approach harnesses a representation that captures both the magnitude and the sources of uncertainty, and a set of policies that select and coordinate the production of nonverbal and verbal behaviors to communicate the system’s uncertainties to conversational participants. The methods are designed to enlist participants’ help in a natural manner to resolve uncertainties arising during interactions. We report on a preliminary implementation of the proposed methods in a deployed system and illustrate the functionality with a trace from a sample interaction.
|
|
|
|
-
|
Mitchell, M., Bohus, D., Kamar, E., (2014) -
Crowdsourcing Language Generation Templates for Dialogue Systems
, in INLG'2014,
Philadelphia, PA, USA [abs]
|
|
|
We explore the use of crowdsourcing to
generate natural language in spoken dialogue
systems. We introduce a methodology
to elicit novel templates from the
crowd based on a dialogue seed corpus,
and investigate the effect that the amount
of surrounding dialogue context has on the
generation task. Evaluation is performed
both with a crowd and with a system developer
to assess the naturalness and suitability
of the elicited phrases. Results indicate
that the crowd is able to provide reasonable
and diverse templates within this
methodology. More work is necessary before
elicited templates can be automatically
plugged into the system.
|
|
|
|
-
|
Bohus, D., Saw, C.W., Horvitz, E., (2014) -
Directions Robot: In-the-Wild Experiences and Lessons Learned
, in AAMAS'2014,
Paris, France [abs]
|
|
|
We introduce Directions Robot, a system we have fielded for
studying open-world human-robot interaction. The system brings
together models for situated spoken language interaction with
directions-generation and a gesturing humanoid robot. We describe
the perceptual, interaction, and output generation competencies of
this system. We then discuss experiences and lessons drawn from
data collected in an initial in-the-wild deployment, and highlight
several challenges with managing engagement, providing
directions, and handling out-of-domain queries that arise in openworld,
multiparty settings.
|
|
|
|
|
-
|
Rosenthal, S., Bohus, D., Kamar, E., Horvitz, E., (2013) -
Look versus Leap: Computing Value of Information with High-Dimensional Streaming Evidence
, in IJCAI'2013,
Beijing, China [abs]
|
|
|
A key decision facing autonomous systems with access
to streams of sensory data is whether to act
based on current evidence or to wait for additional
information that might enhance the utility of taking
an action. Computing the value of information
is particularly difficult with streaming highdimensional
sensory evidence. We describe a belief
projection approach to reasoning about information
value in these settings, using models for inferring
future beliefs over states given streaming evidence.
These belief projection models can be learned from
data or constructed via direct assessment of parameters
and they fit naturally in modular, hierarchical
state inference architectures. We describe principles
of using belief projection and present results
drawn from an implementation of the methodology
within a conversational system.
|
|
|
|
-
|
Rosenthal, S., Skaff, S., Veloso, M., Bohus, D., Horvitz, E., (2013) -
Execution Memory for Grounding and Coordination
, in HRI'2013,
Tokyo, Japan [abs]
|
|
|
As robots are introduced into human environments
for long periods of time, human owners and collaborators will
expect them to remember shared events that occur during execution.
Beyond naturalness of having memories about recent and
longer-term engagements with people, such execution memories
can be important in tasks that persist over time by allowing
robots to ground their dialog and to refer efficiently to previous
events. In this work, we define execution memory as the capability
of saving interaction event information and recalling it for later
use. We divide the problem into four parts: salience filtering of
sensor evidence and saving to short term memory, archiving from
short to long term memory and caching from long to short term
memory, and recalling memories for use in state inference and
policy execution. We then provide examples of how execution
memory can be used to enhance user experience with robots.
|
|
|
|
-
|
Metallinou, A., Bohus, D., Williams, J.D., (2013) -
Discriminative state tracking for spoken dialog systems
, in ACL'2013,
Sofia, Bulgaria [abs]
|
|
|
In spoken dialog systems, statistical state
tracking aims to improve robustness to speech
recognition errors by tracking a posterior distribution
over hidden dialog states. Current
approaches based on generative or discriminative
models have different but important shortcomings
that limit their accuracy. In this paper
we discuss these limitations and introduce
a new approach for discriminative state tracking
that overcomes them by leveraging the
problem structure. An offline evaluation with
dialog data collected from real users shows
improvements in both state tracking accuracy
and the quality of the posterior probabilities.
Features that encode speech recognition error
patterns are particularly helpful, and training
requires relatively few dialogs.
|
|
|
|
-
|
Lasecki, W.S., Kamar, E., Bohus, D. (2013) -
Conversations in the Crowd: Collecting Data for Task-Oriented Dialog Learning
, in HCOMP'2013,
Palm Springs, CA, USA [abs]
|
|
|
A major challenge in developing dialog systems
is obtaining realistic data to train the systems
for specic domains. We study the opportunity
for using crowdsourcing methods to collect dialog
datasets. Specically, we introduce ChatCollect, a
system that allows researchers to collect conversations focused around denable tasks from pairs of
workers in the crowd. We demonstrate that varied and in-depth dialogs can be collected using
this system, then discuss ongoing work on creating a crowd-powered system for parsing semantic
frames. We then discuss research opportunities in
using this approach to train and improve automated dialog systems in the future.
|
|
|
|
-
|
Loomis-Thompson, A., Bohus, D. (2013) -
A Framework for Multimodal Data Collection, Visualization, Annotation and Learning
, in ICMI'2013,
Sydney, Australia [abs]
|
|
|
The development and iterative refinement of inference models for multimodal systems can be challenging and time intensive. We present a framework for multimodal data collection, visualization, annotation, and learning that enables system developers to build models using various machine learning techniques, and quickly iterate through cycles of development, deployment and refinement.
|
|
|
|
|
-
|
Wang, W.Y., Bohus, D., Kamar, E., Horvitz, E. (2012) -
Crowdsourcing the Acquisition
of Natural Language Corpora: Methods and Observations
, in SLT'2012,
Miami, USA [abs]
|
|
|
We study the opportunity for using crowdsourcing methods to acquire language corpora for use in natural language processing systems. Specifically, we empirically investigate three methods for eliciting natural language sentences that correspond to a given semantic form. The methods convey frame semantics to crowd workers by means of sentences, scenarios, and list-based descriptions. We discuss various performance measures of the crowdsourcing process, and analyze the semantic correctness, naturalness, and biases of the collected language. We highlight research challenges and directions in applying these methods to acquire corpora for natural language processing applications.
|
|
|
|
-
|
Vinyals, O., Bohus, D., Caruana, R. (2012) -
Learning Speaker, Addressee and
Overlap Detection Models from Multimodal Streams
, in ICMI'2012,
Santa Monica, USA [abs]
|
|
|
A key challenge in developing conversational systems is fusing streams of information
provided by different sensors to make inferences about the behaviors and goals of
people. Such systems can leverage visual and audio information collected through
cameras and microphone arrays, including the location of various people, their focus
of attention, body pose, the sound source direction, prosody, and speech recognition
results. In this paper, we explore discriminative learning techniques for making
accurate inferences on the problems of speaker, addressee and overlap detection
in multiparty human-computer dialog. The focus is on finding ways to leverage within-
and across-signal temporal patterns and to construct representations from the raw
streams in an automated manner that are informative for the inference problem. We
present a novel extension to traditional decision trees which allows them to incorporate
and model temporal signals. We contrast these methods with more traditional approaches
where a human expert manually engineers relevant temporal features. The proposed
approach performs well even with relatively small amounts of training data, which
is of practical importance as designing features that are task dependent is time
consuming and not always possible.
|
|
|
|
-
|
Bohus, D., Kamar, E., Horvitz, E. (2012) -
Towards Situated Collaboration
, in NAACL Workshop on Future Directions
and Challenges in Spoken Dialog Systems: Tools and Data [abs]
|
|
|
We outline a set of key challenges for dialog management in physically situated
interactive systems, and propose a core shift in perspective that places spoken
dialog in the context of the larger collaborative challenge of managing parallel,
coordinated actions in the open world.
|
|
|
|
|
|
- |
Bohus, D., Horvitz, E. (2011) - Decisions about Turns in Multiparty Conversation: From Perception to Action, in ICMI-2011, Alicante, Spain [abs]
|
|
|
We present a decision-theoretic approach for guiding turn taking in a spoken dialog system operating in multiparty settings. The proposed methodology couples inferences about multiparty conversational dynamics with assessed costs of different outcomes, to guide turn-taking decisions. Beyond considering uncertainties about outcomes arising from evidential reasoning about the state of a conversation, we endow the system with awareness and methods for handling uncertainties stemming from computational delays in its own perception and production. We illustrate via sample cases how the proposed approach makes decisions, and we investigate the behaviors of the proposed methods via a retrospective analysis on logs collected in a multiparty interaction study.
|
|
|
|
- |
Bohus, D., Horvitz, E. (2011) - Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions, in SIGdial-2011, Portland, OR [abs] [Supplemental materials and videos]
|
|
|
We report on an empirical study of a multiparty turn taking model for physically situated spoken dialog systems. We discuss subjective and objective performance measures that show how the model, supported with a basic set of sensory competencies and turn-taking policies, can enable interactions with multiple participants in a collaborative task setting. The analysis we conduct brings to the fore several phenomena and frames challenges for managing multiparty turn taking in physically situated interaction.
|
|
|
|
|
|
|
- |
Bohus, D., Horvitz, E. (2010) - On the Challenges and Opportunities of Physically Situated Dialog, in AAAI Fall Symposium on Dialog with Robots, Arlington, VA [abs]
|
|
|
We outline several challenges and opportunities for building physically situated systems that can interact in open, dynamic, and relatively unconstrained environments. We review a platform and recent progress on developing computational methods for situated, multiparty, open-world dialog, and highlight the value of representations of the physical surroundings and of harnessing the broader situational context when managing communicative processes such as engagement, turn-taking, language understanding, and dialog management. Finally, we outline an open-world learning challenge that spans these different levels.
|
|
|
|
- |
Bohus, D., Horvitz, E. (2010) - Facilitating Multiparty Dialog with Gaze, Gesture and Speech, in ICMI'10, Beijing, China [abs] [Supplemental materials and videos]
|
|
|
We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We review a computational framework for turn taking that provides the foundation for tracking and communicating intentions to hold, release, or take control of the conversational floor. We then present details of the implementation of the approach in an embodied conversational agent and describe experiments with the system in a shared task setting. Finally, we discuss results showing how the verbal and non-verbal cues used by the avatar can shape the dynamics of multiparty conversation.
|
|
|
|
|
|
- |
Bohus, D., Horvitz, E., (2010) - Computational Models for Multiparty Turn-Taking, Microsoft Technical Report MSR-TR-2010-115 [abs] [Supplemental materials and videos]
|
|
|
We describe a computational framework for modeling and managing turn-taking in open-world spoken dialog systems. We present a representation and methodology for tracking the conversational dynamics in multiparty interactions, making floor control decisions, and ren-dering these decisions into appropriate behav-iors. We show how the approach enables an embodied conversational agent to participate in multiparty interactions, and to handle a diversity of natural turn-taking phenomena, including multiparty floor management, barge-ins, restarts, and continuations. Finally, we discuss results and lessons learned from experiments.
|
|
|
|
|
|
|
- |
Bohus, D., Horvitz, E. (2009) - Dialog in the Open World: Platform and Applications, in Proceedings of ICMI'09, Boston, MA [abs] | ICMI'09 outstanding paper award | ICMI'19 Ten-Year Technical Impact Award Runner-up
|
|
|
We review key challenges of developing spoken dialog systems that can engage in interaction with one or multiple participants in open, relatively unconstrained environments. We outline a set of core competencies for open-world dialog, and we describe three prototype systems in this space. The systems harness a common underlying conversational framework which integrates an array of predictive models and component technologies, including speech recognition, head and pose tracking, probabilistic models for scene analysis, multiparty engagement and turn taking, and inferences about user long-term goals and activities. We discuss the current models and showcase their function by means of a sample recorded interaction, and we review results from an observational study of open-world, multiparty dialog in the wild.
|
|
|
|
- |
Bohus, D., Horvitz, E. (2009) - Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings, in Proceedings of SIGdial'09, London, UK [abs] [note]
|
|
|
We describe a machine learning approach that allows an open-world spoken dialog system to learn to predict engagement intentions in situ, from interaction. The proposed approach does not require any developer supervision, and leverages spatiotemporal and attentional features automatically extracted from a visual analysis of people coming into the proximity of the system to produce models that are attuned to the characteristics of the environment the system is placed in. Experimental results indicate that a system using the proposed approach can learn to recognize engagement intentions at low false positive rates (e.g. 2-4%) up to 3-4 seconds prior to the actual moment of engagement.
|
|
|
|
Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. The maximum entropy model was trained in a stepwise fashion, where at each step the next best feature was added to the model; stopping was based on a BIC criterion. During this stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported might differ by small amounts, we believe the general results we have reported in this paper stand.
|
|
|
|
- |
Bohus, D., Horvitz, E. (2009) - Models for Multiparty Engagement in Open-World Dialog, in Proceedings of SIGdial'09, London, UK [abs] | SIGdial'09 best paper award
|
|
|
We present computational models that allow spoken dialog systems to handle multi-participant engagement in open, dynamic environments, where multiple people may enter and leave conversations, and interact with the system and with others in a natural manner. The models for managing the engagement process include components for (1) sensing the en-gagement state, actions and intentions of multiple agents in the scene, (2) making engagement decisions (i.e. whom to engage with, and when) and (3) rendering these decisions in a set of coordinated low-level behaviors in an embodied conversational agent. We review results from a study of interactions "in the wild" with a system that implements such a model.
|
|
|
|
- |
Bohus, D., Horvitz, E. (2009) - Open-World Dialog: Challenges, Directions, and Prototype, in Proceedings of IJCAI'2009 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Pasadena, CA [abs] [video]
|
|
|
We present an investigation of open-world dialog, centering on systems that can perform conversational dialog in an open-world context, where multiple people with different needs, goals, and long-term plans may enter, interact, and leave an environment. We outline and discuss a set of challenges and core competencies required for supporting the kind of fluid multiparty interaction that people expect when conversing and collaborating with other people. Then, we focus as a concrete example on the challenges faced by receptionists who field requests at the entries to corporate buildings. We review the subtleties and difficulties of creating an automated receptionist that can work with people on solving their needs with the ease and etiquette expected from a human receptionist. Finally, we review details of the construction and operation of a working prototype.
|
|
|
|
- |
Li, X., Nguyen, P., Zweig, G., Bohus, D. (2009) - Leveraging Multiple Query Logs to Improve Language Models for Spoken Query Recognition, in Proceedings of ICASSP'09, Taipei, Taiwan [abs]
|
|
|
A voice search system requires a speech interface that can correctly
recognize spoken queries uttered by users. The recognition performance
strongly relies on a robust language model. In this work, we
present the use of multiple data sources, with the focus on query logs,
in improving ASR language models for a voice search application.
Our contributions are three folds: (1) the use of text queries from
web search and mobile search in language modeling; (2) the use of
web click data to predict query forms from business listing forms;
and (3) the use of voice query logs in creating a positive feedback
loop. Experiments show that by leveraging these resources, we can
achieve recognition performance comparable to, or even better than,
that of a previously deploy system where a large amount of spoken
query transcripts are used in language modeling.
|
|
|
|
|
- |
Bohus, D., Zweig, G., Nguyen, P., Li, X. (2008) - Joint N-Best Rescoring for Repeated Utterances in Spoken Dialog Systems, in Proceedings of SLT'08, Goa, India [abs]
|
|
|
Due to speech recognition errors, repetitions are a frequent phenomenon in spoken dialog systems. In previous work we have proposed a joint decoding model that can leverage structural relationships between repeated utterances for improving recogni-tion performance. In this paper we extend this work in two directions. First, we propose a direct, classification-based model for the same task. The new model can leverage features that were fundamentally hard to capture in the previous framework (e.g. spellings, false-starts, etc.) and leads to an additional performance improvement. Second, we show how both models can be used to perform a combined rescoring of two n-best lists that are part of a repetition pair.
|
|
|
|
- |
Zweig, G., Bohus, D., Li, X., Nguyen, P. (2008) - Structured Models for Joint Decoding of Repeated Utterances, in Proceedings of InterSpeech'08, Brisbane, Australia [abs]
|
|
|
Due to speech recognition errors, repetition can be a frequent occurrence in voice-search applications. While a proper treatment of this phenomenon requires the joint modeling of two or more utterances simultaneously, currently deployed systems typically treat the utterances independently. In this paper, we analyze the structure of repetitions and find that in at least one commercial directory assistance application, repetitions follow simple structural transformations more than 70% of the time. We present preliminary results that suggest that significant gains are possible by explicitly modeling this structure in a joint decoding process.
|
|
|
|
- |
Bohus, D., Li, X., Nguyen, P., and Zweig, G. (2008) - Learning N-Best Correction Models from Implicit User Feedback in a Multi-modal Local Search Application, in Proceedings of SIGdial'08, Columbus, OH [abs]
|
|
|
We describe a novel n-best correction model that can leverage implicit user feedback (in the form of clicks) to improve performance in a multi-modal speech-search application. The proposed model works in two stages. First, the n-best list generated by the speech recognizer is expanded with additional candidates, based on confusability information captured via user click statistics. In the second stage, this expanded list is rescored and pruned to produce a more accurate and compact n-best list. Re-sults indicate that the proposed n-best correction model leads to significant improvements over the existing baseline, as well as other traditional n-best rescoring approaches.
|
|
|
|
- |
Bohus, D., Rudnicky, A. (2008) - The RavenClaw dialog management framework: architecture and systems, in Computer Speech and Language, DOI:10.1016/j.csl.2008.10.001 [abs]
|
|
|
In this paper, we describe RavenClaw, a plan-based, task-independent dialog management framework. RavenClaw isolates the domain-specific aspects of the dialog control logic from domain-independent conversational skills, and in the process facilitates rapid development of mixed-initiative systems operating in complex, task-oriented domains. System developers can focus exclusively on describing the dialog task control logic, while a large number of domain-independent conversational skills such as error handling, timing and turn-taking are transparently supported and enforced by the RavenClaw dialog engine. To date, RavenClaw has been used to construct and deploy a large number of systems, spanning different domains and interaction styles, such as information access, guidance through procedures, command-and-control, medical diagnosis, etc. The framework has easily adapted to all of these domains, indicating a high degree of versatility and scalability.
|
|
|
|
|
- |
Bohus, D. (2007) - Error Awareness and Recovery in Conversational Spoken Language Interfaces, Ph.D. Dissertation, CS-07-124, Carnegie Mellon University, Pittsburgh, PA [abs] [note]
|
|
|
One of the most important and persistent problems in the development of conversational spoken language interfaces is their lack of robustness when confronted with understanding-errors. Most of these errors stem from limitations in current speech recognition technology, and, as a result, appear across all domains and interaction types. There are two approaches towards increased robustness: prevent the errors from happening, or recover from them through conversation, by interacting with the users.
In this dissertation we have engaged in a research program centered on the second approach. We argue that three capabilities are needed in order to seamlessly and efficiently recover from errors: (1) systems must be able to detect the errors, preferably as soon as they happen, (2) systems must be equipped with a rich repertoire of error recovery strategies that can be used to set the conversation back on track, and (3) systems must know how to choose optimally between different recovery strategies at run-time, i.e. they must have good error recovery policies. This work makes a number of contributions in each of these areas.
|
|
|
|
Subsequent experiments with the machine learning infrastructure that has been used in parts of this work have revealed a small defect in the procedure for logistic regression model construction and evaluation. In places where such models were constructed in a stepwise model building process (chapters 6 and 7), the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported in cross-validation might differ by small amounts, we believe the results reported in this work stand.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2007) - Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem, in Proceedings of SIGdial 2007, Antwerp, Belgium [abs] [note]
|
|
|
In this paper we propose the use of a novel learning paradigm in spoken language in-terfaces – implicitly-supervised learning. The central idea is to extract a supervision signal online, directly from the user, from certain patterns that occur naturally in the conversation. The approach eliminates the need for developer supervision and facili-tates online learning and adaptation. As a first step towards better understanding its properties, advantages and limitations, we have applied the proposed approach to the problem of confidence annotation. Experi-mental results indicate that we can attain performance similar to that of a fully su-pervised model, without any manual labe-ling. In effect, the system learns from its own experiences with the users.
|
|
|
|
Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. During the stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported might differ by small amounts, we believe the general results we have reported in this paper stand.
|
|
|
|
- |
Ai, H., Raux, A., Bohus, D., Eskenazi, M., and Litman, D. (2007) - Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users, in Proceedings of SIGdial 2007, Antwerp, Belgium [abs]
|
|
|
Empirical spoken dialog research often involves
the collection and analysis of a dialog
corpus. However, it is not well understood
whether and how a corpus of dialogs collected
using recruited subjects differs from
a corpus of dialogs obtained from real users.
In this paper we use Let’s Go Lab, a platform
for experimenting with a deployed spoken
dialog bus information system, to address
this question. Our first corpus is collected
by recruiting subjects to call Let’s Go
in a standard laboratory setting, while our
second corpus consists of calls from real
users calling Let’s Go during its operating
hours. We quantitatively characterize the
two collected corpora using previously proposed
measures from the spoken dialog literature,
then discuss the statistically significant
similarities and differences between the
two corpora with respect to these measures.
For example, we find that recruited subjects
talk more and speak faster, while real users
ask for more help and more frequently interrupt
the system. In contrast, we find no
difference with respect to dialog structure.
|
|
|
|
- |
Bohus, D., Raux, A., Harris, T., Eskenazi, M., and Rudnicky, A. (2007) - Olympus: an open-source framework for conversational spoken language interface research, in HLT-NAACL 2007 workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technology, Rochester, NY [abs]
|
|
|
We introduce Olympus, a freely available framework for research in conversational interfaces. Olympus’ open, transparent, flexible, modular and scalable nature facilitates the development of large-scale, real-world systems, and enables research leading to technological and scientific advances in conversational spoken language interfaces. In this paper, we describe the overall architecture, several systems spanning different domains, and a number of current research efforts supported by Olympus.
|
|
|
|
- |
Bohus, D., Grau, S., Huggins-Daines, D., Keri, V., Krishna, G., Kumar, R., Raux, A., and Tomko, S. (2007) - Conquest - an Open-Source Dialog System for Conferences, in Proceedings of HLT-NAACL 2007, Rochester, NY [abs]
|
|
|
We describe ConQuest, an open-source, reusable spoken dialog system that provides technical program information dur-ing conferences. The system uses a transparent, modular and open infrastructure, and aims to enable applied research in spoken language interfaces. The conference domain is a good platform for applied research since it permits periodical redeployments and evaluations with a real user-base. In this paper, we describe the system’s functionality, overall architecture, and we discuss two initial deployments.
|
|
|
|
- |
Tetreault, J., and Bohus, D., (2007) - Estimating the Reliability of MDP Policies: a Confidence Interval Approach, in HLT-NAACL 2007, Rochester, NY [abs]
|
|
|
Data sparsity is one of the major issues that NLP researchers always wrestle with. That is, does one have enough data to make reliable conclusions in an experiment? Using Reinforcement Learning to improve a spoken dialogue system is
no exception. Past approaches in this area have simply assumed that there was enough collected data to derive reliable dialog control policies or used thousands of user simulations to overcome the sparsity issue. In this paper we present a methodology for numerically constructing confidence bounds on the expected reward for a constructed policy, and use these bounds to better estimate the reliability of that policy. We apply this methodology to a prior
experiment of using MDP's to predict the best features to include in a model of the dialogue state. Our results show that policies developed in the prior work were not as reliable as previously determined but the overall ranking of features remains the same.
|
|
|
|
|
- |
Bohus, D., Langner, B., Raux, A., Black, A., Eskenazi, M. and Rudnicky A. (2006) - Online Supervised Learning of Non-understanding Recovery Policies, in SLT-2006, Palm Beach, Aruba [abs]
|
|
|
Spoken dialog systems typically use a limited number of nonunderstanding
recovery strategies and simple heuristic policies to
engage them (e.g. first ask user to repeat, then give help, then
transfer to an operator). We propose a supervised, online method
for learning a non-understanding recovery policy over a large set
of recovery strategies. The approach consists of two steps: first, we
construct runtime estimates for the likelihood of success of each
recovery strategy, and then we use these estimates to construct a
policy. An experiment with a publicly available spoken dialog
system shows that the learned policy produced a 12.5% relative
improvement in the non-understanding recovery rate.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2006) - A K Hypotheses + Other Belief Updating Model, in AAAI Workshop on Statistical and Empirical Approaches to Spoken Dialogue Systems, 2006, Boston, MA [abs] [note]
|
|
|
Spoken dialog systems typically rely on recognition confidence
scores to guard against potential misunderstandings.
While confidence scores can provide an initial assessment
for the reliability of the information obtained from the user,
ideally systems should leverage information that is available
in subsequent user responses to update and improve the accuracy
of their beliefs. We present a machine-learning
based solution for this problem. We use a compressed representation
of beliefs that tracks up to k hypotheses for each
concept at any given time. We train a generalized linear
model to perform the updates. Experimental results show
that the proposed approach significantly outperforms heuristic
rules used for this task in current systems. Furthermore, a
user study with a mixed-initiative spoken dialog system
shows that the approach leads to significant gains in task
success and in the efficiency of the interaction, across a
wide range of recognition error-rates.
|
|
|
|
Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. During the stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported in cross-validation might differ by small amounts, we believe the general results we have reported in this paper stand.
|
|
|
|
- |
Raux, A., Bohus, D., Langner, B., Black, A., and Eskenazi, M. (2006) - Doing Research in a Deployed Spoken Dialog System: One Year of Let's Go! Public Experience, in Interspeech-2006, Pittsburgh, PA [abs]
|
|
|
This paper describes our work with Let’s Go, a telephone-based
bus schedule information system that has been in use by
the Pittsburgh population since March 2005. Results from
several studies show that while task success correlates
strongly with speech recognition accuracy, other aspects of
dialogue such as turn-taking, the set of error recovery strategies,
and the initiative style also significantly impact system
performance and user behavior.
|
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2005) -
Constructing Accurate Beliefs in Spoken Dialog Systems
, in ASRU-2005, San Juan, Puerto Rico [abs] [note]
|
|
|
We propose a novel approach for constructing more accurate
beliefs over concept values in spoken dialog systems by
integrating information across multiple turns in the conversation.
In particular, we focus our attention on updating the confidence
score of the top hypothesis for a concept, in light of subsequent
user responses to system confirmation actions. Our data-driven
approach bridges previous work in confidence annotation and
correction detection, providing a unified framework for belief
updating. The approach significantly outperforms heuristic rules
currently used in most spoken dialog systems.
|
|
|
|
Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. During the stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported in cross-validation might differ by small amounts, we believe the general results we have reported in this paper stand.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2005) - Error Handling in the RavenClaw dialog management architecture, in HLT-EMNLP-2005, Vancouver, CA [abs]
|
|
|
We describe the error handling architecture
underlying the RavenClaw dialog
management framework. The architecture
provides a robust basis for current and future
research in error detection and recovery.
Several objectives were pursued in its
development: task-independence, ease-ofuse,
adaptability and scalability. We describe
the key aspects of architectural design
which confer these properties, and
discuss the deployment of this architecture
in a number of spoken dialog systems
spanning several domains and interaction
types. Finally, we outline current research
projects supported by this architecture.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2005) - Sorry, I Didn't Catch That! - An Investigation of Non-understanding Errors and Recovery Strategies, in SIGdial-2005, Lisbon, Portugal [abs] [sigdial book chapter]
|
|
|
We present results from an extensive empirical analysis of non-understanding
errors and ten non-understanding recovery strategies, based on a corpus of
dialogs collected with a spoken dialog system that handles conference room
reservations. More specifically, the issues we investigate are: what are the
main sources of non-understanding errors? What is the impact of these errors on
global performance? How do various strategies for recovery from non-
understandings compare to each other? What are the relationships between these
strategies and subsequent user response types, and which response types are more
likely to lead to successful recovery? Can dialog performance be improved by
using a smarter policy for engaging the non-understanding recovery strategies?
If so, can we learn such a policy from data? Whenever available, we compare and
contrast our results with other studies in the literature. Finally, we summarize
the lessons learned and present our plans for future work inspired by this
analysis.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2005) - A Principled Approach for Rejection Threshold Optimization in Spoken Dialog Systems, in Interspeech-2005, Lisbon, Portugal [abs]
|
|
|
A common design pattern in spoken dialog systems is to reject
an input when the recognition confidence score falls below a
preset rejection threshold. However, this introduces a
potentially non-optimal tradeoff between various types of
errors such as misunderstandings and false rejections. In this
paper, we propose a data-driven method for determining the
relative costs of these errors, and then use these costs to
optimize state-specific rejection thresholds. We illustrate the
use of this approach with data from a spoken dialog system
that handles conference room reservations. The results
obtained confirm our intuitions about the costs of the errors,
and are consistent with anecdotal evidence gathered throughout
the use of the system.
|
|
|
|
- |
Raux, A., Langner, B., Bohus, D., Black, A., and Eskenazi, M. (2005) - Let's Go Public! Taking a Spoken Dialog System to the Real World, in Interspeech-2005, Lisbon, Portugal [abs]
|
|
|
In this paper, we describe how a research spoken dialog system
was made available to the general public. The Let’s Go Public
spoken dialog system provides bus schedule information to the
Pittsburgh population during off-peak times. This paper describes
the changes necessary to make the system usable for the general
public and presents analysis of the calls and strategies we have
used to ensure high performance.
|
|
|
|
|
- |
Bohus, D. (2004) - Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems, Ph.D Thesis Proposal, Carnegie Mellon University, Pittsburgh, PA [abs]
|
|
|
A persistent and important problem in spoken language interfaces is their lack of robustness when faced with understanding errors. The problem is present across all domains and interaction types, and stems primarily from the unreliability of the speech recognition process. I propose to alleviate this problem by (1) endowing spoken dialogue systems with better error awareness, (2) constructing a richer repertoire of error recovery strategies, and (3) developing a practical data-driven approach for making error handling decisions. The proposed work will address questions and make contributions in each of these three areas. For the first part, I propose to develop a belief updating mechanism that integrates confidence annotation and correction detection into a unified framework, and allows spoken dialogue systems to continuously track the reliability of the information they use. For the second part, I propose to implement and investigate an extended set of error recovery strategies addressing common problems in human-computer dialogue. Finally, I plan to bring these two capabilities together in a scalable reinforcement-learning based approach for making error handling decisions in task-oriented spoken dialogue systems.
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2004) - Task-Independent Conversational Strategies in the RavenClaw Dialogue Management Framework, unpublished manuscript [abs]
|
|
|
We present the implementation of task-independ¬ent conversational strategies in the RavenClaw dialogue management framework. The proposed approach decouples the implementation and the control of these strategies from the actual system task, and brings forth several advantages: it in-creases the consistency in the interaction style, while at the same time it lessens the development and testing efforts by allowing for the easy reuse of these strategies across different systems. We plan to illustrate the repertoire of task-independent con-versational strategies in the RavenClaw dialogue management framework by giving a live demon-stration of RoomLine, a spoken dialogue system for conference room reservation and scheduling.
|
|
|
|
- |
Aist, G., Bohus, D., Boven, B., Campana, E., Early, S., Phan, S. (2004) - Initial Development of a Voice-Activated Astronaut Assistant for Procedural Tasks: From Need to Concept to Prototype, in Journal of Interactive Instruction Development, Volume 16, Nr. 3, Winter 2004, pp 32-36
|
|
|
|
- |
Bohus, D., and Rudnicky A. (2003) - RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda, in Eurospeech-2003, Geneva, Switzerland [abs]
|
|
|
We describe RavenClaw, a new dialog management framework developed as a successor to the Agenda architecture used in the CMU Communicator. RavenClaw introduces a clear separation between task and discourse behavior specification, and allows rapid development of dialog management components for spoken dialog systems operating in complex, goal-oriented domains. The system development effort is focused entirely on the specification of the dialog task, while a rich set of domain-independent conversational behaviors are transparently generated by the dialog engine. To date, RavenClaw has been applied to five different domains allowing us to draw some preliminary conclusions as to the generality of the approach. We briefly describe our experience in developing these systems.
|
|
|
|
- |
Aist, G., Dowding, J., Hockey, B.A., Rayner, M., Hieronymus, J., Bohus, D., Boven, B., Blaylock, N., Campana, E., Early, S., Gorrell, G., and Phan, S. (2003) - Talking through procedures: An intelligent Space Station procedure assistant, in Demo Session at EACL-2003, Budapest, Hungary [abs]
|
|
|
We present a prototype system aimed at
providing spoken dialogue support for
complex procedures aboard the International
Space Station. The system allows
navigation one line at a time or in larger
steps. Other user functions include issuing
spoken corrections, requesting images
and diagrams, recording voice notes and
spoken alarms, and controlling audio volume.
|
|
|
|
|
- |
Bohus, D., and Rudnicky A. (2002) - LARRI: A Language-Based Maintenance and Repair Assistant, in IDS-2002, Kloster Irsee, Germany [abs]
|
|
|
LARRI (Language-based Agent for Retrieval of Repair Information) is a dialog-based system for support of maintenance and repair domains, characterized by large amounts of documentation and by procedural information. LARRI is based on an architecture developed by Carnegie Mellon University for the DARPA Communicator program and is integrated with a wearable computer system developed by the Wearable Computing group at Carnegie Mellon University.
LARRI adapts a dialog-management architecture developed and optimized for a telephone-based problem solving task (travel planning), and applies it to a very different domain -- aircraft maintenance. The system was taken on a field trial on two occasions where it was used by professional aircraft mechanics. We found that our architecture, AGENDA, extended readily to a multi-modal and multi-media framework. At the same time we found that assumptions that were reasonable in a services domain turn out to be inappropriate for a maintenance domain. Apart from the need to manage integration between input modes and output modalities, we found that the system needed to support multiple categories of tasks and that a different balance between user and system goals was required. A significant problem in the maintenance domain is the need to assimilate and make available for language processing appropriate domain information.
|
|
|
|
- |
Bohus, D., and Rudnicky A. (2002) - Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System, Technical Report CS-190, Carnegie Mellon University, Pittsburgh, PA [abs]
|
|
|
In the recent years, automated speech recognition has been the main drive behind
the advent of spoken language interfaces, but at the same time a severe limiting
factor in the development of these systems. We believe that increased robustness
in the face of recognition errors can be achieved by making the systems aware of
their own misunderstandings, and employing appropriate recovery techniques when
breakdowns in interaction occur. In this paper we address the first problem: the
development of an utterance-level confidence annotator for a spoken dialog
system. After a brief introduction to the CMU Communicator spoken dialog system
(which provided the target platform for the developed annotator), we cast the
confidence annotation problem as a machine learning classification task, and
focus on selecting relevant features and on empirically identifying the best
classification techniques for this task. The results indicate that significant
reductions in classification error rate can be obtained using several different
classifiers. Furthermore, we propose a data driven approach to assessing the
impact of the errors committed by the confidence annotator on dialog
performance, with a view to optimally fine-tuning the annotator. Several models
were constructed, and the resulting error costs were in accordance with our
intuition. We found, surprisingly, that, at least for a mixed-initiative spoken
dialog system as the CMU Communicator, these errors trade-off equally over a
wide operating characteristic range.
|
|
|
|
|
- |
Bohus, D., and Rudnicky, A. (2001) - Modeling the Cost of Misunderstandings in the CMU Communicator Dialog System, in ASRU-2001, Madonna di Campiglio, Italy [abs]
|
|
|
We describe a data-driven approach that allows us to quantify the costs of various types of errors made by the utterance-level confidence annotator in the Carnegie Mellon Communicator system. Knowing these costs we can determine the optimal tradeoff point between these errors, and tune the confidence annotator accordingly. We describe several models, based on concept transmission efficiency. The models fit our data quite well and the relative costs of errors are in accordance with our intuition. We also find, surprisingly, that for a mixed-initiative system such as the CMU Communicator, false positive and false negative errors trade-off equally over a wide operating range.
|
|
|
|
- |
Carpenter P., Jin C., Wilson D., Zhang R., Bohus, D., and Rudnicky A. (2001) - Is This Conversation on Track?, in Eurospeech-2001, Aalborg, Denmark [abs]
|
|
|
Confidence annotation allows a spoken dialog system to accurately assess the likelihood of misunderstanding at the utterance level and to avoid breakdowns in interaction. We describe experiments that assess the utility of features from the decoder, parser and dialog levels of processing. We also investigate the effectiveness of various classifiers, including Bayesian Networks, Neural Networks, SVMs, Decision Trees, AdaBoost and Naive Bayes, to combine this information into an utterance-level confidence metric. We found that a combination of a subset of the features considered produced promising results with several of the classification algorithms considered, e.g., our Bayesian Network classifier produced a 45.7% relative reduction in confidence assessment error and a 29.6% reduction relative to a handcrafted rule.
|
|
|
|
|
|
|
|
|
- |
Bohus, D., and Boldea, M. (2000) - A Web-based Text Corpora Development System, in LREC-2000, Athens, Greece [abs]
|
|
|
One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system that focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless resource of texts. To ensure a certain quality, we enrich the text with relevant information to be fit for further use by resolving in an integrated manner the problems of diacritic characters restoration, lexical ambiguity resolution and morphosyntactic annotation. Although at this moment it is targeted at texts in Romanian, a number of mechanisms have been provided that allows it to be easily adapted to other languages.
|
|
|
|
|
|