Human Intelligence Needs Artificial Intelligence: Daniel S. Weld Mausam Peng Dai
Human Intelligence Needs Artificial Intelligence: Daniel S. Weld Mausam Peng Dai
Human Intelligence Needs Artificial Intelligence: Daniel S. Weld Mausam Peng Dai
Daniel S. Weld
Mausam
Peng Dai
Abstract
Crowdsourcing platforms, such as Amazon Mechanical Turk,
have enabled the construction of scalable applications for
tasks ranging from product categorization and photo tagging
to audio transcription and translation. These vertical applications are typically realized with complex, self-managing
workflows that guarantee quality results. But constructing
such workflows is challenging, with a huge number of alternative decisions for the designer to consider.
We argue the thesis that Artificial intelligence methods can
greatly simplify the process of creating and managing complex crowdsourced workflows. We present the design of
C LOWDER, which uses machine learning to continually refine models of worker performance and task difficulty. Using these models, C LOWDER uses decision-theoretic optimization to 1) choose between alternative workflows, 2) optimize parameters for a workflow, 3) create personalized interfaces for individual workers, and 4) dynamically control the
workflow. Preliminary experience suggests that these optimized workflows are significantly more economical (and return higher quality output) than those generated by humans.
Introduction
Crowd-sourcing marketplaces, such as Amazon Mechanical Turk, have the potential to allow rapid construction of
complex applications which mix human computation with
AI and other automated techniques. Example applications
already span the range from product categorization [2],
photo tagging [24], business listing verifications [16] to audio/video transcription [17; 23], proofreading [19] and translation [20].
In order to guarantee quality results from potentially
error-prone workers, most applications use complex, selfmanaging workflows with independent production and review stages. For example, iterative improvement [14] and
find-fix-verify workflows [1] are popular patterns. But devising these patterns and adapting them to a new task is both
complex and time consuming. Existing development environments, e.g. Turkit [14] simplify important issues, such
as control flow and debugging, but many challenges remain.
For example, in order to craft an effective application, the
designer must:
Choose between alternative workflows for accomplishing the task. For example, given the task of transcribing
an MP3 file, one could ask a worker to do the transcription, or first use speech recognition and then ask workers to find and fix errors. Depending on the accuracy
and costs associated with these primitive steps, one or the
other workflow may be preferable.
Optimize the parameters for a selected workflow. Suppose one has selected the workflow which uses a single
worker to directly transcribe the file; before one can start
execution, one must determine the value of continuous parameters, such as the price, the length of the audio file,
etc.. If the audio track is cut into snippets which are too
long, then transcription speed may fall, since workers often prefer short jobs. But if the audio track is cut into
many short files, then accuracy may fall because of lost
context for the human workers. A computer can methodically try different parameter values to find the best.
Create tuned interfaces for the expected workers. The
precise wording, layout and even color of an interface can
dramatically affect the performance of users. One can use
Fitts Law or alternative cost models to automatically design effective interfaces [7]. Comprehensive A-B testing of alternative designs, automated by computer, is also
essential [12].
Control execution of the final workflow. Some decisions, for example the number of cycles in an iterative
improvement workflow and the number of voters used
for verification, can not be optimally determined a priori.
Instead, decision-theoretic methods, which incorporate a
model of worker accuracy, can dramatically improve on
naive strategies such as majority vote [3].
Our long-term goal is to prove the value of AI methods
on these problems and to build intelligent tools that facilitate the rapid construction of effective crowd-sourced
workflows. Our first system, T UR KONTROL [3; 4], used a
partially-observable Markov decision process (POMDP) to
perform decision-theoretic optimization of iterative, crowdsourced workflows. This paper presents the design of our
second system, C LOWDER1 , which we are just starting to
implement. We start by summarizing the high-level architecture of C LOWDER. Subsequent sections detail the AI rea1 It is said that nothing is as difficult as herding cats, but maybe
decision theory is up to the task? A clowder is a group of cats.
DT planner
HTN
library
worker
marketplace
renderer
rendered
job
task
models
user
models
learner
on the task (people not good at writing English descriptions could still be potent audio transcribers), we can seed
their task-specific quality parameters based on their average parameters from similar prior tasks.
Comprehensive Decision-Theoretic Control. A workflow has several choices to make including pricing, bonus,
number of iterations or voters, and interface layout. Our
previous work, T UR KONTROL, optimized a subset of
these factors for a specific type of workflow. C LOWDER
will extend T UR KONTROL by allowing a large number of
workflows and optimizing for all of these choices.
We now discuss each of these components in detail.
Poor quality workers present a major challenge for crowdsourced applications. Although early studies concluded
that the majority of workers on Mechanical Turk are diligent [22], more recent investigations suggest a plethora of
spam workers. Moreover, the error rates are quite high for
open-ended tasks like improving an artifact or fixing grammatical errors [1].
Ipeirotis [9] has suggested several important improvements to the Mechanical Turk marketplace platform, one of
which is a better reputation system for evaluating workers.
He argues that payment should be separated from evaluation, employers should be allowed to rate workers, and the
platform should provide more visibility into a workers history. Worker quality should be reported as a function of job
type in addition to aggregate measures. By surfacing limited information, such as percentage acceptance and number of completed hits, Mechanical Turk makes it easy for
spam workers to pose as responsible by rank boosting [8;
6]. Yet even if Mechanical Turk is slow to improve its
platform, alternative marketplaces, such as eLance, guru,
oDesk, and vWorker, are doing so.
But even if Ipeirotis improved reputation system is
widely adopted, the best requesters will still overlay
their own models and perform proprietary reasoning about
worker quality. In a crowd-sourced environment, the specific workflow employed (along with algorithms to control
it) is likely to represent a large part of a requesters competitive advantage. The more an employer knows about the
detailed strengths and weaknesses of a worker, the better the
employer can apply the worker to appropriate jobs within
that workflow. Thus, knowledge about a worker provides
a proprietary advantage to an employer and is unlikely to
be fully shared. Just as todays physically-based organizations spend considerable resources on monitoring employee
performance, we expect crowd-sourced worker modeling to
be an area of ongoing innovation. T UR KONTROL devised a
novel approach to worker modeling, which C LOWDER extends.
Learning a Model of Simple Tasks: Let us focus on the
simplest tasks first predicting the worker behavior when
answering a binary question. The learning problem is to
estimate the probability of a worker x answering a binary
ballot question correctly. While prior work has assumed all
Similar ideas apply to learning other more complex models. In C LOWDER we propose to enable learning capability
for a wide variety of worker models. We list a few below
based on the job type. We anticipate that existing models
from other tasks will aid in seeding the worker models for a
new task, which can be continually updated as we gain more
information about a worker on the task at hand.
Discrete alternatives. Workers may be asked to choose
between more than 2 discrete alternatives. A simple extension of our ballot model suffices for this.
Find jobs. The Soylent word processor popularized
a crowd-sourcing design pattern called Find-Fix-Verify
which splits complex crowd intelligence tasks into a
series of generation and review stages that utilize independent agreement and voting to produce reliable results. [1]. Find jobs typically present a worker with a sequence of data, e.g., a textual passage, and ask the worker
to identify flaw locations in that sequence, e.g. the location of grammatical errors. Since only a few locations
can be returned, we can learn models of these jobs with
an extension of the discrete alternatives framework.
Improvement jobs. This class of job (also known as a
Fix job) requires the worker to revise some piece of
content, perhaps by fixing a grammatical error or by extending a written description of a picture. We can employ
and extend curve fitting ideas in [4] to learn such models.
Content creation jobs. This class of job would be used
when initializing an iterative improvement workflow or in
the first step of a transcription task. It can be modeled as
a degenerate case of an improvement job.
submit
initial
artifact ()
Improvement
needed?
Generate
improvement
job
Estimate
prior for
Voting
needed?
better of and
bk
Generate
ballot job
Update
posteriors
for ,
initial
artifact () Generate
Find HIT
f
Update
posterior
of flaw f
f
Find more
flaws?
Pick a
flaw to fix
Generate
Fix HIT
all s
Fix more
flaws?
Generate
verify HIT
bk
Update
posteriors
for all s
More
verification
needed?
For the case of iterative-improvement workflows a simple k-step lookahead greedy search performed remarkably
well; however, more sophisticated methods may be necessary as we increase the number of decision points made
by the agent. We will investigate a variety of strategies,
including discretization and the Monte Carlo methods pioneered in UCT [11]). Our prior experience with approximate and optimal, MDP and POMDP algorithms (e.g., [3;
13]) will come in handy in scaling to the larger problems.
Pricing Jobs: There are several ways to compute the best
price for a job. Once a concrete interface has been selected,
it is easy to measure the time required to complete a job;
multiplying by an expected hourly rate produces a price.
But money is not a workers only motivation; the intellectual challenge of a job and even attractiveness of a UI can
reduce the necessary wage [25].
Mason and Watts [15] showed that increasing the payment
for a task on Mechanical Turk increased the quantity of tasks
performed by a worker but not the quality. So, if the task
comes with a deadline then the variation of the price with the
rate of completion of tasks could determine the pay. Moreover, there are methods for improving worker performance
and these may be explored by C LOWDER.
Awarding Bonuses: Paying an optional bonus for higher
quality (or more rapidly-completed) work is a tried and
tested way to motivate better submissions from the workers. For an automated agent the decision question will be
(1) When to pay a bonus? (2) Under what quality conditions
should a bonus be paid? and (3) What magnitude bonus
should be paid? Intuitively, if we had an expectation on the
total cost of a job, and we ended up saving some of that
money, a fraction of the agents savings could be used to
reward the workers who did well in this task.
Other Parameters:
Related Work
Shahaf and Horvitz [21] also use an HTN-style decomposition algorithm to find a coalition of workers, each with
different skill sets, to solve a task. Our workflow selection
ideas are inspired by their work.
Other researchers have studied modeling worker competencies. Whitehill et al. [27] also model problem difficulty,
though they use the unsupervised EM algorithm. Wellinder
et al. [26] add annotation bias and multi-dimensional parameters. Kern et al. [10] also make predictions on whether to
elicit new answers. Donmez et al. study worker accuracies
as a function of time [5].
Recently Zhang et al. [28] argue that all aspects of the
workflow right from designing to controlling can be crowd-
[6]
[7]
[8]
[9]
P. Ipeirotis.
Plea to Amazon: Fix Mechanical Turk!
http://behind-the-enemy-lines.blogspot.com/2010/10/pleato-amazon-fix-mechanical-turk.html.
Conclusions
Amazon Mechanical Turk has the tagline Artificial Artificial Intelligence emphasizing the ability of the crowd in performing several tasks commonly attributed for AI systems.
In this paper we argue that AI techniques are rather essential in managing, controlling, executing, and evaluating the
tasks performed on crowd-sourcing platforms.
We outline the design of our system, C LOWDER, that (1)
uses machine learning to continually refine the models of
worker performance and task difficulty, (2) is able to optimize the parameters and interfaces in a workflow to achieve
the best quality-cost-completion time trade-off, (3) can dynamically control a workflow to react to and anticipate the
effect of better or worse workers finishing a task, and (4) has
the ability to select one among the multiple possible workflows for a task based on automatic evaluation of the different optimized workflows.
C LOWDER combines core ideas from different subfields
of artificial intelligence, such as decision-theoretic analysis,
model-based planning and execution, machine learning and
constraint optimization to solve the multitude of subproblems that arise in the design. The implementation of the
system is in progress.
We believe that a mixed-initiative system that combines
the power of artificial intelligence with that of artificial artificial intelligence has the potential to revolutionize the ways
of business processes. We already see several innovative
crowd-sourcing applications; we can easily anticipate many
more, as C LOWDER reduces the requester skills and computational overhead required to field an application.
Acknowledgments
This work was supported by the WRF / TJ Cable Professorship, Office of Naval Research grant N00014-06-1-0147,
and National Science Foundation grants IIS 1016713 and IIS
1016465.
References
[1]
M. Bernstein, G. Little, R. Miller, B. Hartmann, M. Ackerman, D. Karger, D. Crowell, and K. Panovich. Soylent: A
word processor with a crowd inside. In UIST, 2010.
[2]
http://crowdflower.com/solutions/prod cat/index.html.
[3]
[4]
Peng Dai, Mausam, and Daniel S. Weld. Artificial intelligence for artificial, artificial intelligence. In AAAI, 2011.
[5]