NRC-TR-2009-001
Crowd Translator: On Building Localized Speech Recognizers
through Micropayments
Jonathan Ledlie, Billy Odero, Einat Minkov, Imre Kiss, Joseph Polifroni
Nokia Research Center Cambridge, US
http://research.nokia.com
June 18, 2009
Abstract:
We present a method to expand the number of languages covered by simple speech recognizers. Enabling speech recognition in users’ primary languages greatly extends the
types of mobile-phone-based applications available to people in developing regions. We describe how we expand language corpora through user-supplied speech contributions, how
we quickly evaluate each contribution, and how we pay contributors for their work.
Index Terms:
Speech Recognition
Crowd-Sourcing
Self-Verification
Low-Resource Languages
2
1
Introduction
Speech is a key modality for mobile devices and applications, particularly for low-literate users in developing regions. Even for simple tasks, speech-based interaction requires a speech recognizer trained in
the target phrases in the user’s language. Because creating a speech recognizer for a new language is
an expensive and time-consuming task, recognizers exist for only the most popular languages. Many of
the users who would most benefit from a speech-based interface are often forced to speak in a secondary
language, such as English, or use an alternative modality. A fundamental component of any speech
interface is reliable data with which to build acoustic and language models. A critical prerequisite to
deploying speech interfaces in the developing world is these data.
In order to ensure reliable and robust performance in a speaker-independent task, large amounts of
spoken data must be collected from as broad a range of people as possible. We demonstrate a system,
Crowd Translator (CX), that gathers speech data from people through their mobile phones to create a
high-quality speech recognizer. After we automatically validate each set of contributions, we pay into
the contributor’s mobile phone bank account.
CX aims to make it easy and cheap to develop simple speech recognizers in many more languages
than are currently available. This paper describes how we acquire speech data in new languages and
how we automatically verify and pay the people who contribute to each language’s corpus of recognized
phrases.
This paper makes the following contributions:
• We show how user-generated content can be subject to self-verification, quickly classifying it as
useful or not.
• We demonstrate a prototype that automatically screens out invalid user data, suggesting that CX
can be used to create simple speech recognizers for local languages at significantly lower cost than
previous methods.
1.1
How Crowd Translator Works
We start with a target corpus of text phrases chosen as likely to conform to a set of simple telephonebased applications. A single person with an unmarked accent, our “voice talent”, records each phrase to
be used as an audio prompt. This same voice talent also reads a short introduction that instructs people
on how to use the system. When subjects call the system, they hear the instructions, which ask them
to repeat each audio prompt after they hear it.
Each contributor, or worker, is recruited from among native speakers of the target language. Immediately after each worker’s contribution, or session, CX automatically determines if the recordings
the worker provided are valid. If the session is valid, the worker is sent a payment, either to a mobile
bank account or to a mobile minutes account, depending on what is available in the particular country.
After sufficiently many users have contributed many sessions for the same corpus, a corpus is created
to be used to train a speech recognizer. Note that we are not building a phoneme-aware model of each
language. Instead, we are collecting many samples of each utterance, which can directly be used for
statistical matching. Figure 1 illustrates this process.
2
2.1
Background
Crowdsourcing Data
Amazon’s Mechanical Turk is a “marketplace for work” [1]. Through its web interface, users can submit
short tasks they want other people to complete. These tasks are typically ones that humans are good
at but machines are not: for example, determining whether a video is suitable for children. Mechanical
Turk takes care of allocating tasks and routing payments to workers, but leaves determining the validity
of each worker’s results to the user who submitted the task. Crowd Translator could be thought of
as a specialized instance of Mechanical Turk’s marketplace, one which builds in verification. While
currently focused on collecting and automatically verifying speech data, CX could expand to include
other speech-driven tasks, such as sentence simplification and annotation of spoken utterances.
NRC-TR-2009-001
Copyright c 2009 Nokia
3
(a) Make Canonical Recordings
car
gari
Gold Std.
Utterance
“gari”
boat
plane
...
mashua
ndege
...
“mashua”
“ndege”
“. . .”
English
Swahili
(c) Verify Input
Intra-session Agreement?
gari1 ≈ gari′1
mashua1 ≈ mashua′1
(b) Gather User Input
garig
User1
Utterance
gari1
ndegeg
mashuag
ndege1
mashua1
. . .g
. . .1
mashuag
. . .g
mashua′1
. . .1
garig
gari′1
Prompt
(d) Expand Corpus
Word
car
Utterance
garig
car
car
gari1
gari′1
car
...
gari2
...
boat
boat
mashuag
mashua1
...
...
Figure 1: Crowd Translator Overview: (a) In a lab setting, a single native speaker translates target
utterances and records canonical, “gold standard” phrases, e.g., “gari” (garig ). (b) A worker, who speaks
the target language natively, calls the Crowd Translator phone number and provides input, mimicking
each prompt. Each set of utterances provided by one worker in one phone call is called a session. (c) CX
quickly determines the validity of the session using a new technique, testing for intra-session agreement.
The worker is paid within a few seconds of completing the task if it contains a sufficient fraction of valid
input. (d) Lastly, we add the user’s input to the overall speech corpus for the target language. Slower
automatic post-processing further filters the corpus.
NRC-TR-2009-001
Copyright c 2009 Nokia
4
Like Mechanical Turk, txteagle is an outsourcing marketplace [7]. Unlike Mechanical Turk, workers
are drawn from developing regions and sent tasks via SMS. Current tasks focus on text translation,
although image recognition tasks may be supported in the future (using MMS).
The success of txteagle’s trials, in Kenya, Rwanda, Nigeria, and the Dominican Republic, have varied
widely depending on levels of trust. In Nigeria, for example, where SMS-based scams are common,
workers have been hesitant to use txteagle, because they are unsure they will really be paid on successfully
completing a task. As we expand CX beyond a prototype, we will likely encounter similar trust issues.
While txteagle and Crowd Translator employ some similar techniques to determine worker and result
credibility, txteagle does not use sessions to establish user credibility. In addition, users are paid per
individual task, not per set of tasks. Because establishing the correctness of a task may occur hours or
days after a worker has provided input, trust may be greater in a system like CX that provides immediate
feedback.
2.2
Data Annotation
Machine learning researchers quickly realized that Mechanical Turk was a new resource for collecting
large quantities of human-labeled data. An evaluation of a machine learning algorithm typically requires
training and testing data, where both sets of data have been labeled, or annotated, by a human. For
example, after humans assign categories (labels) to large numbers of videos, researchers can test their
algorithms to see if this association can be made automatically.
Machine learning researchers also quickly realized that labels generated with Mechanical Turk were
different from “expert” annotations. Snow et al., asked Mechanical Turk’s workers to assign emotions
to news headlines — for example, assigning “surprised” to the phrase “Outcry at North Korea Nuclear
Test” [19] — and found the results were noisy. Snow et al., and Kitter et al., found that Turk’s human annotators sometimes cheat or make mistakes, producing incorrect results when compared to an expert [11].
To correct for these errors, researchers used spot checks, cross-validation, and worker filtering: (a) spot
checks compare the purported result against a known truth, or “gold standard;” (b) cross-validation,
also called inter-annotator agreement, compares the results from different users for the same task [18]
and (c) worker filtering assigns tasks to people who have performed the same types of tasks well in the
past [6]. Combined, these approaches facilitate verification, the process by which utterances are vetted
and transcriptions authenticated. Interestingly, these methods for verifying human input are essentially
the same as those developed a decade earlier to catch cheaters in volunteer computing projects, such as
SETI@Home [2].
Without human intervention, these techniques greatly improve the reliability of the labeled data,
sometimes leading to expert level performance [18]. However, they come at the cost of redundancy,
reducing throughput and increasing cost per task.
Another powerful technique is to measure individual consistency. Kruijff-Korbayová et al., examined
intra-annotator agreement, where the same person was presented with the same question two or more
times [12]. If the person gave the same answer, he or she was considered more reliable; if not, less so.
The results were verified by a human expert, an especially tedious task with speech data.
Crowd Translator uses a new type of intra-annotator agreement to measure input validity. Instead of
spreading redundant queries over months (as in [12]), we ask workers the same question within the same
session. This is feasible because our questions (voice prompts) are very brief relative to the length of
the session. With multiple data points gathered in the same short period, we can estimate the validity
of the user input immediately after each session (or even during the session). This allows us to provide
the worker with immediate feedback on his or her work. This technique of prompting users with a small
number of redundant, very short tasks appears to generalize to verifying other forms of user-provided
data.
2.3
Speech Data Collection
Large-scale telephone speech data collection has been used for more than a decade to bootstrap development of new domains [4]. Broad-based recruitment efforts have been effective in targeting large numbers
of subjects in a short period of time [5, 9]. These efforts involved non-spontaneous speech, with subjects
either reading from prepared prompt sheets or responding to scripted or situational prompts.
NRC-TR-2009-001
Copyright c 2009 Nokia
5
Non-spontaneous data collection makes verification easier, but can have an unnatural priming effect.
A human verifier, or annotator, can be given the anticipated transcription which, in the majority of cases,
is what the user has actually said. Annotators need some level of training, and typically both hardware
and software is required for playing, displaying, and manipulating annotations: they add greatly to the
cost of collection.
In contrast to these highly manual approaches, Crowd Translator provides a method for collecting
and verifying large amounts of speech data in “low resource” languages, i.e., languages for which there
are few, if any, existing corpora, either written or spoken. Other research has examined the issue of
speech-to-speech translation in low resource languages [10, 15]. In these cases, some existing data were
available, as well as resources for transcribing parts of it. This differs from CX in that we are concerned
with the rapid development of simple applications to be used in cases where neither corpora nor resources
for annotation are available.
A different approach to large-scale telephone speech data collection is GOOG411 [8]. GOOG411 is a
public telephone directory service, similar to 411 in the U.S. In contrast to most directory services, the
user speaks the name of the desired listing — no human operator is ever involved — and Google pays
for the call, not the user. While Google loses money on this service directly, the recorded utterances,
in aggregate, improve speech recognition for its other applications and services. Both GOOG411 and
Crowd Translator pay users for their time, although GOOG411 does so indirectly.
GOOG411 collects spontaneous speech; users can ask for any listing. This type of speech often
more closely matches normal speech than mimicked, non-spontaneous speech, but it is much harder to
annotate because there is no anticipated transcription. Because GOOG411 is single-shot, there is little
opportunity to automatically decide if a user has provided useful input for a speech recognizer. Thus,
screening out poor input may be harder. In addition, with GOOG411, it is more difficult to infer if the
result presented to the user is correct: even if the user connects to the top result, this may not have been
what the user originally asked for.
Our phrase-based recognizer is in contrast to the phoneme-based recognizers that have come to
dominate the field in the past fifteen years. Phoneme-based recognizers are, in many respects, superior
to non-phoneme ones. Because they match only subcomponents of words, they are easier to expand
to new words, they can be easier to tune to new dialects, and they use less memory. Unfortunately,
they are extremely expensive to build: so expensive that only 28 languages are available from the
market leader [16]. Phrase-based recognizers have been shown to provide very accurate results (≈ 100%)
for speaker-independent matching for small vocabularies [13]. Interestingly, because this approach was
dropped in favor of phoneme-based recognizers, it is an open problem to examine how this approach
scales to hundreds or thousands of phrases. But because our target applications have only a few dozen
possibilities at each user prompt and because acquiring new phrases is inexpensive, full-phrase recognition
appears sufficient.
3
Verifying User Input
Crowd Translator uses a two pass model to construct a speech corpus. The first pass, illustrated in
Figure 2, is a verification step whose goal is to determine, once a working session is over, whether the
worker is likely to have supplied a sufficient fraction of valid input and should be paid. Invalid inputs
correspond to words or phrases that are different than requested. This may occur due to workers’
attempts to receive an award without performing the task as instructed, as well as due to issues of line
quality and workers’ misinterpretation of the word spoken. We are interested in a model where the
workers are rewarded according to their effort. A good model should discard low quality sessions, but it
should allow moderately noisy inputs to establish worker trust.
The goal of a second pass is to further eliminate irrelevant samples from the set of approved sessions,
such that the quality of the speech recognizer trained on the accepted samples is high. The focus of this
paper is on approaches to the first pass validation, where we leave the theoretical and empirical study
of the second pass for future work.
We next describe two approaches to evaluating session validity: (a) intra-session agreement and (b)
comparing against gold standard samples. The results of an empirical evaluation of the two approaches
are given in the following section.
NRC-TR-2009-001
Copyright c 2009 Nokia
6
0 min
Worker
30 min
(1) Session
A B C D E A’
(4) Payment
Message
Mobile
Operator
D’ Z
it
0.86, 0.93, ..., ....
ed t
Cr oun
)
(3 acc
(2) Verification (first pass)
PDF of Similarity
Figure 2: During a session, each worker is presented with a small number of redundant utterances
e.g., A and A′ . After each session, the verification process can quickly assign an overall validity to the
session. If the session passes verification, the user’s mobile payment account is credited, by sending a
signed message to the mobile operator. Within a few seconds, the user sees his or her account has been
credited, increasing trust in the system.
0.5
0.4
Matching
Non-matching
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
Similarity of Two Utterances
1.0
Figure 3: When workers provide different content than is expected, these invalid inputs will match the
“disagreement” distribution, not the “agreement” one. The figure shows two distributions: “agreement”
is the similar metric when users have said the same word twice; “disagreement” is when they have said
different words. This same method of testing for the similarity of a few user-supplied values can be
applied to other contexts where users supply the bulk of the system’s data. The data come from five
people whose inputs were verified and then artificially scrambled to create mis-matches.
3.1
Intra-Session Agreement
We suggest intra-session agreement as a method to rapidly determine the likely validity of a session of
user data. The intra-session agreement measure is aimed at discovering whether a user was consistent
throughout the session. It is assumed that if the user did not or partially followed the instructions given
to him, overall consistency would be low.
The three steps to testing intra-session agreement are:
1. Make a small fraction of the user’s queries redundant.
2. Measure the similarity between each query pair.
3. The distribution of the output similarity scores is compared against a known distribution of truly
similar pairs of the same type of data.
We evaluate step (2) as the similarity between utterance pairs using an acoustic similarity measure.
However, this model is general: different similarity measures can be used according to the subject
domain. For example, in a sentiment labeling scenario, the user-supplied labels could be compared for
semantic similarity (“surprised” and “shocked”). In an indoor localization scenario, it could be used to
test the quality of user-supplied radio-frequency fingerprints [20].
NRC-TR-2009-001
Copyright c 2009 Nokia
7
Figure 3 shows the distribution we used for step (3). The comparison can be performed simply by
contrasting the median of the output distribution against a threshold, set according to the reference
distribution. An alterinative, more robust, method is to apply a standard statistical test, such as the
Student’s t-test, which states whether it is likely that the output sample was generated from the reference
distribution.
The outcome of the intra-session agreement test is that the user-supplied content is determined to be
likely valid or likely invalid. This approach does not track low-quality or erroneous samples as long as
the user is consistent. In particular, if a user has malicious intentions, this type of test is relatively easy
to falsify, but this problem can be overcome by combining intra-session agreement testing with other
approaches.
In addition to determining whether a specific session was valid, we consider a user’s credibility score
as in Sarmenta [17]. We let a user’s credibility vary over time with an exponentially-weighted moving
average, such that the user’s recent historical information can be integrated into each session evaluation.
We then send a session on to the second pass when the overall credibility score is above a threshold. This
allows users to be paid even when they have provided invalid data in the past, but for CX to remain
cautious about this user.
3.2
Gold Standard Comparisons
Another approach is to compare a sample of user annotations with gold standard data, generated by
experts. In our framework, a sample of the utterances provided in the session may be compared against
gold standard pronunciation. This forms a tighter control over the session’s quality, as beyond consistency, the produced utterances are evaluated directly against a reference acoustic signal. However, we
conjecture that this approach may be suboptimal for the speech domain, as it makes a strong assumption
that there is high similarity between the user’s pronunciation and the gold standard. This assumption
may be false if the worker uses a particular dialect that is not represented by the gold standard. While
it is in our interest to discard irrelevant samples, valid outliers due to dialects are important for system
coverage. Snow et al., and others have also based tests on a comparison to a gold standard [19].
As we build up utterances that come from users with high credibility, we can construct an acoustical
model of each word that is based on multiple inputs, of both experts and credible users. Such model will
form a broader standard that can be used as with the previous approach, but yielding higher coverage.
To make this validation step quick, we would only pick a small random sample of the user’s utterances.
If both this test and the intra-session agreement test succeed, the session will be accepted. We plan to
implement this combined approach in the future, once the size of our user base scales up.
4
Prototype
We implemented a prototype of Crowd Translator and used it to collect a small sample of English and
Swahili utterances in Kenya. Our target corpus of a few hundred words uses the vocabulary of a speechbased classifieds application we are deploying in East Africa [14]. We recruited fifteen local workers
overall, where each working session was comprised of 110 prompts. For evaluation purposes, half of the
prompts in each session are duplicates; in the experiments, however, as well as in a real deployment, only
a small number of duplicates will be used.
Because of low levels of literacy among our target population, we do not use prompt lists, even if
they could be sent via SMS. Our subjects hear audio prompts and are asked to repeat what they hear.
While we realize the priming effect this may have on the data we collect, we made the choice to use audio
prompts based on the average literacy level of our intended subjects and the nature of the data being
collected (i.e., the words represent abstract concepts such as “repeat” or “next” and cannot be rendered
through pictograms).
Workers used mobile phones to call into the prototype, which was built using Asterisk [3]. While a
programmable API to credit accounts exists, we have not yet linked our back-end to it. (This, however,
does not appear to be a significant technical hurdle.)
We used the data collected using the prototype to evaluate intra-session agreement and learned that
several aspects of CX need to be altered before the next phase of the project. Overall, based on manual
validation, the data contained 1229 valid utterances and 421 invalid ones. Invalid data occurred because
NRC-TR-2009-001
Copyright c 2009 Nokia
8
Worker Utterances
12%
Valid
Invalid
8%
4%
0%
0
1.
9
0.
8
0.
7
0.
6
0.
5
0.
4
0.
3
0.
2
0.
1
0.
0
0.
Similarity to Gold Standard
Figure 4: Screening out invalid data is difficult if only comparing user-supplied input to a known gold
standard. When we manually labeled data as valid or invalid and compared it to the gold standard
utterance, we found that much of the invalid data had a high similarity score. Using a threshold to
determine input validity using this method seems unlikely to give solid results.
users misheard the prompts, did not know when to speak (before or after a beep), or simply said nothing.
A proposed alteration of the prototype is therefore adding a training phase to improve user success rates.
4.1
Experiments
Our main goal in the experiments is to evaluate automatic methods for validating user inputs. We
first evaluate the common approach of comparing the input utterances against gold standard samples.
Figure 4 shows that there may be little acoustic resemblance between two valid samples of the same
word. This confirms our conjecture that gold standard samples may not account for the variety of
personal pronunciations and accents in the speech domain (see Section 3). While an expanded method
— where the sample is compared to a set comprising of the gold standard as well as other presumed valid
samples — may improve validation results, comparing only to the gold standard is not recommended for
bootstrapping a corpus in these settings.
The second measure that we evaluate in our experiments is the intra-session agreement. To examine
intra-session agreement, we used our acoustical analyzer to compute a similarity score for 16 duplicate
pairs per every session. To estimate a session’s validity, we sampled from this distribution and either (a)
compared the sample’s median to a threshold or (b) determined whether the sample matched the known
valid intra-session distribution from Figure 3.
Figure 5 displays a receiver operating curve (ROC), showing the false positive and false negative
rates as we vary the acceptance threshold (for the median comparison) and the ratio of valid-to-invalid
distributions (for the t-test comparison). In the figure, a curve closer to the origins is preferable. That
is, Figure 5 confirms that a statistical comparison of the sample distributions performs better than
considering the sample’s median.
Several considerations may guide the selection of the specific threshold or valid-to-invalid ratio to be
applied. In particular, we consider trust as a key element of Crowd Translator. We would like to ensure
that people who have supplied it with primarily valid data will be rewarded. A lack of trust will prevent
the users from recommending Crowd Translator to others. Thus, we would prefer a low false negative
rate (i.e., have only few high-quality sessions rejected) to a low false positive one (i.e., allowing lower
accuracy among the set of accepted sessions), when automatically determining session validity. After
letting some invalid data through the first pass, we can clean it more carefully using slower, more precise
techniques, lowering the false positive rate with the second pass.
By varying the valid-to-invalid ratio, we can model the effect of a user whose utterances are mostly
but not exclusively valid. In our experiments, we assumed a session was valid if 80% of its utterances
were valid. We compare the result of our testing mechanism against the correct decisions, based on the
manual evaluation of the samples. Overall, we observed that, by choosing a moderately lenient threshold,
the automatic evaluation can yield a low (< 5%) false negative rate per session, while limiting the ratio
NRC-TR-2009-001
Copyright c 2009 Nokia
False Negative Rate
9
0.50
t-test
median
0.40
0.30
0.20
0.10
0.00
0.0
0.1
0.2
0.3
0.4
False Positive Rate
0.5
Figure 5: Comparing to a known distribution of valid utterances using a simple statistical test, such as
the t-test, gives moderately better performance over a simple median threshold. In particular, we do not
want high false negative rates because this would lessen trust in the system. The figure shows the false
positive and false negative rates as we vary the median threshold and the t-test distributions sampled
from, respectively.
of invalid sessions accepted in the first pass to between 25 − 35%. This result should satisfy the human
factor; a second pass that further filters noisy data may be required for training high-quality speech
recognizers.
4.2
Discussion
In addition to including a training phase, we believe that three other changes should improve our prototype. First, instead of a single voice model, we should have at least one male and one female model.
While this may not affect intra-session agreement, we found that it does improve second pass verification in preliminary tests. Broadening the selection of voice talents further — including non-standard
accents, for example — should also reduce the priming effects on the collected corpus. Second, once a
“starter” corpus of valid utterances for a phrase has been collected, new utterances of this phrase can
be cross-validated against them. This is broader than simply comparing against the gold standard. Like
intra-session agreement, this test can be done before paying the user. By rejecting sessions where a
significant fraction of the utterances fail this test, we can avoid accepting sessions where, for example,
the user repeats exactly the same utterance in response to each prompt. Third, some volunteers were
extremely reticent to have themselves recorded and required lengthy reassurance that no one outside of
our research group would listen to their recordings (“I don’t like the sound of my voice” said an early
participant from Tanzania). In addition to developing trust on payment, participants must develop trust
that their recordings will be kept private.
5
Conclusion
We presented a method that allows to expand the number of languages covered by speech recognizers,
enabling new applications in developing regions. We focused on a method for automatically validating
sessions of user-generated content, a new form of intra-annotator agreement. Self-validating content is
useful in contexts where large corpora of human-generated data are required. Instead of having a small
number of people painstakingly generate corpora, we showed how many people could be used to build
them.
A promising direction to pursue in terms of user verification and data quality is to apply learning
techniques such as clustering to find trends in the data. Suppose that inter-user similarity is evaluated,
using duplicate samples across multiple users; given user similarity scores, users may be clustered into
cohesive groups (for example, using methods such as agglomerative clustering). It is reasonable that the
groups formed will represent different dialects. The association of a worker to a particular dialect profile
is very valuable, as it may enable control over the distribution of samples selected to train the speech
recognizer. In addition, users that are not found to be tightly related to one of the groups formed in
clustering will be considered low-quality (noisy) workers.
NRC-TR-2009-001
Copyright c 2009 Nokia
10
In the future, we aim to deploy Crowd Translator in several countries in East Africa. We hope to
have created recognizers for at least five new languages in the next year and make these recognizers
available for phone-accessible and on-device applications.
Acknowledgements
We thank: Elishibah Msengeti for being our voice model; Nokia Research Center Africa in Nairobi,
Kenya for testing the CX prototype; and Prof. Tavneet Kaur Suri for orchestrating the data collection
process in the field.
References
[1] Amazon Mechanical Turk. http://mturk.com.
[2] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. Seti@home. Communications
of the ACM, 45(11):56–61, 2002.
[3] Asterisk. http://asterisk.org.
[4] J. Bernstein, K. Taussig, et al. MACROPHONE: An American English Telephone Speech Corpus
for the Polyphone Project. In ICASSP, Apr. 1994.
[5] R. Cole, M. Fanty, et al. Telephone speech corpus development at CSLU. In ICSLP, 1994.
[6] P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently Learning the Accuracy of Labeling Sources
for Selective Sampling. In KDD, 2009.
[7] N. Eagle. txteagle: Mobile Crowdsourcing. In HCII, July 2009.
[8] GOOG-411. http://www.google.com/goog411.
[9] E. Hurley, J. Polifroni, and J. Glass. Telephone data collection using the world wide web. In ICSLP,
1996.
[10] A. Kathol, K. Precoda, D. Vergyri, W. Wang, and S. Riehemann. Speech Translation for LowResource Languages: The Case of Pashto. In Interspeech, 2005.
[11] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with Mechanical Turk. In CHI, Apr.
2008.
[12] I. Kruijff-Korbayová, K. Chvátalová, and O. Postolache. Annotation Guidelines for Czech-English
Word Alignment. In LREC, 2006.
[13] K. Laurila and P. Haavisto. Name dialing: How useful is it? In ICASSP, 2000.
[14] J. Ledlie, N. Eagle, M. Tierney, M. Adler, H. Hansen, and J. Hicks. Mosoko: a Mobile Marketplace
for Developing Regions. In DIS, Feb. 2008.
[15] S. Narayanan et al. Speech Recognition Engineering Issues in Speech to Speech Translation System
Design for Low Resource Languages and Domains. In ICASSP, 2006.
[16] Nuance: OpenSpeech Recognizer. http://nuance.com.
[17] L. Sarmenta. Sabotage-Tolerance Mechanisms for Volunteer Computing Systems. In CCGRID, May
2001.
[18] V. S. Sheng, F. Provost, et al. Get another label? Improving data quality and data mining using
multiple, noisy labelers. In KDD, Aug. 2008.
[19] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast - But is it Good? Evaluating
Non-Expert Annotations for Natural Language Tasks. In EMNLP, Oct. 2008.
[20] S. Teller, J. Battat, B. Charrow, D. Curtis, R. Ryan, J. Ledlie, and J. Hicks. Organic Indoor
Location Discovery. Tech. Report CSAIL TR-2008-075, MIT, Dec. 2008.
NRC-TR-2009-001
Copyright c 2009 Nokia