Pseudo AI Bias
Pseudo AI Bias
Pseudo AI Bias
4
CREATE for STEM Institute, Michigan State University.
Preface
Pseudo Artificial Intelligence bias (PAIB) is broadly discussed in the literature, which can result
in unnecessary AI fear in society, thereby exacerbating the enduring inequities and disparities in
accessing and sharing the benefits of AI applications and wasting social capital invested in AI
research. This study systematically reviews publications in the literature to present three types of
users for AI applications to mitigate AI fears, providing customized user guidance for AI
applications, and developing systematic approaches to monitor bias. We concluded that PAIB,
1
Pseudo Artificial Intelligence bias (PAIB) can result in unnecessary AI fear in society,
exacerbate the enduring inequities and disparities in accessing and sharing the benefits of AI
applications, and waste social capital invested in AI research. As AI bias increasingly draws
attention in every sector of society 1, concerns about the attribution of bias in AI continue
to increase. AI predictions may systematically deviate from the ground truth, generating results
either in favor or against certain groups. Such effects can be detrimental and result in social
consequences that have drawn concerns about the use of AI since it has begun to be broadly
applied in everyday life 2,3. Although AI biases may result from deficit design, skewed training
data, confounding of the algorithmic models, or the algorithm computational capacity 4, some are
It can be even more complicated if attributions of biases to AI are artificially crafted, which
has been broadly documented in the literature. This may be particularly true if understudied
“biases” are arbitrarily attributed to AI only because users misunderstand the bias,
expectation of AI predictions. This review presents three types of PAIBs identified in the
literature and discusses the potential impacts on society and probable solutions.
Misunderstanding of Bias
Many biases attributed to AI result from users’ misunderstanding of bias. Bias was broadly
referred to in social science and natural science with commonalities. We reviewed the definitions
of bias in the literature in both areas (see Supplementary Table S1) and found three critical
properties: (a) deviation – bias measures the deviation between observations and ground truth
(i.e., error); (b) systematic – bias refers to systematic error instead of random error; and (c)
2
tendency -- bias is a tendency to favor or against some ideas or entity over others. These three
properties characterize the idea of AI bias and are fundamental to uncovering PAIBs.
Error versus bias. Errors can be a bias only if they systematically happen. In scientific research,
classification, etc. Therefore, instead of attempting to eliminate errors, researchers put efforts
into reducing errors using various methods, such as refining measurement tools, averaging
observations, or improving prediction algorithms. In contrast, the elimination of bias can occur
once its root is identified and correct solutions are implemented 5. Therefore, bias is more
troublesome to humans than errors are. It is critical to correctly identify bias and eliminate it.
Examples in the literature are broadly cited as evidence of AI bias and demonstrate AI’s
harmfulness. However, some of them may provide examples of errors instead of bias.
Among such, the most popular is the example of humans being mistaken as Gorillas 6. In a 2015
Twitter post, Jacky Alcine shared a screenshot in which a face-recognition algorithm tagged him
and another African American as gorillas. This case shows evidence of the problem of algorithm
However, although we find this example deplorable, numerous authors who cited this case as
“AI bias” did not provide sufficient evidence that this terrible misidentification resulted from
inherent AI bias—a single case is challenging to testify the errors “systematically” happen for a
specific group. Another example includes face recognition software labeling a Taiwanese
“blinking” 7, and an automated passport system not being able to detect whether an individual
from Asia had his eyes open for a passport picture 8, etc. These cases reported AI errors caused
by human programming of the algorithms that need to be addressed instead of AI bias because
groups. As mentioned above, we see these errors as deploring, and we need to work hard to
ensure these errors don’t happen, but in the world of programming, “bugs” happen, and we just
need to make sure that public announcements such as bias are made with evidence. Currently,
society invests trillions of dollars in technology developments every year, while certain groups
underrepresented in society have been found less likely to benefit from these innovations
because of fear of these technologies that resulted from human error and the result of some
engrained bias in the technology. The unfairness could be even worse if the extended AI fear
happens explicitly to the underrepresented groups. Therefore, researchers should use these
mislabeled alerts to further examine potential bias due to programming and increase the accuracy
Favoriting percentage versus ground truth. Amazon's AI recruiting tool is a broadly cited case
for evidence of "AI bias." According to Dastin 9, this tool favored male applicants compared to
females, resulting in many believing that the software was biased towards females and eventually
disbanded by Amazon. Although widely citing this case to demonstrate that AI is biased against
females, seldom has research referred to the prediction “deviations” from the ground truth—the
outcomes based on recruiting criteria. A higher percentage favoring male candidates than
females can hardly support a valid conclusion about AI bias before we know the actual numbers
of applicants by gender and the false prediction cases. This PAIB results from a lack of
information-- the gender breakdown of Amazon’s technical workforce, according to Dastin, was
not even disclosed at that time. Therefore, it is problematic to claim AI bias until more data
becomes available and a deep investigation of gender parity can verify the PAIBs.
This type of PAIB harms people’s trust in AI. The predicted favoring percentage toward males is
4
hard to justify discrimination before we know the “deviations.” As argued by Howard and
Borenstein 10, favoring a group might also be justified if the evidence supports the facts. For
example, charging teenagers more for car insurance seems “discriminating” but arguably
justified given that evidence shows a higher risk of accidents for teenage drivers. A higher
charge for teenage drivers may prevent unnecessary risky driving by teenagers. A key to
determining AI bias requires examining the ground truth and comparing it to the then findings
Fairness versus bias. Accuracy is the primary measure for algorithm predictions, while bias is
the extent to which the results are systematically and purposely distorted. Therefore, computer
scientists pursue high accuracy and avoid bias. Fairness, however, is beyond measurement. It is a
social connotation incorporating factors such as access to social capital and the purpose of
fairness 11. A broadly cited example is Equality and Equity (see Fig. 1). When asked which
picture represents the best practice of fairness, an educator might point to the right picture
the game. Interestingly, when an Asian baseball player was shown this picture, he said he was
never given equal opportunity to play in the game. Should athletes be granted equal opportunity?
closer to fairness than equality in every Fig. 1 Fairness: equality (left) and equity (right).
5
case. With such a complex societal concept, it is difficult to demand an algorithm to achieve
fairness goals.
Nevertheless, our research shows that AI was held full accountable for fairness in some cases.
One example is the Correctional Offender Management Profiling for Alternative Sanctions
(COMPAS), an algorithm used to predict the risk of recidivism, the tendency of a convicted
criminal to re-offend. According to a report 12, COMPAS achieved equal prediction accuracy for
both black and white defendants. That is, a risk score that is correct in equal proportion for both
groups. However, the report also suggests that black defendants were twice as likely to be
incorrectly labeled as higher risk than white defendants. In reality, nevertheless, white
defendants labeled as low risk were more likely to be charged with new offenses than black
defendants with similar scores. According to research from four independent groups13, to satisfy
both accuracy and “fairness” is mathematically impossible given that Blacks were re-arrested at
a much higher rate than Whites. However, because the purpose of the risk score is to help the
judge make decisions in court without considering the defendants' race, COMPAS’s performance
satisfied the cornerstone criteria 13. This case illustrates overly tasking AI algorithms for fairness,
which could increase the suspicions of AI applications in societal issues. COMPAS yielded
fairness concerns but the scores predicted were equally accurate for the ethnic groups. Confusing
bias with fairness is detrimental to the broad use of AI in solving problems. “Statistics and
machine learning are not all-powerful.14” It is the users’ responsibility to clearly identify the
Pseudo-mechanical Bias
In many cases, bias is created because of users’ ignorance of critical features of data or
6
inappropriate operations in the algorithm application processes, instead of the capacity of
algorithms. This type of bias is not due to the algorithm mechanism but to pseudo-mechanical
bias. It is feasible to avoid pseudo-mechanical bias as long as users appropriately apply the AI
Emergent error. Emergent error bias results from the out-of-sample input. It is widely assumed
that AI algorithms can best predict samples close to the training data. In reality, this is not always
the case, as ideal training data are challenging to collect. In this sense, users must be aware of the
limitations of AI algorithms to avoid emergent bias. This type of error frequently appears in
clinical science, where they are attributed to distributional shifts 15. That is, prior experience may
not be adequate for new situations because of “out of sample” input no matter how experienced
the “doctors” are. It is obvious that algorithms are especially weak in dealing with distributional
shifts. Therefore, it is critical for users of AI systems to be aware of the limitations and apply AI
Distinct types of data for training and testing. Currently, supervised machine learning requires
using the same type of data for training and testing. Cheuk argues that algorithms would fail to
respond to questions of one language if trained using a different language 16. In the example she
raised, an AI trained in English was asked by a Chinese heritage “Have you eaten?” in Chinese
(i.e., 你吃了吗). The author suspected that the algorithm would not be able to respond to this
question as expected by the Chinese. Instead, the computer might respond intelligently, “No, I
haven’t eaten,” because the computer might interpret the question as to whether the Chinese care
7
about its diet. However, according to their culture, the Chinese’ question “Have you eaten?”
equals “How are you?” in English. This example is absolutely correct, but it is almost impossible
for an English-speaking Chinese to ask AI questions and expect the answer to follow the Chinese
culture. In other words, it is the users’ responsibility to be aware that the algorithm was trained in
English and expect the response to follow the same conventions in English. These types of
situations in which the algorithm is incapable of interpreting could lead to bias because the AI
Distinct feature distribution among groups. In a recent study, Larrazabal and colleagues17
examined the gender imbalance of algorithm capacity to diagnose disease based on imaging
datasets. They found that for some diseases like Pneumothorax, the algorithms consistently
performed better on males than females regardless of how researchers tried to improve the
training set. This unintended error was created because patients have unique biological features
that prevent algorithms from performing equally based on the gender of the patient. For
instance, females’ breasts occluded imagining of organs responsible for the disease when using
x-ray technology to collect the imaging data of the diseased organs, resulting in poorer
performance of algorithms when identifying disease. If users apply the algorithms without being
predictions differently from the information encapsulated in the training data, it could generate
PAIBs. For example, Obermeyer and colleagues18 examined the most extensive commercial
medical risk-prediction system, which is applied to more than 200 million people in the US each
year. The system was trained using patients’ expenses and was supposed to predict the ideal
solutions for high-risk patients. The authors found the “ideal solutions” discriminating-- Black
8
patients with the same (predicted) risk scores as White patients tend to be sicker, resulting in
unequal solutions provided to patients of color. In this case, the “predicted risk score” was
generated based on past medical expenses, which is a combination of both the risk and the
affordability of medical service. Suppose doctors interpret this score as an indicator of risk to
providing medical care, it would likely put at risk a certain group’s life (e.g., Black patients who
might generate less medical expense). This assumption is problematic because the assumed bias
resulted from doctors’ misinterpretation of the risk score-- disregarding the fact that the scores
reflect not only the severity of illness but also the affordability of the medical service provided to
the patients. Obermeyer and colleagues19 attributed the medical bias to label choice—
① ②
③ ④
Fig. 2. Artificial mechanical bias. ① indicates emergent bias, where testing data include
samples not included in the training data. ② indicates distinct data for training (e.g., Chinese
vs. English). ③ indicates Distinct samples for training and testing. Samples within the dashed
box indicate having an illness, and gray indicates that the illness was predicted by the
algorithm—④indicates disconnected interpretations of labels.
fundamentally reflecting a disconnect between the interpretation of predictions (i.e., the risk of
illness) and the information encapsulated in the training samples (i.e., both the risk of illness and
9
the affordability of medical care). They also provided empirical evidence that altering the labels
Over Expectation
In many cases, over-expectations can result in PAIBs. For example, Levin 20 reported that in a
beauty contest judged by Beauty.AI, 44 winners were nearly all white, with a limited number of
Asian and dark skin individuals. Many pinpoint this case as evidence for AI bias. What makes it
problematic is that the criteria for beauty are perceptional and vary across people--some care
more about color while others concern more about symmetry. It is even challenging to generate a
consensus criterion by humans. How could we expect AI to identify the beauty winners to satisfy
human individual, instead of some intelligence that can overcome diversity dilemmas. In other
words, Beauty.AI learned from human-labeled examples and thus is expected to perform as what
was learned. The bias does rest in the AI algorithms but in the humans that developed the
algorithm. It may be unreasonable to expect the algorithm to perform in a way to satisfy all
This review by no means argues that AI biases are not worthy of being addressed. Indeed, “the
contend that understanding the mechanisms and potential bias of AI is essential. As the dual-use
of AI is increasingly recognized in the field, researchers have the responsibility to confirm the
ethics before AI becomes mainstream in our society. In this review, our main concern focused on
whether AI biases are correctly identified and appropriately attributed to AI. We focus this
review on PAIBs as they have societally consequential. The public needs to realize that PAIBs
10
result from users’ or developers’ misunderstanding, mis-operations, or over-expectations, instead
of the fault of the AI. Our position is that humans create algorithms and train machines. The
biases reflect human perspectives and human errors. Realizing the many kinds of PAIBs and
how humans are at the center of them could help limit societal fear of AI.
Human fear of AI emerged earlier and faster than the progress in AI research because of
numerous science fiction stories that struck and impacted human thinking about AI. The history
of technophobia can be traced back to the 1920s when Rossum’s Universal Robots proposed the
scenes of enlarged robots 18. However, it did not draw much fear until the recent flood of AI
developments, which refreshed humans’ fear of AI gleamed from these science fiction stories.
Cave and Dihal analyzed 300 documents and identified four types of AI fears: inhumanity,
obsolescence, alienation, and uprising 21. Because many AI fears are rooted in artificial scenes
presented in science fiction books or movies 22, it is unsurprising that part of the fears might be
artificial. Their survey of more than 1,000 UK citizens reveals that the common perception of AI
The social consequences can even be worse if research enlarges the factors that yield the fears—
factors such as PAIBs. The users and developers of AI need to take responsibility to avoid
PAIBs by clearly understanding the limitations of machine algorithms and the mechanisms of AI
biases. Researchers, on the other side, have the responsibility to guide the use of AI applications
and avoid extending PAIBs in academic work. Researchers need to put effort into examining and
solving true AI biases instead of PAIBs. To deal with PAIBs, we recommend the following:
Certify users for AI applications to mitigate AI fears. PAIBs resulted because of users’
unfamiliarity with AI mechanics, as well as misunderstanding of bias that was enriched both by
11
measurement and social connotations. To mitigate AI fears, users need professional education.
certifications to AI users such as doctors, judges, etc. 24. Such professional education could
Provide customized user guidance for AI applications. Some PAIBs developed due to the
diverse configurations of the AI algorithms and the complex conditions of applying AI. For
example, some algorithms might influence a certain group under a given condition. Unless users
know of these constraints, they are not likely to appropriately employ the algorithms for
Develop systematic approaches to monitor bias. The best approach to clarify PAIBs will stem
from the development of tools and criteria to systematically monitor bias in algorithms. Such an
approach could ease the public worry about AI applications and prevent biases in the first place.
Aligned with this goal, the Center for Data Science and Public Policy created the Aequitas, an
open-source bias toolkit to audit machine learning algorithm biases 25. This toolkit could audit
two types of biases: 1) biased actions or interventions that are not allocated in a way that is
representative of the population and 2) biased outcomes through actions or interventions that
result from the system being wrong about certain groups of people.
AI has the potential to serve as a viable tool and partner for many professions, including
medicine and education. However, AI will never reach its potential unless researchers eliminate
sources of bias and the public comes to know that many biases result from human errors in
operating algorithms and in misinterpreting the results of AI. Although dealing with AI bias is
12
critical and should draw substantial attention, overselling PAIBs can be detrimental, particularly
to those who have fewer opportunities to learn AI, and may exacerbate the enduring inequities
13
References
1 Zhai, X., Krajcik, J. & Pellegrino, J. On the validity of machine learning-based Next
Generation Science Assessments: A validity inferential network. Journal of Science
Education and Technology 30, 298-312, doi:10.1007/s10956-020-09879-9 (2021).
2 Nazaretsky, T., Cukurova, M. & Alexandron, G. An Instrument for Measuring Teachers’
Trust in AI-Based Educational Technology. (2021).
3 Kapur, S. Reducing racial bias in AI models for clinical use requires a top-down
intervention. Nature Machine Intelligence 3, 460-460 (2021).
4 Zou, J. & Schiebinger, L. AI can be sexist and racist—it’s time to make it fair. Nature
(2018).
5 McNutt, M. Implicit bias. Science 352, 1035-1035, doi:doi:10.1126/science.aag1695
(2016).
6 Pulliam-Moore, C. Google photos identified black people as ‘gorillas,’ but racist
software isn’t new, <https://fusion.tv/story/159736/google-photos-identified-black-
people-as-gorillas-but-racist-software-isnt-new/amp/> (2015).
7 Rose, A. in Time (2010).
8 Griffiths, J. New Zealand passport robot thinks this Asian man’s eyes are closed, 2016).
9 Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women. San
Fransico, CA: Reuters. Retrieved on October 9, 2018 (2018).
10 Howard, A. & Borenstein, J. The ugly truth about ourselves and our robot creations: the
problem of bias and social inequity. Science and engineering ethics 24, 1521-1536
(2018).
11 Teresi, J. A. & Jones, R. N. in APA handbook of testing and assessment in psychology,
Vol. 1: Test theory and testing and assessment in industrial and organizational
psychology. Ch. 008, 139-164 (2013).
12 Angwin, J. & Larson, J. Bias in criminal risk scores is mathematically inevitable,
researchers say. ProPublica, December 30 (2016).
13 Feller, A., Pierson, E., Corbett-Davies, S. & Goel, S. A computer program used for bail
and sentencing decisions was labeled biased against blacks. It’s actually not that clear.
The Washington Post 17 (2016).
14 Kusner, M. J. & Loftus, J. R. (Nature Publishing Group, 2020).
15 Challen, R. et al. Artificial intelligence, bias and clinical safety. BMJ Quality & Safety
28, 231-237 (2019).
16 Cheuk, T. Can AI be racist? Color‐evasiveness in the application of machine learning to
science assessments. Science Education 105, 825-836 (2021).
17 Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance
in medical imaging datasets produces biased classifiers for computer-aided diagnosis.
14
Proceedings of the National Academy of Sciences 117, 12592-12594 (2020).
18 McClure, P. K. “You’re fired,” says the robot: The rise of automation in the workplace,
technophobes, and fears of unemployment. Social Science Computer Review 36, 139-156
(2018).
19 Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an
algorithm used to manage the health of populations. Science 366, 447-453 (2019).
20 Levin, S. A beauty contest was judged by AI and the robots didn't like dark skin,
<https://www.theguardian.com/technology/2016/sep/08/artificial-intelligence-beauty-
contest-doesnt-like-black-people> (2016).
21 Cave, S. & Dihal, K. Hopes and fears for intelligent machines in fiction and reality.
Nature Machine Intelligence 1, 74-78 (2019).
22 Bohannon, J. (American Association for the Advancement of Science, 2015).
23 Cave, S., Coughlan, K. & Dihal, K. in Proceedings of the 2019 AAAI/ACM Conference
on AI, Ethics, and Society. 331-337.
24 Gent, E. AI: Fears of'playing God'. Engineering & Technology 10, 76-79 (2015).
25 Saleiro, P., Rodolfa, K. T. & Ghani, R. in Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 3513-3514.
15
Funding: This study was partially funded by National Science Foundation (NSF) (Award #
2101104 and 2100964) and National Academy of Education/Spencer Foundation. Any opinions,
findings, conclusions, or recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of the founders.
16
Supplementary Materials:
Table S1 Definitions of bias in authoritative literature of social science and natural science
Resources Definition
Oxford Tendency to favor or dislike a person or thing, especially as a result of a
English preconceived opinion, partiality, prejudice.
Dictionary (Statistics) Distortion of a statistical result arising from the method of
sampling, measurement, analysis, etc.; an instance of this
Dictionary of A systematic tendency to take irrelevant factors into account while ignoring
Cognitive relevant factors (p. 352).
Science Algorithms make use of additional knowledge (i.e., learning biases; p. 228).
The Science In common terms, bias is a preference for or against one idea, thing, or
Dictionary person.
In scientific research, bias is described as any systematic deviation between
the results of a study and the “truth.”
17