Sources
Language Use
This project purposefully avoids using the word "Artificial Intelligence", since there is nothing intelligent about these systems. I prefer to use the terms "machine learning" or "statistics on steroids", but I've settled on algorithms here.
Machine Learning models
Almost all the machine learning models used were downloaded "pre-trained" from open source projects I found on Github. This was done to make a point: we often say we should improve biased and/or error-prone machine learning models, but the reality is that most organisations don't train their own models. Instead, they use third parties that supply machine learning services, and it's in the interest of these parties to keep their systems as generic and "one size fits all" as possible. And then you have the parties who just implement whatever they can get their hands on, and hope nobody asks difficult questions.
The beauty scoring model was found on Github (this or this one). The models to predict age, gender and facial expression/emotion are part of FaceApiJS, which forms the backbone of this project. Do note that its developer doesn't fully divulge on which photos the models were trained. Also, FaceApiJS is bad at detecting "Asian guys".
I actually trained the BMI prediction algorithm myself because I couldn't find any existing models that were small enough to use online. I downloaded all the BMI prediction projects I could find on Github, and was astonished to find some of them came with vast troves of photographs. I felt firty using them, but also felt that revealing what was going on was more important. I've documented some of the dodgy things I discovered along the way in this blogpost.
Videos, screenshots and other visual material
The videos were mainly constructed out of public domain source material from pexels.com and Pixabay.com. To the photographers who so kindly shared their work: thank you all for your wonderful generosity!
Other direct sources were used under the artistic, journalistic and educational copyright exception.
Specific sources
Attractiveness
-
TikTok (front page, recorded with a screen recorder). Their practise of showing content from beautiful people is explained here. While I couldn't confirm if TikTok does this algorithmically, Tinder actually has been on the record about this practice.
-
This academic research was the source of the beauty judgement interface. Another source was the SCUT-FCB dataset.
Age
-
Innovatrix offers surveillance systems that analyse the demographics of store visitors. The picture was downloaded form their website.
-
Tinder (front page, screenshot).
Gender
-
Check out This article if you want to understand why an algorithm that tries to sort people into just two categories can upset some people.
Body Mass Index (BMI)
-
The BMI prediction project that was created by researchers who work at Google can be found here (not anymore, they deleted it!). Although all signs point to this project being a part of Google's practise, I sent an email to the makers to verify this, and got a quick response that it was a personal project. I then changed the video to say "researchers who work at Google" intead of "Google's health lab in India".
-
For more juicy details I refer you to the in-depth blogpost mentioned earlier.
Life expectancy
-
There are a lof of projects on Github that explore this idea, some of them as part of (Kaggle) contests set-up by the insurance industry. After exploring how these worked, I created my own wishy-washy implementation. There is no machine learning involved here, it's just a lookup in a table of life expectancies per country, the average BMI in that country, and then a calculation on how BMI might affect that. There is very little merit in this calculation, but then again that doesn't seem to be a requirement for calling yourself a data scientist.
Closer / Face print
-
Deepcamai.com is no longer online, perhaps it was removed after the scandal around Rite Aid. It can still be found using the Internet Archive's wayback machine. The Chinese parent company Deepcam has a Chinese website. There is also an Australian website. Wait, it turns out the USA company has rebranded itself to PDActive. The website is virtually identical to the old deepcamai.com website.
-
The visual effect that shows how the video feed is first turned into a mathematical representation based on contrast is built using HOG descriptor. The code was modified to work in the browser.
Emotions
Surfing behaviour
-
Visual Website Optimizer - Video source
-
HotJar. Learn more about the screen recording feature here.
- Mouse movement was recorded using a small script called Wix client recorder (MIT license). As always, this runs in your own browser - the recording never leaves your computer, and will be gone as soon as you close the browser window.
Tip: you can protect yourself from these practices by installing browser addons such as uBlock Origin and uMatrix. Also check out Privacy Badger, HTTPS Everywhere, and Decentraleyes.
Conclusion
-
Jon Penney researched how Wikipedia was used after the Snowden revelations, and noticed that pages about sensitive topics such as terrorism were visited less. Later research strengthened the notion that this was caused by self-censorship.
-
You may also enjoy the follow-up game I made: AreYouYou.eu.
EU funded
This project has received funding from the European Union’s Horizon 2020 research and innovation programme, under grant agreement No 786641.