100 Billion Data Rows Per Second Culture PDF
100 Billion Data Rows Per Second Culture PDF
100 Billion Data Rows Per Second Culture PDF
Analytics dashboard in offices of Vogue UK. Screenshot from video Alexa Chung Uncovers Fashion
Industry Secrets - Full Series One | Future of Fashion | British Vogue, published on YouTube.com on
October 27, 2015, http://www.youtube.com/watch?v=Bi2nc_xxnvs.
Culture today is infecting everything with sameness. Film, radio and
magazines form a systemInterested parties like to explain culture
industry in technological terms. Its millions of participants, they argue,
demand reproduction processes that inevitably lead to the use of
standard processes to meet the same needs at countless locations In
reality, the cycle of manipulation and retroactive need is unifying the
system is ever more tightly. Theodor Adorno and Max Horkheimer, The
Culture Industry: Enlightenment as Mass Deception," in Dialectic of
Enlightenment, 1944,
http://web.stanford.edu/dept/DLCL/files/pdf/adorno_culture_industry.
pdf.
Facebook 2015 stats: Photo uploads total 300 million per day; 968
million people log onto Facebook daily; 50% of 18-24 year-olds go on
Facebook when they wake up. Source: The Top 20 Valuable Facebook
Statistics, October 2015, https://zephoria.com/top-15-valuable-
facebook-statistics/.
The companies that sell cultural goods and services via the web sites or apps (for
example, Amazon, Apple, Spotify, Netflix), organize and make searchable
information and knowledge (Google, Baidu, Yandex), provide recommendations
(Yelp, TripAdvisor), enable social communication and information sharing
(Facebook, QQ, WeChat, WhatsApp, Twitter, etc.) and media sharing (Instagram,
Pinterest, YouTube, iQiyi) all rely on computational analysis of massive media
data sets and data streams. This data includes the following:
(Note about terminology: I use the term data sets to refer to static or
historical data organized in databases prior to automatic analysis. The term
historical in industrial data analytics applications mean everything that is
more than a few seconds, or sometimes even fractions of a second in the past.
Data Streams refers to the data that arrives in real time and is analyzed
continuously using platforms such as Spark Streaming and Storm. In both cases,
collected data is also stored using platforms such as Cassandra, HBase, and
MongoDB. So far, digital humanities and computational social sciences have only
been analyzing historical static datasets; meanwhile industry has been
increasingly using real-time analysis of data streams that are larger and require
special platforms mentioned above.)
For example, to make its search service possible, Google continuously analyzes
full content and markup of billions of web pages. It looks at every page on the
web its spiders can reach - its text, layout, fonts used, images and so on, using
over 200 signals in total. (Web search was the first massive instantiation of
media analytics.) To be able to recommend music, the streaming services such as
Spotify and Deezer analyze characteristics of millions of songs. For example,
Echonest that powers many online music services used its algorithms to analyze
36,774,820 songs by 3,230,888 artists. Email spam detection relies on analysis of
texts of numerous emails. Amazon analyzes purchases of millions of its
customers to recommend books. Netflix analyzes choices of millions of
subscribes to recommend films and TV show. It also analyzes information on all
its offerings to create over 70,000 genre categories. Contextual advertising
systems such as AdSense analyze content of web pages and automatically select
the relevant ads to show. Video game companies capture gaming actions of
millions of players and use this to optimize games design. YouTube scans posted
videos to see if a new video matches some item in the database of millions of
copyrighted videos. Facebook algorithm analyzes all updates by every friends of
every user to automatically select which ones to show in user feed (if you are
using default Top Stories option). And it does this for all posts of their 1.6
billion users. (According to the estimates, in 2014 Facebook was processing 600
TB of new data per day.) Other examples of use media analytics in the industry
include automatic translation (Google, Skype) and recommendations for people
to follow or add to your friends list (Twitter, Facebook).
The development of algorithms and software systems that make this data
collection, analysis and subsequent actions possible is carried out by researchers
in a number of academic fields including data science, machine learning, data
mining, computer vision, music information retrieval, computational linguistics,
natural language processing, and computer science in general. Most of these
fields started to develop already in the 1950s, with the key concept of
information retrieval introduced in 1950. The newest term is data science
that became popular after 2010. It refers to professionals who know
contemporary algorithms and methods for data analysis (described by
overlapping umbrella terms of data mining, machine learning, and AI) as
well as classical statistics, and can implement gathering, analysis, reporting and
storage of big data using current technologies, such as platforms I referenced
above.
People outside the industry may be surprised to learn that many key parts of
media analytics technologies are open sourced. To speed up the progress of
research, most top companies regularly share many parts of their code. For
example, on November 9, 2015 Google open-sourced TensorFlow, its data and
media analysis system that powers many of its services. Other companies such
as Facebook and Microsoft also open-sourced their software systems for
organizing massive datasets (Cassandra and Hive are two popular systems from
Facebook and they are now used by numerous commercial and non-profit
organizations.) The reverse is also true: the data from community mapping
project openstreetmap.org (with over two million members) is used by many
commercial companies including Microsoft and craigslist in their applications.
The most programming language used for media analytics research today is free
R that is constantly being researchers from universities, labs, and companies.
Media analytics is the new stage of media technology that impacts everyday
cultural experiences of significant percentages of populations in dozens of
countries who use Internet and computing devices. (For figures about use of
Internet and social media in the USA by different demographics groups, see
latest Pew Research Center Internet & Tech reports.)
To be fair, we should note that one part of media analytics the practices of
gathering and algorithmic analysis of user interaction data - received significant
attention. However, almost all discussions of this have been only in relation to
political and social issues such as privacy, surveillance, access rights,
discrimination, fairness, biases, etc., as opposed to history and theory of
technological media.
In contrast, the second key part - the practices of algorithmic analysis of all types
of online media content by the industry - received very little attention. One likely
reason for this absence is that many journalists and academics in social sciences
and media studies are interested mainly in social and political effects and uses of
media, as opposed to technical details beneath its surface. While media
analytics technologies and concepts are widely discussed in computer and data
sciences in business publications, in conferences and trade shows, in leading
science journals, and being taught to millions of students worldwide in computer
science and data science classes, they are not discussed in either popular press
or by academics outside of technology and science fields.
This lack of systematic knowledge on the part of many academics and journalists
who write about digital cultures about the details of the computational
processes that drive web services, apps, desktop applications, video games,
search, image detection, voice recognition, recommendation systems, behavioral
advertising, and so on, as well as contemporary software engineering and the
field of data science in general often prevents them, in my view, from seeing the
full picture. (Understanding of many of these details does require knowledge of
computer science, and today very few people in academic humanists, social
sciences or journalism have this background.) This is the reason of why many
academics and journalists recently adapted the single term algorithm (or
algorithmic ) to refer to the sum total of many very different computational
processes and data infrastructures where algorithms is only one of many parts.
In particular, people often use this term to refer to systems that use supervised
machine learning and therefore are not algorithmic in the accepted meaning of
this concept. As Ian Bogust correctly noted in The Atlantic (01/15/2015),
Concepts like algorithm have become sloppy shorthands, slang terms for the
act of mistaking multipart complex systems for simple, singular ones. (In the
same article Bogust incorrectly describes me as somebody who focuses on
algorithms, while in reality I have been advocating the study of software, the
term I use to refer to such multipart complex systems. See my book Software
Takes Command, 2013). For example, while many presentations at innovative
conferences on Governing Algorithms in 2013 and Algorithms and
Accountability in 2015 organized by NYU Law Institute made interesting and
important arguments, some of the presentations used the term algorithms too
broadly.
Only if we consider the two parts of media analytics together - analysis of user
interaction data and analysis of cultural content the magnitude of the shift that
took place between 1995 and 2010 becomes fully apparent. This is why I am
proposing that we should think of media analytics as the new condition of culture
industry and also as a new stage in media history. Because its use is now so
central to industry as a whole, and because it affects all cultural activities
mediated by the web and the apps, we need to start thinking beyond any
particular instances.
To reiterate this point: the algorithmic analysis of cultural data and algorithmic
decision-making is not only at work in a few most visible areas such as Google
Search and Facebook News. Media analytics practices and technologies are
employed in most platforms and services where people share, purchase, and
interact with cultural products and with each other. They are used by companies
to automatically select what, how, and when will be shown on these platforms to
each user, including updates from their friends and recommended content. And
perhaps most importantly, they are built into many apps and web services used
not only by companies and non-profits but also by millions of individuals who now
participate in culture industry not only as consumers but also as content and
opinion creators. (George Ritzer and Nathan Jurgenson call such combination of
consumption and production prosumer capitalism.) For example, Google
Analytics for websites and blogs, and analytics dashboards provided by
Facebook, Twitter and other major social networks are used by millions to fine
tune their content and posting strategies.
Both parts of media analytics are historically new. At the time when Adorno and
Horkheimer were writing their book, interpersonal and group interactions were
not part of culture industry. But today they have now also become
industrialized influenced in part by algorithms deciding what content,
updates and information from people in your networks to show you. These
interactions are also industrialized in a different sense - interfaces and tools of
social networks and messaging apps are designed with input from UI (user
interaction) scientists and designers who test endless possibilities to assure that
every UI element such as buttons and menus is optimized and engineered to
achieve maximum results.
The computational analysis part is also very recent in terms of its use by culture
industry. The idea and first computer technologies that could perform retrieve
the computer-encoded text in response to a quarry were already introduced in
1940s. In the conference held in in 1948, Holmstrom described a machine
called the Univac capable of searching for text references associated with a
subject code. The code and text were stored on a magnetic steel tape
(Sanderson and Croft, The History of Information Retrieval Research). Calvin
Mooers coined the term information retrieval in his Master Thesis at MIT
paper and published his definition of the term in 1950 (finding information
whose location or very existence is a priori unknown, quoted in
see Eugene Garfield, A Tribute To Calvin N. Mooers, A Pioneer Of Information
Retrieval, 1977). While the earliest systems only used subject and author codes,
in the late 1950s IBM computer scientist Hans Peter Luhn introduced full-text
processing that I identify as the real start of media analytics. In the 1980s, first
search engines applied information retrieval technology to the files on the
internet. After the World Wide started to grow, new search engines for the
websites were created. The first well-known engine that searched texts of web
sites was 1994 Web Crawler. In second part of the 1990s, many search engines
including Yahoo!, Magellan, Lycos, Infoseek, Excite, AltaVista continued analysis
of web text. And in 2000s, the massive analysis of other types of online media
including images, video and songs also started. For example, in early 2016 image
search service by TinEye indexed over 14 billion web images
(https://www.tineye.com/faq#count, retrieved 2/21/2016).
The tools of media analytics are different they automate analysis of 1) billions
of pieces of media content available online, and 2) data from trillions of
interactions between users and software services and apps. For example, Google
analyzes content of images on the web, and when you enter a search term, the
system shows all or only some images (depending on your selection in Safe
Search option.) And if this is desired, they also make possible automatic actions
based on this analysis - for example, automatic ads placement.
So what are now being automated are no longer creation of individual media
items but presentation of all web content and retrieval of relevant content. This
includes selection and filtering (what to show), promotion (advertising of
content), and discovery (search, recommendations). Another growing application
is how to show for example, popular news portal Mashable that currently has
6.73 followers on Twitter (https://twitter.com/mashable, 02/21/2016)
automatically adjusts the placement of content pieces based on real time
analysis of users interactions with this content. Yet another application is what
to create for example, in 2015 New York Time writers started to use in-house
application that recommends topics to cover (for other examples, see Shelley
Podolny, If an Algorithm Wrote This, How Would You Even Know? 03/08/2015,
and Celeste Lecompte, Automation in the Newsroom, 09/01/2015).
In the case of web giants such as Google and Facebook, their technical and talent
resources for data analysis and access to the data about the use of their services
by hundreds of millions people daily gives them significant advantages. It allows
these companies to analyze user interactions and act on them in ways that are
quantitatively different from an individual user or a business using Google
Analytics or Facebook analytics on their own accounts, or using any of the social
media dashboards but qualitatively, in terms of concepts and most of the
technologies, it is exactly the same. One key difference between giants such as
Google, Facebook, Baidu, eBay and smaller companies is that the former have top
scientists developing their machine learning systems (i.e., the modern form of
AI) that analyze and make decisions based on billions of data points captured in
near real time. Another difference is the fact that Google and Facebook dominate
online search and advertising in many countries, and therefore they have a
disproportional effect on discovery of new content and information by hundreds
of millions of people.
So media analytics is big and it is used throughout culture industry. But still, why
do I call it a stage as opposed to just one among other trends of
contemporary culture industry? Because in some industries, media analytics is
used to algorithmically process and act on every cultural artifact. For example,
digital music services that use media analytics accounted for %70 of music
revenues in the U.S in 9/2014). Media analytics is also used to analyze and act on
every user interaction on platforms used by majority of younger people in
dozens of countries (i.e., Facebook, Baidu, Tumblr, Instagram, etc.). Its the new
logic of how media works internally and how it functions in society. In short, it is
crucial both practically and theoretically. Any future discussion of media theory,
media theory or communication has to start with this situation.
(Of course, I am not saying that nothing else has happened after 1993 with
media technologies. I can list many other important developments such as: move
from hierarchical organization of information to search, rise of social media,
integration of geolocation information, mobile computing, integration of
cameras and web browsing into phones, switch to supervised machine learning
across media analytics applications and other areas of data analysis after 2010).
The companies that are key players in big media data processing are all only
10-15 years Google, Baidu, VK, Amazon, Ebay, Facebook, Instagram, etc. They
developed in a Web era, as opposed to the older 20th century cultural industry
players such as movie studios or book publishers. These older players were, and
continue to be, the producers of professional content. The newer players act as
interfaces between people and this professional content, as well as user-
generated content. The older players are gradually moving towards adoption of
analytics, but key decisions (for example, publishing a particular book) are still
made by individuals following their instincts. In contrast, new players from the
beginning built their business on computational media analytics.
1. The analysis part is always fully automated. The results of analysis can be used
to drive actions, but this is not required. The action part is also fully automated
and it be generated in response to user inputs, or without them. Google search
offers an example of the system where the actions depend on previous analysis
and user inputs. Google continuously indexes all web pages including
dynamically generated content and content of apps it can access. This is analysis
part. When a user enters input into the search interface using text, image or
voice, Google systems return the results drawn from index. This is the action in
response to user input.
Use of social media monitoring tools today is an example of analysis typically not
connected to automatic actions. I can use Buffer, Hootsuite, Sprout Social, Piwik ,
and dozens of other free or paid tools to analyze user engagement with my own
websites and social media accounts, or social media activity in general related to
any topic, in many languages, and across dozens of global social networks. After I
discover some patterns that I want to change, I may adjust my strategy of
posting to Twitter, Facebook or Instagram, but these adjustments would not
happen automatically.
1.1. Analysis of media content. Examples include content of web pages and
apps analyzed by search engines; analysis of photos and their metadata to detect
faces or enable categorization by places and content performed by photo apps
and photo sharing services; YouTube analysis of newly shared video to compare
them to its database of copyrighted video and detect copies.
1.3. Analysis of users interactions with other users of a given service. For
example, on Facebook I can start following a particular user; add this user (with
her/his permission) to my friends list; write a message; and also poke,
report, and block. All these behaviors are recorded and analyzed by
Facebook and used in some of its systems that drive certain automatic actions
such as deciding which new items to show to each user.
Users inputs and settings are combined with the results of content and interactions
analysis to determine the actions. The interactions may combine previous
interaction data from the particular user and data for all other users - such as
purchasing history of all Amazon customers. Other information can be also used
to determine actions. For instance, real-time algorithmic auctions that involve
thousands of ads determine which ads will be shown be on the users page at a
particular moment.
2.2. Automatic actions not controlled by explicit user inputs. These are
actions that depend on the analysis of user interaction activity but do not require
user to choose anything explicitly. In other words, a user votes with all her
previous actions. The automatic filtering in Google email into Important and
Everything is a good example of this type of action. Most of the automatic
actions we do encounter in our interactions with web services and apps today
can be partly controlled by us however, not every user is willing to spend time
to understand and change the default settings for every service (for example,
https://www.facebook.com/settings).
Finally, we also divide automatic actions into two types, depending on whether
they are arrived at in deterministic or non-deterministic way:
The overall result is another new condition of media what we are shown and
recommended every time is not completely determined by us or by system
designers. This shift from strictly deterministic technologies and practices of
culture industry in the 20th century to non-deterministic technologies in the first
decade of the 21st century is another important aspect of media analytics. What
was strictly the realm of experimental arts use of indeterminacy by John Cage,
or stochastic processes by Iannis Xenakis to create and/or perform compositions
has now, in a way, has been adopted by culture industry as a way to deal with
the new massive scale of available content. But of course, the goal and the
method now is rather different not to create possibly uncomfortable and
shocking aesthetic experience but to expose a person to more of existing content
fits with a person existing taste, as manifested in her/his previous choice.
However, we should keep in mind that industry recommendation system can be
also used to expand your taste and knowledge, if you gradually keep moving
further from your initial selections - and certainly web hyperlinking structure,
Wikipedia, open access publications and all other kinds of web content can be
used to do this.
One thing I should add to my outline above is another important use of collected
interaction data that also makes the new media analytics stage different. The
data on users interaction with the web service, an app or a device is also often
used to make automatic design adjustments of this web service, app or a device.
It is also used to create more cognitive automation, allowing the system to
anticipate what uses need at any given location and time, and deliver the
information best tailed to this location, moment, user profile, and type of activity.
The term context aware is often used to describe computer systems that can
react to location, time, identity, and activity. Google Now assistant is a good
example of such context-aware computing.
At this point you, the reader, may get impatient and wonder when I will deliver
what critics and media theorists are supposed to deliver when they talk about
contemporary life and in particular use of technologies: a critique of what I am
describing. Where is the word critical in my text? Why I am not invoking
capitalism, commodity, fetishism, or resistance? Why I am not talking
about hidden agenda and biases of data technologies, or end of privacy?
Where is my moralistic judgment?
None of this is coming. Why? Because, in contrast to what media critics like to
tell you, I believe that computing and data analysis technologies are neutral.
They dont come with some built-in social and economic ideologies and effects,
and they are hardly the tools of capitalism, profit making, or oppression. Exactly
the same analytics algorithms (linear regression, k-means cluster analysis,
Principal Component Analysis, and so on) or massive data processing
technologies (Cassandra, MongoDB, etc.) are used to analyze peoples behavior
in social networks, to look for cure for cancer, to look for potential terrorists, to
select ads that appear in your YouTube video, to study the human microbiome,
to motivate people to live healthy lifestyles, to get more people to vote for a
particular candidate during presidential elections (think of use of analytics in
Obama 2008 and 2012 campaigns), to suggest to New York Times editors which
stories they should publish, to generate automatic news layouts on Buzzfeed, etc.
Media analytics benefits not only big companies but also many millions of small
business, freelances and non-profits. The same algorithms and data gathering,
storage and analysis technologies are used by companies and government
agencies in USA, UK, Russia, Brazil, China, and dozens of other countries for
thousands of different applications. They are used to control and to liberate, to
create new knowledge and to limit what we know, to help find love and to
encourage us to consume more, to spy on us and to help us escape surveillance,
to organize protests and to track them. In other words, their use is so varied that
any claim that they are tools of capitalism is simply ungrounded (unless you
also want to also claim that arithmetic, calculus, rhetoric, electricity, space flight,
and every other human technology ever invented are all tools of capitalism.)
This does not mean that the adoption of large-scale data processing and analysis
across culture industry does not significantly change it. Nor does it mean that it
is now any less of an industry, in the sense of having distinct forms of
organization and standardization (such as are likes, favorites, line graphs
showing numbers of people engaging with your content, or maps showing the
countries where these people are located.) On the contrary some of marketing
and advertising techniques, the ways companies engage customers online and
also cultural products are new, and they are in the last few years all came to rely
on big scale media analytics.
Many of the cultural (as opposed to economic, social, and political) effects of
these developments have not been yet systematically studied empirically by
either industry or academic researchers. For example, we know now many
things about the language by conservative and liberal Twitter users in the U.S. or
political polarization on the same platform. But we dont know anything about
the differences in types of content shared on Instagram in thousands of cities
worldwide, or the evolution in cultural topics in hundreds of millions of blogs
over last ten years. The industry does extract some of this information and uses
it in their search and recommendation services, but they dont publish this
information itself. We should also keep in mind that industry is typically
interested in the analysis of the current trends in relation to particular content
and user activities (for example, all social media mentions of a particular brand),
as opposed to historical or large-scale cross-cultural analysis that is of interest to
academics.
However, one thing is clear to me. The same data analysis methods that are used
in culture industry to select and standardize content and communication can be
also used to quantitatively research and theorize cultural effects of media
analytics. (In our lab we have been using such methods to analyze visual content
such as millions of Instagram images, but not yet large interaction data). But
such analysis will gradually emerge, and we already can give it a name:
computational media studies.
In 2005, when industrial media analytics was just emerging, I introduced a term
cultural analytics to refer to the use of computational methods to explore
massive cultural datasets including user-generated content in humanities
context. Since then, researchers published lots of interesting studies that apply
these methods to the analysis of literature, music, art, historical newspaper
content, and social networks including Facebook, Twitter, Flickr and Instagram.
(For an overview, see Manovich, The Science of Culture? Social Computing,
Digital Humanities, and Cultural Analytics, 2015.) However, since computational
analysis of content or user interactions data has not yet been used in media and
communication studies, the term computational media studies can be useful to
motivate this research.
The term culture industry that is used in the title of this text was introduced by
Adorno and Horkheimer in their 1944 book Dialectic of Enlightenment. The book
was written in Los Angeles when Hollywood studio system was in its classical,
i.e. most integrated period. There were eight major film conglomerates; five of
them (Fox, Paramount, RKO, Warner Brothers, and Loews) had their production
studios, distribution divisions, theatre chains, and their own directors and
actors. According to some film theorists, the films produced by these studios
during this period also had a very consistent style and narrative construction
(see David Bordwell, Janet Staiger, Kristin Thompson, The Classical Hollywood
Cinema: Film Style and Mode of Production to 1960, published in 1985.)
Regardless of whether Adorno and Horkheimer already fully formed their ideas
before arriving to Los Angeles as emigrants from Germany, the tone of the book
and its particular statements such as famous culture today is infecting
everything with sameness seem to fit particularly well to Hollywood classical
era.
How does the new computational base (i.e., media analytics) affect both the
products culture industry creates, and what consumers get to see and choose?
For example, do computational recommendation systems used today by
Amazon, YouTube, Netflix, Spotify, Apple iTunes Radio, Google Play and others
help people chose apps, books, videos, movies, or songs more widely (i.e., long
tail effect), or do they, on the contrary, guide them towards top lists? What
about recommendation systems used by Twitter and Facebook to recommend to
us who to follow and which groups to join? Or consider the interfaces and tools
of popular media capture and sharing apps, such as Instagram, with its standard
set of filters and adjustment controls appearing in particular order on your
phone. Does this lead to homogenization of image styles, with the same few
filters dominating over the rest (currently 24 in total)?
The authors propose an algorithm that can find unpopular images (i.e. images
that have been seen by only small proportion of users) that equal in aesthetic
quality to the popular images. Implementing such algorithm would allow more
creators to find audiences for their works. Such research exemplifies potential of
computational media studies to go beyond generating descriptions and
critique of cultural situations by offering constructive solutions that can
change these situations.