Classifying YouTube Channels
Classifying YouTube Channels
Classifying YouTube Channels
CHANNELS: A
PRACTICAL SYSTEM
Prepared by Sai Ram Prasad Reddy S (1ms10cs098)
For 8
th
semester seminar
Introduction
Having a look at their contents, you would probably say in a few
seconds that Machinima is a YouTube channel about vide games and
CNN about news. But how to determine this algorithmically? And how to
do this at the scale of YouTube corpus?
YouTube is placing the concept of channel at the core of its strategy to
develop content and audience.
A channel can be viewed as a living set of videos which share a
common property: they are from the same person or organization, they
are about the same topic, they are related to the same event, etc. A
channel might be created by a content creator (a person who uploads
original videos to the site) or generated by a curator (a person who
recommends videos on the site or even an algorithm).
A channel has a live feed of events, in which videos may be published,
and users may subscribe to them. Channels are engaging creators and
curators, by gathering an audience for them. Channels are engaging
users, by recommending them videos about things they like.
Introduction
A sample YouTube Channel.
Introduction Why classify
channels?
With channels playing such a central role, enabling their discovery
becomes more and more crucial. Discovery features for videos can be
extended to channels, yielding to channel search, channel
recommendations and related channels. These features are very
powerful for users who have a watch history, or who know precisely
what they are looking for.
In order to engage users who are new to the site, or who do not know
exactly what they want to watch, developers at YouTube wanted to
provide a catalog of user channels. For every thematic category of
content available on YouTube (e.g. music, sports, news or even sub-
categories like rock music, tennis, politics), the catalog should provide a
list of interesting channels. For building this, channels must be
classified into a taxonomy of thematic categories.
This feature is known as the channels browser, because it allows users
to browse a catalog of channels and to subscribe to those they are
interested in. It is a major feature, as it allows users to discover new
channels they may be interested in, and also acts as a showcase for
the site.
Manual Classification
The channels browser was launched on
YouTube with a dozen of categories, and a
manually curated list of channels for each of
these categories, with a focus on the United
States.
This approach was fine as a starting point, but
it does not scale.
Manual Classification -
Problems
It limits the channels browser to some hundreds or thousands of
well known channels (e.g. channels from partners or celebrities), as
it is not possible to manually classify millions of channels.
The channels browser should be a living selection of channels. New
channels are created every day, others are closed or abandoned.
The audience, the interest or even the theme of every channel is
evolving. So, the list cannot be developed once for all, but has to be
continuously maintained.
YouTube is officially launched in more than 50 countries. This
requires to develop and maintain this number of versions of the
catalog.
The list of categories, initially limited to 10, should be extended to
many more in order to match as close as possible user interests.
Alternative Approach
An alternative option would have been to ask creators and curators
to classify their own channels, i.e. specify in which category they
want their channel to appear.
Disadvantages:
It requires an additional action from the channel owners, which
might be painful or error prone if the list of categories is large.
It does not work for algorithmically generated channels, as well
as it requires some back-fill work for all existing channels.
It almost prevents changing the taxonomy from time to time (or
limits how it can be changed), because it would require every
creator or curator to update the classifications.
Current Approach
Generate the channels browser algorithmically,
i.e. by developing a program able to classify
channels within the taxonomy without human
intervention.
Towards the solution
Algorithmic classification of video channels is a difficult problem,
because it requires to understand what the channel and its videos
are about, with a limited quantity of algorithmically exploitable
information.
Many works about video content analysis are available in the
literature. But such algorithms are not suitable to discover what
videos are really about, i.e. which users will be interested in
watching the video. For instance, a video may contain a lot of kites
or its soundtrack may have many occurrences of the word kite,
while the video would neither be really about kites or interesting for
users who care about this topic.
These algorithms might also be hard to scale to a video corpus of
the size of YouTube.
An efficient solution!
Ignore the image and audio contents, and rely on the meta-data
associated with the videos and the user channels.
These are mainly the title, the description and the keywords entered
as free text by the users when they create channels and upload
videos.
Relying on text allows applying known techniques like semantic
entity extraction and disambiguation, but with new challenges as the
amount of textual information available for a channel or a video is
usually sparse and short, which requires to introduce some
additional refinements.
Steps in the classification
process
Mapping videos to semantic entities.
Mapping semantic entities to taxonomic
categories.
Mapping channels to taxonomic categories (by
combining both previous steps).
Metrics
The precision of a classification algorithm is the fraction
of classification results produced by the algorithm which
are relevant.
The recall of a classification algorithm is the fraction of
classifiable items which are effectively classified by the
algorithm. The coverage is the fraction of the items
which are classified by the algorithm. As it is easier to
measure, we decided to use coverage as a proxy for
recall.
Subscription rate.
Annotating Videos Semantic
Entities
Freebase is a knowledge base maintained by a
community supported by Google, which aims at
gathering as much of the world's knowledge as
possible. It is organized around entities (a.k.a. topics,
the nodes of the graph), which are connected together
by properties (the edges of the graph).
By entity, one means every concrete or abstract
concepts that people may designate. Persons, places,
objects, artworks are entities. Abstract concepts like
interview, mathematics or even happiness are also
entities. Every entity has types, which define the
properties it may have.
Freebase is used as the source of entities for
annotating videos, and for creating algorithmic
channels.
Annotating videos Entity
names
In textual documents, humans use names to
designate entities. The mapping from names
to entities is N-to-N. In a given language, every
entity may have several names (e.g. the City
of Paris may be designated as Paris, Capital
of France or City of Lights). On the other hand,
a single name may designate different entities
depending on the context (e.g. Jaguar may
designate an animal or a car manufacturer).
Annotating videos Entity
names
A mapping from names to entities has been built by analyzing
Google Search logs, and, in particular, by analyzing the web
queries people are using to get to the Wikipedia article for a
given entity. After inverting it, we get a table mapping every
name to a list of entities with probabilities. For instance, this
table maps the name Jaguar to the entity Jaguar car with a
probability of around 45 % and to the entity Jaguar animal
with a probability of around 35 %.
This table is then enriched by contextual support between
entities, by analyzing the links between the entities (or their
underlying Wikipedia pages). For instance, having the entity
Savannah in the context increases the probability that the
name Jaguar designates the animal and not the car. This
refinement plays a crucial role for disambiguating
annotations.
Annotation process
Annotation process
In the first step, the video is converted into a
textual document by gathering the meta-data
associated with it (like the title, the description and
the keywords entered by the user).
In the second step, the textual document is
annotated with semantic entities using techniques
known as entity extraction and disambiguation:
All fragments of the document which may designate
an entity are identified. Their relative positions are
taken into account in order to disambiguate them.
A ranking of all mentioned entities is produced,
reflecting how topical" (i.e. important) they are for the
document.
Three additional signals for
annotation
The top search queries which lead to effective
watches of the videos.
The videos which have been watched in the
same user sessions. Watchers tend to watch
thematically consistent videos in the same
session.
The upload history of every up-loader, i.e. the
annotation of the videos which have been
previously uploaded. Up-loaders tend to
upload thematically consistent videos over
time.
Metrics
Precision. The precision measures how the
generated annotations for a given video are relevant
with regards to the video content. We defined three
levels of relationship between a video and an entity:
Central. The entity is one of the key entities to describe
what the video is about. For instance, Lady Gaga would be
a central entity for the clip of Bad Romance.
Relevant. The video is about the entity, but this is not one
of the key entities to describe what the video is about. For
instance, Pop music would be a relevant entity for the clip
of Bad Romance.
Off-topic. The entity is not related to the video (i.e. the
annotation is wrong). For instance, Sport would be an off-
topic entity for some music video.
Metrics
Coverage. The coverage is measured by
computing the fraction of videos which are
annotated by at least one entity. It is relevant
to weight this measurement by the respective
audience of the videos, i.e. their number of
views. We measured the coverage by running
the algorithm on all YouTube videos in
languages we support.
Metrics
Table summarizes the results.
Table : Metrics for video annotation
Classifying Entities
Taxonomy: Various taxonomies have been
developed in order to classify corpus of
written, audio or video contents. Libraries have
long used hierarchical taxonomies such as the
Library of Congress System. Similarly, in the
early age of the web, portal sites like
dmoz.org4 presented a hierarchical
classification of web sites. More recently,
Wikipedia provides several hierarchical
taxonomies in its Contents pages. Google also
developed taxonomies for its advertisement
products. Many other examples exist.
Need for a separate taxonomy
These taxonomies are not very well suited for
classifying YouTube videos and channels. The
distribution of YouTube videos and channels is
quite different than in a library or an encyclopedia.
For instance, Mathematics appears as a first-level
node in some Wikipedia taxonomies, while it does
not deserve such importance for YouTube
contents and audience.
So, we decided to develop a specific taxonomy for
our purpose, see an excerpt of it in Table provided
in next slide.
A sample taxonomy structure
Our Taxonomy
The complete tree has around 300 nodes, with
a depth of 2 to 4 depending on the branches.
However, it is worth noting that the approach
described in this paper is not particularly linked
to this taxonomy.
The taxonomy is defined as a directed acyclic
graph, with a single root, whose nodes are
called categories. One says A is a sub-
category of B if there is an edge from B to A in
the graph.
Features
Our classification algorithm works by extracting
features about entities, and using some simple
machine learning. A feature is basically a piece of
information about the entity selected in a given
space. We use several feature spaces,
corresponding to several sources of information
we combine.
For every feature space, we developed a mapping
from features to categories, which models the
probability that an entity having a certain feature
should be categorized in a given category of the
taxonomy. We call this mapping a model.
Models to classify entities
Entity types: Every entity of Freebase has one or several
types. Types provide useful information about what entities
are. For instance, a musical artist (type /music/artist) is likely
to be classified under the /music category, while an Olympics
athlete (type /Olympics/olympic_athlete) is likely to be
classified under the /sports category.
In Freebase, the types for a given entity are split into two
sets:
The notable types, which are the types for which the entity is
widely known.
The other types, which are other types of lower importance.
Further to some quality analysis we decided to use only the
notable types, as the other types tend to introduce some
noise.
Models to classify entities
Freebase properties. Some of the properties
which connect entities together in Freebase are
particularly useful for classifying entities. For
instance, in Freebase, every music artist is
connected to entities representing music genres.
We extract some of this information as features,
mainly in the music, film, gaming and sport areas,
and use them as input for the classification
algorithm. These features can be weighted with
data coming from Freebase (e.g. fraction of music
records of an artist which fall into a given music
genre).
Models to classify entities
Ads-related categories. Several taxonomies
for web documents have already been
developed at Google for advertising purposes,
together with a text classifier.
As they were already developed and well-
proven, this classifier was of prominent interest
as input for our classification algorithm. Hence,
we passed the English description of every
entity from Freebase through this classier, and
used its output as input features for our
classifier with a dedicated model.
Models to classify entities
Portal pages. Wikipedia portals are pages
intended to serve as main pages" for specific
topics or areas. As of mid-2012, there are around
2000 portals (gathering all languages), like portals
about food, anime and manga or insects.
These portals are a very interesting source of
information for classifying entities, as they cover a
wide range of topics which can be mapped to our
taxonomy. We extract the references from entities
to portal pages from Freebase, and built a model
mapping portals to categories.
Classification Algorithm
The classification algorithm works on one entity at a
time, i.e. it can be defined as a function from entities
to categories. For a given entity, it proceeds as
follows:
1. It collects all the features for the entity from the different
sources. Every feature is associated to the entity with a
numeric weight.
2. It maps the features to actual categories by using the
models. The weight of a category for the entity is
computed as a combination of the weight of the category
for the feature and the weight of the feature for the
category. If several features map one entity to the same
category, weights are summed up. Weights are also
propagated to the root of the taxonomy (e.g. the weight
of the /music category is the sum of the weights of the
features directly associated with this category, and the
weights of its sub-categories).
Classification Algorithm
1. Last, the taxonomy is traversed from the root to the leaf
by selecting at every node the sub-category which has
the highest weight. If the ratio between the first and the
second higher weighted sub-categories of a given
category is lower than a specific threshold, the traversal
is stopped. This leads to a classification to a non-leaf
category, which makes sense in particular for broad
topics.
Developing the models
The development of the models is a key step
in the development of the classier, as it directly
affects the quality of the obtained
classification, both in term of precision and
coverage.
Developing the models
We developed a first version of models for the
different feature spaces as follows:
We mapped every category of the taxonomy to one entity
of Freebase (or, in a few particular cases, two or three).
For instance we mapped the category /music/pop to the
Freebase entity Pop Music.
For each category, we analyzed the features of the
associated entities, and we determined which of these
features were representative of it, i.e. present for it and its
sub-categories, but not on other categories in the
taxonomy. For instance, this would recognize that the
Freebase type /music/artist is representative of the
category /music, because it is present only in this sub-tree
of the taxonomy.
We performed a manual curation of this result, in order to correct
a few obvious issues.
Improving the models
In order to improve the quality of the models,
we developed a trainer. The trainer works as
follows:
It takes as input a set of entities, and an initial
version of the models.
It runs the classification algorithm with the initial
version of the model, on the given set of entities,
and determines for every entity a classification in
the taxonomy.
It computes, for every feature, the probability that
an entity having this feature belongs to a given
category.
Improving the models
This process allows computing a new version
of the models.
It can be executed iteratively (replacing initial
version and new version by version N and
version N+1). The interest of this approach
lies in the fact that it allows improving the
quality of the model for one feature space
using the classification obtained from other
feature spaces. As all entities do not have
features in all spaces, the overall coverage of
the classification gets improved.
Classifying user generated
YouTube channels
User-generated channels are the most widely
known type of channels on YouTube. These
are the channels generated by a human
creator (who uploaded a set of videos) or a
human curator (who selected a set of videos
uploaded by one or several creators). From a
data point of view, such a channel consists in
textual meta-data (similarly to a video), and a
list of videos.
Classifying user generated
YouTube channels - algorithm
Algorithm. The classification algorithm for user channels
works as follows.
First, the textual meta-data of the channel is annotated using
the same process as for videos. It produces a set of relevant
entities for the user channel. The algorithm then considers
the categories of these entities, as determined by the
classier, and computes their distribution.
Second, the videos of the channel are themselves annotated
with entities, and mapped to categories. Then, for every
category, the algorithm computes the fraction of the videos of
the channel which are annotated by an entity of this category.
The calculation of the fraction is weighted by the relative
number of views of the videos (in the last 30 days, to have a
current view of the channel contents), and by the weight of
the supporting entity for the video.
Classifying user generated
YouTube channels algorithm
result
These two parts lead to associate weights to every category for the
given channel. A top-down traversal of the taxonomy is then
performed, same as for entities , in order to select the most relevant
category for the channel. Again, the algorithm includes two
additional criteria for improving precision:
Only classifications which are supported by two sources of
information (annotation of the channel meta-data and categories
derived from video annotations) are considered,
Only classifications which are consistent with the most prominent
video categories chosen by the users for the videos of the channel.
Another interest of this algorithm lies in the fact that it allows
assessing the thematic cohesiveness of a user channel. If a user
channel contains videos about unrelated topics, the first and second
steps will output weights spread in different sub-trees of the
taxonomy, and this can be processed in the third step, either to
generate a multiple classification, or to exclude the channel (if non-
cohesive channels are undesirable for the application).
Classifying algorithmic channels
Entity-centered channels. In 2012, YouTube launched
YouTube collections, which are video channels algorithmically
generated by YouTube, and internally known as algorithmic
channels.
An algorithmic channel is created for every entity of interest
for YouTube. The channel's videos are those annotated by
the entity (see section 5), and algorithms are used to
generate the feed of videos and other aspects of the channel.
As there is a 1-to-1 mapping from algorithmic channels to
entities, classifying algorithmic channels is straightforward
using the outcome of the classier presented. Every
algorithmic channel is assigned to the category of its
underlying entity.
Classifying algorithmic channels
Channels from blogs. YouTube also generates
channels from videos embedded in blogs. Each of
these channels contains all the YouTube videos which
are included in the posts of a given blog. These
channels are particularly powerful for newsy topics,
and for providing context about videos. The algorithm
we described can be applied to blogs, replacing the
channel meta-data by the blog metadata. Quality
metrics are similar. In the future, we plan to replace
the annotation of the meta-data by a more precise set
of entities obtained by passing through the annotation
algorithm the whole blog posts. Similarly, blog posts
could be used as context information for annotating
videos.
Conclusion
This seminar described a complete framework for
classifying channels in YouTube, from the definition of
the product to its actual implementation. Along the
technical description, we explained our practical
approach for developing and evaluating such a
product.
This system is currently running daily on the whole
YouTube corpus, and serving several user facing
applications. It has been evaluated in term of
precision and coverage, as well as by its performance
after launch.
To the best of our knowledge, this represents the first
large scale application of thematic video content
classification on an Internet site.