AI & ML Week-3
AI & ML Week-3
AI & ML Week-3
In the Python numpy module, we have many aggregate functions to work with a single-
dimensional or multi-dimensional array.
The Python numpy aggregate functions are sum, min, max, mean, average, product,
median, standard deviation, variance to name few
First, we have to import Numpy as import numpy as np. To make a Numpy array, you
can just use the np.array() function. The aggregate functions are given below
OUTPUT
Vectorized Operations
Vectorized array operations will be faster than their pure Python equivalents
Vectorized sum
# vectorized sum
print(np.sum(np.arange(4)))
OUTPUT
Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)
Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not
Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function
is defined in “functools” module.
Lambda
What is the syntax of a lambda function (or lambda operator)? lambda arguments:
expression
Think of lambdas as one-line methods without a name.
They work practically the same as any other method in Python, for example:
def add(x,y):
Lambdas differ from normal Python methods because they can have only one expression,
can't contain any statements and their return type is a function object
map
The map() function iterates through all items in the given iterable and executes the
function we passed as an argument on each of them.
map(function, iterable(s))
We can pass as many iterable objects as we want after passing the function we want to
use: # Without using lambdas
def test(s):
return s[0] == "A"
print(list(ob))
output is
Filter
filter() Function takes a function object and an iterable and
Vectorized Operations
Processing such a large amount of data in python can be slow as compared to other
languages like C/C++. This is where vectorization comes into play.
Vectorized array operations will be faster than their pure Python equivalents
Vectorized sum
%timeit np.sum(np.arange(4))
%timeit a
OUTPUT
Vectorized multiplication
# vectorized sum
np.a=[4,5,1]
print(np.prod(np.a))
%timeit np.prod(np.a)
t = total print(t)
%timeit t
OUTPUT
Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)
Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not
Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function
Lambda
What is the syntax of a lambda function (or lambda operator)? lambda arguments:
expression
Think of lambdas as one-line methods without a name. They work practically the same as
any other method in Python, for example:
def add(x,y):
Lambdas differ from normal Python methods because they can have only one expression,
can't contain any statements and their return type is a function object
map
The map() function iterates through all items in the given iterable and executes the
function we passed as an argument on each of them.
map(function, iterable(s))
We can pass as many iterable objects as we want after passing the function we want to
use: # Without using lambdas
def test(s):
return s[0] == "A"
print(list(ob))
output is
[True, False, False, True, False]
Filter
filter() Function takes a function object and an iterable and creates a new list. The syntax
is:
filter(function, iterable(s))
As the name suggests, filter() forms a new list that contains only elements that satisfy a
certain condition, i.e. the function we passed returns True.
Search Educations Page 7
Artificial Intelligence and Machine Learning Code:20CS51I
print(list(filter_ob))
reduce() Function
reduce() works differently than map() and filter(). It does not return a new list based on
the function and iterable we've passed. Instead, it returns a single value.
reduce() isn't a built-in function anymore, and it can be found in the functools module.
The syntax is:
reduce(function, sequence[, initial])
reduce() works by calling the function we passed for the first two items in the sequence.
The result returned by the function is used in another call to function alongside with the
next (third in this case), element.
This process repeats until we've gone through all the elements in the sequence.
The optional argument initial is used, when present, at the beginning of this "loop" with
the first element in the first call to function. In a way, the initial element is the 0th
element, before the first one, when provided
We start with a list [2, 4, 7, 3] and pass the add(x, y) function to reduce() alongside this
list, without an initial value
reduce() calls add(6, 7) (result of the previous call to add() and the next element in the list
as parameters), and add() returns 13
Since no more elements are left in the sequence, reduce() returns 16 from functools
import reduce
def add(x, y): return x + y
list = [2, 4, 7, 3]
Search Educations Page 8
Artificial Intelligence and Machine Learning Code:20CS51I
OUTPUT :16
Again, this could be written using lambdas: from functools import reduce
list = [2, 4, 7, 3]
print(reduce(lambda x, y: x + y, list)) OUTPUT :16
Example to run all the functions together lst=[4,2,0,5,1,6,3]
from functools import reduce print(list(map(lambda num: num**2,lst)))
print(list(filter(lambda num:num>2,lst))) print(reduce(lambda x,y: x+y,lst))
OUTPUT
What is pandas?
Pandas is an open-source library that is built on top of NumPy library
It provides various data structures and operations for manipulating numerical data
and time series.
It provides ready to use high-performance data structures and data analysis tools.
Pandas module runs on top of NumPy and it is popularly used for data science and
data analytics.
Flexible reshaping and pivoting of data sets
A DataFrame is a data structure that organizes data into a 2-dimensional table of
rows and columns, much like a spreadsheet
DataFrame is common across many different languages like in R, Scala, and other
languages.
Data aggregation and grouping allows us to create summaries for display or analysis, for
example, when calculating average values or creating a table of counts or sums.
Aggregation function
Example:
Output:
Grouping is used to group data using some criteria from our dataset. It is used as split –
apply-combine strategy.
Function Description:
Output:
Ex1: using groupby functions in dataframe to find first and last element in a group
Pivot
Pivot () is used for pivoting the dataframe without applying aggregation. It doesn't
contain same values or columns/index. While
Melt
The melt () function enables us to reshape and elongate the data frames in a user-defined
manner. It organizes the data values in a long data frame format.
Output:
Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)
Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not
Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function
is defined in “functools” module.
Lambda Function
A lambda function is a small anonymous function. A lambda function can take any
number of arguments, but can only have one expression.
Syntax
Output:6
Output 13
There are many parameters in the methods. But I will explain only the most used
parameters.
start: Starting date. It is the left bound for generating dates. end: Ending date.
It is the upper bound for generating dates.
periods: Number of periods to generate.
freq: It is used to generate dates on the basis of frequency like “D”, “M” e.t.c
Shift operation
It is a common operation to shift time series data. We may need to make a comparison
between lagged or lead features. In our data frame, we can create a new feature that
contains the temperature of the previous day.
Resample
Another common operation performed on time series data is resampling. It involves in
changing the frequency of the periods. For instance, we may be interested in the weekly
temperature data rather than daily measurements.
The resample function creates groups (or bins) of a specified internal. Then, we can apply
aggregation functions to the groups to calculate the value based on resampled frequency.
Let’s calculate the average weekly temperatures. The first step is to resample the data to
week level. Then, we will apply the mean function to calculate the average.
Data Visualization can make your data speak! There is no doubt that when information is
represented in the form of a picture like a graph or a chart, it can provide a much better
understanding of the Data. Meaningful, effective, and aesthetically pleasing.
The key skill of a Data Scientist is to tell a compelling story after finding useful patterns
and information from data.
The plots and graphs can provide a clear description of the data. The Visuals can help
support any claims you make based on the Data at hand.
They can be understood by any non-technical personnel which is the major advantage
offered by them. While doing so, they let us convey most information while being very
compact.
Efficiency
Clarity
Accuracy
Folium:Folium is a powerful Python library that helps you create several types of Leaflet
maps.
Why Visualization
The importance of data visualization is simple: it helps people see, interact with,
and better understand data.
Whether simple or complex, the right visualization can bring everyone on the same
page, regardless of their level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from making data
more understandable.
A data visualization first and fore‐ most has to accurately convey the data. It must
not mislead or distort
We will go through a set of most widely used representations below and how we can use
them in the most effective manner.
Ugly: A figure that has aesthetic problems but otherwise is clear and informative
Bad :A figure that has problems related to perception; it may be unclear, confusing,
overly complicated, or deceiving
Amount : Numerical values shown for some set of categories usually done by bar charts
If there are two or more sets of categories for which we want to show amounts, we can
group or stack the bars.
We can also map the categories onto the x and y axes and show amounts by color, via a
heatmap
Distributions
Ex: Histogram:
We can visualize this table by drawing filled rectangles whose heights correspond to the
counts and whose widths correspond to the width of the age bins
Proportions
x–y relationships
For visualization when we want to show one quantitative variable relative to another. If
we have three quantitative variables, we can map one onto the dot size, creating a variant
of the scatterplot called a bubble chart.
Uncertainty
Error bars are meant to indicate the range of likely values for some estimate or
measurement.
They extend horizontally and/or vertically from some reference point representing the
estimate or measurement
Bar chart
A bar chart is used when we want to compare metric values across different subgroups of
the data.
If we have a greater number of groups, a bar chart is preferred over a column chart.
Column charts are mostly used when we need to compare a single category of data
between individual sub-items, for example, when comparing revenue between regions
Line chart
A line chart is used for the representation of continuous data points. This visual can be
effectively utilized when we want to understand the trend across time.
Line charts are typically used to show the overall trend of a certain topic
Ex: To overall price movement of a stock or people’s interest, Unemployment rate over
years
Line charts are great to show trends in data by plotting data points connected with a line.
In matplotlib, you can plot a line chart using pyplot’s plot() function. The following is the
syntax to plot a line chart:
Example
We have the data on the number of employees of a company XYZ, A year on year, and
want to plot it on a line chart using matplotlib.
import matplotlib.pyplot as plt
# number of employees of XYZ emp_count = [3, 20, 50, 200, 350, 400]
year = [2014, 2015, 2016, 2017, 2018, 2019]
Pie chart
Pie charts can be used to identify proportions of the different components in a given
whole. Pie charts are used to present categorical data in a format that highlights how each
data point contributes to a whole, that is 100%.
A histogram is a graph that shows the frequency of numerical data using rectangles.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10
Scatter plots
Scatter plots can be leveraged to identify relationships between two variables. It can be
effectively used in circumstances where the dependent variable can have multiple values
for the independent variable.
With pictures, maps and graphs, the human mind has an easier time processing and
understanding any given data.
Data visualization plays a significant role in the representation of both small and
large data sets, but it is especially useful when we have large data sets, in which it
is impossible to see all of our data, let alone process and understand it manually.
Data visualization is part art and part science. The challenge is to get the art right
without getting the science wrong and vice versa.
Data visualization first has to accurately convey the data. It must not mislead or
distort. If one number is twice as large as another is, but in the visualization they
look to be about the same, then the visualization is wrong.
At the same time, a data visualization should be aesthetically pleasing. Good visual
presentations tend to enhance the message of the visualization.
To provide a simple visual guideline of which examples should be emulated and which
should be avoided, I am clearly labeling problematic figures as “ugly”, “bad”, or “wrong”
(Figure 1.1):
ugly—A figure that has aesthetic problems but otherwise is clear and informative.
bad—A figure that has problems related to perception; it may be unclear, confusing,
overly complicated, or deceiving.
We are not explicitly labeling good figures. Any figure that isn’t clearly labeled as
flawed should be assumed to be at least acceptable.
It is a figure that is informative, looks appealing, and could be printed as is. Note
that among the good figures, there will still be differences in quality, and some
good figures will be better than others.
We are generally providing my rationale for specific ratings, but some are a matter
of taste. In general, the “ugly” rating is more subjective than the “bad” or “wrong”
rating.
Moreover, the boundary between “ugly” and “bad” is somewhat fluid. Sometimes
poor design choices can interfere with human perception to the point where a “bad”
rating is more appropriate than an “ugly” rating.
In any case, I encourage you to develop your own eye and to critically evaluate my
choices.
In today’s world, a lot of data is being generated on a daily basis. And sometimes
to analyze this data for certain trends, patterns may become difficult if the data is in
its raw format.
To overcome this data visualization comes into play. Data visualization provides a
good, organized pictorial representation of the data, which makes it easier to
understand, observe, analyze. In this tutorial, we will discuss how to visualize data
using Python.
Data visualization is the discipline of trying to understand data by placing it in a
visual context so that patterns, trends, and correlations that might not otherwise be
detected can be exposed.
Python offers multiple great graphing libraries packed with lots of different
features. Whether you want to create interactive or highly customized plots, Python
has an excellent library for you.
Matplotlib
Matplotlib is the most basic library for visualizing data graphically. It includes
many of the graphs that we can think of.
Just because it is basic does not mean that it is not powerful, many of the other data
visualization libraries we are going to talk about are based on it.
Matplotlib’s charts are made up of two main components, the axes (the lines that
delimit thearea of the chart) and the figure (where we draw the axes, titles and
things that come out of the area of the axes).
If you are working with Python from the terminal or a script, after defining the
graph with the functions we have written above use plt.show().
If you’re working from jupyter notebook, add %matplotlib inline to the beginning
of the file and run it before making the chart.
We can make multiple graphics in one figure. This goes very well for comparing
charts or for sharing data from several types of charts easily with a single image.
Seaborn
Seaborn is a dataset-oriented library for making statistical representations in
Python. It is developed atop matplotlib and to create different visualizations.
It is integrated with pandas data structures. The library internally performs the
required mapping and aggregation to create informative visuals It is recommended
to use a Jupyter/IPython interface in matplotlib mode.
Seaborn is a library based on Matplotlib. What it gives us are nicer graphics and
functions to make complex types of graphics with just one line of code.
We import the library and initialize the style of the graphics with sns.set(), without
this command the graphics would still have the same style as Matplotlib.
Bokeh
Bokeh is an interactive visualization library for modern web browsers. It is suitable
for large or streaming data assets and can be used to develop interactive plots and
dashboards.
There is a wide array of intuitive graphs in the library which can be leveraged to
develop solutions.
It works closely with PyData tools. The library is well-suited for creating
customized visuals according to required use-cases.
The visuals can also be made interactive to serve a what-if scenario model. All the
codes are open source and available on GitHub.
Bokeh is a library that allows you to generate interactive graphics. We can export
them to an HTML document that we can share with anyone who has a web
browser. It is a very useful
library when we are interested in looking for things in the graphics and we want to
be able to zoom in and move around the graphic.
Or when we want to share them and give the possibility to explore the data to
another person.
We start by importing the library and defining the file in which we will save the
graph.
Altair
Altair is a declarative statistical visualization library for Python.
Altair’s API is user- friendly and consistent and built atop Vega-Lite JSON
specification. Declarative library indicates that while creating any visuals, we need
to define the links between the data columns to the channels (x-axis, y-axis, size,
color).
With the help of Altair, it is possible to create informative visuals with minimal
code. Altair holds a declarative grammar of both visualization and interaction.
Plotly
plotly.py is an interactive, open-source, high-level, declarative, and browser-based
visualization library for Python.
It holds an array of useful visualization, which includes scientific charts, 3D
graphs, statistical charts, financial charts among others.
Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or
hosted online. Plotly library provides options for interaction and editing.
The robust API works perfectly in both local and web browser mode.
Ggplot
ggplot is a Python implementation of the grammar of graphics.
The Grammar of Graphics refers to the mapping of data to aesthetic attributes
(colour, shape, size) and geometric objects (points, lines, bars).
The basic building blocks according to the grammar of graphics are data, geom
(geometric objects), stats (statistical transformations), scale, coordinate system, and
facet.
Directory of visualizations
This provides a quick visual overview of the various plots and charts that are commonly
used to visualize data. It is meant both to serve as a table of contents, in case you are
looking for a particular visualization whose name you may not know, and as a source of
inspiration, if you need to find alternatives to the figures you routinely make.
Amounts
The most common approach to visualizing amounts (i.e., numerical values shown
for some set of categories) is using bars, either vertically or horizontally arranged.
However, instead of using bars, we can also place dots at the location where the
corresponding bar would end.
If there are two or more sets of categories for which we want to show amounts, we can
Computer Science and Engineering 1
group or stack the bars. We can also map the categories onto the x and y axis and show
amounts by color, via a heatmap.
Distributions
Histograms and density plots provide the most intuitive visualizations of a distribution,
but both require arbitrary parameter choices and can be misleading. Cumulative densities
and quantile-quantile (q-q) plots always represent the data faithfully but can be more
difficult to interpret.
Boxplots, violins, strip charts, and sina plots are useful when we want to visualize many
distributions at once and/or if we are primarily interested in overall shifts among the
distributions.
and are best avoided. Ridgeline plots can be a useful alternative to violin plots and are
often useful when visualizing very large numbers of distributions or changes in
distributions over time.
Proportions
Proportions can be visualized as pie charts, side-by-side bars, or stacked bars, and as in
the case for amounts, bars can be arranged either vertically or horizontally.
Pie charts emphasize that the individual parts add up to a whole and highlight simple
fractions.
However, the individual pieces are more easily compared in side-by-side bars. Stacked
bars look awkward for a single set of proportions, but can be useful when comparing
multiple sets of proportions (see below).
Grouped bars work well as long as the number of conditions compared is moderate, and
stacked bars can work for large numbers of conditions. Stacked densities are appropriate
when the proportions change along a continuous variable.
When proportions are specified according to multiple grouping variables, then mosaic
plots, treemaps, or parallel sets are useful visualization approaches.
Mosaic plots assume that every level of one grouping variable can be combined with
every level of another grouping variable, whereas treemaps do not make such an
assumption.
Treemaps work well even if the subdivisions of one group are entirely distinct from the
subdivisions of another.
Parallel sets work better than either mosaic plots or treemaps when there are more than
two grouping variables.
x–y relationships
If we have three quantitative variables, we can map one onto the dot size, creating a
variant of the scatterplot called bubble chart.
For paired data, where the variables along the x and the y axes are measured in the
same units, it is generally helpful to add a line indicating x = y .
Paired data can also be shown as a slope graph of paired points connected by straight
lines.
For large numbers of points, regular scatterplots can become uninformative due to
overplotting.
In this case, contour lines, 2D bins, or hex bins may provide an alternative.
When we want to visualize more than two quantities, on the other hand, we may choose
to plot correlation coefficients in the form of a correlogram instead of the underlying raw
data
When the x axis represents time or a strictly increasing quantity such as a treatment dose,
we commonly draw line graphs.
Geospatial data
A map takes coordinates on the globe and projects them onto a flat surface, such that
shapes and distances on the globe are approximately represented by shapes and distances
in the 2D representation.
In addition, we can show data values in different regions by coloring those regions in
the map according to the data.
Such a map is called a choropleth. In some cases, it may be helpful to distort the
different regions according to some other quantity (e.g., population number) or simplify
each region into a square. Such visualizations are called cartograms.
Uncertainty
Error bars are meant to indicate the range of likely values for some estimate or
measurement.
They extend horizontally and/or vertically from some reference point representing the
estimate or measurement.
Reference points can be shown in various ways, such as by dots or by bars. Graded
error bars show multiple ranges at the same time, where each range corresponds to a
different degree of confidence.
They are in effect multiple error bars with different line thicknesses plotted on top of
each other.
To achieve a more detailed visualization than is possible with error bars or graded error
bars, we can visualize the actual confidence or posterior distributions.
Confidence strips provide a clear visual sense of uncertainty but are difficult to read
accurately.
Eyes and half-eyes combine error bars with approaches to visualize distributions (violins
and ridgelines, respectively), and thus show both precise ranges for some confidence
levels and the overall uncertainty distribution.
For smooth line graphs, the equivalent of an error bar is a confidence band. It shows a
range of values the line might pass through at a given confidence level.
As in the case of error bars, we can draw graded confidence bands that show multiple
confidence levels at once.
We can also show individual fitted draws in lieu of or in addition to the confidence bands.
To make any sort of data visualization, we need to define position scales, which
determine where in a graphic different data values are located.
We cannot visualize data without placing different data points at different
locations, even if we just arrange them next to each other along a line.
For regular 2d visualizations, two numbers are required to uniquely specify a point,
and therefore we need two position scales.
These two scales are usually but not necessarily the x and y axis of the plot. We
also have to specify the relative geometric arrangement of these scales.
Conventionally, the x axis runs horizontally and the y axis vertically, but we could
choose other arrangements.
For example, we could have the y axis run at an acute angle relative to the x axis,
or we could have one axis run in a circle and the other run radially.
The combination of a set of position scales and their relative geometric
arrangement is called a coordinate system.
Cartesian coordinates
The most widely used coordinate system for data visualization is the 2d Cartesian
coordinate system, where each location is uniquely specified by an x and a y value.
The x and y axes run orthogonally to each other, and data values are placed in an
even spacing along both axes (Figure 3.1).
The two axes are continuous position scales, and they can represent both positive
and negative real numbers.
To fully specify the coordinate system, we need to specify the range of numbers
each axis covers.
In Figure 3.1, the x axis runs from -2.2 to 3.2 and the y axis runs from -2.2 to 2.2.
Any data values between these axis limits are placed at the respective location in
the plot.
Any data values outside the axis limits are discarded.
Standard cartesian coordinate system. The horizontal axis is conventionally called x and
the vertical axis y.
The two axes form a grid with equidistant spacing. Here, both the x and the y grid lines
are separated by units of one.
The point (2, 1) is located two x units to the right and one y unit above the origin (0, 0).
The point (-1, -1) is located one x unit to the left and one y unit below the origin.
Nonlinear axes
In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in
data units and in the resulting visualization.
We refer to the position scales in these coordinate systems as linear. While linear scales
generally provide an accurate representation of the data, there are scenarios where
nonlinear scales are preferred.
In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the
visualization, or conversely even spacing in the visualization corresponds to uneven
spacing in data units.
The most commonly used nonlinear scale is the logarithmic scale or log scale for short.
Log scales are linear in multiplication, such that a unit step on the scale corresponds to
multiplication with a fixed value.
To create a log scale, we need to log-transform the data values while exponentiating the
numbers that are shown along the axis grid lines.
This process is demonstrated in Figure, which shows the numbers 1, 3.16, 10, 31.6, and
100 placed on linear and log scales.
The numbers 3.16 and 31.6 may seem a strange choice, but they were chosen because
they are exactly half-way between 1 and 10 and between 10 and 100 on a log scale.
We can see this bym observing that 100.5=√ 10 ≈3.16100.5=10≈3.16 and equivalently
3.16×3.16≈103.16×3.16≈10. Similarly, 101.5=10×100.5≈31.6101.5=10×100.5≈31.6.
Relationship between linear and logarithmic scales. The dots correspond to data values 1,
3.16, 10, 31.6, 100, which are evenly-spaced numbers on a logarithmic scale.
We can display these data points on a linear scale, we can log-transform them and then
show on a linear scale, or we can show them on a logarithmic scale.
Importantly, the correct axis title for a logarithmic scale is the name of the variable
shown, not the logarithm of that variable.
All coordinate systems we have encountered so far used two straight axes positioned at a
right angle to each other, even if the axes themselves established a non-linear mapping
from data values to positions.
There are other coordinate systems, however, where the axes themselves are curved. In
particular, in the polar coordinate system, we specify positions via an angle and a radial
distance from the origin, and therefore the angle axis is circular.
Relationship between Cartesian and polar coordinates. (a) Three data points shown in a
Cartesian coordinate system.
(b) The same three data points shown in a polar coordinate system. We have taken the x
coordinates from part (a) and used them as angular coordinates and the y coordinates
from part (a) and used them as radial coordinates.
The circular axis runs from 0 to 4 in this example, and therefore x = 0 and x = 4 are the
same locations in this coordinate system.
Whenever we visualize data, we take data values and convert them in a systematic and
logical way into the visual elements that make up the final graphic.
Even though there are many different types of data visualizations, and on first glance a
scatterplot, a pie chart, and a heatmap don’t seem to have much in common, all these
visualizations can be described with a common language that captures how data values
are turned into blobs of ink on paper or colored pixels on a screen.
The key insight is the following: all data visualizations map data values into quantifiable
features of the resulting graphic. We refer to these features as aesthetics.
Aesthetics describe every aspect of a given graphical element. A few examples are
provided in Figure. A critical component of every graphical element is of course its
position, which describes where the element is located.
In standard 2D graphics, we describe positions by an x and y value, but other
coordinate systems and one- or three-dimensional visualizations are possible. Next,
all graphical elements have a shape, a size, and a color.
Even if we are preparing a black-and-white drawing, graphical elements need to
have a color to be visible: for example, black if the background is white or white if
the background is black.
Finally, to the extent we are using lines to visualize data, these lines may have
different widths or dash–dot patterns.
Beyond the examples shown in Figure, there are many other aesthetics we may
encounter in a data visualization.
For example, if we want to display text, we may have to specify font family, font
face, and font size, and if graphical objects overlap, we may have to specify
whether they are partially transparent.
Commonly used aesthetics in data visualization: position, shape, size, color, line
width, line type. Some of these aesthetics can represent both continuous and
discrete data (position, size, line width, color), while others can usually only
represent discrete data (shape, line type).
All aesthetics fall into one of two groups: those that can represent continuous data
and those that cannot.
Continuous data values are values for which arbitrarily fine intermediates exist. For
example, time duration is a continuous value.
Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily
many intermediates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and
so on.
By contrast, number of persons in a room is a discrete value. A room can hold 5
persons or 6, but not 5.5.
For the examples in Figure, position, size, color, and line width can represent
continuous data, but shape and line type can usually only represent discrete data.
Next we’ll consider the types of data we may want to represent in our visualization.
You may think of data as numbers, but numerical values are only two out of
several types of data we may encounter.
In addition to continuous and discrete numerical values, data can come in the form
of discrete categories, in the form of dates or times, and as text (Table).
When data is numerical we also call it quantitative and when it is categorical we
call it qualitative.
Variables holding qualitative data are factors, and the different categories are
called levels.
The levels of a factor are most commonly without order (as in the example of dog,
cat, fish in Table), but factors can also be ordered, when there is an intrinsic order
among the levels of the factor (as in the example of good, fair, poor in Table).
Text The quick brown fox jumps over the lazy dog.None, or discrete Free-form text.
Can be treated as categorical if needed.
Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped
onto a position scale, a shape scale, and a color scale. For each scale, each number
corresponds to a unique position, shape, or color, and vice versa.