AI & ML Week-3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Artificial Intelligence and Machine Learning Code:20CS51I

Explore Numpy module Array Aggregation Functions

In the Python numpy module, we have many aggregate functions to work with a single-
dimensional or multi-dimensional array.

The Python numpy aggregate functions are sum, min, max, mean, average, product,
median, standard deviation, variance to name few

First, we have to import Numpy as import numpy as np. To make a Numpy array, you
can just use the np.array() function. The aggregate functions are given below

1. np.sum(m): Used to find out the sum of the given array.


2. np.prod(m): Used to find out the product(multiplication) of the values of m.
3. np.mean(m): It returns the mean of the input array m.
4. np.std(m): It returns the standard deviation of the given input array m.
5. np.var(m): Used to find out the variance of the data given in the form of array m.
6. np.min(m): It returns the minimum value among the elements of the given array m.
7. np.max(m): It returns the maximum value among the elements of the given array
m.
8. np.argmin(m): It returns the index of the minimum value among the elements of
the array m.
9. np.argmax(m): It returns the index of the maximum value among the elements of
the array m
10. np.median(m): It returns the median of the elements of the array m
11. np.prod(m): It returns the mean of the input array m.

Search Educations Page 1


Artificial Intelligence and Machine Learning Code:20CS51I

OUTPUT

Search Educations Page 2


Artificial Intelligence and Machine Learning Code:20CS51I

Vectorized Operations

Processing such a large amount of data in python can be slow as compared to


other languages like C/C++. This is where vectorization comes into play.

Vectorization is a technique of implementing array operations without using for


loops. Instead, we use functions defined by various modules which are highly
optimized that reduces the running and execution time of code

Vectorized array operations will be faster than their pure Python equivalents

Vectorized sum

# importing the modules


import numpy as np import
timeit

# vectorized sum
print(np.sum(np.arange(4)))

print("Time taken by vectorized sum : ",end= "")


%timeit np.sum(np.arange(4))

# iterative sum total =


0
for item in range(0, 4): total
+= item
a = total print("\n" +
str(a))

print("Time taken by iterative sum : ",end= "")


%timeit a

OUTPUT

Search Educations Page 3


Artificial Intelligence and Machine Learning Code:20CS51I

Use Map, Filter, Reduce and Lambda Functions with NumPy

Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)

Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not

Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function
is defined in “functools” module.

Lambda
What is the syntax of a lambda function (or lambda operator)? lambda arguments:
expression
Think of lambdas as one-line methods without a name.
They work practically the same as any other method in Python, for example:

def add(x,y):

return x + y Can be translated to:


lambda x, y: x + y

Lambdas differ from normal Python methods because they can have only one expression,
can't contain any statements and their return type is a function object

map
The map() function iterates through all items in the given iterable and executes the
function we passed as an argument on each of them.

The syntax is:

map(function, iterable(s))

We can pass as many iterable objects as we want after passing the function we want to
use: # Without using lambdas
def test(s):
return s[0] == "A"

fruit = ["Apple", "Banana", "Pear", "Apricot", "Orange"] ob = map(test, fruit)

Search Educations Page 4


Artificial Intelligence and Machine Learning Code:20CS51I

print(list(ob))
output is

[True, False, False, True, False]

Filter
filter() Function takes a function object and an iterable and

Vectorized Operations
Processing such a large amount of data in python can be slow as compared to other
languages like C/C++. This is where vectorization comes into play.

Vectorization is a technique of implementing array operations without using for loops.


Instead, we use functions defined by various modules which are highly optimized that
reduces the running and execution time of code

Vectorized array operations will be faster than their pure Python equivalents

Vectorized sum

# importing the modules import numpy as np import timeit

# vectorized sum print(np.sum(np.arange(4)))

print("Time taken by vectorized sum : ",end= "")

%timeit np.sum(np.arange(4))

# iterative sum total = 0

for item in range(0, 4): total += item

a = total print("\n" + str(a))

print("Time taken by iterative sum : ",end= "")

%timeit a

OUTPUT

Search Educations Page 5


Artificial Intelligence and Machine Learning Code:20CS51I

Vectorized multiplication

# importing the modules import numpy as np import timeit

# vectorized sum

np.a=[4,5,1]

print(np.prod(np.a))

print("Time taken by vectorized product : ",end= "")

%timeit np.prod(np.a)

# iterative sum total = 1

for item in np.a: total =total*item

t = total print(t)

print("Time taken by iterative multiplication : ",end= "")

%timeit t

OUTPUT

Use Map, Filter, Reduce and Lambda Functions with NumPy

Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)

Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not

Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function

Search Educations Page 6


Artificial Intelligence and Machine Learning Code:20CS51I

is defined in “functools” module.

Lambda
What is the syntax of a lambda function (or lambda operator)? lambda arguments:
expression

Think of lambdas as one-line methods without a name. They work practically the same as
any other method in Python, for example:

def add(x,y):

return x + y Can be translated to:


lambda x, y: x + y

Lambdas differ from normal Python methods because they can have only one expression,
can't contain any statements and their return type is a function object

map
The map() function iterates through all items in the given iterable and executes the
function we passed as an argument on each of them.

The syntax is:

map(function, iterable(s))

We can pass as many iterable objects as we want after passing the function we want to
use: # Without using lambdas
def test(s):
return s[0] == "A"

fruit = ["Apple", "Banana", "Pear", "Apricot", "Orange"] ob = map(test, fruit)

print(list(ob))

output is
[True, False, False, True, False]

Filter
filter() Function takes a function object and an iterable and creates a new list. The syntax
is:

filter(function, iterable(s))
As the name suggests, filter() forms a new list that contains only elements that satisfy a
certain condition, i.e. the function we passed returns True.
Search Educations Page 7
Artificial Intelligence and Machine Learning Code:20CS51I

def starts_B(s): return s[0] == "B"

fruit = ["Apple", "Banana", "Pear", "Apricot", "Orange"]

fruit = ["Apple", "Banana", "Pear", "Apricot", "Orange"]

filter_ob = filter(lambda s: s[0] == "B", fruit)

print(list(filter_ob))

reduce() Function

reduce() works differently than map() and filter(). It does not return a new list based on
the function and iterable we've passed. Instead, it returns a single value.

reduce() isn't a built-in function anymore, and it can be found in the functools module.
The syntax is:
reduce(function, sequence[, initial])

reduce() works by calling the function we passed for the first two items in the sequence.
The result returned by the function is used in another call to function alongside with the
next (third in this case), element.

This process repeats until we've gone through all the elements in the sequence.

The optional argument initial is used, when present, at the beginning of this "loop" with
the first element in the first call to function. In a way, the initial element is the 0th
element, before the first one, when provided

We start with a list [2, 4, 7, 3] and pass the add(x, y) function to reduce() alongside this
list, without an initial value

reduce() calls add(2, 4), and add() returns 6

reduce() calls add(6, 7) (result of the previous call to add() and the next element in the list
as parameters), and add() returns 13

reduce() calls add(13, 3), and add() returns 16

Since no more elements are left in the sequence, reduce() returns 16 from functools
import reduce
def add(x, y): return x + y

list = [2, 4, 7, 3]
Search Educations Page 8
Artificial Intelligence and Machine Learning Code:20CS51I

print(reduce(add, list)) Running this code would yield:

OUTPUT :16

Again, this could be written using lambdas: from functools import reduce
list = [2, 4, 7, 3]
print(reduce(lambda x, y: x + y, list)) OUTPUT :16
Example to run all the functions together lst=[4,2,0,5,1,6,3]
from functools import reduce print(list(map(lambda num: num**2,lst)))
print(list(filter(lambda num:num>2,lst))) print(reduce(lambda x,y: x+y,lst))

OUTPUT

Search Educations Page 9


Artificial Intelligence and Machine Learning Code:20CS51I

Explore Pandas module

What is pandas?
Pandas is an open-source library that is built on top of NumPy library
It provides various data structures and operations for manipulating numerical data
and time series.
It provides ready to use high-performance data structures and data analysis tools.
Pandas module runs on top of NumPy and it is popularly used for data science and
data analytics.
Flexible reshaping and pivoting of data sets
A DataFrame is a data structure that organizes data into a 2-dimensional table of
rows and columns, much like a spreadsheet
DataFrame is common across many different languages like in R, Scala, and other
languages.

Aggregation and Grouping

Data aggregation and grouping allows us to create summaries for display or analysis, for
example, when calculating average values or creating a table of counts or sums.

Aggregation function

 sum() :Compute sum of column values


 min() :Compute min of column values
 max() :Compute max of column values
 mean() :Compute mean of column
 size() :Compute column sizes
 escribe() :Generates descriptive statistics
 first() :Compute first of group values
 last() :Compute last of group values
 count() :Compute count of column values
 std() :Standard deviation of column

Search Educations Page 10


Artificial Intelligence and Machine Learning Code:20CS51I

#simple example of using aggregation functions on a Dataframe

Example:

Output:

Grouping using Pandas

Grouping is used to group data using some criteria from our dataset. It is used as split –
apply-combine strategy.

Function Description:

 sum() :Compute sum of column values


 min() :Compute min of column values
 max() :Compute max of column values
 count() :Compute count of column values

Search Educations Page 11


Artificial Intelligence and Machine Learning Code:20CS51I

Output:

Ex1: using groupby functions in dataframe to find first and last element in a group

Search Educations Page 12


Artificial Intelligence and Machine Learning Code:20CS51I

Ex2: using groupby functions in dataframe to get count

Pivot and Melt function

Pivot
Pivot () is used for pivoting the dataframe without applying aggregation. It doesn't
contain same values or columns/index. While

Melt
The melt () function enables us to reshape and elongate the data frames in a user-defined
manner. It organizes the data values in a long data frame format.

#creating data frame by Pivot and melt functions

Search Educations Page 13


Artificial Intelligence and Machine Learning Code:20CS51I

Output:

Map, Filter and Reduce, Lambda functions

Map
map() function returns a map object (which is an iterator) of the results after applying the
given function to each item of a given iterable (list, tuple etc.)

Filter
The filter() function returns an iterator were the items are filtered through a function to
test if the item is accepted or not

Reduce
The reduce (fun, seq) function is used to apply a particular function passed in its
argument to all of the list elements mentioned in the sequence passed along. This function
is defined in “functools” module.

Lambda Function
A lambda function is a small anonymous function. A lambda function can take any
number of arguments, but can only have one expression.

Syntax

lambda arguments : expression


Add 1 to argument a, and return the result: x = lambda a : a + 1
print(x(5))

Output:6

Search Educations Page 14


Artificial Intelligence and Machine Learning Code:20CS51I

Summarize argument a, b, and c and return the result:

x = lambda a, b, c : a + b + c print(x(5, 6, 2))

Output 13

#creating data frame by Map, Filter and Reduce functions

Search Educations Page 15


Artificial Intelligence and Machine Learning Code:20CS51I

Time series using Pandas


Time series data consists of data points attached to sequential time stamps. Daily sales,
hourly temperature values, and second-level measurements in a chemical process are
some examples of time series data.

Syntax of the Pandas date_range

pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None)

There are many parameters in the methods. But I will explain only the most used
parameters.

start: Starting date. It is the left bound for generating dates. end: Ending date.
It is the upper bound for generating dates.
periods: Number of periods to generate.
freq: It is used to generate dates on the basis of frequency like “D”, “M” e.t.c

Shift operation

It is a common operation to shift time series data. We may need to make a comparison
between lagged or lead features. In our data frame, we can create a new feature that
contains the temperature of the previous day.

Search Educations Page 16


Artificial Intelligence and Machine Learning Code:20CS51I

Resample
Another common operation performed on time series data is resampling. It involves in
changing the frequency of the periods. For instance, we may be interested in the weekly
temperature data rather than daily measurements.

The resample function creates groups (or bins) of a specified internal. Then, we can apply
aggregation functions to the groups to calculate the value based on resampled frequency.

Let’s calculate the average weekly temperatures. The first step is to resample the data to
week level. Then, we will apply the mean function to calculate the average.

Search Educations Page 17


Artificial Intelligence and Machine Learning Code:20CS51I

Data visualization using python

Data visualization refers to representing information in the form of visuals

Data Visualization can make your data speak! There is no doubt that when information is
represented in the form of a picture like a graph or a chart, it can provide a much better
understanding of the Data. Meaningful, effective, and aesthetically pleasing.

The key skill of a Data Scientist is to tell a compelling story after finding useful patterns
and information from data.

The plots and graphs can provide a clear description of the data. The Visuals can help
support any claims you make based on the Data at hand.

They can be understood by any non-technical personnel which is the major advantage
offered by them. While doing so, they let us convey most information while being very
compact.

Data Visualization offers:

Efficiency
Clarity
Accuracy

List of Python data visualization tools


Matplotlib:comprehensive library for creating static, animated, and interactive
visualizations in Python.

Seaborn:Python data visualization library based on matplotib. It provides a high-level


interface for drawing attractive and informative statistical graphics.

Folium:Folium is a powerful Python library that helps you create several types of Leaflet
maps.

Plotly:The plotly python library is an interactive, open-source plotting library that


supports over 40 unique chart types covering a wide range of statistical, financial,
geographic, scientific, and 3-dimensional use-cases

Why Visualization

The importance of data visualization is simple: it helps people see, interact with,
and better understand data.

Search Educations Page 18


Artificial Intelligence and Machine Learning Code:20CS51I

Whether simple or complex, the right visualization can bring everyone on the same
page, regardless of their level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from making data
more understandable.
A data visualization first and fore‐ most has to accurately convey the data. It must
not mislead or distort

How to use the right visualization?

To extract the required information from the different visuals we create, it is


quintessential that we use the correct representation based on the type of data and the
questions that we are trying to understand.

We will go through a set of most widely used representations below and how we can use
them in the most effective manner.

Ugly, Bad, and Wrong Figures

Ugly: A figure that has aesthetic problems but otherwise is clear and informative

Bad :A figure that has problems related to perception; it may be unclear, confusing,
overly complicated, or deceiving

Wrong : A figure that has problems related to mathematics; it is objectively incorrect

Search Educations Page 19


Artificial Intelligence and Machine Learning Code:20CS51I

Amount, Distribution, Proportion, X-Y Relationships and Uncertainty

Amount : Numerical values shown for some set of categories usually done by bar charts

If there are two or more sets of categories for which we want to show amounts, we can
group or stack the bars.

We can also map the categories onto the x and y axes and show amounts by color, via a
heatmap

Distributions

To understand how a particular variable is distributed in a dataset.

Numbers of passengers with known age on the Titanic

Ex: Histogram:

We can visualize this table by drawing filled rectangles whose heights correspond to the
counts and whose widths correspond to the width of the age bins

Proportions

Proportions can be visualized as pie charts, side-by-side bars.Examples include regional


differences in happiness, economic indicators or crime, demographic differences in
voting patterns

Search Educations Page 20


Artificial Intelligence and Machine Learning Code:20CS51I

x–y relationships

For visualization when we want to show one quantitative variable relative to another. If
we have three quantitative variables, we can map one onto the dot size, creating a variant
of the scatterplot called a bubble chart.

Uncertainty

Error bars are meant to indicate the range of likely values for some estimate or
measurement.

They extend horizontally and/or vertically from some reference point representing the
estimate or measurement

Search Educations Page 21


Artificial Intelligence and Machine Learning Code:20CS51I

Coordinate Systems and Axes:

In geometry, a coordinate system is a system that uses one or more numbers or


coordinates to uniquely determine the position of the points.

Example: The simplest example of a coordinate system is the identification of points on a


line with real numbers using the number line.

Bar chart
A bar chart is used when we want to compare metric values across different subgroups of
the data.

If we have a greater number of groups, a bar chart is preferred over a column chart.
Column charts are mostly used when we need to compare a single category of data
between individual sub-items, for example, when comparing revenue between regions

Search Educations Page 22


Artificial Intelligence and Machine Learning Code:20CS51I

Line chart
A line chart is used for the representation of continuous data points. This visual can be
effectively utilized when we want to understand the trend across time.

Line charts are typically used to show the overall trend of a certain topic

Ex: To overall price movement of a stock or people’s interest, Unemployment rate over
years

Sample line chart looks like this

Search Educations Page 23


Artificial Intelligence and Machine Learning Code:20CS51I

How to plot a line chart in Matplotlib?

Line charts are great to show trends in data by plotting data points connected with a line.
In matplotlib, you can plot a line chart using pyplot’s plot() function. The following is the
syntax to plot a line chart:

import matplotlib.pyplot as plt plt.plot(x_values, y_values)


Here, x_values are the values to be plotted on the x-axis and y_values are the values to be
plotted on the y-axis.

Example

Plot a line chart with default parameters

We have the data on the number of employees of a company XYZ, A year on year, and
want to plot it on a line chart using matplotlib.
import matplotlib.pyplot as plt

# number of employees of XYZ emp_count = [3, 20, 50, 200, 350, 400]
year = [2014, 2015, 2016, 2017, 2018, 2019]

# plot a line chart plt.plot(year, emp_count)


plt.show().

Search Educations Page 24


Artificial Intelligence and Machine Learning Code:20CS51I

Customize the formatting of a line chart

Search Educations Page 25


Artificial Intelligence and Machine Learning Code:20CS51I

Line chart with multiple lines on the same scale

Search Educations Page 26


Artificial Intelligence and Machine Learning Code:20CS51I

Pie chart
Pie charts can be used to identify proportions of the different components in a given
whole. Pie charts are used to present categorical data in a format that highlights how each
data point contributes to a whole, that is 100%.

Plotting Histogram using the Matplotlib

It is a graph showing the number of observations within each given interval.

A histogram is a graph that shows the frequency of numerical data using rectangles.

Example: Say you ask for the height of 250 people, you might end up with a histogram
like

2 people from 140 to 145cm


5 people from 145 to 150cm
15 people from 151 to 156cm

For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10

Search Educations Page 27


Artificial Intelligence and Machine Learning Code:20CS51I

Scatter plots

Scatter plots can be leveraged to identify relationships between two variables. It can be
effectively used in circumstances where the dependent variable can have multiple values
for the independent variable.

Saving Plot as image

Search Educations Page 28


Artificial Intelligence and Machine Learning Code:20CS51I

What is Data Visualization?


Data visualization is a field in data analysis that deals with visual representation of
data. It graphically plots data and is an effective way to communicate inferences
from data.
Using data visualization, we can get a visual summary of our data.

With pictures, maps and graphs, the human mind has an easier time processing and
understanding any given data.

Data visualization plays a significant role in the representation of both small and
large data sets, but it is especially useful when we have large data sets, in which it
is impossible to see all of our data, let alone process and understand it manually.

Data visualization is part art and part science. The challenge is to get the art right
without getting the science wrong and vice versa.

Data visualization first has to accurately convey the data. It must not mislead or
distort. If one number is twice as large as another is, but in the visualization they
look to be about the same, then the visualization is wrong.

At the same time, a data visualization should be aesthetically pleasing. Good visual
presentations tend to enhance the message of the visualization.

If a figure contains jarring colors, imbalanced visual elements, or other features


that distract, then the viewer will find it harder to inspect the figure and interpret it
correctly.

Ugly, bad, and wrong figures


We frequently show different versions of the same figures, some as examples of how to
make a good visualization and some as examples of how not to.

To provide a simple visual guideline of which examples should be emulated and which
should be avoided, I am clearly labeling problematic figures as “ugly”, “bad”, or “wrong”
(Figure 1.1):

ugly—A figure that has aesthetic problems but otherwise is clear and informative.

bad—A figure that has problems related to perception; it may be unclear, confusing,
overly complicated, or deceiving.

Search Educations Page 29


Artificial Intelligence and Machine Learning Code:20CS51I

wrong—A figure that has problems related to mathematics; it is objectively incorrect.

Figure 1.1: Examples of ugly, bad, and wrong figures.

A bar plot showing three values (A = 3, B = 5, and C = 4).


This is a reasonable visualization with no major flaws.An ugly version of part (a).
While the plot is technically correct, it is not aesthetically pleasing. The colors are
too bright and not useful.
The background grid is too prominent. The text is displayed using three different
fonts in three different sizes. (c) A bad version of part (a).
Each bar is shown with its own y-axis scale. Because the scales don’t align, this
makes the figure misleading.
One can easily get the impression that the three values are closer together than they
actually are.
A wrong version of part (a). Without an explicit y axis scale, the numbers
represented by the bars cannot be ascertained.
The bars appear to be of lengths 1, 3, and 2, even though the values displayed are
meant to be 3, 5, and 4.

Search Educations Page 30


Artificial Intelligence and Machine Learning Code:20CS51I

We are not explicitly labeling good figures. Any figure that isn’t clearly labeled as
flawed should be assumed to be at least acceptable.
It is a figure that is informative, looks appealing, and could be printed as is. Note
that among the good figures, there will still be differences in quality, and some
good figures will be better than others.
We are generally providing my rationale for specific ratings, but some are a matter
of taste. In general, the “ugly” rating is more subjective than the “bad” or “wrong”
rating.
Moreover, the boundary between “ugly” and “bad” is somewhat fluid. Sometimes
poor design choices can interfere with human perception to the point where a “bad”
rating is more appropriate than an “ugly” rating.
In any case, I encourage you to develop your own eye and to critically evaluate my
choices.
In today’s world, a lot of data is being generated on a daily basis. And sometimes
to analyze this data for certain trends, patterns may become difficult if the data is in
its raw format.
To overcome this data visualization comes into play. Data visualization provides a
good, organized pictorial representation of the data, which makes it easier to
understand, observe, analyze. In this tutorial, we will discuss how to visualize data
using Python.
Data visualization is the discipline of trying to understand data by placing it in a
visual context so that patterns, trends, and correlations that might not otherwise be
detected can be exposed.
Python offers multiple great graphing libraries packed with lots of different
features. Whether you want to create interactive or highly customized plots, Python
has an excellent library for you.

Useful packages for visualizations in python

Matplotlib

Matplotlib is a visualization library in Python for 2D plots of arrays. Matplotlib is


written in Python and makes use of the NumPy library.
It can be used in Python and IPython shells, Jupyter notebook, and web application
servers.
Matplotlib comes with a wide variety of plots like line, bar, scatter, histogram, etc.
which can help us, deep-dive, into understanding trends, patterns, correlations. It
was introduced by John Hunter in 2002.

Search Educations Page 31


Artificial Intelligence and Machine Learning Code:20CS51I

Matplotlib is the most basic library for visualizing data graphically. It includes
many of the graphs that we can think of.
Just because it is basic does not mean that it is not powerful, many of the other data
visualization libraries we are going to talk about are based on it.
Matplotlib’s charts are made up of two main components, the axes (the lines that
delimit thearea of the chart) and the figure (where we draw the axes, titles and
things that come out of the area of the axes).
If you are working with Python from the terminal or a script, after defining the
graph with the functions we have written above use plt.show().
If you’re working from jupyter notebook, add %matplotlib inline to the beginning
of the file and run it before making the chart.
We can make multiple graphics in one figure. This goes very well for comparing
charts or for sharing data from several types of charts easily with a single image.

Seaborn
Seaborn is a dataset-oriented library for making statistical representations in
Python. It is developed atop matplotlib and to create different visualizations.

It is integrated with pandas data structures. The library internally performs the
required mapping and aggregation to create informative visuals It is recommended
to use a Jupyter/IPython interface in matplotlib mode.

Seaborn is a library based on Matplotlib. What it gives us are nicer graphics and
functions to make complex types of graphics with just one line of code.

We import the library and initialize the style of the graphics with sns.set(), without
this command the graphics would still have the same style as Matplotlib.

Bokeh
Bokeh is an interactive visualization library for modern web browsers. It is suitable
for large or streaming data assets and can be used to develop interactive plots and
dashboards.
There is a wide array of intuitive graphs in the library which can be leveraged to
develop solutions.
It works closely with PyData tools. The library is well-suited for creating
customized visuals according to required use-cases.

Search Educations Page 32


Artificial Intelligence and Machine Learning Code:20CS51I

The visuals can also be made interactive to serve a what-if scenario model. All the
codes are open source and available on GitHub.
Bokeh is a library that allows you to generate interactive graphics. We can export
them to an HTML document that we can share with anyone who has a web
browser. It is a very useful
library when we are interested in looking for things in the graphics and we want to
be able to zoom in and move around the graphic.
Or when we want to share them and give the possibility to explore the data to
another person.
We start by importing the library and defining the file in which we will save the
graph.

Altair
Altair is a declarative statistical visualization library for Python.
Altair’s API is user- friendly and consistent and built atop Vega-Lite JSON
specification. Declarative library indicates that while creating any visuals, we need
to define the links between the data columns to the channels (x-axis, y-axis, size,
color).
With the help of Altair, it is possible to create informative visuals with minimal
code. Altair holds a declarative grammar of both visualization and interaction.

Plotly
plotly.py is an interactive, open-source, high-level, declarative, and browser-based
visualization library for Python.
It holds an array of useful visualization, which includes scientific charts, 3D
graphs, statistical charts, financial charts among others.
Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or
hosted online. Plotly library provides options for interaction and editing.
The robust API works perfectly in both local and web browser mode.

Ggplot
ggplot is a Python implementation of the grammar of graphics.
The Grammar of Graphics refers to the mapping of data to aesthetic attributes
(colour, shape, size) and geometric objects (points, lines, bars).
The basic building blocks according to the grammar of graphics are data, geom
(geometric objects), stats (statistical transformations), scale, coordinate system, and
facet.

Search Educations Page 33


Artificial Intelligence and Machine Learning Code:20CS51I

Using ggplot in Python allows you to develop informative visualizations


incrementally, understanding the nuances of the data first, and then tuning the
components to improve the visual representations.

Directory of visualizations

This provides a quick visual overview of the various plots and charts that are commonly
used to visualize data. It is meant both to serve as a table of contents, in case you are
looking for a particular visualization whose name you may not know, and as a source of
inspiration, if you need to find alternatives to the figures you routinely make.

Amounts

The most common approach to visualizing amounts (i.e., numerical values shown
for some set of categories) is using bars, either vertically or horizontally arranged.
However, instead of using bars, we can also place dots at the location where the
corresponding bar would end.

Search Educations Page 34


Artificial Intelligence and Machine Learning Code:20CS51I

If there are two or more sets of categories for which we want to show amounts, we can
Computer Science and Engineering 1

group or stack the bars. We can also map the categories onto the x and y axis and show
amounts by color, via a heatmap.

Distributions

Histograms and density plots provide the most intuitive visualizations of a distribution,
but both require arbitrary parameter choices and can be misleading. Cumulative densities
and quantile-quantile (q-q) plots always represent the data faithfully but can be more
difficult to interpret.

Boxplots, violins, strip charts, and sina plots are useful when we want to visualize many
distributions at once and/or if we are primarily interested in overall shifts among the
distributions.

Stacked histograms and overlapping densities allow a more in-depth comparison of a


smaller number of distributions, though stacked histograms can be difficult to interpret

Search Educations Page 35


Artificial Intelligence and Machine Learning Code:20CS51I

and are best avoided. Ridgeline plots can be a useful alternative to violin plots and are
often useful when visualizing very large numbers of distributions or changes in
distributions over time.

Proportions

Proportions can be visualized as pie charts, side-by-side bars, or stacked bars, and as in
the case for amounts, bars can be arranged either vertically or horizontally.

Pie charts emphasize that the individual parts add up to a whole and highlight simple
fractions.

However, the individual pieces are more easily compared in side-by-side bars. Stacked
bars look awkward for a single set of proportions, but can be useful when comparing
multiple sets of proportions (see below).

When visualizing multiple sets of proportions or changes in proportions across


conditions, pie charts tend to be space-inefficient and often obscure relationships.

Grouped bars work well as long as the number of conditions compared is moderate, and
stacked bars can work for large numbers of conditions. Stacked densities are appropriate
when the proportions change along a continuous variable.

Search Educations Page 36


Artificial Intelligence and Machine Learning Code:20CS51I

When proportions are specified according to multiple grouping variables, then mosaic
plots, treemaps, or parallel sets are useful visualization approaches.

Mosaic plots assume that every level of one grouping variable can be combined with
every level of another grouping variable, whereas treemaps do not make such an
assumption.

Treemaps work well even if the subdivisions of one group are entirely distinct from the
subdivisions of another.

Parallel sets work better than either mosaic plots or treemaps when there are more than
two grouping variables.

x–y relationships

Scatterplots represent the archetypical visualization when we want to show one


quantitative variable relative to another.

If we have three quantitative variables, we can map one onto the dot size, creating a
variant of the scatterplot called bubble chart.

For paired data, where the variables along the x and the y axes are measured in the
same units, it is generally helpful to add a line indicating x = y .

Paired data can also be shown as a slope graph of paired points connected by straight
lines.

Search Educations Page 37


Artificial Intelligence and Machine Learning Code:20CS51I

For large numbers of points, regular scatterplots can become uninformative due to
overplotting.

In this case, contour lines, 2D bins, or hex bins may provide an alternative.

When we want to visualize more than two quantities, on the other hand, we may choose
to plot correlation coefficients in the form of a correlogram instead of the underlying raw
data

When the x axis represents time or a strictly increasing quantity such as a treatment dose,
we commonly draw line graphs.

If we have a temporal sequence of two response variables, we can draw a connected


scatterplot where we first plot the two response variables in a scatterplot and then connect
dots corresponding to adjacent time points.

We can use smooth lines to represent trends in a larger dataset.

Geospatial data

Search Educations Page 38


Artificial Intelligence and Machine Learning Code:20CS51I

The primary mode of showing geospatial data is in the form of a map.

A map takes coordinates on the globe and projects them onto a flat surface, such that
shapes and distances on the globe are approximately represented by shapes and distances
in the 2D representation.

In addition, we can show data values in different regions by coloring those regions in
the map according to the data.

Such a map is called a choropleth. In some cases, it may be helpful to distort the
different regions according to some other quantity (e.g., population number) or simplify
each region into a square. Such visualizations are called cartograms.

Uncertainty

Error bars are meant to indicate the range of likely values for some estimate or
measurement.

They extend horizontally and/or vertically from some reference point representing the
estimate or measurement.

Reference points can be shown in various ways, such as by dots or by bars. Graded
error bars show multiple ranges at the same time, where each range corresponds to a
different degree of confidence.

They are in effect multiple error bars with different line thicknesses plotted on top of
each other.

Search Educations Page 39


Artificial Intelligence and Machine Learning Code:20CS51I

To achieve a more detailed visualization than is possible with error bars or graded error
bars, we can visualize the actual confidence or posterior distributions.

Confidence strips provide a clear visual sense of uncertainty but are difficult to read
accurately.

Eyes and half-eyes combine error bars with approaches to visualize distributions (violins
and ridgelines, respectively), and thus show both precise ranges for some confidence
levels and the overall uncertainty distribution.

A quantile dot plot can serve as an alternative visualization of an uncertainty distribution.


By showing the distribution in discrete units, the quantile dot plot is not as precise but can
be easier to read than the continuous distribution shown by a violin or ridgeline plot.

For smooth line graphs, the equivalent of an error bar is a confidence band. It shows a
range of values the line might pass through at a given confidence level.

As in the case of error bars, we can draw graded confidence bands that show multiple
confidence levels at once.

We can also show individual fitted draws in lieu of or in addition to the confidence bands.

Search Educations Page 40


Artificial Intelligence and Machine Learning Code:20CS51I

Coordinate systems and axes

To make any sort of data visualization, we need to define position scales, which
determine where in a graphic different data values are located.
We cannot visualize data without placing different data points at different
locations, even if we just arrange them next to each other along a line.
For regular 2d visualizations, two numbers are required to uniquely specify a point,
and therefore we need two position scales.
These two scales are usually but not necessarily the x and y axis of the plot. We
also have to specify the relative geometric arrangement of these scales.
Conventionally, the x axis runs horizontally and the y axis vertically, but we could
choose other arrangements.
For example, we could have the y axis run at an acute angle relative to the x axis,
or we could have one axis run in a circle and the other run radially.
The combination of a set of position scales and their relative geometric
arrangement is called a coordinate system.

Cartesian coordinates
The most widely used coordinate system for data visualization is the 2d Cartesian
coordinate system, where each location is uniquely specified by an x and a y value.
The x and y axes run orthogonally to each other, and data values are placed in an
even spacing along both axes (Figure 3.1).
The two axes are continuous position scales, and they can represent both positive
and negative real numbers.
To fully specify the coordinate system, we need to specify the range of numbers
each axis covers.
In Figure 3.1, the x axis runs from -2.2 to 3.2 and the y axis runs from -2.2 to 2.2.
Any data values between these axis limits are placed at the respective location in
the plot.
Any data values outside the axis limits are discarded.

Search Educations Page 41


Artificial Intelligence and Machine Learning Code:20CS51I

Standard cartesian coordinate system. The horizontal axis is conventionally called x and
the vertical axis y.

The two axes form a grid with equidistant spacing. Here, both the x and the y grid lines
are separated by units of one.

The point (2, 1) is located two x units to the right and one y unit above the origin (0, 0).

The point (-1, -1) is located one x unit to the left and one y unit below the origin.

Nonlinear axes

In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in
data units and in the resulting visualization.

We refer to the position scales in these coordinate systems as linear. While linear scales
generally provide an accurate representation of the data, there are scenarios where
nonlinear scales are preferred.

Search Educations Page 42


Artificial Intelligence and Machine Learning Code:20CS51I

In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the
visualization, or conversely even spacing in the visualization corresponds to uneven
spacing in data units.

The most commonly used nonlinear scale is the logarithmic scale or log scale for short.
Log scales are linear in multiplication, such that a unit step on the scale corresponds to
multiplication with a fixed value.

To create a log scale, we need to log-transform the data values while exponentiating the
numbers that are shown along the axis grid lines.

This process is demonstrated in Figure, which shows the numbers 1, 3.16, 10, 31.6, and
100 placed on linear and log scales.

The numbers 3.16 and 31.6 may seem a strange choice, but they were chosen because
they are exactly half-way between 1 and 10 and between 10 and 100 on a log scale.

We can see this bym observing that 100.5=√ 10 ≈3.16100.5=10≈3.16 and equivalently
3.16×3.16≈103.16×3.16≈10. Similarly, 101.5=10×100.5≈31.6101.5=10×100.5≈31.6.

Search Educations Page 43


Artificial Intelligence and Machine Learning Code:20CS51I

Relationship between linear and logarithmic scales. The dots correspond to data values 1,
3.16, 10, 31.6, 100, which are evenly-spaced numbers on a logarithmic scale.

We can display these data points on a linear scale, we can log-transform them and then
show on a linear scale, or we can show them on a logarithmic scale.

Importantly, the correct axis title for a logarithmic scale is the name of the variable
shown, not the logarithm of that variable.

Coordinate systems with curved axes

All coordinate systems we have encountered so far used two straight axes positioned at a
right angle to each other, even if the axes themselves established a non-linear mapping
from data values to positions.

There are other coordinate systems, however, where the axes themselves are curved. In
particular, in the polar coordinate system, we specify positions via an angle and a radial
distance from the origin, and therefore the angle axis is circular.

Relationship between Cartesian and polar coordinates. (a) Three data points shown in a
Cartesian coordinate system.

(b) The same three data points shown in a polar coordinate system. We have taken the x
coordinates from part (a) and used them as angular coordinates and the y coordinates
from part (a) and used them as radial coordinates.

Search Educations Page 44


Artificial Intelligence and Machine Learning Code:20CS51I

The circular axis runs from 0 to 4 in this example, and therefore x = 0 and x = 4 are the
same locations in this coordinate system.

Visualizing Categorical Data

Whenever we visualize data, we take data values and convert them in a systematic and
logical way into the visual elements that make up the final graphic.

Even though there are many different types of data visualizations, and on first glance a
scatterplot, a pie chart, and a heatmap don’t seem to have much in common, all these
visualizations can be described with a common language that captures how data values
are turned into blobs of ink on paper or colored pixels on a screen.

The key insight is the following: all data visualizations map data values into quantifiable
features of the resulting graphic. We refer to these features as aesthetics.

Aesthetics and Types of Data

Aesthetics describe every aspect of a given graphical element. A few examples are
provided in Figure. A critical component of every graphical element is of course its
position, which describes where the element is located.
In standard 2D graphics, we describe positions by an x and y value, but other
coordinate systems and one- or three-dimensional visualizations are possible. Next,
all graphical elements have a shape, a size, and a color.
Even if we are preparing a black-and-white drawing, graphical elements need to
have a color to be visible: for example, black if the background is white or white if
the background is black.
Finally, to the extent we are using lines to visualize data, these lines may have
different widths or dash–dot patterns.
Beyond the examples shown in Figure, there are many other aesthetics we may
encounter in a data visualization.
For example, if we want to display text, we may have to specify font family, font
face, and font size, and if graphical objects overlap, we may have to specify
whether they are partially transparent.

Search Educations Page 45


Artificial Intelligence and Machine Learning Code:20CS51I

Commonly used aesthetics in data visualization: position, shape, size, color, line
width, line type. Some of these aesthetics can represent both continuous and
discrete data (position, size, line width, color), while others can usually only
represent discrete data (shape, line type).
All aesthetics fall into one of two groups: those that can represent continuous data
and those that cannot.
Continuous data values are values for which arbitrarily fine intermediates exist. For
example, time duration is a continuous value.
Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily
many intermediates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and
so on.
By contrast, number of persons in a room is a discrete value. A room can hold 5
persons or 6, but not 5.5.
For the examples in Figure, position, size, color, and line width can represent
continuous data, but shape and line type can usually only represent discrete data.
Next we’ll consider the types of data we may want to represent in our visualization.
You may think of data as numbers, but numerical values are only two out of
several types of data we may encounter.
In addition to continuous and discrete numerical values, data can come in the form
of discrete categories, in the form of dates or times, and as text (Table).
When data is numerical we also call it quantitative and when it is categorical we
call it qualitative.
Variables holding qualitative data are factors, and the different categories are
called levels.

Search Educations Page 46


Artificial Intelligence and Machine Learning Code:20CS51I

The levels of a factor are most commonly without order (as in the example of dog,
cat, fish in Table), but factors can also be ordered, when there is an intrinsic order
among the levels of the factor (as in the example of good, fair, poor in Table).

Table.Types of variables encountered in typical data visualization scenarios.

Text The quick brown fox jumps over the lazy dog.None, or discrete Free-form text.
Can be treated as categorical if needed.

Type of variable Examples Appropria Description


te scale
Quantitative/ 1.3, 5.7, 83, 1.5 × Continuous Arbitrary numerical values.
numerical 10–2 These can be integers,
continuous rational numbers, or real
numbers.

Quantitative/ 1, 2, 3, 4 Discrete Numbers in discrete units.


numerical discrete These are most commonly
but not necessarily integers.
For example, the numbers
0.5, 1.0, 1.5 could also be
treated as discrete if
intermediate values cannot
exist in the given dataset.

Qualitative/ dog, cat, fish Discrete Categories without order.


categorical These are discrete and
unordered unique categories that have
no inherent order. These
variables are also called
factors.

Qualitative/ good, fair, poor Discrete Categories with order.


categorical These are discrete and
ordered unique categories with an
order. For example, “fair”
always lies between “good”
and “poor.” These variables
are also called
ordered factors.

Search Educations Page 47


Artificial Intelligence and Machine Learning Code:20CS51I

Date or time Jan. 5 2018, Continuous Specific days and/or times.


8:03am or discrete Also generic dates, such as
July 4 or Dec. 25 (without
year).
Text The quick brown None, or Free-form text. Can be
fox jumps over the discrete treated as categorical if
lazy dog.
needed.

Scales Map Data Values onto Aesthetics


To map data values onto aesthetics, we need to specify which data values
correspond to which specific aesthetics values.
For example, if our graphic has an x axis, then we need to specify which data
values fall onto particular positions along this axis.
Similarly, we may need to specify which data values are represented by particular
shapes or colors.
This mapping between data values and aesthetics values is created via scales. A
scale defines a unique mapping between data and aesthetics.
Importantly, a scale must be one-to-one, such that for each specific data value there
is exactly one aesthetics value and vice versa.
If a scale isn’t one-to-one, then the data visualization becomes ambiguous.

Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped
onto a position scale, a shape scale, and a color scale. For each scale, each number
corresponds to a unique position, shape, or color, and vice versa.

Search Educations Page 48

You might also like