Data Science
Data Science
Data Science
ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
REGULATION 2021
II YEAR - III SEMESTER
COURSE OBJECTIVES:
• To understand the data science fundamentals and process.
• To learn to describe the data for the data science process.
• To learn to describe the relationship between data.
• To utilize the Python libraries for Data Wrangling.
• To present and interpret data using visualization libraries in Python
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –
Basic Statistical descriptions of Data
Big Data:
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as for example,
the RDBMS.
I. Data Science:
• Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains.
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V:
• Veracity: How accurate is the data?
• Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today.
• Data scientist apart from a statistician are the ability to work with big data and
experience in machine learning, computing, and algorithm building. Tools Hadoop,
Pig, Spark, R, Python, and Java, among others.
• Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
• Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience.
o Eg: Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet
o MaxPoint - example of real-time personalized advertising.
• Human resource professionals:
o people analytics and text mining to screen candidates,
o monitor the mood of employees, and
o study informal networks among coworkers
• Financial institutions use data science:
o to predict stock markets, determine the risk of lending money, and
o learn how to attract new clients for their services
• Governmental organizations:
o internal data scientists to discover valuable information,
o share their data with the public
o Eg: Data.gov is but one example; it’s the home of the US Government’s open
data.
o organizations collected 5 billion data records from widespread applications
such as Google Maps, Angry Birds, email, and text messages, among many
other data sources.
• Nongovernmental organizations:
o World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Eg: DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
• Universities:
o Use data science in their research but also to enhance the study experience of
their students.
o massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional
classes.
o Eg: Coursera, Udacity, and edX
Structured data:
• Structured data is data that depends on a data model and resides in a fixed field
• within a record.
• Easy to store structured data in tables within databases or Excel files or Structured
Query Language.
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
• Eg: E-mail
• Email contains structured elements such as the sender, title, and body text
• Eg: It’s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a
person.
• The thousands of different languages and dialects.
Natural language:
• A human-written email is also a perfect example of natural language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of specific data science
techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition, summarization, text
completion, and sentiment analysis.
• Human language is ambiguous in nature.
Machine-generated data:
• Machine-generated data is information that’s automatically created by a computer,
process, application, or other machines without human intervention.
• Machine-generated data is becoming a major data resource.
• Eg: Wikibon has forecast that the market value of the industrial Internet will be
approximately $540 billion in 2020.
• International Data Corporation has estimated there will be 26 times more
connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• Examples of machine data are web server logs, call detail records, network event
logs, and telemetry.
Graph-based or network data:
• “Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a
• mathematical structure to model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical
• data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate the shortest path between two people.
• Graph-based data can be found on many social media websites.
• Eg: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are queried with
specialized
• query languages such as SPARQL.
Streaming data:
• The data flows into the system when an event happens instead of being loaded into
a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and
• the stock market.
• A structured data science approach helps you maximize your chances of success in
a data science project at the lowest cost.
• The first step of this process is setting a research goal.
• The main purpose here is to make sure all the stakeholders understand the what,
how, and why of the project.
• Draw the result in a project charter.
• To assess the relevance and quality of the data that’s readily available within the
company.
• Company data - data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT
professionals.
• Data mart: A data mart is a subset of the data warehouse and will be serving a
specific business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
• Challenge: As companies grow, their data becomes scattered around many places.
• Knowledge of the data may be dispersed as people change positions and leave the
company.
• Chinese Walls: These policies translate into physical and digital barriers called
Chinese walls. These “walls” are mandatory and well-regulated for customer data.
Don’t be afraid to shop around:
The model needs the data in a specific format, so data transformation will be the step.
It’s a good habit to correct data errors as early on in the process as possible.
Cleansing data:
Data cleansing is a subprocess of the data science process.
It focuses on removing errors in the data.
Then the data becomes a true and consistent representation of the processes.
Types of errors:
Interpretation error - a person’s age is greater than 300 years
Inconsistencies - class of errors is putting “Female” in one table and “F” in another when
they represent the same thing.
DATA ENTRY ERRORS:
• Data collection and data entry are error-prone processes.
• Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
• Eg: transmission errors
REDUNDANT WHITESPACE:
• Whitespaces tend to be hard to detect but cause errors like other redundant
characters.
• Eg: a mismatch of keys such as “FR ” – “FR”
• Fixing redundant whitespaces - Python can use the strip() function to remove
leading and trailing spaces.
OUTLIERS
• An outlier is an observation that seems to be distant from other observations.
• The normal distribution, or Gaussian distribution, is the most common distribution
in natural sciences.
The high values in the bottom graph can point to outliers when assuming a normal
distribution.
JOINING TABLES
• Joining tables allows you to combine the information of one observation found in
one
table with the information that you find in another table
• When these keys also uniquely define the records in the table theyare called
primary keys.
APPENDING TABLES
• Appending or stacking tables is effectively adding observations from one table to
another table.
• With brushing and linking we combine and link different graphs and tables or
views so changes in one graph are automatically transferred to the other graphs.
Pareto Diagram:
• With clean data in place and a good understanding of the content, we’re ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we’re modeling.
• The techniques we’ll use now are borrowed from the field of machine learning,
data mining, and/or statistics.
We need to select the variables you want to include in your model and a modeling
technique.
We’ll need to consider model performance and whether our project meets all the
requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain?
Model execution:
• The most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available
can speed up the process.
Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value. It means there’s a 5% chance
the predictor doesn’t have any influence.
Model diagnostics and model comparison
Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used
to evaluate the model afterward. The principle here is simple: the model should work on
unseen data.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.
X. Data Mining:
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery. The knowledge discovery process is shown in Figure
1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
XI. DataWarehouses
• A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
• To facilitate decision making, the data in a data warehouse are organized around
major subjects (e.g., customer, item, supplier, and activity).
• The data are stored to provide information from a historical perspective, such as in
the past 6 to 12 months, and are typically summarized.
• For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or,
summarized to a higher level, for each sales region.
• A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count.
• Suppose that we have some attribute X, like salary, which has been recorded for a
set of objects.
• Let x1,x2, : : : ,xN be the set of N observed values or observations for X.
• Here, these values may also be referred to as the data set (for X).
• Measures of central tendency include the mean, median, mode, and midrange.
• The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic)mean.
• Let x1,x2, : : : ,xN be a set of N values or observations, such as for some numeric
attribute X, like salary.
• The mean of this set of values is
• Sometimes, each value xi in a set may be associated with a weight wi for i D 1, : : :
,N.
• The weights reflect the significance, importance, or occurrence frequency attached
to their respective values. In this case, we can compute. This is called the weighted
arithmetic mean or the weighted average.
• To offset the effect caused by a small number of extreme values, we can instead use
the trimmed mean,which is the mean obtained after chopping off values at the high
and low extremes.
• For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both ends, as this can
result in the loss of valuable information.
• For skewed data, a better measure of the center of data is the median, which is the
middle value in a set of ordered data values.
• It is the value that separates the higher half of a data set from the lower half.
Median:
• The median generally applies to numeric data; however, we may extend the concept
to ordinal data.
• Suppose that a given data set of N values for an attribute X is sorted in increasing
order.
• If N is odd, then the median is the middle value of the ordered set. If N is even, then
the median is not unique; it is the two middlemost values and any value in between.
Mode:
Mid Range:
• The midrange can also be used to assess the central tendency of a numeric data set.
• It is the average of the largest and smallest values in the set.
• This measure is easy to compute using the SQL aggregate functions, max() and
min().
•
In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value.
• Data in most real applications are not symmetric.
• They may instead be either positively skewed, where the mode occurs at a value
that is smaller than the median or negatively skewed, where the mode occurs at a
value greater than the Median.
• Let x1,x2, : : : ,xN be a set of observations for some numeric attribute, X. The range of
the set is the difference between the largest (max()) and smallest (min()) values.
• Suppose that the data for attribute X are sorted in increasing numeric order.
• Imagine that we can pick certain data points so as to split the data distribution into
equal-size consecutive sets, as in Figure 2.2.
• These data points are called quantiles.
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equalsize consecutive sets.
• The 2-quantile is the data point dividing the lower and upper halves of the data
distribution.
• It corresponds to the median.
• The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
• The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets.
• The median, quartiles, and percentiles are the most widely used forms of quantiles.
• The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 -Q1.
The standard deviation, , of the observations is the square root of the variance, .
Graphic Displays of Basic Statistical Descriptions of Data:
Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate
data distribution. First, it displays all of the data for the given attribute.
Second, it plots quantile information.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another.
It is a powerful visualization tool in that it allows the user to view whether there is a shift
in going from one distribution to another.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles.
Plotting histograms is a graphical method for summarizing the distribution of a given
attribute, X.
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability - Normal Distributions and Standard (z)
Scores
WHAT IS STATISTICS?
Statistics exists because of the prevalence of variability in the real world.
Descriptive Statistics:
• In its simplest form, known as descriptive statistics, statistics provides us with
tools—tables,
• graphs, averages, ranges, correlations—for organizing and summarizing the
inevitable
• variability in collections of actual observations or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual
change in global temperature during the last 30 years
Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing
beyond collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall
happiness
Data
Quantitative Data:
Quantitative Data
• The weights reported by 53 male students in Table 1.1 are quantitative data, sinceany
single observation, such as 160 lbs, represents an amount of weight.
Ranked Data
• The ranked data in order from 1 to 15 depending on the data available in the list.
Qualitative Data
The Y and N replies of students in Table 1.2 are qualitative data, since any single
observation is a letter that represents a class of replies.
Approximate Numbers
• In theory, values for continuous variables can be carried out infinitely far.
• Eg: Someone’s weight, in pounds, might be 140.01438, and so on, to infinity!
• Practical considerations require that values for continuous variables be roundedoff.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• For example, the weights of the to the nearest pound.
• A student whose weight is listed as 150 lbs could actually weigh between 149.5and
150.5 lbs.
Independent and Dependent Variables
• The most studies raise questions about the presence or absence of a relationship
between two (or more) variables.
• Eg: For example, a psychologist might wish to investigate whether couples who
undergo special training in “active listening” tend to have fewer communication
breakdowns than do couples who undergo no special training.
• An experiment is a study in which the investigator decides who receives the special
treatment.
Dependent Variable
• When a variable is believed to have been influenced by the independent variable,it is
called a dependent variable.
• In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
• Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator.
• Instead, it represents an outcome: the data produced by the experiment.
• Eg: To test whether training influences communication, the psychologist countsthe
number of communication breakdowns between each couple
Observational Studies
• Instead of undertaking an experiment, an investigator might simply observe the
relation between two variables. For example, a sociologist might collect paired
measures of poverty level and crime rate for each individual in some group.
• Such studies are often referred to as observational studies.
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator, and it yields less clear-cut conclusions about cause-
effect relationships than does an experiment.
Confounding Variable
• Whenever groups differ not just because of the independent variable but also because
some uncontrolled variable co-varies with the independent variable, any conclusion
about a cause-effect relationship is suspect.
• A difference between groups might be due not to the independent variable but to a
confounding variable.
• For instance, couples willing to devote extra effort to special training might already
possess a deeper commitment that co-varies with more active-listening skills.
• An uncontrolled variable that compromises the interpretation of a study is knownas a
confounding variable.
Problems:
III. DESCRIBING DATA WITH TABLES AND GRAPHS:
• To organize the weights of the male statistics students listed in Table 1.1. First, arrange
a column of consecutive numbers, beginning with the lightest weight
(133) at the bottom and ending with the heaviest weight (245) at the top.
• A short vertical stroke or tally next to a number each time its value appears in the
original set of data; once this process has been completed, substitute for each tally
count a number indicating the frequency ( f ) of occurrence of each weight.
• When observations are sorted into classes of single values, as in Table 2.1, the result
is referred to as a frequency distribution for ungrouped data.
• The frequency distribution shown in Table 2.1 is only partially displayed because there
are more than 100 possible values between the largest and smallest observations.
Grouped Data
• When observations are sorted into classes of more than one value, as in Table 2.2,the
result is referred to as a frequency distribution for grouped data.
• Data are grouped into class intervals with 10 possible values each.
• The bottom class includes the smallest observation (133), and the top classincludes
the largest observation (245).
• The distance between bottom and top is occupied by an orderly series of classes.
• The frequency ( f ) column shows the frequency of observations in each class and,at
the bottom, the total number of observations in all classes.
Essential:
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–
159, etc.,
Optional:
4. All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which no maximum value can
be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5
and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a convenient number.
6. The lower boundary of each class interval should be a multiple of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of10,
the class interval.
7. Aim for a total of approximately 10 classes. Example:
The distribution in Table 2.2 uses 12 classes.
CONSTRUCTING FREQUENCY DISTRIBUTIONS
1. Find the range
2. Find the class interval required to span the range by dividing the range by the desired
number of classes
3. Round off to the nearest convenient interval
4. Determine where the lowest class should begin.
5. Determine where the lowest class should end.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
7. Indicate with a tally the class in which each observation falls.
8. Replace the tally count for each class with a number—the frequency (f )—and showthe
total of all frequencies.
9. Supply headings for both columns and a title for the table.
Problems:
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000
Problem:
RELATIVE FREQUENCY DISTRIBUTIONS
Percentages or Proportions?
• A proportion always varies between 0 and 1, whereas a percentage always varies
between 0 percent and 100 percent.
• To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to theright.
Problem:
Cumulative Percentages
• If relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages.
Percentile Ranks
• When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks.
• The percentile rank of a score indicates the percentage of scores in the entire
distribution with similar or smaller values than that score.
Problem:
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
• When, among a set of observations, any single observation is a word, letter, or
numerical code, the data are qualitative.
• Determine the frequency with which observations occupy each class, and report
these frequencies.
• This frequency distribution reveals that Yes replies are approximately twice as
prevalent as No replies.
• When inspecting a distribution for the first time, train yourself to look at the entire
table, not just the distribution.
• Read the title, column headings, and any footnotes.
• Where do the data come from? Is a source cited? Next, focus on the form of the
frequency distribution.
• When interpreting distributions, including distributions constructed by someone.
GRAPHS
• Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution.
GRAPHS FOR QUANTITATIVE DATA
Histograms
Frequency Polygon
• An important variation on a histogram is the frequency polygon, or line graph.
• Frequency polygons may be constructed directly from frequency distributions.
• However, we will follow the step-by-step transformation of a histogram into a
frequency polygon.
• Stem and leaf displays are ideal for summarizing distributions, such as that for
weight data, without destroying the identities of individual observations.
Constructing a Display
• The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics
students listed in Table 1.1.
• To construct the stem and leaf display for these data, when counting by tens, the
weights range from the 130s to the 240s.
• Arrange a column of numbers, the stems, beginning with 13 (representing the
130s) and ending with 24 (representing the 240s).
• Draw a vertical line to separate the stems, which represent multiples of 10, fromthe
space to be occupied by the leaves, which represent multiples of 1.
• Next, enter each raw score into the stem and leaf display.
Interpretation
• The weight data have been sorted by the stems. All weights in the 130s are listed
together; all of those in the 140s are listed together, and so on.
• A glance at the stem and leaf display in Table 2.9 shows essentially the same pattern
of weights depicted by the frequency distribution in Table 2.2 and the histogram.
Selection of Stems
• Stem values are not limited to units of 10.
• Depending on the data, you might identify the stem with one or more leading digitsthat
culminates in some variation on a stem value of 10, such as 1, 100, 1000, or even .1,
.01, .001, and so on.
• Stem and leaf displays represent statistical bargains.
Problem:
TYPICAL SHAPES
• Whether expressed as a histogram, a frequency polygon, or a stem and leaf display,
an important characteristic of a frequency distribution is its shape.
Normal
• Any distribution that approximates the normal shape in panel A of Figure 2.3 can be
analyzed
• The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, Eg: uninterrupted gestation periods of human fetuses, scores
on standardized tests, and even the popping times of individual kernels in a batch of
popcorn.
Bimodal
• Any distribution that approximates the bimodal shape in panel B of Figure 2.3 reflect
the coexistence of two different types of observations in the same distribution.
• Eg: The distribution of the ages of residents in a neighborhood consisting largely of
either new parents or their infants has a bimodal shape.
Positively Skewed
• The two remaining shapes in Figure 2.3 are lopsided.
• A lopsided distribution caused by a few extreme observations in the positive
direction as in panel C of Figure 2.3, is a positively skewed distribution.
• Eg: most family incomes under $200,000 and relatively few family incomesspanning
a wide range of values above $200,000.
Negatively Skewed
• A lopsided distribution caused by a few extreme observations in the negative
direction as in panel D of Figure 2.3, is a negatively skewed distribution.
• Eg: Most retirement ages at 60 years or older and relatively few retirement ages
spanning the wide range of ages younger than 60.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
• The equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data.
• Likewise, equal segments along the vertical axis reflect increases in frequency.
• The body of the bar graph consists of a series of bars whose heights reflect the
frequencies for the various words or classes.
• A person’s answer to the question “Do you have a Facebook profile?” is either Yesor
No, not some impossible intermediate value, such as 40 percent Yes and 60 percent
No.
MISLEADING GRAPHS
• Graphs can be constructed in an unscrupulous manner to support a particular point
of view.
• For example, to imply that comparatively many students responded Yes to the
Facebook profile question, an unscrupulous person might resort to the various tricks.
• The width of the Yes bar is more than three times that of the No bar, thus violatingthe
custom that bars be equal in width.
• The lower end of the frequency scale is omitted, thus violating the custom that the
entire scale be reproduced, beginning with zero.
• The height of the vertical axis is several times the width of the horizontal axis, thus
violating the custom, heretofore unmentioned, that the vertical axis be approximately
as tall as the horizontal axis is wide.
Problem:
IV. DESCRIBING DATA WITH AVERAGES:
MODE
• The mode reflects the value of the most frequently occurring score.
Progress Check *3.1 Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
mode = 63
Progress Check *3.2 The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find
the mode for these data.
mode = 27.4
MEDIAN
• The median reflects the middle value when observations are ordered from leastto
most.
• The median splits a set of ordered observations into two equal parts, the upperand
lower halves.
• In other words, the median has a percentile rank of 50, since observations with
equal or smaller values constitute 50 percent of the entire distribution.
• To find the median, scores always must be ordered from least to most
• When the total number of scores is odd, as in the lower left-hand panel of Table 3.2,
there is a single middle-ranked score, and the value of the median equals the value of
this score.
• When the total number of scores is even, as in the lower right-hand panel of Table3.2,
the value of the median equals a value midway between the values of the two
middlemost scores.
• In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores.
• The median term can be found for the 20 presidents.
Problems:
Progress Check *3.3 Find the median for the following retirement ages: 60, 63, 45, 63,
65, 70, 55, 63, 60, 65, 63.
median = 63
Progress Check *3.4 Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
median = 27.15
MEAN
• The mean is the most common average, calculated many times.
• The mean is found by adding all scores and then dividing by the number ofscores.
•
• To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .
+ 4 + 8) to obtain a sum of 112 years, and then divide this sum by 20, the number
of presidents, to obtain a mean of 5.60 years.
Sample or Population?
• Statisticians distinguish between two types of means—the population mean and the
sample mean—depending on whether the data are viewed as a population (a
complete set of scores) or as a sample (a subset of scores).
Problems:
Progress Check *3.5 Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Progress Check *3.6 Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
WHICH AVERAGE?
If Distribution Is Not Skewed
• When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
• When extreme scores cause a distribution to be skewed, as for the infant death rates
for selected countries listed in Table 3.4, the values of the three averages candiffer.
• The mean is the single most preferred average for quantitative data.
• An average can refer to the mode, median, or mean—or even geometric mean or
the harmonic mean.
• Conventional usage prescribes that average usually signifies mean, and this
connotation is often reinforced by the context.
• For instance, grade point average is virtually synonymous with mean grade point.
• But when the data are qualitative, your choice among averages is restricted.
• The mode always can be used with qualitative data.
Inappropriate Averages
• It would not be appropriate to report a median for unordered qualitative data with
nominal measurement, such as the ancestries of Americans.
Problem:
V. DESCRIBING VARIABILITY:
• In Figure 4.1, each of the three frequency distributions consists of seven scores with
the same mean (10) but with different variabilities.
• Before reading on, rank the three distributions from least to most variable.
• The distribution A has the least variability, distribution B has intermediate variability,
and distribution C has the most variability.
• For distribution A with the least (zero) variability, all seven scores have the same value
(10).
• For distribution B with intermediate variability, the values of scores vary slightly (one
9 and one 11), and for distribution C with most variability, they vary even more (one
7, two 9s, two 11s, and one 13).
Importance of Variability
• Variability assumes a key role in an analysis of research results.
• Eg: A researcher might ask: Does fitness training improve, on average, the scores of
depressed patients on a mental-wellness test?
• To answer this question, depressed patients are randomly assigned to two groups,
fitness training is given to one group, and wellness scores are obtained for both
groups.
• Figure 4.2 shows the outcomes for two fictitious experiments, each with the same
mean difference of 2, but with the two groups in experiment B having less variability
than the two groups in experiment C.
• Notice that groups B and C in Figure 4.2 are the same as their counterparts in Figure
4.1.
• Although the new group B* retains exactly the same (intermediate) variability as
group B, each of its seven scores and its mean have been shifted 2 units to the right.
• Likewise, although the new group C* retains exactly the same (most) variability asgroup
C, each of its seven scores and its mean have been shifted 2 units to the right.
• Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments.
RANGE
• The range is the difference between the largest and smallest scores.
• In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to
10); distribution B, the moderately variable, has an intermediate range of 2 (from 11
to 9); and distribution C, the most variable, has the
• largest range of 6 (from 13 to 7), in agreement with our intuitive judgments about
differences in variability.
Disadvantages of Range
• The range has several shortcomings.
• First, since its value depends on only two scores—the largest and the smallest—it
fails to use the information provided by the remaining scores.
• The value of the range tends to increase with increases in the total number of
scores.
VARIANCE
• Variance and Standard Deviation are the two important measurements in
statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.
• For each of the three distributions in Figure 4.1, the face values of the sevenoriginal
scores have been re-expressed as deviation scores from their mean of 10.
• For example, in distribution C, one score coincides with the mean of 10, four scores
(two 9s and two 11s) deviate 1 unit from the mean, and two scores (one 7 and one
13) deviate 3 units from the mean, yielding a set of seven deviation scores: one 0,
two –1s, two 1s, one –3, and one 3.
• The sum of all negative deviations always counterbalances the sum of all positive
deviations, regardless of the amount of variability in the group.
• A measure of variability, known as the mean absolute deviation (or m.a.d.), can be
salvaged by summing all absolute deviations from the mean, that is, by ignoring
negative signs.
• Before calculating the variance (a type of mean), negative signs must be eliminated
from deviation scores. Squaring each deviation generates a set of squared deviation
scores, all of which are positive.
STANDARD DEVIATION
• The standard deviation, the square root of the mean of all squared deviations fromthe
mean, that is,
•
• The standard deviation is a rough measure of the average amount by which scores
deviate on either side of
• their mean.
• The standard deviation as a rough measure of the average amount by which scores
deviate on either side of their mean.
Majority of Scores within One Standard Deviation
For most frequency distributions, a majority of all scores are within one standard
deviation on either side of the mean.
• In Figure 4.3, where the lowercase letter s represents the standard deviation.
• As suggested in the top panel of Figure 4.3, if the distribution of IQ scores for a class
of fourth graders has a mean (X) of 105 and a standard deviation (s) of 15, amajority
of their IQ scores should be within one standard deviation on either sideof the mean,
that is, between 90 and 120.
• For most frequency distributions, a small minority of all scores deviate more thantwo
standard deviations on either side of the mean.
• For instance, among the seven deviations in distribution C, none deviates more than
two standard deviations (2 Å~ 1.77 = 3.54) on either side of the mean.
4.3 (a) False. Relatively few students will score exactly one standard deviation from the
mean.
(b) False. Students will score both within and beyond one standard deviation from the
mean.
(c) True
(d) True
(e) False. See (b).
(f) True
STANDARD DEVIATION
Sum of Squares (SS)
• Calculating the standard deviation requires that we obtain first a value for the
variance.
• However, calculating the variance requires, in turn, that we obtain the sum of the
squared deviation scores.
• The sum of squared deviation scores symbolized by SS, merits special attention
because it’s a major component in calculations for the variance, as well as many other
statistical measures.
Sum of Squares Formulas for Population
Standard Deviation for Population σ
I f μ Is Unknown
• It would be most efficient if, as above, we could use a random sample of ndeviations
expressed around the population mean, X − μ, to estimate variability inthe population.
• But this is usually impossible because, in fact, the population mean is unknown.
• Therefore, we must substitute the known sample mean, X, for the unknown
population mean, μ, and we must use a random sample of n deviations expressed
around their own sample mean, X –X, to estimate variability in the population.
• Although there are n = 5 deviations in the sample, only n − 1 = 4 of these deviationsare
free to vary because the sum of the n = 5 deviations from their own sample mean
always equals zero.
In Figure 5.2, the idealized normal curve has been superimposed on the original
distribution for 3091 men.
Interpreting the Shaded Area
• The total area under the normal curve in Figure 5.2 can be identified with all FBI
applicants.
• Viewed relative to the total area, the shaded area represents the proportion of
applicants who will be eligible because they are shorter than exactly 66 inches.
• Every normal curve can be interpreted in exactly the same way once any distancefrom
the mean is
• expressed in standard deviation units.
• For example, .68, or 68 percent of the total area under a normal curve—any normal
curve—is within one standard deviation above and below the mean, and only .05, or
5 percent, of the total area is more than two standard deviations aboveand below the
mean.
z SCORES
• A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.
• To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean
• where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
• Since identical units of measurement appear in both the numerator and denominator
of the ratio for z, the original units of measurement cancel each otherand the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation
units.
• A z score of 2.00 always signifies that the original score is exactly two standard
deviations above its mean.
• Similarly, a z score of –1.27 signifies that the original score is exactly 1.27 standard
deviations below its mean.
• A z score of 0 signifies that the original score coincides with the mean.
Converting to z Scores
• To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation
of heights) and solve for z as follows:
STANDARD NORMAL CURVE
• If the original distribution approximates a normal curve, then the shift to standardor z
scores will always produce a new distribution that approximates the standardnormal
curve.
• The standard normal curve always has a mean of 0 and a standard deviation of 1.
However, to verify that the mean of a standard normal distribution equals 0, replace
X in the z score formula with μ, the mean of any normal distribution, and then solve
for z:
• Likewise, to verify that the standard deviation of the standard normal distribution
equals 1, replace X in the z score formula with μ + 1σ, the value corresponding to one
standard deviation above the mean for any (nonstandard) normal distribution, and
then solve for z:
• Although there is an infinite number of different normal curves, each with its ownmean
and standard deviation, there is only one standard normal curve, with a mean of 0 and
a standard deviation of 1.
• Converting all original observations into z scores leaves the normal shape intactbut
not the units of measurement.
• Shaded observations of 66 inches, 1080 hours, and 90 IQ points all reappear as az
score of –1.00.
I. Correlation:
AN INTUITIVE APPROACH
Positive Relationship
• Trends among pairs of scores can be detected most easily by constructing a list of
paired scores in which the scores along one variable are arranged from largest to
smallest.
• In panel A of Table 6.2, the five pairs of scores are arranged from the largest (13) to
the smallest (1) number of cards sent.
• This table reveals a pronounced tendency for pairs of scores to occupy similar relative
positions in their respective distributions.
• For example, John sent relatively few cards (1) and received relatively few cards (6),
whereas Doris sent relatively many cards (13) and received relatively many cards (14).
• Therefore, that the two variables are related.
• Insofar as relatively low values are paired with relatively low values, and relatively
high values are paired with relatively high values, the relationship is positive.
Negative Relationship
• Although John sent relatively few cards (1), he received relatively many (18).
• From this pattern, we can conclude that the two variables are related.
• This relationship implies that “You get the opposite of what you give.”
• Insofar as relatively low values are paired with relatively high values, and relatively
high values are paired with relatively low values, the relationship is negative.
Little or No Relationship
• No regularity is apparent among the pairs of scores in panel C.
• For instance, although both Andrea and John sent relatively few cards (5 and 1,
respectively), Andrea received
• relatively few cards (6) and John received relatively many cards (14).
• We can conclude that little, if any, relationship exists between the two variables.
• Two variables are positively related if pairs of scores tend to occupy similar relative
positions (high with high and low with low) in their respective distributions.
• They are negatively related if pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective distributions.
II. SCATTERPLOTS
• A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
Construction
• To construct a scatterplot, as in Figure 6.1, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a dot within
the scatterplot.
• For example, the pair of numbers for Mike, 7 and 12, define points along the X andY
axes, respectively.
• Using these points to anchor lines perpendicular to each axis, locate Mike’s dot where
the two lines intersect.
Positive, Negative, or Little or No Relationship?
• A dot cluster that has a slope from the lower left to the upper right, as in panel Aof
Figure 6.2, reflects a positive relationship.
• Small values of one variable are paired with small values of the other variable, and
large values are paired with large values.
• In panel A, short people tend to be light, and tall people tend to be heavy.
• On the other hand, a dot cluster that has a slope from the upper left to the lower
right, as in panel B of Figure 6.2, reflects a negative relationship.
• Small values of one variable tend to be paired with large values of the other
variable, and vice versa.
• A dot cluster that lacks any apparent slope, as in panel C of Figure 6.2, reflects littleor
no relationship.
• Small values of one variable are just as likely to be paired with small, medium, or
large values of the other variable.
Curvilinear Relationship
• The a dot cluster approximates a straight line and, therefore, reflects a linear
relationship.
• Sometimes a dot cluster approximates a bent or curved line, as in Figure 6.4, and
therefore reflects a curvilinear relationship.
• Eg: physical strength, as measured by the force of a person’s handgrip, is less for
children, more for adults, and then less again for older people.
III. A CORRELATION COEFFICIENT
FOR QUANTITATIVE DATA : r
• A correlation coefficient is a number between –1 and 1 that describes the
relationship between pairs of variables.
• The type of correlation coefficient, designated as r, that describes the linear
relationship between pairs of variables for quantitative data.
Key Properties of r
• Named in honor of the British scientist Karl Pearson, the Pearson correlation
coefficient, r, can equal any value between –1.00 and +1.00.
• Furthermore, the following two properties apply:
• The sign of r indicates the type of linear relationship, whether positive or negative.
• The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
Sign of r
• A number with a plus sign (or no sign) indicates a positive relationship, and a
number with a minus sign indicates a negative relationship.
Numerical Value of r
• The more closely a value of r approaches either –1.00 or +1.00, the stronger the
relationship.
• The more closely the value of r approaches 0, the weaker the relationship.
• r = –.90 indicates a stronger relationship than does an r of –.70, and
• r = –.70 indicates a stronger relationship than does an r of .50.
Interpretation of r
• Located along a scale from –1.00 to +1.00, the value of r supplies information
about the direction of a linear relationship—whether positive or negative—and,
• generally, information about the relative strength of a linear relationship—whether
relatively
• weak because r is in the vicinity of 0, or relatively strong because r deviates from0 in
the direction of
• either +1.00 or –1.00.
Range Restrictions
• The value of the correlation coefficient declines whenever the range of possible Xor Y
scores is restricted.
• For example, Figure 6.5 shows a dot cluster with an obvious slope, represented byan r
of .70 for the positive relationship between height and weight for all college students.
• If, the range of heights along Y is restricted to students who stand over 6 feet 2 inches
(or 74 inches) tall, the abbreviated dot cluster loses its slope because of theweights
among tall students.
• Therefore, as depicted in Figure 6.5, the value of r drops to .10.
• Sometimes it’s impossible to avoid a range restriction.
• For example, some colleges only admit students with SAT test scores above some
minimum value.
Caution
• We have to be careful when interpreting the actual numerical value of r.
• An r of .70 for height and weight doesn’t signify that the strength of this relationship
equals either .70 or 70 percent of the strength of a perfect relationship.
• The value of r can’t be interpreted as a proportion or percentage of some perfect
relationship.
Verbal Descriptions
• When interpreting a new r, you’ll find it helpful to translate the numerical value ofr
into a verbal description of the relationship.
• An r of .70 for the height and weight of college students could be translated into
“Taller students tend to weigh more”;
• An r of –.42 for time spent taking an exam and the subsequent exam score couldbe
translated into “Students who take less time tend to make higher scores”; and
• An r in the neighborhood of 0 for shoe size and IQ could be translated into “Little, if
any, relationship exists between shoe size and IQ.”
• A correlation analysis of the exchange of greeting cards by five friends for the
most recent holiday season suggests a strong positive relationship between
cards sent and cards received.
• When informed of these results, another friend, Emma, who enjoys receiving
greeting cards, asks you to predict how many cards she will receive during the
next holiday season, assuming that she plans to send 11 cards.
• All five dots contribute to the more precise prediction, illustrated in Figure 7.2,
that Emma will receive 15.20 cards.
• The solid line designated as the regression line in Figure 7.2, which guides the
string of arrows, beginning at 11, toward the predicted value of 15.20.
• If all five dots had defined a single straight line, placement of the regression line
would have been simple; merely let it pass through all dots.
Predictive Errors
• Figure 7.3 illustrates the predictive errors that would have occurred if the regression
line had been used to predict the number of cards received by the five friends.
• Solid dots reflect the actual number of cards received, and open dots, always located
along the regression line, reflect the predicted number of cards received.
• The largest predictive error, shown as a broken vertical line, occurs for Steve, whosent
9 cards.
• Although he actually received 18 cards, he should have received slightly fewer than 14
cards, according to the regression line.
• The smallest predictive error none for Mike, who sent 7 cards.
• He actually received the 12 cards that he should have received, according to the
regression line.
The smaller the total for all predictive errors in Figure 7.3, the more favorable will be
the prognosis for our predictions.
The regression line to be placed in a position that minimizes the total predictive error,
that is, that minimizes the total of the vertical discrepancies between the solid and open
dots shown in Figure 7.3.
VII. LEAST SQUARES REGRESSION LINE
• To avoid the arithmetic standoff of zero always produced by adding positive and
negative predictive errors
• the placement of the regression line minimizes the total squared predictive
error.
• When located like this, the regression line is often referred to as the least
squares regression line.
Key Property
• Once numbers have been assigned to b and a, as just described, the least squares
regression equation emerges as a working equation with a most desirable property:
• It automatically minimizes the total of all squared predictive errors for known Y
scores in the original correlation analysis.
Solving for Y′
• In its present form, the regression equation can be used to predict the number of
cards that Emma will receive, assuming that she plans to send 11 cards.
• Simply substitute 11 for X and solve for the value of Y′ as follows:
• Even when no cards are sent (X = 0), we predict a return of 6.40 cards because of the
value of a.
• Also, notice that sending each additional card translates into an increment of
only .80 in the predicted return because of the value of b.
• Whenever b has a value less than 1.00, increments in the predicted return will
lag—by an amount equal to the value of b, that is, .80 in the present case—
behind increments in cards sent.
• If the value of b had been greater than 1.00, then increments in the predicted
return would have exceeded increments in cards sent.
A Limitation
• Emma might survey these predicted card returns before committing herself to a
particular card investment. There is no evidence of a simple cause-effect
relationship between cards sent and cards received.
ASSUMPTIONS
Linearity
• Use of the regression equation requires that the underlying relationship be linear.
Homoscedasticity
• Use of the standard error of estimate, sy|x, assumes that except for chance, the dots
in the original scatterplot will be dispersed equally about all segments of the
regression line.
• when the scatterplot reveals a dramatically different type of dot cluster, such as
that shown in Figure 7.4.
• The standard error of estimate for the data in Figure 7.4 should be used cautiously,since
its value overestimates the variability of dots about the lower half of the regression
line and underestimates the variability of dots about the upper half of the regression
line.
INTERPRETATION OF r 2
• Pretend that we know the Y scores (cards received), but not the corresponding X
scores (cards sent), for each of the five friends.
• Lacking information about the relationship between X and Y scores, we could not
construct a regression equation and use it to generate a customized prediction, Y′,for
each friend.
• We mount a primitive predictive effort by always predicting the mean, Y, for each of
the five friends’ Y scores.
• The repetitive prediction of Y for each of the Y scores of all five friends will supplyus
with a frame of reference against which to evaluate our customary predictive effort
based on the correlation between cards sent (X) and cards received (Y).
Predictive Errors
Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for
all five friends, Y, of 12 (shown as the mean line) is always used to predict each of their
five Y scores.
Panel B shows the corresponding predictive errors for all five friends when a series of
different Y′ values, obtained from the least squares equation (shown as the least
squares line), is used to predict each of their five Y scores.
Panel A of Figure 7.5 shows the error for John when the mean for all five friends, Y, of 12
is used to predict his Y score of 6.
Shown as a broken vertical line, the error of −6 for John (from Y − Y = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards. Panel B shows a smaller error
of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
This Y’ value of 7.20 is obtained from the least squares equation, where the number of
cards sent by John, 1, has been substituted for X.
SSy measures the total variability of Y scores that occurs after only primitive
predictions based on Y are made while SSy|x measures the residual variability of Y
scores that remains after customized leastsquare predictions are made.
The error variability of 28.8 for the least squares predictions is much smaller than the
error variability of 80 for the repetitive prediction of Y, confirming the greater accuracy
of the least squares predictions
apparent in Figure 7.5.
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is, subtract
SSy|x from SSy, to obtain
This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracy when the repetitive prediction of Y is replaced by a series of customized Y′
predictions based on the least squares equation.
r 2 Does Not Apply to Individual Scores:
• The total variability of all Y scores—as measured by SSY—can be reduced by 64
percent when each Y score is replaced by its corresponding predicted Y’ score andthen
expressed as a squared deviation from the mean of all observed scores.
• Thus, the 64 percent represents a reduction in the total variability for the five Y scores
when they are replaced by a succession of predicted scores, given the least squares
equation and various values of X.
Small Values of r 2
• When transposed from r to r2, Cohen’s guidelines, state that a value of r 2 in the
vicinity of .01, .09, or .25 reflects a weak, moderate, or strong relationship,
respectively.
• where Y′ represents predicted college GPA and X1, X2, and X3 refer to high
school GPA, IQ score, and SAT score, respectively.
• By capitalizing on the combined predictive power of several predictor variables,
these multiple regression equations supply more accurate predictions for Y′ than
could be obtained from a simple regression equation.
• On the second test, even though the scores of these five students continue to
reflect an above-average permanent component, some of their scores will suffer
because of less good luck or even bad luck.
• The net effect is that the scores of at least some of the original five top students will
drop below the top five scores—that is, regress back toward the mean—on the
second exam.
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables
NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.
2. Attributes:
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the
total size of the array):
3. Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in
square brackets, just as with Python lists:
4. Array Slicing: Accessing Subarrays
We can also use them to access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas.
For example:
5. Accessing array rows and columns
NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
The array slices is that they return views rather than copies of the array data
Consider our two-dimensional array from before:
7. Creating copies of arrays
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an
array or a subarray.
This can be most easily done with the copy() method:
8. Reshaping of Arrays
The most flexible way of doing this is with the reshape() method.
For example, if we want to put the numbers 1 through 9 in a 3X3 grid, we can do the following:
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or
column matrix.
We can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice
operation:
It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points:
II. Aggregations: Min, Max, and Everything in Between
3. Multidimensional aggregates
Other aggregation functions
Example: What Is the Average Height of US Presidents?
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let’s consider the heights of all US presidents.
This data is available in the file president_heights.csv, which is a simple comma-separated list of
labels and values:
III. Computation on Arrays: Broadcasting
Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply
a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on
arrays of different sizes.
Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
Broadcasting allows these types of binary operations to be performed on arrays of different sizes—for example,
we can just as easily add a scalar (think of it as a zero dimensional array) to an array:
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
IV. Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking
comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some
criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
One approach to this would be to answer these questions by hand: loop through the data, incrementing a
counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of
time writing code and time computing the result.
This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, hetero‐ geneous data.
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays:
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:
DataFrame as a dictionary
Operating on Data in Pandas:
Pandas : for unary operations like negation and trigonometric functions, the ufuncs will preserve index and
column labels in the output, and for binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained.
Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-
dimensional NumPy array.
subtraction between a two-dimensional array and one of its rows is applied row-wise.
In the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention. Eg: IEEE floating-point
specification.
Hierarchical indexing (also known as multi-indexing) - to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(),
mean(), median(), min(), and max()
A canonical example of this split-apply-combine operation, where the “apply” is asummation aggregation, is
illustrated in Figure 3-1.
Figure 3-1 makes clear what the GroupBy accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
• The combine step merges the results of these operations into an output array.
Here it’s important to realize that the intermediate splits do not need to be explicitly instantiated.
The GroupBy object
The GroupBy object is a very flexible abstraction.
Column indexing. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:
In[14]: planets.groupby('method')
Out[14]: <pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
In[15]: planets.groupby('method')['orbital_period']
Out[15]: <pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
Iteration over groups. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:
In[17]: for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
Dispatch methods. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:
In[18]: planets.groupby('method')['year'].describe().unstack()
X. Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset.
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data.
The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.
UNIT V
DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Basemap - Visualization with Seaborn.
Short assignment
linestyle='-
' # solid
linestyle='-
-' # dashed
linestyle='-
.' #
dashdot
linestyle=':
' # dotted
• linestyle and color codes can be combined into a single nonkeyword argument to the plt.plot()
function
plt.plot(x, x + 0, '-g') #
solid green plt.plot(x, x +
1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') #
dashdot blackplt.plot(x, x
+ 3, ':r'); # dotted red
Axes
Limits
1
• The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods
Example
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
• The plt.axis() method allows you to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
• Aspect ratio equal is used to represent one unit in x is equal to one unit in y. plt.axis('equal')
Labeling Plots
The labeling of plots includes titles, axis labels, and simple
legends.Title - plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()
Example programs
Line color
import matplotlib.pyplot as
pltimport numpy as np
fig =
plt.figure()ax =
plt.axes()
x = np.linspace(0, 10,
1000)ax.plot(x, np.sin(x));
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code
(rgbcmyk) plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale
between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to
FF)plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names
supported
Line style
import matplotlib.pyplot as plt
import numpy as npfig =
plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1,
linestyle='dashed') plt.plot(x, x +
2, linestyle='dashdot')plt.plot(x, x
+ 3, linestyle='dotted');
# For short, you can use the following
codes:plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
import matplotlib.pyplot as
pltimport numpy as np
fig =
plt.figure()ax =
plt.axes()
x = np.linspace(0, 10, 1000)
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b',
label='cos(x)')plt.title("A Sine
Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
plt.legend();
Example
plt.plot(x, y, 'o', color='black');
• The third argument in the function call is a character that represents the type of symbol used for the plotting.
Just as you can specify options such as '-' and '--' to control the line style, the marker style has its own set of
short string codes.
Example
• Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
plt.plot(x, y, '-ok');
Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
4
Diverging
['PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral',
'coolwarm', 'bwr', 'seismic']
Qualitative
['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'Set1', 'Set2', 'Set3',
'tab10', 'tab20', 'tab20b', 'tab20c']
Miscellaneous
['flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern', 'gnuplot',
'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv', 'gist_rainbow', 'rainbow',
'jet', 'nipy_spectral', 'gist_ncar']
Example programs.
import numpy as np
import matplotlib.pyplot as
pltx = np.linspace(0, 10, 20)
y = np.sin(x)
plt.plot(x, y, '-o',
color='gray',
markersize=15,
linewidth=4,
markerfacecolor='yellow',
markeredgecolor='red',
markeredgewidth=4)
plt.ylim(-1.5, 1.5);
Visualizing Errors
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important,
than accurate reporting of the number itself. For example, imagine that I am using some astrophysical
observations to estimate the Hubble Constant, the local measurement of the expansion rate of the Universe.
In visualization of data and results, showing these errors effectively can make a plot convey much more
completeinformation.
Types of errors
• Basic Errorbars
• Continuous Errors
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
• Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as
theshorthand used in plt.plot()
• In addition to these basic options, the errorbar function has many options to fine tune the outputs.
Usingthese additional options you can easily customize the aesthetics of your errorbar plot.
6
Continuous Errors
• In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not
have a built-in convenience routine for this type of application, it’s relatively easy to combine primitives like
plt.plot and plt.fill_between for a useful result.
• Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a method
of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty.
• Notice that by default when a single color is used, negative values are represented by dashed lines,
andpositive values by solid lines.
• Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
• We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
7
plt.contour(X, Y, Z, 20, cmap='RdGy');
• One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather
thancontinuous, which is not always what is desired.
• You could remedy this by setting the number of contours to a very high number, but this results in a
ratherinefficient plot: Matplotlib must render a new polygon for each step in the level.
• A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional grid
ofdata as an image.
Example Program
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10],
origin='lower', cmap='RdGy')
plt.colorbar()
Histograms
• Histogram is the simple plot to represent the large data set. A histogram is a graph showing
frequencydistributions. It is a graph showing the number of observations within each given interval.
Parameters
• plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a
histogram,the array is sent into the function as an argument.
8
• bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted
as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called
"intervals", "classes", or "buckets".
• normed - Histogram normalization is a technique to distribute the frequencies of the histogram over a wider
range than the current range.
• x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence of arrays
which are not required to be of the same length.
• histtype - {'bar', 'barstacked', 'step', 'stepfilled'},
optionalThe type of histogram to draw.
• 'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side.
• 'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
• 'step' generates a lineplot that is by default unfilled.
• 'stepfilled' generates a lineplot that is by default
filled.Default is 'bar'
• align - {'left', 'mid', 'right'}, optional
Controls how the histogram is
plotted.
Default is None
• label - str or None, optional. Default is None
Other parameter
• **kwargs - Patch properties, it allows us to pass a
variable number of keyword arguments to a
python function. ** denotes this type of function.
Example
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
The hist() function has many options to tune both the calculation and the display; here’s an example of a
morecustomized histogram.
plt.hist(data, bins=30, alpha=0.5,histtype='stepfilled', color='steelblue',edgecolor='none');
The plt.hist docstring has more information on other customization options available. I find this combination
of histtype='stepfilled' along with some transparency alpha to be very useful when comparing histograms of
several distributions
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
Example
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 1000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
10
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw
how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend
in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw
how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend
in Matplotlib
plt.plot(x, np.sin(x), '-b', label='Sine')
plt.plot(x, np.cos(x), '--r', label='Cosine')
plt.legend();
Number of columns - We can use the ncol command to specify the number of columns in the legend.
ax.legend(frameon=False, loc='lower center', ncol=2)
fig
11
We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha value) of the frame, or
change the padding around the text.
ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1)
fig
Multiple legends
It is only possible to create a single legend for the entire plot. If
you try to create a second legend using plt.legend() or ax.legend(),
it willsimply override the first one. We can work around this by
creating a
new legend artist from scratch, and then using the lower-level ax.add_artist() method to manually add the
second artist to the plot
Example
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
x = np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True, shadow=True,borderpad=1,fancybox=True)
fig
Color Bars
In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot.
Forcontinuous labels based on the color of points, lines, or regions, a labeled color bar can be a great tool.
The simplest colorbar can be created with the plt.colorbar() function.
Customizing Colorbars
Choosing color map.
We can specify the colormap using the cmap argument to the plotting function that is creating the
visualization.Broadly, we can know three different categories of colormaps:
• Sequential colormaps - These consist of one continuous sequence of colors (e.g., binary or viridis).
• Divergent colormaps - These usually contain two distinct colors, which show positive and negative
deviations from a mean (e.g., RdBu or PuOr).
• Qualitative colormaps - These mix colors with no particular sequence (e.g., rainbow or jet).
12
Color limits and extensions
• Matplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance of
plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
• We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the top
andbottom by setting the extend property.
plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap() function, and pass the name of a suitable colormap
along with the number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
Subplots
• Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure.
• These subplots might be insets, grids of plots, or other more complicated layouts.
• We’ll explore four routines for creating subplots in Matplotlib.
• plt.axes: Subplots by Hand
• plt.subplot: Simple Grids of Subplots
• plt.subplots: The Whole Grid in One Go
• plt.GridSpec: More Complicated Arrangements
13
For example,
we might create an inset axes at the top-right corner of
another axes by setting the x and y position to 0.65 (that is,
starting at 65% of the width and 65% of the height of the
figure) and the xand y extents to 0.2 (that is, the size of the
axes is 20% of the width and 20% of the height of the figure).
For example, a gridspec for a grid of two rows and three columns with some specified width and height
spacelooks like this:
15
• Text annotation can be done manually with the plt.text/ax.text command, which will place text at a
particular x/y value.
• The ax.text method takes an x position, a y position, a string, and then optional keywords specifying the
color, size, style, alignment, and other properties of the text. Here we used ha='right' and ha='center', where
ha is short for horizontal alignment.
Example
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);
16
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The
transAxes coordinates give the location from the bottom-left corner of the axes (here the white box) as a
fraction of the axes size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here the
gray box) as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected, while the
others remain stationary.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');plt.show()
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
plt.show()
18
• Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface')
plt.show()
Surface Triangulations
• For some applications, the evenly sampled grids required by
the preceding routines are overly restrictive and
inconvenient.
• In these situations, the triangulation-based plots can be very useful.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)
19
• We’ll use an etopo image (which shows topographical features both on land and under the ocean) as
themap background
Program to display particular area of the map with latitude
andlongitude lines
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from itertools import chain
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='r')
Map Projections
The Basemap package implements several dozen such projections, all referenced by a short format code. Here
we’llbriefly demonstrate some of the more common ones.
• Cylindrical projections
• Pseudo-cylindrical projections
• Perspective projections
• Conic projections
Cylindrical projection
• The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitudeare mapped to horizontal and vertical lines, respectively.
• This type of mapping represents equatorial regions quite well, but results in extreme distortions near
thepoles.
• The spacing of latitude lines varies between different cylindrical projections, leading to different
conservation properties, and different distortion near the poles.
• Other cylindrical projections are the Mercator (projection='merc') and the cylindrical equal-area
(projection='cea') projections.
• The additional arguments to Basemap for this view specify the latitude (lat) and longitude (lon) of
thelower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in units of degrees.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
20
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Pseudo-cylindrical projections
• Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude)
remainvertical; this can give better properties near the poles of the projection.
• The Mollweide projection (projection='moll') is one common example of this, in which all meridians
areelliptical arcs
• It is constructed so as to
• preserve area across the map: though there
aredistortions near the poles, the area of small
patches reflects the true area.
• Other pseudo-cylindrical projections are the
sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
• The extra arguments to Basemap here refer to
the central latitude (lat_0) and longitude
(lon_0) for the desired map.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
• Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically
lieswithin the Earth!).
21
• One common example is the orthographic projection (projection='ortho'), which shows one side of the globe
as seen from a viewer at a very long distance.
• Thus, it can show only half the globe at a time.
• Other perspective-based projections include the
gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
• These are often the most useful for showing small
portions of the map.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
Conic projections
• A conic projection projects the map onto a single cone, which is then unrolled.
• This can lead to very good local properties, but regions far from the focus point of the cone may
becomevery distorted.
• One example of this is the Lambert conformal conic projection (projection='lcc').
• It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by
lat_1 and lat_2) have well-represented distances, with scale decreasing between them and increasing
outsideof them.
• Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55, width=1.6E7, height=1.2E7)
draw_map(m)
2
2
Drawing a Map Background
The Basemap package contains a range of useful functions for drawing borders of physical features like
continents,oceans, lakes, and rivers, as well as political boundaries such as countries and US states and counties.
The following are some of the available drawing functions that you may wish to explore using IPython’s
helpfeatures:
• Political boundaries
drawcountries() - Draw country
boundaries drawstates() - Draw US state
boundaries drawcounties() - Draw US
county boundaries
• Map features
drawgreatcircle() - Draw a great circle between two
pointsdrawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map
• Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the
mapshadedrelief() - Project a shaded relief image onto the
map etopo() - Draw an etopo relief image onto the map
warpimage() - Project a user-provided image onto the map
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful
forexploring correlations between multidimensional data, when you’d like to plot all pairs of values against each
other.
We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);
24
Faceted histograms
• Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid makes this
extremely simple.
• We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on
variousindicator data
25
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to
view the distribution of aparameter within bins defined by any other parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint
distribution between differentdatasets, along with the associated marginal distributions.
Bar plots
Time series can be plotted with sns.factorplot.
Question Bank
UNIT-1
1. Define bigdata.
Big data is a term for any collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V: Veracity: How accurate is the data?
• Company data - data can be stored in official data repositories such as databases, data marts,
data warehouses, and data lakes
• Data mart: A data mart is a subset of the data warehouse and will be serving a specific
business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
With brushing and linking we combine and link different graphs and tables or views so changes in
one graph are automatically transferred to the other graphs.
Variance and standard deviation are measures of data dispersion. variance is a measure of
dispersion that takes into account the spread of all data points in a data set. The standard deviation,
is simply the square root of the variance.
Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate
data distribution.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another.
Histograms
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting
histograms is a graphical method for summarizing the distribution of a given attribute, X.
PART B:
Descriptive Statistics:
• Descriptive statistics provides us with tools—tables, graphs, averages, ranges, correlations—
for organizing and summarizing the inevitable variability in collections of actual observations
or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual change in global
temperature during the last 30 years
Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall happiness
7. Define outliers.
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000
• Ideal for summarizing distributions, such as that for weight data, without destroying the
identities of individual observations.
In statistics, a misleading graph, also known as a distorted graph, is a graph that misrepresents data,
constituting a misuse of statistics and with the result that an incorrect conclusion may be derived from
it.
• Variance and Standard Deviation are the two important measurements in statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.
• The standard deviation, the square root of the mean of all squared deviations from the mean,
that is,
PART B:
1. Types of data
2. Types of variables
3. Explain Frequency distribution. (or) How will the data be described with tables and graphs?
4. Explain mean, median and mode? (or) How will the data be described with averages?
5. Explain about data variability with example. (or) Standard Deviation and Variance
6. Explain normal curve and z-score.
PROBLEMS :
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate
in relation to each other. A positive correlation indicates the extent to which those variables increase
or decrease in parallel; a negative correlation indicates the extent to which one variable increases as
the other decreases.
3. Define scatterplot.
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
6. Define r value.
A correlation coefficient is a number between –1 and 1 that describes the relationship between pairs
of variables.
The type of correlation coefficient, designated as r, that describes the linear relationship between
pairs of variables for quantitative data.
A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus
sign indicates a negative relationship.
8. Define regression.
Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest.
In statistics, regression toward the mean is that if one sample of a random variable is extreme, the
next sampling of the same random variable is likely to be closer to its mean.
PART B:
PROBLEMS:
1. Correlation – r value
2. Regression
3. Standard error of estimate
4. Interpretation of r2
UNIT-4
1. What is NumPy? Why should we use it?
NumPy (also called Numerical Python) is a highly flexible, optimized, open-source package meant
for array processing. It provides tools for delivering high-end performance while dealing with N-
dimensional powerful array objects.
• One-Dimensional array
import numpy as np
• Two-Dimensional array
import numpy as np
arr = [[1,2,3,4],[4,5,6,7]]
numpy_arr = np.array(arr)
• Three-Dimensional array
import numpy as np
arr = [[[1,2,3,4],[4,5,6,7],[7,8,9,10]]]
numpy_arr = np.array(arr)
Using the np.array() function, we can create NumPy arrays of any dimensions.
Syntax Description
array.shape Dimensions (Rows,Columns)
len(array) Length of Array
array.ndim Number of Array Dimensions
array.dtype Data Type
Array Manipulation
Adding or Removing Elements
Operator Description
np.append(a,b) Append items to array
6. Define broadcasting.
Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication,
etc.) on arrays of different sizes.
They also provide broadcasting and additional methods like reduce, accumulate etc. that are very
helpful for computation.
Eg: import numpy as np
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)
print(z)
X = [51 92 14 71 60 20 82 86 74 74]
import numpy as np
print(a)
Selection :
Selecting a single row using .ix[] as .loc[]
In order to select a single row, we put a single row label in a .ix function.
In order to select all rows and some columns, we use single colon [:] to select all of rows and for
columns we make a list of integer then pass to a .iloc[] function.
import pandas as pd
data = pd.read_csv("employees.csv")
bool_series = pd.isnull(data["Gender"])
data[bool_series]
df_ind3.sort_index()
print(df_ind3.head(10))
Aggregation in pandas provides various functions that perform a mathematical or logical operation
on our dataset and returns a summary of that function. Aggregation can be used to get a summary of
columns in our dataset like getting sum, minimum, maximum, etc. from a particular column of our
dataset.
Aggregation :
Function Description:
• sum() :Compute sum of column values
• min() :Compute min of column values
• max() :Compute max of column values
• mean() :Compute mean of column
• size() :Compute column sizes
• describe() :Generates descriptive statistics
• first() :Compute first of group values
• last() :Compute last of group values
• count() :Compute count of column values
• std() :Standard deviation of column
• var() :Compute variance of column
• sem() :Standard error of the mean of column
18. Grouping :
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-combine
strategy.
dataset.groupby(['cut', 'color']).agg('min')
PART B:
1. NUMPY aggregation
2. Comparision, mask, Boolean logic
3. Fancy indexing
4. structured array
5. Indexing, selection in pandas
6. Missing data – pandas
7. Hierarchical indexing
8. Combining dataset in pandas
9. Aggregation and grouping in pandas
10. Pivot tables
Questions may come scenario based, eg: Counting Rainy Days – for comparison , mask and Boolean
expression
UNIT-5
1. Define matplotlib.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in
Python.
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
plt.plot(x,y)
plt.show()
In [2]:
fig = plt.figure()
ax = plt.axes()
fig = plt.figure()
ax = plt.axes()
x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
plt.scatter(x, y, c ="blue")
Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities
6. Histograms
A histogram is a graphical representation of the distribution of data given by the user. Its
appearance is similar to Bar-Graph except it is continuous.
The towers or bars of a histogram are called bins. The height of each bin shows how many values
from that data fall into that range.
Divergent colormaps
These usually contain two distinct colors, which show positive and negative deviations
from a mean (e.g., RdBu or PuOr).
Qualitative colormaps
These mix colors with no particular sequence (e.g., rainbow or jet).
8. Define subplots.
Sometimes it is helpful to compare different views of data side by side. To this end,
Matplotlib has the concept of subplots: groups of smaller axes that can exist together
within a single figure.
9. Define annotation.
The annotate() function in pyplot module of matplotlib library is used to annotate the point xy with
text s.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface
for drawing attractive and informative statistical graphics.
One common type of visualization in data science is that of geographic data. Matplotlib's main tool
for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits
which lives under the mpl_toolkits namespace.
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
13. How will you plot multiple dimensions in a graph? (or) 3D plot
In order to plot 3D figures use matplotlib, we need to import the mplot3d toolkit, which adds the
simple 3D plotting capabilities to matplotlib.
import numpy as np
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt
plt.style.use('seaborn-poster')
Once we imported the mplot3d toolkit, we could create 3D axes and add data to the axes. create a
3D axes.
Kernel density estimation (KDE) is in some senses an algorithm which takes the mixture-of-
Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per
point, resulting in an essentially non-parametric estimator of density.
The lowest level of the sub plot is plt.subplot(), which creates a single subplot within a grid.
for i in range(1, 7):
plt.subplot(2, 3, i)
plt.text(0.5, 0.5, str((2, 3, i)),
fontsize=18, ha='center')
PART B:
1. Line plots
2. SCATTER PLOTS
3. DENSITY AND COUNTOR PLOTS
4. BASIC AND CONTINUOUS ERRORS
5. 3D PLOTING
6. GEOGRAPHICAL DATA WITH BASEMAP
7. SEABORN
8. HISTOGRAM (WITH ALL OPERATIONS)