Data Science

PRATHYUSHA
ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
REGULATION 2021
II YEAR - III SEMESTER
CS3352 – FOUNDATIONS OF DATA SCIENCE

SYLLABUS
CS3352 FOUNDATIONS OF DATA SCIENCE
COURSE OBJECTIVES:
• To understand the data science fundamentals and process.
• To learn to describe the data for the data science process.
• To learn to describe the relationship between data.
• To utilize the Python libraries for Data Wrangling.
• To present and interpret data using visualization libraries in Python
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data
UNIT II DESCRIBING DATA

Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores
UNIT III DESCRIBING RELATIONSHIPS

Correlation –Scatter plots –correlation coefficient for quantitative data –computational
formula for correlation coefficient – Regression –regression line –least squares regression line
– Standard error of estimate – interpretation of r2 –multiple regression equations –regression
towards the mean
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean
logic – fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and
selection – operating on data – missing data – Hierarchical indexing – combining datasets –
aggregation and grouping – pivot tables
UNIT V DATA VISUALIZATION

Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots
– Histograms – legends – colors – subplots – text and annotation – customization – three
dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.
CS3352 – Foundations of Data Science
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –
Basic Statistical descriptions of Data
Big Data:
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as for example,
the RDBMS.
I. Data Science:
• Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains.
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V:
• Veracity: How accurate is the data?
• Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today.
• Data scientist apart from a statistician are the ability to work with big data and
experience in machine learning, computing, and algorithm building. Tools Hadoop,
Pig, Spark, R, Python, and Java, among others.
II. Benefits and uses of data science and big data
• Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
• Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience.
o Eg: Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet
o MaxPoint - example of real-time personalized advertising.
• Human resource professionals:
o people analytics and text mining to screen candidates,
o monitor the mood of employees, and
o study informal networks among coworkers
• Financial institutions use data science:
o to predict stock markets, determine the risk of lending money, and
o learn how to attract new clients for their services
• Governmental organizations:
o internal data scientists to discover valuable information,
o share their data with the public
o Eg: Data.gov is but one example; it’s the home of the US Government’s open
data.
o organizations collected 5 billion data records from widespread applications
such as Google Maps, Angry Birds, email, and text messages, among many
other data sources.
• Nongovernmental organizations:
o World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Eg: DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
• Universities:
o Use data science in their research but also to enhance the study experience of
their students.
o massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional
classes.
o Eg: Coursera, Udacity, and edX
III. Facets of data:

The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data:
• Structured data is data that depends on a data model and resides in a fixed field
• within a record.
• Easy to store structured data in tables within databases or Excel files or Structured
Query Language.
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
• Eg: E-mail
• Email contains structured elements such as the sender, title, and body text
• Eg: It’s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a
person.
• The thousands of different languages and dialects.
Natural language:
• A human-written email is also a perfect example of natural language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of specific data science
techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition, summarization, text
completion, and sentiment analysis.
• Human language is ambiguous in nature.
Machine-generated data:
• Machine-generated data is information that’s automatically created by a computer,
process, application, or other machines without human intervention.
• Machine-generated data is becoming a major data resource.
• Eg: Wikibon has forecast that the market value of the industrial Internet will be
approximately $540 billion in 2020.
• International Data Corporation has estimated there will be 26 times more
connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• Examples of machine data are web server logs, call detail records, network event
logs, and telemetry.
Graph-based or network data:
• “Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a
• mathematical structure to model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical
• data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate the shortest path between two people.
• Graph-based data can be found on many social media websites.
• Eg: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are queried with
specialized
• query languages such as SPARQL.
Audio, image, and video:

• Audio, image, and video are data types that pose specific challenges to a data
scientist.
• Recognizing objects in pictures, turn out to be challenging for computers.
• Major League Baseball Advanced Media - video capture to approximately 7 TB per
• game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time.
• DeepMind succeeded at creating an algorithm that’s capable of learning how to
play video games.
• This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning.
• Google – Artificial Intelligence Development plans
Streaming data:
• The data flows into the system when an event happens instead of being loaded into
a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and
• the stock market.
The data science process:
• The data science process typically consists of six steps:

o Setting the research goal
o Retrieving data
o Data preparation
o Data exploration
o Data modeling or model building
o Presentation and automation
The data science process
IV. Overview of the data science process:
• A structured data science approach helps you maximize your chances of success in
a data science project at the lowest cost.
• The first step of this process is setting a research goal.
• The main purpose here is to make sure all the stakeholders understand the what,
how, and why of the project.
• Draw the result in a project charter.
Step 1: Defining research goals and creating a project charter

• A project starts by understanding your project's what, why, and how.
• The outcome should be a clear research goal, a good understanding of the context,
• well-defined deliverables, and a plan of action with a timetable.
• The information is then best placed in a project charter.
Spend time understanding the goals and context of your research:

• An essential outcome is the research goal that states the purpose of your
assignment
• in a clear and focused manner.
• Understanding the business goals and context is critical for project success.
• To asking questions and devising examples:
o for business expectations,
o how your research is going to change the business, and
o understand how they’ll use your results
Create a project charter:

• The formal agreement on the deliverables.
• All this information is best collected in a project charter.
• A project charter requires teamwork, and your input covers at least the following:
■A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
V. Step 2: Retrieving data
• The next step in data science is to retrieve the required data.

• Sometimes we need to go into the field and design a data collection process.
• Many companies will have already collected and stored the data.
• That also can be bought from third parties.
• look outside your organization for data - high-quality data freely available for
• public and commercial use.
• Data can be stored in many forms, ranging from simple text files to tables in a
database.
Start with data stored within the company
• To assess the relevance and quality of the data that’s readily available within the
company.
• Company data - data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT
professionals.
• Data mart: A data mart is a subset of the data warehouse and will be serving a
specific business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
• Challenge: As companies grow, their data becomes scattered around many places.
• Knowledge of the data may be dispersed as people change positions and leave the
company.
• Chinese Walls: These policies translate into physical and digital barriers called
Chinese walls. These “walls” are mandatory and well-regulated for customer data.
Don’t be afraid to shop around:
• Many companies specialize in collecting valuable information.

• Nielsen and GFK - retail industry.
• Data as Service - Twitter, LinkedIn, and Facebook.
Do data quality checks now to prevent problems later:

• Data Correction and cleansing.
• Data retrieval - to see if the data is equal to the data in the source document and if
you have the right data types.
• Discover outliers in the exploratory phase, they can point to a data entry error.
VI. Step 3: Cleansing, integrating, and transforming data
The model needs the data in a specific format, so data transformation will be the step.
It’s a good habit to correct data errors as early on in the process as possible.
Cleansing data:
Data cleansing is a subprocess of the data science process.
It focuses on removing errors in the data.
Then the data becomes a true and consistent representation of the processes.
Types of errors:
Interpretation error - a person’s age is greater than 300 years
Inconsistencies - class of errors is putting “Female” in one table and “F” in another when
they represent the same thing.
DATA ENTRY ERRORS:
• Data collection and data entry are error-prone processes.
• Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
• Eg: transmission errors
REDUNDANT WHITESPACE:
• Whitespaces tend to be hard to detect but cause errors like other redundant
characters.
• Eg: a mismatch of keys such as “FR ” – “FR”
• Fixing redundant whitespaces - Python can use the strip() function to remove
leading and trailing spaces.
FIXING CAPITAL LETTER MISMATCHES:

• Capital letter mismatches - distinction between “Brazil” and “brazil”
• strings in lowercase, such as .lower() in Python. “Brazil”.lower() ==“brazil”.lower()
should result in true.
IMPOSSIBLE VALUES AND SANITY CHECKS:

• Sanity checks are another valuable type of data check.
• Check the value against physically or theoretically impossible values : such as
people taller than 3 meters or someone with an age of 299 years.
OUTLIERS
• An outlier is an observation that seems to be distant from other observations.
• The normal distribution, or Gaussian distribution, is the most common distribution
in natural sciences.
The high values in the bottom graph can point to outliers when assuming a normal
distribution.
DEALING WITH MISSING VALUES:
DEVIATIONS FROM A CODE BOOK:

• Detecting errors in larger data sets against a code book or against standardized
values
• can be done with the help of set operations.
• A code book is a description of your data form of metadata.
DIFFERENT UNITS OF MEASUREMENT

• When integrating two data sets, we have to pay attention to their respective units
of
• measurement.
• Eg: Data sets can contain prices per gallon and others can contain prices per liter.
DIFFERENT LEVELS OF AGGREGATION
• Having different levels of aggregation is similar to having different types of
measurement.
• Eg: A data set containing data per week versus one containing data per work week.
Correct errors as early as possible:

• A good practice is to mediate data errors as early as possible in the data collection
• chain and to fix as little as possible.
• The data collection process is error-prone, and in a big organization, it involves many
steps and teams.
• Data should be cleansed when acquired for many reasons:
• Not everyone spots the data anomalies
• If errors are not corrected early on in the process, the cleansing will have to be done.
• Data errors may point to a business process that isn’t working as designed.
• Data errors may point to defective equipment, etc.,
• Data errors can point to bugs in software or in the integration of software.
• Data manipulation doesn’t end with correcting mistakes; still need to combine your
incoming data.
Combining data from different data sources:

• Data varies in size, type, and structure, ranging from databases and Excel files to
text documents.
THE DIFFERENT WAYS OF COMBINING DATA:

• Two operations to combine information from different data.
• joining: enriching an observation from one table with information from another
table.
• The second operation is appending or stacking: adding the observations
• of one table to those of another table.
JOINING TABLES
• Joining tables allows you to combine the information of one observation found in
one
table with the information that you find in another table
• When these keys also uniquely define the records in the table theyare called
primary keys.
APPENDING TABLES
• Appending or stacking tables is effectively adding observations from one table to
another table.
USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
• To avoid duplication of data, we can virtually combine data with views.

• How the sales data from the different months is combined virtually into a yearly
sales table instead of duplicating the data?
• A table join is only performed once, the join that creates the view is recreated every
time it’s queried.
ENRICHING AGGREGATED MEASURES

• Data enrichment can also be done by adding calculated information to the table.
• Eg: such as the total number of sales or what percentage of total stock has been
sold in a certain region.
TRANSFORMING DATA
• Relationships between an input variable and an output variable aren’t always

linear.
• Take, for instance, a relationship of the form y = aebx.
• Taking the log of the independent variables simplifies the estimation problem.
REDUCING THE NUMBER OF VARIABLES

• We have too many variables and need to reduce the number because they don’t
add new information to the model.
• Having too many variables in your model makes the model difficult to handle, and
certain techniques don’t perform well when you overload them with too many
input variables.
• Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data.
• Reducing the number of variables makes it easier to understand the key values.
• These variables, called “component1” and “component2,” are both combinations of
the original variables.
• They’re the principal components of the underlying data structure.
TURNING VARIABLES INTO DUMMIES

• Variables can be turned into dummy variables.
• Dummy variables can only take two values: true(1) or false(0).
• They’re used to indicate the absence of a categorical effect that may explain the
observation.
• An example is turning one column named Weekdays into the columns Monday
through Sunday.
• We use an indicator to show if the observation was on a Monday; you put 1 on
Monday and 0 elsewhere.
VII. Step 4: Exploratory data analysis
Information becomes much easier to grasp when shown in a picture.

The graphical techniques to gain an understanding of your data and the interactions
between variables.
visualization techniques : simple line graphs or histograms

Brushing and Linking:
• With brushing and linking we combine and link different graphs and tables or
views so changes in one graph are automatically transferred to the other graphs.
Pareto Diagram:
• A Pareto diagram is a combination of the values and a cumulative distribution.

• It’s easy to see from this diagram that the first 50% of the countries contain slightly
less than 80% of the total
• amount.
• If this graph represented customer buying power and we sell expensive products,
we probably don’t need to spend our marketing budget in every country; we could
start with the first 50%.
• In a histogram a variable is cut into discrete categories and the number of

occurrences in each category are summed up and shown in the graph.
• The boxplot, doesn’t show how many observations are present but does offer an
• impression of the distribution within categories.
• It can show the maximum, minimum, median, and other characterizing measures at
the same time.
Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis.
VIII. Step 5: Build the models
• With clean data in place and a good understanding of the content, we’re ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we’re modeling.
• The techniques we’ll use now are borrowed from the field of machine learning,
data mining, and/or statistics.
• Building a model is an iterative process.

• The way we build our model depends on whether we go with classic statistics or
the recent machine learning
• and the type of technique we want to use.
• Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison
Model and variable selection
We need to select the variables you want to include in your model and a modeling
technique.
We’ll need to consider model performance and whether our project meets all the
requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain?
Model execution:
• The most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available
can speed up the process.
Model fit—For this the R-squared or adjusted R-squared is used.

• This measure is an indication of the amount of variation in the data that gets
captured by the model.
• The difference between the adjusted R-squared and the R-squared is minimal here
because the adjusted one is the normal one + a penalty for model complexity.
• A model gets complex when many variables or features are introduced.
Predictor variables have a coefficient—For a linear model this is easy to interpret.

• In our example if you add “1” to x1, it will change y by “0.7658”.
• It’s easy to see how finding a good predictor can be your route.
• Eg: If, for instance, you determine that a certain gene is significant as a cause for
cancer, this is important knowledge, even if that gene in itself doesn’t determine
whether a person will get cancer.
• When to a gene has that impact? This is called significance.
Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value. It means there’s a 5% chance
the predictor doesn’t have any influence.
Model diagnostics and model comparison
Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used
to evaluate the model afterward. The principle here is simple: the model should work on
unseen data.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.
IX. Step 6: Presenting findings and building applications on top of them

• Some work need to repeat it over and over again because they value the
predictions of our models or the insights that you produced.
• For this reason, we need to automate your models.
• This doesn’t always mean that we have to redo all of your analysis all the time.
• Sometimes it’s sufficient that we implement only the model scoring; other times we
might build an application
• that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.
X. Data Mining:
• Data mining turns a large collection of data into knowledge.

• A search engine (e.g., Google) receives hundreds of millions of queries every day.
• Each query can be viewed as a transaction where the user describes her or his
information need.
• For example, Google’s Flu Trends uses specific search terms as indicators of flu
activity.
• It found a close relationship between the number of people who search for flu-
related information and the number of people who actually have flu symptoms.
• In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information-poor situation (Figure 1.2).
• The fast-growing, tremendous amount of data, collected and stored in large and
numerous data repositories, has far exceeded our human ability for comprehension
without powerful
• tools.
• As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited.
• Unfortunately, however, the manual knowledge input procedure is prone to biases
and errors and is extremely costly and time consuming.
• The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden nuggets” of
knowledge.
• other terms have a similar meaning to data mining—for example, knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery. The knowledge discovery process is shown in Figure
1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
XI. DataWarehouses
• A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
Eg: All Electronics
• To facilitate decision making, the data in a data warehouse are organized around
major subjects (e.g., customer, item, supplier, and activity).
• The data are stored to provide information from a historical perspective, such as in
the past 6 to 12 months, and are typically summarized.
• For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or,
summarized to a higher level, for each sales region.
• A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count.
• By providing multidimensional data views and the precomputation of summarized

data, data warehouse systems can provide inherent support for OLAP.
• Online analytical processing operations make use of background knowledge
regarding the domain of the data being studied to allow the presentation of data at
different levels of abstraction.
• Such operations accommodate different user viewpoints.
• Examples of OLAP operations include drill-down and roll-up, which allow the user
to view the data at differing degrees of summarization, as illustrated.
XII. Basic Statistical Descriptions of Data:
Measuring the Central Tendency: Mean, Median, and Mode
• Suppose that we have some attribute X, like salary, which has been recorded for a
set of objects.
• Let x1,x2, : : : ,xN be the set of N observed values or observations for X.
• Here, these values may also be referred to as the data set (for X).
• Measures of central tendency include the mean, median, mode, and midrange.
• The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic)mean.
• Let x1,x2, : : : ,xN be a set of N values or observations, such as for some numeric
attribute X, like salary.
• The mean of this set of values is
• Sometimes, each value xi in a set may be associated with a weight wi for i D 1, : : :
,N.
• The weights reflect the significance, importance, or occurrence frequency attached
to their respective values. In this case, we can compute. This is called the weighted
arithmetic mean or the weighted average.
• To offset the effect caused by a small number of extreme values, we can instead use
the trimmed mean,which is the mean obtained after chopping off values at the high
and low extremes.
• For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both ends, as this can
result in the loss of valuable information.
• For skewed data, a better measure of the center of data is the median, which is the
middle value in a set of ordered data values.
• It is the value that separates the higher half of a data set from the lower half.
Median:
• The median generally applies to numeric data; however, we may extend the concept
to ordinal data.
• Suppose that a given data set of N values for an attribute X is sorted in increasing
order.
• If N is odd, then the median is the middle value of the ordered set. If N is even, then
the median is not unique; it is the two middlemost values and any value in between.
• The median is expensive to compute when we have a large number of observations.

• For numeric attributes, however, we can easily approximate the value.
Mode:
• The mode is another measure of central tendency.

• The mode for a set of data is the value that occurs most frequently in the set.
• Therefore, it can be determined for qualitative and quantitative attributes.
• It is possible for the greatest frequency to correspond to several different values,
which results in more than one mode. Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal. In general, a data set with two
or more modes is multimodal.
Mid Range:
• The midrange can also be used to assess the central tendency of a numeric data set.
• It is the average of the largest and smallest values in the set.
• This measure is easy to compute using the SQL aggregate functions, max() and
min().
•
In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value.
• Data in most real applications are not symmetric.
• They may instead be either positively skewed, where the mode occurs at a value
that is smaller than the median or negatively skewed, where the mode occurs at a
value greater than the Median.
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation,

and Interquartile Range
• Let x1,x2, : : : ,xN be a set of observations for some numeric attribute, X. The range of
the set is the difference between the largest (max()) and smallest (min()) values.
• Suppose that the data for attribute X are sorted in increasing numeric order.
• Imagine that we can pick certain data points so as to split the data distribution into
equal-size consecutive sets, as in Figure 2.2.
• These data points are called quantiles.
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equalsize consecutive sets.
• The 2-quantile is the data point dividing the lower and upper halves of the data
distribution.
• It corresponds to the median.
• The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
• The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets.
• The median, quartiles, and percentiles are the most widely used forms of quantiles.
• The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 -Q1.
Five-Number Summary, Boxplots, and Outliers:

• A common rule of thumb for identifying suspected outliers is to single out values
falling at least 1.5 IQR above the third quartile or below the first quartile.
• Because Q1, the median, and Q3 together contain no information about the
endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can
be obtained by providing the lowest and highest data values as well. This is known
as the five-number summary.
• The five-number summary of a distribution consists of the median (Q2), the
quartiles Q1 and Q3, and the smallest and largest individual observations, written in
the order of Minimum, Q1, Median, Q3, Maximum.
• Boxplots are a popular way of visualizing a distribution.
• A boxplot incorporates the five-number summary as follows:
• Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
• The median is marked by a line within the box.
• Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
Boxplot: Figure shows boxplots for unit price data for items sold at four branches of
AllElectronics during a given time period.
For branch 1, we see that the median price of items sold is $80, Q1 is $60, and Q3 is $100.
Notice that two outlying observations for this branch were plotted individually, as their
values of 175 and 202 are more than 1.5 times the IQR here of 40.
Variance and Standard Deviation

• Variance and standard deviation are measures of data dispersion.
• They indicate how spread out a data distribution is.
• A low standard deviation means that the data observations tend to be very close to
the mean, while a high standard deviation indicates that the data are spread out
over a large range of values.
The standard deviation, , of the observations is the square root of the variance, .
Graphic Displays of Basic Statistical Descriptions of Data:
Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate
data distribution. First, it displays all of the data for the given attribute.
Second, it plots quantile information.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another.
It is a powerful visualization tool in that it allows the user to view whether there is a shift
in going from one distribution to another.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles.
Plotting histograms is a graphical method for summarizing the distribution of a given
attribute, X.
Scatter Plots and Data Correlation

• A scatter plot is one of the most effective graphical methods for determining if
there appears to be a relationship, pattern, or trend between two numeric
attributes.
• Each pair of values is treated as a pair of coordinates and plotted as points in the
plane.
• The scatter plot is a useful method for providing a first look at bivariate data to see
o clusters of points and outliers,
o or to explore the possibility of correlation relationships.
• Two attributes, X, and Y, are correlated if one attribute implies the other.
• Correlations can be positive, negative, or null (uncorrelated).
• If the pattern of plotted points slopes from upper left to lower right, X's values
increase as Y’s values decrease, suggesting a negative correlation.
• A line of best fit can be drawn to study the correlation between the variables.
UNIT II
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability - Normal Distributions and Standard (z)
Scores
WHAT IS STATISTICS?
Statistics exists because of the prevalence of variability in the real world.
Descriptive Statistics:
• In its simplest form, known as descriptive statistics, statistics provides us with
tools—tables,
• graphs, averages, ranges, correlations—for organizing and summarizing the
inevitable
• variability in collections of actual observations or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual
change in global temperature during the last 30 years
Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing
beyond collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall
happiness
I. THREE TYPES OF DATA:

• Data is a collection of actual observations or scores in a survey or an experiment.
• The precise form of statistical analysis often depends on whether data are
qualitative, ranked, or quantitative.
Data
Qualitative Ranked Quantitative

• Qualitative Data: Qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category.
• Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
• Quantitative data consists of numbers (weights of 238, 170, . . . 185 lbs) that
represent an amount or a count.
Quantitative Data:
Quantitative Data
• The weights reported by 53 male students in Table 1.1 are quantitative data, sinceany
single observation, such as 160 lbs, represents an amount of weight.
Ranked Data
• The ranked data in order from 1 to 15 depending on the data available in the list.
Qualitative Data
The Y and N replies of students in Table 1.2 are qualitative data, since any single
observation is a letter that represents a class of replies.
II. TYPES OF VARIABLES

• A variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables

• Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous.
• A discrete variable consists of isolated numbers separated by gaps.
• Examples include most counts, such as the number of children in a family; the number
of foreign countries you have visited; and the current size of the U.S. population.
• A continuous variable consists of numbers whose values, at least in theory, have no
restrictions.
• Examples include amounts, such as weights of male statistics students; durations,and
standardized test scores, such as those on the Scholastic Aptitude Test (SAT).
Approximate Numbers
• In theory, values for continuous variables can be carried out infinitely far.
• Eg: Someone’s weight, in pounds, might be 140.01438, and so on, to infinity!
• Practical considerations require that values for continuous variables be roundedoff.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• For example, the weights of the to the nearest pound.
• A student whose weight is listed as 150 lbs could actually weigh between 149.5and
150.5 lbs.
Independent and Dependent Variables
• The most studies raise questions about the presence or absence of a relationship
between two (or more) variables.
• Eg: For example, a psychologist might wish to investigate whether couples who
undergo special training in “active listening” tend to have fewer communication
breakdowns than do couples who undergo no special training.
• An experiment is a study in which the investigator decides who receives the special
treatment.
Dependent Variable
• When a variable is believed to have been influenced by the independent variable,it is
called a dependent variable.
• In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
• Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator.
• Instead, it represents an outcome: the data produced by the experiment.
• Eg: To test whether training influences communication, the psychologist countsthe
number of communication breakdowns between each couple
Observational Studies
• Instead of undertaking an experiment, an investigator might simply observe the
relation between two variables. For example, a sociologist might collect paired
measures of poverty level and crime rate for each individual in some group.
• Such studies are often referred to as observational studies.
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator, and it yields less clear-cut conclusions about cause-
effect relationships than does an experiment.
Confounding Variable
• Whenever groups differ not just because of the independent variable but also because
some uncontrolled variable co-varies with the independent variable, any conclusion
about a cause-effect relationship is suspect.
• A difference between groups might be due not to the independent variable but to a
confounding variable.
• For instance, couples willing to devote extra effort to special training might already
possess a deeper commitment that co-varies with more active-listening skills.
• An uncontrolled variable that compromises the interpretation of a study is knownas a
confounding variable.
Problems:
III. DESCRIBING DATA WITH TABLES AND GRAPHS:
TABLES (FREQUENCY DISTRIBUTIONS)
• A frequency distribution is a collection of observations produced by sorting

observations into classes and showing their frequency (f ) of occurrence in each class.
• To organize the weights of the male statistics students listed in Table 1.1. First, arrange
a column of consecutive numbers, beginning with the lightest weight
(133) at the bottom and ending with the heaviest weight (245) at the top.
• A short vertical stroke or tally next to a number each time its value appears in the
original set of data; once this process has been completed, substitute for each tally
count a number indicating the frequency ( f ) of occurrence of each weight.
• When observations are sorted into classes of single values, as in Table 2.1, the result
is referred to as a frequency distribution for ungrouped data.
• The frequency distribution shown in Table 2.1 is only partially displayed because there
are more than 100 possible values between the largest and smallest observations.
Grouped Data
• When observations are sorted into classes of more than one value, as in Table 2.2,the
result is referred to as a frequency distribution for grouped data.
• Data are grouped into class intervals with 10 possible values each.
• The bottom class includes the smallest observation (133), and the top classincludes
the largest observation (245).
• The distance between bottom and top is occupied by an orderly series of classes.
• The frequency ( f ) column shows the frequency of observations in each class and,at
the bottom, the total number of observations in all classes.
Gaps between Classes:

• The size of the gap should always equal one unit of measurement.
• It should always equal the smallest possible difference between scores within a
particular set of data.
• Since the gap is never bigger than one unit of measurement, no score can fall into the
gap.
How Many Classes?

• Classes should not be too large and not too high.
When There Are Either Many or Few Observations:

• Grouping of classes can be 10, the recommended number of classes, as
recommended.
Real Limits of Class Intervals

• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled boundary.
• Eg: The real limits for 140–149 in Table 2.2 are 139.5 (140 minus one-half of the unit
of measurement of 1) and 149.5 (149 plus one-half of the unit of measurement of 1),
and the actual width of the class interval would be 10 (from 149.5 139.5 = 10).
GUIDELINES
Essential:
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–
159, etc.,
Optional:
4. All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which no maximum value can
be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5
and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a convenient number.
6. The lower boundary of each class interval should be a multiple of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of10,
the class interval.
7. Aim for a total of approximately 10 classes. Example:
The distribution in Table 2.2 uses 12 classes.
CONSTRUCTING FREQUENCY DISTRIBUTIONS
1. Find the range
2. Find the class interval required to span the range by dividing the range by the desired
number of classes
3. Round off to the nearest convenient interval
4. Determine where the lowest class should begin.
5. Determine where the lowest class should end.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
7. Indicate with a tally the class in which each observation falls.
8. Replace the tally count for each class with a number—the frequency (f )—and showthe
total of all frequencies.
9. Supply headings for both columns and a title for the table.
Problems:
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000
Check for Accuracy

• Whenever you encounter an outrageously extreme value, such as a GPA of 0.06,
attempt to verify its accuracy.
• If the outlier survives an accuracy check, it should be treated as a legitimate score.
Might Exclude from Summaries

• We might choose to segregate (but not to suppress!) an outlier from any summaryof
the data.
• We might use various numerical summaries, such as the median and interquartile
range, to that ignore extreme scores, including outliers.
Might Enhance Understanding

• A valid outlier can be viewed as the product of special circumstances, it might helpyou
to understand the data.
• Eg: crime rates differ among communities
Problem:
RELATIVE FREQUENCY DISTRIBUTIONS
• Relative frequency distributions show the frequency of each class as a part or

fraction of the total frequency for the entire distribution.
• This type of distribution allows us to focus on the relative concentration of
observations among different classes within the same distribution.
• In the case of the weight data in Table 2.2, it permits us to see that the 160s accountfor
about one-fourth (12/53 = 23, or 23%) of all observations.
Constructing Relative Frequency Distributions

• To convert a frequency distribution into a relative frequency distribution, dividethe
frequency for each class by the total frequency for the entire distribution.
Percentages or Proportions?
• A proportion always varies between 0 and 1, whereas a percentage always varies
between 0 percent and 100 percent.
• To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to theright.
Problem:
CUMULATIVE FREQUENCY DISTRIBUTIONS

• Cumulative frequency distributions show the total number of observations in eachclass
and in all lower-ranked classes.
• This type of distribution can be used effectively with sets of scores, such as test scores
for intellectual or academic aptitude.
• Under these circumstances, cumulative frequencies are usually converted, in turn,to
cumulative percentages. Cumulative percentages are often referred to as percentile
ranks.
Constructing Cumulative Frequency Distributions

• To convert a frequency distribution into a cumulative frequency distribution, addto
the frequency of each class the sum of the frequencies of all classes ranked below it.
• This gives the cumulative frequency for that class.
• Begin with the lowest-ranked class in the frequency distribution and work upward,
finding the cumulative frequencies in ascending order.
Cumulative Percentages
• If relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages.
Percentile Ranks
• When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks.
• The percentile rank of a score indicates the percentage of scores in the entire
distribution with similar or smaller values than that score.
Approximate Percentile Ranks (from Grouped Data)

• The assignment of exact percentile ranks requires that cumulative percentages be
obtained from frequency distributions for ungrouped data.
• If we have access only to a frequency distribution for grouped data, cumulative
percentages can be used to assign approximate percentile ranks.
Problem:
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
• When, among a set of observations, any single observation is a word, letter, or
numerical code, the data are qualitative.
• Determine the frequency with which observations occupy each class, and report
these frequencies.
• This frequency distribution reveals that Yes replies are approximately twice as
prevalent as No replies.
Ordered Qualitative Data

• Whether Yes is listed above or below No in Table 2.7.
• When, however, qualitative data have an ordinal level of measurement because
observations can be ordered from least to most, that order should be preserved inthe
frequency table.
Relative and Cumulative Distributions for Qualitative Data

• Frequency distributions for qualitative variables can always be converted into
relative frequency distributions.
• That a captain has an approximate percentile rank of 63 among officers since 62.5(or
63) is the cumulative percent for this class.
Problem:
INTERPRETING DISTRIBUTIONS CONSTRUCTED BY OTHERS
• When inspecting a distribution for the first time, train yourself to look at the entire
table, not just the distribution.
• Read the title, column headings, and any footnotes.
• Where do the data come from? Is a source cited? Next, focus on the form of the
frequency distribution.
• When interpreting distributions, including distributions constructed by someone.
GRAPHS
• Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution.
GRAPHS FOR QUANTITATIVE DATA
Histograms
Important features of histograms:

• Equal units along the horizontal axis (the X axis, or abscissa) reflect the variousclass
intervals of the frequency distribution.
• Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
• The intersection of the two axes defines the origin at which both numerical scales
equal 0.
• Numerical scales always increase from left to right along the horizontal axis and
from bottom to top along the vertical axis.
• The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.
• The adjacent bars in histograms have common boundaries that emphasize the
continuity of quantitative data for continuous variables.
Frequency Polygon
• An important variation on a histogram is the frequency polygon, or line graph.
• Frequency polygons may be constructed directly from frequency distributions.
• However, we will follow the step-by-step transformation of a histogram into a
frequency polygon.
A. This panel shows the histogram for the weight distribution.

B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpointsfor
classes on the horizontal axis, and connect them with straight lines.
C. Anchor the frequency polygon to the horizontal axis.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon.
Stem and Leaf Displays
• Stem and leaf displays are ideal for summarizing distributions, such as that for
weight data, without destroying the identities of individual observations.
Constructing a Display
• The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics
students listed in Table 1.1.
• To construct the stem and leaf display for these data, when counting by tens, the
weights range from the 130s to the 240s.
• Arrange a column of numbers, the stems, beginning with 13 (representing the
130s) and ending with 24 (representing the 240s).
• Draw a vertical line to separate the stems, which represent multiples of 10, fromthe
space to be occupied by the leaves, which represent multiples of 1.
• Next, enter each raw score into the stem and leaf display.
Interpretation
• The weight data have been sorted by the stems. All weights in the 130s are listed
together; all of those in the 140s are listed together, and so on.
• A glance at the stem and leaf display in Table 2.9 shows essentially the same pattern
of weights depicted by the frequency distribution in Table 2.2 and the histogram.
Selection of Stems
• Stem values are not limited to units of 10.
• Depending on the data, you might identify the stem with one or more leading digitsthat
culminates in some variation on a stem value of 10, such as 1, 100, 1000, or even .1,
.01, .001, and so on.
• Stem and leaf displays represent statistical bargains.
Problem:
TYPICAL SHAPES
• Whether expressed as a histogram, a frequency polygon, or a stem and leaf display,
an important characteristic of a frequency distribution is its shape.
Normal
• Any distribution that approximates the normal shape in panel A of Figure 2.3 can be
analyzed
• The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, Eg: uninterrupted gestation periods of human fetuses, scores
on standardized tests, and even the popping times of individual kernels in a batch of
popcorn.
Bimodal
• Any distribution that approximates the bimodal shape in panel B of Figure 2.3 reflect
the coexistence of two different types of observations in the same distribution.
• Eg: The distribution of the ages of residents in a neighborhood consisting largely of
either new parents or their infants has a bimodal shape.
Positively Skewed
• The two remaining shapes in Figure 2.3 are lopsided.
• A lopsided distribution caused by a few extreme observations in the positive
direction as in panel C of Figure 2.3, is a positively skewed distribution.
• Eg: most family incomes under $200,000 and relatively few family incomesspanning
a wide range of values above $200,000.
Negatively Skewed
• A lopsided distribution caused by a few extreme observations in the negative
direction as in panel D of Figure 2.3, is a negatively skewed distribution.
• Eg: Most retirement ages at 60 years or older and relatively few retirement ages
spanning the wide range of ages younger than 60.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
• The equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data.
• Likewise, equal segments along the vertical axis reflect increases in frequency.
• The body of the bar graph consists of a series of bars whose heights reflect the
frequencies for the various words or classes.
• A person’s answer to the question “Do you have a Facebook profile?” is either Yesor
No, not some impossible intermediate value, such as 40 percent Yes and 60 percent
No.
MISLEADING GRAPHS
• Graphs can be constructed in an unscrupulous manner to support a particular point
of view.
• For example, to imply that comparatively many students responded Yes to the
Facebook profile question, an unscrupulous person might resort to the various tricks.
• The width of the Yes bar is more than three times that of the No bar, thus violatingthe
custom that bars be equal in width.
• The lower end of the frequency scale is omitted, thus violating the custom that the
entire scale be reproduced, beginning with zero.
• The height of the vertical axis is several times the width of the horizontal axis, thus
violating the custom, heretofore unmentioned, that the vertical axis be approximately
as tall as the horizontal axis is wide.
Problem:
IV. DESCRIBING DATA WITH AVERAGES:
MODE
• The mode reflects the value of the most frequently occurring score.
• Distributions can have more than one mode.

• Distributions with two obvious peaks, even though they are not exactly the same
height, are referred to as bimodal.
• Distributions with more than two peaks are referred to as multimodal.
• The presence of more than one mode might reflect important differences among
subsets of data.
Problems:
Progress Check *3.1 Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
mode = 63
Progress Check *3.2 The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find
the mode for these data.
mode = 27.4
MEDIAN
• The median reflects the middle value when observations are ordered from leastto
most.
• The median splits a set of ordered observations into two equal parts, the upperand
lower halves.
• In other words, the median has a percentile rank of 50, since observations with
equal or smaller values constitute 50 percent of the entire distribution.
Finding the Median
• To find the median, scores always must be ordered from least to most
• When the total number of scores is odd, as in the lower left-hand panel of Table 3.2,
there is a single middle-ranked score, and the value of the median equals the value of
this score.
• When the total number of scores is even, as in the lower right-hand panel of Table3.2,
the value of the median equals a value midway between the values of the two
middlemost scores.
• In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores.
• The median term can be found for the 20 presidents.
Problems:
Progress Check *3.3 Find the median for the following retirement ages: 60, 63, 45, 63,
65, 70, 55, 63, 60, 65, 63.
median = 63
Progress Check *3.4 Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
median = 27.15
MEAN
• The mean is the most common average, calculated many times.
• The mean is found by adding all scores and then dividing by the number ofscores.
•
• To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .
+ 4 + 8) to obtain a sum of 112 years, and then divide this sum by 20, the number
of presidents, to obtain a mean of 5.60 years.
Sample or Population?
• Statisticians distinguish between two types of means—the population mean and the
sample mean—depending on whether the data are viewed as a population (a
complete set of scores) or as a sample (a subset of scores).
Formula for Sample Mean

• When symbols are used, X designates the sample mean, and the formula becomes and
reads: “X-bar equals the sum of the variable X divided by the sample size n.”
Formula for Population Mean

• The formula for the population mean differs from that for the sample mean only
because of a change in some symbols.
• The population mean is represented by μ (pronounced “mu”), the lowercase Greek
letter m for mean, where the uppercase letter N refers to the population size.
• Otherwise, the calculations are the same as those for the sample mean.
Mean as Balance Point

• The mean serves as the balance point for its frequency distribution.
• The mean serves as the balance point for its distribution because of a special
property:
• The sum of all scores, expressed as positive and negative deviations from the
mean, always equals zero.
• In its role as balance point, the mean describes the single point of equilibrium at
which, once all scores have been expressed as deviations from the mean.
• The mean reflects the values of all scores, not just those that are middle ranked (as
with the median), or those that occur most frequently (as with the mode).
Problems:
Progress Check *3.5 Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Progress Check *3.6 Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
WHICH AVERAGE?
If Distribution Is Not Skewed
• When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
• When extreme scores cause a distribution to be skewed, as for the infant death rates
for selected countries listed in Table 3.4, the values of the three averages candiffer.
Interpreting Differences between Mean and Median

• When a distribution is skewed, report both the mean and the median.
• The differences between the values of the mean and median signal the presence of a
skewed distribution.
• If the mean exceeds the median, as it does for the infant death rates, the underlying
distribution is positively skewed because of one or more scores with relatively large
values, such as the very high infant death rates for a number of countries, especially
Sierra Leone.
• On the other hand, if the median exceeds the mean, the underlying distribution is
negatively skewed because of one or more scores with relatively small values.
Problem:
Special Status of the Mean
• The mean is the single most preferred average for quantitative data.
Using the Word Average
• An average can refer to the mode, median, or mean—or even geometric mean or
the harmonic mean.
• Conventional usage prescribes that average usually signifies mean, and this
connotation is often reinforced by the context.
• For instance, grade point average is virtually synonymous with mean grade point.
AVERAGES FOR QUALITATIVE AND RANKED DATA
Mode Always Appropriate for Qualitative Data
• But when the data are qualitative, your choice among averages is restricted.
• The mode always can be used with qualitative data.
Median Sometimes Appropriate

• The median can be used whenever it is possible to order qualitative data from
least to most because the level of measurement is ordinal.
• Do not treat the various classes as though they have the same frequencies when
they actually have different frequencies.
Inappropriate Averages
• It would not be appropriate to report a median for unordered qualitative data with
nominal measurement, such as the ancestries of Americans.
Problem:
Averages for Ranked Data

• When the data consist of a series of ranks, with its ordinal level of measurement,the
median rank always can be obtained.
• It’s simply the middlemost or average of the two middlemost ranks.
V. DESCRIBING VARIABILITY:
• In Figure 4.1, each of the three frequency distributions consists of seven scores with
the same mean (10) but with different variabilities.
• Before reading on, rank the three distributions from least to most variable.
• The distribution A has the least variability, distribution B has intermediate variability,
and distribution C has the most variability.
• For distribution A with the least (zero) variability, all seven scores have the same value
(10).
• For distribution B with intermediate variability, the values of scores vary slightly (one
9 and one 11), and for distribution C with most variability, they vary even more (one
7, two 9s, two 11s, and one 13).
Importance of Variability
• Variability assumes a key role in an analysis of research results.
• Eg: A researcher might ask: Does fitness training improve, on average, the scores of
depressed patients on a mental-wellness test?
• To answer this question, depressed patients are randomly assigned to two groups,
fitness training is given to one group, and wellness scores are obtained for both
groups.
• Figure 4.2 shows the outcomes for two fictitious experiments, each with the same
mean difference of 2, but with the two groups in experiment B having less variability
than the two groups in experiment C.
• Notice that groups B and C in Figure 4.2 are the same as their counterparts in Figure
4.1.
• Although the new group B* retains exactly the same (intermediate) variability as
group B, each of its seven scores and its mean have been shifted 2 units to the right.
• Likewise, although the new group C* retains exactly the same (most) variability asgroup
C, each of its seven scores and its mean have been shifted 2 units to the right.
• Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments.
• variabilities within groups assume a key role in inferential statistics.

• The relatively larger variabilities within groups in experiment C translate into less
statistical stability for the observed mean difference of 2 when it is viewed as justone
outcome among many possible outcomes for repeat experiments.
4.1 (a) small
(b) large
RANGE
• The range is the difference between the largest and smallest scores.
• In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to
10); distribution B, the moderately variable, has an intermediate range of 2 (from 11
to 9); and distribution C, the most variable, has the
• largest range of 6 (from 13 to 7), in agreement with our intuitive judgments about
differences in variability.
Disadvantages of Range
• The range has several shortcomings.
• First, since its value depends on only two scores—the largest and the smallest—it
fails to use the information provided by the remaining scores.
• The value of the range tends to increase with increases in the total number of
scores.
VARIANCE
• Variance and Standard Deviation are the two important measurements in
statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.
Reconstructing the Variance

• To qualify as a type of mean, the values of all scores must be added and thendivided
by the total number of scores.
• In the case of the variance, each original score is re-expressed as a distance or
deviation from the mean by subtracting the mean.
• For each of the three distributions in Figure 4.1, the face values of the sevenoriginal
scores have been re-expressed as deviation scores from their mean of 10.
• For example, in distribution C, one score coincides with the mean of 10, four scores
(two 9s and two 11s) deviate 1 unit from the mean, and two scores (one 7 and one
13) deviate 3 units from the mean, yielding a set of seven deviation scores: one 0,
two –1s, two 1s, one –3, and one 3.
Mean of the Deviations Not a Useful Measure
• The sum of all negative deviations always counterbalances the sum of all positive
deviations, regardless of the amount of variability in the group.
• A measure of variability, known as the mean absolute deviation (or m.a.d.), can be
salvaged by summing all absolute deviations from the mean, that is, by ignoring
negative signs.
Mean of the Squared Deviations
• Before calculating the variance (a type of mean), negative signs must be eliminated
from deviation scores. Squaring each deviation generates a set of squared deviation
scores, all of which are positive.
STANDARD DEVIATION
• The standard deviation, the square root of the mean of all squared deviations fromthe
mean, that is,
•
• The standard deviation is a rough measure of the average amount by which scores
deviate on either side of
• their mean.
• The standard deviation as a rough measure of the average amount by which scores
deviate on either side of their mean.
Majority of Scores within One Standard Deviation
For most frequency distributions, a majority of all scores are within one standard
deviation on either side of the mean.
• In Figure 4.3, where the lowercase letter s represents the standard deviation.
• As suggested in the top panel of Figure 4.3, if the distribution of IQ scores for a class
of fourth graders has a mean (X) of 105 and a standard deviation (s) of 15, amajority
of their IQ scores should be within one standard deviation on either sideof the mean,
that is, between 90 and 120.
• For most frequency distributions, a small minority of all scores deviate more thantwo
standard deviations on either side of the mean.
• For instance, among the seven deviations in distribution C, none deviates more than
two standard deviations (2 Å~ 1.77 = 3.54) on either side of the mean.
Generalizations Are for All Distributions

• These two generalizations about the majority and minority of scores are independent
of the particular shape of the distribution.
Standard Deviation: A Measure of Distance

• There’s an important difference between the standard deviation and mean.
• The mean is a measure of position, but the standard deviation is a measure of
distance. Figure 4.4 describes the weight distribution for the males.
• The mean (X) of 169.51 lbs has a particular position or location along thehorizontal
axis: It is located at the point, and only at the point, corresponding to
169.51 lbs.
• On the other hand, the standard deviation (s) of 23.33 lbs for the same distributionhas
no particular location along the horizontal axis.
Value of Standard Deviation Cannot Be Negative

• Standard deviation distances always originate from the mean and are expressed as
positive deviations above the
• The actual value of the standard deviation can be zero or a positive number, it can
never be a negative number because any negative deviation disappears when
squared.
Problem:
4.2 (a) $80,000 to $100,000

(b) $70,000
(c) $110,000
(d) $88,000 to $92,000; $86,000; $94,000
4.3 (a) False. Relatively few students will score exactly one standard deviation from the
mean.
(b) False. Students will score both within and beyond one standard deviation from the
mean.
(c) True
(d) True
(e) False. See (b).
(f) True
STANDARD DEVIATION
Sum of Squares (SS)
• Calculating the standard deviation requires that we obtain first a value for the
variance.
• However, calculating the variance requires, in turn, that we obtain the sum of the
squared deviation scores.
• The sum of squared deviation scores symbolized by SS, merits special attention
because it’s a major component in calculations for the variance, as well as many other
statistical measures.
Sum of Squares Formulas for Population
Standard Deviation for Population σ
I f μ Is Unknown
• It would be most efficient if, as above, we could use a random sample of ndeviations
expressed around the population mean, X − μ, to estimate variability inthe population.
• But this is usually impossible because, in fact, the population mean is unknown.
• Therefore, we must substitute the known sample mean, X, for the unknown
population mean, μ, and we must use a random sample of n deviations expressed
around their own sample mean, X –X, to estimate variability in the population.
• Although there are n = 5 deviations in the sample, only n − 1 = 4 of these deviationsare
free to vary because the sum of the n = 5 deviations from their own sample mean
always equals zero.
DEGREES OF FREEDOM (df)

• Degrees of freedom (df) refers to the number of values that are free to vary, givenone
or more mathematical restrictions, in a sample being used to estimate a population
characteristic.
• The concept of degrees of freedom is introduced only because we are using scoresin a
sample to estimate some unknown characteristic of the population.
INTERQUARTILE RANGE (IQR)
• The most important spinoff of the range, the interquartile range (IQR), is simply the
range for the middle 50 percent of the scores.
MEASURES OF VARIABILITY FOR QUALITATIVE AND RANKED DATA

Qualitative Data
• Measures of variability are virtually nonexistent for qualitative or nominal data.
• It is probably adequate to note merely whether scores are evenly divided among the
various classes, unevenly divided among the various classes, or concentrated mostly
in one class.
• For example, if the ethnic composition of the residents of a city is about evenly divided
among several groups, the variability with respect to ethnic groups is maximum; there
is considerable heterogeneity.
Ordered Qualitative and Ranked Data
• If qualitative data can be ordered because measurement is ordinal then it’s
appropriate to describe variability by identifying extreme scores.
• For instance, the active membership of an officers’ club might include no one witha
rank below first lieutenant or above brigadier general.
VI. NORMAL DISTRIBUTIONS AND STANDARD (z) SCORE:
THE NORMAL CURVE

• A distribution based on 30,910 men usually is more accurate than one based on 3091,
and a distribution based on 3,091,000 usually is even more accurate.
• But it is prohibitively expensive in both time and money to even survey 30,910 people.
Fortunately, it is a fact that the distribution of heights for all American men—not just
3091 or even 3,091,000—approximates the normal curve, a well- documented
theoretical curve.
In Figure 5.2, the idealized normal curve has been superimposed on the original
distribution for 3091 men.
Interpreting the Shaded Area
• The total area under the normal curve in Figure 5.2 can be identified with all FBI
applicants.
• Viewed relative to the total area, the shaded area represents the proportion of
applicants who will be eligible because they are shorter than exactly 66 inches.
Finding a Proportion for the Shaded Area

• To find this new proportion, we cannot rely on the vertical scale in Figure 5.2, because
it describes as proportions the areas in the rectangular bars of histograms, not the
areas in the various curved sectors of the normal curve.
Properties of the Normal Curve
• A normal curve is a theoretical curve defined for a continuous variable, asdescribed

in Section 1.6, and noted for its symmetrical bell-shaped form.
• Because the normal curve is symmetrical, its lower half is the mirror image of its
upper half.
• Being bell shaped, the normal curve peaks above a point midway along the
horizontal spread and then tapers off gradually in either direction from the peak
• The values of the mean, median (or 50th percentile), and mode, located at a point
midway along the horizontal spread, are the same for the normal curve.
Importance of Mean and Standard Deviation

• When you’re using the normal curve, two bits of information are indispensable:
values for the mean and the standard deviation.
Different Normal Curves
• Every normal curve can be interpreted in exactly the same way once any distancefrom
the mean is
• expressed in standard deviation units.
• For example, .68, or 68 percent of the total area under a normal curve—any normal
curve—is within one standard deviation above and below the mean, and only .05, or
5 percent, of the total area is more than two standard deviations aboveand below the
mean.
z SCORES
• A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.
• To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean
• where X is the original score and μ and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
• Since identical units of measurement appear in both the numerator and denominator
of the ratio for z, the original units of measurement cancel each otherand the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation
units.
• A z score of 2.00 always signifies that the original score is exactly two standard
deviations above its mean.
• Similarly, a z score of –1.27 signifies that the original score is exactly 1.27 standard
deviations below its mean.
• A z score of 0 signifies that the original score coincides with the mean.
Converting to z Scores
• To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation
of heights) and solve for z as follows:
STANDARD NORMAL CURVE
• If the original distribution approximates a normal curve, then the shift to standardor z
scores will always produce a new distribution that approximates the standardnormal
curve.
• The standard normal curve always has a mean of 0 and a standard deviation of 1.
However, to verify that the mean of a standard normal distribution equals 0, replace
X in the z score formula with μ, the mean of any normal distribution, and then solve
for z:
• Likewise, to verify that the standard deviation of the standard normal distribution
equals 1, replace X in the z score formula with μ + 1σ, the value corresponding to one
standard deviation above the mean for any (nonstandard) normal distribution, and
then solve for z:
• Although there is an infinite number of different normal curves, each with its ownmean
and standard deviation, there is only one standard normal curve, with a mean of 0 and
a standard deviation of 1.
• Converting all original observations into z scores leaves the normal shape intactbut
not the units of measurement.
• Shaded observations of 66 inches, 1080 hours, and 90 IQ points all reappear as az
score of –1.00.
Standard Normal Table

• Essentially, the standard normal table consists of columns of z scores coordinated
with columns of proportions.
SOLVING NORMAL CURVE PROBLEMS
FINDING PROPORTIONS
Example: Finding Proportions for One Score
Example: Finding Proportions between Two Scores
Finding Proportions beyond Two Scores
FINDING SCORES
z Scores for Non-normal Distributions
• z scores are not limited to normal distributions. Non-normal distributions also canbe
transformed into sets of unit-free, standardized z scores.
• In this case, the standard normal table cannot be consulted, since the shape of the
distribution of z scores is the same as that for the original non-normal distribution.
• For instance, if the original distribution is positively skewed, the distribution of z
scores also will be positively skewed.
• Regardless of the shape of the distribution, the shift to z scores always produces a
distribution of standard scores with a mean of 0 and a standard deviation of 1.
Interpreting Test Scores

• The use of z scores can help you identify a person’s relative strengths and
weaknesses on several different tests.
Importance of Reference Group

• Remember that z scores reflect performance relative to some group rather thanan
absolute standard.
• A meaningful interpretation of z scores requires, therefore, that the nature of the
reference group be specified.
Standard Score
• Whenever any unit-free scores are expressed relative to a known mean and a known
standard deviation, they are referred to as standard scores.
• Although z scores qualify as standard scores because they are unit-free andexpressed
relative to a known mean of 0 and a known standard deviation of 1, other scores also
qualify as standard scores.
Transformed Standard Scores

• Being by far the most important standard score, z scores are often viewed as
synonymous with standard scores.
• For convenience, particularly when reporting test results to a wide audience, z scores
can be changed to transformed standard scores, other types of unit-free standard
scores that lack negative signs and decimal points.
• These transformations change neither the shape of the original distribution nor the
relative standing of any test score within the distribution.
Figure 5.11 shows the values of some of the more common types of transformed standard
scores relative to the various portions of the area under the normal curve.
UNIT III DESCRIBING RELATIONSHIPS
Correlation –Scatter plots –correlation coefficient for quantitative data –computational
formula for correlation coefficient – Regression –regression line –least squares
regression line – Standard error of estimate – interpretation of r2 –multiple regression
equations –regression towards the mean
I. Correlation:
• An investigator suspects that a relationship exists between the number of greeting

cards sent and the number of greeting cards received by individuals.
• The investigator obtains the estimates for the most recent holiday season fromfive
friends, as shown in Table 6.1.
• The data in Table 6.1 represent a very simple observational study with two
dependent variables.
AN INTUITIVE APPROACH
• A tendency for pairs of scores to occupy similar relative positions in theirrespective

distributions.
Positive Relationship
• Trends among pairs of scores can be detected most easily by constructing a list of
paired scores in which the scores along one variable are arranged from largest to
smallest.
• In panel A of Table 6.2, the five pairs of scores are arranged from the largest (13) to
the smallest (1) number of cards sent.
• This table reveals a pronounced tendency for pairs of scores to occupy similar relative
positions in their respective distributions.
• For example, John sent relatively few cards (1) and received relatively few cards (6),
whereas Doris sent relatively many cards (13) and received relatively many cards (14).
• Therefore, that the two variables are related.
• Insofar as relatively low values are paired with relatively low values, and relatively
high values are paired with relatively high values, the relationship is positive.
Negative Relationship
• Although John sent relatively few cards (1), he received relatively many (18).
• From this pattern, we can conclude that the two variables are related.
• This relationship implies that “You get the opposite of what you give.”
• Insofar as relatively low values are paired with relatively high values, and relatively
high values are paired with relatively low values, the relationship is negative.
Little or No Relationship
• No regularity is apparent among the pairs of scores in panel C.
• For instance, although both Andrea and John sent relatively few cards (5 and 1,
respectively), Andrea received
• relatively few cards (6) and John received relatively many cards (14).
• We can conclude that little, if any, relationship exists between the two variables.
• Two variables are positively related if pairs of scores tend to occupy similar relative
positions (high with high and low with low) in their respective distributions.
• They are negatively related if pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective distributions.
II. SCATTERPLOTS
• A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
Construction
• To construct a scatterplot, as in Figure 6.1, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a dot within
the scatterplot.
• For example, the pair of numbers for Mike, 7 and 12, define points along the X andY
axes, respectively.
• Using these points to anchor lines perpendicular to each axis, locate Mike’s dot where
the two lines intersect.
Positive, Negative, or Little or No Relationship?
• A dot cluster that has a slope from the lower left to the upper right, as in panel Aof
Figure 6.2, reflects a positive relationship.
• Small values of one variable are paired with small values of the other variable, and
large values are paired with large values.
• In panel A, short people tend to be light, and tall people tend to be heavy.
• On the other hand, a dot cluster that has a slope from the upper left to the lower
right, as in panel B of Figure 6.2, reflects a negative relationship.
• Small values of one variable tend to be paired with large values of the other
variable, and vice versa.
• A dot cluster that lacks any apparent slope, as in panel C of Figure 6.2, reflects littleor
no relationship.
• Small values of one variable are just as likely to be paired with small, medium, or
large values of the other variable.
Strong or Weak Relationship?

• The more closely the dot cluster approximates a straight line, the stronger (the more
regular) the relationship will be.
• Figure 6.3 shows a series of scatterplots, each representing:
• A different positive relationship between IQ scores for pairs of people whose
backgrounds reflect different degrees of genetic overlap, ranging from minimum
overlap between foster parents and foster children to maximum overlap between
identical twins.
Perfect Relationship
• A dot cluster that equals a straight line reflects aperfect relationship between two
variables.
Curvilinear Relationship
• The a dot cluster approximates a straight line and, therefore, reflects a linear
relationship.
• Sometimes a dot cluster approximates a bent or curved line, as in Figure 6.4, and
therefore reflects a curvilinear relationship.
• Eg: physical strength, as measured by the force of a person’s handgrip, is less for
children, more for adults, and then less again for older people.
III. A CORRELATION COEFFICIENT
FOR QUANTITATIVE DATA : r
• A correlation coefficient is a number between –1 and 1 that describes the
relationship between pairs of variables.
• The type of correlation coefficient, designated as r, that describes the linear
relationship between pairs of variables for quantitative data.
Key Properties of r
• Named in honor of the British scientist Karl Pearson, the Pearson correlation
coefficient, r, can equal any value between –1.00 and +1.00.
• Furthermore, the following two properties apply:
• The sign of r indicates the type of linear relationship, whether positive or negative.
• The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
Sign of r
• A number with a plus sign (or no sign) indicates a positive relationship, and a
number with a minus sign indicates a negative relationship.
Numerical Value of r
• The more closely a value of r approaches either –1.00 or +1.00, the stronger the
relationship.
• The more closely the value of r approaches 0, the weaker the relationship.
• r = –.90 indicates a stronger relationship than does an r of –.70, and
• r = –.70 indicates a stronger relationship than does an r of .50.
Interpretation of r
• Located along a scale from –1.00 to +1.00, the value of r supplies information
about the direction of a linear relationship—whether positive or negative—and,
• generally, information about the relative strength of a linear relationship—whether
relatively
• weak because r is in the vicinity of 0, or relatively strong because r deviates from0 in
the direction of
• either +1.00 or –1.00.
r Is Independent of Units of Measurement

• The value of r is independent of the original units of measurement.
• Same value of r describes the correlation between height and weight for a groupof
adults.
• r depends only on the pattern among pairs of scores, which in turn show no tracesof
the units of measurement for the original X and Y scores.
• A positive value of r reflects a tendency for pairs of scores to occupy similar relative
locations in their respective distributions, while a negative value of r reflects a
tendency for pairs of scores to occupy dissimilar relative locations in their respective
distributions.
Range Restrictions
• The value of the correlation coefficient declines whenever the range of possible Xor Y
scores is restricted.
• For example, Figure 6.5 shows a dot cluster with an obvious slope, represented byan r
of .70 for the positive relationship between height and weight for all college students.
• If, the range of heights along Y is restricted to students who stand over 6 feet 2 inches
(or 74 inches) tall, the abbreviated dot cluster loses its slope because of theweights
among tall students.
• Therefore, as depicted in Figure 6.5, the value of r drops to .10.
• Sometimes it’s impossible to avoid a range restriction.
• For example, some colleges only admit students with SAT test scores above some
minimum value.
Caution
• We have to be careful when interpreting the actual numerical value of r.
• An r of .70 for height and weight doesn’t signify that the strength of this relationship
equals either .70 or 70 percent of the strength of a perfect relationship.
• The value of r can’t be interpreted as a proportion or percentage of some perfect
relationship.
Verbal Descriptions
• When interpreting a new r, you’ll find it helpful to translate the numerical value ofr
into a verbal description of the relationship.
• An r of .70 for the height and weight of college students could be translated into
“Taller students tend to weigh more”;
• An r of –.42 for time spent taking an exam and the subsequent exam score couldbe
translated into “Students who take less time tend to make higher scores”; and
• An r in the neighborhood of 0 for shoe size and IQ could be translated into “Little, if
any, relationship exists between shoe size and IQ.”
Correlation Not Necessarily Cause-Effect
• A correlation coefficient, regardless of size, never provides information about whether

an observed relationship reflects a simple cause-effect relationship or some more
complex state of affairs.
• Eg: correlation between cigarette smoking and lung cancer.

• American Cancer Society representatives interpreted the correlation as a causal
relationship: Smoking produces lung cancer.
• Tobacco industry representatives interpreted the correlation as, both the desire to
smoke cigarettes and lung cancer are caused by some more basic but unidentified
factors, such as the body metabolism or personality of some people.
• According to this reasoning, people with a high body metabolism might be more prone
to smoke and, quite independent of their smoking, more vulnerable to lungcancer.
• Therefore, smoking correlates with lung cancer because both are effects of some
common cause or causes.
Role of Experimentation
• In the present case, laboratory animals were trained to inhale different amounts of
tobacco tars and were then euthanized.
• Autopsies revealed that the observed incidence of lung cancer varied directly withthe
amount of inhaled tobacco tars, even though possible “contaminating” factors,such as
different body metabolisms or personalities, had been neutralized either through
experimental control or by random assignment of the subjects to different test
conditions.
IV. DETAILS: COMPUTATION FORMULA FOR r

V. Regression
TWO ROUGH PREDICTIONS
• A correlation analysis of the exchange of greeting cards by five friends for the
most recent holiday season suggests a strong positive relationship between
cards sent and cards received.
• When informed of these results, another friend, Emma, who enjoys receiving
greeting cards, asks you to predict how many cards she will receive during the
next holiday season, assuming that she plans to send 11 cards.
TWO ROUGH PREDICTIONS

• Predict “Relatively Large Number”
Rough Prediction for Emma:
• We could offer Emma a very rough prediction by recalling that cards sent and
received tend to occupy similar relative locations in their respective
distributions.
• Therefore, Emma can expect to receive a relatively large number of cards, since she
plans to send a relatively large number of cards.
Predict “between 14 and 18 Cards”

• To obtain a slightly more precise prediction for Emma, refer to the scatter plot for
the original five friends shown in Figure 7.1.
• Notice that Emma’s plan to send 11 cards locates her along the X axis between
the 9 cards sent by Steve and the 13 sent by Doris.
• Using the dots for Steve and Doris as guides, construct two strings of arrows, one
beginning at 9 and ending at 18 for Steve and the other beginning at 13 and ending
at 14 for Doris.
• We can predict that Emma’s return should be between 14 and 18 cards, the
numbers received by Doris and Steve.
VI. A REGRESSION LINE
• All five dots contribute to the more precise prediction, illustrated in Figure 7.2,
that Emma will receive 15.20 cards.
• The solid line designated as the regression line in Figure 7.2, which guides the
string of arrows, beginning at 11, toward the predicted value of 15.20.
• If all five dots had defined a single straight line, placement of the regression line
would have been simple; merely let it pass through all dots.
Predictive Errors
• Figure 7.3 illustrates the predictive errors that would have occurred if the regression
line had been used to predict the number of cards received by the five friends.
• Solid dots reflect the actual number of cards received, and open dots, always located
along the regression line, reflect the predicted number of cards received.
• The largest predictive error, shown as a broken vertical line, occurs for Steve, whosent
9 cards.
• Although he actually received 18 cards, he should have received slightly fewer than 14
cards, according to the regression line.
• The smallest predictive error none for Mike, who sent 7 cards.
• He actually received the 12 cards that he should have received, according to the
regression line.
Total Predictive Error
The smaller the total for all predictive errors in Figure 7.3, the more favorable will be
the prognosis for our predictions.
The regression line to be placed in a position that minimizes the total predictive error,
that is, that minimizes the total of the vertical discrepancies between the solid and open
dots shown in Figure 7.3.
VII. LEAST SQUARES REGRESSION LINE
• To avoid the arithmetic standoff of zero always produced by adding positive and
negative predictive errors
• the placement of the regression line minimizes the total squared predictive
error.
• When located like this, the regression line is often referred to as the least
squares regression line.
Key Property
• Once numbers have been assigned to b and a, as just described, the least squares
regression equation emerges as a working equation with a most desirable property:
• It automatically minimizes the total of all squared predictive errors for known Y
scores in the original correlation analysis.
Solving for Y′
• In its present form, the regression equation can be used to predict the number of
cards that Emma will receive, assuming that she plans to send 11 cards.
• Simply substitute 11 for X and solve for the value of Y′ as follows:
• Even when no cards are sent (X = 0), we predict a return of 6.40 cards because of the
value of a.
• Also, notice that sending each additional card translates into an increment of
only .80 in the predicted return because of the value of b.
• Whenever b has a value less than 1.00, increments in the predicted return will
lag—by an amount equal to the value of b, that is, .80 in the present case—
behind increments in cards sent.
• If the value of b had been greater than 1.00, then increments in the predicted
return would have exceeded increments in cards sent.
A Limitation
• Emma might survey these predicted card returns before committing herself to a
particular card investment. There is no evidence of a simple cause-effect
relationship between cards sent and cards received.
VIII. STANDARD ERROR OF ESTIMATE,s y | x
• Emma’s investment of 11 cards will yield a return of 15.20 cards, we would be

surprised if she actually received 15 cards.
• It is more likely that because of the imperfect relationship between cards sent and
cards received,
• Emma’s return will be some number other than 15.
• Although designed to minimize predictive error, the least squares equation doesnot
eliminate it.
Importance of r
Substituting a value of 1 for r, we obtain
substituting a value of 0 for r in the numerator of Formula 7.5, we obtain
ASSUMPTIONS
Linearity
• Use of the regression equation requires that the underlying relationship be linear.
Homoscedasticity
• Use of the standard error of estimate, sy|x, assumes that except for chance, the dots
in the original scatterplot will be dispersed equally about all segments of the
regression line.
• when the scatterplot reveals a dramatically different type of dot cluster, such as
that shown in Figure 7.4.
• The standard error of estimate for the data in Figure 7.4 should be used cautiously,since
its value overestimates the variability of dots about the lower half of the regression
line and underestimates the variability of dots about the upper half of the regression
line.
INTERPRETATION OF r 2
• The squared correlation coefficient, r2, provides us a key interpretation of the

correlation coefficient and also a measure of predictive accuracy that
supplements the standard error of estimate, sy|x.
Repetitive Prediction of the Mean
• Pretend that we know the Y scores (cards received), but not the corresponding X
scores (cards sent), for each of the five friends.
• Lacking information about the relationship between X and Y scores, we could not
construct a regression equation and use it to generate a customized prediction, Y′,for
each friend.
• We mount a primitive predictive effort by always predicting the mean, Y, for each of
the five friends’ Y scores.
• The repetitive prediction of Y for each of the Y scores of all five friends will supplyus
with a frame of reference against which to evaluate our customary predictive effort
based on the correlation between cards sent (X) and cards received (Y).
Predictive Errors
Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for
all five friends, Y, of 12 (shown as the mean line) is always used to predict each of their
five Y scores.
Panel B shows the corresponding predictive errors for all five friends when a series of
different Y′ values, obtained from the least squares equation (shown as the least
squares line), is used to predict each of their five Y scores.
Panel A of Figure 7.5 shows the error for John when the mean for all five friends, Y, of 12
is used to predict his Y score of 6.
Shown as a broken vertical line, the error of −6 for John (from Y − Y = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards. Panel B shows a smaller error
of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
This Y’ value of 7.20 is obtained from the least squares equation, where the number of
cards sent by John, 1, has been substituted for X.
Error Variability (Sum of Squares)

The sum of squares of any set of deviations, now called errors, can be calculated by first
squaring each error (to eliminate negative signs), then summing all squared errors.
Proportion of Predicted Variability
SSy measures the total variability of Y scores that occurs after only primitive
predictions based on Y are made while SSy|x measures the residual variability of Y
scores that remains after customized leastsquare predictions are made.
The error variability of 28.8 for the least squares predictions is much smaller than the
error variability of 80 for the repetitive prediction of Y, confirming the greater accuracy
of the least squares predictions
apparent in Figure 7.5.
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is, subtract
SSy|x from SSy, to obtain
This result, .64 or 64 percent, represents the proportion or percent gain in predictive
accuracy when the repetitive prediction of Y is replaced by a series of customized Y′
predictions based on the least squares equation.
r 2 Does Not Apply to Individual Scores:
• The total variability of all Y scores—as measured by SSY—can be reduced by 64
percent when each Y score is replaced by its corresponding predicted Y’ score andthen
expressed as a squared deviation from the mean of all observed scores.
• Thus, the 64 percent represents a reduction in the total variability for the five Y scores
when they are replaced by a succession of predicted scores, given the least squares
equation and various values of X.
Small Values of r 2
• When transposed from r to r2, Cohen’s guidelines, state that a value of r 2 in the
vicinity of .01, .09, or .25 reflects a weak, moderate, or strong relationship,
respectively.
r 2 Doesn’t Ensure Cause-Effect

• If the correlation between mental health scores of sixth graders and their weaningages
as infants equals .20, we cannot claim, therefore, that (.20)(20) = .04 or 4 percent of
the total variability in mental health scores is caused by the differences in weaning
ages.
• r2 is indicating the proportion or percent of predictable variability, you also might
encounter references to r2 as indicating the proportion or percent of explained
variability.
• In this context, “explained” signifies only predictability, not causality.
X. MULTIPLE REGRESSION EQUATIONS
• Any serious predictive effort usually culminates in a more complex equation that
contains not just one but several X, or predictor variables.
• For instance, a serious effort to predict college GPA might culminate in the
following equation:
• where Y′ represents predicted college GPA and X1, X2, and X3 refer to high
school GPA, IQ score, and SAT score, respectively.
• By capitalizing on the combined predictive power of several predictor variables,
these multiple regression equations supply more accurate predictions for Y′ than
could be obtained from a simple regression equation.
XI. REGRESSION TOWARD THE MEAN

• Regression toward the mean refers to a tendency for scores, particularly extreme
scores, to shrink toward the mean.
• For example, because of regression toward the mean, we would expect that
students who made the top five scores on the first statistics exam would not make
the top five scores on the second statistics exam. Although all five students might
score above the mean on the second exam, some of their scores would regress back
toward the mean.
• On the second test, even though the scores of these five students continue to
reflect an above-average permanent component, some of their scores will suffer
because of less good luck or even bad luck.
• The net effect is that the scores of at least some of the original five top students will
drop below the top five scores—that is, regress back toward the mean—on the
second exam.
Appears in Many Distributions

• Regression toward the mean appears among subsets of extreme observations for a
wide variety of distributions.
• It appears for the subset of best (or worst) performing stocks on the New York Stock
Exchange across any period, such as a week, month, or year.
• It also appears for the top (or bottom) major league baseball hitters duringconsecutive
seasons. Table 7.4 lists the top 10 hitters in the major leagues during 2014 and shows
how they fared during 2015.
• Notice that 7 of the top 10 batting averages regressed downward, toward 260s, the
approximate mean for all hitters during 2015.
The Regression Fallacy

• The regression fallacy is committed whenever regression toward the mean is
interpreted as a real, rather than a chance, effect.
• A classic example of the regression fallacy occurred in an Israeli Air Force studyof
pilot training.
• Some trainees were praised after very good landings, while others were
reprimanded after very bad landings.
• On their next landings, praised trainees did more poorly and reprimanded
trainees did better.
• A valid conclusion considers regression toward the mean.
Avoiding the Regression Fallacy

• The regression fallacy can be avoided by splitting the subset of extreme observations
into two groups.
• In the previous example, one group of trainees would continue to be praised aftervery
good landings and reprimanded after very poor landings.
• A second group of trainees would receive no feedback whatsoever after very goodand
very bad landings.
• In effect, the second group would serve as a control for regression toward the mean,
since any shift toward the mean on their second landings would be due to chance.
• Most important, any observed difference between the two groups would be viewed
as a real difference not attributable to the regression effect.
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables
I. Basics of Numpy arrays:
NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.
Basic array manipulations:

Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays
Indexing of arrays
Getting and setting the value of individual array elements
Slicing of arrays
Getting and setting smaller subarrays within a larger array
Reshaping of arrays
Changing the shape of a given array
Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many
NumPy Array Attributes
1. Creating numpy arrays:
The Three random arrays: a one-dimensional, two-dimensional, and three-dimensional array.

We’ll use NumPy’s random number generator, which we will seed with a set value in order to ensure that the
same random arrays are generated each time this code is run:
2. Attributes:
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the
total size of the array):
3. Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in
square brackets, just as with Python lists:
4. Array Slicing: Accessing Subarrays
We can also use them to access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas.
For example:
5. Accessing array rows and columns
One commonly needed routine is accessing single rows or columns of an array.

We can do this by combining indexing and slicing, using an empty slice marked by a single colon (:):
6. Subarrays as no-copy views
NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
The array slices is that they return views rather than copies of the array data
Consider our two-dimensional array from before:
7. Creating copies of arrays
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an
array or a subarray.
This can be most easily done with the copy() method:
8. Reshaping of Arrays
The most flexible way of doing this is with the reshape() method.
For example, if we want to put the numbers 1 through 9 in a 3X3 grid, we can do the following:
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or
column matrix.
We can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice
operation:
9. Array Concatenation and Splitting
It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points:
II. Aggregations: Min, Max, and Everything in Between
1. Summing the Values in an Array

Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to
confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple
array dimensions, as we will see in the following section.
2. Minimum and Maximum
3. Multidimensional aggregates
Other aggregation functions
Example: What Is the Average Height of US Presidents?
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let’s consider the heights of all US presidents.
This data is available in the file president_heights.csv, which is a simple comma-separated list of
labels and values:
III. Computation on Arrays: Broadcasting
Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply
a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on
arrays of different sizes.
Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
Broadcasting allows these types of binary operations to be performed on arrays of different sizes—for example,
we can just as easily add a scalar (think of it as a zero dimensional array) to an array:
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
IV. Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking
comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some
criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
One approach to this would be to answer these questions by hand: loop through the data, incrementing a
counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of
time writing code and time computing the result.
Comparison Operators as ufuncs

NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise
ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:
Boolean operators
NumPy overloads these as ufuncs that work element-wise on (usually Boolean) arrays.
Boolean Arrays as Masks:
We looked at aggregates computed directly on Boolean arrays.

A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our x array from before, suppose we want an array of all values in the array that are less than, say,
5:
IV. Fancy Indexing:

We’ll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of single
scalars.
This allows us to very quickly access and modify complicated subsets of an array’s values.
Exploring Fancy Indexing

Example: Binning Data
V. Structured Data: NumPy’s Structured Arrays
This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, hetero‐ geneous data.
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays:
In[2]: name = ['Alice', 'Bob', 'Cathy', 'Doug']

age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
VI. Data Manipulation with Pandas
Data Indexing and Selection
Data Selection in Series

A Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard
Python dictionary.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:
Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic
mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing. Examples of these are as follows:
Indexers: loc, iloc, and ix
For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the
explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based
indexing.
Data Selection in DataFrame:
DataFrame as a dictionary
Operating on Data in Pandas:
Pandas : for unary operations like negation and trigonometric functions, the ufuncs will preserve index and
column labels in the output, and for binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained.
Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-
dimensional NumPy array.
In[15]: A = rng.randint(10, size=(3, 4))

A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
subtraction between a two-dimensional array and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:

In[17]: df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[17]: Q R S T
00000
1 -1 -2 2 4
2 3 -7 1 4
If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
25027
VII. Handling Missing Data
In the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing
Trade-Offs in Missing Data Conventions

A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a
missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention. Eg: IEEE floating-point
specification.
Missing Data in Pandas

The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which
does not have a built-in notion of NA values for nonfloating- point data types.
VIII. Hierarchical Indexing:
Hierarchical indexing (also known as multi-indexing) - to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.
A Multiply Indexed Series

Rearranging Multi-Indices
Sorted and unsorted indices

Many of the MultiIndex slicing operations will fail if the index is not sorted.
Stacking and unstacking indices

VIII. Combining Datasets: Concat and Append
Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often
undesirable. pd.concat() gives us a few ways to handle it.
Catching the repeats as an error.

Ignoring the index.
Adding MultiIndex keys
IX. Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(),
mean(), median(), min(), and max()
Simple Aggregation in Pandas

GroupBy: Split, Apply, Combine
A canonical example of this split-apply-combine operation, where the “apply” is asummation aggregation, is
illustrated in Figure 3-1.
Figure 3-1 makes clear what the GroupBy accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
• The combine step merges the results of these operations into an output array.
Here it’s important to realize that the intermediate splits do not need to be explicitly instantiated.
The GroupBy object
The GroupBy object is a very flexible abstraction.
Column indexing. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:
In[14]: planets.groupby('method')
Out[14]: <pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
In[15]: planets.groupby('method')['orbital_period']
Out[15]: <pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
Iteration over groups. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:
In[17]: for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
Dispatch methods. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:
In[18]: planets.groupby('method')['year'].describe().unstack()
X. Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset.
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data.
The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.
UNIT V
DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Basemap - Visualization with Seaborn.
Simple Line Plots

The simplest of all plots is the visualization of a single function y = f x . Here we will take a first look
at creating asimple plot of this type.
The figure (an instance of the class plt.Figure) can be thought of as a single container that contains
all the objectsrepresenting axes, graphics, text, and labels.
The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and
labels, which willeventually contain the plot elements that make up our visualization.
Line Colors and Styles

• The first adjustment you might wish to make to a plot is to control the line colors and styles.
• To adjust the color, you can use the color keyword, which accepts a string argument
representing virtuallyany imaginable color. The color can be specified in a variety of ways
• If no color is specified, Matplotlib will automatically cycle through a set of default colors for
multiple lines
Different forms of color representation.

specify color by name - color='blue'
short color code (rgbcmyk) - color='g'
Grayscale between 0 and 1 - color='0.75'
Hex code (RRGGBB from 00 to FF) -
color='#FFDD44' RGB tuple, values 0 and 1
-
color=(1.0,0.2,0.3)all HTML color names
supported -
color='chartreuse'
• We can adjust the line style using the linestyle keyword.

Different line styles
linestyl
e='soli
d'
linestyl
e='das
hed'
linestyl
e='das
hdot'
linestyl
e='dott
ed'
Short assignment
linestyle='-
' # solid
linestyle='-
-' # dashed
linestyle='-
.' #
dashdot
linestyle=':
' # dotted
• linestyle and color codes can be combined into a single nonkeyword argument to the plt.plot()
function
plt.plot(x, x + 0, '-g') #
solid green plt.plot(x, x +
1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') #
dashdot blackplt.plot(x, x
+ 3, ':r'); # dotted red
Axes
Limits
1
• The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods
Example
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
• The plt.axis() method allows you to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
• Aspect ratio equal is used to represent one unit in x is equal to one unit in y. plt.axis('equal')
Labeling Plots
The labeling of plots includes titles, axis labels, and simple
legends.Title - plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()
Example programs
Line color
import matplotlib.pyplot as
pltimport numpy as np
fig =
plt.figure()ax =
plt.axes()
x = np.linspace(0, 10,
1000)ax.plot(x, np.sin(x));
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code
(rgbcmyk) plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale
between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to
FF)plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names
supported
Line style
import matplotlib.pyplot as plt
import numpy as npfig =
plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1,
linestyle='dashed') plt.plot(x, x +
2, linestyle='dashdot')plt.plot(x, x
+ 3, linestyle='dotted');
# For short, you can use the following
codes:plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
Axis limit with label and legend
pltimport numpy as np
fig =
plt.figure()ax =
plt.axes()
x = np.linspace(0, 10, 1000)
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b',
label='cos(x)')plt.title("A Sine
Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
plt.legend();
Simple Scatter Plots

Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being
joined by line segments, here the points are represented individually with a dot, circle, or other shape.
Syntax
plt.plot(x, y, 'type of symbol ', color);
Example
plt.plot(x, y, 'o', color='black');
• The third argument in the function call is a character that represents the type of symbol used for the plotting.
Just as you can specify options such as '-' and '--' to control the line style, the marker style has its own set of
short string codes.
Example
• Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
• Short hand assignment of line, symbol and color also allowed.
plt.plot(x, y, '-ok');
• Additional arguments in plt.plot()

We can specify some other parameters related with scatter plot which makes it more attractive. They
arecolor, marker size, linewidth, marker face color, marker edge color, marker edge width, etc
Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
Scatter Plots with plt.scatter

• A second, more powerful method of creating scatter plots is the plt.scatter function, which can be used very
similarly to the plt.plot function
plt.scatter(x, y, marker='o');
• The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the
properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped
to data.
• Notice that the color argument is automatically mapped to a color scale (shown here by the colorbar()
command), and the size argument is given in pixels.
• Cmap – color map used in scatter plot gives different color combinations.
Perceptually Uniform Sequential

['viridis', 'plasma', 'inferno', 'magma']
Sequential
['Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds', 'YlOrBr', 'YlOrRd',
'OrRd', 'PuRd', 'RdPu', 'BuPu', 'GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn',
'YlGn']
Sequential (2)
['binary', 'gist_yarg', 'gist_gray', 'gray', 'bone', 'pink', 'spring', 'summer',
'autumn', 'winter', 'cool', 'Wistia', 'hot', 'afmhot', 'gist_heat', 'copper']
4
Diverging
['PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral',
'coolwarm', 'bwr', 'seismic']
Qualitative
['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'Set1', 'Set2', 'Set3',
'tab10', 'tab20', 'tab20b', 'tab20c']
Miscellaneous
['flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern', 'gnuplot',
'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv', 'gist_rainbow', 'rainbow',
'jet', 'nipy_spectral', 'gist_ncar']
Example programs.
Simple scatter plot.

import numpy as np
pltx = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black');
Scatter plot with edge color, face color, size,

and width of marker. (Scatter plot with line)
import numpy as np
pltx = np.linspace(0, 10, 20)
y = np.sin(x)
plt.plot(x, y, '-o',
color='gray',
markersize=15,
linewidth=4,
markerfacecolor='yellow',
markeredgecolor='red',
markeredgewidth=4)
plt.ylim(-1.5, 1.5);
Scatter plot with random colors, size and transparency

import numpy as np
rng =
np.random.RandomState(0)x =
rng.randn(100)
y = rng.randn(100)
colors =
rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
map='viridis')plt.colorbar()
Visualizing Errors
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important,
than accurate reporting of the number itself. For example, imagine that I am using some astrophysical
observations to estimate the Hubble Constant, the local measurement of the expansion rate of the Universe.
In visualization of data and results, showing these errors effectively can make a plot convey much more
completeinformation.
Types of errors
• Basic Errorbars
• Continuous Errors
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
• Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as
theshorthand used in plt.plot()
• In addition to these basic options, the errorbar function has many options to fine tune the outputs.
Usingthese additional options you can easily customize the aesthetics of your errorbar plot.
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',ecolor='lightgray', elinewidth=3, capsize=0);
6
Continuous Errors
• In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not
have a built-in convenience routine for this type of application, it’s relatively easy to combine primitives like
plt.plot and plt.fill_between for a useful result.
• Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a method
of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty.
Density and Contour Plots

To display three-dimensional data in two dimensions using contours or color-coded
regions.There are three Matplotlib functions that can be helpful for this task:
• plt.contour for contour plots,
• plt.contourf for filled contour plots, and
• plt.imshow for showing images.
Visualizing a Three-Dimensional Function

A contour plot can be created with the plt.contour function.
I
ttakes three arguments:
• a grid of x values,
• a grid of y values, and
• a grid of z values.
The x and y values represent positions on the plot, and the z
values will be represented by the contour levels.
The way to prepare such data is to use the np.meshgrid
function, which builds two-dimensional grids from one-
dimensional arrays:
Example
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black');
• Notice that by default when a single color is used, negative values are represented by dashed lines,
andpositive values by solid lines.
• Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
• We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
7
plt.contour(X, Y, Z, 20, cmap='RdGy');
• One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather
thancontinuous, which is not always what is desired.
• You could remedy this by setting the number of contours to a very high number, but this results in a
ratherinefficient plot: Matplotlib must render a new polygon for each step in the level.
• A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional grid
ofdata as an image.
There are a few potential gotchas with imshow().

• plt.imshow() doesn’t accept an x and y grid, so you must manually specify the extent [xmin, xmax, ymin,
ymax] of the image on the plot.
• plt.imshow() by default follows the standard image array definition where the origin is in the upper left,
notin the lower left as in most contour plots. This must be changed when showing gridded data.
• plt.imshow() will automatically adjust the axis aspect ratio to match the input data; you can change this
bysetting, for example, plt.axis(aspect='image') to make x and y units match.
Finally, it can sometimes be useful to combine

contour plots and image plots. we’ll use a partially
transparent background image (with transparency set
via the alpha parameter) and over-plot contours with
labels on the contours themselves (using the plt.clabel()
function):
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy', alpha=0.5)
plt.colorbar();
Example Program
import numpy as np
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
y = np.linspace(0, 5, 40)
Z = f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10],
origin='lower', cmap='RdGy')
plt.colorbar()
Histograms
• Histogram is the simple plot to represent the large data set. A histogram is a graph showing
frequencydistributions. It is a graph showing the number of observations within each given interval.
Parameters
• plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a
histogram,the array is sent into the function as an argument.
8
• bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted
as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called
"intervals", "classes", or "buckets".
• normed - Histogram normalization is a technique to distribute the frequencies of the histogram over a wider
range than the current range.
• x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence of arrays
which are not required to be of the same length.
• histtype - {'bar', 'barstacked', 'step', 'stepfilled'},
optionalThe type of histogram to draw.
• 'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side.
• 'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
• 'step' generates a lineplot that is by default unfilled.
• 'stepfilled' generates a lineplot that is by default
filled.Default is 'bar'
• align - {'left', 'mid', 'right'}, optional
Controls how the histogram is
plotted.
• 'left': bars are centered on the left bin edges.

• 'mid': bars are centered between the bin edges.
• 'right': bars are centered on the right bin
edges.Default is 'mid'
• orientation - {'horizontal', 'vertical'}, optional
If 'horizontal', barh will be used for bar-type histograms and the bottom kwarg will be the left edges.
• color - color or array_like of colors or None, optional
Color spec or sequence of color specs, one per dataset. Default (None) uses the standard line color
sequence.
Default is None
• label - str or None, optional. Default is None
Other parameter
• **kwargs - Patch properties, it allows us to pass a
variable number of keyword arguments to a
python function. ** denotes this type of function.
Example
import numpy as np
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
The hist() function has many options to tune both the calculation and the display; here’s an example of a
morecustomized histogram.
plt.hist(data, bins=30, alpha=0.5,histtype='stepfilled', color='steelblue',edgecolor='none');
The plt.hist docstring has more information on other customization options available. I find this combination
of histtype='stepfilled' along with some transparency alpha to be very useful when comparing histograms of
several distributions
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
Two-Dimensional Histograms and Binnings

• We can create histograms in two dimensions by dividing points among two dimensional bins.
• We would define x and y values. Here for example We’ll start by defining some data—an x and y array
drawn from a multivariate Gaussian distribution:
• Simple way to plot a two-dimensional histogram is to use Matplotlib’s plt.hist2d() function
Example
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 1000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
10
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw
how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend
in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw
how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend
in Matplotlib
plt.plot(x, np.sin(x), '-b', label='Sine')
plt.plot(x, np.cos(x), '--r', label='Cosine')
plt.legend();
Customizing Plot Legends

Location and turn off the frame - We can specify the location and turn off the frame. By the parameter loc and
framon.
ax.legend(loc='upper left', frameon=False)
fig
Number of columns - We can use the ncol command to specify the number of columns in the legend.
ax.legend(frameon=False, loc='lower center', ncol=2)
fig
Rounded box, shadow and frame transparency
11
We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha value) of the frame, or
change the padding around the text.
ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1)
fig
Choosing Elements for the Legend

• The legend includes all labeled elements by default. We can change which elements and labels appear in
thelegend by using the objects returned by plot commands.
• The plt.plot() command is able to create multiple lines at once, and returns a list of created line instances.
Passing any of these to plt.legend() will tell it which to identify, along with the labels we’d like to specify
y = np.sin(x[:, np.newaxis] + np.pi * np.arange(0, 2, 0.5))
lines = plt.plot(x, y)
plt.legend(lines[:2],['first','second']);
# Applying label individually.

plt.plot(x, y[:, 0], label='first')
plt.plot(x, y[:, 1], label='second')
plt.plot(x, y[:, 2:])
plt.legend(framealpha=1, frameon=True);
Multiple legends
It is only possible to create a single legend for the entire plot. If
you try to create a second legend using plt.legend() or ax.legend(),
it willsimply override the first one. We can work around this by
creating a
new legend artist from scratch, and then using the lower-level ax.add_artist() method to manually add the
second artist to the plot
Example
plt.style.use('classic')
import numpy as np
x = np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True, shadow=True,borderpad=1,fancybox=True)
fig
Color Bars
In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot.
Forcontinuous labels based on the color of points, lines, or regions, a labeled color bar can be a great tool.
The simplest colorbar can be created with the plt.colorbar() function.
Customizing Colorbars
Choosing color map.
We can specify the colormap using the cmap argument to the plotting function that is creating the
visualization.Broadly, we can know three different categories of colormaps:
• Sequential colormaps - These consist of one continuous sequence of colors (e.g., binary or viridis).
• Divergent colormaps - These usually contain two distinct colors, which show positive and negative
deviations from a mean (e.g., RdBu or PuOr).
• Qualitative colormaps - These mix colors with no particular sequence (e.g., rainbow or jet).
12
Color limits and extensions
• Matplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance of
plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
• We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the top
andbottom by setting the extend property.
plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap() function, and pass the name of a suitable colormap
along with the number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
Subplots
• Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure.
• These subplots might be insets, grids of plots, or other more complicated layouts.
• We’ll explore four routines for creating subplots in Matplotlib.
• plt.axes: Subplots by Hand
• plt.subplot: Simple Grids of Subplots
• plt.subplots: The Whole Grid in One Go
• plt.GridSpec: More Complicated Arrangements
plt.axes: Subplots by Hand

• The most basic method of creating an axes is to use the plt.axes function. As we’ve seen previously,
bydefault this creates a standard axes object that fills the entire figure.
• plt.axes also takes an optional argument that is a list of four numbers in the figure coordinate system.
• These numbers represent [bottom, left, width,height] in the figure coordinate system, which ranges from 0
atthe bottom left of the figure to 1 at the top right of the figure.
13
For example,
we might create an inset axes at the top-right corner of
another axes by setting the x and y position to 0.65 (that is,
starting at 65% of the width and 65% of the height of the
figure) and the xand y extents to 0.2 (that is, the size of the
axes is 20% of the width and 20% of the height of the figure).

import numpy as np
ax1 = plt.axes() # standard axes
ax2 = plt.axes([0.65, 0.65, 0.2, 0.2])
Vertical sub plot

The equivalent of plt.axes() command within the
object-oriented interface is ig.add_axes(). Let’s use
this to create two vertically stacked axes.
fig = plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],
xticklabels=[], ylim=(-1.2, 1.2))
ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
ylim=(-1.2, 1.2))
x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));
• We now have two axes (the top with no tick
labels) that are just touching: the bottom of the
upper panel (at position 0.5) matches the top of
the lower panel (at position 0.1+ 0.4).
• If the axis value is changed in second plot both
the plots are separated with each other,
exampleax2 = fig.add_axes([0.1, 0.01, 0.8, 0.4
plt.subplot: Simple Grids of Subplots

• Matplotlib has several convenience routines to align columns or rows of subplots.
• The lowest level of these is plt.subplot(), which creates a single subplot within a grid.
• This command takes three integer

arguments—the number of rows, the number
of columns, and the index of the plot to be
created in this scheme, which runs from the
upper left to the bottom right
for i in range(1, 7):
plt.subplot(2, 3, i)
plt.text(0.5, 0.5, str((2, 3, i)),
fontsize=18, ha='center')
1
4
plt.subplots: The Whole Grid in One Go
• The approach just described can become quite tedious when you’re creating a large grid of subplots,
especially if you’d like to hide the x- and y-
axis labels on the inner plots.
• For this purpose, plt.subplots() is the easier
tool to use (note the s at the end of subplots).
• Rather than creating a single subplot, this
function creates a full grid of subplots in a
single line, returning them in a NumPy array.
• The arguments are the number of rows and
number of columns, along with optional
keywords sharex and sharey, which allow you
to specify the relationships between different
axes.
• Here we’ll create a 2×3 grid of subplots, where
all axes in the same row share their y- axis
scale, and all axes in the same column share
their x-axis scale
fig, ax = plt.subplots(2, 3, sharex='col',
sharey='row')
Note that by specifying sharex and sharey,
we’ve automatically removed inner labels on
the grid to make the plot cleaner.
plt.GridSpec: More Complicated Arrangements

To go beyond a regular grid to subplots that span multiple rows and columns, plt.GridSpec() is the best
tool. The plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is
recognizedby the plt.subplot() command.
For example, a gridspec for a grid of two rows and three columns with some specified width and height
spacelooks like this:
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)

From this we can specify subplot locations and
extentsplt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);
Text and Annotation

• The most basic types of annotations we will use are axes labels and titles, here we will see some more
visualization and annotation information’s.
15
• Text annotation can be done manually with the plt.text/ax.text command, which will place text at a
particular x/y value.
• The ax.text method takes an x position, a y position, a string, and then optional keywords specifying the
color, size, style, alignment, and other properties of the text. Here we used ha='right' and ha='center', where
ha is short for horizontal alignment.
Transforms and Text Position

• We anchored our text annotations to data locations. Sometimes it’s preferable to anchor the text to a
positionon the axes or figure, independent of the data. In Matplotlib, we do this by modifying the transform.
• Any graphics display framework needs some scheme for translating between coordinate systems.
• Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a well-
developed set of tools that it uses internally to perform them (the tools can be explored in the
matplotlib.transforms submodule).
• There are three predefined transforms that can be useful in this situation.
o ax.transData - Transform associated with data coordinates

o ax.transAxes - Transform associated with the axes (in units of axes dimensions)
o fig.transFigure - Transform associated with the figure (in units of figure dimensions)
Example
import matplotlib as mpl
import numpy as np
import pandas as pd
fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);
16
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The
transAxes coordinates give the location from the bottom-left corner of the axes (here the white box) as a
fraction of the axes size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here the
gray box) as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected, while the
others remain stationary.
Arrows and Annotation

• Along with tick marks and text, another useful annotation mark is the simple arrow.
• Drawing arrows in Matplotlib is not much harder because there is a plt.arrow() function available.
• The arrows it creates are SVG (scalable vector graphics)objects that will be subject to the varying
aspectratio of your plots, and the result is rarely what the user intended.
• The arrow style is controlled through the arrowprops dictionary, which has numerous options available.
Three-Dimensional Plotting in Matplotlib

We enable three-dimensional plots by importing the mplot3d toolkit, included with the main Matplotlib
installation.
import numpy as np
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
With this 3D axes enabled, we can now plot a

varietyof three-dimensional plot types.
Three-Dimensional Points and Lines

The most basic three-dimensional plot is a line or scatter plot created from sets of (x, y, z) triples.
In analogy with the more common two-dimensional plots discussed earlier, we can create these using the
ax.plot3D
and ax.scatter3D functions
import numpy as np
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');plt.show()
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page.
Three-Dimensional Contour Plots

• mplot3d contains tools to create three-dimensional relief plots using the same inputs.
• Like two-dimensional ax.contour plots, ax.contour3D requires all the input data to be in the form of
two-dimensional regular grids, with the Z data evaluated at each point.
• Here we’ll show a three-dimensional contour diagram of a three dimensional sinusoidal function
import numpy as np
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
Z = f(X, Y)
fig = plt.figure()
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()
Sometimes the default viewing angle is not optimal, in which case we can use the view_init method to
set theelevation and azimuthal angles.
ax.view_init(60, 35)
fig
Wire frames and Surface Plots

• Two other types of three-dimensional plots that work on gridded data are wireframes and surface plots.
• These take a grid of values and project it onto the specified threedimensional surface, and can make
theresulting three-dimensional forms quite easy to visualize.
import numpy as np
fig = plt.figure()
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
plt.show()
• A surface plot is like a wireframe plot, but each

faceof the wireframe is a filled polygon.
18
• Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized
import numpy as np
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface')
plt.show()
Surface Triangulations
• For some applications, the evenly sampled grids required by
the preceding routines are overly restrictive and
inconvenient.
• In these situations, the triangulation-based plots can be very useful.
import numpy as np
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)
Geographic Data with Basemap

• One common type of visualization in data science is
thatof geographic data.
• Matplotlib’s main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits that live under the mpl_toolkits namespace.
• Basemap is a useful tool for Python users to have in their virtual toolbelts
• Installation of Basemap. Once you have the Basemap toolkit installed and imported, geographic plots
alsorequire the PIL package in Python 2, or the pillow package
in Python 3.
import numpy as np
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
• Matplotlib axes that understands spherical coordinates

andallows us to easily over-plot data on the map
19
• We’ll use an etopo image (which shows topographical features both on land and under the ocean) as
themap background
Program to display particular area of the map with latitude
andlongitude lines
import numpy as np
from itertools import chain
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='r')
Map Projections
The Basemap package implements several dozen such projections, all referenced by a short format code. Here
we’llbriefly demonstrate some of the more common ones.
• Cylindrical projections
• Pseudo-cylindrical projections
• Perspective projections
• Conic projections
Cylindrical projection
• The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitudeare mapped to horizontal and vertical lines, respectively.
• This type of mapping represents equatorial regions quite well, but results in extreme distortions near
thepoles.
• The spacing of latitude lines varies between different cylindrical projections, leading to different
conservation properties, and different distortion near the poles.
• Other cylindrical projections are the Mercator (projection='merc') and the cylindrical equal-area
(projection='cea') projections.
• The additional arguments to Basemap for this view specify the latitude (lat) and longitude (lon) of
thelower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in units of degrees.
import numpy as np
20
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Pseudo-cylindrical projections
• Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude)
remainvertical; this can give better properties near the poles of the projection.
• The Mollweide projection (projection='moll') is one common example of this, in which all meridians
areelliptical arcs
• It is constructed so as to
• preserve area across the map: though there
aredistortions near the poles, the area of small
patches reflects the true area.
• Other pseudo-cylindrical projections are the
sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
• The extra arguments to Basemap here refer to
the central latitude (lat_0) and longitude
(lon_0) for the desired map.
import numpy as np
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
• Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically
lieswithin the Earth!).
21
• One common example is the orthographic projection (projection='ortho'), which shows one side of the globe
as seen from a viewer at a very long distance.
• Thus, it can show only half the globe at a time.
• Other perspective-based projections include the
gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
• These are often the most useful for showing small
portions of the map.
import numpy as np
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
Conic projections
• A conic projection projects the map onto a single cone, which is then unrolled.
• This can lead to very good local properties, but regions far from the focus point of the cone may
becomevery distorted.
• One example of this is the Lambert conformal conic projection (projection='lcc').
• It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by
lat_1 and lat_2) have well-represented distances, with scale decreasing between them and increasing
outsideof them.
• Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection
import numpy as np
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55, width=1.6E7, height=1.2E7)
draw_map(m)
2
2
Drawing a Map Background
The Basemap package contains a range of useful functions for drawing borders of physical features like
continents,oceans, lakes, and rivers, as well as political boundaries such as countries and US states and counties.
The following are some of the available drawing functions that you may wish to explore using IPython’s
helpfeatures:
• Physical boundaries and bodies of water

drawcoastlines() - Draw continental coast
lines
drawlsmask() - Draw a mask between the land and sea, for use with projecting images on one or
the otherdrawmapboundary() - Draw the map boundary, including the fill color for oceans
drawrivers() - Draw rivers on the map
fillcontinents() - Fill the continents with a given color; optionally fill lakes with another color
• Political boundaries
drawcountries() - Draw country
boundaries drawstates() - Draw US state
boundaries drawcounties() - Draw US
county boundaries
• Map features
drawgreatcircle() - Draw a great circle between two
pointsdrawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map
• Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the
mapshadedrelief() - Project a shaded relief image onto the
map etopo() - Draw an etopo relief image onto the map
warpimage() - Project a user-provided image onto the map
Plotting Data on Maps

• The Basemap toolkit is the ability to over-plot a variety of data onto a map background.
• There are many map-specific functions available as methods of the Basemap
instance.Some of these map-specific methods are:
contour()/contourf() - Draw contour lines or filled
contoursimshow() - Draw an image
pcolor()/pcolormesh() - Draw a pseudocolor plot for irregular/regular
meshesplot() - Draw lines and/or markers
scatter() - Draw points with
markersquiver() - Draw vectors
barbs() - Draw wind barbs
drawgreatcircle() - Draw a great
circle
Visualization with Seaborn

The main idea of Seaborn is that it provides high-level commands to create a variety of plot types
useful forstatistical data exploration, and even some statistical model fitting.
Histograms, KDE, and densities
• In statistical data visualization, all you want is to plot
histograms and joint distributions of variables. We have
seen that this is relatively straightforward in Matplotlib
• Rather than a histogram, we can get a smooth estimate of
the distribution using a kernel density estimation, which
Seaborn does with sns.kdeplot
import pandas as pd
import seaborn as sns
data = np.random.multivariate_normal([0, 0], [[5, 2], [2,
2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])
for col in 'xy':
sns.kdeplot(data[col], shade=True)
• Histograms and KDE can be combined using distplot

sns.distplot(data['x'])
sns.distplot(data['y']);
• If we pass the full two-dimensional dataset to kdeplot, we will get

atwo-dimensional visualization of the data.
• We can see the joint distribution and the marginal distributions together using sns.jointplot.
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful
forexploring correlations between multidimensional data, when you’d like to plot all pairs of values against each
other.
We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);
24
Faceted histograms
• Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid makes this
extremely simple.
• We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on
variousindicator data
25
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to
view the distribution of aparameter within bins defined by any other parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint
distribution between differentdatasets, along with the associated marginal distributions.
Bar plots
Time series can be plotted with sns.factorplot.
Question Bank
UNIT-1
1. Define bigdata.
Big data is a term for any collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V: Veracity: How accurate is the data?
2. Define Data Science.

Data science involves using methods to analyze massive amounts of data and extract the knowledge
it contains.
3. List the Facets of data.

• Structured - resides in a fixed field within a record.
■ Unstructured -It is data that isn’t easy to fit into a data model, Eg:Email
■ Natural language - Natural language is a special type of unstructured data
■ Machine-generated - automatically created by a computer, process, application, or without
human intervention
■ Graph-based - data that focuses on the relationship of objects
■ Audio, video, and images - pose specific challenges to a data scientist
■ Streaming - data flows into the system when an event happens instead of being loaded into
a data store in a batch.
4. Steps in Data Science Process:
• The data science process typically consists of six steps:

o Setting the research goal
o Retrieving data
o Data preparation
o Data exploration
o Data modeling or model building
o Presentation and automation
5. What are the sources of retrieving data?
• Company data - data can be stored in official data repositories such as databases, data marts,
data warehouses, and data lakes
• Data mart: A data mart is a subset of the data warehouse and will be serving a specific
business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
6. What is data Cleansing?

Data cleansing is a subprocess of the data science process.
It focuses on removing errors in the data.
DATA ENTRY ERRORS

REDUNDANT WHITESPACE
FIXING CAPITAL LETTER MISMATCHES
IMPOSSIBLE VALUES AND SANITY CHECKS
OUTLIERS - An outlier is an observation that seems to be distant from other observations.
DEALING WITH MISSING VALUES
DIFFERENT UNITS OF MEASUREMENT
DIFFERENT LEVELS OF AGGREGATION
7. Write about Combining data from different data sources

• Two operations to combine information from different data.
• joining: enriching an observation from one table with information from another table.
• The second operation is appending or stacking: adding the observations of one table to those
of another table.
8. What is Transforming Data?

REDUCING THE NUMBER OF VARIABLES - Data scientists use special methods to reduce the
number of variables
TURNING VARIABLES INTO DUMMIES - • Variables can be turned into dummy variables.
9. What is Exploratory Data Analysis?

The graphical techniques to gain an understanding of your data and the interactions between
variables.
visualization techniques : simple line graphs or histograms

10. Define Brushing and Linking
With brushing and linking we combine and link different graphs and tables or views so changes in
one graph are automatically transferred to the other graphs.
11. What is Pareto Diagram?
A Pareto diagram is a combination of the values and a cumulative distribution.

The first 50% of the countries contain slightly less than 80% of the total amount.
12. How to do building a model?

Building a model is an iterative process.
most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison
13. What is model fit?

Model fit—For this the R-squared or adjusted R-squared is used.
This measure is an indication of the amount of variation in the data that gets captured by the model.
14. Define Data Mining

Data mining turns a large collection of data into knowledge.
A search engine (e.g., Google) receives hundreds of millions of queries every day.
15. Define DataWarehouses.
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and usually residing at a single site.
16. What is Measuring the Central Tendency?

The arithmetic mean is found by adding the numbers and dividing the sum by the number of
numbers in the list. This is what is most often meant by an average. The median is the middle value
in a list ordered from smallest to largest. The mode is the most frequently occurring value on the
list.
• The mean of this set of values is
• The mode is another measure of central tendency.

• The mode for a set of data is the value that occurs most frequently in the set.
17. What is Mid Range?

The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set
18. Define variance and SD.
Variance and standard deviation are measures of data dispersion. variance is a measure of
dispersion that takes into account the spread of all data points in a data set. The standard deviation,
is simply the square root of the variance.
19. Define different types of plots.
Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate
data distribution.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another.
Histograms
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting
histograms is a graphical method for summarizing the distribution of a given attribute, X.
Scatter Plots and Data Correlation

A scatter plot is one of the most effective graphical methods for determining if there appears to be a
relationship, pattern, or trend between two numeric attributes.
PART B:
1. DATA SCIENCE PROCESS DIAGRAM, ALL THE STEPS IN BRIEF

2. DATA CLEANING
3. EXPLORATORY DATA ANALYSIS
4. DATA MINING
5. DATA WAREHOUSING
6. FACETS OF DATA
7. MODEL BUILDING , STATISTICAL DESCRIPTION OF DATA
UNIT-2
1. Define statistics.
Descriptive Statistics:
• Descriptive statistics provides us with tools—tables, graphs, averages, ranges, correlations—
for organizing and summarizing the inevitable variability in collections of actual observations
or scores.
• Eg: A tabular listing, ranked from most to least, A graph showing the annual change in global
temperature during the last 30 years
Inferential Statistics:
• Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations.
• This more advanced area is known as inferential statistics.
• Eg: An assertion about the relationship between job satisfaction and overall happiness
2. Give the types of data.

• Data is a collection of actual observations or scores in a survey or an experiment.
• Types : qualitative, ranked, or quantitative.
• Qualitative Data: Qualitative data consist of words (Yes or No), letters (Y or N), or numerical
codes (0 or 1)
• Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative standing
within a group
• Quantitative data consists of numbers (weights of 238, 170, . . . 185 lbs)
3. Give the types of variables.

• Discrete and Continuous Variables - A discrete variable consists of isolated numbers
separated by gaps.
Examples - the number of children in a family
• A continuous variable consists of numbers whose values, at least in theory, have no
restrictions.
• Examples - weights of male statistics students
• Approximate Numbers - values for continuous variables can be carried out infinitely far
• Independent and Dependent Variables - When a variable is believed to have been influenced
by the independent variable, it is called a dependent variable.
• Observational Studies - Simply observe the relation between two variables.
• Confounding Variable - An uncontrolled variable that compromises the interpretation of a
study is known as a confounding variable.
4. Define frequency distribution.

• A frequency distribution is a collection of observations produced by sorting observations into
classes and showing their frequency (f ) of occurrence in each class.
5. What is grouped frequency distribution?

Grouped Data - When observations are sorted into classes of more than one value, as in, the result is
referred to as a frequency distribution for grouped data.
6. What are the guidelines for frequency distribution?

1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of zero.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–159, etc.,
7. Define outliers.
OUTLIERS
• The appearance of one or more very extreme scores are called outliers.
• Ex: A GPA of 0.06, an IQ of 170, summer wages of $62,000
8. Define relative frequency distribution and cumulative distribution.

• Relative frequency distributions show the frequency of each class as a part or fraction of the
total frequency for the entire distribution.
• Cumulative frequency distributions show the total number of observations in each class and
in all lower-ranked classes.
9. What is percentile rank?

• The percentile rank of a score indicates the percentage of scores in the entire distribution with
similar or smaller values than that score.
10. What is histogram?

• Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals
of the frequency distribution.
• Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
• The adjacent bars in histograms have common boundaries that emphasize the continuity of
quantitative data for continuous variables.
11. Define frequency polygon.

• An important variation on a histogram is the frequency polygon, or line graph.
• Frequency polygons may be constructed directly from frequency distributions.
• Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for
classes on the horizontal axis, and connect them with straight lines.
• Anchor the frequency polygon to the horizontal axis.
12. Define stem leaf display.
• Ideal for summarizing distributions, such as that for weight data, without destroying the
identities of individual observations.
13. What are the shapes of distribution?
• The familiar bell-shaped silhouette of the normal curve

• Any distribution that approximates the bimodal shape is bimodal distribution.
• A lopsided distribution caused by a few extreme observations in the positive direction as in
panel C, is a positively skewed distribution
• A lopsided distribution caused by a few extreme observations in the negative direction as in
panel D, is a negatively skewed distribution.
14. What are misleading graph?
In statistics, a misleading graph, also known as a distorted graph, is a graph that misrepresents data,
constituting a misuse of statistics and with the result that an incorrect conclusion may be derived from
it.
15. Define mean, median and mode.

• The mode reflects the value of the most frequently occurring score.
• The median reflects the middle value when observations are ordered from least to most.
• The mean is the most common average, calculated many times.
• The mean is found by adding all scores and then dividing by the number of scores.
16. Define sample and population mean.
• The population mean depending on whether the data are viewed as a population (a complete
set of scores)
• The sample mean—The data is viewed as a subset or as a sample (a subset of scores).
17. Define average and range.

• An average can refer to the mode, median, or mean—or even geometric mean or the harmonic
mean.
• The range is the difference between the largest and smallest scores.
18. What is standard deviation and variance?
• Variance and Standard Deviation are the two important measurements in statistics.
• Variance is a measure of how data points vary from the mean.
• The standard deviation is the measure of the distribution of statistical data.
• The standard deviation, the square root of the mean of all squared deviations from the mean,
that is,
19. Define degrees of freedom.

• Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
20. Define Interquartile range.

• The interquartile range (IQR), is simply the range for the middle 50 percent of the scores.
21. Define normal curve.

• A normal curve is a theoretical curve defined for a continuous variable, and noted for its
symmetrical bell-shaped form.
22. Define z-score.

• A z score is a unit-free, standardized score that, indicates how many standard deviations a
score is above or below the mean of its distribution.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation units.
PART B:
1. Types of data
2. Types of variables
3. Explain Frequency distribution. (or) How will the data be described with tables and graphs?
4. Explain mean, median and mode? (or) How will the data be described with averages?
5. Explain about data variability with example. (or) Standard Deviation and Variance
6. Explain normal curve and z-score.
PROBLEMS :
1. Standard Deviation, Variance

2. Relative Frequency distribution, Cumulative Frequency distribution, Percentile rank
3. Normal curve problems
UNIT-3
1. Define correlation.
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate
in relation to each other. A positive correlation indicates the extent to which those variables increase
or decrease in parallel; a negative correlation indicates the extent to which one variable increases as
the other decreases.
2. Define positive and negative correlation.

Two variables are positively related if pairs of scores tend to occupy similar relative positions (high
with high and low with low) in their respective distributions.
They are negatively related if pairs of scores tend to occupy dissimilar relative positions (high with
low and vice versa) in their respective distributions.
3. Define scatterplot.
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
4. List the types of relationships.

5. Define curvilinear relationship.
A dot cluster approximates a bent or curved line, as in Figure 6.4, and therefore reflects a
curvilinear relationship.
6. Define r value.
A correlation coefficient is a number between –1 and 1 that describes the relationship between pairs
of variables.
The type of correlation coefficient, designated as r, that describes the linear relationship between
pairs of variables for quantitative data.
A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus
sign indicates a negative relationship.
7. What is cause – effect relation and complex relation?

A direct relationship between a cause and its effect. Eg: calorie intake and increase in weight
A complex relationship. Eg: Cigarette smoking and cancer
8. Define regression.
Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest.
9. Define Least Square Regression line.

The Least Squares Regression Line is the line that makes the vertical distance from the data points
to the regression line as small as possible.
10. Define standard error of estimates.
The standard error of the estimate is the estimation of the accuracy of any predictions. It is denoted
as SEE.
11. What is Homoscedasticity?

It refers to a condition in which the variance of the residual, or error term, in a regression model is
constant. That is, the error term does not vary much as the value of the predictor variable changes.
12. Define regression fallacy.

The Regression Fallacy occurs when one mistakes regression to the mean, which is a statistical
phenomenon, for a causal relationship
The regression fallacy can be avoided by splitting the subset of extreme observations into two
groups.
13. Define multiple linear regression.

Multiple linear regression is a regression model that estimates the relationship between a quantitative
dependent variable and two or more independent variables using a straight line.
14. Define regression towards mean.
In statistics, regression toward the mean is that if one sample of a random variable is extreme, the
next sampling of the same random variable is likely to be closer to its mean.
15. What is interpretation of r2?

The most common interpretation of r-squared is how well the regression model explains observed
data. For example, an r-squared of 60% reveals that 60% of the variability observed in the target
variable is explained by the regression model.
PART B:
1. Explain Scatter plots.

2. What is correlation? How is correlation used in statistics? Explain with an example.
3. Explain about regression , multiple regression.
PROBLEMS:
1. Correlation – r value
2. Regression
3. Standard error of estimate
4. Interpretation of r2
UNIT-4
1. What is NumPy? Why should we use it?
NumPy (also called Numerical Python) is a highly flexible, optimized, open-source package meant
for array processing. It provides tools for delivering high-end performance while dealing with N-
dimensional powerful array objects.
2. What are ways of creating 1D, 2D and 3D arrays in NumPy?
• One-Dimensional array
import numpy as np
arr = [1,2,3,4] #python list

numpy_arr = np.array(arr) #numpy array
• Two-Dimensional array
import numpy as np
arr = [[1,2,3,4],[4,5,6,7]]
numpy_arr = np.array(arr)
• Three-Dimensional array
import numpy as np
arr = [[[1,2,3,4],[4,5,6,7],[7,8,9,10]]]
numpy_arr = np.array(arr)
Using the np.array() function, we can create NumPy arrays of any dimensions.
3. How do you convert Pandas DataFrame to a NumPy array?

The to_numpy() method of the NumPy package can be used to convert Pandas DataFrame, Index and
Series objects.
4. List the array functions in numpy.
Syntax Description
array.shape Dimensions (Rows,Columns)
len(array) Length of Array
array.ndim Number of Array Dimensions
array.dtype Data Type
5. List the array operations. (PART B – short notes)
Array Manipulation
Adding or Removing Elements
Operator Description
np.append(a,b) Append items to array
np.insert(array, 1, 2, axis) Insert items into array at axis 0 or 1
np.resize((2,4)) Resize array to shape(2,4)
np.delete(array,1,axis) Deletes items from array

Combining Arrays
np.concatenate((a,b),axis=0) Concatenates 2 arrays, adds to end
np.vstack((a,b)) Stack array row-wise
np.hstack((a,b)) Stack array column wise

Splitting Arrays
numpy.split() Split an array into multiple sub-arrays.
np.array_split(array, 3) Split an array in sub-arrays of (nearly) identical size
6. Define broadcasting.
Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication,
etc.) on arrays of different sizes.
7. Define universal functions in numpy.

ufuncs are used to implement vectorization in NumPy which is way faster than iterating over
elements.
They also provide broadcasting and additional methods like reduce, accumulate etc. that are very
helpful for computation.
Eg: import numpy as np
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)
print(z)
8. What is boolean masking on NumPy arrays in Python?

The NumPy library in Python is a popular library for working with arrays. Boolean masking, also
called boolean indexing, is a feature in Python NumPy that allows for the filtering of values in numpy
arrays.
There are two main ways to carry out boolean masking:
Method one: Returning the result array.

Method two: Returning a boolean array.
9. What is fancy indexing?

Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array
elements at once. For example, consider the following array:
X = [51 92 14 71 60 20 82 86 74 74]
Suppose we want to access three different elements. We could do it like this:

In [2]:
[x[3], x[7], x[2]]
Out[2]:
[71, 86, 14]
10. Define pandas
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data.
11. Define structured arrays.

Numpy’s Structured Array is similar to Struct in C. It is used for grouping data of different types and
sizes. Structure array uses data containers called fields. Each data field can contain data of any type
and size.
import numpy as np
a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],

dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight',
np.float64)])
print(a)
12. Define indexing in pandas.

Indexing in Pandas :
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame
Dataframe.[ ] ; This function also known as indexing operator

Dataframe.loc[ ] : This function is used for labels.
Dataframe.iloc[ ] : This function is used for positions or integer based
Dataframe.ix[] : This function is used for both label and integer based
Selection :
Selecting a single row using .ix[] as .loc[]
In order to select a single row, we put a single row label in a .ix function.
In order to select all rows and some columns, we use single colon [:] to select all of rows and for
columns we make a list of integer then pass to a .iloc[] function.
13. Write on missing data in pandas.

Missing Data can occur when no information is provided for one or more items or for a whole unit.
None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems
that use the standard IEEE floating-point representation
import pandas as pd
data = pd.read_csv("employees.csv")
bool_series = pd.isnull(data["Gender"])
data[bool_series]
14. Define hierarchical indexing in pandas.

Hierarchical Indexes are also known as multi-indexing is setting more than one column name as the
index.
To make the column an index, we use the Set_index() function of pandas.
df_ind3 = df.set_index(['region', 'state', 'individuals'])
df_ind3.sort_index()
print(df_ind3.head(10))
15. How can the data set be combined using pandas?
merge() for combining data on common columns or indices

.join() for combining data on a key column or an index
concat() for combining DataFrames across rows or columns
16. Define pivot table in pandas.

The Pandas pivot_table() is used to calculate, aggregate, and summarize your data. It is defined as a
powerful tool that aggregates data with calculations such as Sum, Count, Average, Max, and Min. It
also allows the user to sort and filter your data when the pivot table has been created.
17. Define aggregation and grouping in pandas.
Aggregation in pandas provides various functions that perform a mathematical or logical operation
on our dataset and returns a summary of that function. Aggregation can be used to get a summary of
columns in our dataset like getting sum, minimum, maximum, etc. from a particular column of our
dataset.
Aggregation :
Function Description:
• sum() :Compute sum of column values
• min() :Compute min of column values
• max() :Compute max of column values
• mean() :Compute mean of column
• size() :Compute column sizes
• describe() :Generates descriptive statistics
• first() :Compute first of group values
• last() :Compute last of group values
• count() :Compute count of column values
• std() :Standard deviation of column
• var() :Compute variance of column
• sem() :Standard error of the mean of column
18. Grouping :
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-combine
strategy.
Splitting the data into groups based on some criteria.

Applying a function to each group independently.
Combining the results into a data structure.
dataset.groupby(['cut', 'color']).agg('min')
PART B:
1. NUMPY aggregation
2. Comparision, mask, Boolean logic
3. Fancy indexing
4. structured array
5. Indexing, selection in pandas
6. Missing data – pandas
7. Hierarchical indexing
8. Combining dataset in pandas
9. Aggregation and grouping in pandas
10. Pivot tables
Questions may come scenario based, eg: Counting Rainy Days – for comparison , mask and Boolean
expression
UNIT-5
1. Define matplotlib.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in
Python.
from matplotlib import pyplot as plt
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
plt.plot(x,y)
plt.show()
2. Define line plot.

The simplest of all plots is the visualization of a single function y=f(x)y=f(x).
%matplotlib inline
import numpy as np
In [2]:
fig = plt.figure()
ax = plt.axes()
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)

ax.plot(x, np.sin(x));
3. Define scatter plot.
Scatter plots are used to observe relationship between variables and uses dots to represent the
relationship between them. The scatter() method in the matplotlib library is used to draw a scatter
plot.
x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
y =[99, 86, 87, 88, 100, 86,

103, 87, 94, 78, 77, 85, 86]
plt.scatter(x, y, c ="blue")
# To show the plot

plt.show()
4. Write about visualizing errors.

In visualization of data and results, showing these errors effectively can make a plot convey much
more complete information
Basic error bars:
A basic errorbar can be created with a single Matplotlib function call
In[1]: %matplotlib inline

import numpy as np
In[2]: x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities
5. Define density and contour plots.

A contour plot is a graphical method to visualize the 3-D surface by plotting constant Z slices called
contours in a 2-D format.
Density Plots and Contour Plots represent events with a density gradient or contour gradient
depending on the number of events. Density Plots - In a density plot, the color of an area reflects
how many events are in that position of the plot.
6. Histograms
A histogram is a graphical representation of the distribution of data given by the user. Its
appearance is similar to Bar-Graph except it is continuous.
The towers or bars of a histogram are called bins. The height of each bin shows how many values
from that data fall into that range.
plt.hist2d(x, y, bins=30, cmap='Blues')

cb = plt.colorbar()
cb.set_label('counts in bin')
7. Define types of color map.

Sequential colormaps
These consist of one continuous sequence of colors (e.g., binary or viridis).
Divergent colormaps
These usually contain two distinct colors, which show positive and negative deviations
from a mean (e.g., RdBu or PuOr).
Qualitative colormaps
These mix colors with no particular sequence (e.g., rainbow or jet).
8. Define subplots.
Sometimes it is helpful to compare different views of data side by side. To this end,
Matplotlib has the concept of subplots: groups of smaller axes that can exist together
within a single figure.

plt.text(0.5, 0.5, str((2, 3, i)),
9. Define annotation.
The annotate() function in pyplot module of matplotlib library is used to annotate the point xy with
text s.
10. Define seaborn.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface
for drawing attractive and informative statistical graphics.
11. write down the difference between seaborn and matplotlib.
Features Matplotlib Seaborn

Functionality It is utilized for making basic graphs. Seaborn contains a number of
Datasets are visualised with the help of patterns and plots for data
bargraphs, histograms, piecharts, scatter visualization. It helps in compiling
plots, lines and so on. whole data into a single plot. It also
provides distribution of data.
Syntax Lengthy syntax. Example: Syntax for Simple syntax. Example: Syntax for
bargraph- matplotlib.pyplot.bar(x_axis, bargraph- seaborn.barplot(x_axis,
y_axis). y_axis).
Dealing We can open and use multiple figures Seaborn sets time for the creation of
Multiple simultaneously. each figure.
Figures
Visualization Matplotlib is well connected with Seaborn is more comfortable in
Numpy and Pandas and acts as a handling Pandas data frames.
graphics package for data visualization
in python
Pliability Matplotlib is a highly customized and Seaborn avoids overlapping of plots
robust with the help of its default themes
12. What is geographic distribution with basemap in python?
One common type of visualization in data science is that of geographic data. Matplotlib's main tool
for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits
which lives under the mpl_toolkits namespace.
A base map is a layer with geographic information that serves as a background.

Basemap is a great tool for creating maps using python in a simple way
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
13. How will you plot multiple dimensions in a graph? (or) 3D plot
In order to plot 3D figures use matplotlib, we need to import the mplot3d toolkit, which adds the
simple 3D plotting capabilities to matplotlib.
import numpy as np
plt.style.use('seaborn-poster')
Once we imported the mplot3d toolkit, we could create 3D axes and add data to the axes. create a
3D axes.
fig = plt.figure(figsize = (10,10))

plt.show()
14. What Is kernel density estimation?
Kernel density estimation (KDE) is in some senses an algorithm which takes the mixture-of-
Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per
point, resulting in an essentially non-parametric estimator of density.
15. Mention some customization we can do on graphs.
Customizing Plot Legends, Customising color bars,
Giving a name for the plot:
x = np.linspace(0, 10, 1000)

fig, ax = plt.subplots()
ax.plot(x, np.sin(x), '-b', label='Sine')
ax.plot(x, np.cos(x), '--r', label='Cosine')
ax.axis('equal')
leg = ax.legend();
16. What is simple grid of subplots?
The lowest level of the sub plot is plt.subplot(), which creates a single subplot within a grid.
plt.text(0.5, 0.5, str((2, 3, i)),
PART B:
1. Line plots
2. SCATTER PLOTS
3. DENSITY AND COUNTOR PLOTS
4. BASIC AND CONTINUOUS ERRORS
5. 3D PLOTING
6. GEOGRAPHICAL DATA WITH BASEMAP
7. SEABORN
8. HISTOGRAM (WITH ALL OPERATIONS)
Case study Example: Effect of Holidays on US Births (fo

Data Science

Uploaded by

Copyright:

Available Formats

Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

PRATHYUSHA

CS3352 – FOUNDATIONS OF DATA SCIENCE

UNIT II DESCRIBING DATA

UNIT III DESCRIBING RELATIONSHIPS

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

UNIT V DATA VISUALIZATION

II. Benefits and uses of data science and big data

III. Facets of data:

Audio, image, and video:

The data science process:

• The data science process typically consists of six steps:

The data science process

IV. Overview of the data science process:

Step 1: Defining research goals and creating a project charter

Spend time understanding the goals and context of your research:

Create a project charter:

V. Step 2: Retrieving data

• The next step in data science is to retrieve the required data.

Start with data stored within the company

• Many companies specialize in collecting valuable information.

Do data quality checks now to prevent problems later:

FIXING CAPITAL LETTER MISMATCHES:

IMPOSSIBLE VALUES AND SANITY CHECKS:

DEALING WITH MISSING VALUES:

DEVIATIONS FROM A CODE BOOK:

DIFFERENT UNITS OF MEASUREMENT

Correct errors as early as possible:

Combining data from different data sources:

THE DIFFERENT WAYS OF COMBINING DATA:

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

• To avoid duplication of data, we can virtually combine data with views.

ENRICHING AGGREGATED MEASURES

• Relationships between an input variable and an output variable aren’t always

REDUCING THE NUMBER OF VARIABLES

TURNING VARIABLES INTO DUMMIES

Information becomes much easier to grasp when shown in a picture.

visualization techniques : simple line graphs or histograms

• A Pareto diagram is a combination of the values and a cumulative distribution.

• In a histogram a variable is cut into discrete categories and the number of

VIII. Step 5: Build the models

• Building a model is an iterative process.

Model and variable selection

Model fit—For this the R-squared or adjusted R-squared is used.

Predictor variables have a coefficient—For a linear model this is easy to interpret.

IX. Step 6: Presenting findings and building applications on top of them

• Data mining turns a large collection of data into knowledge.

Eg: All Electronics

• By providing multidimensional data views and the precomputation of summarized

• The median is expensive to compute when we have a large number of observations.

• The mode is another measure of central tendency.

Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation,

Five-Number Summary, Boxplots, and Outliers:

Variance and Standard Deviation

Scatter Plots and Data Correlation

I. THREE TYPES OF DATA:

Qualitative Ranked Quantitative

II. TYPES OF VARIABLES

Discrete and Continuous Variables

TABLES (FREQUENCY DISTRIBUTIONS)