IDS Mid 1 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Big Data Hype

1.Volume, Velocity, Variety: The core idea behind Big Data is handling vast amounts of
data that are generated rapidly and come in different formats.
2.Advanced Analytics: The promise of Big Data is often tied to advanced analytics,
including predictive analytics, real-time analytics, and machine learning. The ability to
derive actionable insights from large datasets has been a major selling point.
3.Infrastructure and Technology: Technologies like Hadoop, Spark, and cloud computing
have become synonymous with Big Data. The hype often revolves around the ability of
these technologies to store, process, and analyze large datasets efficiently.
4.Industry Applications: Big Data is frequently promoted as a transformative force across
various industries, from healthcare and finance to retail and transportation. The ability to
optimize operations, enhance customer experiences, and discover new business
opportunities are commonly highlighted benefits.
5.Challenges and Concerns: Despite the hype, there are challenges, such as data privacy
and security, the need for specialized skills, and the potential for data misinterpretation.
The costs and complexities associated with implementing Big Data solutions can also be
significant.
Data Science Hype
1.Interdisciplinary Field: Data Science encompasses statistics, computer science,
domain expertise, and more. The hype often centers on the ability of Data
Scientists to tackle complex problems by combining these disciplines.
2.Machine Learning and AI: A major driver of the Data Science hype is the growth
of machine learning and artificial intelligence. The idea that algorithms can learn
from data and make predictions or decisions has captivated the public and
businesses alike.
3.Job Market and Salaries: The demand for Data Scientists has led to high salaries
and a perception that it is a lucrative career path. This has contributed to the
hype, with many people entering the field in search of high-paying jobs.
4.Business Impact: Data Science is often seen as a key to competitive advantage.
Businesses are keen to leverage data to improve decision-making, understand
customer behavior, and streamline operations.
5.Education and Training: The hype has led to a rapid increase of educational
programs, online courses, and boot camps aimed at training the next generation
of Data Scientists.
Datafication refers to the transformation of various aspects of life, business
processes, and human activities into data that can be quantified, analyzed,
and used for decision-making. It involves converting diverse forms of
information into digital data, enabling organizations to analyze and utilize it
for various purposes, such as improving services, understanding behavior,
and predicting trends.
Key Aspects of Datafication:
1.Digitization: Converting analog information (like physical documents,
spoken words, or physical activities) into digital form. This is a foundational
step in datafication, as it allows data to be stored, processed, and analyzed
using digital technologies.
2.Data Collection: Gathering data from various sources, such as sensors,
social media, transaction records, GPS, and more. Modern technology
enables the collection of vast amounts of data in real time.
1.Data Storage and Management: Storing the collected data in databases,
data warehouses, or data lakes. Proper management of this data, including
ensuring its quality and accessibility, is crucial for effective analysis.
2.Data Analysis: Using various data science techniques, such as machine
learning, statistical analysis, and data mining, to extract insights from the
data. This analysis can reveal patterns, correlations, and trends that can
inform decision-making.
3.Data-Driven Decision Making: Leveraging the insights gained from data
analysis to make informed decisions. Data-driven approaches can optimize
business processes, enhance customer experiences, and improve
operational efficiency.
Applications of Datafication:
1.Business and Marketing: Companies use datafication to understand
customer behavior, personalize marketing efforts, optimize supply chains,
and improve product offerings.
2.Healthcare: Data from medical records, wearable devices, and patient
monitoring systems can be analyzed to improve diagnoses, treatments,
and patient outcomes.
3.Smart Cities: Data from sensors and connected devices in urban
environments can optimize traffic flow, manage utilities, and enhance
public safety.
4.Finance: Financial institutions use datafication to detect fraud, assess
credit risk, and provide personalized financial services.
5.Education: Data on student performance and behavior can be used to
tailor educational content, improve learning outcomes, and streamline
administrative processes.
The Current Landscape (with a Little History)
Drew Conway's Venn Diagram is a popular conceptual model used to describe
the essential skills required for data science. It consists of three overlapping
circles, each representing a different domain:
1.Mathematics and Statistics Knowledge: This circle represents the
foundational understanding of statistical methods, mathematical theories,
and techniques crucial for data analysis and interpretation.
2.Substantive Expertise: This area pertains to domain-specific knowledge or
expertise. It includes understanding the field or industry in which data
science is being applied, such as finance, healthcare, marketing, etc.
3.Hacking Skills: This circle refers to the technical ability to manipulate and
work with data, including programming skills, software engineering, and
familiarity with tools like Python, R, SQL, and other data-related
technologies.
Drew Conway’s
statistical inference
Statistical inference is the cornerstone of data science. It's the process of
drawing conclusions about a population based on a sample of data. While we
often have access to large datasets, it's rarely feasible to analyze the entire
population.
Key Concepts
• Population: The entire group we're interested in studying.
• Sample: A subset of the population used for analysis.
• Parameter: A numerical characteristic of the population (e.g., population
mean, population standard deviation).
• Statistic: A numerical characteristic of a sample (e.g., sample mean, sample
standard deviation).
• Inference: The process of drawing conclusions about population parameters
based on sample statistics.
Suppose your population was all emails sent last year by employees at a huge corporation,
BigCorp.
Then a single observation could be a list of things: the sender’s name, the list of recipients,
date sent, text of email, number of characters in the email, number of sentences in the
email, number of verbs in the email, and the length of time until first reply
In the BigCorp email example, you could make a list of all the employees and select 1/10th
of those people at random and take all the email they ever sent, and that would be your
sample.
Alternatively, you could sample 1/10th of all email sent each day at random, and that
would be your sample. Both these methods are reasonable, and both methods yield the
same sample size. But if you took them and counted how many email messages each
person sent, and used that to estimate the underlying distribution of emails sent by all
indiviuals at BigCorp, you might get entirely different answers
Statistical modeling in data science involves using mathematical models to
represent, analyze, and predict data. It's a core component of data science,
providing the tools and techniques to understand data, identify relationships,
and make data-driven decisions

Linear regression is a fundamental statistical technique used in data science


to model the relationship between a dependent variable (also known as the
target or response variable) and one or more independent variables (also
known as predictors or features). The primary goal of linear regression is to
predict the value of the dependent variable based on the values of the
independent variables.
Types of Linear Regression
Probability Distribution:
Underfitting
• Definition: Underfitting occurs when a model is too simple to capture the
underlying patterns in the data. This happens when the model has high bias
and cannot adequately learn the relationship between input and output
variables.
• Example: Suppose you are trying to predict house prices based on a single
feature, like the size of the house (in square feet). If you use a linear
regression model (a straight line) to fit the data, but the relationship between
size and price is actually more complex (perhaps non-linear), your model may
fail to capture the true trend, leading to poor predictions.
• Symptoms: Both training and validation errors are high.
• Solution: Increase the model complexity, add more features, or use a more
powerful model that can capture the non-linear relationship.
Definition: Overfitting occurs when a model is too complex and learns not
only the underlying pattern but also the noise in the training data. This leads
to high variance, where the model performs well on training data but poorly
on unseen data.
• Example: Now, consider that you use a very complex model, like a deep
neural network, with many layers and parameters to predict house prices.
This model might fit the training data very well, capturing every small
fluctuation. However, these fluctuations might be due to random noise or
outliers rather than true underlying trends. When you apply this model to
new data, it fails to generalize and gives poor predictions.
• Symptoms: Training error is low, but validation error is high.
• Solution: Reduce model complexity
Data Science - R

By

Mr K Kishore Kumar
Assistant Professor
Department of Data Science
What is Data Science?
Data science is an interdisciplinary field that involves extracting
valuable insights and knowledge from data using scientific methods,
processes, algorithms, and systems.
It combines elements from statistics, mathematics, computer science,
and domain expertise to analyze large and complex datasets
Core Components of Data Science:
• Data Collection: Gathering relevant data from various sources.
• Data Cleaning: Preparing data for analysis by handling missing
values, inconsistencies, and outliers.
• Data Exploration: Discovering patterns and trends within the data
through visualization and summary statistics.
• Data Modeling: Building statistical or machine learning models to
predict outcomes or understand relationships.
• Data Evaluation: Assessing the accuracy and reliability of models.
• Data Communication: Presenting findings and insights in a clear and
understandable manner.
Applications of Data Science:
• Data science has a wide range of applications across industries, including:
• Business: Customer segmentation, fraud detection, marketing
optimization.
• Healthcare: Disease prediction, drug discovery, personalized medicine.
• Finance: Risk assessment, algorithmic trading, fraud prevention.
• Marketing: Customer behavior analysis, recommendation systems,
targeted advertising.
What is R?
R is a powerful, open-source programming language and environment
primarily used for statistical computing and data analysis.
 It's become a cornerstone for data scientists, statisticians, and researchers
due to its flexibility, vast ecosystem of packages, and strong community
support. `
Why R?
• Open-source: Free to use and distribute.
• Comprehensive: Offers a wide range of statistical and graphical
techniques.
• Flexible: Can handle large datasets and complex analyses.
• Extensible: Thousands of packages available for specialized tasks.
• Active Community: Strong support and continuous development.
• R was started by professors Ross Ihaka and Robert
Gentleman as a programming language to teach introductory
statistics at the University of Auckland.[10] The language was
inspired by the S programming language, with most S programs
able to run unaltered in R.[6] The language was also inspired
by Scheme's lexical scoping, allowing for local variables
Getting Started
1.Download R: Visit the Comprehensive R Archive Network (CRAN) at
https://cran.r-project.org/ to download and install R for your operating
system.
2.RStudio: While optional, RStudio is a popular Integrated Development
Environment (IDE) that provides a user-friendly interface for R
programming. It's highly recommended for beginners.
Basic Concepts
• R Console: This is where you
interact with R by typing
commands.
• Objects: R stores data in objects.
These can be numbers, text,
logical values, or more complex
structures.
• Functions: Built-in or user-
defined procedures to perform
specific tasks.
• Packages: Collections of
functions and data for specific
purposes.
Basic Syntax:

x <- 5 # Assigns the value 5 to the object x

result <- 3 + 4 * 2 # Performs the calculation


Entering a Input
At the R prompt we type expressions. The <- symbol is the assignment
operator.
> x <- 1
> print(x)
[1] 1
>x
[1] 1
> msg <- "hello"
The grammar of the language determines whether an expression is complete
or not.
x <- ## Incomplete expression
R Objects
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
• The most basic type of R object is a vector. Empty vectors can be created with the vector()
function. There is really only one rule about vectors in R, which is that A vector can only
contain objects of the same class.
Numbers in R are generally treated as numeric objects (i.e. double precision
real numbers). This means that even if you see a number like “1” or “2” in R,
which you might think of as integers, they are likely represented behind the
scenes as numeric objects (so something like “1.00” or “2.00”).
This isn’t important most of the time…except when it is. If you explicitly
want an integer, you need to specify the L suffix.
So entering 1 in R gives you a numeric object; entering 1L explicitly gives
you an integer object.
There is also a special number Inf which represents infinity. This allows us to
represent entities like 1 / 0. This way, Inf can be used in ordinary
calculations; e.g. 1 / Inf is 0.
The value NaN represents an undefined value (“not a number”)
Types of Data
Mr K Kishore Kumar
Assistant Professor,
Data Science Department,
CMR Technical Campus.
• Data sets are made up of data objects. A data object represents an entity. In
a sales database, the objects could be customers, store items, or sales, for
instance. In a medical database, the objects may be patients. In a university
database, the objects could be students, professors, and courses.
• Data objects are typically described by attributes. Data objects can also be
referred to as samples, examples, instances, data points, or objects. If the
data objects are stored in a database, they are data tuples.
What is an Attribute?
• An attribute is a data field, representing a characteristic or feature of a data
object. The nouns attribute, dimension, feature, and variable are often
used interchangeably in literature. The term dimension is commonly used
in data warehousing.
• Machine learning literature tends to use the term feature, while
statisticians prefer the term variable. Data mining and database
professionals commonly use the term attribute. We use the term attribute
here as well. Attributes describing a customer object can include, for
example, customer ID, name, and address.
• Observed values for a given attribute are known as observations. A set of
attributes used to describe a given object is called an attribute vector (or
feature vector ). The distribution of data involving one attribute (or
variable) is called univariate. A bivariate distribution involves two
attributes, and so on.
The type of an attribute is determined by the set of possible values the attribute can have.
Attributes can be nominal, binary, ordinal, or numeric
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things.
Example:
Nominal attributes. Suppose that Hair color and Marital status are two attributes
describing person objects. In our application, possible values for Hair color are black,
brown blond, red, grey, and white. Marital status can take on the values single, married,
divorced, and widowed. Both Hair color and Marital status are nominal attributes.
Occupation is another example, with the values teacher, dentist, programmer, farmer, and
so on
A binary attribute is a nominal attribute with only two categories or states: 0
or 1, where 0 typically means that the attribute is absent, and 1 means that
it is present. Binary attributes are referred to as Boolean if the two states
correspond to true and false.

Example 2.2 Binary attributes. Given the attribute Smoker describing a


patient object, 1 indicates that the patient smokes, while 0 indicates that the
patient does not. Similarly, suppose the patient undergoes a medical test
that has two possible outcomes. The attribute Medical test is binary, where a
value of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
An ordinal attribute is an attribute whose possible values have a meaningful order or
ranking among them, but the magnitude between successive values is not known
Example 2.3 Ordinal attributes. Suppose that Drink size corresponds to the size of drinks
available at a fast food restaurant. This nominal attribute has three possible values –
small, medium, and large. The values have a meaningful sequence (which corresponds to
increasing drink size), however, we cannot tell from the values how much bigger, say, a
medium is from a large.
Other examples of ordinal attributes include Grade (e.g., A+, A, A−, B+, and so on) and
Professional rank. Professional ranks can be enumerated in a sequential order, such as
assistant, associate, and full for professors, and private, private first class, specialist,
corporal, sergeant for army ranks. Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured objectively. Hence, ordinal attributes
are often used in surveys for ratings. In one survey, participants were asked to rate how
satisfied they were as customers. Customer satisfaction had the following ordinal
categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4:
very satisfied.
A numeric attribute is quantitative, that is, it is a measurable quantity, represented in
integer or real values. Numeric attributes can be interval-scaled or ratio-scaled
Interval-scaled attributes are measured on a scale of equal-sized units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to
providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.
Example:
Interval-scaled attributes. Temperature is an interval-scaled attribute. Suppose that we
have the outdoor temperature value for a number of different days, where each day is an
object. By ordering the values, we obtain a ranking of the objects with respect to
temperature. In addition, we can quantify the difference between values. For example, a
temperature of 20◦C is 5 degrees higher than a temperature of 15◦C. Calendar dates are
another example. For instance, the years 2002 and 2010 are 8 years apart
• A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value. In addition, the values are ordered, and
we can also compute the difference between values. The mean, median,
and mode can be computed as well.
• Example 2.5 Ratio-scaled attributes. Unlike temperatures in Celsius and
Fahrenheit, the Kelvin (K) temperature scale has what is considered a true
zero-point (0 degrees K = −273.15◦C): It is the point at which the particles
that comprise matter have zero kinetic energy. Other examples of ratio-
scaled attributes include Count attributes such as Years of experience
(where the objects are employees, for example) and Number of words
(where the objects are documents). Additional examples include attributes
to measure weight, height, latitude and longitude coordinates (e.g., when
clustering houses), and monetary quantities (e.g., you are 100 times richer
with $100 than with $1).
Discrete Versus Continuous Attributes
• Classification algorithms developed from the field of machine learning
often talk of attributes as being either discrete or continuous. Each type
may be processed differently. A discrete attribute has a finite or countably
infinite set of values, which may or may not be represented as integers.
The attributes Hair color, Smoker, Medical test, and Drink size each have a
finite number of values, and so are discrete.
• Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes, or, the values 0 to 110 for the attribute Age. An attribute
is countably infinite if the set of possible values is infinite, but the values
can be put in a one-to-one correspondence with natural numbers. For
example, the attribute customer ID is countably infinite. The number of
customers can grow to infinity, but in reality, the actual set of values is
countable (where the values can be put in one-to-one correspondence
with the set of integers). Zip codes are another example.
• If an attribute is not discrete, it is continuous. The terms “numeric
attribute” and “continuous attribute” are often used interchangeably in
the literature. (This can be confusing because, in the classical sense,
continuous values are real numbers, whereas numeric values can be either
integers or real numbers.) In practice, real values are represented using a
finite number of digits. Continuous attributes are typically represented as
floating-point variables.
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, Mode
we look at various ways to measure the central tendency of data. Suppose that we
have some attribute X, like salary, which has been recorded for a set of objects. Let
x1, x2, . . . , Xn be the set of N observed values or observations for X
plot the observations for salary, where would most of the values fall?
Measures of central tendency include the mean, median, mode, and midrange
Median:
In probability and statistics, the median generally applies to numeric data,
however, we may extend the concept to ordinal data. Suppose that a given
data set of N values for an attribute X is sorted in increasing order. If N is odd,
then the median is the middle value of the ordered set. If N is even then the
median is not unique;
Let is find the median example
The data are already sorted in increasing order. There is an even number of
observations , therefore, the median is not unique. It can be any value within
the two middlemost values of 52 and 56 (30, 31, 47, 50, 52, 52, 56, 60, 63,
70, 70 that is, within the 5th and 6th values in the list). By convention, we
assign the average of the two middlemost values as the median. That is,
52+56/ 2 = 108 /2 = 54. Thus, the median is $54K.
• The median is expensive to compute when we have a large number of
observations. Assume that data are grouped in intervals according to
their xi data values and that the frequency of each interval is known. For
example, employees may be grouped according to their annual salary in
intervals such as 10–20K, 20–30K, and so on.
• Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g., the
median salary) by interpolation using the formula:
The mode for a set of data is the value that occurs most frequently in the
set. Therefore, it can be determined for qualitative and quantitative
attributes. It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode. Data sets with one,
two, or three modes are respectively called unimodal, bimodal, and
trimodal
Measuring the Dispersion of Data:

• Range, Quartiles, and the Interquartile Range (IQR) :


Let x1, x2, . . . , Xn be a set of observations for some numeric attribute, X. The range of the set is the
difference between the largest (max()) and smallest (min()) values. Suppose that the data for attribute
X are sorted in increasing numeric order.
Imagine that we can pick certain data points so as to split the data distribution into equal-sized
consecutive sets, called quartiles
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-sized consecutive sets.
• The kth q-quantile for a given data distribution is the value x such that at most k/q of
the data values are less than x and at most (q − k)/q of the data values are more than x,
where k is an integer such that 0 < k < q. There are q − 1 q-quantiles.
The quartiles give an indication of the center, spread, and shape of a distribution. The
first quartile, denoted by Q1, is the 25th percentile. The third quartile, denoted by Q3, is
the 75th percentile. The second quartile is the 50th percentile. As the median, it gives
the center of the data distribution. The distance between the first and third quartiles is a
simple measure of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as IQR = Q3 − Q1.
• Example 2.10 Interquartile range.
• The quartiles are the three values that split the sorted data set into four
equal parts. The data of Example 2.2.1 contain 12
observations(30,36,47,50,52,52,56,60,63,70,70,110) already sorted in
increasing order. Thus, the quartiles for this data are the 3rd, 6th, and
9th values, respectively, in the sorted list. Therefore, Q1 = $47K and Q3 is
$63K.
• Thus, the interquartile range is IQR = 63 − 47 = $16K. (Note that the 6th
value is a median, $52K, although this data set has two medians since
the number of data values is even.)
Five-Number Summary, Boxplots, and Outliers
IQR, is very useful for describing skewed distributions. Have a look at the
symmetric and skewed data distributions
In the symmetric distribution, the median splits the data into equal-size
halves, this does not occur for skewed distributions
it is more informative to also provide the two quartiles Q1 and Q3, along
with the median
A common rule of thumb for identifying suspected outliers is to single out
values falling at least 1.5 × IQR above the third quartile or below the first
quartile
• Because Q1, the median, and Q3 together contain no information about
the endpoints (e.g., tails) of the data, a fuller summary of the shape of a
distribution can be obtained by providing the lowest and highest data
values as well. This is known as the five-number summary.
• The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order of Minimum, Q1, Median, Q3, Maximum.
• Boxplots are a popular way of visualizing a distribution.
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles, so that the box length is
the interquartile range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
Figure 2.3 shows boxplots for unit
price data for items sold at four
branches of All Electronics during a
given time period.
For branch 1, we see that the
median price of items sold is $80,
Q1 is $60, Q3 is $100. Notice that
two outlying observations for this
branch were plotted individually, as
their values of 175 and 202 are
more than 1.5 times the IQR here
of 40.
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while high standard deviation indicates
that the data are spread out over a large range of values.
The basic properties of the standard deviation, σ, as a measure of spread
are
• σ measures spread about the mean and should be considered only when
the mean is chosen as the measure of center.
• σ = 0 only when there is no spread, that is, when all observations have
the same value. Otherwise σ > 0
Graphic Displays of Basic Statistical
Descriptions of Data
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution First, it displays all of the data for the given attribute Second, it plots
quantile information.
Let xi , for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X.
Each observation, xi , is paired with a percentage, fi , which indicates that approximately
fi × 100% of the data are below the value, xi .
• These numbers increase in equal steps of 1/N, ranging from 1/ 2N (which is
slightly above zero) to 1 − 1 /2N (which is slightly below one). On a quantile
plot, xi is graphed against fi
For example, given the quantile plots of sales data for two different time
periods, we can compare their Q1, median, Q3, and other fi values at a glance.
• A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
• It is a powerful visualization tool in that it allows the user to view whether
there is a shift in going from one distribution to another.
• Suppose that we have two sets of observations for the attribute or variable
unit price, taken from two different branch locations.
• Let x1, . . . , xN be the data from the first branch, and y1, . . . , yM be the data
from the second, where each data set is sorted in increasing order.
• If M = N (i.e., the number of points in each set is the same), then we simply
plot yi against xi , where yi and xi are both (i − 0.5)/N quantiles of their
respective data sets
• If M < N (i.e., the second branch has fewer observations than the first), there
can be only M points on the q-q plot. Here, yi is the (i − 0.5)/M quantile of
the y data, which is plotted against the (i − 0.5)/M quantile of the x data.
This computation typically involves interpolation.
• Histograms “Histog” means pole and “gram” means chart, so a histogram
is a chart of poles.
• Plotting histograms is a graphical method for summarizing the distribution of
a given attribute, X.
• If X is nominal, such as item type, then a pole or vertical bar is drawn for
each known value of X. The height of the bar indicates the frequency (i.e.,
count) of that X value. The resulting graph is more commonly known as a bar
chart.
• If X is numeric, the term histogram is preferred. The range of values for X
is partitioned into disjoint consecutive subranges.
• The subranges, referred to as buckets, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width. Typically, the buckets are
equal-width.
• For example, a price attribute with a value range of $1 to $200 (rounded
up to the nearest dollar) can be partitioned into subranges 1 to 20, 21 to
40, 41 to 60, and so on. For each subrange, a bar is drawn whose height
represents the total count of items observed within the subrange
• A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, pattern, or trend
between two numeric attributes.
• The scatter plot is a useful method for providing a first look at bivariate
data to see clusters of points and outliers, or to explore the possibility of
correlation relationships.
• Two attributes, X, and Y , are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated). Figure 2.8
shows examples of positive and negative correlations between two
attributes.
• If the pattern of plotted points slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, which
suggests a positive correlation (Figure 2.8a)).
• If the pattern of plotted points slopes from upper left to lower right, then
the values of X increase as the values of Y decrease, suggesting a negative
correlation (Figure 2.8b)). A line of best fit can be drawn in order to study
the correlation between the variables.
• Statistical tests for correlation are given in on data integration . Figure 2.9
shows three cases for which there is no correlation relationship between
the two attributes in each of the given data sets. Section 2.3.2 shows how
scatter plots can be extended to n attributes, resulting in a scatter plot
matrix.
• In R programming, a list is a data structure that can store different types
of elements, such as numbers, strings, vectors, or even other lists. Lists in
R are particularly useful when you want to bundle objects of different
types together.
Lists in R Programming
Data Frames in R Programming
A Data Frame in R is a two-dimensional data structure used for
storing tabular data. It is similar to a table in a database or a
spreadsheet in Excel, where each column contains values of one
variable, and each row contains one set of values from each column.

Key Characteristics of Data Frames:


Heterogeneous Data: A data frame can store different types of data
(numeric, character, factor, etc.) in different columns.
Named Columns: Each column in a data frame has a name that you
can use to refer to that column.
Flexible Indexing: You can access data by both row and column
indices.
Creating a Data Frame
A Data Frame can be created using the data.frame() function in R.
# Creating a Data Frame in R
df <- data.frame(
Name = c("John", "Alice", "Peter", "Rachel"),
Age = c(25, 30, 28, 22),
Gender = c("Male", "Female", "Male", "Female")
)
print(df)
------------------------------------------------------------------------------------
Accessing Data Frame Elements
You can access elements of a data frame using indexing or the $
operator.
Accessing Columns:
# Access the 'Name' column
df$Name
# Or using index
df[ "Name"]
# Access the first row
df[1, ]

# Access the element at the 2nd row, 1st column (Alice's Name)
df [2, 1]
Subsetting Data Frames
Subsetting refers to extracting a portion of the Data Frame, either
rows, columns, or both.
Subsetting by Rows: You can select rows based on specific criteria,
such as filtering by a column value.
# Subset rows where Gender is 'Female'
df_female <- df[df$Gender == "Female", ]
print(df_female)
Subsetting by Columns:
You can select specific columns by name or index.
# Subsetting columns 'Name' and 'Age'
df_subset <- df[, c("Name", "Age")]
print(df_subset)
Combining Rows and Columns: You can subset both rows and
columns at the same time.
# Select the 'Name' column for the first 3 rows
df_subset <- df[1:3, "Name"]
print(df_subset)

Adding and Removing Columns


Adding Columns: You can add a new column to the data frame by
assigning values to a new column name.
# Adding a new column 'Salary'
df$Salary <- c(50000, 60000, 55000, 45000)
print(df)
Removing Columns: You can remove a column by setting it to
NULL.
# Remove the 'Salary' column
df$Salary <- NULL
print(df)
Sorting Data Frames
You can sort a Data Frame by a column using the order () function.
Sorting by a Single Column:
# Sort by 'Age' in ascending order
df_sorted <- df[order(df$Age), ]
print(df_sorted)
Sorting by Multiple Columns:
# Sort by 'Gender' and then by 'Age'
df_sorted <- df[order(df$Gender, df$Age), ]
print(df_sorted)

Adding and Removing Rows


Adding Rows: You can add rows using the rbind() function.
# Adding a new row
new_row <- data.frame(Name = "Michael", Age = 35, Gender =
"Male")
df <- rbind (df, new_row)
print(df)
# Adding a new row
new_row <- data.frame(Name = "Michael", Age = 35, Gender =
"Male")
df <- rbind (df, new_row)
print(df)
Removing Rows: You can remove rows by subsetting and
excluding unwanted rows.
# Remove the second row (Alice)
df <- df [-2, ]
print(df)

You might also like