Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
Agenda:
1)Assign Homework #1 (Due Wednesday 6/30)
2) Lecture over more of Chapter 2
1
Homework Assignment:
Homework #1 is due Wednesday 6/30
*Please e-mail a single file and make sure your name is on the first
page and in the body of the email. Also, the file name should say
“homework1” and include your name.
2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Chapter 2: Data
3
What is Data? Attributes
4
Object is also known as record, point, case, sample,
entity, instance, or observation
Experimental vs. Observational Data
(Important but not in book)
Experimental data describes data which was collected by
someone who exercised strict control over all attributes.
Examples:
-Distance from cell phone tower
vs. childhood cancer
6
Types of Attributes (P. 25):
7
division -> (increase %)
Types of Attributes:
Some examples:
–Nominal
Examples: ID numbers, eye color, zip codes
–Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
–Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit, GRE score
–Ratio
Examples: temperature in Kelvin, length, time, counts
8
Properties of Attribute Values
9
Discrete vs. Continuous (P. 28)
Discrete Attribute
–Has only a finite or countably infinite set of values
–Examples: zip codes, counts, or the set of words in a collection
of documents
–Note: binary attributes are a special case of discrete attributes
which have only 2 values
Continuous Attribute
–Has real numbers as attribute values
–Can compute as accurately as instruments allow
–Examples: temperature, height, or weight
–Practically, real values can only be measured and represented
using a finite number of digits
–Continuous attributes are typically represented as
10
floating-point variables
Discrete vs. Continuous (P. 28)
11
In class exercise #2:
Classify the following attributes as discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative
(interval or ratio). Some cases may have more than one
interpretation, so briefly indicate your reasoning if you think
there may be some ambiguity.
a) Number of telephones in your house
b) Size of French Fries (Medium or Large or X-Large)
c) Ownership of a cell phone
d) Number of local phone calls you made in a month
e) Length of longest phone call
f) Length of your foot
g) Price of your textbook
h) Zip code
i) Temperature in degrees Fahrenheit
j) Temperature in degrees Celsius
k) Temperature in Kelvin
12
2009 UCSD Data Mining Competition Dataset
Read it into R
> getwd()
> setwd(”C:/Documents And Settings/rajan/Desktop/”)
> data<-read.csv(”features.csv", header=T)
13
Types of Data in R
In R,
14
Types of Data in R
For example, the state in the third column of features.csv is a
factor
> data[1:10,3]
[1] CA CA CA NJ CA CA FL CA IA CA
53 Levels: AE AK AL AP AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ... WY
> is.factor(data[,3])
[1] TRUE
> data[,3]+10
[1] NA NA NA NA NA NA NA NA …
Warning message:
+ not meaningful for factors …
15
Types of Data in R
The fourth column seems like some version of the zip code. It
should be a factor (categorical) not numeric, but R doesn’t
know this.
> is.factor(data[,4])
[1] FALSE
> as.factor(data[1:10,4])
[1] 925 925 928 77 945 940 331 945 503 913
Levels: 77 331 503 913 925 928 940 945
16
Working with Data in R
Creating Data:
> aa<-c(1,10,12)
> aa
[1] 1 10 12
> aa+10
[1] 11 20 22
> length(aa)
17
[1] 3
Working with Data in R
> bb<-c(2,6,79)
> my_data_set<-
data.frame(attributeA=aa,attributeB=bb)
> my_data_set
attributeA attributeB
1 1 2
2 10 6
3 12 79
18
Working with Data in R
Indexing Data:
> my_data_set[,1]
[1] 1 10 12
> my_data_set[1,]
attributeA attributeB
1 1 2
> my_data_set[3,2]
[1] 79
> my_data_set[1:2,]
attributeA attributeB
1 1 2
19
Working with Data in R
Indexing Data:
> my_data_set[c(1,3),]
attributeA attributeB
1 1 2
3 12 79
Arithmetic:
> aa/bb
[1] 0.5000000 1.6666667 0.1518987
20
Working with Data in R
Summary Statistics:
> mean(my_data_set[,1])
[1] 7.666667
> median(my_data_set[,1])
[1] 10
> sqrt(var(my_data_set[,1]))
[1] 5.859465
21
Working with Data in R
Writing Data:
> write.csv(my_data_set,"my_data_set_file.csv")
Help!:
> ?write.csv
22
Working with Data in Excel
Reading in Data:
23
Working with Data in Excel
Deleting a Column:
(right click)
24
Working with Data in Excel
Arithmetic:
25
Working with Data in Excel
26
Working with Data in Excel
27
Working with Data in Excel
28
Working with Data in Excel
29
Sampling
30
Sampling
31
Sampling
The simple random sample is the most common and basic
type of sample
In a simple random sample every item has the same
probability of inclusion and every sample of the fixed size has
the same probability of selection
It is the standard “names out of a hat”
32
Sampling in Excel:
The function rand() is useful.
Then, sort on this column and take the first n, where n is the
desired sample size.
34
In class exercise #3:
Explain how to use R to draw a sample of 10 observations with
replacement from the first quantitative attribute in the data set
http://sites.google.com/site/stats202/data/features.csv
35
In class exercise #3:
Explain how to use R to draw a sample of 10 observations with
replacement from the first quantitative attribute in the data set
http://sites.google.com/site/stats202/data/features.csv
Answer:
> sam<-sample(seq(1,nrow(data)),10,replace=T)
> my_sample<-data$amount[sam]
36
In class exercise #4:
If you do the sampling in the previous exercise repeatedly, roughly
how far is the mean of the sample from the mean of the whole
column on average?
37
In class exercise #4:
If you do the sampling in the previous exercise repeatedly, roughly
how far is the mean of the sample from the mean of the whole
column on average?
> real_mean<-mean(data$amount)
> store_diff<-rep(0,10000)
>
> for (k in 1:10000){
+ sam<-sample(seq(1,nrow(data)),10,replace=T)
+ my_sample<-data$amount[sam]
+ store_diff[k]<-abs(mean(my_sample)-
real_mean)
+ }
38
> mean(store_diff)
[1] 3.59541
In class exercise #5:
If you change the sample size from 10 to 100, how does your answer
to the previous question change?
39
In class exercise #5:
If you change the sample size from 10 to 100, how does your answer
to the previous question change?
> real_mean<-mean(data$amount)
> store_diff<-rep(0,10000)
>
> for (k in 1:10000){
+ sam<-sample(seq(1,nrow(data)),100,replace=T)
+ my_sample<-data$amount[sam]
+ store_diff[k]<-abs(mean(my_sample)-
real_mean)
+ }
40
> mean(store_diff)
[1] 1.128120
The square root sampling relationship:
When you take samples, the differences between the sample
values and the value using the entire data set scale as the
square root of the sample size for many statistics such as the
mean.
Note: It is only the sizes of the samples that matter, and not
the size of the whole data set.
41
Sampling
Sampling can be tricky or ineffective when the data has a
more complex structure than simply independent observations.
For example, here is a “sample” of words from a song. Most
of the information is lost.
42
Sampling
Sampling can be tricky or ineffective when the data has a
more complex structure than simply independent observations.
For example, here is a “sample” of words from a song. Most
of the information is lost.
43