Summarizing Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Summarizing data

1/27/13 5:22 PM

Summarizing data
Jeffrey Leek, Assistant Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 1 of 20

Summarizing data

1/27/13 5:22 PM

Why summarize?
Data are often too big to look at the whole thing
The first step in an analysis is to find problems
When you do these summaries you should be looking for
- Missing values
- Values outside of expected ranges
- Values that seem to be in the wrong units
- Mislabled variables/columns
- Variables that are the wrong class

2/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 2 of 20

Summarizing data

1/27/13 5:22 PM

Earthquake data

https://explore.data.gov/Geography-and-Environment/Worldwide-M1-Earthquakes-Past-7Days/7tag-iwnu
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

3/20

Page 3 of 20

Summarizing data

1/27/13 5:22 PM

Earthquake data
fileUrl <- "http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt"
download.file(fileUrl,destfile="./data/earthquakeData.csv",method="curl")
dateDownloaded <- date()
dateDownloaded

[1] "Sun Jan 27 00:23:22 2013"

eData <- read.csv("./data/earthquakeData.csv")

4/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 4 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - the whole thing


eData

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Src
nc
ci
ak
nc
nn
ak
hv
ak
ci
us
ci
hv
hv
ak
ak
us
ci

Eqid Version
71929481
1
15278017
0
10645573
1
71929476
0
00401016
9
10645564
1
60459531
2
10645555
1
15278009
0
c000ewb3
7
15278001
0
60459521
1
60459516
2
10645533
1
10645528
1
c000ewax
6
15277993
0

Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,

January
January
January
January
January
January
January
January
January
January
January
January
January
January
January
January
January

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,

2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013

Datetime
05:03:01 UTC
04:59:04 UTC
04:55:09 UTC
04:51:48 UTC
04:45:19 UTC
04:16:45 UTC
04:15:57 UTC
04:14:35 UTC
04:07:44 UTC
04:05:42 UTC
03:54:27 UTC
03:50:13 UTC
03:43:56 UTC
03:25:17 UTC
03:18:17 UTC
03:17:57 UTC
02:47:04 UTC

5/20

Page 5 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - dim(),names(),nrow(),ncol()


dim(eData)

[1] 1057

10

names(eData)

[1] "Src"
[6] "Lon"

"Eqid"
"Version"
"Magnitude" "Depth"

"Datetime"
"NST"

"Lat"
"Region"

nrow(eData)

[1] 1057

6/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 6 of 20

Summarizing data

1/27/13 5:22 PM

Looking at the data - quantile(),summary()


quantile(eData$Lat)

0%
-61.30

25%
35.56

50%
38.77

75%
52.58

100%
67.66

summary(eData)

Src
ak
:330
nc
:247
ci
:145
nn
: 92
us
: 89
pr
: 40
(Other):114

Eqid
00400150:
1
00400153:
1
00400155:
1
00400156:
1
00400157:
1
00400159:
1
(Other) :1051

Version
2
:379
0
:195
1
:168
9
: 97
3
: 82
4
: 43
(Other): 93
Datetime
Monday, January 21, 2013 11:00:00 UTC:
2
Friday, January 25, 2013 00:06:25 UTC:
1
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Lat
Min.
:-61.3
1st Qu.: 35.6

7/20

Page 7 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - class()


class(eData)

[1] "data.frame"

sapply(eData[1,],class)

Src
Eqid
"factor" "factor"
Depth
NST
"numeric" "integer"

Version
"factor"
Region
"factor"

Datetime
Lat
Lon Magnitude
"factor" "numeric" "numeric" "numeric"

8/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 8 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - unique(),length(),table()


unique(eData$Src)

[1] nc ci ak nn hv us pr uw nm mb uu
Levels: ak ci hv mb nc nm nn pr us uu uw

length(unique(eData$Src))

[1] 11

table(eData$Src)

ak ci
330 145

hv
29

mb nc
10 247

nm
2

nn
92

pr
40

us
89

uu
40

uw
33
9/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 9 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - table()


table(eData$Src,eData$Version)

ak
ci
hv
mb
nc
nm
nn
pr
us
uu
uw

0
0
64
0
0
91
0
0
40
0
0
0

1
2
93 211
0 67
14 11
0 10
46 51
0
0
0
0
0
0
0
2
0 15
15 12

3
26
7
0
0
37
0
0
0
0
6
6

4
0
3
2
0
10
0
0
0
14
14
0

5
0
3
2
0
4
0
0
0
13
3
0

6
0
1
0
0
3
0
0
0
24
2
0

7
0
0
0
0
1
0
0
0
13
0
0

8
0
0
0
0
1
0
0
0
11
0
0

9
0
0
0
0
1
0
92
0
4
0
0

A
0
0
0
0
1
2
0
0
4
0
0

B
0
0
0
0
1
0
0
0
2
0
0

D
0
0
0
0
0
0
0
0
1
0
0

E
0
0
0
0
0
0
0
0
1
0
0

10/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 10 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - any(), all()


eData$Lat[1:10]

[1] 38.83 36.04 65.23 39.56 37.26 62.10 19.41 63.51 32.91 -5.17

eData$Lat[1:10] > 40

[1] FALSE FALSE

TRUE FALSE FALSE

TRUE FALSE

TRUE FALSE FALSE

any(eData$Lat[1:10] > 40)

[1] TRUE

11/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 11 of 20

Summarizing data

1/27/13 5:22 PM

Looking at data - all()


eData$Lat[1:10] > 40

[1] FALSE FALSE

TRUE FALSE FALSE

TRUE FALSE

TRUE FALSE FALSE

all(eData$Lat[1:10] > 40)

[1] FALSE

12/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 12 of 20

Summarizing data

1/27/13 5:22 PM

Looking at subsets - &


eData[eData$Lat > 0 & eData$Lon > 0,c("Lat","Lon")]

51
56
58
110
129
134
146
153
155
160
175
193
239
325
348
359
385

Lat
5.486
39.749
38.295
34.571
51.130
9.438
38.426
49.728
43.337
29.379
44.280
31.763
4.998
53.564
38.608
27.771
49.825

Lon
127.05
77.30
46.81
24.10
179.35
126.10
73.36
155.69
18.77
132.20
10.53
50.95
95.96
142.75
73.49
56.41
87.60

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

13/20

Page 13 of 20

Summarizing data

1/27/13 5:22 PM

Looking at subsets - |
eData[eData$Lat > 0 | eData$Lon > 0,c("Lat","Lon")]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Lat
38.8292
36.0403
65.2271
39.5573
37.2587
62.1046
19.4065
63.5132
32.9112
-5.1704
35.5633
19.2960
19.9262
62.1638
63.2917
34.2925
33.6293

Lon
-122.81
-117.35
-149.51
-121.99
-114.07
-150.70
-155.26
-150.83
-116.25
102.94
-118.53
-155.38
-155.54
-149.58
-149.24
-106.71
-116.69

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

14/20

Page 14 of 20

Summarizing data

1/27/13 5:22 PM

Peer review experiment data


Data on submissions/reviews in an experiment

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895

15/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 15 of 20

Summarizing data

1/27/13 5:22 PM

Peer review data


fileUrl1 <- "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)

id solution_id reviewer_id
start
stop time_left accept
1 1
3
27 1304095698 1304095758
1754
1
2 2
4
22 1304095188 1304095206
2306
1

head(solutions,2)

id problem_id subject_id
start
stop time_left answer
1 1
156
29 1304095119 1304095169
2343
B
2 2
269
25 1304095119 1304095183
2329
C
16/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 16 of 20

Summarizing data

1/27/13 5:22 PM

Find if there are missing values - is.na()


is.na(reviews$time_left[1:10])

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

TRUE FALSE FALSE

sum(is.na(reviews$time_left))

[1] 84

table(is.na(reviews$time_left))

FALSE
115

TRUE
84
17/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 17 of 20

Summarizing data

1/27/13 5:22 PM

Important table()/NA issue


table(c(0,1,2,3,NA,3,3,2,2,3))

0 1 2 3
1 1 3 4

table(c(0,1,2,3,NA,3,3,2,2,3),useNA="ifany")

0
1

1
1

2
3

3 <NA>
4
1

18/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 18 of 20

Summarizing data

1/27/13 5:22 PM

Summarizing columns/rows rowSums(),rowMeans(),colSums(),colMeans()


Important parameters: x, na.rm
colSums(reviews)

id solution_id reviewer_id
19900
19929
5064
time_left
accept
NA
NA

start
NA

stop
NA

19/20

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

Page 19 of 20

Summarizing data

1/27/13 5:22 PM

Summarizing columns/rows rowSums(),rowMeans(),colSums(),colMeans()


colMeans(reviews,na.rm=TRUE)

id solution_id reviewer_id
1.000e+02
1.001e+02
2.545e+01
time_left
accept
1.114e+03
6.435e-01

start
1.304e+09

stop
1.304e+09

rowMeans(reviews,na.rm=TRUE)

[1]
[7]
[13]
[19]
[25]
[31]
[37]

3.726e+08
3.726e+08
3.726e+08
3.726e+08
2.367e+01
3.726e+08
3.267e+01

3.726e+08
1.300e+01
3.726e+08
1.933e+01
2.367e+01
3.726e+08
3.726e+08

3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.400e+01

3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08

file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1

3.726e+08
3.726e+08
1.967e+01
3.726e+08
3.726e+08
3.133e+01
3.200e+01

3.726e+08
3.726e+08
3.726e+08
2.433e+01
3.726e+08
3.726e+08
3.726e+08

20/20

Page 20 of 20

You might also like