Summarizing Data
Summarizing Data
Summarizing Data
1/27/13 5:22 PM
Summarizing data
Jeffrey Leek, Assistant Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 1 of 20
Summarizing data
1/27/13 5:22 PM
Why summarize?
Data are often too big to look at the whole thing
The first step in an analysis is to find problems
When you do these summaries you should be looking for
- Missing values
- Values outside of expected ranges
- Values that seem to be in the wrong units
- Mislabled variables/columns
- Variables that are the wrong class
2/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 2 of 20
Summarizing data
1/27/13 5:22 PM
Earthquake data
https://explore.data.gov/Geography-and-Environment/Worldwide-M1-Earthquakes-Past-7Days/7tag-iwnu
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
3/20
Page 3 of 20
Summarizing data
1/27/13 5:22 PM
Earthquake data
fileUrl <- "http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt"
download.file(fileUrl,destfile="./data/earthquakeData.csv",method="curl")
dateDownloaded <- date()
dateDownloaded
4/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 4 of 20
Summarizing data
1/27/13 5:22 PM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Src
nc
ci
ak
nc
nn
ak
hv
ak
ci
us
ci
hv
hv
ak
ak
us
ci
Eqid Version
71929481
1
15278017
0
10645573
1
71929476
0
00401016
9
10645564
1
60459531
2
10645555
1
15278009
0
c000ewb3
7
15278001
0
60459521
1
60459516
2
10645533
1
10645528
1
c000ewax
6
15277993
0
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
Sunday,
January
January
January
January
January
January
January
January
January
January
January
January
January
January
January
January
January
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
27,
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
Datetime
05:03:01 UTC
04:59:04 UTC
04:55:09 UTC
04:51:48 UTC
04:45:19 UTC
04:16:45 UTC
04:15:57 UTC
04:14:35 UTC
04:07:44 UTC
04:05:42 UTC
03:54:27 UTC
03:50:13 UTC
03:43:56 UTC
03:25:17 UTC
03:18:17 UTC
03:17:57 UTC
02:47:04 UTC
5/20
Page 5 of 20
Summarizing data
1/27/13 5:22 PM
[1] 1057
10
names(eData)
[1] "Src"
[6] "Lon"
"Eqid"
"Version"
"Magnitude" "Depth"
"Datetime"
"NST"
"Lat"
"Region"
nrow(eData)
[1] 1057
6/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 6 of 20
Summarizing data
1/27/13 5:22 PM
0%
-61.30
25%
35.56
50%
38.77
75%
52.58
100%
67.66
summary(eData)
Src
ak
:330
nc
:247
ci
:145
nn
: 92
us
: 89
pr
: 40
(Other):114
Eqid
00400150:
1
00400153:
1
00400155:
1
00400156:
1
00400157:
1
00400159:
1
(Other) :1051
Version
2
:379
0
:195
1
:168
9
: 97
3
: 82
4
: 43
(Other): 93
Datetime
Monday, January 21, 2013 11:00:00 UTC:
2
Friday, January 25, 2013 00:06:25 UTC:
1
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Lat
Min.
:-61.3
1st Qu.: 35.6
7/20
Page 7 of 20
Summarizing data
1/27/13 5:22 PM
[1] "data.frame"
sapply(eData[1,],class)
Src
Eqid
"factor" "factor"
Depth
NST
"numeric" "integer"
Version
"factor"
Region
"factor"
Datetime
Lat
Lon Magnitude
"factor" "numeric" "numeric" "numeric"
8/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 8 of 20
Summarizing data
1/27/13 5:22 PM
[1] nc ci ak nn hv us pr uw nm mb uu
Levels: ak ci hv mb nc nm nn pr us uu uw
length(unique(eData$Src))
[1] 11
table(eData$Src)
ak ci
330 145
hv
29
mb nc
10 247
nm
2
nn
92
pr
40
us
89
uu
40
uw
33
9/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 9 of 20
Summarizing data
1/27/13 5:22 PM
ak
ci
hv
mb
nc
nm
nn
pr
us
uu
uw
0
0
64
0
0
91
0
0
40
0
0
0
1
2
93 211
0 67
14 11
0 10
46 51
0
0
0
0
0
0
0
2
0 15
15 12
3
26
7
0
0
37
0
0
0
0
6
6
4
0
3
2
0
10
0
0
0
14
14
0
5
0
3
2
0
4
0
0
0
13
3
0
6
0
1
0
0
3
0
0
0
24
2
0
7
0
0
0
0
1
0
0
0
13
0
0
8
0
0
0
0
1
0
0
0
11
0
0
9
0
0
0
0
1
0
92
0
4
0
0
A
0
0
0
0
1
2
0
0
4
0
0
B
0
0
0
0
1
0
0
0
2
0
0
D
0
0
0
0
0
0
0
0
1
0
0
E
0
0
0
0
0
0
0
0
1
0
0
10/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 10 of 20
Summarizing data
1/27/13 5:22 PM
[1] 38.83 36.04 65.23 39.56 37.26 62.10 19.41 63.51 32.91 -5.17
eData$Lat[1:10] > 40
TRUE FALSE
[1] TRUE
11/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 11 of 20
Summarizing data
1/27/13 5:22 PM
TRUE FALSE
[1] FALSE
12/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 12 of 20
Summarizing data
1/27/13 5:22 PM
51
56
58
110
129
134
146
153
155
160
175
193
239
325
348
359
385
Lat
5.486
39.749
38.295
34.571
51.130
9.438
38.426
49.728
43.337
29.379
44.280
31.763
4.998
53.564
38.608
27.771
49.825
Lon
127.05
77.30
46.81
24.10
179.35
126.10
73.36
155.69
18.77
132.20
10.53
50.95
95.96
142.75
73.49
56.41
87.60
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
13/20
Page 13 of 20
Summarizing data
1/27/13 5:22 PM
Looking at subsets - |
eData[eData$Lat > 0 | eData$Lon > 0,c("Lat","Lon")]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Lat
38.8292
36.0403
65.2271
39.5573
37.2587
62.1046
19.4065
63.5132
32.9112
-5.1704
35.5633
19.2960
19.9262
62.1638
63.2917
34.2925
33.6293
Lon
-122.81
-117.35
-149.51
-121.99
-114.07
-150.70
-155.26
-150.83
-116.25
102.94
-118.53
-155.38
-155.54
-149.58
-149.24
-106.71
-116.69
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
14/20
Page 14 of 20
Summarizing data
1/27/13 5:22 PM
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895
15/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 15 of 20
Summarizing data
1/27/13 5:22 PM
id solution_id reviewer_id
start
stop time_left accept
1 1
3
27 1304095698 1304095758
1754
1
2 2
4
22 1304095188 1304095206
2306
1
head(solutions,2)
id problem_id subject_id
start
stop time_left answer
1 1
156
29 1304095119 1304095169
2343
B
2 2
269
25 1304095119 1304095183
2329
C
16/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 16 of 20
Summarizing data
1/27/13 5:22 PM
sum(is.na(reviews$time_left))
[1] 84
table(is.na(reviews$time_left))
FALSE
115
TRUE
84
17/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 17 of 20
Summarizing data
1/27/13 5:22 PM
0 1 2 3
1 1 3 4
table(c(0,1,2,3,NA,3,3,2,2,3),useNA="ifany")
0
1
1
1
2
3
3 <NA>
4
1
18/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 18 of 20
Summarizing data
1/27/13 5:22 PM
id solution_id reviewer_id
19900
19929
5064
time_left
accept
NA
NA
start
NA
stop
NA
19/20
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
Page 19 of 20
Summarizing data
1/27/13 5:22 PM
id solution_id reviewer_id
1.000e+02
1.001e+02
2.545e+01
time_left
accept
1.114e+03
6.435e-01
start
1.304e+09
stop
1.304e+09
rowMeans(reviews,na.rm=TRUE)
[1]
[7]
[13]
[19]
[25]
[31]
[37]
3.726e+08
3.726e+08
3.726e+08
3.726e+08
2.367e+01
3.726e+08
3.267e+01
3.726e+08
1.300e+01
3.726e+08
1.933e+01
2.367e+01
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.400e+01
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
3.726e+08
file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week2/007summarizingData/index.html#1
3.726e+08
3.726e+08
1.967e+01
3.726e+08
3.726e+08
3.133e+01
3.200e+01
3.726e+08
3.726e+08
3.726e+08
2.433e+01
3.726e+08
3.726e+08
3.726e+08
20/20
Page 20 of 20