NB 7
NB 7
NB 7
• Pandas informative commands, such as unique, describe, head, tail, min and max, loc
and iloc etc.
• Notes on JOINS with sketches.
• Lambda function to format the year: 4 digit -> 2 digit
• Using apply() to calculate prevalence and to canonicalize a table
• Function that determines if two tables are equal (including permutations)
• Data frame melting
• Data frame casting
• Function to separate and unite a column
• TB dataset: data cleaning and formatting
Tidy data and the Pandas module
This notebook accompanies Topic 7, which is about "tidying data," or cleaning up tabular data for analysis purposes. It also introduces one of t
important Python modules for data analysis: Pandas! (not the bear)
Exercise 0 (ungraded). Run the code cell below to download the data. (This code will check if each dataset has already been downloaded and,
avoid re-downloading it.)
if not os.path.exists(file):
url = 'https://cse6040.gatech.edu/datasets/{}'.format(url_suffix)
print("Downloading: {} ...".format(url))
r = requests.get(url)
with open(file, 'w', encoding=r.encoding) as f:
f.write(r.text)
print("'{}' is ready!".format(file))
'table4b.csv' is ready!
'who2_soln.csv' is ready!
'who.csv' is ready!
'table4a.csv' is ready!
'iris.csv' is ready!
'table1.csv' is ready!
'table3.csv' is ready!
'who3_soln.csv' is ready!
'table2.csv' is ready!
You may find it helpful to also refer to the original materials on which this lab is based:
Male 0 5
Female 1 4
Male Yes 0
Male No 5
Female Yes 1
Female No 4
These are two entirely equivalent ways of representing the same data. However, each may be suited to a particular task.
For instance, Representation 1 is a typical input format for statistical routines that implement Pearson's -test, which can check for independe
factors. (Are gender and pregnancy status independent?) By contrast, Representation 2 might be better suited to regression. (Can you predict r
from gender and pregnancy status?)
While Representation 1 has its uses (http://simplystatistics.org/2016/02/17/non-tidy-data/), Wickham argues that Representation 2 is often the
more general way to supply data to a wide variety of statistical analysis and visualization tasks. He refers to Representation 2 as tidy and Repre
untidy or messy.
The term "messy" is, as Wickham states, not intended to be perjorative since "messy" representations may be exactly the right ones fo
particular analysis tasks, as noted above.
Definition: Tidy datasets. More specifically, Wickham defines a tidy data set as one that can be organized into a 2-D table such that
Here is a visual schematic of this definition, taken from another source (http://r4ds.had.co.nz/images/tidy-1.png):
This definition appeals to a statistician's intuitive idea of data he or she wishes to analyze It is also consistent with tasks that seek to establish
This definition appeals to a statistician s intuitive idea of data he or she wishes to analyze. It is also consistent with tasks that seek to establish
relationship between some response (output) variable from one or more independent variables.
A computer scientist with a machine learning outlook might refer to columns as features and rows as data points, especially when all va
are numerical (ordinal or continuous).
Definition: Tibbles. Here's one more bit of terminology: if a table is tidy, we will call it a tidy table, or tibble, for short.
In this part of this notebook, let's look at how Pandas works and can help us store Tidy data.
You may find this introduction to the Pandas module's data structures useful for reference:
https://pandas.pydata.org/pandas-docs/stable/dsintro.html (https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
Consider the famous Iris data set (https://en.wikipedia.org/wiki/Iris_flower_data_set). It consists of 50 samples from each of three species of Iris
Iris virginica, and Iris versicolor). Four features were measured from each sample: the lengths and the widths of the sepals
(https://en.wikipedia.org/wiki/Sepal) and petals (https://en.wikipedia.org/wiki/Petal).
The following code uses Pandas to read and represent this data in a Pandas data frame object, stored in a variable named irises.
irises = pd.read_csv('iris.csv')
print("=== Iris data set: {} rows x {} columns. ===".format(irises.shape[0], irises.shape[1]))
display (irises.head())
In a Pandas data frame, every column has a name (stored as a string) and all values within the column must have the same primitive type. This
columns different from, for instance, lists.
In addition, every row has a special column, called the data frame's index. (Try printing irises.index.) Any particular index value serves as a
row; these index values are usually integers but can be more complex types, like tuples.
In [3]: print(irises.index)
Separate from the index values (row names), you can also refer to rows by their integer offset from the top, where the first row has an offset of 0
row has an offset of n-1 if the data frame has n rows. You'll see that in action in Exercise 1, below.
Exercise 1 (ungraded). Run the following commands to understand what each one does. If it's not obvious, try reading the Pandas documenta
(http://pandas.pydata.org/) or going online to get more information.
irises.describe()
irises['sepal length'].head()
irises[["sepal length", "petal width"]].head()
irises.iloc[5:10]
irises[irises["sepal length"] > 5.0]
irises["sepal length"].max()
irises['species'].unique()
irises.sort_values(by="sepal length", ascending=False).head(1)
irises.sort_values(by="sepal length", ascending=False).iloc[5:10]
irises.sort_values(by="sepal length", ascending=False).loc[5:10]
irises['x'] = 3.14
irises.rename(columns={'species': 'type'})
del irises['x']
irises['x'] = 3.14
print("\n=== `irises['x'] = 3.14`: Creates a new column (variable) named ‘x’ and sets all values
= 3.14 ===\n\n{}".format(irises.head()))
del irises['x']
print("\n=== `del irises['x']`: Removes a column ===\n\n{}".format(irises.head()))
=== `irises['sepal length'].head()`: Dumps the first few rows of a given column ===
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
Name: sepal length, dtype: float64
=== `irises[["sepal length", "petal width"]].head()`: Dumps the first few rows of several specif
s ===
=== `irises.iloc[5:10]`: Selects rows at a certain integer offset and range ===
=== `irises[irises["sepal length"] > 5.0]`: Selects the subset of rows satisfying some condition
here sepal length is strictly more than 5) ===
i i i
sepal length sepal width petal length petal width species
0 5.1 3.5 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
21 5.1 3.7 1.5 0.4 Iris-setosa
23 5.1 3.3 1.7 0.5 Iris-setosa
27 5.2 3.5 1.5 0.2 Iris-setosa
28 5.2 3.4 1.4 0.2 Iris-setosa
31 5.4 3.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
36 5.5 3.5 1.3 0.2 Iris-setosa
39 5.1 3.4 1.5 0.2 Iris-setosa
44 5.1 3.8 1.9 0.4 Iris-setosa
46 5.1 3.8 1.6 0.2 Iris-setosa
48 5.3 3.7 1.5 0.2 Iris-setosa
50 7.0 3.2 4.7 1.4 Iris-versicolor
51 6.4 3.2 4.5 1.5 Iris-versicolor
52 6.9 3.1 4.9 1.5 Iris-versicolor
53 5.5 2.3 4.0 1.3 Iris-versicolor
54 6.5 2.8 4.6 1.5 Iris-versicolor
55 5.7 2.8 4.5 1.3 Iris-versicolor
56 6.3 3.3 4.7 1.6 Iris-versicolor
58 6.6 2.9 4.6 1.3 Iris-versicolor
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
127 6.1 3.0 4.9 1.8 Iris-virginica
128 6.4 2.8 5.6 2.1 Iris-virginica
129 7.2 3.0 5.8 1.6 Iris-virginica
130 7.4 2.8 6.1 1.9 Iris-virginica
131 7.9 3.8 6.4 2.0 Iris-virginica
132 6.4 2.8 5.6 2.2 Iris-virginica
133 6.3 2.8 5.1 1.5 Iris-virginica
134 6.1 2.6 5.6 1.4 Iris-virginica
135 7.7 3.0 6.1 2.3 Iris-virginica
136 6.3 3.4 5.6 2.4 Iris-virginica
137 6.4 3.1 5.5 1.8 Iris-virginica
138 6.0 3.0 4.8 1.8 Iris-virginica
139 6.9 3.1 5.4 2.1 Iris-virginica
140 6.7 3.1 5.6 2.4 Iris-virginica
141 6.9 3.1 5.1 2.3 Iris-virginica
142 5.8 2.7 5.1 1.9 Iris-virginica
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
=== `irises["sepal length"].max()`: Returns the largest value of a given column ===
7.9
`i i " " i ` i
=== `irises.sort_values(by="sepal length", ascending=False).loc[5:10]`: Returns the observations
the one whose row ID is 5 and the one that is 10, in order of sepal-length, 5 and 10 are inclusi
=== `irises['x'] = 3.14`: Creates a new column (variable) named ‘x’ and sets all values in colum
===
x
0 3.14
1 3.14
2 3.14
3 3.14
4 3.14
5 3.14
6 3.14
7 3.14
8 3.14
9 3.14
10 3.14
11 3.14
12 3.14
13 3.14
14 3.14
15 3.14
16 3.14
17 3.14
18 3.14
19 3.14
20 3.14
21 3.14
22 3.14
23 3.14
24 3.14
25 3.14
26 3.14
27 3.14
28 3.14
29 3.14
.. ...
120 3.14
121 3.14
122 3.14
123 3.14
124 3.14
125 3.14
126 3.14
127 3.14
128 3.14
129 3.14
130 3.14
131 3.14
132 3.14
133 3.14
134 3.14
135 3.14
136 3.14
137 3.14
138 3.14
139 3.14
140 3.14
141 3.14
142 3.14
143 3.14
144 3.14
145 3.14
146 3.14
147 3.14
148 3.14
149 3.14
In Pandas, you can perform this merge using the .merge() function (http://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.merge.html):
In this call, the on= parameter specifies the list of column names to use to align or "match" the two tables, A and B. By default, merge() will on
rows from A and B where all keys match between the two tables.
=== A ===
=== B ===
Joins. This default behavior of keeping only rows that match both input frames is an example of what relational database systems call an inner-
operation. But there are several other types of joins.
Inner-join (A, B) (default): Keep only rows of A and B where the on-keys match in both.
Outer-join (A, B): Keep all rows of both frames, but merge rows when the on-keys match. For non-matches, fill in missing values with not-a
(NaN) values.
Left-join (A, B): Keep all rows of A. Only merge rows of B whose on-keys match A.
Right-join (A, B): Keep all rows of B. Only merge rows of A whose on-keys match B.
You can use merge's how=... parameter, which takes the (string) values, 'inner', 'outer', 'left', and 'right'. Here are some example
types of joins.
with StringIO("""x,y,w
hug,-1,e
smug,-2,e
rug,-3,e
tug,-4,e
bug,1,e""") as fp:
E = pd.read_csv(fp)
print("\n=== E ===")
display(E)
=== D ===
x y z
I I g
0 bug 1 d
1 rug 2 d
2 lug 3 d
3 mug 4 d
=== E ===
x y w
0 hug -1 e
1 smug -2 e
2 rug -3 e
3 tug -4 e
My
=== Outer-join (D, E) ===
x y z w
D E
8
0 bug 1 d e
1 rug 2 d NaN
2 lug 3 d NaN
values
lug
3 mug 4 d NaN
4 hug -1 NaN e
have
5 smug -2 NaN e mug
6 rug -3 NaN e
7 tug -4 NaN e
x y z w D's values
0 bug 1 d e
1 rug 2 d NaN
take priority
2 lug 3 d NaN to override E
3 mug 4 d NaN
x y z w
D E
00
If
0 bug 1 d e
E's values
big
1 hug -1 NaN e
2 smug -2 NaN e
take
3 rug -3 NaN e priority
4 tug -4 NaN e
For instance, suppose we wish to convert the year column in C into an abbrievated two-digit form. The following code will do it:
In [9]: display(C)
In [10]: G = C.copy() # If you do not use copy function the original data frame is modified
G['year'] = G['year'].apply(lambda x: "'{:02d}".format(x % 100))
display(G)
retainerof yeartoo 19
wetonywant e.g Iggy100
Hereremainderis 99
81
country year cases population apostropetendigitder
0 Afghanistan '99 745 19987071
Exercise 2 (2 points). Suppose you wish to compute the prevalence, which is the ratio of cases to the population.
However, for this exercise, try to figure out how to use apply() to do it instead. To figure that out, you'll need to consult the documentation fo
(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) or go online to find some hints.
Implement your solution in a function, calc_prevalence(G), which given G returns a new copy H that has a column named 'prevalence'
correctly computed prevalence values.
Note 0. The emphasis on "new copy" is there to remind you that your function should not modify the input dataframe, G.
Note 1. Although there is the easy solution above, the purpose of this exercise is to force you to learn more about how apply() works
that you can "apply" it in more settings in the future.
H = G.copy()
H['prevalence'] = H.apply(calc_ratio, axis=1)
return H
### END SOLUTION
G_copy = G.copy()
H = calc_prevalence(G)
display(H) # Displayed `H` should have a 'prevalence' column
print("Prevalance values seem correct. But did you use `apply()?` Let's see...")
print("\n(Passed!)")
Prevalance values seem correct But did you use `apply()?` Let's see
Prevalance values seem correct. But did you use apply()? Let's see...
You used `apply()`. You may have even used it as intended.
(Passed!)
Exercise 3 (3 points). Write a function, canonicalize_tibble(X), that, given a tibble X, returns a new copy Y of X in canonical order. We sa
canonical order if it has the following properties.
1. The variables appear in sorted order by name, ascending from left to right.
2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
3. The row labels (Y.index) go from 0 to n-1, where n is the number of observations.
c a b
2 hat x 1
0 rat y 4
3 cat x 2
1 bat x 2
a b c
0 x 1 hat
1 x 2 bat
2 x 2 cat
3 y 4 rat
A partial solution appears below, which ensures that Property 1 above holds. Complete the solution to ensure Properties 2 and 3 hold. Feel free
the Pandas API (http://pandas.pydata.org/pandas-docs/stable/api.html).
# Test input
canonical_in_csv = """,c,a,b
2,hat,x,1
0,rat,y,4
3,cat,x,2
1,bat,x,2"""
canonical_out = canonicalize_tibble(canonical_in)
print("=== Your computed solution ===")
display(canonical_out)
print("")
print ("\n(Passed.)")
c a b
2 hat x 1
0 rat y 4
3 cat x 2
1 bat x 2
a b c
0 x 1 hat
1 x 2 bat
2 x 2 cat
3 y 4 rat
a b c
0 x 1 hat
1 x 2 bat
2 x 2 cat
3 y 4 rat
a b c
(Passed.)
Exercise 4 (1 point). Write a function, tibbles_are_equivalent(A, B) to determine if two tibbles, A and B, are equivalent. "Equivalent" m
and B have identical variables and observations, up to permutations. If A and B are equivalent, then the function should return True. Otherwise
return False.
The last condition, "up to permutations," means that the variables and observations might not appear in the table in the same order. For examp
following two tibbles are equivalent:
a b c
x 1 hat
y 2 cat
z 3 bat
w 4 rat
w 4 rat
b c a
2 cat y
3 bat z
1 hat x
4 rat w
By contrast, the following table would not be equivalent to either of the above tibbles.
a b c
2 y cat
3 z bat
1 x hat
4 w rat
Note: Unlike Pandas data frames, tibbles conceptually do not have row labels. So you should ignore row labels.
obs_ind_orig = list(range(A.shape[0]))
var_names = list(A.columns)
obs_ind = obs_ind_orig.copy()
while obs_ind == obs_ind_orig:
random.shuffle(obs_ind)
B = A[var_names].copy()
B = B.iloc[obs_ind]
assert tibbles_are_equivalent(A, B)
assert not tibbles_are_equivalent(A, C)
assert not tibbles_are_equivalent(B, C)
print ("\n(Passed.)")
a b c
0 x 1 hat
1 y 2 cat
2 z 3 bat
3 w 4 rat
=== Tibble B == A ===
c a b
0 hat x 1
2 bat z 3
3 rat w 4
1 cat y 2
c a b
0 x 1 hat
1 y 2 cat
2 z 3 bat
3 w 4 rat
(Passed.)
Melting
First, values often appear as columns. Table 4a is an example. To tidy up, you want to turn columns into rows:
I
Because this operation takes columns into rows, making a "fat" table more tall and skinny, it is sometimes called melting.
1. Extract the column values into a new variable. In this case, columns "1999" and "2000" of table4 need to become the values of the var
"year".
2. Convert the values associated with the column values into a new variable as well. In this case, the values formerly in columns "1999" and
become the values of the "cases" variable.
In the context of a melt, let's also refer to "year" as the new key variable and "cases" as the new value variable.
df: the input data frame, e.g., table4 in the example above;
col_vals: a list of the column names that will serve as values; column 1999 & 2000 in example table names values
key: name of the new variable, e.g., year in the example above;
value: name of the column to hold the values. cases in the example above
You may need to refer to the Pandas documentation to figure out how to create and manipulate tables. The bits related to indexing
(http://pandas.pydata.org/pandas-docs/stable/indexing.html) and merging (http://pandas.pydata.org/pandas-docs/stable/merging.htm
be especially helpful.
ÉEm
valuestobemelted
In [18]: def melt(df, col_vals, key, value):
assert type(df) is pd.DataFrame
d
### BEGIN SOLUTION
returns thecolumn names in the data frame
keep_vars = df.columns.difference(col_vals)
melted_sections = [] are
af that not colvals e country
g
for c in col_vals: toiteratethrough the all colvals i.e
egg 2000s
melted_c = df[keep_vars].copy() makea
copy ofthe afcolumns that wantto
we keepcount
melted_c[key] = c setting
thekey
variable i.e
gearequal tothe
values of d i.e
vats iggy
equaltothe
daterame
melted_c[value] = df[c] add c asestothe
valuecolumn
melted_sections.append(melted_c) meltedc isnowadays
data
frames
as
setthe
gears
country
value
cases
original
gate indexed
table4a = pd.read_csv('table4a.csv')
print("\n=== table4a ===")
display(table4a)
table4b = pd.read_csv('table4b.csv')
print("\n=== table4b ===")
display(table4b)
table1 = pd.read_csv('table1.csv')
print ("=== table1 (target solution) ===")
display(table1)
assert tibbles_are_equivalent(table1, m_4)
print ("\n(Passed.)")
(Passed.)
Casting
The second most common issue is that an observation might be split across multiple rows. Table 2 is an example. To tidy up, you want to merg
erows
Because this operation is the moral opposite of melting, and "rebuilds" observations from parts, it is sometimes called casting.
Melting and casting are Wickham's terms from his original paper on tidying data (http://www.jstatsoft.org/v59/i10/paper). In his more re
writing, on which this tutorial is based (http://r4ds.had.co.nz/tidy-data.html), he refers to the same operation as gathering. Again, this te
comes from Wickham's original paper, whereas his more recent summaries use the term spreading.
The signature of a cast is similar to that of melt. However, you only need to know the key, which is column of the input table containing new va
rrable
and the value, which is the column containing corresponding values.
Exercise 6 (4 points). Implement a function to cast a data frame into a tibble, given a key column containing new variable names and a value co
containing the corresponding cells.
damn
We've given you a partial solution that
verifies that the given key and value columns are actual columns of the input data frame;
computes the list of columns, fixed_vars, that should remain unchanged; and
initializes and empty tibble.
Observe that we are asking your cast() to accept an optional parameter, join_how, that may take the values 'outer' or 'inner' (with 'o
default). Why do you need such a parameter?
the
as insymationmanageist
I
column
cases
input aims Fatin
corresponds
return tibble
table2 = pd.read_csv('table2.csv')
print('=== table2 ===')
display(table2)
(Passed.)
Separating variables
Consider the following table.
country year
I
cages
0 Afghanistan 1999 745/19987071
jopulation
rate
I
1 Afghanistan 2000 2666/20595360
In this table, the rate variable combines what had previously been the cases and population data. This example is an instance in which we
separate a column into two variables.
n
How will the separation happen? The caller should provide a function, splitter(x), that given a value returns a list containing the componen
itsobsen
that the partial solution below defines a default splitter, which uses the regular expression, (\d+\.?\d+), to find all integer or floating-point va
input x.
luesin
sting
In [23]: import re
def default_splitter(text):
"""Searches the given spring for all integer and floating-point
values, returning them as a list _of strings_.
# Hint: http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-creat
-new-columns
returnthevaluesinpandasseries
wouldtakeastringinputs
war
secondaryfunction
### BEGIN SOLUTION
def apply_splitter(text):
I
fields = splitter(text)
return pd.Series({into[i]: f for i, f in enumerate (fields)})
wontchange
fixedvariablesthat
listof
fixed_vars = df.columns.difference([key]) to getour
tibble = df[fixed_vars].copy() make
any of
thedata
frame
me fixedvariablesatthe
rate
canna
tibble_extra = df[key].apply(apply_splitter)
return pd.concat([tibble, tibble_extra], axis=1)
### END SOLUTION
tibble3['cases'] = tibble3['cases'].apply(int)
tibble3['population'] = tibble3['population'].apply(int)
assert tibbles_are_equivalent(tibble3, table1)
print("\n(Passed.)")
(Passed.)
Exercise 7 (2 points). Implement the inverse of separate, which is unite. This function should take a data frame (df), the set of columns to co
the name of the new column (new_var), and a function that takes the subset of the cols variables from a single observation. It should return a
print("\n(Passed.)")
(Passed.)
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 new
101 Algeria DZ DZA 2013 NaN NaN NaN NaN NaN NaN
770 Bermuda BM BMU 2002 NaN NaN NaN NaN NaN NaN
951 Brazil BR BRA 2009 328.0 4621.0 6399.0 5291.0 5058.0 284
2108 Egypt EG EGY 2006 54.0 542.0 728.0 563.0 587.0 340
6280 Thailand TH THA 1984 NaN NaN NaN NaN NaN NaN
5 rows × 60 columns
The data set has 7,240 rows and 60 columns. Here is how to decode the columns.
Columns 'country', 'iso2', and 'iso3' are different ways to designate the country and redundant, meaning you only really need to ke
of them.
Column 'year' is the year of the report and is a natural variable.
Among columns 'new_sp_m014' through 'newrel_f65', the 'new...' prefix indicates that the column's values count new cases of T
particular data set, all the data are for new cases.
Binthis
The short codes, rel, ep, sn, and sp describe the type of TB case. They stand for relapse, extrapulmonary, pulmonary not detectable by a
pulmonary smear test ("smear negative"), and pulmonary detectable by such a test ("smear positive"), respectively.
The codes 'm' and 'f' indicate the gender (male and female, respectively).
The trailing numeric code indicates the age group: 014 is 0-14 years of age, 1524 for 15-24 years, 2534 for 25-34 years, etc., and 65 stan
years or older.
In other words, it looks like you are likely to want to treat all the columns as values of multiple variables!
Exercise 8 (3 points). As a first step, start with who_raw and create a new data frame, who2, with the following properties:
All the 'new...' columns of who_raw become values of a single variable, case_type. Store the counts associated with each case_typ
as a new variable called 'count'.
Remove the iso2 and iso3 columns, since they are redundant with country (which you should keep!).
Keep the year column as a variable.
Remove all not-a-number (NaN) counts. Hint: You can test for a NaN using Python's math.isnan() (https://docs.python.org/3/library/mat
Convert the counts to integers. (Because of the presence of NaNs, the counts will be otherwise be treated as floating-point values, which i
undesirable since you do not expect to see non-integer counts.)
# Remove NaNs
who2 = who2[who2['count'].apply(lambda x: not isnan(x))]
# Check it
assert tibbles_are_equivalent(who2, who2_soln)
print ("\n(Passed.)")
(Passed.)
Exercise 9 (5 points). Starting from your who2 data frame, create a new tibble, who3, for which each 'key' value is split into three new variabl
es
'type', to hold the TB type, having possible values of rel, ep, sn, and sp;
'gender', to hold the gender as a string having possible values of female and male; and
'age_group', to hold the age group as a string having possible values of 0-14, 15-24, 25-34, 35-44, 45-54, 55-64, and 65+.
The input data file is large enough that your solution might take a minute to run. But if it appears to be taking much more than that, you
want to revisit your approach.
if fields[1] == 'f':
fields[1] = 'female'
elif fields[1] == 'm':
fields[1] = 'male'
if fields[2] == '014':
fields[2] = '0-14'
elif fields[2] == '65':
fields[2] = '65+'
elif len(fields[2]) == 4 and fields[2].isdigit():
fields[2] = fields[2][0:2] + '-' + fields[2][2:4]
trirstadigits
band
lower
represent
3rd
sum digitsare
bound
return fields upper
g g who3 = separate(who2,
key='case_type',
into=['type', 'gender', 'age_group'],
splitter=who_splitter)
# Save this as the reference solution (master notebook only)
#who3.to_csv('who3_soln.csv', index=False)
### END SOLUTION"
who3_soln = pd.read_csv('who3_soln.csv')
print("\n=== First few rows of the instructor's solution ===")
display(who3_soln.head())