0

Below is what I needed to do to get to the part where I attempt to implement seaborn's barplot.

import matplotlib.pyplot as plt 
import seaborn as sns 
import pandas as pd 
import statsmodels.api as sm 
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")

da["DMDMARTL"] = da.DMDMARTL.fillna("Missing")
da["DMDMARTLdescript"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 
                             6: "Living with partner",       77: "Refused", 99: "Don't know"})

da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["agegrp"] = pd.cut(da.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])

I pieced together bits of code here and there and arrived at what I have below.

y = "prop"
dx = da.loc[~da.RIAGENDRx.isin(["Male"]), :]
plt.figure(figsize=(12, 5))
prop_df = (dx["agegrp"]
       .groupby(dx["DMDMARTLdescript"])
       .value_counts(normalize=True)
       .rename(y)
       .reset_index())
sns.barplot(x="agegrp", y=y, hue="DMDMARTLdescript", data=prop_df)

The result of running the code above is the following

Image

I have following issues with the plot it generates.

  1. Although I have asked each age group to be normalized `(normalized = True), based on the image, it's fairly obvious that the sum of the bars in each age group exceeds 1.

  2. The age groups are ordered along the x axis in a somewhat arbitrary way. I am not sure how to order them in the numerical order.

(the csv file is publicly available here github link.)

5
  • Concerning (1.) the normalization takes place according to the descript values. I.e. all "divorced" cases sum up to 1. Commented Jan 20, 2019 at 21:35
  • Please print output of print(prop_df.groupby(['DMDMARTLdescript']).sum()) and check. And please provide actual sample as we do not have your .csv file. See How to make good reproducible pandas examples.
    – Parfait
    Commented Jan 20, 2019 at 22:46
  • @ImportanceOfBeingErnest thank you for your input. So I thought each age group is a divorced case and the sum of the bars in each age group would amount to 1. But I see that the red bar in [10,20] alone is already 1.
    – Blackwidow
    Commented Jan 21, 2019 at 13:40
  • Yes, because the red bar is the only case of "Missing" so sum("Missing") is indeed 1. Commented Jan 21, 2019 at 13:41
  • @ImportanceOfBeingErnest I ran print(prop_df.groupby(['DMDMARTLdescript']).sum()) and I see what you mean. Is there a way I can make the sum of the bars in each age group normalized instead?
    – Blackwidow
    Commented Jan 21, 2019 at 13:45

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.