Below is what I needed to do to get to the part where I attempt to implement seaborn's barplot
.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np
da = pd.read_csv("nhanes_2015_2016.csv")
da["DMDMARTL"] = da.DMDMARTL.fillna("Missing")
da["DMDMARTLdescript"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married",
6: "Living with partner", 77: "Refused", 99: "Don't know"})
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["agegrp"] = pd.cut(da.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])
I pieced together bits of code here and there and arrived at what I have below.
y = "prop"
dx = da.loc[~da.RIAGENDRx.isin(["Male"]), :]
plt.figure(figsize=(12, 5))
prop_df = (dx["agegrp"]
.groupby(dx["DMDMARTLdescript"])
.value_counts(normalize=True)
.rename(y)
.reset_index())
sns.barplot(x="agegrp", y=y, hue="DMDMARTLdescript", data=prop_df)
The result of running the code above is the following
I have following issues with the plot it generates.
Although I have asked each age group to be normalized `(normalized = True), based on the image, it's fairly obvious that the sum of the bars in each age group exceeds 1.
The age groups are ordered along the x axis in a somewhat arbitrary way. I am not sure how to order them in the numerical order.
(the csv file is publicly available here github link.)
print(prop_df.groupby(['DMDMARTLdescript']).sum())
and check. And please provide actual sample as we do not have your.csv
file. See How to make good reproducible pandas examples.sum("Missing")
is indeed 1.print(prop_df.groupby(['DMDMARTLdescript']).sum())
and I see what you mean. Is there a way I can make the sum of the bars in each age group normalized instead?