Bagian 2: Transformasi Data Dengan Tipe Kategori : 'Install' 'Seaborn'

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

11/3/21, 1:14 PM z - Jupyter Notebook

In [1]: import numpy as np


import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

In [2]: import pip


pip.main(['install','seaborn'])

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.

Please see https://github.com/pypa/pip/issues/5599 (https://github.com/pypa/pip/issues/5599) for advice on fixing the underlying issue.

To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.

Requirement already satisfied: seaborn in /srv/conda/envs/notebook/lib/python3.6/site-packages (0.11.2)

Requirement already satisfied: numpy>=1.15 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from seaborn) (1.19.5)

Requirement already satisfied: scipy>=1.0 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from seaborn) (1.5.3)

Requirement already satisfied: pandas>=0.23 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from seaborn) (1.1.5)

Requirement already satisfied: matplotlib>=2.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from seaborn) (3.3.4)

Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)

Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib>=2.2->seaborn) (8.3.2)

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)

Requirement already satisfied: python-dateutil>=2.1 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib>=2.2->seaborn) (2.8.2)

Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from matplotlib>=2.2->seaborn) (0.11.0)

Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from pandas>=0.23->seaborn) (2021.3)

Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib>=2.2->seaborn) (1.16.0)

Out[2]: 0

BAGIAN 2 : TRANSFORMASI DATA DENGAN TIPE KATEGORI**


Pada bagian ini, Anda akan mempraktikan cara untuk :

Melakukan transformasi terhadap data yang bersifat kategori

Dataset 2
Dataset yang akan Anda gunakan pada bagian ini adalah data sensus penduduk. Dataset ini memiliki jumlah sebanyak 48842 data dengan 15 fitur.

In [3]: from sklearn.preprocessing import OrdinalEncoder


import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import ttest_ind, ttest_rel
from scipy import stats

In [4]: data = pd.read_csv("https://gitlab.com/andreass.bayu/file-directory/-/raw/main/adult.csv", na_values="?" )


print('Number of rows: '+ format(data.shape[0]) +', number of features: '+ format(data.shape[1]))

Number of rows: 48842, number of features: 15

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 1/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [5]: data.head(10)

Out[5]:
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income

0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K

1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K

2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K

3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K

4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States <=50K

5 34 Private 198693 10th 6 Never-married Other-service Not-in-family White Male 0 0 30 United-States <=50K

6 29 NaN 227026 HS-grad 9 Never-married NaN Unmarried Black Male 0 0 40 United-States <=50K

7 63 Self-emp-not-inc 104626 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 3103 0 32 United-States >50K

8 24 Private 369667 Some-college 10 Never-married Other-service Unmarried White Female 0 0 40 United-States <=50K

9 55 Private 104996 7th-8th 4 Married-civ-spouse Craft-repair Husband White Male 0 0 10 United-States <=50K

In [6]: ## mengecek apakah terdapat nilai NA pada dataset



C = (data.dtypes == 'object')
CategoricalVariables = list(C[C].index)

Integer = (data.dtypes == 'int64')
Float = (data.dtypes == 'float64')
NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)

Missing_Percentage = (data.isnull().sum()).sum()/np.product(data.shape)*100
print("The number of missing entries before cleaning: " + str(round(Missing_Percentage,5)) + " %")

The number of missing entries before cleaning: 0.88244 %

In [7]: ## menampilkan seluruh list fitur yang ada


list(data.columns)

Out[7]: ['age',

'workclass',

'fnlwgt',

'education',

'educational-num',

'marital-status',

'occupation',

'relationship',

'race',

'gender',

'capital-gain',

'capital-loss',

'hours-per-week',

'native-country',

'income']

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 2/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [8]: data.dtypes

Out[8]: age int64

workclass object

fnlwgt int64

education object

educational-num int64

marital-status object

occupation object

relationship object

race object

gender object

capital-gain int64

capital-loss int64

hours-per-week int64

native-country object

income object

dtype: object

In [9]: data.education.unique()

Out[9]: array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',

'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',

'5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],

dtype=object)

In [10]: ##melakukan proses rename kolom


dataRename = data.rename(columns={'hours-per-week': 'hoursPerWeek'})

In [11]: dataRename.head(5)

Out[11]:
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hoursPerWeek native-country income

0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K

1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K

2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K

3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K

4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States <=50K

In [12]: ## Kode untuk melakukan transformasi untuk kolom marital_status dengan fungsi cat.codes
dataRename["race"] = dataRename["race"].astype('category')
dataRename["race_encoded"] = dataRename["race"].cat.codes
dataRename.head()

Out[12]:
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hoursPerWeek native-country income race_encoded

0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K 2

1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K 4

2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K 4

3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K 2

4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States <=50K 4

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 3/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [13]: dataRename.race.unique()

Out[13]: ['Black', 'White', 'Asian-Pac-Islander', 'Other', 'Amer-Indian-Eskimo']

Categories (5, object): ['Black', 'White', 'Asian-Pac-Islander', 'Other', 'Amer-Indian-Eskimo']

In [14]: race1 = dataRename[dataRename["race_encoded"] == 1]


race2 = dataRename[dataRename["race_encoded"] == 2]
race3 = dataRename[dataRename["race_encoded"] == 3]
race4 = dataRename[dataRename["race_encoded"] == 4]
race5 = dataRename[dataRename["race_encoded"] == 5]

In [15]: race1.dtypes

Out[15]: age int64

workclass object

fnlwgt int64

education object

educational-num int64

marital-status object

occupation object

relationship object

race category

gender object

capital-gain int64

capital-loss int64

hoursPerWeek int64

native-country object

income object

race_encoded int8

dtype: object

In [16]: race1.head(10)

Out[16]:
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hoursPerWeek native-country income race_encoded

19 40 Private 85019 Doctorate 16 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 45 NaN >50K 1

141 18 Private 262118 Some-college 10 Never-married Adm-clerical Own-child Asian-Pac-Islander Female 0 0 22 Germany <=50K 1

220 34 Private 162312 Bachelors 13 Married-civ-spouse Adm-clerical Husband Asian-Pac-Islander Male 0 0 40 Philippines <=50K 1

221 25 Private 77698 HS-grad 9 Never-married Machine-op-inspct Not-in-family Asian-Pac-Islander Female 0 0 40 Philippines <=50K 1

232 55 Private 119751 Masters 14 Never-married Exec-managerial Unmarried Asian-Pac-Islander Female 0 0 50 Thailand <=50K 1

Self-emp-not-
309 51 136708 HS-grad 9 Married-civ-spouse Sales Husband Asian-Pac-Islander Male 3103 0 84 Vietnam <=50K 1
inc

376 28 Private 302903 Bachelors 13 Married-civ-spouse Prof-specialty Wife Asian-Pac-Islander Female 0 1485 40 United-States <=50K 1

377 24 Private 154835 HS-grad 9 Never-married Exec-managerial Own-child Asian-Pac-Islander Female 0 0 40 South <=50K 1

395 37 Private 79586 HS-grad 9 Separated Machine-op-inspct Own-child Asian-Pac-Islander Male 0 0 60 United-States <=50K 1

396 45 Private 355781 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 45 Japan >50K 1

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 4/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [17]: race1 = plt.hist(dataRename[dataRename["race_encoded"] == 1].hoursPerWeek)


plt.title("Race1: Asian pac-islander")
plt.show()
race2 = plt.hist(dataRename[dataRename["race_encoded"] == 2].hoursPerWeek)
plt.title("Race2: Black")
plt.show()
race3 = plt.hist(dataRename[dataRename["race_encoded"] == 3].hoursPerWeek)
plt.title("Race3: Other")
plt.show()
race4 = plt.hist(dataRename[dataRename["race_encoded"] == 4].hoursPerWeek)
plt.title("Race4: white")
plt.show()
race5 = plt.hist(dataRename[dataRename["race_encoded"] == 5].hoursPerWeek)
plt.title("Race5: Amer-Indian-Eskimo")
plt.show()

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 5/10
11/3/21, 1:14 PM z - Jupyter Notebook

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 6/10
11/3/21, 1:14 PM z - Jupyter Notebook

KESIMPULAN
hoursperweek kebanyakan orang berdasarkan race1(asian pac-islander) adalah 40 jam per minggu dengan jumlah >800 orang.
hoursperweek kebanyakan orang berdasarkan race2(black) adalah 40 jam per minggu dengan jumlah >3000 orang.
hoursperweek kebanyakan orang berdasarkan race3(Other) adalah 40 jam per minggu dengan jumlah 250 orang.
hoursperweek kebanyakan orang berdasarkan race4(white) adalah 40 jam per minggu dengan jumlah >20000 orang.
hoursperweek untuk race5(Eskimo) tidak ada/tidak ada responden dengan ras eskimo
dari ke 4 ras yang ada diketahui bahwa kebanyakan jam kerja dalam seminggu adalah 40 jam/bisa dibilang rata-rata orang bekerja adalah sekitar 40 jam seminggu

KATEGORI YANG MENGISI WORKCLASS


In [18]: data.workclass.unique()

Out[18]: array(['Private', 'Local-gov', nan, 'Self-emp-not-inc', 'Federal-gov',

'State-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked'],

dtype=object)

In [19]: race1 = dataRename[dataRename["race_encoded"] == 1]


race2 = dataRename[dataRename["race_encoded"] == 2]
race3 = dataRename[dataRename["race_encoded"] == 3]
race4 = dataRename[dataRename["race_encoded"] == 4]
race5 = dataRename[dataRename["race_encoded"] == 5]

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 7/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [20]: plt.figure(figsize=(7,7))
total = float(len(race4) )

ax = sns.countplot(x="income", data=race4[race4["age"]>70])
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}'.format((height/total)*100),
ha="center")
plt.show()

BOXPLOT INCOME DENGAN AGE

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 8/10
11/3/21, 1:14 PM z - Jupyter Notebook

In [22]: fig = plt.figure()


sns.boxplot(x="income", y="age", data=race1)
plt.show()
print("mean :"'\n', data[['income', 'age']].groupby(['income'], as_index=False).mean().sort_values(by='age', ascending=False))
print("median :"'\n', data[['income', 'age']].groupby(['income'], as_index=False).median().sort_values(by='age', ascending=False))

mean :

income age

1 >50K 44.275178

0 <=50K 36.872184

median :

income age

1 >50K 43

0 <=50K 34

KESIMPULAN
rata-rata untuk umur yang memiliki income kelompok 1(>50K) adalah 44,3 tahun
rata-rata untuk umur yang memiliki income kelompok 0(<=50K) adalah 36,9 tahun
sedangkan median untuk umur yang memiliki income kelompok 1(>50K) adalah 43 tahun
dan median untuk umur yang memiliki income kelompok 0(<=50k) adalah 34 tahun

Instruksi Praktikum mahasiswa Teknik Industri, Teknik Mesin, Agroteknologi, FTSP dan jurusan Soshum
Ganti kolom hours-per-week dengan nama hoursPerWeek
Lakukan analisis histogram pada kolom hoursPerWeek pada setiap data race1, race2, race3, race4 dan race5. Informasi apa yang dapat Anda simpulkan ?
Terdapat berapa kategori data yang mengisi kolom workclass? Apa saja kategori yang ada?
Jelaskan hasil boxplot yang diperoleh untuk data income dan umur untuk data race1 !

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 9/10
11/3/21, 1:14 PM z - Jupyter Notebook

https://hub.gke2.mybinder.org/user/ipython-ipython-in-depth-hc0ua6ma/notebooks/binder/z.ipynb# 10/10

You might also like