Numpy Merged

Download as pdf or txt
Download as pdf or txt
You are on page 1of 93

NumPy Arrays and Vectorized Computation

0.0.1 NUMPY MODULE:


NumPy, short for Numerical Python, is a fundamental library for numerical computing in Python.
It provides powerful data structures, primarily the ndarray (n-dimensional array), which enables
efficient storage and manipulation of large datasets. With its support for multi-dimensional arrays,
NumPy allows users to perform complex mathematical operations with ease. One of the key features
of NumPy is its ability to perform element-wise operations on arrays, which is significantly faster
than using traditional Python lists. This efficiency stems from its implementation in C, allowing for
lower-level optimizations. NumPy also includes a comprehensive set of mathematical functions that
can operate on arrays, including linear algebra, Fourier transforms, and random number generation.
In addition to its array capabilities, NumPy provides tools for integrating with other languages, such
as C and Fortran, making it a versatile choice for performance-critical applications. It serves as the
backbone for many other scientific computing libraries, including SciPy, pandas, and Matplotlib,
establishing itself as an essential component of the scientific Python ecosystem. NumPy’s array
operations are broadcastable, meaning that arrays of different shapes can still be used together in
calculations, making it easier to handle data of varying dimensions. This flexibility is particularly
useful in data analysis and machine learning tasks.

1 1. Numpy Arrays from Python DataStructures,Intrinsic Numpy


Objects and Random Functions

2 1. 1 Arrays from python data structures

[1]:
import numpy as np
=

[1] : array([1, 2, 3])

[2] :

[2] : numpy.ndarray

[3] :

1
[3] : dtype('int32')

[4] :
a.ndim

[4]: 1

[5] :

[5]: 3

[6] :
a.shape

[6]: (3,)

[7] :

[1, 2, 3, 4, 5]

[8] : #Creation of ndarrays using array() method


#2-D array
import numpy as np
x=[1,2,3]
y=[3,4,5]
z=np.array((x,y))
print(z)

[[1 2 3]
[3 4 5]]

[9] :

c=np.array(m)

[1 2 3]

[10] :

2
[[1 2 3 4 5]
[6 7 8 9 1]]

[11] :

[11] : array({1, 2, 3, 4, 5}, dtype=object)

[12] :

{1, 2, 3, 4}

[13] : #dictionary
import numpy as np
dict={'a':1,'b':2,'c':3}
z=np.array(list(dict.items()))
print(z)
a=np.array(list(dict.keys()))
print(a)

[['a' '1']
['b' '2']
['c' '3']]
['a' 'b' 'c']

3 1.2 Intirinsic Numpy Objects


Intrinsic NumPy objects are fundamental data structures provided by the NumPy library, which
are optimized for numerical computations and provide efficient operations on large datase ts.

[14] :
a=np.array(np.arange(9))

[0 1 2 3 4 5 6 7 8]

[15] :

[0. 0. 0.]

3
[16] : b=np.zeros([3,3])

[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

[17] :

[17] : array([[1, 3, 7],


[2, 5, 9]])

[18] : d=np.zeros_like(x)
d

[18] : array([[0, 0, 0],


[0, 0, 0]])

[19] :

[1. 1. 1. 1.]

[20] : b=np.ones([3,3])

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

[21] :

[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]

[22] : c=np.eye(3,k=1)

[[0. 1. 0.]
[0. 0. 1.]
[0. 0. 0.]]

4
[23] :

[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]

[24] :

[[7 7]
[7 7]]

[25] :

[25] : array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

[26] : x=np.arange(6,dtype=int)

[26] : array([1, 1, 1, 1, 1, 1])

[27] :

[27] : array([0, 0, 0, 0, 0, 0])

[28] : d=np.full_like(x,0.1,dtype=np.double)
d

[28] : array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1])

[29] :

[[0.1 0.1 0.1]


[0.1 0.1 0.1]]

[30] : np.empty((2, 3, 2))

[30]: array([[[1.05337787e-311, 2.86558075e-322],


[0.00000000e+000, 0.00000000e+000],
[1.10343781e-312, 1.31370903e-076]],

[[5.20093491e-090, 5.69847262e-066],

5
[5.51292779e+169, 4.85649086e-033],
[6.48224659e+170, 5.82471487e+257]]])

[31]: #empty_like()
a=([1,2,3],[4,5,6])
np.empty_like(a)

[31]: array([[1730487296, 496, 0],


[ 0, 131074, 168442489]])

[32]: #using diag() mrthod


np.diag([1,2,3,4])

[32]: array([[1, 0, 0, 0],


[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])

[33]:

x,y=np.meshgrid(x,y)

[[1 2 3]
[1 2 3]
[1 2 3]]
[[4 4 4]
[5 5 5]
[6 6 6]]

3.1 1.3 Random Functions


The random functions in NumPy are essential for simulations, statistical sampling, and generating
synthetic data. They help facilitate various operations in scientific computing, machine learning,
and data analysis.

[34]:
from numpy import random

x = random.randint(100)

15

6
[35]:
y=np.random.bytes(7)

b"t'\n\x16\x14QB"
[['true' 'false' 'false']
['true' 'false' 'true']]

[36]:
x = random.rand(1) + random.rand(1)*1j
print (x)

print(x.imag)

[0.08421058+0.69654499j]
[0.08421058]
[0.69654499]

[37]:
x = random.rand(1,5) + random.rand(1,5)*1j
print (x)

[[0.29653563+0.94629414j 0.56539718+0.58965768j 0.83340819+0.82456817j


0.16209606+0.15309722j 0.92519953+0.01018444j]]

[38]:
np.random.random(size=(2,2))+1j*np.random.random(size=(2,2))

[38] : array([[0.90898124+0.87349692j, 0.64895681+0.87327894j],


[0.7544518 +0.122983j , 0.4716534 +0.77610277j]])

[39] :
np.random.permutation(5)

[39] : array([0, 3, 4, 1, 2])

[40] :
b=np.random.choice(a,size=5,p=[0.1,0.2,0.3,0.2,0.2])

[4 3 0 3 4]

[41] :

[41]: 3

7
[42] :

[[ 0.08009351 1.04758386 -0.15977457 0.60779634 0.12686552 -2.29032851


-0.53667358 -0.69266066 1.42867051 -0.34056088]]

[43] :

b=np.random.choice(a)

bananaa

[44] :
np.random.shuffle(a)

['cherry' 'apple' 'bananaa']

4 2.Manipulation Of Numpy Arrays

5 2.1 Indexing
Indexing in NumPy refers to accessing individual elements or groups of elements within an array

[45] :
import numpy as np
x =
y =

[46] :

[47] : =

[47] : array([[1, 2, 3],


[4, 5, 6]])

[48] :
old_values = arr3d[0].copy()
arr3d[0] = 42

8
[[[42 42 42]
[42 42 42]]

[[ 7 8 9]
[10 11 12]]]

[49] : import numpy as np

[50] : import numpy as np

[51] : import numpy as np

10

[52] : import numpy as np

[53] : import numpy as np

10

9
6 2.2 Slicing
Slicing in NumPy refers to the process of selecting a specific subset of elements from an array. It
allows you to create a new view of the original data without copying it, which can be very efficient
in terms of memory usage.

[54] :

arr=np.array([5,6,7,8,9])

[6 7]

[55] :
arr=np.array([5,6,7,3,6,8,9])

[6 7 3 6 8 9]

[56] :
arr=np.array([1,2,3,4,5,8,9])

[2 3 4 5 8 9]

[57] : arr=np.array([5,6,7,8,9])

[5 6 7]

[58] :

[7 8]

[59] : arr=np.array([5,6,7,8,9])

[5 6 7]

[60] : arr=np.array([5,6,7,8,9])

[5 6 7]

[61] :

[7 8]

10
[62] :

arr=np.array([5,6,7,8,4,5,6,7,9])

[6 8]

[63] : arr=np.array([5,6,7,8,4,5,6,7,9])

[9 7 6 5]

[64]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[1, 1:4])

[7 8 9]
[65]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 2])

[3 8]
[66]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 1:4])

[[2 3 4]
[7 8 9]]

[67]:
b=

llo

[68]: b =

Hello

11
[69]: b =

llo, World!

7 2.3 Re-Shaping
Reshaping in NumPy is the process of changing the shape (i.e., dimensions) of an existing array
without altering the data. This is particularly useful when you need to transform an array to fit a
certain shape for further operations, such as machine learning or data processing task

[70]: import numpy as np

(2, 4)

[71]: import numpy as np

[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)

[72]: import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

arr1= arr.reshape(4, 3)

print(arr1)

[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]

[73]: import numpy as np

12
[[[ 1 2 3]
[ 4 5 6]]

[[ 7 8 9]
[10 11 12]]]
[74]:

[[0 1]
[2 3]
[4 5]
[6 7]]

[75]: a=np.arange(12).reshape(4,3)

[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]

8 2.4 Joining Arrays


Joining arrays in NumPy is a way of combining two or more arrays into a single array. There are
several ways to join arrays, depending on the desired result and the shape of the input arrays.

[76]:
a1=np.arange(6).reshape(3,2)
a2=np.arange(6).reshape(3,2)
print(np.concatenate((a1,a2),axis=1))

[[0 1 0 1]
[2 3 2 3]
[4 5 4 5]]

[77]:
a =
b =

[[[1 2]
[3 4]]

13
[[5 6]
[7 8]]]

[78]:

[[[1 2]
[3 4]]

[[5 6]
[7 8]]]

[79]:

[[[1 2]
[5 6]]

[[3 4]
[7 8]]]

[80]:
ch = np.hstack((a,b))
print(ch)

[[1 2 5 6]
[3 4 7 8]]

[81]:
ch = np.vstack((a,b))
print(ch)

[[1 2]
[3 4]
[5 6]
[7 8]]

9 2.5 Splitting
Splitting in NumPy involves dividing an array into multiple sub-arrays. This can be useful when
you need to partition data for different processing purposes or when dealing with chunks of data
in a structured way.

[82]: import numpy as np


a = np.arange(9)
print(a)

[0 1 2 3 4 5 6 7 8]

14
[83]:
b =

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

[84]:
a = np.arange(12).reshape(4,3)
b=np.hsplit(a,3)

[array([[0],
[3],
[6],
[9] ]), array([[ 1],
[ 4],
[ 7],
[10] ]), array([[ 2],
[ 5],
[ 8],
[11] ])]

[85]:

[array([[0, 1, 2],
[3, 4, 5]]), array([[ 6, 7, 8],
[ 9, 10, 11]])]

10 3.Computation On Numpy Arrays Using Universal Functions

11 3.1 Unary Universal Functions


Unary Universal Functions (also known as unary ufuncs) in NumPy are mathematical functions
that operate on a single input array element-wise. These functions apply a specific mathematical
operation to each element of an array independently, resulting in an output array of the same shape.

[86]: arr = np.arange(10)

[0 1 2 3 4 5 6 7 8 9]

[87]:

15
[87] : array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

[88] :

[88] : array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,


5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])

[89] :

[89]: 0

[90] :

[90]: 9

[91] :

[91]: 4.5

[92] :

[0 1 2 3 4 5 6 7 8 9]

[93] :
arr=np.arange(0,-5,-0.5)

[0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]

12 3.2 Binary Universal Functions


Binary Universal Functions (also known as binary ufuncs) operate on two input arrays elementwise.
These functions require two arrays (or one array and one scalar) and perform a mathematical
operation between corresponding elements

[94] : x = np.random.randn(8)
y = np.random.randn(8)
print(x)

[ 1.11097262 -0.26995231 0.0060993 1.04398907 -1.82141342 0.00998652


0.08274781 0.82046885]

16
[95] :

[-0.05342373 0.10817525 -0.4610533 0.5755554 -0.66695438 0.25344274


1.40395846 -0.87447163]

[96] :

[96] : array([ 1.11097262, 0.10817525, 0.0060993 , 1.04398907, -0.66695438,


0.25344274, 1.40395846, 0.82046885])

[97] : arr = np.random.randn(7) * 5


remainder, whole_part = np.modf(arr)
print(remainder)

[-0.98958028 0.75318997 0.47148313 0.96309562 -0.84443205 0.60019609


0.41412946]

[98] :

[-6. 7. 2. 1. -2. 5. 3.]

[99] : import numpy as np


a = np.arange(9).reshape(3,3)
b = np.array([[10,10,10],[10,10,10],[10,10,10]])

[[10 11 12]
[13 14 15]
[16 17 18]]

[100] :

[100] : array([[-10, -9, -8],


[ -7, -6, -5],
[ -4, -3, -2]])

[101] :

[101] : array([[ 0, 10, 20],


[30, 40, 50],
[60, 70, 80]])

[102] :

[102] : array([[0. , 0.1, 0.2],


[0.3, 0.4, 0.5],
[0.6, 0.7, 0.8]])

17
[103] : import numpy as np
a = np.array([10,100,1000])
np.power(a,2)

[103] : array([ 100, 10000, 1000000], dtype=int32)

13 4.Compute Statistical and Mathematical Methods and Com-


parison Operations on rows/columns
13.1 4.1 Mathematical and Statistical methods on Numpy Arrays
NumPy provides a variety of mathematical and statistical methods to perform operations on arrays.

[104] : a =
a

[104] : array([[3, 7, 5],


[8, 4, 3],
[2, 4, 9]])

[105] :

[105]: 45

[106] :
import numpy as np
a = np.array([[30,40,70],[80,20,10],[50,90,60]])

[106]: 82.0

[107] : arr = np.random.randn(5, 4)

[108] :

[108]: -0.14756616582071838

[109] :

[109] : array([-0.93641711, 0.12758996, -0.44993246, 0.13099294, 0.38993583])

[110] :

[110]: -0.28413298907449897

18
[111] :

[111]: 0.9329450218698545

[112] :

[112]: 0.8703864138317433

[113] :

[113] : array([ 0.68253865, -2.88096912, -2.108008 , 1.35511515])

[114] : =

[ 0 1 3 6 10 15 21 28]

[115] : =

[[ 0 1 2]
[ 3 5 7]
[ 9 12 15]]

[116] :

[[ 0 0 0]
[ 3 12 60]
[ 6 42 336]]

13.2 4.2 Comparison Operations


Comparison operations in NumPy allow element-wise comparison between arrays or with scalars.

[117] :

True

[118] :

19
[119] :

[False True False True]

[120] :

False

[121] :

[False True True True]

[122] :

True

[123] :

[ True False False False]

[124] :

[ True False True False]

14 5.Computation on Numpy Arrays using Sorting,unique and Set


Operations
14.1 5.1 Sorting
Sorting helps to arrange elements of an array in a particular order.

[125] : import numpy as np


=

[[3 7]
[9 1]]

[126] :

[126] : array([[3, 7],


[1, 9]])

20
[127] :

[127] : array([[3, 1],


[9, 7]])

[128] :

[128] : array([[3, 7],


[1, 9]])

[129] : arr = np.random.randn(5, 3)

[[-0.92147727 -0.67857177 -0.04478315]


[-0.30378745 -0.95433394 -1.83418572]
[-0.48103436 -0.55413111 1.28233061]
[ 0.76260305 1.30994277 0.32818117]
[ 1.87598839 -0.35057108 0.47603584]]

[130] :

[[-0.92147727 -0.67857177 -0.04478315]


[-1.83418572 -0.95433394 -0.30378745]
[-0.55413111 -0.48103436 1.28233061]
[ 0.32818117 0.76260305 1.30994277]
[-0.35057108 0.47603584 1.87598839]]

14.2 5.2 Unique Operation


NumPy provides functions that perform set operations on arrays

[131] :
names =
print(np.unique(names))

['Bob' 'Joe' 'Will']

[132] :

[132]: ['Bob', 'Joe', 'Will']

[133]: =

[1 2 3 4]

21
14.3 5.3 Set Operations

[134]:
import numpy as np
=

[ True False False True True False True]

[135]:

[136]:

[1 2 3 4 5 6]

[137]:

[3 4]

[138]:

[1 2]

[139]:

[1 2 5 6]

15 6.Load an image file and do crop and flip operation using


Numpy indexing
To load and manipulate images with NumPy, you can use the Pillow (PIL) library to load an image
and convert it into a NumPy array.

[8]: from PIL import Image


img=Image.open("img.jpg")
img.format

[8] : 'JPEG'

[9] :

22
[[[242 242 242]
[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[242 242 242]


[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[242 242 242]


[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[208 208 208]


[208 208 208]
[206 206 206]

[163 163 163]
[163 163 163]
[163 163 163]]

[[208 208 208]


[207 207 207]
[206 206 206]

[164 164 164]
[164 164 164]
[164 164 164]]

[[207 207 207]


[207 207 207]
[205 205 205]

[165 165 165]
[165 165 165]

23
[165 165 165]]]

[10] :

24
[11] : crop_img=a[100:900,100:900,:]
img_out=Image.fromarray(crop_img)
img_out

[11]:

[12] :
display(Image.fromarray(flipped_img))

25
[ ]:

26
Data Manipulation with Pandas

1 1.create pandas series from python List ,Numpy Arrays and


Dictionary

2 1.1Pandas Series From Python List

[1]: import pandas as pd


import numpy as np
data=[4,7,-5,3]

0 4
1 7
2 -5
3 3
dtype: int64

[2]: # import pandas lib. as pd


import pandas as pd

# create Pandas Series with define indexes


x = pd.Series([10, 20, 30, 40, 50], index =['a', 'b', 'c', 'd', 'e'])

# print the Series


print(x)

a 10
b 20
c 30
d 40
e 50
dtype: int64

[3]: import pandas as pd

27
lst = ['G', 'h', 'i', 'j',
'k', 'l', 'm']

# create Pandas Series with define indexes


x = pd.Series(lst, index = ind)

# print the Series


print(x)

10 G
20 h
30 i
40 j
50 k
60 l
70 m
dtype: object

3 1.2 Pandas Series From Numpy arrays

[4] : import pandas as pd


import numpy as np

# numpy array
data = np.array(['a', 'b', 'c', 'd', 'e'])

# creating series
s = pd.Series(data)
print(s)

0 a
1 b
2 c
3 d
4 e
dtype: object

[5] : # importing Pandas & numpy


import pandas as pd
import numpy as np

# numpy array
data = np.array(['a', 'b', 'c', 'd', 'e'])

# creating series
s = pd.Series(data, index =[1000, 1001, 1002, 1003, 1004])
print(s)

28
1000 a
1001 b
1002 c
1003 d
1004 e
dtype: object

[6] : =

s = pd.Series(numpy_array, index=list('abcdef'))

Output Series:
a 1.0
b 2.8
c 3.0
d 2.0
e 9.0
f 4.2
dtype: float64

4 1.3 Pandas Series From Dictionary

[7] : import pandas as pd

# create a dictionary
dictionary = {'D': 10, 'B': 20, 'C': 30}

# create a series
series = pd.Series(dictionary)

print(series)

D 10
B 20
C 30
dtype: int64

[8] : # import the pandas lib as pd


import pandas as pd

# create a dictionary
dictionary = {'A': 50, 'B': 10, 'C': 80}

# create a series
series = pd.Series(dictionary, index=['B','C','A'])

29
B 10
C 80
A 50
dtype: int64

[9] : import pandas as pd

# create a dictionary
dictionary = {'A': 50, 'B': 10, 'C': 80}

# create a series
series = pd.Series(dictionary, index=['B', 'C', 'D', 'A'])

print(series)

B 10.0
C 80.0
D NaN
A 50.0
dtype: float64

4.1 2. Data Manipulation with Pandas Series


4.2 2.1 Indexing

[10] : import pandas as pd


import numpy as np

# creating simple array


data = np.array(['s','p','a','n','d','a','n','a'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17])
print(ser[16])

[11] : import pandas as pd

Date = ['1/1/2018', '2/1/2018', '3/1/2018', '4/1/2018']


Index_name = ['Day 1', 'Day 2', 'Day 3', 'Day 4']
sr = pd.Series(data = Date,
index = Index_name )
print(sr)

Day 1 1/1/2018
Day 2 2/1/2018

30
Day 3 3/1/2018
Day 4 4/1/2018
dtype: object

[12] :

1/1/2018

[13] : import numpy as np


import pandas as pd

a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

4.3 2.2 Selecting

[24]: import numpy as np


import pandas as pd

a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

[19] :

[19]: 1.0

[26]:

[26] : b 1.0
a 0.0
d 3.0
dtype: float64

[27] :

31
[27]: b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

[20] :

[20]: 1.0

[21] :

[21]: c 2.0
d 3.0
dtype: float64

[23]:

[23]: b 1.0
d 3.0
dtype: float64

[28] :

a 0.0
c 2.0
e 4.0
dtype: float64

4.4 2.3 Filtering

[4]: import numpy as np


import pandas as pd

a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

[32]:

[32]: a 0.0
dtype: float64

32
[36]:

[36]: b 5.0
d 3.0
e 4.0
dtype: float64

[35]:

[35]: a 0.0
b 5.0
d 3.0
e 4.0
dtype: float64

[38]: s[(s>2)&(s<5)

[38]: d 3.0
e 4.0
dtype: float64

[33]:

[33]: b 5.0
c 2.0
dtype: float64

[7]:

b True
dtype: bool

[42]:

[42]: c 2.0
e 4.0
dtype: float64

4.5 2.4 Arithmetic Operations

[8]: import pandas as pd


=
=

[3]: series3 = series1 + series2


print(series3)

33
0 7
1 9
2 11
3 13
4 15
dtype: int64

[4]: series3 = series1 - series2


print(series3)

0 -5
1 -5
2 -5
3 -5
4 -5
dtype: int64

[5]: series3 = series1 *series2


print(series3)

0 6
1 14
2 24
3 36
4 50
dtype: int64

[6]: series3 = series1 /series2


print(series3)

0 0.166667
1 0.285714
2 0.375000
3 0.444444
4 0.500000
dtype: float64

[9]: series3 = series1 %series2


print(series3)

0 1
1 2
2 3
3 4
4 5
dtype: int64

34
4.6 2.5 Ranking

[10]: import pandas as pd


s=pd.Series([121,211,153,214,115,116,237,118,219,120])
s.rank(ascending=True)

[10]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[49]:

[49]: 0 6.0
1 4.0
2 5.0
3 3.0
4 10.0
5 9.0
6 1.0
7 8.0
8 2.0
9 7.0
dtype: float64

[11]:

[11]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[12]:

35
[12]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[50]:

[50]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

4.7 2.6 Sorting

[52]: import pandas as pd


sr = pd.Series([19.5, 16.8, 22.78, 20.124, 18.1002])

0 19.5000
1 16.8000
2 22.7800
3 20.1240
4 18.1002
dtype: float64

[8]: sr.sort_values(ascending = False)

[8]: 2 22.7800
3 20.1240
0 19.5000
4 18.1002
1 16.8000
dtype: float64

36
[53]: sr.sort_values(ascending = True)

[53]: 1 16.8000
4 18.1002
0 19.5000
3 20.1240
2 22.7800
dtype: float64

[55]:

[55]: 0 19.5000
1 16.8000
2 22.7800
3 20.1240
4 18.1002
dtype: float64

[58]:

1 16.8000
4 18.1002
0 19.5000
3 20.1240
2 22.7800
dtype: float64

4.8 2.7 checking null values

[40]: s=pd.Series({'ohio':35000,'teyas':71000,'oregon':16000,'utah':5000})

x=pd.Series(s,index=states)

ohio 35000
teyas 71000
oregon 16000
utah 5000
dtype: int64
california NaN
ohio 35000.0
Texas NaN
oregon 16000.0
dtype: float64

[42]:

37
[42]: california True
ohio False
Texas True
oregon False
dtype: bool

[44]:

[44]: california False


ohio True
Texas False
oregon True
dtype: bool

4.9 2.8 Concatenation


[19]:
=
=

[65]:

0 1
1 2
2 3
0 A
1 B
2 C
dtype: object

[66]:
=

0 1
0 1 A
1 2 B
2 3 C

[67]:
=

0 1
1 2
2 3
0 A
1 B

38
2 C
dtype: object

[21]:

0 1
1 2
2 3
3 A
4 B
5 C
dtype: object

[22]:

0 1
1 2
2 3
0 A
1 B
2 C
dtype: object

[69]:

series1 0 1
1 2
2 3
series2 0 A
1 B
2 C
dtype: object

4.10 3 .Creating DataFrames from List and Dictionary


4.11 3.1 From List
[16]: =

df = pd.DataFrame(data, columns=['Numbers'])
print(df)

Numbers
0 1
1 2
2 3
3 4
4 5

39
[70]: import pandas as pd
nme = ["aparna", "pankaj", "sudhir", "Geeku"]
deg = ["MBA", "BCA", "M.Tech", "MBA"]
scr = [90, 40, 80, 98]
dict = {'name': nme, 'degree': deg, 'score': scr}
df = pd.DataFrame(dict)
print(df)

name degree score


0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
3 Geeku MBA 98

[38]: import pandas as pd


=

df = pd.DataFrame(data, columns = ['Name', 'Age'])

Name Age
0 G 10
1 h 15
2 i 20

4.12 3.2 From Dictionary

[39]:

a b c
1 4 7 10
2 5 8 11
3 6 9 12

[13]:
[2000,2001,2002,2000,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9,3.2]})
𝗌

state year pop


0 AP 2000 1.5
1 AP 2001 1.7
2 AP 2002 3.6
3 TS 2000 2.4
4 TS 2001 2.9
5 TS 2002 3.2

40
[14] :

a b
n v
d 1 4 7
2 5 8
e 2 6 9

[71]:

[71]: ap ts tn
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

4.13 4.Import various file formats to pandas DataFrames and preform the fol-
lowing
4.14 4.1 Importing file

[10]: import pandas as pd


data=pd.read_csv('bird.csv')
data

[10]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW
3 SW

41
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[420 rows x 12 columns]

4.15 4.2 display top and bottom five rows

[15] :

[15] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw type
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84 SW
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01 SW
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34 SW
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41 SW
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13 SW

[16] :

[16]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
415 SO
416 SO
417 SO
418 SO
419 SO

4.16 4.3 Get shape,data type,null values,index and column details

[17] : data.shape

[17]: (420, 12)

[18] :

42
[18] : id int64
huml float64
humw float64
ulnal float64
ulnaw float64
feml float64
femw float64
tibl float64
tibw float64
tarl float64
tarw float64
type object
dtype: object

[19] :

[19]: id 0
huml 1
humw 1
ulnal 3
ulnaw 2
feml 2
femw 1
tibl 2
tibw 1
tarl 1
tarw 1
type 0
dtype: int64

[20] : data.columns

[20]: Index(['id', 'huml', 'humw', 'ulnal', 'ulnaw', 'feml', 'femw', 'tibl', 'tibw',
'tarl', 'tarw', 'type'],
dtype='object')

[21]:

[21]: RangeIndex(start=0, stop=420, step=1)

4.17 4.4 Select/Delete the records rows/columns based on conditions

[24]:

[24]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01

43
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW
3 SW
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[419 rows x 12 columns]

[25]:

[25]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
5 5 61.92 4.78 50.46 3.47 49.52 4.41 56.95 2.73 29.07 2.83
6 6 79.73 5.94 67.39 4.50 42.07 3.41 71.26 3.56 37.22 3.64
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
1 SW
2 SW
4 SW
5 SW
6 SW
.. …

44
415 SO
416 SO
417 SO
418 SO
419 SO

[418 rows x 12 columns]

[27]:

[27] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw type
342 342 NaN NaN NaN NaN 32.54 2.65 55.06 2.81 38.94 2.25 SO

[28] :

[28]: 67.39

[29] :

[29] : huml humw


11 186.00 9.83
12 172.00 8.44
13 148.91 6.78
14 149.19 6.98
15 140.59 6.59

4.18 4.5 Sorting and Ranking operations in DataFrame

[30] : data

[30]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW

45
3 SW
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[420 rows x 12 columns]

[31] :

[31]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
.. … … … … … … … … … … …
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84

type
419 SO
418 SO
417 SO
416 SO
415 SO
.. …
4 SW
3 SW
2 SW
1 SW
0 SW

[420 rows x 12 columns]

[32] :

[32] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
369 369 13.48 1.27 16.00 1.00 12.67 1.10 23.12 0.88 16.34 0.89
413 413 12.95 1.16 14.09 1.03 13.03 1.03 22.13 0.96 15.19 1.02
395 395 15.62 1.28 18.52 1.06 15.75 1.17 28.63 1.03 21.39 0.88

46
367 367 13.31 1.17 16.47 1.06 12.32 0.93 22.47 0.95 15.97 0.75
414 414 13.63 1.16 15.22 1.06 13.75 0.99 23.13 0.96 15.62 1.01
376 376 13.52 1.28 17.88 1.07 15.10 1.05 25.14 1.23 17.81 0.69

type
369 SO
413 SO
395 SO
367 SO
414 SO
376 SO

[33] :

[33]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
369 369 13.48 1.27 16.00 1.00 12.67 1.10 23.12 0.88 16.34 0.89
413 413 12.95 1.16 14.09 1.03 13.03 1.03 22.13 0.96 15.19 1.02
414 414 13.63 1.16 15.22 1.06 13.75 0.99 23.13 0.96 15.62 1.01
367 367 13.31 1.17 16.47 1.06 12.32 0.93 22.47 0.95 15.97 0.75
395 395 15.62 1.28 18.52 1.06 15.75 1.17 28.63 1.03 21.39 0.88
376 376 13.52 1.28 17.88 1.07 15.10 1.05 25.14 1.23 17.81 0.69

type
369 SO
413 SO
414 SO
367 SO
395 SO
376 SO

[34] :

[34]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 1.0 289.0 344.0 275.0 325.5 289.0 295.0 1.0 302.0 272.0 328.0
1 2.0 308.0 343.0 284.0 343.0 312.0 320.0 308.0 327.5 285.0 333.0
2 3.0 286.0 336.0 268.0 334.0 295.0 303.5 292.0 303.5 271.0 305.5
3 4.0 284.0 308.0 255.0 313.5 279.0 288.0 270.0 272.5 247.0 310.5
4 5.0 248.0 281.0 227.5 258.0 224.0 225.5 231.0 250.0 211.0 294.0
5 6.0 246.0 275.0 223.0 242.0 326.0 322.0 234.0 234.0 181.0 268.5
6 7.0 285.0 321.0 262.0 304.0 292.0 282.5 279.0 280.5 259.0 320.0
7 8.0 304.0 306.0 278.0 306.0 300.0 299.0 296.0 295.5 266.0 324.0
8 9.0 362.0 370.0 354.0 362.0 365.0 356.5 363.5 359.0 352.0 346.0
9 10.0 387.0 399.0 381.5 383.0 382.0 398.0 382.0 397.0 392.0 377.0

type
0 274.5
1 274.5

47
2 274.5
3 274.5
4 274.5
5 274.5
6 274.5
7 274.5
8 274.5
9 274.5

[35] :

[35]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 1.0 289.0 344.0 275.0 325.5 289.0 295.0 1.0 302.0 272.0 328.0
1 2.0 308.0 343.0 284.0 343.0 312.0 320.0 308.0 327.5 285.0 333.0

type
0 274.5
1 274.5

[15]:

[15]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 891.0 617.0 246.0 783.0 289.0 497.0 179.0 552.5 220.0
1 890.0 171.5 783.5 701.0 734.5 183.0 179.0 552.5 112.0
2 889.0 171.5 246.0 538.0 734.5 404.5 587.5 552.5 17.0
3 888.0 171.5 783.5 619.0 734.5 226.5 179.0 552.5 824.5
4 887.0 617.0 246.0 876.0 289.0 226.5 587.5 552.5 283.0

Fare Cabin Embarked


0 815.0 NaN 322.5
1 103.0 94.0 805.5
2 659.5 NaN 322.5
3 144.0 134.5 322.5
4 628.0 NaN 322.5

4.18.1 4.6 Statistical operations

[23]: import pandas as pd


data=pd.read_csv('gym-track.csv')
data

[23]: Age Gender Weight (kg) Height (m) Max_BPM Avg_BPM Resting_BPM \
0 56 Male 88.3 1.71 180 157 60
1 46 Female 74.9 1.53 179 151 66
2 32 Female 68.1 1.66 167 122 54
3 25 Male 53.2 1.70 190 164 56
4 38 Male 46.1 1.79 188 158 68

48
.. … … … … … … …
968 24 Male 87.1 1.74 187 158 67
969 25 Male 66.6 1.61 184 166 56
970 59 Female 60.4 1.76 194 120 53
971 32 Male 126.4 1.83 198 146 62
972 46 Male 88.7 1.63 166 146 66

Session_Duration (hours) Calories_Burned Workout_Type Fat_Percentage \


0 1.69 1313.0 Yoga 12.6
1 1.30 883.0 HIIT 33.9
2 1.11 677.0 Cardio 33.4
3 0.59 532.0 Strength 28.8
4 0.64 556.0 Strength 29.2
.. … … … …
968 1.57 1364.0 Strength 10.0
969 1.38 1260.0 Strength 25.0
970 1.72 929.0 Cardio 18.8
971 1.10 883.0 HIIT 28.2
972 0.75 542.0 Strength 28.8

Water_Intake (liters) Workout_Frequency (days/week) Experience_Level \


0 3.5 4 3
1 2.1 4 2
2 2.3 4 2
3 2.1 3 1
4 2.8 3 1
.. … … …
968 3.5 4 3
969 3.0 2 1
970 2.7 5 3
971 2.1 3 2
972 3.5 2 1

BMI
0 30.20
1 32.00
2 24.71
3 18.41
4 14.39
.. …
968 28.77
969 25.69
970 19.50
971 37.74
972 33.38

[973 rows x 15 columns]

49
[25]:

[25]: 38.68345323741007

[28]:

[28]: 40.0

[29]:

[29]: 12.180927866987108

[30] :

[30]: 37639

[31] :

[31]: 148.37500370074312

4.18.2 4.7 count and Uniqueness of given Categorical values

[35]:

[35]: Age 973


Gender 973
Weight (kg) 973
Height (m) 973
Max_BPM 973
Avg_BPM 973
Resting_BPM 973
Session_Duration (hours) 973
Calories_Burned 973
Workout_Type 973
Fat_Percentage 973
Water_Intake (liters) 973
Workout_Frequency (days/week) 973
Experience_Level 973
BMI 973
dtype: int64

50
Data cleaning and preparation

a) handling missing data by detecting dropping and replacing/filling


mising valus
import pandas as pd
import numpy as np

Student performance
Import any csv file to pandas data frame and perform the following
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('student.csv')
df

Hours_Studied Attendance Parental_Involvement


Access_to_Resources \
0 23 84 Low
High
1 19 64 Low
Medium
2 24 98 Medium
Medium
3 29 89 Low
Medium
4 19 92 Medium
Medium
... ... ... ... .
..
6602 25 69 High
Medium
6603 23 76 High
Medium
6604 20 90 Medium
Low
6605 10 86 High
High
6606 15 67 Medium
Low

Extracurricular_Activities Sleep_Hours Previous_Scores \


0 No 7 73
1 No 8 59
2 Yes 7 91
3 Yes 8 98
4 Yes 6 65
... ... ... ...
6602 No 7 76

51
6603 No 8 81
6604 Yes 6 65
6605 Yes 6 91
6606 Yes 9 94

Motivation_Level Internet_Access Tutoring_Sessions Family_Income


\
0 Low Yes 0 Low

1 Low Yes 2 Medium

2 Medium Yes 2 Medium

3 Medium Yes 1 Medium

4 Medium Yes 3 Medium

... ... ... ... ...

6602 Medium Yes 1 High

6603 Medium Yes 3 Low

6604 Low Yes 3 Low

6605 High Yes 2 Low

6606 Medium Yes 0 Medium

Teacher_Quality School_Type Peer_Influence Physical_Activity \


0 Medium Public Positive 3
1 Medium Public Negative 4
2 Medium Public Neutral 4
3 Medium Public Negative 4
4 High Public Neutral 4
... ... ... ... ...
6602 Medium Public Positive 2
6603 High Public Positive 2
6604 Medium Public Negative 2
6605 Medium Private Positive 3
6606 Medium Public Positive 4

Learning_Disabilities Parental_Education_Level Distance_from_Home


\
0 No High School Near

1 No College Moderate

2 No Postgraduate Near

52
3 No High School Moderate

4 No College Near

... ... ... ...

6602 No High School Near

6603 No High School Near

6604 No Postgraduate Near

6605 No High School Far

6606 No Postgraduate Near

Gender Exam_Score
0 Male 67
1 Female 61
2 Male 74
3 Male 71
4 Female 70
... ... ...
6602 Female 68
6603 Female 69
6604 Female 68
6605 Female 68
6606 Male 64

[6607 rows x 20 columns]

# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores

53
Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

# 1. Detect missing data


missing_data = df.isnull()
print("\nMissing Data:")
print(missing_data.head(10))

Missing Data:
Hours_Studied Attendance Parental_Involvement
Access_to_Resources \
0 False False False
False
1 False False False

54
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 False False False
False
1 False False False
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality


\
0 False False False False

1 False False False False

2 False False False False

3 False False False False

55
4 False False False False

5 False False False False

6 False False False False

7 False False False False

8 False False False False

9 False False False False

School_Type Peer_Influence Physical_Activity


Learning_Disabilities \
0 False False False
False
1 False False False
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False

56
# No of null values
n=df.isnull().sum()
n

Hours_Studied 0
Attendance 0
Parental_Involvement 0
Access_to_Resources 0
Extracurricular_Activities 0
Sleep_Hours 0
Previous_Scores 0
Motivation_Level 0
Internet_Access 0
Tutoring_Sessions 0
Family_Income 0
Teacher_Quality 78
School_Type 0
Peer_Influence 0
Physical_Activity 0
Learning_Disabilities 0
Parental_Education_Level 90
Distance_from_Home 67
Gender 0
Exam_Score 0
dtype: int64

# 2. Drop rows with missing values


df_dropna = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropna.head(10))

DataFrame after dropping rows with missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

5 19 88 Medium Medium

6 29 84 Medium Low

7 25 78 Low High

57
8 17 94 Medium High

9 23 98 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium
5 Yes 8 89
Medium
6 Yes 7 68
Low
7 Yes 6 50
Medium
8 No 6 80
High
9 Yes 8 71
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High
5 Yes 3 Medium Medium
6 Yes 1 Low Medium
7 Yes 1 High High
8 Yes 0 Medium Low
9 Yes 0 High High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

58
4 Public Neutral 4 No

5 Public Positive 3 No

6 Private Neutral 2 No

7 Public Negative 2 No

8 Private Neutral 1 No

9 Public Positive 5 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70
5 Postgraduate Near Male 71
6 High School Moderate Male 67
7 High School Far Male 66
8 College Near Male 69
9 High School Moderate Male 72

# 3. Fill missing values with a specific value (e.g., mean, median, or


custom value)
# Let's fill missing values in the 'Age' column with the mean value of
that column
mean_chas = df['Attendance'].mean()
df_fillna = df.fillna({'Attendance': mean_chas})
print("\nDataFrame after filling missing values:")
print(df_fillna.head(10))

DataFrame after filling missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

5 19 88 Medium Medium

6 29 84 Medium Low

59
7 25 78 Low High

8 17 94 Medium High

9 23 98 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium
5 Yes 8 89
Medium
6 Yes 7 68
Low
7 Yes 6 50
Medium
8 No 6 80
High
9 Yes 8 71
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High
5 Yes 3 Medium Medium
6 Yes 1 Low Medium
7 Yes 1 High High
8 Yes 0 Medium Low
9 Yes 0 High High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

60
3 Public Negative 4 No

4 Public Neutral 4 No

5 Public Positive 3 No

6 Private Neutral 2 No

7 Public Negative 2 No

8 Private Neutral 1 No

9 Public Positive 5 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70
5 Postgraduate Near Male 71
6 High School Moderate Male 67
7 High School Far Male 66
8 College Near Male 69
9 High School Moderate Male 72

# 4. Replace missing values conditionally


# For example, replace missing values in 'City' with 'Unknown'
df_replace = df.fillna({'Hours_Studied': 'Unknown'})
print("\nDataFrame after replacing missing values:")
print(df_replace.head())

DataFrame after replacing missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \

61
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

b) transform data using apply() and map() method


# Load the CSV file into a Pandas DataFrame
# Replace 'data.csv' with the actual file path if needed
df=pd.read_csv('student.csv')

# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
Hours_Studied Attendance Parental_Involvement Access_to_Resources

62
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

63
# Assume 'Price' is a column that we want to transform

# 1. Transform using apply() method


# Let's square the values in the 'Price' column
df['Sleep_Hours'] = df['Previous_Scores'].apply(lambda x: x ** 2)
df

Hours_Studied Attendance Parental_Involvement


Access_to_Resources \
0 23 84 Low
High
1 19 64 Low
Medium
2 24 98 Medium
Medium
3 29 89 Low
Medium
4 19 92 Medium
Medium
... ... ... ... .
..
6602 25 69 High
Medium
6603 23 76 High
Medium
6604 20 90 Medium
Low
6605 10 86 High
High
6606 15 67 Medium
Low

Extracurricular_Activities Sleep_Hours Previous_Scores \


0 No 5329 73
1 No 3481 59
2 Yes 8281 91
3 Yes 9604 98
4 Yes 4225 65
... ... ... ...
6602 No 5776 76
6603 No 6561 81
6604 Yes 4225 65
6605 Yes 8281 91
6606 Yes 8836 94

Motivation_Level Internet_Access Tutoring_Sessions Family_Income


\
0 Low Yes 0 Low

1 Low Yes 2 Medium

64
2 Medium Yes 2 Medium

3 Medium Yes 1 Medium

4 Medium Yes 3 Medium

... ... ... ... ...

6602 Medium Yes 1 High

6603 Medium Yes 3 Low

6604 Low Yes 3 Low

6605 High Yes 2 Low

6606 Medium Yes 0 Medium

Teacher_Quality School_Type Peer_Influence Physical_Activity \


0 Medium Public Positive 3
1 Medium Public Negative 4
2 Medium Public Neutral 4
3 Medium Public Negative 4
4 High Public Neutral 4
... ... ... ... ...
6602 Medium Public Positive 2
6603 High Public Positive 2
6604 Medium Public Negative 2
6605 Medium Private Positive 3
6606 Medium Public Positive 4

Learning_Disabilities Parental_Education_Level Distance_from_Home


\
0 No High School Near

1 No College Moderate

2 No Postgraduate Near

3 No High School Moderate

4 No College Near

... ... ... ...

6602 No High School Near

6603 No High School Near

65
6604 No Postgraduate Near

6605 No High School Far

6606 No Postgraduate Near

Gender Exam_Score
0 Male 67
1 Female 61
2 Male 74
3 Male 71
4 Female 70
... ... ...
6602 Female 68
6603 Female 69
6604 Female 68
6605 Female 68
6606 Male 64

[6607 rows x 20 columns]

# 2. Transform using map() method


# Let's map a new column 'Price_category' based on the 'Price' values
Age_category_map = {0: 'Low', 1: 'Medium', 2: 'High'}
df['Sleep_Hours'] = df['Previous_Scores'].map(Age_category_map)

# Display the transformed DataFrame


print("\nDataFrame after transformation:")
print(df.head())

DataFrame after transformation:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No NaN 73
Low
1 No NaN 59

66
Low
2 Yes NaN 91
Medium
3 Yes NaN 98
Medium
4 Yes NaN 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

c) Detect and filter outliers


# Load the CSV file into a Pandas DataFrame
# Replace 'data.csv' with the actual file path if needed
df = pd.read_csv('titanic.csv')
df
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...

67
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0
.. ... ... ...
...
886 Montvila, Rev. Juozas male 27.0
0
887 Graham, Miss. Margaret Edith female 19.0
0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN
1
889 Behr, Mr. Karl Howell male 26.0
0
890 Dooley, Mr. Patrick male 32.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

68
Titanic
# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


S ibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

# Select the column to analyze for outliers (replace 'Value' with the
actual column name)
column_name = 'Fare'

# Calculate the z-scores for the selected column


z_scores = np.abs((df[column_name] - df[column_name].mean()) /
df[column_name].std())
z_scores.head(10)
0 0.502163
1 0.786404
2 0.488580
3 0.420494
4 0.486064
5 0.477848
6 0.395591
7 0.223957

69
8 0.424018
9 0.042931
Name: Fare, dtype: float64

# Define a threshold for outliers (e.g., z-score greater than 3)


z_score_threshold = 3

# Filter the DataFrame to keep rows without outliers


filtered_df = df[z_scores <= z_score_threshold]

# Display the DataFrame after filtering outliers


print("\nDataFrame after filtering outliers:")
print(filtered_df.head())

DataFrame after filtering outliers:


PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

70
d) perform vectorized string operations on pandas series
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('titanic.csv')
df

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0
.. ... ... ...
...
886 Montvila, Rev. Juozas male 27.0
0
887 Graham, Miss. Margaret Edith female 19.0
0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN
1
889 Behr, Mr. Karl Howell male 26.0
0
890 Dooley, Mr. Patrick male 32.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

71
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Assuming 'Name' is the column containing strings

# Convert all names to uppercase


df['Name']= df['Sex'].str.upper()
df

PassengerId Survived Pclass Name Sex Age SibSp Parch


\
0 1 0 3 MALE male 22.0 1 0

1 2 1 1 FEMALE female 38.0 1 0

2 3 1 3 FEMALE female 26.0 0 0

3 4 1 1 FEMALE female 35.0 1 0

4 5 0 3 MALE male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 MALE male 27.0 0 0

887 888 1 1 FEMALE female 19.0 0 0

888 889 0 3 FEMALE female NaN 1 2

889 890 1 1 MALE male 26.0 0 0

890 891 0 3 MALE male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C

72
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Calculate the length of each name


df['Name'] = df['Sex'].str.len()
df

PassengerId Survived Pclass Name Sex Age SibSp


Parch \
0 1 0 3 4 male 22.0 1 0

1 2 1 1 6 female 38.0 1 0

2 3 1 3 6 female 26.0 0 0

3 4 1 1 6 female 35.0 1 0

4 5 0 3 4 male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 4 male 27.0 0 0

887 888 1 1 6 female 19.0 0 0

888 889 0 3 6 female NaN 1 2

889 890 1 1 4 male 26.0 0 0

890 891 0 3 4 male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Split the names based on a delimiter (e.g., space) and create a new
column for the first part of the name

73
df['Name'] = df['Sex'].str.split(' ').str[0]
df
PassengerId Survived Pclass Name Sex Age SibSp Parch
\
0 1 0 3 male male 22.0 1 0

1 2 1 1 female female 38.0 1 0

2 3 1 3 female female 26.0 0 0

3 4 1 1 female female 35.0 1 0

4 5 0 3 male male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 male male 27.0 0 0

887 888 1 1 female female 19.0 0 0

888 889 0 3 female female NaN 1 2

889 890 1 1 male male 26.0 0 0

890 891 0 3 male male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Display the transformed DataFrame


print("DataFrame after performing vectorized string operations:")
print(df.head())

DataFrame after performing vectorized string operations:


PassengerId Survived Pclass Name Sex Age SibSp
Parch \
0 1 0 3 male male 22.0 1 0

74
1 2 1 1 female female 38.0 1 0

2 3 1 3 female female 26.0 0 0

3 4 1 1 female female 35.0 1 0

4 5 0 3 male male 35.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S

75
Data Wrangling

0.0.1 1. Concatenate / Join / Merge/ Reshape DataFrames.


Used to concatenate two or more DataFrame objects. By setting axis=0 it concatenates vertically
(rows), and by setting axis=1 it concatenates horizontally (columns).

[3]: import pandas as pd


df1 = pd.DataFrame({'X': ['X0', 'X1'],# Column 'A' with values 'A0',
'A1'
'Y': ['Y0', 'Y1']})# Column 'B' with values 'B0', 'B1'
# Create the second DataFrame (df2) with columns 'A' and 'B' and two rows
df2 = pd.DataFrame({'X': ['X2', 'X3'],
'Y': ['Y2', 'Y3']})
# Concatenate df1 and df2 vertically (axis=0) to stack rows
# This combines the two DataFrames by adding the rows of df2 below the rows of␣
𝗌df1

result = pd.concat([df1, df2], axis=0)


df1

[3] : X A1Y
0 X0 Y0
1 X1 Y1

[4] : df2

[4] : X Y
0 X2 Y2
1 X3 Y3

[5] :

[5]: X A1Y Y
0 X0 Y0 NaN
1 X1 Y1 NaN
0 X2 NaN Y2
1 X3 NaN Y3

76
0.0.2 MERGE
Used to merge two data frames based on a key column, similar to SQL joins. Options include
how=’inner’, how=’outer’, how=’left’, and how=’right’ for different types of joins.

[8]: import pandas as pd


# Create DataFrame 1
df1 = pd.DataFrame({'key': ['x', 'y', 'z'], 'value1': [1, 2, 3]})
# Create DataFrame 2
df2 = pd.DataFrame({'key': ['y', 'z', 'a'], 'value2': [4, 5, 6]})
# Merge DataFrames on 'key' column using inner join
result = pd.merge(df1, df2, on='key', how='inner')
df1

[8] : key value1


0 x 1
1 y 2
2 z 3

[9] : df2

[9] : key value2


0 y 4
1 z 5
2 a 6

[10] :

[10] : key value1 value2


0 y 2 4
1 z 3 5

[11] : import pandas as pd


# Create DataFrame 1
# Create DataFrame 1
df1 = pd.DataFrame({'key': ['x', 'y', 'z'], 'value1': [1, 2, 3]})
# Create DataFrame 2
df2 = pd.DataFrame({'key': ['y', 'z', 'a'], 'value2': [4, 5, 6]})
# Merge DataFrames on 'key' column using outer join
result = pd.merge(df1, df2, on="key", how='outer')
df1

[11] : key value1


0 x 1
1 y 2
2 z 3

[12] : df2

77
[12] : key value2
0 y 4
1 z 5
2 a 6

[13] :

[13]: key value1 value2


0 x 1.0 NaN
1 y 2.0 4.0
2 z 3.0 5.0
3 a NaN 6.0

0.0.3 JOIN
A join is a way to combine data from two or more tables (or DataFrames) based on a common
column, known as the join key.

[18]: df1 = pd.DataFrame({"x": ["x0", "x1", "x2"], "y": ["y0", "y1", "y2"]},
index=["j0", "j1", "j2"]) # Create DataFrame 2
df2 = pd.DataFrame({"z": ["z0", "z2", "z3"], "a": ["a0", "a2", "a3"]},
index=["K0", "K2", "K3"])
# Print DataFrame 1
print(df1)
# Print DataFrame 2
print(df2)
# Join DataFrames 1 and 2 on index (default)
df3 = df1.join(df2)
print(df3)

x y
j0 x0 y0
j1 x1 y1
j2 x2 y2
z a
K0 z0 a0
K2 z2 a2
K3 z3 a3
x y z a
j0 x0 y0 NaN NaN
j1 x1 y1 NaN NaN
j2 x2 y2 NaN NaN

0.0.4 INNER JOIN


Returns rows with matching keys in both DataFrames.

78
[21]: #inner join
# Create DataFrame 1
df1 = pd.DataFrame({"x": ["x0", "x1", "x2"], "y": ["y0", "y1", "y2"]},
index=["j0", "j1", "j2"]) # Create DataFrame 2
df2 = pd.DataFrame({"x": ["x0", "x1", "x3"],"z": ["z0", "z2", "z3"],
"a": ["a0", "a2", "a3"]},
index=["K0", "K2", "K3"])
df4 = df1.merge(df2,on="x", how='inner')
print(df4)

x y z a
0 x0 y0 z0 a0
1 x1 y1 z2 a2

0.0.5 FULL OUTER JOIN


Returns all rows from both DataFrames.
[22]:
df5 = df1.merge(df2,on="x", how='outer')
print(df5)

x y z a
0 x0 y0 z0 a0
1 x1 y1 z2 a2
2 x2 y2 NaN NaN
3 x3 NaN z3 a3

0.0.6 LEFT OUTER JOIN


Returns all rows from the left DataFrame and matching rows from the right DataFrame.

0.0.7 RIGHT OUTER JOIN


Returns all rows from the right DataFrame and matching rows from the left DataFrame.

[25]:
df7 = df1.merge(df2,on="x",how='right')
print(df7)

x y z a
0 x0 y0 z0 a0
1 x1 y1 z2 a2
2 x3 NaN z3 a3

0.0.8 RESHAPE
Reshaping functions like pivot and melt are used to transform the layout of data frames.

79
[30]: import pandas as pd
# Create Series 1
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
# Create Series 2
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
# Concatenate Series into DataFrame
df = pd.concat([s1, s2], keys=['one', 'two'])
print(df)

one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64

[31]:

a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0

[ ]:

80
81
84
85
86
87
88
89
90
91
92
93

You might also like