Pandas & Numpy

Pandas Questions
In [ ]:
"""
Mention the different types of Data Structures in Pandas?
Series:
It is a one-dimensional array-like structure with homogeneous data
which means data of different data types cannot be a part of the same series.
It can hold any data type such as integers, floats, and strings and its
values are mutable i.e. it can be changed but the size of the series is
immutable i.e. it cannot be changed. By using a ‘series’ method, we can
easily convert the list, tuple, and dictionary into a series. A Series cannot
contain multiple columns.
DataFrame :
It is a two-dimensional array-like structure with heterogeneous
data. It can contain data of different data types and the data is
aligned in a tabular manner i.e. in rows and columns and the
indexes with respect to these are called row index and column
index respectively. Both size and values of DataFrame are mutable.
The columns can be heterogeneous types like int and bool. It can
also be defined as a dictionary of Series.
"""
In [ ]:
"""
What are the significant features of the pandas Library?
Fast and efficient DataFrame object with default and customized indexing.
High-performance merging and joining of data.
Data alignment and integrated handling of missing data.
Label-based slicing, indexing, and subsetting of large data sets.
Reshaping and pivoting of data sets.
Tools for loading data into in-memory data objects from different file formats.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
Time Series functionality.
"""
In [ ]:
"""
Different ways to create series in pandas?
creating empty sereis

creating series from an array
creating series from an array with index
creating series from a dictionary
creating series from a list
creating series from s scaler value
"""
# 1. creating empty sereis
import pandas as pd
ser = pd.Series()
print(ser)
# 2. creating series from an array

import numpy as np
data = np.array(['s', 'c', 'a', 'l', 'a','r'])
ser = pd.Series(data)
print(ser)
# 3. creating series from an array with index

data = np.array(['s', 'c', 'a', 'l', 'a','r'])
ser = pd.Series(data, index=[10, 11, 12, 13, 14,15])
print(ser)
# 4.creating series from a dictionary

import pandas as pd
dict = {'A': 101,
'B': 202,
'C': 303}
ser = pd.Series(dict)
print(ser)
# 5.creating series from a list

list = ['s', 'c', 'a', 'l', 'a','r']
ser = pd.Series(list)
print(ser)
# 6.creating series from s scaler value

# giving a scalar value with index
ser = pd.Series(10, index=[0, 1, 2, 3, 4, 5])
print(ser)
Series([], dtype: object)

0 s
1 c
2 a
3 l
4 a
5 r
dtype: object
10 s
11 c
12 a
13 l
14 a
15 r
dtype: object
A 101
B 202
C 303
dtype: int64
0 s
1 c
2 a
3 l
4 a
5 r
dtype: object
0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64
In [ ]:
"""
crating different types of dataframe?
create an empty dataframe

creating dataframe from a dict of ndarray/lists
creating dataframe using list
creating dataframe using list using a dictionary
creating dataframe using a series
"""
# 1.create an empty dataframe
import pandas as pd
df = pd.DataFrame()
print(df)
# 2. creating dataframe from a dict of ndarray/lists

data = [110,202,303,404,550,650]
df = pd.DataFrame(data, columns=['Amounts'])
print(df)
# 3. creating dataframe using list

data = [['mark', 20], ['zack', 16], ['ron', 24]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
# 4. creating dataframe using list using a dictionary

data = {'Name': ['Max', 'Lara', 'Koke', 'muller'],
'Age': [10, 31, 91, 48]}
df = pd.DataFrame(data)
print(df)
data = [{'aa': 1, 'bs': 2, 'cd': 3},

{'aa': 10, 'bs': 20, 'cd': 30}]
print(df)
# 5. creating dataframe using a series

d = pd.Series([10, 20, 30, 40])
df = pd.DataFrame(d)
print(df)
d = {'one': pd.Series([10, 20, 30, 40],

index=['a', 'b', 'c', 'd']),
'two': pd.Series([10, 20, 30, 40],
index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
Empty DataFrame
Columns: []
Index: []
Amounts
0 110
1 202
2 303
3 404
4 550
5 650
Name Age
0 mark 20
1 zack 16
2 ron 24
Name Age
0 Max 10
1 Lara 31
2 Koke 91
3 muller 48
aa bs cd
0 1 2 3
1 10 20 30
0
0 10
1 20
2 30
3 40
one two
a 10 10
b 20 20
c 30 30
d 40 40
In [ ]:
In [ ]:
"""
How can we create a copy of the series in Pandas?
"""
list = ['s', 'c', 'a', 'l', 'a','r']

ser = pd.Series(list)
print(ser)
ser.copy(deep = True)
0 s
1 c
2 a
3 l
4 a
5 r
dtype: object
Out[ ]:
0 s
1 c
2 a
3 l
4 a
5 r
dtype: object
In [ ]:
"""
Categorical data in python:
Categorical data is a discrete set of values for a particular

outcome and has a fixed range. Also, the data in the category need
not be numerical, it can be textual in nature. Examples are gender,
social class, blood type, country affiliation, observation time,etc.
"""
In [ ]:
"""
What is MultiIndexing in Pandas?
MultiIndexing in Python, particularly within libraries like Pandas,
is a method of handling and organizing data with multiple levels of
indexing. It allows you to work with data that has more than one
key to index.
"""
import pandas as pd
index = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']],

names=['Group', 'Subgroup'])
data = {'Values': [1, 2, 3, 4]}
df = pd.DataFrame(data, index=index)
print(df)
Values
Group Subgroup
A a 1
b 2
B a 3
b 4
In [ ]:
"""
Convert dataframe to numpy array :
Pandas DataFrame to a NumPy array using the values attribute of
the DataFrame. The values attribute returns a NumPy array
representation of the DataFrame's data
"""
import pandas as pd
import numpy as np
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
numpy_array = df.values
print(numpy_array)
[[1 4 7]
[2 5 8]
[3 6 9]]
In [ ]:
"""
how to convert dataframe to excel file in pandas?
"""
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
file_path = 'data.xlsx'
df.to_excel(file_path, index=False)
df1=df.to_csv(file_path) # Set index=False to not write row numbers as index
In [ ]:
"""
Timedelta in python ?
A Timedelta can represent differences in time at various
resolutions (days, hours, minutes, seconds, milliseconds,
microseconds, and nanoseconds). You can create a Timedelta
object by subtracting two dates or times, or by using the
pd.Timedelta() constructor.
"""
import pandas as pd
td = pd.Timedelta(days = 5, hours = 5, minutes = 5,
seconds = 5, milliseconds = 5, microseconds = 5,
nanoseconds = 5)
print(td)
5 days 05:05:05.005005005
In [ ]:
#
start_date = pd.Timestamp('2022-01-01')
end_date = pd.Timestamp('2022-01-10')
duration = end_date - start_date

print(duration)
new_date = start_date + pd.Timedelta(days=5)

print(new_date)
9 days 00:00:00
2022-01-06 00:00:00
In [ ]:
In [ ]:
"""
Is iterating over a Pandas Dataframe a good practice?
If not what are the important conditions to keep in mind before iterating?
Ideally, iterating over pandas DataFrames is definitely not the best

practice and one should only consider doing so when it is absolutely
necessary and no other function is applicable. The iteration process
through DataFrames is very inefficient. Pandas provide a lot of functions
using which an operation can be executed without iterating through the
dataframe. There are certain conditions that need to be checked before
Before attempting to iterate through pandas objects, we must first

ensure that none of the below-stated conditions aligns with our use
case:
Applying a function to rows:
A common use case of iteration is when it comes to applying a function

to every row, which is designed to work only one row at a time and
cannot be applied on the full DataFrame or Series. In such cases,
it’s always recommended to use apply() method instead of iterating
through the pandas object.
Iterative manipulations:
In case we need to perform iterative manipulations and at the same

time performance is a major area of concern, then we have alternatives
like numba and cython.
Printing a DataFrame:
If we want to print out a DataFrame then instead of iterating through

the whole DataFrame we can simply use DataFrame.to_string() method
in order to render the DataFrame to a console-friendly tabular output.
Vectorisation over iteration:
It is always preferred to choose vectorization over iteration as

pandas come with a rich set of built-in methods whose performance
is highly optimized and super efficient.
"""
In [ ]:
"""
how to iterate over rows in pandas dataframe
"""
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Method 1: Using iterrows()

print("Using iterrows():")
for index, row in df.iterrows():
print(f"Row {index} : Sum = {row['A'] + row['B']}")
# Method 2: Using apply() with axis=1

print("\n Using apply() with axis=1:")
def sum_row(row):
return row['A'] + row['B']
df['Sum'] = df.apply(sum_row, axis=1)
print(df)
# Method 3: Using a traditional loop with iloc[]

print("\n Using a traditional loop with iloc[]:")
for i in range(len(df)):
print(f"Row {i}: Sum = {df.iloc[i]['A'] + df.iloc[i]['B']}")
Using iterrows():
Row 0: Sum = 5
Row 1: Sum = 7
Row 2: Sum = 9
Using apply() with axis=1:

A B Sum
0 1 4 5
1 2 5 7
2 3 6 9
Using a traditional loop with iloc[]:

Row 0: Sum = 5
Row 1: Sum = 7
Row 2: Sum = 9
In [1]:
import pandas as pd
data = {'A': [1, 2, None, 4, 5],

'B': [None, 2, 3, None, 5],
'C': [1, 2, 3, 4, 5]}
# Check for missing values

missing_values = df.isna() # or df.isnull()
print("Missing values before handling:\n ", missing_values)
# Fill missing values with the mean of the respective column

df_filled = df.fillna(df.mean())
print("\n DataFrame after filling missing values with the mean:\n ", df_filled)
Missing values before handling:

A B C
0 False True False
1 False False False
2 True False False
3 False True False
4 False False False
DataFrame after filling missing values with the mean:

A B C
0 1.0 3.333333 1
1 2.0 2.000000 2
2 3.0 3.000000 3
3 4.0 3.333333 4
4 5.0 5.000000 5
In [2]:
"""
Interpolating Along Columns or Rows:
The interpolate() function in pandas allows you to interpolate along either
the rows or the columns of a DataFrame, depending on the axis parameter.
This flexibility allows you to handle different data structures effectively.
"""
import pandas as pd
data = {'A': [1, 2, None, 4, 5],

'B': [None, 2, 3, None, 5],
'C': [1, 2, 3, 4, 5]}
# Check for missing values

missing_values = df.isna() # or df.isnull()
print("Missing values before handling:\n ", missing_values)
df_dropped = df.dropna()
print("\n DataFrame after dropping rows with missing values:\n ", df_dropped)
df_filled_value = df.fillna(0)
print("\n DataFrame after filling missing values with a specific value:\n ", df_filled_valu
e)
df_filled_mean = df.fillna(df.mean())
print("\n DataFrame after filling missing values with the mean:\n ", df_filled_mean)
# Option 4: Interpolate missing values

df_interpolated = df.interpolate()
print("\n DataFrame after interpolating missing values:\n ", df_interpolated)
Missing values before handling:

A B C
0 False True False
1 False False False
2 True False False
3 False True False
4 False False False
DataFrame after dropping rows with missing values:

A B C
1 2.0 2.0 2
4 5.0 5.0 5
DataFrame after filling missing values with a specific value:

A B C
0 1.0 0.0 1
1 2.0 2.0 2
2 0.0 3.0 3
3 4.0 0.0 4
4 5.0 5.0 5
DataFrame after filling missing values with the mean:

A B C
0 1.0 3.333333 1
1 2.0 2.000000 2
2 3.0 3.000000 3
3 4.0 3.333333 4
4 5.0 5.000000 5
DataFrame after interpolating missing values:

A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 3.0 3.0 3
3 4.0 4.0 4
4 5.0 5.0 5
In [3]:
import pandas as pd
# Create a sample DataFrame

data = {'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]}
# Filter rows where values in column 'A' are greater than 3

filtered_df = df[df['A'] > 3]
print(filtered_df)
A B
3 4 9
4 5 10
In [4]:
"""
The groupby() function in pandas is used to split the data into groups
based on some criteria. After splitting, the function applies a function
to each group independently and then combines the results back into a
DataFrame.
"""
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B','C'],

'Value': [10, 20, 30, 40, 50,3,2]}
grouped_df = df.groupby('Category').mean()
print(grouped_df)
Value
Category
A 30.0
B 21.0
C 2.0
In [5]:
"""
Method Chaining in Pandas:
Method chaining in pandas involves calling multiple methods on a DataFrame
or Series object sequentially in a single line, which allows for more concise
and readable code.
"""
import pandas as pd
data = {'A': [1, 2, 3, 4, 5],

'B': [6, 7, 8, 9, 10]}
result = df[df['A'] > 2].sort_values(by='B',ascending=False).reset_index(drop=True)

print(result)
A B
0 5 10
1 4 9
2 3 8
In [ ]:
"""
The pivot_table() function in pandas is used to create a spreadsheet-style
pivot table as a DataFrame.
It allows users to summarize and aggregate data from a DataFrame according
to one or more keys.
"""
import pandas as pd
data = {'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'],

'Category': ['A', 'B', 'A', 'B', 'A'],
'Value': [10, 20, 30, 40, 50]}
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value',

aggfunc='sum')
print(pivot_table)
Category A B
Date
2022-01-01 10.0 20.0
2022-01-02 30.0 40.0
2022-01-03 50.0 NaN
In [6]:
"""
Handling duplicate rows in a DataFrame in pandas:
Duplicate rows can be handled in pandas using various methods such as

drop_duplicates() to remove duplicate rows, or by aggregating duplicate
rows using grouping and aggregation functions.
"""
import pandas as pd
data = {'A': [1, 2, 3, 1, 2],

'B': [4, 5, 6, 4, 5]}
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
import pandas as pd
data = {'A': [1, 2, 1, 2],

'B': [4, 5, 4, 5],
'C': [10, 20, 30, 40]}
# Aggregate duplicate rows by summing values in column 'C'

aggregated_df = df.groupby(['A', 'B']).agg({'C': 'sum'}).reset_index()
print(aggregated_df)
A B
0 1 4
1 2 5
2 3 6
A B C
0 1 4 40
1 2 5 60
In [ ]:
"""
Descriptive Statistics:
mean(): Computes the mean of the values.
median(): Computes the median of the values.
mode(): Computes the mode of the values.
std(): Computes the standard deviation of the values.
var(): Computes the variance of the values.
Quantiles and Percentiles:

quantile(q): Computes the qth quantile of the values.
percentile(q): Computes the qth percentile of the values.
Summary Statistics:
describe(): Generates descriptive statistics summary of the DataFrame.
Correlation and Covariance:

corr(): Computes the pairwise correlation of columns.
cov(): Computes the pairwise covariance of columns.
Aggregation Functions:
sum(): Computes the sum of values.
count(): Computes the count of non-null values.
min(): Computes the minimum value.
max(): Computes the maximum value.
Unique Values and Value Counts:

unique(): Returns unique values in the object.
value_counts(): Returns counts of unique values.
Skewness and Kurtosis:

skew(): Computes the skewness of the values.
kurt(): Computes the kurtosis of the values.
Categorical Statistics:
groupby(): Group DataFrame using a mapper or by a Series of columns.
agg(): Aggregate using one or more operations over the specified axis.
"""
"""
Skewness:
Skewness measures the asymmetry of the distribution of values around the
mean of the data. A distribution is symmetric if it looks the same on both
sides of the mean. Skewness quantifies the extent to which a distribution
differs from this symmetry. It can be positive, negative, or zero.
Positive skewness: The distribution has a longer right tail. The majority of
the data points are concentrated on the left side of the mean, and the tail
extends towards the right.
Negative skewness: The distribution has a longer left tail. The majority of the
data points are concentrated on the right side of the mean, and the tail extends
towards the left.
Zero skewness: The distribution is perfectly symmetric around the mean.
Kurtosis: Kurtosis measures the "tailedness" of the distribution of values.

It indicates how sharply the data points are concentrated around the mean and
how heavy the tails are compared to a normal distribution.
Positive kurtosis (leptokurtic): The distribution has fatter tails and a sharper
peak than the normal distribution. It indicates more extreme values than would be
expected under a normal distribution.
Negative kurtosis (platykurtic): The distribution has thinner tails and a flatter
peak than the normal distribution. It indicates fewer extreme values than would be
expected under a normal distribution.
Mesokurtic: The distribution has kurtosis equal to that of the normal distribution.
"""
In [7]:
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1]}
# Descriptive statistics
print("Mean:", df.mean())
print("Median:", df.median())
print("Standard Deviation:", df.std())
print("Summary Statistics:\n ", df.describe())
# Correlation and Covariance

print("Correlation:\n ", df.corr())
print("Covariance:\n ", df.cov())
# Aggregation functions
print("Sum:", df.sum())
print("Count:", df.count())
print("Minimum:", df.min())
print("Maximum:", df.max())
# Unique values and value counts

print("Unique values in column A:", df['A'].unique())
print("Value counts in column B:\n ", df['B'].value_counts())
# Skewness and Kurtosis

print("Skewness:", df.skew())
print("Kurtosis:", df.kurt())
# Groupby and Aggregation
print("Groupby mean:\n ", df.groupby('B').mean())
Mean: A 3.0
B 3.0
dtype: float64
Median: A 3.0
B 3.0
dtype: float64
Standard Deviation: A 1.581139
B 1.581139
dtype: float64
Summary Statistics:
A B
count 5.000000 5.000000
mean 3.000000 3.000000
std 1.581139 1.581139
min 1.000000 1.000000
25% 2.000000 2.000000
50% 3.000000 3.000000
75% 4.000000 4.000000
max 5.000000 5.000000
Correlation:
A B
A 1.0 -1.0
B -1.0 1.0
Covariance:
A B
A 2.5 -2.5
B -2.5 2.5
Sum: A 15
B 15
dtype: int64
Count: A 5
B 5
dtype: int64
Minimum: A 1
B 1
dtype: int64
Maximum: A 5
B 5
dtype: int64
Unique values in column A: [1 2 3 4 5]
Value counts in column B:
B
5 1
4 1
3 1
2 1
1 1
Name: count, dtype: int64
Skewness: A 0.0
B 0.0
dtype: float64
Kurtosis: A -1.2
B -1.2
dtype: float64
Groupby mean:
A
B
1 5.0
2 4.0
3 3.0
4 2.0
5 1.0
In [ ]:
import pandas as pd
df_csv = pd.read_csv('sample.csv')
df_excel = pd.read_excel('sample.xlsx')
df_json = pd.read_json('sample.json')
url = 'https://example.com/sample.csv'
df_url = pd.read_csv(url)
print("CSV File:")
print(df_csv.head())
print("\n Excel File:")
print(df_excel.head())
print("\n JSON File:")
print(df_json.head())
print("\n Data from URL:")
print(df_url.head())
In [8]:
"""
loc vs iloc
"""
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
df = pd.DataFrame(data, index=['X', 'Y', 'Z'])
print("Using loc:")
print("Single row 'Y':")
print(df.loc['Y'])
print("\n Single value at 'Y', 'B':")

print(df.loc['Y', 'B'])
print("\n Slice of rows and columns:")

print(df.loc['X':'Y','A':'B'])
print("\n Single column 'C':")

print(df.loc[:,'C'])
print("First row:")
print(df.iloc[0])
print("\n Single value at first row, second column:")

print(df.iloc[0, 1])
print("\n Slice of rows and columns:")

print(df.iloc[0:2, 0:2])
print("\n Single column at index 2:")

print(df.iloc[:, 2])
Using loc:
Single row 'Y':
A 2
B 5
C 8
Name: Y, dtype: int64
Single value at 'Y', 'B':

5
Slice of rows and columns:

A B
X 1 4
Y 2 5
Single column 'C':

X 7
Y 8
Z 9
Name: C, dtype: int64
First row:
A 1
B 4
C 7
Name: X, dtype: int64
Single value at first row, second column:

4
Slice of rows and columns:

A B
X 1 4
Y 2 5
Single column at index 2:

X 7
Y 8
Z 9
Name: C, dtype: int64
In [12]:
"""
how to drop row and column in pandas :
"""
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
# Drop the row with index label 1 from the DataFrame

df = df.drop(1)
print(df)
# Drop the 'B' column from the DataFrame in place
df.drop('B', axis=1, inplace=True)
# Drop the row with index label 1 from the DataFrame in place
#df.drop(1, inplace=True)
#print(df)
A B C
0 1 4 7
2 3 6 9
In [13]:
"""
count greq of unique number :
"""
import pandas as pd
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'C', 'B']
}
frequency_count = df['Category'].value_counts()
print(frequency_count)
""" for reset the index """
frequency_count_df = frequency_count.reset_index()
frequency_count_df.columns = ['Category', 'Frequency']
print(frequency_count_df)
""" rename the index on the dataframe """
frequency_count_df = frequency_count_df.rename_axis('Index_Name')
print(frequency_count_df)
Category
A 4
B 3
C 2
Name: count, dtype: int64
Category Frequency
0 A 4
1 B 3
2 C 2
Category Frequency
Index_Name
0 A 4
1 B 3
2 C 2
In [ ]:
"""
find the row for which the value of specific column is min or max
"""
import pandas as pd
data = {
'A': [10, 20, 15, 25],
'B': [30, 25, 20, 35],
'C': [5, 10, 15, 20]
}
max_row_A = df['A'].idxmax()
min_row_B = df['B'].idxmin()
print("Row with maximum value in column 'A':")

print(df.loc[max_row_A])
print("\n Row with minimum value in column 'B':")

print(df.loc[min_row_B])
Row with maximum value in column 'A':

A 25
B 35
C 20
Name: 3, dtype: int64
Row with minimum value in column 'B':

A 15
B 20
C 15
Name: 2, dtype: int64
In [14]:
"""
groupby():
The groupby() function is used to split the DataFrame into groups based on some criteria.
It creates a GroupBy object that contains information about how the DataFrame is split.
You typically follow groupby() with an aggregation function to perform some operation on
each group.
aggregate() (or agg()):

The aggregate() function is used to apply one or more aggregation functions to the data i
n each group.
It allows you to perform custom aggregations or apply multiple aggregation functions simu
ltaneously.
You can use built-in aggregation functions (e.g., sum(), mean(), max()) or define custom
aggregation functions.
"""
import pandas as pd

data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'C', 'B'],
'Value': [10, 20, 15, 25, 30, 20, 10, 35, 40]
}
# Group the DataFrame by 'Category'

grouped_df = df.groupby('Category')
# Apply aggregation functions to each group

agg_result = grouped_df.agg({
'Value': ['sum', 'mean', 'max', 'min', 'count']
})
print(agg_result)
Value
sum mean max min count
Category
A 55 13.75 20 10 4
B 90 30.00 40 20 3
C 60 30.00 35 25 2
In [ ]:
"""
String Operation:
Pandas provide a set of string functions for working with string data. The following
are the few operations on string data:
lower(): Any strings in the index or series are converted to lowercase letters.
upper(): Any strings in the index or series are converted to uppercase letters.
strip(): This method eliminates spacing from every string in the Series/index,
along with a new line.
islower(): If all of the characters in the Series/Index string are lowercase,
it returns True. Otherwise, False is returned.
isupper(): If all of the characters in the Series/Index string are uppercase,
it returns True. Otherwise, False is returned.
split(’ '): It’s a method that separates a string according to a pattern.
cat(sep=’ '): With a defined separator, it concatenates series/index items.
contains(pattern): If a substring is available in the current element,
it returns True; otherwise, it returns False.
replace(a,b): It substitutes the value b for the value a.
startswith(pattern): If all of the components in the series begin with a pattern,
it returns True.
endswith(pattern): If all of the components in the series terminate in a pattern,
it returns True.
find(pattern): It can be used to return the pattern’s first occurrence.
findall(pattern): It gives you a list of all the times the pattern appears.
swapcase: It is used to switch the lower/upper case.
Null values:
When no data is being sent to the items, a Null value/missing value can appear.
There may be no values in the respective columns, which are commonly represented as NaN.
Pandas provide several useful functions for identifying, deleting, and changing null
values in Data Frames. The following are the functions.
isnull(): isnull 's job is to return true if either of the rows has null values.
notnull(): It is the inverse of the isnull() function, returning true values for non-null
values.
dropna(): This function evaluates and removes null values from rows and columns.
fillna(): It enables users to substitute other values for the NaN values.
replace(): It’s a powerful function that can take the role of a regex, dictionary, string
, series, and more.
interpolate(): It’s a useful function for filling null values in a series or data frame.
Row and column selection: We can retrieve any row and column of the DataFrame by specifyi
ng
the names of the rows and columns. It is one-dimensional and is regarded as a Series when
you select it from the DataFrame.
Filter Data: By using some of the boolean logic in DataFrame, we can filter the data.
Count Values: Using the ‘value counts()’ option, this process is used to count the overal
l
possible combinations.
"""
In [15]:
"""
apply():
The apply() method is used to apply a function along an axis of the DataFrame or Series.
It can be used with both DataFrame and Series objects.
When applied to a DataFrame, apply() allows you to apply a function along the rows or
columns (specified by the axis parameter).
When applied to a Series, apply() allows you to apply a function element-wise to each
element in the Series.
applymap():
The applymap() method is a DataFrame method and is used to apply a function to every
element of the DataFrame.
It applies the function to each element independently, irrespective of rows or columns.
applymap() is particularly useful when you want to apply an element-wise operation to
every cell in a DataFrame.
map():
The map() method is a Series method and is used to substitute each value in a Series with
another value.
It's primarily used for mapping values from one domain to another or for substituting
specific values with other values.
map() is not applicable to DataFrames directly, only to Series.
Here's a brief comparison:
apply() is used for applying a function along the rows or columns of a DataFrame
or element-wise to a Series.
applymap() is used specifically for applying a function to every element of a DataFrame.
map() is used for substituting each value in a Series with another value.
"""
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Using apply() on DataFrame to calculate the sum of each column

print(df.apply(sum, axis=0))
# Using applymap() to square every element in the DataFrame

print(df.applymap(lambda x: x ** 2))
# Sample Series
s = pd.Series(['cat', 'dog', 'bird'])
# Using map() to substitute values in the Series

print(s.map({'cat': 'feline', 'dog': 'canine', 'bird': 'avian'}))
A 6
B 15
dtype: int64
A B
0 1 16
1 4 25
2 9 36
0 feline
1 canine
2 avian
dtype: object
<ipython-input-15-8dc710cab47b>:46: FutureWarning: DataFrame.applymap has been deprecated

. Use DataFrame.map instead.
print(df.applymap(lambda x: x ** 2))
In [16]:
"""
merge():
The merge() function in Pandas is used to merge two DataFrames based on
the values of the specified columns.It is similar to SQL join operations.
It can perform inner, outer, left, and right joins.
join():
The join() method in Pandas is used to combine columns of two potentially
differently-indexed DataFrames into a single result DataFrame.
It uses indexes to join DataFrames.
concatenate():
The concatenate() function in Pandas is used to concatenate two or more
DataFrames along rows or columns.
It does not perform any joins or merges based on values or indexes.
"""
import pandas as pd
# Creating DataFrame df1

df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
# Creating DataFrame df2

'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
# Using merge() to perform an inner join on the 'A' column

merge_result = pd.merge(df1, df2, on='A', how='inner')
print("Merge Result (Inner Join):")
print(merge_result)
# Creating DataFrame df1 with a new index

'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
}, index=['X', 'Y', 'Z'])
# Creating DataFrame df2 with the same index as df1

'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']
}, index=['X', 'Y', 'Z'])
# Using join() to perform a left join based on the index

join_result = df1.join(df2)
print("\n Join Result (Left Join):")
print(join_result)
# Using concatenate() to concatenate the DataFrames along rows

concat_result = pd.concat([df1, df2])
print("\n Concatenate Result (Along Rows):")
print(concat_result)
Merge Result (Inner Join):

Empty DataFrame
Columns: [A, B_x, B_y]
Index: []
Join Result (Left Join):

A B C D
X A0 B0 C0 D0
Y A1 B1 C1 D1
Z A2 B2 C2 D2
Concatenate Result (Along Rows):

A B C D
X A0 B0 NaN NaN
Y A1 B1 NaN NaN
Z A2 B2 NaN NaN
X NaN NaN C0 D0
Y NaN NaN C1 D1
Z NaN NaN C2 D2
In [ ]:
"""
How Do you optimize the performance while working with large datasets in pandas?
Load less data: While reading data using pd.read_csv(), choose only the columns
you need with the “usecols” parameter to avoid loading unnecessary data. Plus,
specifying the “chunksize” parameter splits the data into different chunks and
processes them sequentially.
Avoid loops: Loops and iterations are expensive, especially when working with
large datasets. Instead, opt for vectorized operations, as they are applied on
an entire column at once, making them faster than row-wise iterations.
Use data aggregation: Try aggregating data and perform statistical operations
because operations on aggregated data are more efficient than on the entire dataset.
Use the right data types:

The default data types in pandas are not memory efficient. For example, integer values
take the default datatype of int64, but if your values can fit in int32, adjusting the
datatype to int32 can optimize the memory usage.
Parallel processing:
Dask is a pandas-like API to work with large datasets. It utilizes multiple processes
of your system to parallely execute different data tasks.
"""
In [ ]:
"""
sort values based on columns :
"""
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 20, 30],
'Score': [80, 90, 75]
}
# Sort the DataFrame by the 'Age' column in ascending order

sorted_df = df.sort_values(by='Age')
print(sorted_df)
Name Age Score

1 Bob 20 90
0 Alice 25 80
2 Charlie 30 75
In [17]:
"""
different ways to filter the values in pandas:
"""
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, None],
'Category': ['A', 'B', 'A', 'C', 'B'],
'Score': [80, 90, 75, 85, 95]
}
# Filter data using boolean indexing

bool_filtered_df = df[df['Age'] > 30]
# Filter data using query method

query_filtered_df = df.query('Age > 30')
# Filter data using loc method

loc_filtered_df = df.loc[df['Age'] > 30]
# Filter data using isin method

isin_filtered_df = df[df['Category'].isin(['A', 'B'])]
# Filter data using isna method

na_filtered_df = df[df['Age'].isna()]
# Filter data using string methods

string_filtered_df = df[df['Name'].str.contains('a', case=False)]
# Groupby and filter

grouped_filtered_df = df.groupby('Category').filter(lambda x: x['Score'].mean() > 80)
print("Boolean Indexing:")
print(bool_filtered_df)
print("\n Query Method:")

print(query_filtered_df)
print("\n Loc Method:")

print(loc_filtered_df)
print("\n Isin Method:")

print(isin_filtered_df)
print("\n Isna Method:")

print(na_filtered_df)
print("\n String Methods:")

print(string_filtered_df)
print("\n Groupby and Filter:")

print(grouped_filtered_df)
Boolean Indexing:
Name Age Category Score
2 Charlie 35.0 A 75
3 David 40.0 C 85
Query Method:
2 Charlie 35.0 A 75
3 David 40.0 C 85
Loc Method:
2 Charlie 35.0 A 75
3 David 40.0 C 85
Isin Method:
0 Alice 25.0 A 80
1 Bob 30.0 B 90
2 Charlie 35.0 A 75
4 Emma NaN B 95
Isna Method:
4 Emma NaN B 95
String Methods:
0 Alice 25.0 A 80
2 Charlie 35.0 A 75
3 David 40.0 C 85
4 Emma NaN B 95
Groupby and Filter:

1 Bob 30.0 B 90
3 David 40.0 C 85
4 Emma NaN B 95
In [ ]:
"""
How do you handle null or missing values in pandas?
You can use any of the following three methods to handle missing values in pandas:
dropna() – the function removes the missing rows or columns from the DataFrame.
fillna() – fill nulls with a specific value using this function.
interpolate() – this method fills the missing values with computed interpolation values.
The interpolation technique can be linear, polynomial, spline, time, etc.

"""
In [ ]:
"""
Difference between fillna() and interpolate() methods
fillna():
fillna() fills the missing values with the given constant.
Plus, you can give forward-filling or backward-filling inputs to its ‘method’ parameter.
interpolate():
By default, this function fills the missing or NaN values with the linear interpolated
values. However, you can customize the interpolation technique to polynomial, time,
index, spline, etc., using its ‘method’ parameter.
The interpolation method is highly suitable for time series data, whereas fillna
is a more generic approach.
"""
In [ ]:
"""
What is Resampling?
Resampling is used to change the frequency at which time series data is reported.
Imagine you have monthly time series data and want to convert it into weekly
data or yearly, this is where resampling is used.
Converting monthly to weekly or daily data is nothing but upsampling. Interpolation
techniques are used to increase the frequencies here.
converting monthly to yearly data is termed as downsampling, where data aggregation
techniques are applied.
"""
In [18]:
"""
How do you perform one-hot encoding using pandas?
"""
import pandas as pd
data = {'Name': ['John', 'Cateline', 'Matt', 'Oliver'],

'ID': [1, 22, 23, 36]}
new_df = pd.get_dummies(df.Name)
new_df.head()
Out[18]:
Cateline John Matt Oliver
0 False True False False
1 True False False False
2 False False True False
3 False False False True
In [19]:
import pandas as pd
EmpData=pd.DataFrame({'Name': ['ram','ravi','sham','sita','gita'],
'id': [101,102,103,104,105],
'Gender': ['M','M','M','F','F'],
'Age': [21,25,24,28,25]
})
print(EmpData)
# Replacing values in data globally for all the columns
# Wherever you find values, replace them: M-->Male, and 21-->22
EmpDataReplaced=EmpData.replace(to_replace={'M':'Male', 21:30}, inplace=False)
EmpDataReplaced
Name id Gender Age

0 ram 101 M 21
1 ravi 102 M 25
2 sham 103 M 24
3 sita 104 F 28
4 gita 105 F 25
Out[19]:
Name id Gender Age
0 ram 101 Male 30
1 ravi 102 Male 25
2 sham 103 Male 24
3 sita 104 F 28
4 gita 105 F 25
In [20]:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, None],
'Category': ['A', 'B', 'A', 'C', 'B'],
'Score': [80, 90, 75, 85, 95]
}
df['Category'] = df['Category'].replace({'A': 'Category_A', 'B': 'Category_B'})
print("DataFrame after replacing values in 'Category' column:")

print(df)
DataFrame after replacing values in 'Category' column:

0 Alice 25.0 Category_A 80
1 Bob 30.0 Category_B 90
2 Charlie 35.0 Category_A 75
3 David 40.0 C 85
4 Emma NaN Category_B 95
In [21]:
df.loc[(df['Age'] >= 30) & (df['Age'] < 40), 'Age'] = 35
print("\n DataFrame after replacing a range of values in 'Age' column:")

print(df)
DataFrame after replacing a range of values in 'Age' column:

0 Alice 25.0 Category_A 80
1 Bob 35.0 Category_B 90
2 Charlie 35.0 Category_A 75
3 David 40.0 C 85
4 Emma NaN Category_B 95
Numpy questions
In [22]:
"""
Main datastructures in Nuture of Numpy ?
The main data structure in NumPy is the ndarray, short for n-dimensional array.
It is a powerful data structure that allows for efficient storage and manipulation
of arrays containing homogeneous data (data of the same type).
Here are some key characteristics of ndarrays:

Homogeneous Data: All elements in a NumPy ndarray must be of the same data type,
unlike Python lists which can contain elements of different data types.
Fixed Size: The size of a NumPy array is fixed upon creation, meaning you cannot
resize it like a Python list.
Efficient Computation: NumPy arrays are implemented in C, which allows for efficient
computation and vectorized operations.
Multi-dimensional: NumPy arrays can have any number of dimensions. A one-dimensional

array is like a list, a two-dimensional array is like a matrix, and s
o on.
Indexing and Slicing: Similar to Python lists, you can access elements of a NumPy array u
sing
indexing and slicing.
"""
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print("NumPy Array:")
print(arr)
print("Type of array:", type(arr))
print("Data type of elements:", arr.dtype)
print("Shape of array:", arr.shape)
NumPy Array:
[1 2 3 4 5]
Type of array: <class 'numpy.ndarray'>
Data type of elements: int64
Shape of array: (5,)
In [ ]:
"""
NumPy is a fundamental library in data science and machine learning.
Efficient Array Operations: NumPy provides high-performance multidimensional array

objects (ndarray) and tools for working with these arrays. This allows for efficient
storage and manipulation of large datasets, making it essential for handling data in
data science tasks.
Vectorized Operations: NumPy supports vectorized operations, allowing mathematical

and logical operations to be performed on entire arrays without the need for explicit
looping. This makes code concise, readable, and computationally efficient.
Broadcasting: NumPy's broadcasting capability allows arrays of different shapes to be

combined in arithmetic operations. This simplifies code and makes it easier to work
with arrays of different dimensions, which is common in data science tasks.
Random Number Generation: NumPy provides functions for generating random numbers and
sampling from various probability distributions. This is useful for simulating data,
bootstrapping, and conducting statistical experiments in data science.
Integration with Other Libraries: NumPy is a foundational library in the Python ecosystem
and is extensively used by other libraries and frameworks in data science, such as Pandas
,
SciPy, Matplotlib, and scikit-learn. These libraries often accept NumPy arrays as input o
r
return NumPy arrays as output.
Interfacing with Low-Level Languages:
NumPy is implemented in C and Fortran, which makes it efficient for numerical computation
s.
It also provides interfaces to libraries written in these languages, enabling seamless
integration with existing computational libraries and frameworks.
Linear Algebra Operations:
NumPy provides a rich set of functions for linear algebra operations, including matrix
multiplication, eigenvalue decomposition, singular value decomposition, and solving linea
r
systems of equations. These operations are fundamental in many data science applications,
such as machine learning and optimization.
"""
In [ ]:
"""
There are several reasons why NumPy is an important library in Python:
Efficient operations on arrays and matrices:

NumPy is designed to be efficient for numerical computing. It provides functions and meth
ods
for performing operations on large arrays and matrices of data that are much faster than
using Python's built-in data structures. NumPy provides efficient, vectorized operations
on arrays and matrices, which can be much faster than looping over the elements of the ar
ray
and performing the operation manually.
Large collection of mathematical functions:

NumPy provides a large collection of mathematical functions that can be applied to arrays
and
matrices, such as trigonometric functions, exponential functions, and linear algebra func
tions.
This can save a lot of time and effort compared to implementing these functions yourself.
Interoperability with other libraries:

NumPy is designed to work seamlessly with these libraries, making it easy to use them tog
ether.
NumPy is integrated with many other popular Python libraries, such as Pandas (a library f
or
data analysis) and Matplotlib (a library for data visualization). This allows you to use
NumPy arrays in these libraries and take advantage of their functionality.
Widely used in scientific computing:

NumPy is widely used in the scientific computing and data science communities, and is oft
en
used in conjunction with other libraries such as Pandas and SciPy. Since NumPy is an esse
ntial
library for scientific computing in Python, it is widely used in machine learning, data s
cience,
and other fields that require efficient operations on large arrays of numerical data.
Support for large datasets:

NumPy is designed to handle large datasets efficiently, allowing you to work with dataset
s that
may not fit in memory using other data structures.
Easy to use:
NumPy provides a simple and intuitive interface for working with numerical data in Python
. Its
syntax is similar to Python's built-in data types and it integrates well with other libra
ries,
such as Matplotlib for visualization.
Support for high-level mathematical functions:

NumPy provides support for a wide range of mathematical functions, such as trigonometric
functions,
logarithms, and exponential functions. These functions are implemented in a highly effici
ent manner,
making it easy to perform complex mathematical operations with NumPy.
Support for array broadcasting:

NumPy's support for array broadcasting allows you to perform arithmetic operations on arr
ays of
different sizes, making it easy to work with arrays of different shapes and dimensions.
Flexibility:
NumPy arrays can be used to store data of any type and can be easily resized or reshaped
to fit
the needs of your application.
"""
In [ ]:
"""
why numpy preferd over Matlab, Octave ?
Powerful functions for performing complex mathematical operations on multi-dimensional ma

trices
and arrays. The operations on ndarrays of NumPy are approximately up to 50% faster when c
ompared
to operations on native lists using loops. This efficiency is very much useful when the a
rrays
have millions of elements.
Provides indexing syntax to access portions of data easily in a large array.
Provides built-in functions which help to easily perform operations related to linear alg
ebra
and statistics.
It takes only a few lines of code to achieve complex computations using NumPy.
"""
In [ ]:
"""
How are NumPy arrays better than Python’s lists?
Python lists support storing heterogeneous data types whereas NumPy arrays can store data
types
of one nature itself.
NumPy provides extra functional capabilities that make operating on its arrays easier whi
ch makes
NumPy array advantageous in comparison to Python lists as those functions cannot be opera
ted on
heterogeneous data.
NumPy arrays are treated as objects which results in minimal memory usage. Since Python k
eeps
track of objects by creating or deleting them based on the requirements, NumPy objects ar
e also
treated the same way. This results in lesser memory wastage.
NumPy arrays support multi-dimensional arrays.
NumPy provides various powerful and efficient functions for complex computations on the a
rrays.
NumPy also provides various range of functions for BitWise Operati
"""
In [25]:
"""
what are the different types of data types in numpy?
"""
import numpy as np
arr_int32 = np.array([1, 2, 3], dtype=np.int32)

arr_float64 = np.array([1.0, 2.0, 3.0], dtype=np.float64)
arr_complex128 = np.array([1 + 2j, 3 + 4j], dtype=np.complex128)
arr_bool = np.array([True, False, True], dtype=bool)
arr_string = np.array(['apple', 'banana', 'cherry'], dtype=str)
# Printing arrays and their data types

print("Integer Array (int32):", arr_int32)
print("Float Array (float64):", arr_float64)
print("Complex Array (complex128):", arr_complex128)
print("Boolean Array (bool):", arr_bool)
print("String Array (str):", arr_string)
Integer Array (int32): [1 2 3]

Float Array (float64): [1. 2. 3.]
Complex Array (complex128): [1.+2.j 3.+4.j]
Boolean Array (bool): [ True False True]
String Array (str): ['apple' 'banana' 'cherry']
In [ ]:
"""
Here are a few examples of situations where NumPy might be useful:
Scientific computing:
NumPy provides a number of functions and features that are useful for scientific computi
ng tasks,
such as numerical integration, linear algebra, and random number generation.
Data analysis:
NumPy is often used as a foundation for other libraries that are used for data analysis,
such as
Pandas and SciPy. It provides functions for reading and writing data to and from files,
as well
as functions for performing statistical analysis and manipulating data.
Machine learning:
NumPy is frequently used in machine learning tasks, such as preparing data, creating tra
ining and
testing sets, and implementing algorithms. It provides a number of functions that are us
eful for
these tasks, such as matrix multiplication and element-wise operations.
Image processing:
NumPy is often used for image processing tasks, such as resizing and cropping images, as
well as
applying filters and transformations. It provides functions for working with arrays of p
ixel values,
which can be used to represent images.
Data visualization:
NumPy can be used to create data visualizations such as histograms, scatter plots, and l
ine plots.
It provides functions for generating data to be plotted as well as functions for creatin
g plots
using Matplotlib or other visualization libraries.
Data manipulation:
NumPy provides functions for efficiently manipulating large arrays of data, such as sele
cting
specific elements or subarrays, sorting, and reshaping.
Optimization:
NumPy provides functions for minimizing or maximizing objective functions, such as NumPy
.argmin
and NumPy.argmax, which can be used to find the optimal parameters for a given model.
Signal processing:
NumPy provides functions for performing tasks such as filtering, convolution, and correl
ation,
which are commonly used in signal processing.
Text processing:
NumPy can be used to encode and decode text data for use in natural language processing t
asks.
Financial modeling:
NumPy can be used to perform financial modeling tasks, such as calculating returns, risk
, and
portfolio optimization.
Simulation:
NumPy can be used to generate random numbers and perform simulations, such as Monte Carl
o simulations.
Computer vision:
NumPy can be used to process and manipulate images and video data for use in computer vi
sion tasks.
"""
In [26]:
"""
"""
Difference between the mean() and average in numpy :
np.mean():
Calculates the arithmetic mean of the elements in the array.

By default, it computes the simple average, treating all elements equally.
Offers additional options like specifying axis along which the mean is computed, data typ
es,
and where the result should be placed.
Does not inherently support weighted averages.
np.average():
Computes the weighted average of the elements in the array if the weights parameter is sp
ecified.
Allows for the elements to contribute unequally to the final average based on their weigh
ts.
Useful when you want to give different importance to different elements in the array.
"""
import numpy as np
# Example data
data = np.array([1, 2, 3, 4, 5])
# Calculating mean using np.mean()

mean_simple = np.mean(data)
print("Simple Mean (np.mean()):", mean_simple)
#
weights = np.array([0.1, 0.2, 0.3, 0.2, 0.2])
weighted_average = np.average(data, weights=weights)
print("Weighted Average (np.average()):", weighted_average)
Simple Mean (np.mean()): 3.0

Weighted Average (np.average()): 3.2
In [27]:
"""
How do you count the frequency of a given positive value appearing in the NumPy array?
"""
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 2, 3, 4, 2, 1])
value_to_count = 2
frequency = np.count_nonzero(arr == value_to_count)
print("Frequency of", value_to_count, "in the array:", frequency)
Frequency of 2 in the array: 3
In [28]:
"""
How is arr[:,0] different from arr[:,[0]] give two example similar to this ?
"""
import numpy as np
arr = np.array([[1, 2, 3],

[4, 5, 6],
[7, 8, 9]])
result = arr[:, 0]
print("arr[:, 0]:\n ", result)

print("Shape of result:", result.shape)
arr[:, 0]:
[1 4 7]
Shape of result: (3,)
In [29]:
import numpy as np
arr = np.array([[1, 2, 3],

[4, 5, 6],
[7, 8, 9]])
result = arr[:, [0]]
print("arr[:, [0]]:\n ", result)

print("Shape of result:", result.shape)
arr[:, [0]]:
[[1]
[4]
[7]]
Shape of result: (3, 1)
In [30]:
"""
Vectorization in NumPy refers to the ability to apply operations element-wise on
entire arrays, which is more efficient than using traditional Python loops. It
leverages optimized C and Fortran code under the hood to execute these operations efficie
ntly.
"""
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
result = a + b
print("Result of element-wise addition:", result)
Result of element-wise addition: [ 6 8 10 12]
In [31]:
"""
convert data frame into array ?
"""
import pandas as pd
import numpy as np

data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
numpy_array = df.values
print("NumPy Array:")
print(numpy_array)
NumPy Array:
[[1 4 7]
[2 5 8]
[3 6 9]]
In [ ]:
"""
How is Vectorization related to Broadcasting in NumPy?
Vectorization involves delegating NumPy operations internally to optimized C language
functions to result in faster Python code. Whereas Broadcasting refers to the methods
that allow NumPy to perform array-related arithmetic operations. The size or shape of
the arrays does not matter in this case. Broadcasting solves the problem of mismatched
shaped arrays by replicating the smaller array along the larger array to ensure both
arrays are having compatible shapes for NumPy operations. Performing Broadcasting before
Vectorization helps to vectorize operations which support arrays of different dimensions.
"""
In [32]:
"""
Write a program to repeat each of the elements five times for a given array.
"""
import numpy as np
given_array = np.array([1, 2, 3, 4, 5])

repeated_array = np.repeat(given_array, 5)
print("Original Array:", given_array)

print("Array with each element repeated five times:", repeated_array)
Original Array: [1 2 3 4 5]
Array with each element repeated five times: [1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5
5 5 5 5]
In [33]:
""" how to add zero at border in numpy """
import numpy as np
existing_array = np.array([[1, 2, 3],

[4, 5, 6],
[7, 8, 9]])
rows, cols = existing_array.shape
# Create a new array with zeros and expanded dimensions

new_array = np.zeros((rows + 2, cols + 2), dtype=existing_array.dtype)
# Assign the existing array to the center of the new array

new_array[1:-1, 1:-1] = existing_array
print("Existing Array:")
print(existing_array)
print("\n New Array with Zeros Border:")

print(new_array)
Existing Array:
[[1 2 3]
[4 5 6]
[7 8 9]]
New Array with Zeros Border:

[[0 0 0 0 0]
[0 1 2 3 0]
[0 4 5 6 0]
[0 7 8 9 0]
[0 0 0 0 0]]
In [34]:
"""
how to split array into different parts in numpy ?
"""
import numpy as np
# Create a sample array

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
# Split the array into 3 sub-arrays

sub_arrays = np.array_split(arr, 3)
print("Original Array:", arr)

print("Sub-Arrays:", sub_arrays)
Original Array: [1 2 3 4 5 6 7 8 9]
Sub-Arrays: [array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9])]
In [35]:
"""
How to rehshape and resize array in numpy ?
"""
import numpy as np
arr = np.array([[1, 2, 3],

[4, 5, 6]])
resized_arr = np.resize(arr, (3, 4))
print("Resized Array:")
print(resized_arr)
reshaped_arr = np.reshape(arr, (3, 2))
print("\n Reshaped Array:")

print(reshaped_arr)
Resized Array:
[[1 2 3 4]
[5 6 1 2]
[3 4 5 6]]
Reshaped Array:
[[1 2]
[3 4]
[5 6]]
In [36]:
import numpy as np
# Create a sample array

arr = np.array([3, 1, 2, 5, 4])
# Sort the array

sorted_arr = np.sort(arr)
print("Original Array:", arr)

print("Sorted Array:", sorted_arr)
Original Array: [3 1 2 5 4]
Sorted Array: [1 2 3 4 5]
In [37]:
import numpy as np
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
values = list(data_dict.values())
numpy_array = np.array(values)
print("NumPy Array from Dictionary Values:", numpy_array)
NumPy Array from Dictionary Values: [1 2 3 4]
In [38]:
In [38]:
"""
Arrays (numpy.ndarray):
Arrays can have any number of dimensions (1D, 2D, 3D, etc.).
Arrays are the fundamental data structure in NumPy.
Arrays support element-wise operations.
Arrays are more flexible and commonly used in numerical computing and data analysis.
You can create arrays using np.array() function.
Matrices (numpy.matrix):
Matrices are a subclass of arrays and always have exactly two dimensions (rows and column
s).
Matrices support matrix multiplication with the * operator.
Matrices have some additional methods like I for computing the inverse and T for
computing the transpose.
Matrices can be less flexible compared to arrays, especially when dealing with operations
beyond linear algebra.
You can create matrices using np.matrix() function.
"""
import numpy as np
array_a = np.array([[1, 2], [3, 4]])

matrix_b = np.matrix([[1, 2], [3, 4]])
print("Array:")
print(array_a)
print("Type of array:", type(array_a))
print("\n Matrix:")
print(matrix_b)
print("Type of matrix:", type(matrix_b))
Array:
[[1 2]
[3 4]]
Type of array: <class 'numpy.ndarray'>
Matrix:
[[1 2]
[3 4]]
Type of matrix: <class 'numpy.matrix'>
In [ ]:
In [ ]:

Pandas & Numpy

Uploaded by

Copyright:

Available Formats

Pandas & Numpy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas & Numpy

Uploaded by

Copyright:

Available Formats

Pandas Questions

creating empty sereis

# 2. creating series from an array

# 3. creating series from an array with index

# 4.creating series from a dictionary

# 5.creating series from a list

# 6.creating series from s scaler value

Series([], dtype: object)

create an empty dataframe

# 2. creating dataframe from a dict of ndarray/lists

# 3. creating dataframe using list

# 4. creating dataframe using list using a dictionary

data = [{'aa': 1, 'bs': 2, 'cd': 3},

# 5. creating dataframe using a series

d = {'one': pd.Series([10, 20, 30, 40],

list = ['s', 'c', 'a', 'l', 'a','r']

Categorical data is a discrete set of values for a particular

index = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']],

duration = end_date - start_date

new_date = start_date + pd.Timedelta(days=5)

Ideally, iterating over pandas DataFrames is definitely not the best

Before attempting to iterate through pandas objects, we must first

Applying a function to rows:

A common use case of iteration is when it comes to applying a function

In case we need to perform iterative manipulations and at the same

If we want to print out a DataFrame then instead of iterating through

Vectorisation over iteration:

It is always preferred to choose vectorization over iteration as

# Method 1: Using iterrows()

# Method 2: Using apply() with axis=1

# Method 3: Using a traditional loop with iloc[]

Using apply() with axis=1:

Using a traditional loop with iloc[]:

data = {'A': [1, 2, None, 4, 5],

# Check for missing values

# Fill missing values with the mean of the respective column

Missing values before handling:

DataFrame after filling missing values with the mean:

data = {'A': [1, 2, None, 4, 5],

# Check for missing values

# Option 4: Interpolate missing values

Missing values before handling:

DataFrame after dropping rows with missing values:

DataFrame after filling missing values with a specific value:

DataFrame after filling missing values with the mean:

DataFrame after interpolating missing values:

# Create a sample DataFrame

# Filter rows where values in column 'A' are greater than 3

data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B','C'],

data = {'A': [1, 2, 3, 4, 5],

result = df[df['A'] > 2].sort_values(by='B',ascending=False).reset_index(drop=True)

data = {'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'],

pivot_table = df.pivot_table(index='Date', columns='Category', values='Value',

Duplicate rows can be handled in pandas using various methods such as

data = {'A': [1, 2, 3, 1, 2],

data = {'A': [1, 2, 1, 2],

# Aggregate duplicate rows by summing values in column 'C'

Quantiles and Percentiles:

Correlation and Covariance:

Unique Values and Value Counts: