45B AIML Practical1.1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Name of Student: Ahmed Mobin Ahmed Shaikh

Roll Number: 45 Lab Practical Number: 1.1

Title of Lab Assignment: Numpy, Pandas Implementation and exercises.

DOP: 23-01-24 DOS: 30-01-24

CO Mapped: PO Mapped : Signature:


CO1 PO1, PO2,
PO3, PSO1,
PSO2
NumPy Arrays | Numpy Arange | Numpy
Linspace | Numpy Rand | Numpy Reshape |
Numpy Shape
Notebook Link:
https://colab.research.google.com/drive/1re8EJQ0Q4PfEqIoj
f6L1G-IHlSE0OvDn#scrollTo=vBm7sbc6ed8C

NumPy Arrays:
NumPy is a powerful library for numerical computing in Python. One
of its key features is the NumPy array, a multidimensional array of
elements, usually of the same type. NumPy arrays are more efficient
than Python lists for numerical operations because they are
implemented in C and allow for vectorized operations.

Creating NumPy Arrays:

You can create NumPy arrays in various ways:

python
import numpy as np

# Creating an array from a list


arr_list = [1, 2, 3, 4, 5]
np_array_from_list = np.array(arr_list)

# Creating an array using np.arange


arr_range = np.arange(0, 10, 2) # Creates an array from 0 to 10
(exclusive) with step 2

# Creating an array using np.linspace


arr_linspace = np.linspace(0, 1, 5) # Creates an array of 5 evenly
spaced values between 0 and 1

# Creating an array of random values using np.random


arr_random = np.random.rand(3, 3) # Creates a 3x3 array of random
values between 0 and 1

NumPy arange:

np.arange is a function that returns an array with regularly spaced


values within a given interval. It is similar to the Python range
function but returns a NumPy array.

python
arr_arange = np.arange(start, stop, step)

NumPy linspace:

np.linspace returns an array with evenly spaced values over a


specified range. Unlike np.arange, it includes both the start and
stop values, and you specify the number of elements you want.

python
arr_linspace = np.linspace(start, stop, num)

NumPy random:

np.random module provides functions for generating random data. Some


commonly used functions are rand, randn, randint, random, and
shuffle.

python
arr_random = np.random.rand(3, 3) # Generates a 3x3 array of random
values between 0 and 1

NumPy reshape:

np.reshape is used to change the shape of an array. It allows you to


reorganize the elements of an array into a new shape without
changing their values.

python
arr_reshape = np.reshape(original_array, new_shape)

NumPy shape:

np.shape is an attribute that returns the shape of an array. It is a


tuple representing the dimensions of the array.

python
shape_tuple = np.shape(arr)
NumPy Indexing and Selection | Fancy
Indexing | Matrices in Python | Numpy in
Machine Learning
Notebook Link:
https://colab.research.google.com/drive/1yVclsVQ8Amiq
cT2euXUbn4MJTb5zxUjX

NumPy Indexing and Selection:


NumPy provides powerful indexing and selection mechanisms for
accessing elements or subsets of elements in arrays.

Basic Indexing:
python
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Accessing elements using index


element_at_index_2 = arr[2] # Returns 3

# Slicing to get a subset


subset = arr[1:4] # Returns array([2, 3, 4])

Multidimensional Array Indexing:


python
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Accessing elements in a 2D array


element_row2_col1 = arr_2d[1, 0] # Returns 4

# Slicing a 2D array
subset_2d = arr_2d[:2, 1:] # Returns array([[2, 3], [5, 6]])

Fancy Indexing:
Fancy indexing allows you to use arrays of indices to access
multiple elements at once.
python
indices = np.array([0, 2, 1])
selected_elements = arr[indices] # Returns array([1, 3, 2])

Matrices in Python:

In NumPy, matrices are represented using the np.array class with two
dimensions. Matrices can be created by passing nested lists or using
the np.matrix class.

python
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

NumPy provides functions for matrix operations such as


multiplication (np.dot or @ operator), inversion (np.linalg.inv),
and determinant (np.linalg.det).

NumPy in Machine Learning:


NumPy is extensively used in machine learning for its efficient
array operations. Here are some key aspects:

1. Data Representation:

● In machine learning, datasets are often represented as NumPy


arrays.
● Features of a dataset can be stored in a 2D array, where rows
represent samples and columns represent features.

2. Vectorization:

● NumPy allows vectorized operations, making it possible to


perform operations on entire arrays without using explicit
loops. This significantly improves computational efficiency.

3. Linear Algebra:

● Linear algebra operations, such as matrix multiplication, are


fundamental in machine learning algorithms. NumPy provides
efficient implementations of these operations.

4. Random Number Generation:

● NumPy's random module is used for generating random numbers,


which is often crucial in machine learning for tasks like data
shuffling or initialization of weights in neural networks.
5. Indexing and Selection:

Efficient indexing and selection using NumPy are essential for


manipulating and accessing data in machine learning applications.
Numpy Operations | Numpy arithmetic
Operations | Numpy Universal Array Functions
Notebook Link:
https://colab.research.google.com/drive/19RdU0QOltzqPPscW
fkqkqMb8GEG7qttl

NumPy Operations:
NumPy provides a wide range of operations that can be performed on
arrays. These operations include arithmetic operations, statistical
operations, linear algebra operations, and more.

Arithmetic Operations:

NumPy allows you to perform element-wise arithmetic operations on


arrays.

python
import numpy as np

arr1 = np.array([1, 2, 3])


arr2 = np.array([4, 5, 6])

# Addition
result_addition = arr1 + arr2 # [5, 7, 9]

# Subtraction
result_subtraction = arr1 - arr2 # [-3, -3, -3]

# Multiplication
result_multiplication = arr1 * arr2 # [4, 10, 18]

# Division
result_division = arr1 / arr2 # [0.25, 0.4, 0.5]

# Element-wise power
result_power = arr1 ** 2 # [1, 4, 9]

Universal Array Functions (ufuncs):

NumPy also provides Universal Functions (ufuncs), which are


functions that operate element-wise on arrays. These functions are
highly optimized and can operate on arrays of any size and shape.

python
# Square root
result_sqrt = np.sqrt(arr1) # [1.0, 1.414, 1.732]
# Exponential
result_exp = np.exp(arr1) # [2.718, 7.389, 20.085]

# Trigonometric functions
result_sin = np.sin(arr1) # [0.841, 0.909, 0.141]
result_cos = np.cos(arr1) # [0.540, -0.416, -0.990]

Aggregation Functions:

NumPy provides functions for aggregating values in an array, such as


sum, mean, median, min, max, etc.

python
# Sum
total_sum = np.sum(arr1) # 6

# Mean
mean_value = np.mean(arr1) # 2.0

# Minimum and maximum


min_value = np.min(arr1) # 1
max_value = np.max(arr1) # 3

Linear Algebra Operations:

NumPy has a comprehensive set of functions for linear algebra


operations.

python
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

# Matrix multiplication
result_matrix_multiply = np.dot(matrix_a, matrix_b)

# Matrix determinant
matrix_det = np.linalg.det(matrix_a)

# Matrix inverse
matrix_inv = np.linalg.inv(matrix_a)
Pandas in Python| Series in Pandas | Pandas
Series to Dataframe | Pandas Series to List
Notebook Link:
https://colab.research.google.com/drive/1y5rHaTKhKSwmkx-
4jZbRQ7CcL471JlQd

Pandas in Python:

Pandas is a popular open-source data manipulation and


analysis library for Python. It provides two primary data
structures: Series and DataFrame. These structures are
built on top of NumPy arrays, offering more functionality
and flexibility for data manipulation.

Series in Pandas:

A Series is a one-dimensional labeled array in Pandas. It


is capable of holding any data type, and each element in
the Series has a label called an index.
Creating a Pandas Series:
python
import pandas as pd

# Creating a Series from a list


data_list = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data_list)

# Creating a Series with custom index


custom_index_series = pd.Series(data_list, index=['a',
'b', 'c', 'd', 'e'])

Pandas Series to DataFrame:


A DataFrame is a two-dimensional labeled data structure
with columns that can be of different data types. You can
convert a Pandas Series to a DataFrame using the
pd.DataFrame() constructor.

python
# Creating a DataFrame from a Series
df_from_series = pd.DataFrame(series_from_list,
columns=['Column_Name'])

# Creating a DataFrame from multiple Series


series1 = pd.Series([1, 2, 3])
series2 = pd.Series(['a', 'b', 'c'])
df_from_multiple_series = pd.DataFrame({'Column1':
series1, 'Column2': series2})

Pandas Series to List:

You can convert a Pandas Series to a Python list using


the tolist() method.

python
# Converting a Pandas Series to a list
list_from_series = series_from_list.tolist()
DataFrames in Pandas:

A DataFrame is a two-dimensional labeled data structure


in Pandas, resembling a table or a spreadsheet with rows
and columns. It is one of the most widely used data
structures for data manipulation and analysis in Python.
Creating a DataFrame:

There are several ways to create a DataFrame in Pandas:

1. From a Dictionary of Lists:

python
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}

df = pd.DataFrame(data)

2. From a List of Lists:

python
data_list = [
['Alice', 25, 'New York'],
['Bob', 30, 'San Francisco'],
['Charlie', 35, 'Los Angeles']
]

df = pd.DataFrame(data_list, columns=['Name', 'Age',


'City'])

3. From a NumPy Array:

python
import numpy as np

data_array = np.array([
['Alice', 25, 'New York'],
['Bob', 30, 'San Francisco'],
['Charlie', 35, 'Los Angeles']
])

df = pd.DataFrame(data_array, columns=['Name', 'Age',


'City'])

4. From a CSV File:

python
df = pd.read_csv('example.csv')

Essential DataFrame Operations:

Once you have a DataFrame, you can perform various


operations on it:
1. Viewing Data:
python
# Display the first few rows
df.head()

# Display the last few rows


df.tail()

# Display basic statistics


df.describe()

2. Selecting Data:
python
# Selecting a column
name_column = df['Name']
# Selecting multiple columns
selected_columns = df[['Name', 'Age']]

3. Filtering Data:
python
# Filtering based on a condition
filtered_df = df[df['Age'] > 30]

4. Adding and Removing Columns:


python
# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Removing a column
df = df.drop('Salary', axis=1)

5. Handling Missing Data:


python
# Check for missing values
df.isnull()

# Drop rows with missing values


df = df.dropna()

# Fill missing values with a specific value


df = df.fillna(0)

6. Grouping and Aggregation:


python
# Group by a column and calculate mean
grouped_df = df.groupby('City').mean()

7. Merging DataFrames:
python
# Merge two DataFrames
merged_df = pd.merge(df1, df2, on='Key_Column')

8. Writing to CSV:
python
df.to_csv('output.csv', index=False)
Handling Missing Data:
1. Dropping Missing Values:

Use dropna() to remove rows or columns with missing


values.

python
# Drop rows with any missing values
df_no_missing_rows = df.dropna()

# Drop columns with any missing values


df_no_missing_cols = df.dropna(axis=1)

2. Filling Missing Values:

Use fillna() to fill missing values with a specified


value or a calculated value (mean, median, etc.).

python
# Fill missing values with a specific value (e.g., 0)
df_filled_zero = df.fillna(0)

# Fill missing values with the mean of each column


df_filled_mean = df.fillna(df.mean())

3. Interpolation:

Use interpolate() to fill missing values by interpolating


between existing values.

python
# Interpolate missing values linearly
df_interpolated = df.interpolate()

4. Forward or Backward Fill:

Use ffill (forward fill) or bfill (backward fill) to fill


missing values with the previous or next valid value.

python
# Forward fill missing values
df_ffill = df.ffill()
# Backward fill missing values
df_bfill = df.bfill()

5. Handling Missing Values in Time Series:

For time series data, you might want to fill missing


values using methods like forward-fill or backward-fill
with specific limits.

python
# Forward fill with a limit of 1 (fills missing values up
to one non-missing value)
df_ffill_limit = df.ffill(limit=1)
Pandas Operations:

Pandas provides a rich set of operations for data


manipulation. Here, we'll discuss some common operations:
GroupBy, Merge, Joins, and Concatenation.
1. GroupBy with Pandas:

The GroupBy operation involves splitting the data based


on some criteria, applying a function to each group
independently, and then combining the results.

python
import pandas as pd

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)

# Grouping by 'Category' and calculating the mean for


each group
grouped_df = df.groupby('Category').mean()

2. Merge with Pandas:

Merging is a way to combine two DataFrames based on a


common column.

python
# Creating two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',
'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30,
35]})

# Merging DataFrames based on the 'ID' column


merged_df = pd.merge(df1, df2, on='ID')

3. Joins with Pandas:

Joins are similar to merges but allow you to specify the


type of join (inner, outer, left, or right).
python

# Performing a left join


left_join_df = pd.merge(df1, df2, on='ID', how='left')

4. Concatenation:

Concatenation combines two DataFrames along a particular


axis (either rows or columns).

python
# Concatenating two DataFrames vertically (along rows)
concatenated_df = pd.concat([df1, df2], axis=0)

# Concatenating two DataFrames horizontally (along


columns)
concatenated_df_horizontal = pd.concat([df1, df2],
axis=1)

Summary of Operations:
1. GroupBy:

● Use groupby() to group data based on specific


criteria.
● Apply aggregation functions like mean(), sum(), etc.,
on grouped data.

2. Merge:

● Use merge() to combine two DataFrames based on a


common column.
● Specify the 'on' parameter as the column for merging.

3. Joins:

● Joins are a type of merge.


● Use the how parameter to specify the type of join
(inner, outer, left, right).

4. Concatenation:

● Use concat() to combine DataFrames along a specified


axis.
● Specify the axis (0 for rows, 1 for columns).
Exploratory Data Analysis - 1
Exploratory Data Analysis (EDA) is a crucial step in the
data analysis process. It involves exploring and
understanding the main characteristics of a dataset
before applying more advanced statistical modeling. Here,
I'll detail various aspects of EDA:

1. Understanding the Data:


- Load the Data:

● Import necessary libraries (e.g., Pandas, NumPy) and


load your dataset into a DataFrame.

python
import pandas as pd

# Load the dataset


df = pd.read_csv('your_dataset.csv')

- Initial Inspection:

● Use methods like head(), info(), and describe() to


get an initial overview of the dataset.

python
# Display the first few rows
print(df.head())

# Get general information about the dataset


print(df.info())

# Get summary statistics


print(df.describe())

2. Dealing with Missing Values:


- Identify Missing Values:

● Check for missing values using isnull().

python
# Check for missing values
print(df.isnull().sum())

- Handling Missing Values:

● Decide on a strategy for handling missing values


(removing, imputing, etc.).

python
# Drop rows with missing values
df = df.dropna()

# Impute missing values


df['column_name'].fillna(df['column_name'].mean(),
inplace=True)

3. Exploratory Visualization:
- Univariate Analysis:

● Visualize the distribution of individual variables.

python
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for a numeric variable


plt.hist(df['numeric_column'], bins=20)
plt.show()

# Bar chart for a categorical variable


sns.countplot(x='category_column', data=df)
plt.show()
- Bivariate Analysis:

● Explore relationships between pairs of variables.

python
# Scatter plot for two numeric variables
plt.scatter(df['numeric_column1'], df['numeric_column2'])
plt.xlabel('Numeric Column 1')
plt.ylabel('Numeric Column 2')
plt.show()

# Boxplot for a numeric variable across categories


sns.boxplot(x='category_column', y='numeric_column',
data=df)
plt.show()

- Correlation Analysis:

● Understand the correlation between numeric variables.

python
# Correlation matrix
correlation_matrix = df.corr()

# Heatmap for correlation matrix


sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm')
plt.show()

4. Feature Engineering:
- Creating New Features:

● Derive new features that might be more informative.

python
# Create a new feature
df['new_feature'] = df['numeric_column1'] *
df['numeric_column2']

- Transforming Features:

● Apply transformations to existing features (e.g., log


transformation).

python
# Log transformation
df['log_numeric_column'] = np.log(df['numeric_column'])
5. Statistical Testing:
- Hypothesis Testing:

● Conduct statistical tests to validate hypotheses.

python
from scipy.stats import ttest_ind

# Perform t-test for two groups


group1 = df[df['condition'] == 'A']['numeric_column']
group2 = df[df['condition'] == 'B']['numeric_column']
t_stat, p_value = ttest_ind(group1, group2)
Collab Link:
https://colab.research.google.com/drive/19RdU0QOltzqPPscWfkqkqMb8GEG
7qttl#scrollTo=MSAho0QKo-2F

You might also like