Data Science 1-5

PROGRAM 1
Aim: Describing, viewing, and manipulating data.

Theory:
Describing data helps us understand its structure, distribution, and key statistical insights. In
Python, the Pandas library offers the describe() function that summarizes the central tendency,
dispersion, and shape of the dataset's distribution, excluding NaN values.
Explanation:
The describe() function returns a statistical summary including:
count: Number of non-null observations
mean: Average of the values
std: Standard deviation
min: Minimum value
25%, 50%, 75%: Percentile values
max: Maximum value
Viewing data in Python is crucial when we need to inspect its contents. Pandas offers functions like
head() and tail() to view the first or last few rows of a dataset, making it easy to get a quick glimpse
of the data.
Explanation:
head(): Displays the first few rows (default: 5 rows).
tail(): Displays the last few rows (default: 5, but here, we've set it to 3).
Manipulating data involves transforming, modifying, or cleaning the dataset to suit your needs.
Some common tasks include renaming columns, filtering data, adding new columns, or removing
rows/columns.
Explanation:
Renaming columns: Makes the dataset more readable or aligns it with a specific standard.
Filtering data: Extracts rows that meet certain conditions.
Code:
1. Describing Data
import pandas as pd
# Sample data
data = {
'Age': [25, 30, 35, 40, 45, 50, 55],
'Salary': [30000, 40000, 50000, 60000, 70000, 80000, 90000],
'Experience': [2, 4, 6, 8, 10, 12, 14]
}
df = pd.DataFrame(data)
# Describing the data
summary = df.describe()
print(summary)
Output:
2. Viewing Data
# Viewing the first 5 rows of the data

print(df.head())
# Viewing the last 3 rows of the data

print(df.tail(3))
Output:
3. Manipulating Data
# Renaming the 'Salary' column to 'Income'

df_renamed = df.rename(columns={'Salary': 'Income'})
print(df_renamed)
# Filtering out records where 'Age' is greater than 40

filtered_data = df[df['Age'] > 40]
print(filtered_data)
Output:
# Renaming the 'Salary' column to 'Income'
# Filtering out records where 'Age' is greater than 40
PROGRAM 2
Aim: To plot the probability distribution curve.
Theory:
A probability distribution describes how the values of a random variable are distributed. It shows
the probability of different outcomes. In data science, visualizing these distributions can help in
understanding the underlying patterns in the data.
Some common types of probability distributions include:
 Normal distribution: A bell-shaped curve where data tends to be around a central value.
 Exponential distribution: Describes the time between events in a Poisson process.
 Uniform distribution: All outcomes are equally likely.
The most commonly used distribution in data science is the normal distribution. We can visualize
the distribution of data using the probability density function (PDF) or a histogram. Python provides
libraries like Matplotlib and Seaborn to help plot these curves.
Steps for plotting a normal distribution curve:

1. Generate data following a normal distribution using NumPy.
2. Plot the probability distribution curve using Seaborn or Matplotlib.
Code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Step 1: Generate random data following a normal distribution

# mean = 0, standard deviation = 1, and 1000 data points
data = np.random.normal(0, 1, 1000)
# Step 2: Plot the probability distribution curve

sns.set(style='whitegrid')
sns.histplot(data, bins=30, kde=True, color='blue')
# Step 3: Set labels and show the plot

plt.title('Normal Distribution Curve')
plt.xlabel('Data Points')
plt.ylabel('Density')
plt.show()
Output:
PROGRAM 3
Aim: To perform chi square test on various data sets.
Theory:
The Chi-Square (χ²) test is a statistical method to determine if there is a significant association
between two categorical variables. The test compares the observed frequencies in each category to
the frequencies expected if there is no association between the variables.
There are two main types of Chi-Square tests:

Chi-Square Test of Independence: Determines if two categorical variables are independent.
Chi-Square Goodness-of-Fit Test: Determines if observed data fits a particular distribution.
The test statistic (χ²) is calculated as:
Where:
𝑂𝑖= observed frequency
𝐸𝑖 = expected frequency
The Chi-Square test requires:
The sample data to be categorical.
Expected frequencies to be large enough (at least 5 in each cell).
Python’s SciPy library provides the chi2_contingency() function for the Chi-Square Test
of Independence.
Code:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
# Creating a contingency table

data = np.array([[50, 30, 20], # Male: preferences for A, B, C
[30, 40, 30]]) # Female: preferences for A, B, C
# Converting to a pandas DataFrame for better visualization

df = pd.DataFrame(data, columns=['Product A', 'Product B', 'Product C'],
index=['Male', 'Female'])
print(df)
# Performing the Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(df)
# Output results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
Output:
PROGRAM 4
Aim: To use python as a programming tool for the analysis of data structures.
Theory:
Data structures are fundamental components in programming, providing ways to organize and
store data efficiently. In Python, commonly used data structures include:
Lists: A dynamic array that can hold elements of different data types.
Dictionaries: A collection of key-value pairs.
Sets: A collection of unique elements.
Tuples: Immutable sequences, similar to lists but unchangeable.
Python also supports more complex data structures, such as Stacks, Queues, Linked Lists, and
Trees. Each data structure has its unique properties and performance trade-offs.
By analyzing data structures, we explore their performance characteristics, such as time complexity
(speed) and space complexity (memory usage) when performing common operations like insertion,
deletion, searching, and traversal.
1. List
Lists in Python are dynamic arrays. They allow operations such as:
Appending: Adding an element to the end of the list.
Inserting: Adding an element at a specific position.
Deleting: Removing an element by value or index.
Accessing: Retrieving elements using indices.
The time complexities of these operations are generally:
Append: O(1)
Insert/Delete (at a specific index): O(n), since the remaining elements need to be shifted.
Accessing by index: O(1)
2. Dictionary
Dictionaries are implemented as hash tables, where each key maps to a value. The most important
operations are:
Insertion/Deletion: O(1) on average, thanks to the hash function.

Searching (Key Lookup): O(1), as the key is hashed and the value can be accessed directly.
Code:
1. Lists
import time
# Create a list of 1 million integers
data = list(range(1000000))
# 1. Appending an element
start = time.time()
data.append(1000001)
end = time.time()
print(f"Time taken to append: {end - start} seconds")
# 2. Inserting an element at the beginning (index 0)
start = time.time()
data.insert(0, -1)
end = time.time()
print(f"Time taken to insert at the beginning: {end - start} seconds")
# 3. Deleting an element from the middle
start = time.time()
del data[len(data) // 2]
end = time.time()
print(f"Time taken to delete from the middle: {end - start} seconds")
# 4. Accessing an element
start = time.time()
element = data[500000]
end = time.time()
print(f"Time taken to access an element: {end - start} seconds")
Output:
Code:
2. Dictionary
# Creating a large dictionary
data_dict = {i: i*2 for i in range(1000000)}
# 1. Inserting a key-value pair
start = time.time()
data_dict[1000000] = 2000000
end = time.time()
print(f"Time taken to insert: {end - start} seconds")
# 2. Searching for a key
start = time.time()
value = data_dict.get(500000)
end = time.time()
print(f"Time taken to search: {end - start} seconds")
# 3. Deleting a key-value pair
start = time.time()
del data_dict[500000]
end = time.time()
print(f"Time taken to delete: {end - start} seconds")
Output:
PROGRAM 5
Aim: To perform various operations such as data storage, analysis and visualization.
Theory:
Python is a versatile tool for handling data storage, data analysis, and data visualization. Each of
these tasks involves different libraries:
Data storage: Storing data in files (CSV, Excel, JSON) or databases.
Data analysis: Analyzing data using libraries like Pandas and NumPy, which provide functions to
manipulate and derive insights from data.
Data visualization: Presenting data graphically using libraries like Matplotlib and Seaborn, which
allow you to create plots, charts, and graphs.
Data storage involves saving data in a structured format such as CSV, Excel, JSON, or databases.
Python's Pandas library can read and write to these formats with ease. Storing data is crucial for
preserving results, analysis, or records for further use.
Explanation:
Pandas DataFrame: A 2D data structure used to hold tabular data.
to_csv(): Stores the data into a CSV file. The index=False option excludes the index from the file.
Data analysis involves exploring and manipulating data to uncover insights, trends, and patterns.
Pandas and NumPy are commonly used for these tasks. Pandas provides functions for summarizing,
filtering, aggregating, and manipulating data efficiently.
Explanation:
Loading data: The read_csv() function loads data from a CSV file into a Pandas DataFrame.
head(): Displays the first few rows of the DataFrame.
describe(): Provides descriptive statistics for numerical columns (like mean, count, std).
Filtering: Conditions like df['Age'] > 22 are used to filter rows based on criteria.
Data visualization is crucial for communicating data-driven insights in a clear and compelling way.
Libraries like Matplotlib and Seaborn provide tools for plotting different types of graphs (e.g., line
plots, bar charts, histograms, scatter plots, etc.).
Explanation:
countplot(): Displays the count of occurrences for categorical data (Grades).
histplot(): Plots the distribution of numerical data (Ages), showing how frequently each value
occurs. The kde=True option overlays a kernel density estimate to smooth the distribution curve.
Code:
1. DATA STORAGE
import pandas as pd
# Sample data (students' information)

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [22, 23, 24, 22, 25],
'Grade': ['A', 'B', 'A', 'C', 'B']
}
# Create a DataFrame
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file

df.to_csv('students_info.csv', index=False)
print("Data saved to students_info.csv")
Output:
Code:
2. DATA ANALYSIS
# Load data from the CSV file

df = pd.read_csv('students_info.csv')
# View the first few rows of the data
print("Data Preview:")
print(df.head())
# Descriptive statistics
print("\nSummary Statistics:")
print(df.describe())
# Filtering data: Find students with Age > 22
filtered_data = df[df['Age'] > 22]
print("\nStudents older than 22:")
print(filtered_data)
Output:
Code:
3. DATA VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.countplot(x='Grade', data=df, palette='Set2')
plt.title('Count of Students by Grade')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(8, 5))
sns.histplot(df['Age'], bins=5, kde=True, color='blue')
plt.title('Distribution of Students\' Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output:

Data Science 1-5

Uploaded by

Copyright:

Available Formats

Data Science 1-5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science 1-5

Uploaded by

Copyright:

Available Formats

PROGRAM 1

Aim: Describing, viewing, and manipulating data.

# Viewing the first 5 rows of the data

# Viewing the last 3 rows of the data

# Renaming the 'Salary' column to 'Income'

# Filtering out records where 'Age' is greater than 40

# Renaming the 'Salary' column to 'Income'

# Filtering out records where 'Age' is greater than 40

Some common types of probability distributions include:

Steps for plotting a normal distribution curve:

# Step 1: Generate random data following a normal distribution

# Step 2: Plot the probability distribution curve

# Step 3: Set labels and show the plot

There are two main types of Chi-Square tests:

# Creating a contingency table

# Converting to a pandas DataFrame for better visualization

Insertion/Deletion: O(1) on average, thanks to the hash function.

# Sample data (students' information)

# Save the DataFrame to a CSV file

# Load data from the CSV file

You might also like