Data Science 1-5
Data Science 1-5
Data Science 1-5
Viewing data in Python is crucial when we need to inspect its contents. Pandas offers functions like
head() and tail() to view the first or last few rows of a dataset, making it easy to get a quick glimpse
of the data.
Explanation:
head(): Displays the first few rows (default: 5 rows).
tail(): Displays the last few rows (default: 5, but here, we've set it to 3).
Manipulating data involves transforming, modifying, or cleaning the dataset to suit your needs.
Some common tasks include renaming columns, filtering data, adding new columns, or removing
rows/columns.
Explanation:
Renaming columns: Makes the dataset more readable or aligns it with a specific standard.
Filtering data: Extracts rows that meet certain conditions.
Code:
1. Describing Data
import pandas as pd
# Sample data
data = {
'Age': [25, 30, 35, 40, 45, 50, 55],
'Salary': [30000, 40000, 50000, 60000, 70000, 80000, 90000],
'Experience': [2, 4, 6, 8, 10, 12, 14]
}
df = pd.DataFrame(data)
# Describing the data
summary = df.describe()
print(summary)
Output:
2. Viewing Data
Output:
3. Manipulating Data
Output:
PROGRAM 2
Aim: To plot the probability distribution curve.
Theory:
A probability distribution describes how the values of a random variable are distributed. It shows
the probability of different outcomes. In data science, visualizing these distributions can help in
understanding the underlying patterns in the data.
Normal distribution: A bell-shaped curve where data tends to be around a central value.
Exponential distribution: Describes the time between events in a Poisson process.
Uniform distribution: All outcomes are equally likely.
The most commonly used distribution in data science is the normal distribution. We can visualize
the distribution of data using the probability density function (PDF) or a histogram. Python provides
libraries like Matplotlib and Seaborn to help plot these curves.
Code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Output:
PROGRAM 3
Aim: To perform chi square test on various data sets.
Theory:
The Chi-Square (χ²) test is a statistical method to determine if there is a significant association
between two categorical variables. The test compares the observed frequencies in each category to
the frequencies expected if there is no association between the variables.
Where:
𝑂𝑖= observed frequency
𝐸𝑖 = expected frequency
The Chi-Square test requires:
The sample data to be categorical.
Expected frequencies to be large enough (at least 5 in each cell).
Python’s SciPy library provides the chi2_contingency() function for the Chi-Square Test
of Independence.
Code:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
# Output results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
Output:
PROGRAM 4
Aim: To use python as a programming tool for the analysis of data structures.
Theory:
Data structures are fundamental components in programming, providing ways to organize and
store data efficiently. In Python, commonly used data structures include:
Lists: A dynamic array that can hold elements of different data types.
Dictionaries: A collection of key-value pairs.
Sets: A collection of unique elements.
Tuples: Immutable sequences, similar to lists but unchangeable.
Python also supports more complex data structures, such as Stacks, Queues, Linked Lists, and
Trees. Each data structure has its unique properties and performance trade-offs.
By analyzing data structures, we explore their performance characteristics, such as time complexity
(speed) and space complexity (memory usage) when performing common operations like insertion,
deletion, searching, and traversal.
1. List
Lists in Python are dynamic arrays. They allow operations such as:
Appending: Adding an element to the end of the list.
Inserting: Adding an element at a specific position.
Deleting: Removing an element by value or index.
Accessing: Retrieving elements using indices.
The time complexities of these operations are generally:
Append: O(1)
Insert/Delete (at a specific index): O(n), since the remaining elements need to be shifted.
Accessing by index: O(1)
2. Dictionary
Dictionaries are implemented as hash tables, where each key maps to a value. The most important
operations are:
Code:
1. Lists
import time
# Create a list of 1 million integers
data = list(range(1000000))
# 1. Appending an element
start = time.time()
data.append(1000001)
end = time.time()
print(f"Time taken to append: {end - start} seconds")
# 2. Inserting an element at the beginning (index 0)
start = time.time()
data.insert(0, -1)
end = time.time()
print(f"Time taken to insert at the beginning: {end - start} seconds")
# 3. Deleting an element from the middle
start = time.time()
del data[len(data) // 2]
end = time.time()
print(f"Time taken to delete from the middle: {end - start} seconds")
# 4. Accessing an element
start = time.time()
element = data[500000]
end = time.time()
print(f"Time taken to access an element: {end - start} seconds")
Output:
Code:
2. Dictionary
# Creating a large dictionary
data_dict = {i: i*2 for i in range(1000000)}
# 1. Inserting a key-value pair
start = time.time()
data_dict[1000000] = 2000000
end = time.time()
print(f"Time taken to insert: {end - start} seconds")
# 2. Searching for a key
start = time.time()
value = data_dict.get(500000)
end = time.time()
print(f"Time taken to search: {end - start} seconds")
# 3. Deleting a key-value pair
start = time.time()
del data_dict[500000]
end = time.time()
print(f"Time taken to delete: {end - start} seconds")
Output:
PROGRAM 5
Aim: To perform various operations such as data storage, analysis and visualization.
Theory:
Python is a versatile tool for handling data storage, data analysis, and data visualization. Each of
these tasks involves different libraries:
Data storage: Storing data in files (CSV, Excel, JSON) or databases.
Data analysis: Analyzing data using libraries like Pandas and NumPy, which provide functions to
manipulate and derive insights from data.
Data visualization: Presenting data graphically using libraries like Matplotlib and Seaborn, which
allow you to create plots, charts, and graphs.
Data storage involves saving data in a structured format such as CSV, Excel, JSON, or databases.
Python's Pandas library can read and write to these formats with ease. Storing data is crucial for
preserving results, analysis, or records for further use.
Explanation:
Pandas DataFrame: A 2D data structure used to hold tabular data.
to_csv(): Stores the data into a CSV file. The index=False option excludes the index from the file.
Data analysis involves exploring and manipulating data to uncover insights, trends, and patterns.
Pandas and NumPy are commonly used for these tasks. Pandas provides functions for summarizing,
filtering, aggregating, and manipulating data efficiently.
Explanation:
Loading data: The read_csv() function loads data from a CSV file into a Pandas DataFrame.
head(): Displays the first few rows of the DataFrame.
describe(): Provides descriptive statistics for numerical columns (like mean, count, std).
Filtering: Conditions like df['Age'] > 22 are used to filter rows based on criteria.
Data visualization is crucial for communicating data-driven insights in a clear and compelling way.
Libraries like Matplotlib and Seaborn provide tools for plotting different types of graphs (e.g., line
plots, bar charts, histograms, scatter plots, etc.).
Explanation:
countplot(): Displays the count of occurrences for categorical data (Grades).
histplot(): Plots the distribution of numerical data (Ages), showing how frequently each value
occurs. The kde=True option overlays a kernel density estimate to smooth the distribution curve.
Code:
1. DATA STORAGE
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
Output:
Code:
2. DATA ANALYSIS
Output:
Code:
3. DATA VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.countplot(x='Grade', data=df, palette='Set2')
plt.title('Count of Students by Grade')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(8, 5))
sns.histplot(df['Age'], bins=5, kde=True, color='blue')
plt.title('Distribution of Students\' Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output: