unit-3(FODS)

NumPy
NumPy is a fundamental package in Python for scientific computing, providing support for
arrays, matrices, and a large collection of mathematical functions to operate on these data
structures. It's especially useful for numerical data manipulation and is commonly used in fields
like data science, machine learning, engineering, and physics due to its efficiency and simplicity
in handling large datasets.
Key Features of NumPy:
1. Multidimensional Arrays (ndarray): At the core of NumPy is the powerful ndarray, a

multi-dimensional array object that allows for fast operations on large datasets.
2. Broadcasting: This is a feature that enables you to perform arithmetic operations on
arrays of different shapes in a way that would be computationally intensive to do with
pure Python.
3. Mathematical Functions: NumPy includes a wide range of mathematical operations

such as trigonometric functions, statistical operations, linear algebraic functions, and
more.
4. Random Number Generation: The numpy.random module provides functions to
generate random numbers from various statistical distributions.
5. Interoperability with Other Libraries: Libraries like Pandas, SciPy, TensorFlow, and
Scikit-learn are built on top of or can efficiently integrate with NumPy arrays.
Basic Usage
To start using NumPy, you need to import it:
import numpy as np
1. Creating Arrays
• 1D array:
arr = np.array([1, 2, 3, 4, 5])
• 2D array (Matrix):
matrix = np.array([[1, 2, 3], [4, 5, 6]])

• Array of zeros, ones, or a specific range:
zeros = np.zeros((3, 3))
ones = np.ones((2, 2))
range_arr = np.arange(0, 10, 2) # From 0 to 10 with step size of 2

2. Array Operations
• Arithmetic operations:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = arr1 + arr2 # Element-wise addition
• Mathematical functions:
sin_values = np.sin(arr) # Applies sine to each element in arr
mean_value = np.mean(arr) # Mean of array
• Matrix multiplication:
product = np.dot(arr1, arr2) # Dot product for 1D arrays

3. Reshaping and Slicing
• Reshaping:
arr = np.array([1, 2, 3, 4, 5, 6])

reshaped_arr = arr.reshape((2, 3)) # Changes shape to 2x3
• Slicing:
arr = np.array([1, 2, 3, 4, 5, 6])
sliced_arr = arr[1:5] # Takes elements from index 1 to 4

4. Broadcasting
• Allows operations on arrays with different shapes:
arr = np.array([1, 2, 3])
arr_broadcasted = arr + 10 # Adds 10 to each element

5. Using NumPy for Data Analysis
• Statistical functions:
arr = np.array([1, 2, 3, 4, 5])

print("Mean:", np.mean(arr))
print("Standard Deviation:", np.std(arr))
• Logical Operations:
bool_arr = arr > 2 # Returns an array of True/False for each condition
Why Use NumPy?
NumPy is efficient because it uses contiguous memory blocks (similar to C arrays), so it can
perform operations much faster than Python lists. The concise syntax also makes it easier to read,
write, and maintain code for handling numerical data.
Pandas
Pandas is a popular Python library for data manipulation and analysis, built on top of NumPy. It
provides data structures and functions to efficiently handle large datasets, making it ideal for data
cleaning, preparation, and exploration. The main data structures in Pandas are Series and
DataFrame, which are well-suited for structured data.
Key Features of Pandas
1. Data Structures: Pandas has two main data structures:
o Series: A one-dimensional labeled array.

o DataFrame: A two-dimensional table-like structure with labeled rows and
columns.
2. Data Cleaning: Pandas offers tools for handling missing data, filtering rows, replacing
values, and more.
3. Data Analysis: Built-in functions for grouping, aggregating, and summarizing data.
4. Data Import and Export: Read from and write to various formats like CSV, Excel,
SQL, and more.
5. Powerful Indexing: Allows you to filter, select, and manipulate subsets of data easily.
Installation
# Install pandas if you haven't already
pip install pandas

Basic Usage of Pandas
To start using Pandas, you need to import it:
import pandas as pd
1. Creating a DataFrame
A DataFrame is like a table (rows and columns). You can create one manually using a dictionary
or import data from a file.
Creating a DataFrame from a dictionary:
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
2. Reading and Writing Data
Pandas makes it easy to read from and write to files.

• Reading from a CSV file:
df = pd.read_csv('data.csv')
Writing to a CSV file:
df.to_csv('output.csv', index=False)
3. Basic DataFrame Operations
• Inspecting Data:
print(df.head()) # Display the first 5 rows

print(df.describe()) # Summary statistics for numerical columns
print(df.info()) # Information about the DataFrame
Selecting Columns:
ages = df['Age'] # Select the "Age" column
Filtering Rows:
adults = df[df['Age'] > 25] # Filter rows where "Age" > 25
4. Adding and Modifying Columns

• Adding a new column:
df['Salary'] = [50000, 60000, 55000, 70000]
Modifying an existing column:
df['Age'] = df['Age'] + 1 # Increase everyone's age by 1
5. Handling Missing Data

Pandas has built-in tools to handle missing values.
• Detecting missing values:
print(df.isnull())
Filling missing values:

df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill missing age with mean age
6. Grouping and Aggregation
Grouping allows you to perform operations on subsets of data.

grouped = df.groupby('City')['Age'].mean() # Get average age per city
print(grouped)
Why Use Pandas?

Pandas is highly efficient and simplifies working with structured data, from loading and cleaning
data to performing complex transformations. Its integration with other Python libraries like
Matplotlib and Seaborn also makes it a go-to tool for data analysis and visualization.
Pandas is a powerful data manipulation library in Python, primarily used for data cleaning,
transformation, and analysis. It provides two main data structures:
1. Series: A one-dimensional array-like object.
2. DataFrame: A two-dimensional, table-like data structure with rows and columns.

Here's a quick overview of how to use Pandas with examples:
1. Importing Pandas
First, install Pandas if you haven't done so:
pip install pandas

Then, import it in your script:
import pandas as pd
2. Creating a DataFrame
You can create a DataFrame from a dictionary, list, or by reading data from a file (e.g., CSV).
From a Dictionary:
data = {
'Name': ['A', 'B, 'C'],
'Age': [24, 27, 22],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
From a CSV file:

# Assuming 'data.csv' has columns: Name, Age, City
print(df.head()) # Prints the first 5 rows of the DataFrame
3. Accessing Data
You can access columns, rows, or individual values within a DataFrame.
Accessing Columns:
# Access a single column
print(df['Name'])
# Access multiple columns
print(df[['Name', 'Age']])
Accessing Rows:
# Access rows by index using .iloc

print(df.iloc[0]) # First row
# Access rows by label using .loc
print(df.loc[0]) # Also the first row if the index is numeric
# Access rows based on condition

print(df[df['Age'] > 23])
4. Adding and Modifying Columns
You can easily add or modify columns in a DataFrame.
# Add a new column

df['Salary'] = [50000, 60000, 45000]
# Modify an existing column

df['Age'] = df['Age'] + 1
print(df)
5. Dropping Rows or Columns
You can drop rows or columns based on conditions or indices.

# Drop a column
df = df.drop(columns=['Salary'])
# Drop rows where Age > 25

df = df[df['Age'] <= 25]
print(df)
6. Grouping and Aggregation

Pandas offers powerful grouping and aggregation methods, similar to SQL.
# Group by 'City' and calculate the average 'Age'
age_by_city = df.groupby('City')['Age'].mean()
print(age_by_city)
7. Saving Data
You can save your DataFrame to various file formats.
# Save to CSV
df.to_csv('output.csv', index=False)
# Save to Excel
df.to_excel('output.xlsx', index=False)
8. Basic Statistical Operations
Pandas provides many built-in functions for quick statistical analysis.

# Calculate basic statistics
print(df['Age'].mean()) # Average age
print(df['Age'].sum()) # Sum of ages

print(df.describe()) # Summary statistics for all numerical columns.
Matplotlib
Matplotlib is a popular Python library for data visualization, allowing you to create static,
interactive, and animated plots. It's commonly used with Pandas for plotting DataFrames and
provides a range of chart types like line charts, bar charts, scatter plots, histograms, and more.
1. Installing and Importing Matplotlib
If you don’t already have it installed, use:

pip install matplotlib
Then, import it in your script:
import matplotlib.pyplot as plt
2. Basic Line Plot

A line plot is one of the simplest ways to visualize a trend over time.
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
# Create a line plot
plt.plot(x, y, label='Trend Line', color='blue', marker='o')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Basic Line Plot')
plt.legend()
plt.show()
3. Bar Plot
Bar plots are useful for comparing categories.
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 9]
# Create a bar plot

plt.bar(categories, values, color='skyblue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot Example')
plt.show()
4. Scatter Plot
# Sample data
x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11]

y = [99, 86, 87, 88, 100, 86, 103, 87, 94, 78]
# Create a scatter plot
plt.scatter(x, y, color='purple')
plt.title('Scatter Plot Example')
plt.show()
5. Histogram
# Sample data
import numpy as np
data = np.random.normal(0, 1, 1000) # 1000 data points with a normal distribution
# Create a histogram
plt.hist(data, bins=30, color='teal', edgecolor='black')
plt.xlabel('Data values')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
SciPy
SciPy is a scientific computing library in Python that builds on NumPy, providing advanced
mathematical functions for optimization, integration, interpolation, eigenvalue problems,
statistics, and more. It's particularly useful for data science, machine learning, and engineering
applications.
1. Installing and Importing SciPy

pip install scipy
import numpy as np
from scipy import stats, optimize, integrate, interpolate
2. Optimization
SciPy’s optimize module provides functions for minimizing (or maximizing) objective functions,
as well as root-finding algorithms.
Example: Finding the Minimum of a Function
Suppose you want to find the minimum of f(x)=x2+4x+4
from scipy.optimize import minimize
# Define the function

def f(x):
return x**2 + 4*x + 4
# Use minimize to find the minimum of f

result = minimize(f, x0=0) # x0 is the initial guess
print("Minimum:", result.x) # Optimal value of x
3. Integration
The integrate module provides functions for integrating functions, including definite and
indefinite integrals.
Example: Definite Integral
Let's calculate the integral of f(x)=x2f(x) = x^2f(x)=x2 from 0 to 1.

from scipy.integrate import quad
# Define the function
def f(x):
return x**2
# Perform the integration
result, error = quad(f, 0, 1) # quad returns both the result and an estimate of the error
print("Integral result:", result)
4. Interpolation
SciPy’s interpolate module can be used for interpolating between data points, which is helpful
for filling in missing data or creating smooth curves.
Example: Linear Interpolation
Suppose we have data points (x,y)={(0,0),(1,2),(2,3)}(x, y) = \{(0, 0), (1, 2), (2,
3)\}(x,y)={(0,0),(1,2),(2,3)}. We want to find the interpolated value at x=1.5x = 1.5x=1.5.
from scipy.interpolate import interp1d
# Known data points
x = np.array([0, 1, 2])
y = np.array([0, 2, 3])
# Create the interpolation function
f = interp1d(x, y, kind='linear')
# Interpolate at a new point

y_interp = f(1.5)
print("Interpolated value at x = 1.5:", y_interp)
5. Statistics
SciPy’s stats module provides a wide range of statistical functions, including probability
distributions, statistical tests, and summary statistics.
Example: Statistical Summary and Normal Distribution
Let's generate some data and calculate summary statistics, as well as work with a normal
distribution.
from scipy.stats import norm
# Generate some data
data = np.random.normal(0, 1, 1000) # Mean 0, standard deviation 1
# Calculate mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)
print("Mean:", mean)
print("Standard Deviation:", std_dev)
# PDF and CDF for the normal distribution

pdf_value = norm.pdf(0, loc=0, scale=1) # Probability density function at x = 0
cdf_value = norm.cdf(1, loc=0, scale=1) # Cumulative distribution function up to x = 1
print("PDF at x=0:", pdf_value)
print("CDF up to x=1:", cdf_value)
6. Solving Linear Algebra Problems

SciPy's linalg module provides functions for working with linear algebra problems, such as
solving systems of linear equations, matrix decompositions, and more.
Example: Solving a System of Linear Equations
Suppose we have the following system of equations:
2x+y=5 x+3y=7
We can represent this system as Ax=b, where:
• A=[2113]
• b=[57]
Data Processing
Data processing involves several steps to transform raw data into a usable format, typically for
analysis, modeling, or visualization. The key stages include data collection, cleaning,
transformation, and aggregation. Libraries like Pandas, NumPy, SciPy, and Scikit-Learn in
Python offer robust tools to make these tasks efficient and manageable.
Here’s an outline of each step with examples.
1. Data Collection
This involves gathering data from various sources, such as CSV files, databases, or APIs.
Example: Reading data from a CSV file
import pandas as pd
# Load data from a CSV file
print(df.head()) # Display the first 5 rows of the dataset
2. Data Cleaning
Data cleaning includes handling missing values, removing duplicates, and correcting data types
to ensure consistency and accuracy.
Example: Handling Missing Values

# Check for missing values
print(df.isnull().sum())
# Fill missing values with mean for numerical columns
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Drop rows with any missing values
df.dropna(inplace=True)
Example: Removing Duplicates

# Remove duplicate rows
df.drop_duplicates(inplace=True)
Example: Converting Data Types
# Convert a column to a specific data type

df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')
3. Data Transformation
This step involves modifying the data structure, scaling features, encoding categorical data, or
creating new features.
Example: Feature Scaling

Feature scaling is often necessary for algorithms that rely on the distance between data points.
from sklearn.preprocessing import StandardScaler

# Standardize numerical columns
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])
Example: Encoding Categorical Variables

For machine learning, categorical variables must be converted to a numerical format.
# One-hot encode categorical columns
df = pd.get_dummies(df, columns=['category_column'])
4. Aggregation and Grouping

Data aggregation helps to summarize data, often by grouping and performing operations like
sum, mean, count, etc.
Example: Grouping and Aggregating
# Group by a column and calculate the mean of each group
grouped = df.groupby('group_column')['numeric_column'].mean()
print(grouped)
# Multiple aggregations
summary = df.groupby('group_column').agg({
'numeric_column': ['mean', 'sum', 'max'],
'another_column': 'count'
})
print(summary)
5. Data Filtering and Selection
This involves extracting a subset of data that meets specific conditions, which is useful for
isolating relevant data or working on a smaller dataset.
Example: Filtering Data
# Filter rows where 'column_name' > 50
filtered_df = df[df['column_name'] > 50]

# Multiple conditions
filtered_df = df[(df['column1'] > 10) & (df['column2'] == 'specific_value')]
6. Data Integration
Combining data from multiple sources, like merging or joining tables, is common in data
processing.
Example: Merging DataFrames

# Assume df1 and df2 have a common column, 'key'
merged_df = pd.merge(df1, df2, on='key', how='inner') # Inner join
7. Data Exporting
After processing, the cleaned and transformed data can be saved for analysis or modeling.
Example: Exporting Data to CSV
# Save the DataFrame to a new CSV file

df.to_csv('processed_data.csv', index=False)
8. Automation with Functions
If you're processing similar datasets frequently, creating functions can make this more efficient.
Example: Defining a Data Processing Function

def process_data(df):
# Handle missing values
df.fillna(df.mean(), inplace=True)
# Scale features
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])
# Encode categorical columns

df = pd.get_dummies(df, columns=['category_column'])
return df
# Apply the function to a dataset
processed_df = process_data(df)
Data visualization
Data visualization is the graphical representation of data, helping us see patterns, trends, and
outliers that may not be immediately apparent in raw data. Libraries like Matplotlib, Seaborn,
and Plotly are popular in Python for creating visualizations.
Here are some common data visualization techniques along with examples.
1. Line Plot
Line plots are great for showing trends over time or continuous data.
Example: Simple Line Plot with Matplotlib
import numpy as np
# Data
x = np.arange(0, 10, 0.1)
y = np.sin(x)
# Create the plot
plt.plot(x, y, label='Sine Wave', color='blue', marker='o')

plt.title('Line Plot Example')
plt.legend()
plt.show()
2. Bar Plot
Bar plots are useful for comparing quantities across categories.
Example: Bar Plot with Matplotlib

# Data
categories = ['Category A', 'Category B', 'Category C']

values = [10, 20, 15]
# Create the plot
plt.bar(categories, values, color='skyblue')

plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot Example')
plt.show()
3. Histogram
Histograms are used to show the distribution of a dataset, which helps to understand the
frequency of different ranges of data.
Example: Histogram with Matplotlib

import numpy as np
# Generate data
data = np.random.normal(0, 1, 1000) # 1000 data points from a normal distribution

# Create histogram
plt.hist(data, bins=30, color='teal', edgecolor='black')
plt.xlabel('Data values')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
4. Scatter Plot
Scatter plots are great for visualizing the relationship between two variables.
Example: Scatter Plot with Matplotlib
import numpy as np
# Data
x = np.random.rand(50)
y = np.random.rand(50)
# Create scatter plot

plt.scatter(x, y, color='purple')
plt.title('Scatter Plot Example')
plt.show()
5. Box Plot
Box plots display the distribution of data based on five summary statistics: minimum, first
quartile, median, third quartile, and maximum. They’re helpful for identifying outliers and
understanding the spread of the data.
Example: Box Plot with Seaborn
import seaborn as sns

import numpy as np
# Generate data
data = np.random.normal(0, 1, 100)

# Create box plot
sns.boxplot(data=data, color='lightblue')
plt.title('Box Plot Example')

plt.show()
6. Heatmap
Heatmaps display data in a matrix format where values are represented by color. They’re often
used for correlation matrices and showing the density of data points.
Example: Heatmap with Seaborn

import numpy as np
# Generate random correlation matrix
data = np.random.rand(10, 10)

# Create heatmap
sns.heatmap(data, cmap='viridis', annot=True)

plt.title('Heatmap Example')
plt.show()
7. Pair Plot
A pair plot (or scatter plot matrix) visualizes the pairwise relationships between features in a
dataset, making it useful for identifying correlations and patterns.
Example: Pair Plot with Seaborn

import pandas as pd
# Generate sample data
data = sns.load_dataset("iris")
# Create pair plot
sns.pairplot(data, hue="species")
plt.show()
8. Pie Chart
Pie charts are good for showing parts of a whole but are generally recommended only when you
have a small number of categories.
Example: Pie Chart with Matplotlib
# Data
sizes = [20, 30, 25, 25]

labels = ['Category A', 'Category B', 'Category C', 'Category D']
# Create pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart Example')

plt.show()
9. Interactive Plotting with Plotly

Plotly allows for interactive visualizations, which is useful for web applications or Jupyter
notebooks.
Example: Interactive Scatter Plot with Plotly
import plotly.express as px
import pandas as pd
# Sample data
df = pd.DataFrame({
'x': range(10),
'y': [3, 7, 9, 1, 4, 6, 8, 2, 5, 7],
'category': ['A']*5 + ['B']*5
})
# Create interactive scatter plot
fig = px.scatter(df, x='x', y='y', color='category', title='Interactive Scatter Plot')
fig.show()
10. Advanced Visualizations with Seaborn and Matplotlib
Combining and customizing plots can create more sophisticated visualizations.
Example: Customizing a Seaborn Violin Plot

# Generate data
tips = sns.load_dataset("tips")
# Create a violin plot with customizations
sns.violinplot(x="day", y="total_bill", data=tips, hue="sex", split=True, palette="Set2")
plt.title('Violin Plot Example')
plt.show()
11. Saving Plots
You can save any plot to a file using plt.savefig().

plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Example Plot")
plt.savefig("example_plot.png") # Save as PNG
plt.show()
Data visualization tools

There are a variety of powerful tools for data visualization, each with unique features for creating
insightful and attractive visualizations. Here's a summary of some of the best data visualization
tools, both in and outside the Python ecosystem.
1. Matplotlib (Python)
• Description: A foundational Python library for creating static, animated, and interactive
visualizations. It’s highly customizable but may require additional setup to produce
polished graphics.
• Best For: Simple plots like line charts, bar charts, and scatter plots; often used in
combination with other Python libraries.
• Pros: Extremely flexible, supports customizations, integrates well with other Python
libraries.
• Cons: Steeper learning curve, less attractive default visuals.
• Example:

plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Example Plot")
plt.show()
2. Seaborn (Python)
• Description: Built on top of Matplotlib, Seaborn provides an easier interface and creates
aesthetically pleasing statistical visualizations with minimal code.
• Best For: Statistical plots, including box plots, violin plots, pair plots, and heatmaps.
• Pros: Beautiful and complex visualizations with minimal code, excellent for statistical
data.
• Cons: Limited customization options compared to Matplotlib.
• Example:
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips)
3. Plotly (Python, R, JS)

• Description: An interactive visualization library that supports 3D plotting and a wide
range of chart types, making it ideal for web-based applications.
• Best For: Interactive plots, dashboards, and complex visualizations (e.g., 3D plots).
• Pros: Interactive and highly customizable, supports exports as HTML.
• Cons: Larger file sizes for interactive plots, higher learning curve for advanced
customization.
• Example:
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")

fig.show()
4. Tableau (Commercial)
• Description: A powerful BI (Business Intelligence) tool that allows users to create
interactive dashboards and complex visualizations with a drag-and-drop interface.
• Best For: Business intelligence, creating dashboards for data presentation.

• Pros: Extremely user-friendly, advanced dashboarding capabilities, great for non-coders.
• Cons: Expensive, not open-source, limited flexibility for custom scripting.

• Use Case: Ideal for data analysts working in business environments for quick insights
and sharing results with stakeholders.
5. Power BI (Commercial)
• Description: A Microsoft product for business analytics, Power BI allows you to create
interactive reports and dashboards with a strong focus on data connectivity.
• Best For: Business reporting, integration with Microsoft products (Excel, Azure).
• Pros: Intuitive interface, strong data transformation tools, integrates with Microsoft
Office suite.
• Cons: Subscription-based, more limited visual customizations compared to Tableau.
• Use Case: Great for enterprise settings where users are already within a Microsoft
ecosystem.
6. D3.js (JavaScript)
• Description: A JavaScript library that allows you to create complex, customized
visualizations for the web.
• Best For: Custom web-based visualizations that require full control over layout and
interactivity.
• Pros: Extremely customizable, integrates well with web technologies.

• Cons: Steep learning curve, requires JavaScript knowledge.
• Example: Interactive charts on websites, especially those that require unique layouts or
transitions.
7. ggplot2 (R)
• Description: A popular R package based on the Grammar of Graphics. It enables data

analysts and statisticians to create high-quality graphics easily.
• Best For: Statistical and exploratory data analysis, especially for users comfortable with
R.
• Pros: Concise syntax, high-quality statistical visualizations, integrates well within the R
ecosystem.
• Cons: Limited to R, and may require extensions for interactivity.
• Example:
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point()
8. Excel (Spreadsheet Software)
• Description: The most widely used spreadsheet tool, Microsoft Excel includes basic
charting and visualization features, suitable for simple, quick visualizations.
• Best For: Simple and quick visualizations, especially in business settings where Excel is
already in use.
• Pros: User-friendly, widely accessible, strong for basic plots and charts.
• Cons: Limited customization, difficult to handle large datasets or complex visuals.
• Use Case: Basic charts, quick insights, or ad hoc data analysis in business environments.
9. Google Data Studio (Web-Based)

• Description: A free, web-based BI tool by Google that allows users to create interactive
dashboards and reports connected to data sources like Google Sheets, Google Analytics,
and BigQuery.
• Best For: Interactive dashboards with Google ecosystem integrations.
• Pros: Free, easy to use, integrates with Google products.

• Cons: Limited customization and visual variety, some visualizations are basic.
• Use Case: Dashboarding for Google Analytics, ad campaigns, or any web-based data
source.
10. Highcharts (JavaScript)
• Description: A commercial JavaScript charting library for creating interactive charts for
web applications.
• Best For: Embedding interactive charts in web applications.
• Pros: Highly customizable, interactive, offers a wide range of chart types.

• Cons: Requires a license for commercial use, JavaScript experience needed.
• Use Case: Real-time web applications and dashboards.
11. Bokeh (Python)
• Description: A Python library that focuses on providing interactive visualizations for

web applications, similar to Plotly but more Python-centric.
• Best For: Interactive, browser-based visualizations with large datasets.
• Pros: Highly interactive, integrates well with Jupyter Notebooks and web apps.
• Cons: Limited 3D plotting capabilities, some chart types need additional coding.
• Example:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook() # For Jupyter Notebooks
# Create a basic line plot
p = figure(title="Simple Line Plot")
p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], line_width=2)

show(p)
Each tool has its strengths depending on the complexity of the visualization, interactivity needs,
and your familiarity with programming. For complex custom visualizations, D3.js or Bokeh are
powerful options, while Tableau and Power BI are excellent for business-focused dashboards
and quick insights.
Specialized data visualization tools
Specialized data visualization tools are designed for specific fields or types of data, offering
tailored capabilities beyond general-purpose visualization. Here are some specialized tools and
libraries used across fields like geospatial analysis, network visualization, genomic data, and
time series analysis.
1. Geospatial Visualization
Tools and libraries designed to visualize geographic data, such as maps and spatial relationships.
• Leaflet (JavaScript/Python)
o Description: A JavaScript library for interactive maps. Often used in combination

with Python via Folium.
o Best For: Creating custom interactive maps that can display geospatial data (e.g.,
markers, polygons).
o Use Case: Displaying data on maps, adding interactive features like zooming and
popups.
o Example:
import folium
m = folium.Map(location=[45.5236, -122.6750], zoom_start=13)

m=folium.Marker([45.5236, -122.6750], popup="Marker").add_to(m)
• GeoPandas (Python)
o Description: A Python library built on Pandas to handle and visualize geospatial

data.
o Best For: Geospatial analysis with shapefiles or geographical data stored in

DataFrames.
o Use Case: Visualizing geographic boundaries and performing spatial joins and
calculations.
o Example:
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.plot()
• Kepler.gl (Web-Based)
o Description: An open-source tool by Uber for large-scale geospatial data

visualization.
o Best For: Fast rendering of large geospatial datasets with 3D map visualizations.
o Use Case: Data with millions of records, 3D maps for urban planning, logistics,
or fleet management.
2. Network Visualization
Tools and libraries for visualizing relationships, social networks, or any type of graph or network
structure.
• Gephi (Desktop)
o Description: An open-source tool for network visualization and analysis.
o Best For: Social network analysis, graph theory research, and visualizing
relationships.
o Use Case: Social network diagrams, displaying connections between nodes.
• NetworkX (Python)
o Description: A Python library for the creation, manipulation, and study of
complex networks of nodes and edges.
o Best For: Building network graphs with Python, especially for analysis rather
than visualization.
o Example:
import networkx as nx

G = nx.karate_club_graph()
nx.draw(G, with_labels=True)
plt.show()
• Cytoscape (Desktop)
o Description: Software for visualizing molecular interaction networks and
biological pathways.
o Best For: Biomedical research, but also used for any type of network data
visualization.
o Use Case: Visualizing interactions in genomic or proteomic datasets.

3. Genomic and Biological Data Visualization
Specialized for representing biological sequences, genomic structures, and molecular pathways.
• Bioconductor (R)
o Description: A collection of R packages for analyzing and visualizing biological
data.
o Best For: Visualizing gene expression, pathways, and molecular interactions.

o Use Case: Research involving genomic data, such as gene expression levels
across samples.
• PyMOL (Standalone Application)
o Description: A molecular visualization tool for 3D visualizations of

biomolecules, such as proteins and DNA.
o Best For: Protein structure visualization in 3D, molecular modeling.
o Use Case: Visualizing macromolecular structures, often in biochemical research.

• Circos (Standalone Application)
o Description: A software package used to create circular layout visualizations,

especially for comparative genomics.
o Best For: Showing relationships across multiple genomic datasets, structural

variations.
o Use Case: Visualizing genomic alignments or variations across different species

or within complex genomes.
4. Time Series and Financial Data Visualization
Tools and libraries focused on analyzing and visualizing sequential and financial data.
• Plotly with Plotly Finance (Python)

o Description: A library for creating interactive financial charts and visualizing
time series data.
o Best For: Visualizing stock prices, candlestick charts, and complex time series
data.
o Use Case: Financial market analysis, interactive stock charts.

import plotly.graph_objects as go
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'], high=df['High'],
low=df['Low'], close=df['Close'])])
fig.show()
• TA-Lib (Python)
o Description: A technical analysis library for Python to analyze financial time

series.
o Best For: Calculating financial indicators like moving averages, RSI, and
Bollinger Bands.
o Use Case: Quantitative trading and backtesting algorithms, finance-focused data

science.
5. Real-Time and IoT Data Visualization
Tools specialized for live, streaming, or IoT data visualizations.

• Grafana (Web-Based)
o Description: An open-source tool for monitoring and visualizing real-time data
from multiple data sources.
o Best For: Real-time dashboards for server metrics, IoT data, or live analytics.
o Use Case: System monitoring, DevOps dashboards, IoT data visualization in real
time.
• Apache Superset (Web-Based)
o Description: An open-source platform for creating modern BI dashboards,

suitable for live data.
o Best For: Lightweight dashboards, especially for web apps with SQL-based data
sources.
o Use Case: Business analytics with large datasets, visualizations that require SQL
queries.
6. Natural Language Processing (NLP) Visualization
These tools visualize word embeddings, word relationships, and document similarities.
• WordCloud (Python)
o Description: A Python package for generating word clouds from text.
o Best For: Quickly understanding the frequency of words in text data.
o Use Case: Analyzing word frequency, creating word clouds from document data.
o Example:
from wordcloud import WordCloud
text = "Python data visualization specialized tools"
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
• TensorFlow Embedding Projector (Web-Based)

o Description: An online tool by TensorFlow for visualizing high-dimensional data,
commonly used for word embeddings.
o Best For: Visualizing word embeddings or document clustering in NLP.
o Use Case: Inspecting how words are related in vector space, identifying clusters
or similar words.
seaborn creating and plotting Maps
Seaborn doesn’t directly support plotting maps like some dedicated geospatial libraries (e.g.,
Folium, GeoPandas, or Basemap). However, you can use Seaborn along with Matplotlib and
GeoPandas to visualize maps in a way that’s aesthetically consistent with Seaborn’s style.
Here's how you can create and plot maps using Seaborn in combination with GeoPandas and
Matplotlib:
Step 1: Install the Required Libraries
Make sure you have seaborn, matplotlib, and geopandas installed.

pip install seaborn matplotlib geopandas
Step 2: Load Geospatial Data and Plot with Seaborn Styling
GeoPandas provides built-in sample datasets like world maps, which you can load and plot.
Seaborn can be used to enhance the aesthetics of the plot.
Example: Plotting a World Map with Seaborn and GeoPandas
import geopandas as gpd

# Set Seaborn style
sns.set(style="whitegrid")
# Load a sample world map dataset from GeoPandas

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Plotting the world map
plt.figure(figsize=(10, 6))
ax = plt.gca()
world.plot(ax=ax, color="lightblue", edgecolor="gray")
# Adding Seaborn-style touches to the plot
sns.despine(left=True, bottom=True) # Remove axis lines
plt.title("World Map", fontsize=16)
plt.show()
Example: Mapping Data with Seaborn Colors
If you want to add data on top of your map (e.g., visualizing population by country), you can
color the map using Seaborn color palettes.
# Use Seaborn color palette for map coloring based on a data column
ax = plt.gca()
world.plot(column='pop_est', ax=ax, cmap='viridis', legend=True,
legend_kwds={'label': "Population by Country",
'orientation': "horizontal"})
# Adding Seaborn-style tweaks
sns.despine(left=True, bottom=True)
plt.title("World Map by Population", fontsize=16)

plt.show()
In this example:
• world.plot(column='pop_est', cmap='viridis') colors the countries based on population

estimates.
• The legend_kwds option customizes the legend appearance, while Seaborn’s sns.despine
removes the extra axis lines for a cleaner look.
Example: Highlighting Specific Countries with Seaborn Colors
To highlight specific countries on a map with Seaborn colors, you can filter and style the
GeoDataFrame accordingly.
# Filter for specific countries to highlight (e.g., Japan and Canada)
highlight_countries = world[world['name'].isin(['Japan', 'Canada'])]
ax = plt.gca()
world.plot(ax=ax, color="lightgray", edgecolor="white")

highlight_countries.plot(ax=ax, color=sns.color_palette("bright")[1])
# Add title and Seaborn-style finishing touches
sns.despine(left=True, bottom=True)
plt.title("Highlighted Countries: Japan and Canada", fontsize=16)
plt.show()

unit-3(FODS)

Uploaded by

Copyright:

Available Formats

unit-3(FODS)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

unit-3(FODS)

Uploaded by

Copyright:

Available Formats

NumPy

1. Multidimensional Arrays (ndarray): At the core of NumPy is the powerful ndarray, a

3. Mathematical Functions: NumPy includes a wide range of mathematical operations

arr = np.array([1, 2, 3, 4, 5])

matrix = np.array([[1, 2, 3], [4, 5, 6]])

zeros = np.zeros((3, 3))

ones = np.ones((2, 2))

range_arr = np.arange(0, 10, 2) # From 0 to 10 with step size of 2

arr2 = np.array([4, 5, 6])

result = arr1 + arr2 # Element-wise addition

mean_value = np.mean(arr) # Mean of array

product = np.dot(arr1, arr2) # Dot product for 1D arrays

arr = np.array([1, 2, 3, 4, 5, 6])

arr = np.array([1, 2, 3, 4, 5, 6])

sliced_arr = arr[1:5] # Takes elements from index 1 to 4

• Allows operations on arrays with different shapes:

arr = np.array([1, 2, 3])

arr_broadcasted = arr + 10 # Adds 10 to each element

arr = np.array([1, 2, 3, 4, 5])

print("Standard Deviation:", np.std(arr))

Why Use NumPy?

Key Features of Pandas

1. Data Structures: Pandas has two main data structures:

o Series: A one-dimensional labeled array.

# Install pandas if you haven't already

pip install pandas

To start using Pandas, you need to import it:

Creating a DataFrame from a dictionary:

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']

2. Reading and Writing Data

Pandas makes it easy to read from and write to files.

Writing to a CSV file:

print(df.head()) # Display the first 5 rows

adults = df[df['Age'] > 25] # Filter rows where "Age" > 25

4. Adding and Modifying Columns

df['Salary'] = [50000, 60000, 55000, 70000]

Modifying an existing column:

df['Age'] = df['Age'] + 1 # Increase everyone's age by 1

5. Handling Missing Data

• Detecting missing values:

Filling missing values:

6. Grouping and Aggregation

Grouping allows you to perform operations on subsets of data.

Why Use Pandas?

1. Series: A one-dimensional array-like object.

2. DataFrame: A two-dimensional, table-like data structure with rows and columns.

First, install Pandas if you haven't done so:

pip install pandas

'Age': [24, 27, 22],

'City': ['New York', 'Los Angeles', 'Chicago']

From a CSV file:

print(df.head()) # Prints the first 5 rows of the DataFrame

# Access a single column

# Access multiple columns

# Access rows by index using .iloc

# Access rows by label using .loc

print(df.loc[0]) # Also the first row if the index is numeric

# Access rows based on condition

4. Adding and Modifying Columns