unit-3(FODS)
unit-3(FODS)
unit-3(FODS)
NumPy is a fundamental package in Python for scientific computing, providing support for
arrays, matrices, and a large collection of mathematical functions to operate on these data
structures. It's especially useful for numerical data manipulation and is commonly used in fields
like data science, machine learning, engineering, and physics due to its efficiency and simplicity
in handling large datasets.
Key Features of NumPy:
Basic Usage
To start using NumPy, you need to import it:
import numpy as np
1. Creating Arrays
• 1D array:
• 2D array (Matrix):
• Arithmetic operations:
arr1 = np.array([1, 2, 3])
• Mathematical functions:
sin_values = np.sin(arr) # Applies sine to each element in arr
• Matrix multiplication:
• Reshaping:
• Slicing:
• Statistical functions:
• Logical Operations:
bool_arr = arr > 2 # Returns an array of True/False for each condition
NumPy is efficient because it uses contiguous memory blocks (similar to C arrays), so it can
perform operations much faster than Python lists. The concise syntax also makes it easier to read,
write, and maintain code for handling numerical data.
Pandas
Pandas is a popular Python library for data manipulation and analysis, built on top of NumPy. It
provides data structures and functions to efficiently handle large datasets, making it ideal for data
cleaning, preparation, and exploration. The main data structures in Pandas are Series and
DataFrame, which are well-suited for structured data.
4. Data Import and Export: Read from and write to various formats like CSV, Excel,
SQL, and more.
5. Powerful Indexing: Allows you to filter, select, and manipulate subsets of data easily.
Installation
import pandas as pd
1. Creating a DataFrame
A DataFrame is like a table (rows and columns). You can create one manually using a dictionary
or import data from a file.
data = {
}
df = pd.DataFrame(data)
print(df)
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
3. Basic DataFrame Operations
• Inspecting Data:
Selecting Columns:
ages = df['Age'] # Select the "Age" column
Filtering Rows:
print(df.isnull())
print(grouped)
Pandas is a powerful data manipulation library in Python, primarily used for data cleaning,
transformation, and analysis. It provides two main data structures:
1. Importing Pandas
import pandas as pd
2. Creating a DataFrame
You can create a DataFrame from a dictionary, list, or by reading data from a file (e.g., CSV).
From a Dictionary:
data = {
'Name': ['A', 'B, 'C'],
}
df = pd.DataFrame(data)
print(df)
df = pd.read_csv('data.csv')
3. Accessing Data
You can access columns, rows, or individual values within a DataFrame.
Accessing Columns:
print(df['Name'])
print(df[['Name', 'Age']])
Accessing Rows:
print(df)
df = df.drop(columns=['Salary'])
print(df)
age_by_city = df.groupby('City')['Age'].mean()
print(age_by_city)
7. Saving Data
You can save your DataFrame to various file formats.
# Save to CSV
df.to_csv('output.csv', index=False)
# Save to Excel
df.to_excel('output.xlsx', index=False)
Matplotlib
Matplotlib is a popular Python library for data visualization, allowing you to create static,
interactive, and animated plots. It's commonly used with Pandas for plotting DataFrames and
provides a range of chart types like line charts, bar charts, scatter plots, histograms, and more.
# Sample data
x = [1, 2, 3, 4, 5]
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()
3. Bar Plot
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 9]
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot Example')
plt.show()
4. Scatter Plot
# Sample data
plt.scatter(x, y, color='purple')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
5. Histogram
# Sample data
import numpy as np
# Create a histogram
plt.xlabel('Data values')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
SciPy
SciPy is a scientific computing library in Python that builds on NumPy, providing advanced
mathematical functions for optimization, integration, interpolation, eigenvalue problems,
statistics, and more. It's particularly useful for data science, machine learning, and engineering
applications.
import numpy as np
2. Optimization
SciPy’s optimize module provides functions for minimizing (or maximizing) objective functions,
as well as root-finding algorithms.
3. Integration
The integrate module provides functions for integrating functions, including definite and
indefinite integrals.
Example: Definite Integral
def f(x):
return x**2
result, error = quad(f, 0, 1) # quad returns both the result and an estimate of the error
print("Integral result:", result)
4. Interpolation
SciPy’s interpolate module can be used for interpolating between data points, which is helpful
for filling in missing data or creating smooth curves.
Suppose we have data points (x,y)={(0,0),(1,2),(2,3)}(x, y) = \{(0, 0), (1, 2), (2,
3)\}(x,y)={(0,0),(1,2),(2,3)}. We want to find the interpolated value at x=1.5x = 1.5x=1.5.
x = np.array([0, 1, 2])
y = np.array([0, 2, 3])
f = interp1d(x, y, kind='linear')
5. Statistics
SciPy’s stats module provides a wide range of statistical functions, including probability
distributions, statistical tests, and summary statistics.
Let's generate some data and calculate summary statistics, as well as work with a normal
distribution.
mean = np.mean(data)
std_dev = np.std(data)
print("Mean:", mean)
2x+y=5 x+3y=7
We can represent this system as Ax=b, where:
• A=[2113]
• b=[57]
Data Processing
Data processing involves several steps to transform raw data into a usable format, typically for
analysis, modeling, or visualization. The key stages include data collection, cleaning,
transformation, and aggregation. Libraries like Pandas, NumPy, SciPy, and Scikit-Learn in
Python offer robust tools to make these tasks efficient and manageable.
1. Data Collection
This involves gathering data from various sources, such as CSV files, databases, or APIs.
import pandas as pd
# Load data from a CSV file
df = pd.read_csv('data.csv')
print(df.head()) # Display the first 5 rows of the dataset
2. Data Cleaning
Data cleaning includes handling missing values, removing duplicates, and correcting data types
to ensure consistency and accuracy.
print(df.isnull().sum())
# Fill missing values with mean for numerical columns
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
3. Data Transformation
This step involves modifying the data structure, scaling features, encoding categorical data, or
creating new features.
df = pd.get_dummies(df, columns=['category_column'])
grouped = df.groupby('group_column')['numeric_column'].mean()
print(grouped)
# Multiple aggregations
summary = df.groupby('group_column').agg({
'numeric_column': ['mean', 'sum', 'max'],
'another_column': 'count'
})
print(summary)
5. Data Filtering and Selection
This involves extracting a subset of data that meets specific conditions, which is useful for
isolating relevant data or working on a smaller dataset.
Example: Filtering Data
6. Data Integration
Combining data from multiple sources, like merging or joining tables, is common in data
processing.
7. Data Exporting
After processing, the cleaned and transformed data can be saved for analysis or modeling.
Example: Exporting Data to CSV
If you're processing similar datasets frequently, creating functions can make this more efficient.
df.fillna(df.mean(), inplace=True)
# Scale features
scaler = StandardScaler()
return df
processed_df = process_data(df)
Data visualization
Data visualization is the graphical representation of data, helping us see patterns, trends, and
outliers that may not be immediately apparent in raw data. Libraries like Matplotlib, Seaborn,
and Plotly are popular in Python for creating visualizations.
Here are some common data visualization techniques along with examples.
1. Line Plot
Line plots are great for showing trends over time or continuous data.
import numpy as np
# Data
y = np.sin(x)
plt.ylabel('Y-axis')
plt.legend()
plt.show()
2. Bar Plot
# Data
plt.ylabel('Values')
plt.title('Bar Plot Example')
plt.show()
3. Histogram
Histograms are used to show the distribution of a dataset, which helps to understand the
frequency of different ranges of data.
# Generate data
plt.xlabel('Data values')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
4. Scatter Plot
Scatter plots are great for visualizing the relationship between two variables.
Example: Scatter Plot with Matplotlib
import numpy as np
# Data
x = np.random.rand(50)
y = np.random.rand(50)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
5. Box Plot
Box plots display the distribution of data based on five summary statistics: minimum, first
quartile, median, third quartile, and maximum. They’re helpful for identifying outliers and
understanding the spread of the data.
import numpy as np
# Generate data
sns.boxplot(data=data, color='lightblue')
6. Heatmap
Heatmaps display data in a matrix format where values are represented by color. They’re often
used for correlation matrices and showing the density of data points.
plt.show()
7. Pair Plot
A pair plot (or scatter plot matrix) visualizes the pairwise relationships between features in a
dataset, making it useful for identifying correlations and patterns.
import pandas as pd
data = sns.load_dataset("iris")
# Create pair plot
sns.pairplot(data, hue="species")
plt.show()
8. Pie Chart
Pie charts are good for showing parts of a whole but are generally recommended only when you
have a small number of categories.
Example: Pie Chart with Matplotlib
# Data
import plotly.express as px
import pandas as pd
# Sample data
df = pd.DataFrame({
'x': range(10),
})
fig.show()
10. Advanced Visualizations with Seaborn and Matplotlib
# Generate data
tips = sns.load_dataset("tips")
# Create a violin plot with customizations
plt.show()
11. Saving Plots
plt.title("Example Plot")
plt.show()
1. Matplotlib (Python)
• Description: A foundational Python library for creating static, animated, and interactive
visualizations. It’s highly customizable but may require additional setup to produce
polished graphics.
• Best For: Simple plots like line charts, bar charts, and scatter plots; often used in
combination with other Python libraries.
• Pros: Extremely flexible, supports customizations, integrates well with other Python
libraries.
• Cons: Steeper learning curve, less attractive default visuals.
• Example:
plt.title("Example Plot")
plt.show()
2. Seaborn (Python)
• Description: Built on top of Matplotlib, Seaborn provides an easier interface and creates
aesthetically pleasing statistical visualizations with minimal code.
• Best For: Statistical plots, including box plots, violin plots, pair plots, and heatmaps.
• Pros: Beautiful and complex visualizations with minimal code, excellent for statistical
data.
• Example:
import seaborn as sns
tips = sns.load_dataset("tips")
• Cons: Larger file sizes for interactive plots, higher learning curve for advanced
customization.
• Example:
import plotly.express as px
df = px.data.iris()
4. Tableau (Commercial)
• Description: A powerful BI (Business Intelligence) tool that allows users to create
interactive dashboards and complex visualizations with a drag-and-drop interface.
5. Power BI (Commercial)
• Description: A Microsoft product for business analytics, Power BI allows you to create
interactive reports and dashboards with a strong focus on data connectivity.
• Best For: Business reporting, integration with Microsoft products (Excel, Azure).
• Pros: Intuitive interface, strong data transformation tools, integrates with Microsoft
Office suite.
• Use Case: Great for enterprise settings where users are already within a Microsoft
ecosystem.
6. D3.js (JavaScript)
• Description: A JavaScript library that allows you to create complex, customized
visualizations for the web.
• Best For: Custom web-based visualizations that require full control over layout and
interactivity.
• Example: Interactive charts on websites, especially those that require unique layouts or
transitions.
7. ggplot2 (R)
• Best For: Statistical and exploratory data analysis, especially for users comfortable with
R.
• Pros: Concise syntax, high-quality statistical visualizations, integrates well within the R
ecosystem.
• Cons: Limited to R, and may require extensions for interactivity.
• Example:
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
geom_point()
8. Excel (Spreadsheet Software)
• Description: The most widely used spreadsheet tool, Microsoft Excel includes basic
charting and visualization features, suitable for simple, quick visualizations.
• Best For: Simple and quick visualizations, especially in business settings where Excel is
already in use.
• Pros: User-friendly, widely accessible, strong for basic plots and charts.
• Use Case: Basic charts, quick insights, or ad hoc data analysis in business environments.
• Use Case: Dashboarding for Google Analytics, ad campaigns, or any web-based data
source.
• Description: A commercial JavaScript charting library for creating interactive charts for
web applications.
• Pros: Highly interactive, integrates well with Jupyter Notebooks and web apps.
• Cons: Limited 3D plotting capabilities, some chart types need additional coding.
• Example:
Each tool has its strengths depending on the complexity of the visualization, interactivity needs,
and your familiarity with programming. For complex custom visualizations, D3.js or Bokeh are
powerful options, while Tableau and Power BI are excellent for business-focused dashboards
and quick insights.
Specialized data visualization tools are designed for specific fields or types of data, offering
tailored capabilities beyond general-purpose visualization. Here are some specialized tools and
libraries used across fields like geospatial analysis, network visualization, genomic data, and
time series analysis.
1. Geospatial Visualization
Tools and libraries designed to visualize geographic data, such as maps and spatial relationships.
• Leaflet (JavaScript/Python)
o Best For: Creating custom interactive maps that can display geospatial data (e.g.,
markers, polygons).
o Use Case: Displaying data on maps, adding interactive features like zooming and
popups.
o Example:
import folium
• GeoPandas (Python)
o Use Case: Visualizing geographic boundaries and performing spatial joins and
calculations.
o Example:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.plot()
• Kepler.gl (Web-Based)
o Best For: Fast rendering of large geospatial datasets with 3D map visualizations.
o Use Case: Data with millions of records, 3D maps for urban planning, logistics,
or fleet management.
2. Network Visualization
Tools and libraries for visualizing relationships, social networks, or any type of graph or network
structure.
• Gephi (Desktop)
o Description: An open-source tool for network visualization and analysis.
o Best For: Social network analysis, graph theory research, and visualizing
relationships.
o Use Case: Social network diagrams, displaying connections between nodes.
• NetworkX (Python)
o Description: A Python library for the creation, manipulation, and study of
complex networks of nodes and edges.
o Best For: Building network graphs with Python, especially for analysis rather
than visualization.
o Example:
import networkx as nx
nx.draw(G, with_labels=True)
plt.show()
• Cytoscape (Desktop)
o Description: Software for visualizing molecular interaction networks and
biological pathways.
o Best For: Biomedical research, but also used for any type of network data
visualization.
Specialized for representing biological sequences, genomic structures, and molecular pathways.
• Bioconductor (R)
o Description: A collection of R packages for analyzing and visualizing biological
data.
Tools and libraries focused on analyzing and visualizing sequential and financial data.
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'], high=df['High'],
low=df['Low'], close=df['Close'])])
fig.show()
• TA-Lib (Python)
o Best For: Calculating financial indicators like moving averages, RSI, and
Bollinger Bands.
o Best For: Real-time dashboards for server metrics, IoT data, or live analytics.
o Use Case: System monitoring, DevOps dashboards, IoT data visualization in real
time.
o Best For: Lightweight dashboards, especially for web apps with SQL-based data
sources.
o Use Case: Business analytics with large datasets, visualizations that require SQL
queries.
These tools visualize word embeddings, word relationships, and document similarities.
• WordCloud (Python)
o Description: A Python package for generating word clouds from text.
o Use Case: Analyzing word frequency, creating word clouds from document data.
o Example:
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
o Use Case: Inspecting how words are related in vector space, identifying clusters
or similar words.
Seaborn doesn’t directly support plotting maps like some dedicated geospatial libraries (e.g.,
Folium, GeoPandas, or Basemap). However, you can use Seaborn along with Matplotlib and
GeoPandas to visualize maps in a way that’s aesthetically consistent with Seaborn’s style.
Here's how you can create and plot maps using Seaborn in combination with GeoPandas and
Matplotlib:
GeoPandas provides built-in sample datasets like world maps, which you can load and plot.
Seaborn can be used to enhance the aesthetics of the plot.
sns.set(style="whitegrid")
ax = plt.gca()
plt.show()
Example: Mapping Data with Seaborn Colors
If you want to add data on top of your map (e.g., visualizing population by country), you can
color the map using Seaborn color palettes.
# Use Seaborn color palette for map coloring based on a data column
plt.figure(figsize=(12, 8))
ax = plt.gca()
world.plot(column='pop_est', ax=ax, cmap='viridis', legend=True,
'orientation': "horizontal"})
sns.despine(left=True, bottom=True)
In this example:
To highlight specific countries on a map with Seaborn colors, you can filter and style the
GeoDataFrame accordingly.
plt.figure(figsize=(12, 8))
ax = plt.gca()
sns.despine(left=True, bottom=True)
plt.title("Highlighted Countries: Japan and Canada", fontsize=16)
plt.show()