Pandas CheatSheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

PANDAS

CHEATSHEET
A Beginners Guide

@apexiq.ai
Introduction
What is Pandas?
Pandas is a free library for Python that makes it easy to work with data. It
provides two main data structures: Series (like a list) and DataFrame (like a
table or spreadsheet). With Pandas, you can easily organize, analyze, and
manipulate data.

Why use Pandas?


User-Friendly: It has a simple and clear syntax, making it easy to learn
and use.
Data Handling: You can easily read and write data in different formats,
like CSV or Excel.
Data Manipulation: It offers powerful tools to filter, group, and reshape
your data quickly.
Integration: Pandas works well with other libraries like Matplotlib for
plotting graphs and Scikit-learn for machine learning.

Installation
To install Pandas, open your terminal or command prompt and type:
!pip install pandas

If you’re using Anaconda, you can install it by typing:


!conda install pandas

@apexiq.ai
1. Loading Data
Loading data is the first step in any data analysis workflow. Pandas provides
several functions to read data from various file formats.

Import:
Import Pandas library:
import pandas as pd

Load CSV File:


df = pd.read_csv('file.csv')

Load Excel File:


df = pd.read_excel('file.xlsx')
or
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

Load JSON File:


df = pd.read_json('file.json')

@apexiq.ai
2. Viewing Data
After loading the data, it’s important to inspect it to understand its structure
and content. Pandas provides several methods for this.

View First N Rows:


df.head(n=5)

View Last N Rows:


df.tail(n=5)

Random Sample of Rows:


df.sample(n=5)

Summary of DataFrame:

Display information about the DataFrame (data types, non-null counts)

df.info()

Display descriptive statistics for numerical columns (count, mean, std, min,
max)

df.describe()

@apexiq.ai
3. Selecting Data
Selecting specific data from a DataFrame is crucial for analysis. Pandas allows
you to select columns and rows easily.

Select Column by Name:


Access a single column by name

df['column_name']

Access multiple columns by names (returns a DataFrame)

df[['col1', 'col2']]

Select Rows by Index:


Access the first row by integer index (position)

df.iloc[0]

Access the first row by label (if index is not integer)

df.loc[0]

Select Rows with Conditions:

Filter rows based on condition (e.g., column_name > value)

filtered_df = df[df['column_name'] > value]

@apexiq.ai
4. Modifying Data
Modifying data in a DataFrame is essential for preparing your dataset for
analysis.

Add New Column:


Create a new column that is double the values of an existing column

df['new_column'] = df['existing_column'] * 2

Rename Columns:
Rename a specific column

df.rename(columns={'old_name': 'new_name'}, inplace=True)

Drop Columns:

Drop specified column(s)

df.drop(columns=['column_to_drop'], inplace=True)

@apexiq.ai
5. Handling Missing Values
Dealing with missing values is crucial to ensure the integrity of your analysis.

Check for Missing Values:


Count of missing values in each column

df.isnull().sum()

Drop Rows with Missing Values:


Drop any row with NaN values

df.dropna(inplace=True)

Drop rows where a specific column is NaN

df.dropna(subset=['column_name'], inplace=True)

Fill Missing Values:


Fill NaN with a specified value (e.g., zero)

df.fillna(value=0, inplace=True)

Forward fill to propagate the last valid observation forward

df.fillna(method='ffill', inplace=True)

@apexiq.ai
6. Removing Duplicates
Dealing with missing values is crucial to ensure the integrity of your analysis.

Remove Duplicate Rows:

Remove duplicate rows based on all columns

df.drop_duplicates(inplace=True)

Remove duplicates based on specific column(s)

df.drop_duplicates(subset=['col1'], inplace=True)

@apexiq.ai
7. Sorting Data
Sorting data is essential for analysis and presentation. You can sort your
DataFrame by one or more columns.

Sort by One Column:


Sort in ascending order

df.sort_values(by='column_name', ascending=True, inplace=True)

Sort in descending order

df.sort_values(by='column_name', ascending=False, inplace=True)

Sort by Multiple Columns:


Sort by col1 ascending and col2 descending

df.sort_values(by=['col1', 'col2'], ascending=[True, False], inplace=True)

@apexiq.ai
8. Grouping and Aggregating Data
Grouping data allows you to perform operations on subsets of your data.

Group By One Column:


Group data by specified column(s)

grouped = df.groupby('column_name')

Aggregate Functions on Grouped Data:


Sum of grouped values in a specific column

grouped['value_column'].sum()

Mean of grouped values in a specific column

grouped['value_column'].mean()

Multiple aggregations
agg_df = grouped.agg({'value_column': ['sum', 'mean'], 'another_column':
'count'})

@apexiq.ai
9. Merging and Joining DataFrames
Combining multiple DataFrames is often necessary when working with related
datasets.

Merge Two DataFrames:


Merge on key column(s)

merged_df = pd.merge(df1, df2, on='key_column')

Outer Join Two DataFrames:


Outer join to include all records from both DataFrames

merged_outer = pd.merge(df1, df2, how='outer', on='key_column')

Concatenate Two DataFrames:


Concatenate along rows (axis=0)

concat_df = pd.concat([df1, df2], axis=0)

Concatenate along columns (axis=1)

concat_cols_df = pd.concat([df1, df2], axis=1)

@apexiq.ai
10. Applying Functions
You can apply custom functions to your DataFrame or Series to manipulate or
transform data.

Using apply() on DataFrame:


Apply a function to each element in a column

df['new_col'] = df['existing_col'].apply(lambda x: x + 1)

Using apply() on Series:


Square each element in the Series
s = pd.Series([1, 2, 3])
s_squared = s.apply(lambda x: x**2)

Using map() for Element-wise Operations:


Map values based on a dictionary

df['new_col'] = df['existing_col'].map({1: 'A', 2: 'B'})

@apexiq.ai
11. String Methods
Pandas provides string methods that allow you to perform vectorized string
operations on Series.

Converting Strings to Lowercase:


Convert all strings in the column to lowercase

df['string_column'] = df['string_column'].str.lower()

Checking for Substrings:


Check if 'text' is in each string

df['contains_text'] = df['string_column'].str.contains('text')

Replacing Substrings:
Replace 'old' with 'new' in strings

df['string_column'] = df['string_column'].str.replace('old', 'new')

@apexiq.ai
12. Advanced Data Manipulation
Advanced data manipulation techniques allow for more complex
transformations and reshaping of your DataFrame.

Melt Function:
The melt() function is used to transform wide-format data into long-format
data.

df_melted = pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])

Pivot Function:
The pivot() function reshapes the DataFrame by specifying index, columns,
and values.

df_pivot = df.pivot(index='date', columns='category', values='value')

Stack and Unstack:


Stack: Convert columns into rows (long format).

stacked_df = df.stack()

Unstack: Convert rows back into columns (wide format).

unstacked_df = stacked_df.unstack()

@apexiq.ai
13. Creating and Using Pivot Tables
Pivot tables allow you to summarize data in a flexible way.

Creating a Pivot Table:


Create a pivot table with specified values, index, columns, and aggregation
function

df_melted = pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])

Pivot Function:
The pivot() function reshapes the DataFrame by specifying index, columns,
and values.

pivot_table = df.pivot_table(values='value', index='index_col',


columns='column_col', aggfunc='sum')

Pivot Table with Multiple Aggregations:


Create a pivot table with multiple aggregation functions (sum and mean)

pivot_table_multi = df.pivot_table(values='value', index='index_col', aggfunc=


[np.sum, np.mean])

@apexiq.ai
14. Working with Categorical Data
Pandas provides support for categorical data, which can improve performance
and memory usage.

Convert Column to Categorical:


Convert a column to categorical type

df['category_column'] = df['category_column'].astype('category')

Get Categories and Their Codes:


Get unique categories

pivot_table = df.pivot_table(values='value', index='index_col',


columns='column_col', aggfunc='sum')

Get integer codes for categories

codes = df['category_column'].cat.codes

Using Categorical Data for Grouping:


Group by categorical column and count occurrences

grouped = df.groupby('category_column').size()

@apexiq.ai
15. Handling Date and Time Data
Pandas provides powerful tools for working with date and time data, making it
easy to manipulate and analyze time series.

Convert Strings to Datetime:


Convert to datetime format

df['date_column'] = pd.to_datetime(df['date_column'])

Extracting Date Components:


Extract year

df['year'] = df['date_column'].dt.year

Extract month

df['month'] = df['date_column'].dt.month

Extract day

df['day'] = df['date_column'].dt.day

Setting a Date Column as Index:


Set date_column as the index

df.set_index('date_column', inplace=True)

@apexiq.ai
LIKE FOLLOW SHARE

THANK YOU!

@apexiq.ai

You might also like