Python Pandas

Python Pandas Introduction
Pandas is defined as an open-source library that provides high-performance data

manipulation in Python. The name of Pandas is derived from the word Panel Data,
which means an Econometrics from Multidimensional data. It is used for data analysis
in Python and developed by Wes McKinney in 2008.
Data analysis requires lots of processing, such as restructuring, cleaning or merging,

etc. There are different tools are available for fast data processing, such as Numpy,
Scipy, Cython, and Panda. But we prefer Pandas because working with Pandas is fast,
simple and more expressive than other tools.
34.4M
655
Java Try Catch
Next
Stay
Pandas is built on top of the Numpy package, means Numpy is required for operating

the Pandas.
Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced the
capabilities of data analysis. It can perform five significant steps required for processing
and analysis of data irrespective of the origin of the data, i.e., load, manipulate,
prepare, model, and analyze.
Key Features of Pandas

o It has a fast and efficient DataFrame object with the default and customized
indexing.
o Used for reshaping and pivoting of the data sets.
o Group by data for aggregations and transformations.
o It is used for data alignment and integration of the missing data.
o Provide the functionality of Time Series.
o Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
o Handle multiple operations of the data sets such as subsetting, slicing, filtering,
groupBy, re-ordering, and re-shaping.
o It integrates with the other libraries such as SciPy, and scikit-learn.
o Provides fast performance, and If you want to speed it, even more, you can use
the Cython.
Benefits of Pandas
The benefits of pandas over using other language are as follows:
o Data Representation: It represents the data in a form that is suited for data
analysis through its DataFrame and Series.
o Clear code: The clear API of the Pandas allows you to focus on the core part of
the code. So, it provides clear and concise code for the user.
Python Pandas Data Structure

The Pandas provides two data structures for processing the data,
i.e., Series and DataFrame, which are discussed below:
1) Series
It is defined as a one-dimensional array that is capable of storing various data types. The
row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using "series' method. A Series cannot contain multiple columns. It
has one parameter:
Data: It can be any list, dictionary, or scalar value.
Creating Series from Array:
Before creating a Series, Firstly, we have to import the numpy module and then use
array() function in the program.
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
Explanation: In this code, firstly, we have imported the pandas and numpy library with

the pd and np alias. Then, we have taken a variable named "info" that consist of an array
of some values. We have called the info variable through a Series method and defined
it in an "a" variable. The Series has printed by calling the print(a) method.
Python Pandas DataFrame

It is a widely used data structure of pandas and works with a two-dimensional array with
labeled axes (rows and columns). DataFrame is defined as a standard way to store data
and has two different indexes, i.e., row index and column index. It consists of the
following properties:
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and
columns are indexed. It is denoted as "columns" in case of columns and "index"
in case of rows.
Create a DataFrame using List:
We can easily create a DataFrame in Pandas using list.
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']

# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output
0
0 Python
1 Pandas
Python Pandas Series

The Pandas Series can be defined as a one-dimensional array that is capable of storing
various data types. We can easily convert the list, tuple, and dictionary into series using
"series' method. The row labels of series are called the index. A Series cannot contain
multiple columns. It has the following parameter:
o data: It can be any list, dictionary, or scalar value.

o index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
o dtype: It refers to the data type of series.
o copy: It is used for copying the data.
Creating a Series:
We can create a Series in two ways:
1. Create an empty Series

2. Create a Series using inputs.
Create an Empty Series:

We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:
34.4M
655
1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.
Example
import pandas as pd
x = pd.Series()
print (x)
Output
Series([], dtype: float64)
Creating a Series using inputs:

We can create Series by using various inputs:
o Array
o Dict
o Scalar value
Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use
array() function in the program. If the data is ndarray, then the passed index must be of
the same length.
If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
Example
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.
If index is passed, then values correspond to a particular label in the index will be
extracted from the dictionary.
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Output
x 0.0
y 1.0
z 2.0
dtype: float64
Create a Series using Scalar:
If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.
#import pandas library
import pandas as pd
import numpy as np
x = pd.Series(4, index=[0, 1, 2, 3])
print (x)
Output
0 4
1 4
2 4
3 4
dtype: int64
Accessing data from series with Position:

Once you create the Series type object, you can access its indexes, data, and even
individual elements.
The data in the Series can be accessed similar to that in the ndarray.
import pandas as pd
x = pd.Series([1,2,3],index = ['a','b','c'])
#retrieve the first element
print (x[0])
Output
Series object attributes

The Series attribute is defined as any information related to the Series object such as
size, datatype. etc. Below are some of the attributes that you can use to get the
information about the Series object:
Attributes Description
Series.index Defines the index of the Series.
Series.shape It returns a tuple of shape of the data.
Series.dtype It returns the data type of the data.
Series.size It returns the size of the data.
Series.empty It returns True if Series object is empty, otherwise returns false.

Series.hasnans It returns True if there are any NaN values, otherwise returns false.
Series.nbytes It returns the number of bytes in the data.
Series.ndim It returns the number of dimensions in the data.
Series.itemsize It returns the size of the datatype of item.
Retrieving Index array and data array of a series object

We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.
import numpy as np
import pandas as pd
x=pd.Series(data=[2,4,6,8])
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
print(x.index)
print(x.values)
print(y.index)
print(y.values)
Output
RangeIndex(start=0, stop=4, step=1)

[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]
Retrieving Types (dtype) and Size of Type (itemsize)

You can use attribute dtype with Series object as <objectname> dtype for retrieving the
data type of an individual element of a series object, you can use the itemsize attribute
to show the number of bytes allocated to each data item.
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],
index=['x','y','z'])
print(a.dtype)
print(a.itemsize)
print(b.dtype)
print(b.itemsize)
Output
int64
8
float64
8
Retrieving Shape
The shape of the Series object defines total number of elements including missing or
empty values(NaN).
import numpy as np
import pandas as pd
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Output
(4,)
(3,)
Retrieving Dimension, Size and Number of bytes:
import numpy as np
import pandas as pd
b=pd.Series(data=[4.9,8.2,5.6],
index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Output
1 1
4 3
32 24
Checking Emptiness and Presence of NaNs

To check the Series object is empty, you can use the empty attribute. Similarly, to
check if a series object contains some NaN values or not, you can use
the hasans attribute.
Example
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Output
False False True

True False False
4 3
3 3
Series Functions
There are some functions used in Series which are as follows:
Functions Description
Pandas Series.map() Map the values from two series that have a common column.
Pandas Series.std() Calculate the standard deviation of the given set of numbers, DataFrame,
column, and rows.
Pandas Series.to_frame() Convert the series object to the dataframe.
Pandas Returns a Series that contain counts of unique values.

Series.value_counts()
Pandas Series.map()
The main task of map() is used to map the values from two series that have a common
column. To map the two Series, the last column of the first Series should be the same as
the index column of the second series, and the values should be unique.
Syntax
1. Series.map(arg, na_action=None)
Parameters
o arg: function, dict, or Series.

It refers to the mapping correspondence.
o na_action: {None, 'ignore'}, Default value None. If ignore, it returns null values, without
passing it to the mapping correspondence.
Returns
It returns the Pandas Series with the same index as a caller.
Example
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Output
0 Core
1 NaN
2 NaN
3 NaN
dtype: object
Example2
import pandas as pd
import numpy as np
a.map('I like {}'.format, na_action='ignore')
Output
0 I like Java
1 I like C
2 I like C++
3 I like nan
dtype: object
Example3
import pandas as pd
import numpy as np
a.map('I like {}'.format)
a.map('I like {}'.format, na_action='ignore')
Output
0 I like Java
1 I like C
2 I like C++
3 NaN
dtype: object
Pandas Series.std()
The Pandas std() is defined as a function for calculating the standard deviation of the
given set of numbers, DataFrame, column, and rows. In respect to calculate the standard
deviation, we need to import the package named "statistics" for the calculation of
median.
The standard deviation is normalized by N-1 by default and can be changed using
the ddof argument.
Syntax:
1. Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Parameters:
o axis: {index (0), columns (1)}

o skipna: It excludes all the NA/null values. If NA is present in an entire row/column, the
result will be NA.
o level: It counts along with a particular level, and collapsing into a scalar if the axis is a
MultiIndex (hierarchical).
o ddof: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N
represents the number of elements.
o numeric_only: boolean, default value None
It includes only float, int, boolean columns. If it is None, it will attempt to use everything,
so use only numeric data.
It is not implemented for a Series.
Returns:
It returns Series or DataFrame if the level is specified.
Example1:
import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Output
2.1147629234082532
10.077252622027656
Example2:
import pandas as pd
import numpy as np

#Create a DataFrame
info = {
'Name':['Parker','Smith','John','William'],
'sub1_Marks':[52,38,42,37],
'sub2_Marks':[41,35,29,36]}
data = pd.DataFrame(info)
data
# standard deviation of the dataframe
data.std()
Output
sub1_Marks 6.849574
sub2_Marks 4.924429
dtype: float64
Pandas Index
Pandas Index is defined as a vital tool that selects particular rows and columns of data
from a DataFrame. Its task is to organize the data and to provide fast accessing of data.
It can also be called a Subset Selection.
The values are in bold font in the index, and the individual value of the index is called
a label.
If we want to compare the data accessing time with and without indexing, we can use %
%timeit for comparing the time required for various access-operations.
We can also define an index like an address through which any data can be accessed
across the Series or DataFrame. A DataFrame is a combination of three different
components, the index, columns, and the data.
Axis and axes

An axis is defined as a common terminology that refers to rows and columns, whereas
axes are collection of these rows and columns.
Creating index
First, we have to take a csv file that consist some data used for indexing.
# importing pandas package
import pandas as pd
data = pd.read_csv("aa.csv")
data
Output:
Name Hire Date Salary Leaves Remaining

0 John Idle 03/15/14 50000.0 10
1 Smith Gilliam 06/01/15 65000.0 8
2 Parker Chapman 05/12/14 45000.0 10
3 Jones Palin 11/01/13 70000.0 3
4 Terry Gilliam 08/12/14 48000.0 7
5 Michael Palin 05/23/13 66000.0 8
Example1
import pandas as pd
# making data frame from csv file
info = pd.read_csv("aa.csv", index_col ="Name")
# retrieving multiple columns by indexing operator
a = info[["Hire Date", "Salary"]]
print(a)
Output:
Name Hire Date Salary

0 John Idle 03/15/14 50000.0
1 Smith Gilliam 06/01/15 65000.0
2 Parker Chapman 05/12/14 45000.0
3 Jones Palin 11/01/13 70000.0
4 Terry Gilliam 08/12/14 48000.0
5 Michael Palin 05/23/13 66000.0
Example2:
importpandas as pd

# making data frame from csv file
info =pd.read_csv("aa.csv", index_col ="Name")

# retrieving columns by indexing operator
a =info["Salary"]
print(a)
Output:
Name Salary
0 John Idle 50000.0
1 Smith Gilliam 65000.0
2 Parker Chapman 45000.0
3 Jones Palin 70000.0
4 Terry Gilliam 48000.0
5 Michael Palin 66000.0
Set index
The 'set_index' is used to set the DataFrame index using existing columns. An index can
replace the existing index and can also expand the existing index.
It set a list, Series or DataFrame as the index of the DataFrame.
info = pd.DataFrame({'Name': ['Parker', 'Terry', 'Smith', 'William'],
'Year': [2011, 2009, 2014, 2010],
'Leaves': [10, 15, 9, 4]})
info
info.set_index('Name')
info.set_index(['year', 'Name'])
info.set_index([pd.Index([1, 2, 3, 4]), 'year'])
a = pd.Series([1, 2, 3, 4])
info.set_index([a, a**2])
Output:
Name Year Leaves

1 1 Parker 2011 10
2 4 Terry 2009 15
3 9 Smith 2014 9
4 16 William 2010 4
Multiple Index
We can also have multiple indexes in the data.
Example1:
import pandas as pd
import numpy as np
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
Output:
MultiIndex(levels=[[nan, None, NaT, 128, 2]],

codes=[[0, -1, 1, 2, 3, 4]])
Reset index
We can also reset the index using the 'reset_index' command. Let's look at the 'cm'
DataFrame again.
Example:
info = pd.DataFrame([('William', 'C'),
('Smith', 'Java'),
('Parker', 'Python'),
('Phill', np.nan)],
index=[1, 2, 3, 4],
columns=('name', 'Language'))
info
info.reset_index()
Output:
index name Language

0 1 William C
1 2 Smith Java
2 3 Parker Python
3 4 Phill NaN
Multiple Index
Multiple indexing is defined as a very essential indexing because it deals with the data
analysis and manipulation, especially for working with higher dimensional data. It also
enables to store and manipulate data with the arbitrary number of dimensions in lower
dimensional data structures like Series and DataFrame.
It is the hierarchical analogue of the standard index object which is used to store the
axis labels in pandas objects. It can also be defined as an array of tuples where each
tuple is unique. It can be created from a list of arrays, an array of tuples, and a crossed
set of iterables.
Example:
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
tuples
Output:
[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]
Example2:
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
Output:
MultiIndex([('bar', 'one'),
[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]
names=['first', 'second'])
Example3:
import pandas as pd
import numpy as np
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
Output:
MultiIndex(levels=[[nan, None, NaT, 128, 2]],

codes=[[0, -1, 1, 2, 3, 4]])
Reindex
The main task of the Pandas reindex is to conform DataFrame to a new index with
optional filling logic and to place NA/NaN in that location where the values are not
present in the previous index. It returns a new object unless the new index is produced
as an equivalent to the current one, and the value of copy becomes False.
Reindexing is used to change the index of the rows and columns of the DataFrame. We
can reindex the single or multiple rows by using the reindex() method. Default values in
the new index are assigned NaN if it is not present in the DataFrame.
Syntax:
1. DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy
=True, level=None, fill_value=nan, limit=None, tolerance=None)
Parameters:
labels: It is an optional parameter that refers to the new labels or the index to conform
to the axis that is specified by the 'axis'.
index, columns : It is also an optional parameter that refers to the new labels or the
index. It generally prefers an index object for avoiding the duplicate data.
axis : It is also an optional parameter that targets the axis and can be either the axis
name or the numbers.
method: It is also an optional parameter that is to be used for filling the holes in the
reindexed DataFrame. It can only be applied to the DataFrame or Series with a
monotonically increasing/decreasing order.
None: It is a default value that does not fill the gaps.
pad / ffill: It is used to propagate the last valid observation forward to the next valid
observation.
backfill / bfill: To fill the gap, It uses the next valid observation.
nearest: To fill the gap, it uses the next valid observation.

copy: Its default value is True and returns a new object as a boolean value, even if the
passed indexes are the same.
level : It is used to broadcast across the level, and match index values on the passed
MultiIndex level.
fill_value : Its default value is np.NaN and used to fill existing missing (NaN) values. It
needs any new element for successful DataFrame alignment, with this value before
computation.
limit : It defines the maximum number of consecutive elements that are to be forward
or backward fill.
tolerance : It is also an optional parameter that determines the maximum distance
between original and new labels for inexact matches. At the matching locations, the
values of the index should most satisfy the equation abs(index[indexer] ? target) <=
tolerance.
Returns :
It returns reindexed DataFrame.
Example 1:
The below example shows the working of reindex() function to reindex the dataframe.
In the new index,default values are assigned NaN in the new index that does not have
corresponding records in the DataFrame.
Note: We can use fill_value for filling the missing values.

import pandas as pd

# Create dataframe
info = pd.DataFrame({"P":[4, 7, 1, 8, 9],
"Q":[6, 8, 10, 15, 11],
"R":[17, 13, 12, 16, 14],
"S":[15, 19, 7, 21, 9]},
index =["Parker", "William", "Smith", "Terry", "Phill"])

# Print dataframe
info
Output:
A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN
Now, we can use the dataframe.reindex() function to reindex the dataframe.
1. # reindexing with new index values
2. info.reindex(["A", "B", "C", "D", "E"])
Output:
P Q R S
A NaN NaN NaN NaN
B NaN NaN NaN NaN
C NaN NaN NaN NaN
D NaN NaN NaN NaN
E NaN NaN NaN NaN
Notice that the new indexes are populated with NaN values. We can fill in the missing
values using the fill_value parameter.
1. # filling the missing values by 100
2. info.reindex(["A", "B", "C", "D", "E"], fill_value =100)
Output:
P Q R S
A 100 100 100 100
B 100 100 100 100
C 100 100 100 100
D 100 100 100 100
E 100 100 100 100
Example 2:
This example shows the working of reindex() function to reindex the column axis.
# importing pandas as pd
importpandas as pd

# Creating the first dataframe
info1 =pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
# reindexing the column axis with
# old and new index values
info.reindex(columns =["A", "B", "D", "E"])
Output:
A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN
Notice that NaN values are present in the new columns after reindexing, we can use the
argument fill_value to the function for removing the NaN values.
1. # reindex the columns
2. # fill the missing values by 25
3. info.reindex(columns =["A", "B", "D", "E"], fill_value =37)
Output:
A B D E
Parker 37 37 37 37
William 37 37 37 37
Smith 37 37 37 37
Terry 37 37 37 37
Phill 37 37 37 37
Reset Index
The Reset index of the DataFrame is used to reset the index by using the ' reset_index'
command. If the DataFrame has a MultiIndex, this method can remove one or more
levels.
Syntax:
1. DataFrame.reset_index(self, level=None, drop=False, inplace=False, col_level=0, col_fill='')
Parameters:
level : Refers to int, str, tuple, or list, default value None
It is used to remove the given levels from the index and also removes all levels by
default.
drop : Refers to Boolean value, default value False
It resets the index to the default integer index.
inplace : Refers to Boolean value, default value False
It is used to modify the DataFrame in place and does not require to create a new object.
col_level : Refers to int or str, default value 0
It determines level the labels are inserted if the column have multiple labels
col_fill : Refers to an object, default value ''
It determines how the other levels are named if the columns have multiple level.
Example1:
info = pd.DataFrame([('William', 'C'),
('Smith', 'Java'),
('Parker', 'Python'),
('Phill', np.nan)],
index=[1, 2, 3, 4],
columns=('name', 'Language'))
info
info.reset_index()
Output:
index name Language

0 1 William C
1 2 Smith Java
2 3 Parker Python
3 4 Phill NaN
Pandas Time Series

The Time series data is defined as an important source for information that provides a
strategy that is used in various businesses. From a conventional finance industry to the
education industry, it consist of a lot of details about the time.
Time series forecasting is the machine learning modeling that deals with the Time Series
data for predicting future values through Time Series modeling.
The Pandas have extensive capabilities and features that work with the time series data
for all the domains. By using the NumPy datetime64 and timedelta64 dtypes. The
Pandas has consolidated different features from other python libraries
like scikits.timeseries as well as created a tremendous amount of new functionality for
manipulating the time series data.
For example, pandas support to parse the time-series information from various sources
and formats.
Importing Packages and Data

Before starting, you have to import some packages that will make use of numpy,
pandas, matplotlib, and seaborn.
You can attach the images to be plotted in the Jupyter Notebook, by

adding %matplotlib inline to the code and can also switch to Seaborn defaults by
using sns.set():
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
Date and time
The Pandas provide the number of functionalities for dates, times, deltas, and
timespans. It is mainly used for data science applications.
Native dates and times:

We have two native date and time available that reside in datetime module. We can
also perform lots of useful functionalities on date and time by using
the dateutil function. You can also parse the dates from a variety of string formats:
Example1:
import pandas as pd
# Create the dates with frequency
info = pd.date_range('5/4/2013', periods = 8, freq ='S')
info
Output:
DatetimeIndex(['2013-05-04 00:00:00', '2013-05-04 00:00:01',

'2013-05-04 00:00:02', '2013-05-04 00:00:03',
'2013-05-04 00:00:04', '2013-05-04 00:00:05',
'2013-05-04 00:00:06', '2013-05-04 00:00:07'],
dtype='datetime64[ns]', freq='S')
Example1:
import pandas as pd
# Create the Timestamp
p = pd.Timestamp('2018-12-12 06:25:18')
# Create the DateOffset
do = pd.tseries.offsets.DateOffset(n = 2)
# Print the Timestamp
print(p)
# Print the DateOffset
print(do)
Output:
2018-12-12 06:25:18
<2 * DateOffsets>
Pandas Datetime
The Pandas can provide the features to work with time-series data for all domains. It
also consolidates a large number of features from other Python libraries like
scikits.timeseries by using the NumPy datetime64 and timedelta64 dtypes. It provides
new functionalities for manipulating the time series data.
The time series tools are most useful for data science applications and deals with other
packages used in Python.
Example1:
import pandas as pd
# Create the dates with frequency
info = pd.date_range('5/4/2013', periods = 8, freq ='S')
info
Output:
DatetimeIndex(['2013-05-04 00:00:00', '2013-05-04 00:00:01',

'2013-05-04 00:00:02', '2013-05-04 00:00:03',
'2013-05-04 00:00:04', '2013-05-04 00:00:05',
'2013-05-04 00:00:06', '2013-05-04 00:00:07'],
Example2:
info = pd.DataFrame({'year': [2014, 2012],
'month': [5, 7],
'day': [20, 17]})
pd.to_datetime(info)
0 2014-05-20
1 2012-07-17
dtype: datetime64[ns]
You can pass errors='ignore' if the date does not meet the timestamp. It will return the
original input without raising any exception.
If you pass errors='coerce', it will force an out-of-bounds date to NaT.
import pandas as pd
pd.to_datetime('18000706', format='%Y%m%d', errors='ignore')
datetime.datetime(1800, 7, 6, 0, 0)
pd.to_datetime('18000706', format='%Y%m%d', errors='coerce')
Output:
Timestamp('1800-07-06 00:00:00')
Example3:
import pandas as pd
dmy = pd.date_range('2017-06-04', periods=5, freq='S')
dmy
Output:
DatetimeIndex(['2017-06-04 00:00:00',
'2017-06-04 00:00:01',
'2017-06-04 00:00:02',
'2017-06-04 00:00:03',
'2017-06-04 00:00:04'],
Example4:
import pandas as pd
dmy = dmy.tz_localize('UTC')
dmy
Output:
DatetimeIndex(['2017-06-04 00:00:00+00:00', '2017-06-04 00:00:01+00:00',

'2017-06-04 00:00:02+00:00',
'2017-06-04 00:00:03+00:00',
'2017-06-04 00:00:04+00:00'],
dtype='datetime64[ns, UTC]', freq='S')
Example5:
import pandas as pd
dmy = pd.date_range('2017-06-04', periods=5, freq='S')
dmy
Output:
DatetimeIndex(['2017-06-04 00:00:00', '2017-06-04 00:00:01',

'2017-06-04 00:00:02', '2017-06-04 00:00:03',
'2017-06-04 00:00:04'],
Pandas Time Offset

The time series tools are most useful for data science applications and deals with other
packages used in Python. The time offset performs various operations on time, i.e.,
adding and subtracting.
The offset specifies a set of dates that conform to the DateOffset. We can create the
DateOffsets to move the dates forward to valid dates.
If the date is not valid, we can use the rollback and rollforward methods for rolling the
date to its nearest valid date before or after the date. The pseudo-code of time offsets
are as follows:
Syntax:
1. class pandas.tseries.offsets.DateOffset(n=1, normalize=False, **kwds)
def __add__(date):
date = rollback(date). It returns nothing if the date is valid + <n number of periods>.
date = rollforward(date)
When we create a date offset for a negative number of periods, the date will be rolling
forward.
Parameters:
n: Refers to int, default value is 1.
It is the number of time periods that represents the offsets.

normalize: Refers to a boolean value, default value False.
**kwds
It is an optional parameter that adds or replaces the offset value.
The parameters used for adding to the offset are as follows:
o years
o months
o weeks
o days
o hours
o minutes
o seconds
o microseconds
o nanoseconds
The parameters used for replacing the offset value are as follows:
o year
o month
o day
o weekday
o hour
o minute
o second
o microsecond
o nanosecond
Example:
import pandas as pd
p = pd.Timestamp('2018-12-12 06:25:18')
# Print the Timestamp
print(p)
# Print the DateOffset
print(do)
Output:
2018-12-12 06:25:18
<2 * DateOffsets>
Example2:
import pandas as pd
p = pd.Timestamp('2018-12-12 06:25:18')
# Add the dateoffset to given timestamp
new_timestamp = p + do
# Print updated timestamp
print(new_timestamp)
Output:
Timestamp('2018-12-14 06:25:18')
Pandas Time Periods

The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is
defined as a class that allows us to convert the frequency to the periods.
Generating periods and frequency conversion

We can generate the period by using 'Period' command with frequency 'M'. If we use
'asfreq' operation with 'start' operation, the date will print '01' whereas if we use the
'end' option, the date will print '31'.
Example:
import pandas as pd
x = pd.Period('2014', freq='S')
x.asfreq('D', 'start')
Output:
Period('2014-01-01', 'D')
Example:
import pandas as pd
x = pd.Period('2014', freq='S')
x.asfreq('D', 'end')
Output:
Period('2014-01-31', 'D')
Period arithmetic
Period arithmetic is used to perform various arithmetic operation on periods. All the
operations will be performed on the basis of 'freq'.
import pandas as pd
x = pd.Period('2014', freq='Q')
x
Output:
Period('2014', 'Q-DEC')
Example:
1. import pandas as pd
2. x = pd.Period('2014', freq='Q')
3. x + 1
Output:
Period('2015', 'Q-DEC')
Creating period range

We can create the range of period by using the 'period_range' command.
1. import pandas as pd
2. p = pd.period_range('2012', '2017', freq='A')
3. p
Output:
PeriodIndex(['2012-01-02', '2012-01-03', '2012-01-04', '2012-01-05',

'2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11',
'2012-01-12', '2012-01-13',
'2016-12-20', '2016-12-21', '2016-12-22', '2016-12-23',
'2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29',
'2016-12-30', '2017-01-02'],
dtype='period[B]', length=1306, freq='B')
Converting string-dates to period

If we want to Convert the string-dates to period, first we need to convert the string to
date format and then we can convert the dates into the periods.
# dates as string
p = ['2012-06-05', '2011-07-09', '2012-04-06']
# convert string to date format
x = pd.to_datetime(p)
x
Output:
DatetimeIndex(['2012-06-05', '2011-07-09', '2012-04-06'],

dtype='datetime64[ns]', freq=None)
Convert periods to timestamps

If we convert periods back to timestamps, we can simply do it by using 'to_timestamp'
command.
import pandas as pd
prd
prd.to_timestamp()
Output:
DatetimeIndex(['2017-04-02', '2016-04-06', '2016-05-08'],

dtype='datetime64[ns]', fre
Convert string to date

In today's time, it is a tedious task to analyze datasets with dates and times. Because of
different lengths in months, distributions of the weekdays and weekends, leap years,
and the time zones are the things that needs to consider according to our context. So,
for this reason, Python has defined a new data type especially for dates and times called
datetime.
However, in many datasets, Strings are used to represent the dates. So, in this topic,
you'll learn about converting date strings to the datetime format and see how these
powerful set of tools helps to work effectively with complicated time series data.
The challenge behind this scenario is how the date strings are expressed. For example,
'Wednesday, June 6, 2018' can also be shown as '6/6/18' and '06-06-2018'. All these
formats define the same date, but the code represents to convert each of them is
slightly different.
fromdatetime import datetime

# Define dates as the strings
dmy_str1 = 'Wednesday, July 14, 2018'
dmy_str2 = '14/7/17'
dmy_str3 = '14-07-2017'

# Define dates as the datetime objects
dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')
dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')
dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')

#Print the converted dates
print(dmy_dt1)
print(dmy_dt2)
print(dmy_dt3)
Output:
2017-07-14 00:00:00
2017-07-14 00:00:00
2018-07-14 00:00:00
Converting the date string column

This conversion shows how to convert whole column of date strings from the dataset to
datetime format.
From now on, you have to work with the DataFrame called eth that contains the
historical data on ether, and also a cryptocurrency whose blockchain is produced by the
Ethereum platform. The dataset consists the following columns:
o date: Defines the actual date, daily at 00:00 UTC.

o txVolume: It refers an unadjusted measure of the total value in US dollars, in outputs on
the blockchain.
o txCount: It defines number of transactions performed on public blockchain.
o marketCap: Refers to the unit price in US dollars multiplied by the number of units in
circulation.
o price: Refers an opening price in US dollars at 00:00 UTC.
o generatedCoins: Refers the number of new coins.
o exchangeVolume: Refers the actual volume which is measured by US dollars, at
exchanges like GDAX and Bitfinex.

Python Pandas

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Python Pandas

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python Pandas

Uploaded by

Copyright:

Available Formats

Python Pandas Introduction

Pandas is defined as an open-source library that provides high-performance data

Data analysis requires lots of processing, such as restructuring, cleaning or merging,

Pandas is built on top of the Numpy package, means Numpy is required for operating

Key Features of Pandas

Python Pandas Data Structure

Data: It can be any list, dictionary, or scalar value.

Creating Series from Array:

Explanation: In this code, firstly, we have imported the pandas and numpy library with

Python Pandas DataFrame

Create a DataFrame using List:

We can easily create a DataFrame in Pandas using list.

Python Pandas Series

o data: It can be any list, dictionary, or scalar value.

1. Create an empty Series

Create an Empty Series:

The syntax that is used for creating an Empty Series:

Series([], dtype: float64)

Creating a Series using inputs:

Creating Series from Array:

Create a Series from dict

Create a Series using Scalar:

Accessing data from series with Position:

Series object attributes

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

Series.empty It returns True if Series object is empty, otherwise returns false.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object

RangeIndex(start=0, stop=4, step=1)

Retrieving Types (dtype) and Size of Type (itemsize)

Retrieving Dimension, Size and Number of bytes:

Checking Emptiness and Presence of NaNs

False False True

Pandas Series.to_frame() Convert the series object to the dataframe.

Pandas Returns a Series that contain counts of unique values.

o arg: function, dict, or Series.

o axis: {index (0), columns (1)}

Axis and axes

Name Hire Date Salary Leaves Remaining

Name Hire Date Salary

It set a list, Series or DataFrame as the index of the DataFrame.

Name Year Leaves

MultiIndex(levels=[[nan, None, NaT, 128, 2]],

index name Language

MultiIndex(levels=[[nan, None, NaT, 128, 2]],

None: It is a default value that does not fill the gaps.

nearest: To fill the gap, it uses the next valid observation.

Note: We can use fill_value for filling the missing values.

Now, we can use the dataframe.reindex() function to reindex the dataframe.

level : Refers to int, str, tuple, or list, default value None

drop : Refers to Boolean value, default value False

It resets the index to the default integer index.

inplace : Refers to Boolean value, default value False