100% found this document useful (1 vote)
205 views35 pages

Python Pandas

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 35

Python Pandas Introduction

Pandas is defined as an open-source library that provides high-performance data


manipulation in Python. The name of Pandas is derived from the word Panel Data,
which means an Econometrics from Multidimensional data. It is used for data analysis
in Python and developed by Wes McKinney in 2008.

Data analysis requires lots of processing, such as restructuring, cleaning or merging,


etc. There are different tools are available for fast data processing, such as Numpy,
Scipy, Cython, and Panda. But we prefer Pandas because working with Pandas is fast,
simple and more expressive than other tools.

34.4M
655
Java Try Catch
Next
Stay

Pandas is built on top of the Numpy package, means Numpy is required for operating


the Pandas.

Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced the
capabilities of data analysis. It can perform five significant steps required for processing
and analysis of data irrespective of the origin of the data, i.e., load, manipulate,
prepare, model, and analyze.

Key Features of Pandas


o It has a fast and efficient DataFrame object with the default and customized
indexing.
o Used for reshaping and pivoting of the data sets.
o Group by data for aggregations and transformations.
o It is used for data alignment and integration of the missing data.
o Provide the functionality of Time Series.
o Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
o Handle multiple operations of the data sets such as subsetting, slicing, filtering,
groupBy, re-ordering, and re-shaping.
o It integrates with the other libraries such as SciPy, and scikit-learn.
o Provides fast performance, and If you want to speed it, even more, you can use
the Cython.

Benefits of Pandas
The benefits of pandas over using other language are as follows:

o Data Representation: It represents the data in a form that is suited for data
analysis through its DataFrame and Series.
o Clear code: The clear API of the Pandas allows you to focus on the core part of
the code. So, it provides clear and concise code for the user.

Python Pandas Data Structure


The Pandas provides two data structures for processing the data,
i.e., Series and DataFrame, which are discussed below:

1) Series
It is defined as a one-dimensional array that is capable of storing various data types. The
row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using "series' method. A Series cannot contain multiple columns. It
has one parameter:

Data: It can be any list, dictionary, or scalar value.

Creating Series from Array:

Before creating a Series, Firstly, we have to import the numpy module and then use
array() function in the program.

import pandas as pd  
import numpy as np  
info = np.array(['P','a','n','d','a','s'])  
a = pd.Series(info)  
print(a)  

Output

0 P
1 a
2 n
3 d
4 a
5 s
dtype: object

Explanation: In this code, firstly, we have imported the pandas and numpy library with


the pd and np alias. Then, we have taken a variable named "info" that consist of an array
of some values. We have called the info variable through a Series method and defined
it in an "a" variable. The Series has printed by calling the print(a) method.

Python Pandas DataFrame


It is a widely used data structure of pandas and works with a two-dimensional array with
labeled axes (rows and columns). DataFrame is defined as a standard way to store data
and has two different indexes, i.e., row index and column index. It consists of the
following properties:

o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and
columns are indexed. It is denoted as "columns" in case of columns and "index"
in case of rows.

Create a DataFrame using List:

We can easily create a DataFrame in Pandas using list.

import pandas as pd  
# a list of strings  
x = ['Python', 'Pandas']  
  
# Calling DataFrame constructor on list  
df = pd.DataFrame(x)  
print(df)  
Output

0
0 Python
1 Pandas

Python Pandas Series


The Pandas Series can be defined as a one-dimensional array that is capable of storing
various data types. We can easily convert the list, tuple, and dictionary into series using
"series' method. The row labels of series are called the index. A Series cannot contain
multiple columns. It has the following parameter:

o data: It can be any list, dictionary, or scalar value.


o index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
o dtype: It refers to the data type of series.
o copy: It is used for copying the data.

Creating a Series:
We can create a Series in two ways:

1. Create an empty Series


2. Create a Series using inputs.

Create an Empty Series:


We can easily create an empty series in Pandas which means it will not have any value.

The syntax that is used for creating an Empty Series:

34.4M

655

1. <series object> = pandas.Series()  

The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.
Example

import pandas as pd  
x = pd.Series()  
print (x)  

Output

Series([], dtype: float64)

Creating a Series using inputs:


We can create Series by using various inputs:

o Array
o Dict
o Scalar value

Creating Series from Array:

Before creating a Series, firstly, we have to import the numpy module and then use
array() function in the program. If the data is ndarray, then the passed index must be of
the same length.

If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].

Example

import pandas as pd  
import numpy as np  
info = np.array(['P','a','n','d','a','s'])  
a = pd.Series(info)  
print(a)   

Output

0 P
1 a
2 n
3 d
4 a
5 s
dtype: object

Create a Series from dict

We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.

If index is passed, then values correspond to a particular label in the index will be
extracted from the dictionary.

#import the pandas library   
import pandas as pd  
import numpy as np  
info = {'x' : 0., 'y' : 1., 'z' : 2.}  
a = pd.Series(info)  
print (a)  

Output

x 0.0
y 1.0
z 2.0
dtype: float64

Create a Series using Scalar:

If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.

#import pandas library   
import pandas as pd  
import numpy as np  
x = pd.Series(4, index=[0, 1, 2, 3])  
print (x)  

Output

0 4
1 4
2 4
3 4
dtype: int64

Accessing data from series with Position:


Once you create the Series type object, you can access its indexes, data, and even
individual elements.

The data in the Series can be accessed similar to that in the ndarray.

import pandas as pd  
x = pd.Series([1,2,3],index = ['a','b','c'])  
#retrieve the first element  
print (x[0])  

Output

Series object attributes


The Series attribute is defined as any information related to the Series object such as
size, datatype. etc. Below are some of the attributes that you can use to get the
information about the Series object:

Attributes Description

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

Series.empty It returns True if Series object is empty, otherwise returns false.


Series.hasnans It returns True if there are any NaN values, otherwise returns false.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object


We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.

import numpy as np   
import pandas as pd   
x=pd.Series(data=[2,4,6,8])   
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])   
print(x.index)   
print(x.values)   
print(y.index)   
print(y.values)  

Output

RangeIndex(start=0, stop=4, step=1)


[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]

Retrieving Types (dtype) and Size of Type (itemsize)


You can use attribute dtype with Series object as <objectname> dtype for retrieving the
data type of an individual element of a series object, you can use the itemsize attribute
to show the number of bytes allocated to each data item.

import numpy as np   
import pandas as pd   
a=pd.Series(data=[1,2,3,4])   
b=pd.Series(data=[4.9,8.2,5.6],   
index=['x','y','z'])   
print(a.dtype)   
print(a.itemsize)    
print(b.dtype)   
print(b.itemsize)  

Output

int64
8
float64
8

Retrieving Shape
The shape of the Series object defines total number of elements including missing or
empty values(NaN).

import numpy as np   
import pandas as pd   
a=pd.Series(data=[1,2,3,4])   
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])   
print(a.shape)   
print(b.shape)  

Output

(4,)
(3,)

Retrieving Dimension, Size and Number of bytes:

import numpy as np   
import pandas as pd   
a=pd.Series(data=[1,2,3,4])   
b=pd.Series(data=[4.9,8.2,5.6],  
index=['x','y','z'])   
print(a.ndim, b.ndim)   
print(a.size, b.size)   
print(a.nbytes, b.nbytes)  

Output

1 1
4 3
32 24

Checking Emptiness and Presence of NaNs


To check the Series object is empty, you can use the empty attribute. Similarly, to
check if a series object contains some NaN values or not, you can use
the hasans attribute.

Example

import numpy as np   
import pandas as pd   
a=pd.Series(data=[1,2,3,np.NaN])   
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])   
c=pd.Series()   
print(a.empty,b.empty,c.empty)   
print(a.hasnans,b.hasnans,c.hasnans)   
print(len(a),len(b))   
print(a.count( ),b.count( ))  

Output

False False True


True False False
4 3
3 3
Series Functions
There are some functions used in Series which are as follows:

Functions Description

Pandas Series.map() Map the values from two series that have a common column.

Pandas Series.std() Calculate the standard deviation of the given set of numbers, DataFrame,
column, and rows.

Pandas Series.to_frame() Convert the series object to the dataframe.

Pandas Returns a Series that contain counts of unique values.


Series.value_counts()

Pandas Series.map()
The main task of map() is used to map the values from two series that have a common
column. To map the two Series, the last column of the first Series should be the same as
the index column of the second series, and the values should be unique.

Syntax

1. Series.map(arg, na_action=None)  

Parameters

o arg: function, dict, or Series.


It refers to the mapping correspondence.
o na_action: {None, 'ignore'}, Default value None. If ignore, it returns null values, without
passing it to the mapping correspondence.
Returns
It returns the Pandas Series with the same index as a caller.

Example

import pandas as pd  
import numpy as np  
a = pd.Series(['Java', 'C', 'C++', np.nan])  
a.map({'Java': 'Core'})  

Output

0 Core
1 NaN
2 NaN
3 NaN
dtype: object

Example2

import pandas as pd  
import numpy as np  
a = pd.Series(['Java', 'C', 'C++', np.nan])  
a.map({'Java': 'Core'})  
a.map('I like {}'.format, na_action='ignore')  

Output

0 I like Java
1 I like C
2 I like C++
3 I like nan
dtype: object

Example3

import pandas as pd  
import numpy as np  
a = pd.Series(['Java', 'C', 'C++', np.nan])  
a.map({'Java': 'Core'})  
a.map('I like {}'.format)  
a.map('I like {}'.format, na_action='ignore')  

Output

0 I like Java
1 I like C
2 I like C++
3 NaN
dtype: object

Pandas Series.std()
The Pandas std() is defined as a function for calculating the standard deviation of the
given set of numbers, DataFrame, column, and rows. In respect to calculate the standard
deviation, we need to import the package named "statistics" for the calculation of
median.

The standard deviation is normalized by N-1 by default and can be changed using
the ddof argument.

Syntax:

1. Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)  

Parameters:

o axis: {index (0), columns (1)}


o skipna: It excludes all the NA/null values. If NA is present in an entire row/column, the
result will be NA.
o level: It counts along with a particular level, and collapsing into a scalar if the axis is a
MultiIndex (hierarchical).
o ddof: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N
represents the number of elements.
o numeric_only: boolean, default value None
It includes only float, int, boolean columns. If it is None, it will attempt to use everything,
so use only numeric data.
It is not implemented for a Series.
Returns:
It returns Series or DataFrame if the level is specified.

Example1:

import pandas as pd  
# calculate standard deviation  
import numpy as np   
print(np.std([4,7,2,1,6,3]))  
print(np.std([6,9,15,2,-17,15,4]))  

Output

2.1147629234082532
10.077252622027656

Example2:

import pandas as pd  
import numpy as np  
   
#Create a DataFrame  
info = {  
    'Name':['Parker','Smith','John','William'],  
   'sub1_Marks':[52,38,42,37],  
   'sub2_Marks':[41,35,29,36]}   
data = pd.DataFrame(info)  
data  
# standard deviation of the dataframe  
data.std()  

Output

sub1_Marks 6.849574
sub2_Marks 4.924429
dtype: float64
Pandas Index
Pandas Index is defined as a vital tool that selects particular rows and columns of data
from a DataFrame. Its task is to organize the data and to provide fast accessing of data.
It can also be called a Subset Selection.

The values are in bold font in the index, and the individual value of the index is called
a label.

If we want to compare the data accessing time with and without indexing, we can use %
%timeit for comparing the time required for various access-operations.

We can also define an index like an address through which any data can be accessed
across the Series or DataFrame. A DataFrame is a combination of three different
components, the index, columns, and the data.

Axis and axes


An axis is defined as a common terminology that refers to rows and columns, whereas
axes are collection of these rows and columns.

Creating index
First, we have to take a csv file that consist some data used for indexing.

# importing pandas package   
import pandas as pd     
data = pd.read_csv("aa.csv")  
data  

Output:

Name Hire Date Salary Leaves Remaining


0 John Idle 03/15/14 50000.0 10
1 Smith Gilliam 06/01/15 65000.0 8
2 Parker Chapman 05/12/14 45000.0 10
3 Jones Palin 11/01/13 70000.0 3
4 Terry Gilliam 08/12/14 48000.0 7
5 Michael Palin 05/23/13 66000.0 8

Example1
# importing pandas package   
import pandas as pd     
# making data frame from csv file   
info = pd.read_csv("aa.csv", index_col ="Name")    
# retrieving multiple columns by indexing operator   
a = info[["Hire Date", "Salary"]]    
print(a)  

Output:

Name Hire Date Salary


0 John Idle 03/15/14 50000.0
1 Smith Gilliam 06/01/15 65000.0
2 Parker Chapman 05/12/14 45000.0
3 Jones Palin 11/01/13 70000.0
4 Terry Gilliam 08/12/14 48000.0
5 Michael Palin 05/23/13 66000.0

Example2:

# importing pandas package   
importpandas as pd   
    
# making data frame from csv file   
info =pd.read_csv("aa.csv", index_col ="Name")   
    
# retrieving columns by indexing operator   
a =info["Salary"]   
print(a)   

Output:

Name Salary
0 John Idle 50000.0
1 Smith Gilliam 65000.0
2 Parker Chapman 45000.0
3 Jones Palin 70000.0
4 Terry Gilliam 48000.0
5 Michael Palin 66000.0

Set index
The 'set_index' is used to set the DataFrame index using existing columns. An index can
replace the existing index and can also expand the existing index.

It set a list, Series or DataFrame as the index of the DataFrame.

info = pd.DataFrame({'Name': ['Parker', 'Terry', 'Smith', 'William'],  
'Year': [2011, 2009, 2014, 2010],  
'Leaves': [10, 15, 9, 4]})  
info  
info.set_index('Name')  
info.set_index(['year', 'Name'])  
info.set_index([pd.Index([1, 2, 3, 4]), 'year'])  
a = pd.Series([1, 2, 3, 4])  
info.set_index([a, a**2])  

Output:

Name Year Leaves


1 1 Parker 2011 10
2 4 Terry 2009 15
3 9 Smith 2014 9
4 16 William 2010 4

Multiple Index
We can also have multiple indexes in the data.

Example1:

import pandas as pd  
import numpy as np  
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],   
codes=[[0, -1, 1, 2, 3, 4]])  

Output:

MultiIndex(levels=[[nan, None, NaT, 128, 2]],


codes=[[0, -1, 1, 2, 3, 4]])

Reset index
We can also reset the index using the 'reset_index' command. Let's look at the 'cm'
DataFrame again.

Example:

info = pd.DataFrame([('William', 'C'),  
('Smith', 'Java'),  
('Parker', 'Python'),  
('Phill', np.nan)],  
index=[1, 2, 3, 4],  
columns=('name', 'Language'))  
info  
info.reset_index()  

Output:

index name Language


0 1 William C
1 2 Smith Java
2 3 Parker Python
3 4 Phill NaN

Multiple Index
Multiple indexing is defined as a very essential indexing because it deals with the data
analysis and manipulation, especially for working with higher dimensional data. It also
enables to store and manipulate data with the arbitrary number of dimensions in lower
dimensional data structures like Series and DataFrame.

It is the hierarchical analogue of the standard index object which is used to store the
axis labels in pandas objects. It can also be defined as an array of tuples where each
tuple is unique. It can be created from a list of arrays, an array of tuples, and a crossed
set of iterables.

Example:

arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],  
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]  
tuples = list(zip(*arrays))  
tuples  
Output:

[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]

Example2:

arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],  
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]  
tuples = list(zip(*arrays))  
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])  

Output:

MultiIndex([('bar', 'one'),
[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]
names=['first', 'second'])

Example3:

import pandas as pd  
import numpy as np  
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],   
codes=[[0, -1, 1, 2, 3, 4]])  

Output:

MultiIndex(levels=[[nan, None, NaT, 128, 2]],


codes=[[0, -1, 1, 2, 3, 4]])
Reindex
The main task of the Pandas reindex is to conform DataFrame to a new index with
optional filling logic and to place NA/NaN in that location where the values are not
present in the previous index. It returns a new object unless the new index is produced
as an equivalent to the current one, and the value of copy becomes False.

Reindexing is used to change the index of the rows and columns of the DataFrame. We
can reindex the single or multiple rows by using the reindex() method. Default values in
the new index are assigned NaN if it is not present in the DataFrame.

Syntax:

1. DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy
=True, level=None, fill_value=nan, limit=None, tolerance=None)  

Parameters:
labels: It is an optional parameter that refers to the new labels or the index to conform
to the axis that is specified by the 'axis'.

index, columns : It is also an optional parameter that refers to the new labels or the
index. It generally prefers an index object for avoiding the duplicate data.

axis : It is also an optional parameter that targets the axis and can be either the axis
name or the numbers.

method: It is also an optional parameter that is to be used for filling the holes in the
reindexed DataFrame. It can only be applied to the DataFrame or Series with a
monotonically increasing/decreasing order.

None: It is a default value that does not fill the gaps.

pad / ffill: It is used to propagate the last valid observation forward to the next valid
observation.

backfill / bfill: To fill the gap, It uses the next valid observation.

nearest: To fill the gap, it uses the next valid observation.


copy: Its default value is True and returns a new object as a boolean value, even if the
passed indexes are the same.

level : It is used to broadcast across the level, and match index values on the passed
MultiIndex level.

fill_value : Its default value is np.NaN and used to fill existing missing (NaN) values. It
needs any new element for successful DataFrame alignment, with this value before
computation.

limit : It defines the maximum number of consecutive elements that are to be forward
or backward fill.

tolerance : It is also an optional parameter that determines the maximum distance
between original and new labels for inexact matches. At the matching locations, the
values of the index should most satisfy the equation abs(index[indexer] ? target) <=
tolerance.

Returns :
It returns reindexed DataFrame.

Example 1:
The below example shows the working of reindex() function to reindex the dataframe.
In the new index,default values are assigned NaN in the new index that does not have
corresponding records in the DataFrame.

Note: We can use fill_value for filling the missing values.


import pandas as pd  
  
# Create dataframe  
info = pd.DataFrame({"P":[4, 7, 1, 8, 9],   
                   "Q":[6, 8, 10, 15, 11],   
                   "R":[17, 13, 12, 16, 14],   
                   "S":[15, 19, 7, 21, 9]},   
                   index =["Parker", "William", "Smith", "Terry", "Phill"])   
  
# Print dataframe  
info  

Output:

A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN

Now, we can use the dataframe.reindex() function to reindex the dataframe.

1. # reindexing with new index values   
2. info.reindex(["A", "B", "C", "D", "E"])  

Output:

P Q R S
A NaN NaN NaN NaN
B NaN NaN NaN NaN
C NaN NaN NaN NaN
D NaN NaN NaN NaN
E NaN NaN NaN NaN

Notice that the new indexes are populated with NaN values. We can fill in the missing
values using the fill_value parameter.

1. # filling the missing values by 100   
2. info.reindex(["A", "B", "C", "D", "E"], fill_value =100)  

Output:

P Q R S
A 100 100 100 100
B 100 100 100 100
C 100 100 100 100
D 100 100 100 100
E 100 100 100 100

Example 2:

This example shows the working of reindex() function to reindex the column axis.

# importing pandas as pd  
importpandas as pd  
    
# Creating the first dataframe    
info1 =pd.DataFrame({"A":[1, 5, 3, 4, 2],   
                    "B":[3, 2, 4, 3, 4],   
                    "C":[2, 2, 7, 3, 4],   
                    "D":[4, 3, 6, 12, 7]})   
# reindexing the column axis with   
# old and new index values   
info.reindex(columns =["A", "B", "D", "E"])  

Output:

A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN

Notice that NaN values are present in the new columns after reindexing, we can use the
argument fill_value to the function for removing the NaN values.

1. # reindex the columns   
2. # fill the missing values by 25   
3. info.reindex(columns =["A", "B", "D", "E"], fill_value =37)  

Output:

A B D E
Parker 37 37 37 37
William 37 37 37 37
Smith 37 37 37 37
Terry 37 37 37 37
Phill 37 37 37 37

Reset Index
The Reset index of the DataFrame is used to reset the index by using the ' reset_index'
command. If the DataFrame has a MultiIndex, this method can remove one or more
levels.
Syntax:

1. DataFrame.reset_index(self, level=None, drop=False, inplace=False, col_level=0, col_fill='')  

Parameters:

level : Refers to int, str, tuple, or list, default value None

It is used to remove the given levels from the index and also removes all levels by
default.

drop : Refers to Boolean value, default value False

It resets the index to the default integer index.

inplace : Refers to Boolean value, default value False

It is used to modify the DataFrame in place and does not require to create a new object.

col_level : Refers to int or str, default value 0

It determines level the labels are inserted if the column have multiple labels

col_fill : Refers to an object, default value ''

It determines how the other levels are named if the columns have multiple level.

Example1:

info = pd.DataFrame([('William', 'C'),  
('Smith', 'Java'),  
('Parker', 'Python'),  
('Phill', np.nan)],  
index=[1, 2, 3, 4],  
columns=('name', 'Language'))  
info  
info.reset_index()  

Output:

index name Language


0 1 William C
1 2 Smith Java
2 3 Parker Python
3 4 Phill NaN

Pandas Time Series


The Time series data is defined as an important source for information that provides a
strategy that is used in various businesses. From a conventional finance industry to the
education industry, it consist of a lot of details about the time.

Time series forecasting is the machine learning modeling that deals with the Time Series
data for predicting future values through Time Series modeling.

The Pandas have extensive capabilities and features that work with the time series data
for all the domains. By using the NumPy datetime64 and timedelta64 dtypes. The
Pandas has consolidated different features from other python libraries
like scikits.timeseries as well as created a tremendous amount of new functionality for
manipulating the time series data.

For example, pandas support to parse the time-series information from various sources
and formats.

Importing Packages and Data


Before starting, you have to import some packages that will make use of numpy,
pandas, matplotlib, and seaborn.

You can attach the images to be plotted in the Jupyter Notebook, by


adding %matplotlib inline to the code and can also switch to Seaborn defaults by
using sns.set():

# import packages  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
%matplotlib inline  
sns.set()  
Date and time
The Pandas provide the number of functionalities for dates, times, deltas, and
timespans. It is mainly used for data science applications.

Native dates and times:


We have two native date and time available that reside in datetime module. We can
also perform lots of useful functionalities on date and time by using
the dateutil function. You can also parse the dates from a variety of string formats:

Example1:

import pandas as pd     
# Create the dates with frequency     
info = pd.date_range('5/4/2013', periods = 8, freq ='S')     
info  

Output:

DatetimeIndex(['2013-05-04 00:00:00', '2013-05-04 00:00:01',


'2013-05-04 00:00:02', '2013-05-04 00:00:03',
'2013-05-04 00:00:04', '2013-05-04 00:00:05',
'2013-05-04 00:00:06', '2013-05-04 00:00:07'],
dtype='datetime64[ns]', freq='S')

Example1:

import pandas as pd   
# Create the Timestamp   
p = pd.Timestamp('2018-12-12 06:25:18')   
# Create the DateOffset   
do = pd.tseries.offsets.DateOffset(n = 2)   
# Print the Timestamp   
print(p)   
# Print the DateOffset   
print(do)  

Output:
2018-12-12 06:25:18
<2 * DateOffsets>

Pandas Datetime
The Pandas can provide the features to work with time-series data for all domains. It
also consolidates a large number of features from other Python libraries like
scikits.timeseries by using the NumPy datetime64 and timedelta64 dtypes. It provides
new functionalities for manipulating the time series data.

The time series tools are most useful for data science applications and deals with other
packages used in Python.

Example1:

import pandas as pd     
# Create the dates with frequency     
info = pd.date_range('5/4/2013', periods = 8, freq ='S')     
info  

Output:

DatetimeIndex(['2013-05-04 00:00:00', '2013-05-04 00:00:01',


'2013-05-04 00:00:02', '2013-05-04 00:00:03',
'2013-05-04 00:00:04', '2013-05-04 00:00:05',
'2013-05-04 00:00:06', '2013-05-04 00:00:07'],
dtype='datetime64[ns]', freq='S')

Example2:

info = pd.DataFrame({'year': [2014, 2012],  
'month': [5, 7],  
'day': [20, 17]})  
pd.to_datetime(info)  
0   2014-05-20  
1   2012-07-17  
dtype: datetime64[ns]  

You can pass errors='ignore' if the date does not meet the timestamp. It will return the
original input without raising any exception.
If you pass errors='coerce', it will force an out-of-bounds date to NaT.

import pandas as pd  
pd.to_datetime('18000706', format='%Y%m%d', errors='ignore')  
datetime.datetime(1800, 7, 6, 0, 0)  
pd.to_datetime('18000706', format='%Y%m%d', errors='coerce')  

Output:

Timestamp('1800-07-06 00:00:00')

Example3:

import pandas as pd  
dmy = pd.date_range('2017-06-04', periods=5, freq='S')  
dmy  

Output:

DatetimeIndex(['2017-06-04 00:00:00',
'2017-06-04 00:00:01',
'2017-06-04 00:00:02',
'2017-06-04 00:00:03',
'2017-06-04 00:00:04'],
dtype='datetime64[ns]', freq='S')

Example4:

import pandas as pd  
dmy = dmy.tz_localize('UTC')  
dmy  

Output:

DatetimeIndex(['2017-06-04 00:00:00+00:00', '2017-06-04 00:00:01+00:00',


'2017-06-04 00:00:02+00:00',
'2017-06-04 00:00:03+00:00',
'2017-06-04 00:00:04+00:00'],
dtype='datetime64[ns, UTC]', freq='S')

Example5:

import pandas as pd  
dmy = pd.date_range('2017-06-04', periods=5, freq='S')  
dmy  

Output:

DatetimeIndex(['2017-06-04 00:00:00', '2017-06-04 00:00:01',


'2017-06-04 00:00:02', '2017-06-04 00:00:03',
'2017-06-04 00:00:04'],
dtype='datetime64[ns]', freq='S')

Pandas Time Offset


The time series tools are most useful for data science applications and deals with other
packages used in Python. The time offset performs various operations on time, i.e.,
adding and subtracting.

The offset specifies a set of dates that conform to the DateOffset. We can create the
DateOffsets to move the dates forward to valid dates.

If the date is not valid, we can use the rollback and rollforward methods for rolling the
date to its nearest valid date before or after the date. The pseudo-code of time offsets
are as follows:

Syntax:

1. class pandas.tseries.offsets.DateOffset(n=1, normalize=False, **kwds)  

def __add__(date):

date = rollback(date). It returns nothing if the date is valid + <n number of periods>.

date = rollforward(date)

When we create a date offset for a negative number of periods, the date will be rolling
forward.

Parameters:
n: Refers to int, default value is 1.

It is the number of time periods that represents the offsets.


normalize: Refers to a boolean value, default value False.

**kwds

It is an optional parameter that adds or replaces the offset value.

The parameters used for adding to the offset are as follows:

o years
o months
o weeks
o days
o hours
o minutes
o seconds
o microseconds
o nanoseconds

The parameters used for replacing the offset value are as follows:

o year
o month
o day
o weekday
o hour
o minute
o second
o microsecond
o nanosecond

Example:

import pandas as pd   
# Create the Timestamp   
p = pd.Timestamp('2018-12-12 06:25:18')   
# Create the DateOffset   
do = pd.tseries.offsets.DateOffset(n = 2)   
# Print the Timestamp   
print(p)   
# Print the DateOffset   
print(do)  

Output:

2018-12-12 06:25:18
<2 * DateOffsets>

Example2:

import pandas as pd     
# Create the Timestamp   
p = pd.Timestamp('2018-12-12 06:25:18')     
# Create the DateOffset   
do = pd.tseries.offsets.DateOffset(n = 2)     
# Add the dateoffset to given timestamp   
new_timestamp = p + do   
# Print updated timestamp   
print(new_timestamp)  

Output:

Timestamp('2018-12-14 06:25:18')

Pandas Time Periods


The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is
defined as a class that allows us to convert the frequency to the periods.

Generating periods and frequency conversion


We can generate the period by using 'Period' command with frequency 'M'. If we use
'asfreq' operation with 'start' operation, the date will print '01' whereas if we use the
'end' option, the date will print '31'.
Example:

import pandas as pd  
x = pd.Period('2014', freq='S')   
x.asfreq('D', 'start')  

Output:

Period('2014-01-01', 'D')

Example:

import pandas as pd  
x = pd.Period('2014', freq='S')   
x.asfreq('D', 'end')  

Output:

Period('2014-01-31', 'D')

Period arithmetic
Period arithmetic is used to perform various arithmetic operation on periods. All the
operations will be performed on the basis of 'freq'.

import pandas as pd  
x = pd.Period('2014', freq='Q')   
x  

Output:

Period('2014', 'Q-DEC')

Example:

1. import pandas as pd  
2. x = pd.Period('2014', freq='Q')   
3. x + 1  

Output:
Period('2015', 'Q-DEC')

Creating period range


We can create the range of period by using the 'period_range' command.

1. import pandas as pd  
2. p = pd.period_range('2012', '2017', freq='A')   
3. p   

Output:

PeriodIndex(['2012-01-02', '2012-01-03', '2012-01-04', '2012-01-05',


'2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11',
'2012-01-12', '2012-01-13',
'2016-12-20', '2016-12-21', '2016-12-22', '2016-12-23',
'2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29',
'2016-12-30', '2017-01-02'],
dtype='period[B]', length=1306, freq='B')

Converting string-dates to period


If we want to Convert the string-dates to period, first we need to convert the string to
date format and then we can convert the dates into the periods.

# dates as string   
p = ['2012-06-05', '2011-07-09', '2012-04-06']  
# convert string to date format   
x = pd.to_datetime(p)   
x  

Output:

DatetimeIndex(['2012-06-05', '2011-07-09', '2012-04-06'],


dtype='datetime64[ns]', freq=None)

Convert periods to timestamps


If we convert periods back to timestamps, we can simply do it by using 'to_timestamp'
command.

import pandas as pd  
prd  
prd.to_timestamp()  

Output:

DatetimeIndex(['2017-04-02', '2016-04-06', '2016-05-08'],


dtype='datetime64[ns]', fre

Convert string to date


In today's time, it is a tedious task to analyze datasets with dates and times. Because of
different lengths in months, distributions of the weekdays and weekends, leap years,
and the time zones are the things that needs to consider according to our context. So,
for this reason, Python has defined a new data type especially for dates and times called
datetime.

However, in many datasets, Strings are used to represent the dates. So, in this topic,
you'll learn about converting date strings to the datetime format and see how these
powerful set of tools helps to work effectively with complicated time series data.

The challenge behind this scenario is how the date strings are expressed. For example,
'Wednesday, June 6, 2018' can also be shown as '6/6/18' and '06-06-2018'. All these
formats define the same date, but the code represents to convert each of them is
slightly different.

fromdatetime import datetime  
  
# Define dates as the strings     
dmy_str1 = 'Wednesday, July 14, 2018'  
dmy_str2 = '14/7/17'  
dmy_str3 = '14-07-2017'  
  
# Define dates as the datetime objects  
dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')  
dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')  
dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')  
  
#Print the converted dates  
print(dmy_dt1)  
print(dmy_dt2)  
print(dmy_dt3)  

Output:

2017-07-14 00:00:00
2017-07-14 00:00:00
2018-07-14 00:00:00

Converting the date string column


This conversion shows how to convert whole column of date strings from the dataset to
datetime format.

From now on, you have to work with the DataFrame called eth that contains the
historical data on ether, and also a cryptocurrency whose blockchain is produced by the
Ethereum platform. The dataset consists the following columns:

o date: Defines the actual date, daily at 00:00 UTC.


o txVolume: It refers an unadjusted measure of the total value in US dollars, in outputs on
the blockchain.
o txCount: It defines number of transactions performed on public blockchain.
o marketCap: Refers to the unit price in US dollars multiplied by the number of units in
circulation.
o price: Refers an opening price in US dollars at 00:00 UTC.
o generatedCoins: Refers the number of new coins.
o exchangeVolume: Refers the actual volume which is measured by US dollars, at
exchanges like GDAX and Bitfinex.

You might also like