Python Pandas
Python Pandas
Python Pandas
34.4M
655
Java Try Catch
Next
Stay
Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced the
capabilities of data analysis. It can perform five significant steps required for processing
and analysis of data irrespective of the origin of the data, i.e., load, manipulate,
prepare, model, and analyze.
Benefits of Pandas
The benefits of pandas over using other language are as follows:
o Data Representation: It represents the data in a form that is suited for data
analysis through its DataFrame and Series.
o Clear code: The clear API of the Pandas allows you to focus on the core part of
the code. So, it provides clear and concise code for the user.
1) Series
It is defined as a one-dimensional array that is capable of storing various data types. The
row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using "series' method. A Series cannot contain multiple columns. It
has one parameter:
Before creating a Series, Firstly, we have to import the numpy module and then use
array() function in the program.
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and
columns are indexed. It is denoted as "columns" in case of columns and "index"
in case of rows.
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output
0
0 Python
1 Pandas
Creating a Series:
We can create a Series in two ways:
34.4M
655
1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.
Example
import pandas as pd
x = pd.Series()
print (x)
Output
o Array
o Dict
o Scalar value
Before creating a Series, firstly, we have to import the numpy module and then use
array() function in the program. If the data is ndarray, then the passed index must be of
the same length.
If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
Example
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.
If index is passed, then values correspond to a particular label in the index will be
extracted from the dictionary.
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Output
x 0.0
y 1.0
z 2.0
dtype: float64
If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.
#import pandas library
import pandas as pd
import numpy as np
x = pd.Series(4, index=[0, 1, 2, 3])
print (x)
Output
0 4
1 4
2 4
3 4
dtype: int64
The data in the Series can be accessed similar to that in the ndarray.
import pandas as pd
x = pd.Series([1,2,3],index = ['a','b','c'])
#retrieve the first element
print (x[0])
Output
Attributes Description
import numpy as np
import pandas as pd
x=pd.Series(data=[2,4,6,8])
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
print(x.index)
print(x.values)
print(y.index)
print(y.values)
Output
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],
index=['x','y','z'])
print(a.dtype)
print(a.itemsize)
print(b.dtype)
print(b.itemsize)
Output
int64
8
float64
8
Retrieving Shape
The shape of the Series object defines total number of elements including missing or
empty values(NaN).
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Output
(4,)
(3,)
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],
index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Output
1 1
4 3
32 24
Example
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Output
Functions Description
Pandas Series.map() Map the values from two series that have a common column.
Pandas Series.std() Calculate the standard deviation of the given set of numbers, DataFrame,
column, and rows.
Pandas Series.map()
The main task of map() is used to map the values from two series that have a common
column. To map the two Series, the last column of the first Series should be the same as
the index column of the second series, and the values should be unique.
Syntax
1. Series.map(arg, na_action=None)
Parameters
Example
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Output
0 Core
1 NaN
2 NaN
3 NaN
dtype: object
Example2
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
a.map('I like {}'.format, na_action='ignore')
Output
0 I like Java
1 I like C
2 I like C++
3 I like nan
dtype: object
Example3
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
a.map('I like {}'.format)
a.map('I like {}'.format, na_action='ignore')
Output
0 I like Java
1 I like C
2 I like C++
3 NaN
dtype: object
Pandas Series.std()
The Pandas std() is defined as a function for calculating the standard deviation of the
given set of numbers, DataFrame, column, and rows. In respect to calculate the standard
deviation, we need to import the package named "statistics" for the calculation of
median.
The standard deviation is normalized by N-1 by default and can be changed using
the ddof argument.
Syntax:
1. Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Parameters:
Example1:
import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Output
2.1147629234082532
10.077252622027656
Example2:
import pandas as pd
import numpy as np
#Create a DataFrame
info = {
'Name':['Parker','Smith','John','William'],
'sub1_Marks':[52,38,42,37],
'sub2_Marks':[41,35,29,36]}
data = pd.DataFrame(info)
data
# standard deviation of the dataframe
data.std()
Output
sub1_Marks 6.849574
sub2_Marks 4.924429
dtype: float64
Pandas Index
Pandas Index is defined as a vital tool that selects particular rows and columns of data
from a DataFrame. Its task is to organize the data and to provide fast accessing of data.
It can also be called a Subset Selection.
The values are in bold font in the index, and the individual value of the index is called
a label.
If we want to compare the data accessing time with and without indexing, we can use %
%timeit for comparing the time required for various access-operations.
We can also define an index like an address through which any data can be accessed
across the Series or DataFrame. A DataFrame is a combination of three different
components, the index, columns, and the data.
Creating index
First, we have to take a csv file that consist some data used for indexing.
# importing pandas package
import pandas as pd
data = pd.read_csv("aa.csv")
data
Output:
Example1
# importing pandas package
import pandas as pd
# making data frame from csv file
info = pd.read_csv("aa.csv", index_col ="Name")
# retrieving multiple columns by indexing operator
a = info[["Hire Date", "Salary"]]
print(a)
Output:
Example2:
# importing pandas package
importpandas as pd
# making data frame from csv file
info =pd.read_csv("aa.csv", index_col ="Name")
# retrieving columns by indexing operator
a =info["Salary"]
print(a)
Output:
Name Salary
0 John Idle 50000.0
1 Smith Gilliam 65000.0
2 Parker Chapman 45000.0
3 Jones Palin 70000.0
4 Terry Gilliam 48000.0
5 Michael Palin 66000.0
Set index
The 'set_index' is used to set the DataFrame index using existing columns. An index can
replace the existing index and can also expand the existing index.
info = pd.DataFrame({'Name': ['Parker', 'Terry', 'Smith', 'William'],
'Year': [2011, 2009, 2014, 2010],
'Leaves': [10, 15, 9, 4]})
info
info.set_index('Name')
info.set_index(['year', 'Name'])
info.set_index([pd.Index([1, 2, 3, 4]), 'year'])
a = pd.Series([1, 2, 3, 4])
info.set_index([a, a**2])
Output:
Multiple Index
We can also have multiple indexes in the data.
Example1:
import pandas as pd
import numpy as np
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
Output:
Reset index
We can also reset the index using the 'reset_index' command. Let's look at the 'cm'
DataFrame again.
Example:
info = pd.DataFrame([('William', 'C'),
('Smith', 'Java'),
('Parker', 'Python'),
('Phill', np.nan)],
index=[1, 2, 3, 4],
columns=('name', 'Language'))
info
info.reset_index()
Output:
Multiple Index
Multiple indexing is defined as a very essential indexing because it deals with the data
analysis and manipulation, especially for working with higher dimensional data. It also
enables to store and manipulate data with the arbitrary number of dimensions in lower
dimensional data structures like Series and DataFrame.
It is the hierarchical analogue of the standard index object which is used to store the
axis labels in pandas objects. It can also be defined as an array of tuples where each
tuple is unique. It can be created from a list of arrays, an array of tuples, and a crossed
set of iterables.
Example:
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
tuples
Output:
[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]
Example2:
arrays = [['it', 'it', 'of', 'of', 'for', 'for', 'then', 'then'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
Output:
MultiIndex([('bar', 'one'),
[('it', 'one'),
('it', 'two'),
('of', 'one'),
('of', 'two'),
('for', 'one'),
('for', 'two'),
('then', 'one'),
('then', 'two')]
names=['first', 'second'])
Example3:
import pandas as pd
import numpy as np
pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
Output:
Reindexing is used to change the index of the rows and columns of the DataFrame. We
can reindex the single or multiple rows by using the reindex() method. Default values in
the new index are assigned NaN if it is not present in the DataFrame.
Syntax:
1. DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy
=True, level=None, fill_value=nan, limit=None, tolerance=None)
Parameters:
labels: It is an optional parameter that refers to the new labels or the index to conform
to the axis that is specified by the 'axis'.
index, columns : It is also an optional parameter that refers to the new labels or the
index. It generally prefers an index object for avoiding the duplicate data.
axis : It is also an optional parameter that targets the axis and can be either the axis
name or the numbers.
method: It is also an optional parameter that is to be used for filling the holes in the
reindexed DataFrame. It can only be applied to the DataFrame or Series with a
monotonically increasing/decreasing order.
pad / ffill: It is used to propagate the last valid observation forward to the next valid
observation.
backfill / bfill: To fill the gap, It uses the next valid observation.
level : It is used to broadcast across the level, and match index values on the passed
MultiIndex level.
fill_value : Its default value is np.NaN and used to fill existing missing (NaN) values. It
needs any new element for successful DataFrame alignment, with this value before
computation.
limit : It defines the maximum number of consecutive elements that are to be forward
or backward fill.
tolerance : It is also an optional parameter that determines the maximum distance
between original and new labels for inexact matches. At the matching locations, the
values of the index should most satisfy the equation abs(index[indexer] ? target) <=
tolerance.
Returns :
It returns reindexed DataFrame.
Example 1:
The below example shows the working of reindex() function to reindex the dataframe.
In the new index,default values are assigned NaN in the new index that does not have
corresponding records in the DataFrame.
Output:
A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN
1. # reindexing with new index values
2. info.reindex(["A", "B", "C", "D", "E"])
Output:
P Q R S
A NaN NaN NaN NaN
B NaN NaN NaN NaN
C NaN NaN NaN NaN
D NaN NaN NaN NaN
E NaN NaN NaN NaN
Notice that the new indexes are populated with NaN values. We can fill in the missing
values using the fill_value parameter.
1. # filling the missing values by 100
2. info.reindex(["A", "B", "C", "D", "E"], fill_value =100)
Output:
P Q R S
A 100 100 100 100
B 100 100 100 100
C 100 100 100 100
D 100 100 100 100
E 100 100 100 100
Example 2:
This example shows the working of reindex() function to reindex the column axis.
# importing pandas as pd
importpandas as pd
# Creating the first dataframe
info1 =pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
# reindexing the column axis with
# old and new index values
info.reindex(columns =["A", "B", "D", "E"])
Output:
A B D E
Parker NaN NaN NaN NaN
William NaN NaN NaN NaN
Smith NaN NaN NaN NaN
Terry NaN NaN NaN NaN
Phill NaN NaN NaN NaN
Notice that NaN values are present in the new columns after reindexing, we can use the
argument fill_value to the function for removing the NaN values.
1. # reindex the columns
2. # fill the missing values by 25
3. info.reindex(columns =["A", "B", "D", "E"], fill_value =37)
Output:
A B D E
Parker 37 37 37 37
William 37 37 37 37
Smith 37 37 37 37
Terry 37 37 37 37
Phill 37 37 37 37
Reset Index
The Reset index of the DataFrame is used to reset the index by using the ' reset_index'
command. If the DataFrame has a MultiIndex, this method can remove one or more
levels.
Syntax:
1. DataFrame.reset_index(self, level=None, drop=False, inplace=False, col_level=0, col_fill='')
Parameters:
It is used to remove the given levels from the index and also removes all levels by
default.
It is used to modify the DataFrame in place and does not require to create a new object.
It determines level the labels are inserted if the column have multiple labels
It determines how the other levels are named if the columns have multiple level.
Example1:
info = pd.DataFrame([('William', 'C'),
('Smith', 'Java'),
('Parker', 'Python'),
('Phill', np.nan)],
index=[1, 2, 3, 4],
columns=('name', 'Language'))
info
info.reset_index()
Output:
Time series forecasting is the machine learning modeling that deals with the Time Series
data for predicting future values through Time Series modeling.
The Pandas have extensive capabilities and features that work with the time series data
for all the domains. By using the NumPy datetime64 and timedelta64 dtypes. The
Pandas has consolidated different features from other python libraries
like scikits.timeseries as well as created a tremendous amount of new functionality for
manipulating the time series data.
For example, pandas support to parse the time-series information from various sources
and formats.
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
Date and time
The Pandas provide the number of functionalities for dates, times, deltas, and
timespans. It is mainly used for data science applications.
Example1:
import pandas as pd
# Create the dates with frequency
info = pd.date_range('5/4/2013', periods = 8, freq ='S')
info
Output:
Example1:
import pandas as pd
# Create the Timestamp
p = pd.Timestamp('2018-12-12 06:25:18')
# Create the DateOffset
do = pd.tseries.offsets.DateOffset(n = 2)
# Print the Timestamp
print(p)
# Print the DateOffset
print(do)
Output:
2018-12-12 06:25:18
<2 * DateOffsets>
Pandas Datetime
The Pandas can provide the features to work with time-series data for all domains. It
also consolidates a large number of features from other Python libraries like
scikits.timeseries by using the NumPy datetime64 and timedelta64 dtypes. It provides
new functionalities for manipulating the time series data.
The time series tools are most useful for data science applications and deals with other
packages used in Python.
Example1:
import pandas as pd
# Create the dates with frequency
info = pd.date_range('5/4/2013', periods = 8, freq ='S')
info
Output:
Example2:
info = pd.DataFrame({'year': [2014, 2012],
'month': [5, 7],
'day': [20, 17]})
pd.to_datetime(info)
0 2014-05-20
1 2012-07-17
dtype: datetime64[ns]
You can pass errors='ignore' if the date does not meet the timestamp. It will return the
original input without raising any exception.
If you pass errors='coerce', it will force an out-of-bounds date to NaT.
import pandas as pd
pd.to_datetime('18000706', format='%Y%m%d', errors='ignore')
datetime.datetime(1800, 7, 6, 0, 0)
pd.to_datetime('18000706', format='%Y%m%d', errors='coerce')
Output:
Timestamp('1800-07-06 00:00:00')
Example3:
import pandas as pd
dmy = pd.date_range('2017-06-04', periods=5, freq='S')
dmy
Output:
DatetimeIndex(['2017-06-04 00:00:00',
'2017-06-04 00:00:01',
'2017-06-04 00:00:02',
'2017-06-04 00:00:03',
'2017-06-04 00:00:04'],
dtype='datetime64[ns]', freq='S')
Example4:
import pandas as pd
dmy = dmy.tz_localize('UTC')
dmy
Output:
Example5:
import pandas as pd
dmy = pd.date_range('2017-06-04', periods=5, freq='S')
dmy
Output:
The offset specifies a set of dates that conform to the DateOffset. We can create the
DateOffsets to move the dates forward to valid dates.
If the date is not valid, we can use the rollback and rollforward methods for rolling the
date to its nearest valid date before or after the date. The pseudo-code of time offsets
are as follows:
Syntax:
1. class pandas.tseries.offsets.DateOffset(n=1, normalize=False, **kwds)
def __add__(date):
date = rollback(date). It returns nothing if the date is valid + <n number of periods>.
date = rollforward(date)
When we create a date offset for a negative number of periods, the date will be rolling
forward.
Parameters:
n: Refers to int, default value is 1.
**kwds
o years
o months
o weeks
o days
o hours
o minutes
o seconds
o microseconds
o nanoseconds
o year
o month
o day
o weekday
o hour
o minute
o second
o microsecond
o nanosecond
Example:
import pandas as pd
# Create the Timestamp
p = pd.Timestamp('2018-12-12 06:25:18')
# Create the DateOffset
do = pd.tseries.offsets.DateOffset(n = 2)
# Print the Timestamp
print(p)
# Print the DateOffset
print(do)
Output:
2018-12-12 06:25:18
<2 * DateOffsets>
Example2:
import pandas as pd
# Create the Timestamp
p = pd.Timestamp('2018-12-12 06:25:18')
# Create the DateOffset
do = pd.tseries.offsets.DateOffset(n = 2)
# Add the dateoffset to given timestamp
new_timestamp = p + do
# Print updated timestamp
print(new_timestamp)
Output:
Timestamp('2018-12-14 06:25:18')
import pandas as pd
x = pd.Period('2014', freq='S')
x.asfreq('D', 'start')
Output:
Period('2014-01-01', 'D')
Example:
import pandas as pd
x = pd.Period('2014', freq='S')
x.asfreq('D', 'end')
Output:
Period('2014-01-31', 'D')
Period arithmetic
Period arithmetic is used to perform various arithmetic operation on periods. All the
operations will be performed on the basis of 'freq'.
import pandas as pd
x = pd.Period('2014', freq='Q')
x
Output:
Period('2014', 'Q-DEC')
Example:
1. import pandas as pd
2. x = pd.Period('2014', freq='Q')
3. x + 1
Output:
Period('2015', 'Q-DEC')
1. import pandas as pd
2. p = pd.period_range('2012', '2017', freq='A')
3. p
Output:
# dates as string
p = ['2012-06-05', '2011-07-09', '2012-04-06']
# convert string to date format
x = pd.to_datetime(p)
x
Output:
import pandas as pd
prd
prd.to_timestamp()
Output:
However, in many datasets, Strings are used to represent the dates. So, in this topic,
you'll learn about converting date strings to the datetime format and see how these
powerful set of tools helps to work effectively with complicated time series data.
The challenge behind this scenario is how the date strings are expressed. For example,
'Wednesday, June 6, 2018' can also be shown as '6/6/18' and '06-06-2018'. All these
formats define the same date, but the code represents to convert each of them is
slightly different.
fromdatetime import datetime
# Define dates as the strings
dmy_str1 = 'Wednesday, July 14, 2018'
dmy_str2 = '14/7/17'
dmy_str3 = '14-07-2017'
# Define dates as the datetime objects
dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')
dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')
dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')
#Print the converted dates
print(dmy_dt1)
print(dmy_dt2)
print(dmy_dt3)
Output:
2017-07-14 00:00:00
2017-07-14 00:00:00
2018-07-14 00:00:00
From now on, you have to work with the DataFrame called eth that contains the
historical data on ether, and also a cryptocurrency whose blockchain is produced by the
Ethereum platform. The dataset consists the following columns: