Manipulating and Analyzing Data With Pandas
Manipulating and Analyzing Data With Pandas
Manipulating and Analyzing Data With Pandas
Céline Comte
Nokia Bell Labs France & Télécom ParisTech
(https://github.com/pandas-dev/pandas/blob/master/LICENSE)
3/50 © 2019 Nokia Public
Introduction
NumPy
NumPy
• “NumPy arrays have a fixed size at creation, unlike Python lists (which can
grow dynamically). Changing the size of an ndarray will create a new array and
delete the original.”
• ”The elements in a NumPy array are all required to be of the same data type,
and thus will be the same size in memory. The exception: one can have arrays
of (Python, including NumPy) objects, thereby allowing for arrays of different
sized elements.“
• Advantage of this rigidity: (usually) contiguous block of memory
→ Faster code
Axis 0
0 3. 0.
1 20. 230.
> a[:2, :] # returns a view of the array
array([[ 3., 0.], 2 21. 275.
[ 20., 230.]])
> a.shape
(3, 2)
# axis 0 is of length 3, axis 1 is of length 2
> a.dtype
dtype('float64') Axis 1
# data-type, specifies how to interpret each item 0 1
# (inferred from data if unspecified)
Axis 0
0 3. 0.
1 20. 230.
> a.itemize 2 21. 275.
8 # the size of each element of the array,
# in bytes (8 x 8 = 64)
Axis 0
0 3. 0.
array([[ 0., 0.], 1 20. 230.
[ 20., 230.], 2 21. 275.
[ 21., 275.]])
The reshape method returns its argument with a modified shape, whereas the
resize method modifies the array itself:
> a.resize(2,3)
> b = a.reshape(2,3)
> a
array([[ 3., 0., 20.],
array([[ 3., 0., 20.],
[230., 21., 275.]])
[230., 21., 275.]])
> a
array([[ 3., 0.],
> np.resize(a, (2,3))
[ 20., 230.],
array([[ 3., 0., 20.],
[ 21., 275.]])
[230., 21., 275.]])
Axis 0
0 3. 0.
array([ 44., 505.]) 1 20. 230.
2 21. 275.
• Taking the maximum of an array
> a.max() # and, similarly, a.max(axis=0)
275.0
Axis 0
0 3 0.
1 20 230.
Here, a is a 1-dimensional array with tuple elements.
2 21 275.
> a[0]
(3, 0.)
NumPy
Observations
− DataFrames: 2-dimensional Age Weight
Bei Bei 3
• Give a semantical meaning to the axes Mei Xiang 20 230.
− Columns ≃ Variables Tian Tian 21 275.
− Lines ≃ Observations
• Other functionalities:
− Missing data: Identified by NaN (np.nan).
− Mutability: Add and remove columns in a DataFrame
− Data alignment: Combine data based on the indices
Weight (pounds)
• Like ndarrays, the length of a Series cannot be
Axis 0
modified after definition. Mei Xiang 230.
Tian Tian 275.
• Index: Can be of any hashable type.
Axis 0
BB 3
• Mutability: Columns can have different
MX 20 230.
dtypes and can be added and removed,
but they have a fixed size. TT 21 275.
Axis 0
(https://pandas.pydata.org/pandas-docs/stable/ BB 3
reference/api/pandas.DataFrame.html) MX 20 230.
TT 21 275.
• “You can treat a DataFrame semantically like
a dict of like-indexed Series objects. Getting, ↓
setting, and deleting columns works with the
same syntax as the analogous dict operations”. Age Weight
(https://pandas.pydata.org/pandas-docs/stable/
BB 3 MX 230.
getting_started/dsintro.html)
MX 20 TT 275.
• In particular: access by key, del, pop. TT 21
Axis 0
BB 3
> df.dtypes # returns a Series MX 20 230.
Age int64 TT 21 275.
Weight float64
dtype: object
Axis 0
BB 3
6 MX 20 230.
TT 21 275.
> df.columns
Index(['Age', 'Weight'], dtype='object')
> df.index
Index(['Bei Bei', 'Mei Xiang', 'Tian Tian'],
dtype='object')
Axis 0
Mei Xiang 20 BB 3
Tian Tian 21 MX 20 230.
Name: Age, dtype: int64 TT 21 275.
Axis 0
Age 44.0 BB 3
Weight 505.0 MX 20 230.
dtype: float64 TT 21 275.
> df.sum(axis=1)
Bei Bei 3.0
Mei Xiang 250.0
Tian Tian 296.0
dtype: float64
NumPy
• The organization and most of the examples of this section come from
the official tutorial 10 minutes to pandas.
(http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
> a = df.to_numpy()
array([[-2.35406005, -0.31282731, 0.19482154, 1.14387112],
[ 1.70706975, -0.78209048, 0.06241179, -0.00753477],
[-0.21252435, 0.06799263, 1.03563884, -0.67680038],
[ 0.65801543, -0.39368803, 0.5654252 , -1.32672643],
[-1.30699305, -0.06174394, 0.09464223, -0.97696831]])
> a[0,0] = 0
> df['A'][0]
0.0
> df.fillna(value=0)
A B C D
2013-01-01 0.00000 1.171513 0.000000 0.298407
2013-01-02 0.00000 0.893041 2.136786 0.000000
2013-01-03 0.00000 0.030041 0.131783 0.000000
2013-01-04 0.46075 0.000000 0.000000 0.000000
2013-01-05 0.00000 0.953238 0.778675 1.109996
> df.to_csv('foo.csv')
Other examples:
sub Substracts another Series or DataFrame (broadcasting)
apply Applies a function such as np.cumsum
value_counts Counts the number of occurrencies of each value
Concatenate:
concat General-purpose function
Concatenate Series or DataFrames along colums or rows
append Append rows to a DataFrame
Equivalent to concat along axis 0
Join / Merge:
merge Database-like join operations
> ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
> ts.head(3)
2000-01-01 0.310037
2000-01-02 1.747102
2000-01-03 -2.121889
Freq: D, dtype: float64
> ts = ts.cumsum()
> ts.plot()