DS UNIT-VI
DS UNIT-VI
DS UNIT-VI
In [12]: now
Out[12]: datetime.datetime(2017, 9, 25, 14, 5, 52, 72973)
datetime stores both the date and time down to the microsecond. timedelta repre‐ sents the
temporal difference between two datetime objects:
In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
In [15]: delta
Out[15]: datetime.timedelta(926, 56700)
In [16]: delta.days
Out[16]: 926
In [17]: delta.seconds
Out[17]: 56700
You can add (or subtract) a timedelta or multiple thereof to a datetime object to yield a new
shifted object:
In [18]: from datetime import timedelta
The below table summarizes the data types in the datetime module.
Type Description
date Store calendar date (year, month, day) using the Gregorian calendar time
timedelta Represents the difference between two datetime values (as days, seconds, and microseconds)
tzinfo Base type for storing time zone information
In [23]: str(stamp)
Out[23]: '2011-01-03 00:00:00'
In [24]: stamp.strftime('%Y-%m-%d')
Out[24]: '2011-01-03'
Python datetime
Python has a module named datetime to work with dates and times.
It provides a variety of classes for representing and manipulating dates and times, as well as
for formatting and parsing dates and times in a variety of formats.
Here, we have imported the datetime module using the import datetime statement.
One of the classes defined in the datetime module is the datetime class.
We then used the now() method to create a datetime object containing the current local date
and time.
current_date = datetime.date.today()
print(current_date)
Output
2022-12-27
In the above example, we have used the today() method defined in the date class to get
a datetime object containing the current local date.
today = date.today()
Output:
Current month: 12
Current day: 27
Example 4: Print hour, minute, second and microsecond
Once we create the time object, we can easily print its attributes such as hour, minute, etc.
from datetime import time
a = time(11, 34, 56)
print("Hour =", a.hour)
print("Minute =", a.minute)
print("Second =", a.second)
print("Microsecond =", a.microsecond)
Output:
Hour = 11
Minute = 34
Second = 56
Microsecond = 0
In [42]: ts
Out[42]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
Under the hood, these datetime objects have been put in a DatetimeIndex:
In [43]: ts.index
Out[43]:
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
Like other Series, arithmetic operations between differently indexed time series auto‐ matically
align on the dates:
In [44]: ts + ts[::2]
Out[44]:
2011-01-02 -0.409415
2011-01-05 NaN
2011-01-07 -1.038877
2011-01-08 NaN
2011-01-10 3.931561
2011-01-12 NaN
dtype: float64
Out[45]: dtype('<M8[ns]')
In [47]: stamp
Out[47]: Timestamp('2011-01-02 00:00:00')
A Timestamp can be substituted anywhere you would use a datetime object. Addi‐ tionally, it
can store frequency information (if any) and understands how to do time zone conversions and
other kinds of manipulations. More on both of these things later.
In [49]: ts[stamp]
Out[49]: -0.51943871505673811
In [51]: ts['20110110']
Out[51]: 1.9657805725027142
For longer time series, a year or only a year and month can be passed to easily select slices of
data:
In [52]: longer_ts = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
In [53]: longer_ts
Out[53]:
2000-01-01 0.092908
2000-01-02 0.281746
2000-01-03 0.769023
2000-01-04 1.246435
2000-01-05 1.007189
2000-01-06 -1.296221
2000-01-07 0.274992
2000-01-08 0.228913
2000-01-09 1.352917
2000-01-10 0.886429
...
2002-09-17 -0.139298
2002-09-18 -1.159926
2002-09-19 0.618965
2002-09-20 1.373890
2002-09-21 -0.983505
2002-09-22 0.930944
2002-09-23 -0.811676
2002-09-24 -1.830156
2002-09-25 -0.138730
2002-09-26 0.334088
Freq: D, Length: 1000, dtype: float64
In [54]: longer_ts['2001']
Out[54]:
2001-01-01 1.599534
2001-01-02 0.474071
2001-01-03 0.151326
2001-01-04 -0.542173
2001-01-05 -0.475496
2001-01-06 0.106403
2001-01-07 -1.308228
2001-01-08 2.173185
2001-01-09 0.564561
2001-01-10 -0.190481
...
2001-12-22 0.000369
2001-12-23 0.900885
2001-12-24 -0.454869
2001-12-25 -0.864547
2001-12-26 1.129120
2001-12-27 0.057874
2001-12-28 -0.433739
2001-12-29 0.092698
2001-12-30 -1.397820
2001-12-31 1.457823
Freq: D, Length: 365, dtype: float64
Here, the string '2001' is interpreted as a year and selects that time period. This also works if
you specify the month:
In [55]: longer_ts['2001-05']
Out[55]:
2001-05-01 -0.622547
2001-05-02 0.936289
2001-05-03 0.750018
2001-05-04 -0.056715
2001-05-05 2.300675
2001-05-06 0.569497
2001-05-07 1.489410
2001-05-08 1.264250
2001-05-09 -0.761837
2001-05-10 -0.331617
...
2001-05-22 0.503699
2001-05-23 -1.387874
2001-05-24 0.204851
2001-05-25 0.603705
2001-05-26 0.545680
2001-05-27 0.235477
2001-05-28 0.111835
2001-05-29 -1.251504
2001-05-30 -2.949343
2001-05-31 0.634634
Out[56]:
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
Because most time series data is ordered chronologically, you can slice with time‐ stamps not
contained in a time series to perform a range query:
In [57]: ts
Out[57]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
In [58]: ts['1/6/2011':'1/11/2011']
Out[58]:
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
dtype: float64
As before, you can pass either a string date, datetime, or timestamp. Remember that slicing in
this manner produces views on the source time series like slicing NumPy arrays. This means that
no data is copied and modifications on the slice will be reflec‐ ted in the original data.
There is an equivalent instance method, truncate, that slices a Series between two dates:
In [59]: ts.truncate(after='1/9/2011')
Out[59]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
dtype: float64
All of this holds true for DataFrame as well, indexing on its rows:
In [60]: dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
In [62]: long_df.loc['5-2001']
Out[62]:
Colorado Texas New York Ohio
2001-05-02 -0.006045 0.490094 -0.277186 -0.707213
2001-05-09 -0.560107 2.735527 0.927335 1.513906
2001-05-16 0.538600 1.273768 0.667876 -0.969206
2001-05-23 1.676091 -0.817649 0.050188 1.951312
2001-05-30 3.260383 0.963301 1.201206 -1.852001
In [65]: dup_ts
Out[65]:
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int64
We can tell that the index is not unique by checking its is_unique property:
In [66]: dup_ts.index.is_unique
Out[66]: False
Indexing into this time series will now either produce scalar values or slices depend‐ ing on
whether a timestamp is duplicated:
In [67]: dup_ts['1/3/2000'] # not duplicated
Out[67]: 4
dtype: int64
Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is
to use groupby and pass level=0:
In [69]: grouped = dup_ts.groupby(level=0)
In [70]: grouped.mean()
Out[70]:
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int64
In [71]: grouped.count()
Out[71]:
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
dtype: float64
In [73]: resampler = ts.resample('D')
In [75]: index
Out[75]:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
By default, date_range generates daily timestamps. If you pass only a start or end date, you
must pass a number of periods to generate:
In [76]: pd.date_range(start='2012-04-01', periods=20)
Out[76]:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
dtype='datetime64[ns]', freq='D')
The start and end dates define strict boundaries for the generated date index. For example, if
you wanted a date index containing the last business day of each month, you would pass the
'BM' frequency (business end of month; see more complete listing
of frequencies and only dates falling on or inside the date interval will be included:
In [78]: pd.date_range('2000-01-01', '2000-12-01', freq='BM')
Out[78]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
Table:. Base time series frequencies (not comprehensive)
Alias Offset type Description
D Day Calendar daily
B BusinessDay Business daily
H Hour Hourly
T or min Minute Minutely
S Second Secondly
L or ms Milli Millisecond (1/1,000 of 1 second)
U Micro Microsecond (1/1,000,000 of 1 second)
M MonthEnd Last calendar day of month
BM BusinessMonthEnd Last business day (weekday) of month
MS MonthBegin First calendar day of month
BMS BusinessMonthBegin First weekday of month
date_range by default preserves the time (if any) of the start or end timestamp:
In [79]: pd.date_range('2012-05-02 12:56:31', periods=5)
Out[79]:
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
Sometimes you will have start or end dates with time information but want to gener‐ ate a set of
timestamps normalized to midnight as a convention. To do this, there is a normalize option:
In [80]: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
Out[80]:
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
hour = Hour()
In most applications, you would never need to explicitly create one of these objects, instead
using a string alias like 'H' or '4H' . Putting an integer before the base fre‐ quency creates a
multiple:
In [86]: pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')
Out[86]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
Many offsets can be combined together by addition:
In [87]: Hour(2) + Minute(30) Out[87]: <150 *
Minutes>
Similarly, you can pass frequency strings, like '1h30min', that will effectively be parsed to the
same expression:
In [88]: pd.date_range('2000-01-01', periods=10, freq='1h30min')
Out[88]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
'2000-01-01 03:00:00', '2000-01-01 04:30:00',
'2000-01-01 06:00:00', '2000-01-01 07:30:00',
'2000-01-01 09:00:00', '2000-01-01 10:30:00',
'2000-01-01 12:00:00', '2000-01-01 13:30:00'],
dtype='datetime64[ns]', freq='90T')
In [92]: ts
Out[92]:
2000-01-31 -0.066748
2000-02-29 0.838639
2000-03-31 -0.117388
2000-04-30 -0.517795
Freq: M, dtype: float64
In [93]: ts.shift(2)
Out[93]:
2000-01-31 NaN
2000-02-29 NaN
2000-03-31 -0.066748
2000-04-30 0.838639
Freq: M, dtype: float64
In [94]: ts.shift(-2)
Out[94]:
2000-01-31 -0.117388
2000-02-29 -0.517795
2000-03-31 NaN
2000-04-30 NaN
Freq: M, dtype: float64
In [107]: ts
Out[107]:
2000-01-15 -0.116696
2000-01-19 2.389645
2000-01-23 -0.932454
2000-01-27 -0.229331
2000-01-31 -1.140330
2000-02-04 0.439920
2000-02-08 -0.823758
2000-02-12 -0.520930
2000-02-16 0.350282
2000-02-20 0.204395
2000-02-24 0.133445
2000-02-28 0.327905
2000-03-03 0.072153
2000-03-07 0.131678
2000-03-11 -1.297459
2000-03-15 0.997747
2000-03-19 0.870955
2000-03-23 -0.991253
2000-03-27 0.151699
2000-03-31 1.266151
In [108]: ts.groupby(offset.rollforward).mean()
Out[108]:
2000-01-31 -0.005833
2000-02-29 0.015894
2000-03-31 0.150209
dtype: float64
Methods in pandas will accept either time zone names or these objects.
In [116]: ts
Out[116]:
2012-03-09 09:30:00 -0.202469
2012-03-10 09:30:00 0.050718
2012-03-11 09:30:00 0.639869
2012-03-12 09:30:00 0.597594
2012-03-13 09:30:00 -0.797246
2012-03-14 09:30:00 0.472879
Freq: D, dtype: float64
Out[118]:
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
'2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
Out[119]:
2012-03-09 09:30:00 -0.202469
2012-03-10 09:30:00 0.050718
2012-03-11 09:30:00 0.639869
2012-03-12 09:30:00 0.597594
2012-03-13 09:30:00 -0.797246
2012-03-14 09:30:00 0.472879
Freq: D, dtype: float64
In [121]: ts_utc
Out[121]:
2012-03-09 09:30:00+00:00 -0.202469
2012-03-10 09:30:00+00:00 0.050718
2012-03-11 09:30:00+00:00 0.639869
2012-03-12 09:30:00+00:00 0.597594
2012-03-13 09:30:00+00:00 -0.797246
2012-03-14 09:30:00+00:00 0.472879
Freq: D, dtype: float64
In [122]: ts_utc.index
Out[122]:
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
Once a time series has been localized to a particular time zone, it can be converted to another
time zone with tz_convert:
In [123]: ts_utc.tz_convert('America/New_York')
Out[123]:
2012-03-09 04:30:00-05:00 -0.202469
2012-03-10 04:30:00-05:00 0.050718
2012-03-11 05:30:00-04:00 0.639869
2012-03-12 05:30:00-04:00 0.597594
2012-03-13 05:30:00-04:00 -0.797246
2012-03-14 05:30:00-04:00 0.472879
In the case of the preceding time series, which straddles a DST transition in the Amer
ica/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin time:
In [124]: ts_eastern = ts.tz_localize('America/New_York')
In [125]: ts_eastern.tz_convert('UTC')
Out[125]:
2012-03-09 14:30:00+00:00 -0.202469
2012-03-10 14:30:00+00:00 0.050718
2012-03-11 13:30:00+00:00 0.639869
2012-03-12 13:30:00+00:00 0.597594
2012-03-13 13:30:00+00:00 -0.797246
2012-03-14 13:30:00+00:00 0.472879
Freq: D, dtype: float64
In [126]: ts_eastern.tz_convert('Europe/Berlin')
Out[126]:
2012-03-09 15:30:00+01:00 -0.202469
2012-03-10 15:30:00+01:00 0.050718
2012-03-11 14:30:00+01:00 0.639869
2012-03-12 14:30:00+01:00 0.597594
2012-03-13 14:30:00+01:00 -0.797246
2012-03-14 14:30:00+01:00 0.472879
Freq: D, dtype: float64
Out[127]:
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
'2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
'2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
dtype='datetime64[ns, Asia/Shanghai]', freq='D')
You can also pass a time zone when creating the Timestamp:
In [131]: stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
In [132]: stamp_moscow
Out[132]: Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')
Time zone–aware Timestamp objects internally store a UTC timestamp value as nano‐ seconds
since the Unix epoch (January 1, 1970); this UTC value is invariant between time zone
conversions:
In [133]: stamp_utc.value
Out[133]: 1299902400000000000
In [134]: stamp_utc.tz_convert('America/New_York').value
Out[134]: 1299902400000000000
When performing time arithmetic using pandas’s DateOffset objects, pandas respects daylight
saving time transitions where possible. Here we construct time‐ stamps that occur right before
DST transitions (forward and backward). First, 30 minutes before transitioning to DST:
In [135]: from pandas.tseries.offsets import Hour
In [137]: stamp
Out[137]: Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
In [144]: ts
Out[144]:
2012-03-07 09:30:00 0.522356
2012-03-08 09:30:00 -0.546348
2012-03-09 09:30:00 -0.733537
2012-03-12 09:30:00 1.302736
2012-03-13 09:30:00 0.022199
2012-03-14 09:30:00 0.364287
2012-03-15 09:30:00 -0.922839
2012-03-16 09:30:00 0.312656
2012-03-19 09:30:00 -1.128497
2012-03-20 09:30:00 -0.333488
Freq: B, dtype: float64
In [148]: result.index
Out[148]:
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
'2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='B')
In [150]: p
Out[150]:Period('2007', 'A-DEC')
In this case, the Period object represents the full timespan from January 1, 2007, to December
31, 2007, inclusive. Conveniently, adding and subtracting integers from periods has the effect of
shifting by their frequency:
In [151]: p + 5
Out[151]: Period('2012', 'A-DEC')
In [152]: p - 2
Out[152]: Period('2005', 'A-DEC')
If two periods have the same frequency, their difference is the number of units between them:
In [153]: pd.Period('2014', freq='A-DEC') - p Out[153]: 7
Regular ranges of periods can be constructed with the period_range function:
In [154]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
In [155]: rng
Out[155]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '20
00-06'], dtype='period[M]', freq='M')
The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas
data structure:
In [156]: pd.Series(np.random.randn(6), index=rng)
Out[156]:
2000-01 -0.514551
2000-02 -0.559782
2000-03 -0.783408
2000-04 -1.797685
2000-05 -0.172670
2000-06 0.680215
Freq: M, dtype: float64
If you have an array of strings, you can also use the PeriodIndex class:
In [157]: values = ['2001Q3', '2002Q2', '2003Q1']
In [159]: index
Out[159]: PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq
='Q-DEC')
In [161]: p
Out[161]: Period('2007', 'A-DEC')
You can think of Period('2007', 'A-DEC') as being a sort of cursor pointing to a span of
time, subdivided by monthly periods. See Figure 11-1 for an illustration of this. For a fiscal year
ending on a month other than December, the corresponding monthly subperiods are different:
In [164]: p = pd.Period('2007', freq='A-JUN')
In [165]: p
Out[165]: Period('2007', 'A-JUN')
In [166]: p.asfreq('M', 'start')
Out[166]: Period('2006-07', 'M')
In [167]: p.asfreq('M', 'end')
Out[167]: Period('2007-06', 'M')
In [169]: p.asfreq('A-JUN')
Out[169]: Period('2008', 'A-JUN')
In [210]: ts
Out[210]:
2000-01-01 0.631634
2000-01-02 -1.594313
2000-01-03 -1.519937
2000-01-04 1.108752
2000-01-05 1.255853
2000-01-06 -0.024330
2000-01-07 -2.047939
2000-01-08 -0.272657
2000-01-09 -1.692615
2000-01-10 1.423830
...
2000-03-31 -0.007852
2000-04-01 -1.638806
2000-04-02 1.401227
2000-04-03 1.758539
2000-04-04 0.628932
2000-04-05 -0.423776
2000-04-06 0.789740
2000-04-07 0.937568
2000-04-08 -2.253294
2000-04-09 -1.772919
Freq: D, Length: 100, dtype: float64
In [211]: ts.resample('M').mean()
Out[211]:
2000-01-31 -0.165893
2000-02-29 0.078606
2000-03-31 0.223811
2000-04-30 -0.063643
Freq: M, dtype: float64
2000-01 -0.165893
2000-02 0.078606
2000-03 0.223811
2000-04 -0.063643
Freq: M, dtype: float64
resample is a flexible and high-performance method that can be used to process very large time
series.
Downsampling:
Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you’re
aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are
used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or
'BM', you need to chop up the data into one- month intervals. Each interval is said to be half-open;
a data point can only belong to one interval, and the union of the intervals must make up the
whole time frame. There are a couple things to think about when using resample to
downsample data:
In [215]: ts
Out[215]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int64
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking the sum of
each group:
In [216]: ts.resample('5min', closed='right').sum()
Out[216]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The frequency you pass defines bin edges in five-minute increments. By default, the left bin
edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 interval.1 Passing
closed='right' changes the interval to be closed on the right:
In [217]: ts.resample('5min', closed='right').sum()
Out[217]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The resulting time series is labeled by the timestamps from the left side of each bin. By
passing label='right' you can label them with the right bin edge:
In [218]: ts.resample('5min', closed='right', label='right').sum()
Out[218]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
In [222]: frame
Out[222]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
When you are using an aggregation function with this data, there is only one value per group,
and missing values result in the gaps. We use the asfreq method to con‐ vert to the higher
frequency without any aggregation:
In [223]: df_daily = frame.resample('D').asfreq()
In [224]: df_daily
Out[224]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
Suppose you wanted to fill forward each weekly value on the non-Wednesdays. The same filling
or interpolation methods available in the fillna and reindex methods are available for
resampling:
In [225]: frame.resample('D').ffill()
Out[225]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-07 -0.896431 0.677263 0.036503 0.087102
2000-01-08 -0.896431 0.677263 0.036503 0.087102
2000-01-09 -0.896431 0.677263 0.036503 0.087102
2000-01-10 -0.896431 0.677263 0.036503 0.087102
2000-01-11 -0.896431 0.677263 0.036503 0.087102
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
You can similarly choose to only fill a certain number of periods forward to limit how far to
continue using an observed value:
In [226]: frame.resample('D').ffill(limit=2)
Out[226]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-07 -0.896431 0.677263 0.036503 0.087102
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
Notably, the new date index need not overlap with the old one at all:
In [227]: frame.resample('W-THU').ffill()
Out[227]:
Colorado Texas New York Ohio
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-13 -0.046662 0.927238 0.482284 -0.867130
In [229]: frame[:5]
Out[229]:
Colorado Texas New York Ohio
2000-01 0.493841 -0.155434 1.397286 1.507055
2000-02 -1.179442 0.443171 1.395676 -0.529658
2000-03 0.787358 0.248845 0.743239 1.267746
2000-04 1.302395 -0.272154 -0.051532 -0.467740
2000-05 -1.040816 0.426419 0.312945 -1.115689
In [231]: annual_frame
Out[231]:
Colorado Texas New York Ohio
2000 0.556703 0.016631 0.111873 -0.027445
2001 0.046303 0.163344 0.251503 -0.157276
I now introduce the rolling operator, which behaves similarly to resample and groupby. It can
be called on a Series or DataFrame along with a window (expressed as a number of periods; see
Figure 11-4 for the plot created):
In [238]: close_px.AAPL.plot()
In [239]: close_px.AAPL.rolling(250).mean().plot()
Apple Price with 250-day MA
The expression rolling(250) is similar in behavior to groupby, but instead of group‐ ing it creates
an object that enables grouping over a 250-day sliding window. So here we have the 250-day
moving window average of Apple’s stock price.
Whole PeriodIndex objects or time series can be similarly converted with the same semantics:
In [241]: appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()
In [242]: appl_std250[5:12]
Out[242]:
2003-01-09 NaN
2003-01-10 NaN
2003-01-13 NaN
2003-01-14 NaN
2003-01-15 0.077496
2003-01-16 0.074760
2003-01-17 0.112368
Freq: B, Name: AAPL, dtype: float64
In [243]: appl_std250.plot()