Week 10 Resample Hourly Data
Week 10 Resample Hourly Data
Import Modules
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read data
The CSV file is assumed to exist in the same working directory as this notebook. Specify
the file name below.
In [3]: data_file = 'Net_generation_from_nuclear_for_United_States_Lower_48_(region)
Read in the data from the CSV file but we must skip the first 5 rows and provide our own
column names.
In [4]: nuclear = pd.read_csv( data_file, skiprows=5, names=['timestamp', 'generatio
In [5]: nuclear.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23808 entries, 0 to 23807
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 23808 non-null object
1 generation 23808 non-null int64
dtypes: int64(1), object(1)
memory usage: 372.1+ KB
The timestamp column looks different from other examples because it includes the
hour!
In [6]: nuclear
Reorganize data
We must convert the timestamp from a string to a date time object. The
pd.to_datetime() function is usually quite good at guessing the format.
In [6]: pd.to_datetime( nuclear.timestamp )
You may get warnings when executing the conversion. Hourly data can sometimes
require information associated with the time zone. The utc argument is one such
argument for specifying how to time zone behavior is controlled. By default
utc=False . For this example, including utc=True does not change the outcome
though older versions of Pandas may need utc=True to run properly. As shown below,
the date and hour are the same as the default result shown above. However, more
information is displayed because the UTC option is specified.
In [7]: pd.to_datetime( nuclear.timestamp, utc=True )
Out[7]: 0 2021-03-1904:00:00+00:00
1 2021-03-1903:00:00+00:00
2 2021-03-1902:00:00+00:00
3 2021-03-1901:00:00+00:00
4 2021-03-1900:00:00+00:00
...
23803 2018-07-01 09:00:00+00:00
23804 2018-07-01 08:00:00+00:00
23805 2018-07-01 07:00:00+00:00
23806 2018-07-01 06:00:00+00:00
23807 2018-07-01 05:00:00+00:00
Name: timestamp, Length: 23808, dtype: datetime64[ns, UTC]
Let's assign the converted date time object to the date_dt column.
In [8]: nuclear['date_dt'] = pd.to_datetime( nuclear.timestamp, utc=True )
In [9]: nuclear
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23808 entries, 0 to 23807
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 23808 non-null object
1 generation 23808 non-null int64
2 date_dt 23808 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), int64(1), object(1)
memory usage: 558.1+ KB
In [12]: nuclear
In [14]: nuclear
In [16]: nuclear
In [18]: nuclear
In [20]: nuclear
Visualizations
Although our previous example focused on using time series specific visuals...we can still
use the standard or conventional plots to explore time series data!!!!!
We can explore the distribution of the value, generation , grouped by the newly
created date time attributes!
In [21]: sns.catplot(data = nuclear, x='the_year', y='generation', kind='box', aspect
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
The violin plot shows the shape of the distribution in addition to the important summary
stats.
In [22]: sns.catplot(data = nuclear, x='the_year', y='generation', kind='violin', asp
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Our previous examples worked with MONTHLY data. Let's convert our HOURLY or high
frequency data to the MONTHLY START sampling frequency. Before calling
.resample() method let's review grouping with .groupby().aggregate() .
In [30]: my_series
Out[30]: 0 85915
1 85091
2 85414
3 85628
4 85781
...
23803 81733
23804 81700
23805 75650
23806 75818
23807 58363
Name: generation, Length: 23808, dtype: int64
In [31]: my_series.index
In [33]: my_series.index
RESAMPLE to the MONTHLY END summing ALL values within the MONTH!!!
In [34]: ready_series = my_series.copy().resample("M").sum()
In [35]: ready_series
Out[35]: date_dt
2018-07-31 00:00:00+00:00 51584844
2018-08-31 00:00:00+00:00 49120222
2018-09-30 00:00:00+00:00 53522993
2018-10-31 00:00:00+00:00 61530792
2018-11-30 00:00:00+00:00 65513045
2018-12-31 00:00:00+00:00 74614079
2019-01-31 00:00:00+00:00 76917712
2019-02-28 00:00:00+00:00 67581408
2019-03-31 00:00:00+00:00 68366739
2019-04-30 00:00:00+00:00 62864882
2019-05-31 00:00:00+00:00 69872088
2019-06-30 00:00:00+00:00 71835222
2019-07-31 00:00:00+00:00 75485578
2019-08-31 00:00:00+00:00 74914172
2019-09-30 00:00:00+00:00 69033350
2019-10-31 00:00:00+00:00 64232759
2019-11-30 00:00:00+00:00 66277431
2019-12-31 00:00:00+00:00 73554316
2020-01-31 00:00:00+00:00 71315906
2020-02-29 00:00:00+00:00 63381879
2020-03-31 00:00:00+00:00 62970581
2020-04-30 00:00:00+00:00 59340276
2020-05-31 00:00:00+00:00 64463536
2020-06-30 00:00:00+00:00 67451902
2020-07-31 00:00:00+00:00 69636785
2020-08-31 00:00:00+00:00 69271076
2020-09-30 00:00:00+00:00 65977276
2020-10-31 00:00:00+00:00 59644949
2020-11-30 00:00:00+00:00 61784698
2020-12-31 00:00:00+00:00 70140098
2021-01-31 00:00:00+00:00 72048088
2021-02-28 00:00:00+00:00 63280414
2021-03-31 00:00:00+00:00 38627244
Freq: M, Name: generation, dtype: int64
In [36]: ready_series.size
Out[36]: 33
The time series visualization for the resampled montly frequency data.
In [37]: ready_series.plot( figsize=(12, 5) )
plt.show()
If we wanted WEEKLY data we would still need to identify the summary stat and the
sampling frequency. The total or SUMMED value per week.
In [38]: weekly_series = my_series.copy().resample('W').sum()
In [39]: weekly_series
Out[39]: date_dt
2018-07-01 00:00:00+00:00
1512691
2018-07-08 00:00:00+00:00
14569442
2018-07-15 00:00:00+00:00
10907296
2018-07-22 00:00:00+00:00
10438491
2018-07-29 00:00:00+00:00
10976691
...
2021-02-21 00:00:00+00:00 15795919
2021-02-28 00:00:00+00:00 15273776
2021-03-07 00:00:00+00:00 15024074
2021-03-14 00:00:00+00:00 14914081
2021-03-21 00:00:00+00:00 8689089
Freq: W-SUN, Name: generation, Length: 143, dtype: int64
plt.show()
plt.show()
plt.show()
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Summary
This report showed how to organize high frequency hourly data for monthly time series
exploration.
In [ ]: