Pandas time series data manipulation is a must have skill for any data analyst/engineer. More than 70% of the world’s structured data is time series data. And pandas library in python provides powerful functions/APIs for time series data manipulation. So let’s learn the basics of data wrangling using pandas time series APIs.
For our time series data analysis task, we have taken temperature data of “Mountain View, CA” with datetime from 2018-10-25 to 2018-10-31. You can download it from this link.
At the end of this post, you will learn:
- Creating pandas time series dataframe
- Load time series CSV file
- Get summary statistics
- Data Selection from pandas time series dataframe
- Data Slicing with pandas time series dataframe
- Compute Aggregation on pandas time series dataframe
- Plot your time series data
Creating pandas time series dataframe
We can manually create time series using pandas Series. Let’s try to create random hourly data points from 2018-10-25 to 2018-10-31.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # Create Series of hourly data points between 2018-11-01 and 2018-11-15 date_range_series = pd.date_range(start='2018-11-01', end='2018-11-19', freq='H') print(date_range_series.size) date_range_series # Output : # # 433 # # DatetimeIndex(['2018-11-01 00:00:00', '2018-11-01 01:00:00', # '2018-11-01 02:00:00', '2018-11-01 03:00:00', # '2018-11-01 04:00:00', '2018-11-01 05:00:00', # '2018-11-01 06:00:00', '2018-11-01 07:00:00', # '2018-11-01 08:00:00', '2018-11-01 09:00:00', # ... # '2018-11-18 15:00:00', '2018-11-18 16:00:00', # '2018-11-18 17:00:00', '2018-11-18 18:00:00', # '2018-11-18 19:00:00', '2018-11-18 20:00:00', # '2018-11-18 21:00:00', '2018-11-18 22:00:00', # '2018-11-18 23:00:00', '2018-11-19 00:00:00'], # dtype='datetime64[ns]', length=433, freq='H') # Create random temperature data data = np.random.randint(8,20,size=(len(date_range_series))) weather_df = pd.DataFrame(index=date_range_series, data=data, columns=['temperature']) weather_df.head() |
temperature | |
---|---|
2018-11-01 00:00:00 | 15 |
2018-11-01 01:00:00 | 10 |
2018-11-01 02:00:00 | 17 |
2018-11-01 03:00:00 | 16 |
2018-11-01 04:00:00 | 13 |
Load time series CSV file
Another way is to read time series CSV data using Series.from_csv method which is deprecated method in python 3.6.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pandas import Series ## Loading Series from CSV file ## Date time will be in String format date_range_series = Series.from_csv('data/weather-data.csv', header=0) ## Format date time while loading CSV file date_range_series = Series.from_csv('data/weather-data.csv', header=0, parse_dates=[0]) date_range_series.head() # Output : # # datetime # 2018-11-01 00:56:00 15.6 # 2018-11-01 01:00:00 15.0 # 2018-11-01 01:40:00 15.0 # 2018-11-01 02:00:00 16.7 # 2018-11-01 02:40:00 16.7 # Name: temperature, dtype: float64 |
We can also read dataframe using pandas.read_csv and replace index with the datetime column.
1 2 3 4 | weather_df = pd.read_csv('data/weather-data.csv') weather_df['datetime'] = pd.to_datetime(weather_df['datetime']) weather_df = weather_df.set_index('datetime') weather_df.head() |
temperature | |
---|---|
datetime | |
2018-10-25 03:40:00 | 15.0 |
2018-10-25 04:04:00 | 15.0 |
2018-10-25 04:40:00 | 15.0 |
2018-10-25 05:40:00 | 14.0 |
2018-10-25 06:04:00 | 14.0 |
Calculate summary statistics
We can get summary statistics using describe() methods on DataFrame as well as on Series.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from pandas import Series date_range_series = Series.from_csv('data/weather-data.csv', header=0) print(date_range_series.describe()) # Output : # # count 278.000000 # mean 17.770504 # std 4.377023 # min 10.600000 # 25% 14.000000 # 50% 17.000000 # 75% 20.900000 # max 28.000000 # Name: temperature, dtype: float64 |
Data Selection from pandas time series dataframe
Let’s say we have weather dataframe with datetime index named ‘datetime’, we can simply select data of any year, month and day by specifying it as an index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ## Select all values on 31 Oct weather_df = pd.read_csv('data/weather-data.csv') weather_df['datetime'] = pd.to_datetime(weather_df['datetime']) weather_df = weather_df.set_index('datetime') print(weather_df.shape) ## DataFrame with all records of 2018 df_2018 = weather_df['2018'] print(df_2018.shape) # Data will be same ## DataFrame with all records of Oct 2018 df_oct_2018 = weather_df['2018-10'] print(df_oct_2018.shape) # Data will be same ## DataFrame with all records of 31 October 2018 df_31_oct_2018 = weather_df['2018-10-31'] print(df_31_oct_2018.shape) # Data will be same df_31_oct_2018.head() # Output : # # (278, 1) # (278, 1) # (278, 1) # (40, 1) |
temperature | |
---|---|
datetime | |
2018-10-31 00:40:00 | 12.8 |
2018-10-31 01:00:00 | 12.2 |
2018-10-31 01:40:00 | 12.2 |
2018-10-31 02:00:00 | 12.0 |
2018-10-31 02:40:00 | 12.0 |
Data slicing with time series dataframe
Get temperature data from 30 Oct 2018 4PM to 6 PM
1 2 3 4 | weather_df = pd.read_csv('data/weather-data.csv') weather_df['datetime'] = pd.to_datetime(weather_df['datetime']) weather_df = weather_df.set_index('datetime') weather_df[datetime(2018,10,30,16):datetime(2018,10,30,18)] |
temperature | |
---|---|
datetime | |
2018-10-30 16:47:00 | 25.0 |
2018-10-30 16:56:00 | 25.6 |
2018-10-30 17:59:00 | 23.3 |
Get temperature data from 28 Oct 2018 to 28 Oct 2018
1 | weather_df['2018-10-28':'2018-10-30'].head() |
temperature | |
---|---|
datetime | |
2018-10-28 00:40:00 | 17.0 |
2018-10-28 01:08:00 | 17.0 |
2018-10-28 01:40:00 | 17.0 |
2018-10-28 02:00:00 | 17.0 |
2018-10-28 02:40:00 | 17.0 |
Aggregation on time series dataframe
From your time series dataframe you can always change the granularity to higher level and aggregate. Here resample our existing dataframe to daily level min, max and average temperature values.
1 2 3 4 5 6 7 | # Changing aggregation level to daily uusing resample methof mean_temp = weather_df.resample('D').mean().rename(columns={'temperature':'avg'}) min_temp = weather_df.resample('D').min().rename(columns={'temperature':'min'}) max_temp = weather_df.resample('D').max().rename(columns={'temperature':'max'}) daily_weather_df = pd.concat([min_temp, max_temp, mean_temp], join='inner', axis=1).reset_index() daily_weather_df |
datetime | min | max | avg | |
---|---|---|---|---|
0 | 2018-10-25 | 14.0 | 27.0 | 18.411765 |
1 | 2018-10-26 | 13.0 | 28.0 | 18.216216 |
2 | 2018-10-27 | 13.0 | 28.0 | 18.550000 |
3 | 2018-10-28 | 16.0 | 24.0 | 18.767442 |
4 | 2018-10-29 | 13.0 | 23.3 | 16.819048 |
5 | 2018-10-30 | 11.7 | 25.6 | 16.397619 |
6 | 2018-10-31 | 10.6 | 26.7 | 17.402500 |
Plot your time series data
While data exploration, plotting time series data is a critical part to check the trend and seasonality. Following is an example of plotting the two days temperature data.
1 2 3 4 5 6 | weather_df = pd.read_csv('data/weather-data.csv') weather_df['datetime'] = pd.to_datetime(weather_df['datetime']) weather_df = weather_df.set_index('datetime') hourly_series = weather_df['2018-10-28':'2018-10-30'].resample('H').mean()['temperature'] pyplot.plot(hourly_series) pyplot.show() |
In this post, we learned to play with time series data using pandas library.
In case you have any queries regarding time series data analysis in Python, reach out to me. You can ask your questions in the comments section below and stay tuned…!!
Leave a Reply