Time Series
Time Series
Time Series
In [13]: df = pd.read_csv('sunspots.csv')
print(df.head())
df.info()
We can see from info, Date column is stored as object i.e. string data type . Date column must be converted into
datatime format which makes it easier for working with date and time data.There is an unnecessary column
named 'Unnamed :0' which has to be removed.
df['Date']=pd.to_datetime(df['Date']) <-- can be used to convert a column to into datetime data type column
pd.drop can be used for dropping the unnnecesary column
Here, I am using 'usecols' argument inside pd.read_csv for selecting only required column.
'parse_date', & 'date_parser' arguments for converting Date column into datetime data type.
inside 'parse_data, we have to pass the column to be conveted into datetime, here, it is 'Date' column.
'dateparse' function below is requied which is basically converting any argument passed to it into datetime
data type . This is given to 'data_parser' inside pd.read_csv.
check the documentation of pd.read_csv, there are more than 15 arguments, which can be used to perform
many operations while importing the data itself.
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 1/31
10/25/2020 Sunspot_Time_Series_Data
Out[7]:
Date Monthly Mean Total Sunspot Number
0 1749-01-31 96.7
1 1749-02-28 104.3
2 1749-03-31 116.7
3 1749-04-30 92.8
4 1749-05-31 141.7
In [8]: df.info() ## Checking the info again : data type of Date column --> has convet
ed into datetime
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3235 entries, 0 to 3234
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 3235 non-null datetime64[ns]
1 Monthly Mean Total Sunspot Number 3235 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 50.7 KB
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 2/31
10/25/2020 Sunspot_Time_Series_Data
In [10]: df_non_index['Month']=df_non_index.Date.dt.month
df_non_index.head()
Out[10]:
Date Monthly Mean Total Sunspot Number Month
0 1749-01-31 96.7 1
1 1749-02-28 104.3 2
2 1749-03-31 116.7 3
3 1749-04-30 92.8 4
4 1749-05-31 141.7 5
The following code is extracting the each year of the decade, for example in string '1749' last character i.e.
(3rd positional) is year 9 of that decade, which has been extracded and kept in another column named
'nth_year.
for '1748' it wil be year 8.
But for '1750' it will be year '0' which has to be 10. Thus .replace('0','10') is applied and finally converted
back into intger by type casting.
Out[11]:
Date Monthly Mean Total Sunspot Number Month nth_year
0 1749-01-31 96.7 1 9
1 1749-02-28 104.3 2 9
2 1749-03-31 116.7 3 9
3 1749-04-30 92.8 4 9
4 1749-05-31 141.7 5 9
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 3/31
10/25/2020 Sunspot_Time_Series_Data
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 4/31
10/25/2020 Sunspot_Time_Series_Data
In [13]: df = df.set_index('Date')
df.head()
Out[13]:
Monthly Mean Total Sunspot Number
Date
1749-01-31 96.7
1749-02-28 104.3
1749-03-31 116.7
1749-04-30 92.8
1749-05-31 141.7
In [14]: df.tail()
Out[14]:
Monthly Mean Total Sunspot Number
Date
2018-03-31 2.5
2018-04-30 8.9
2018-05-31 13.2
2018-06-30 15.9
2018-07-31 1.6
Out[15]: <AxesSubplot:xlabel='Date'>
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 5/31
10/25/2020 Sunspot_Time_Series_Data
The data is too to large to see it in a one graph, there are 3235 monthly entries from date 1749-01-31 to
2018-07-31.
One way to slice the data and visualise any particular time zone.
Plotly express provide slider and button to select particular time zone.
Checking both:
Out[16]: <AxesSubplot:xlabel='Date'>
Out[17]: <AxesSubplot:xlabel='Date'>
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 6/31
10/25/2020 Sunspot_Time_Series_Data
plotly express
Mean_Sunspot_Slider
hly Mean Total Sunspot Number
400
300
200
100
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 7/31
10/25/2020 Sunspot_Time_Series_Data
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=10, label="10y", step="year", stepmode="backward"),
dict(count=20, label="20y", step="year", stepmode="backward"),
dict(count=30, label="30y", step="year", stepmode="backward"),
dict(count=40, label="40y", step="year", stepmode="backward"),
dict(count=50, label="50y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
Mean_Sunspot_Slider
400
300
200
100
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 8/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: df_non_index.head()
Out[ ]:
Date Monthly Mean Total Sunspot Number Month nth_year
0 1749-01-31 96.7 1 9
1 1749-02-28 104.3 2 9
2 1749-03-31 116.7 3 9
3 1749-04-30 92.8 4 9
4 1749-05-31 141.7 5 9
In above graph we can see the pattern is repeating after 11 year approx, choose the time to match any two
reapeated pattern
x=np.arange(1,len(df_11_1996['Date'])+1)
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 9/31
10/25/2020 Sunspot_Time_Series_Data
Lag plot
It helps to understand the autocorrelation lag, visualizing for few, normally lag greater than 4 is not useful.
As we increase the lag time, the correlation is decresing.
the data is correlated with its recet time lag upt 4/5 time lag.
In [ ]: fig=plt.figure(figsize=(18,6))
fig.subplots_adjust(hspace=0.4, wspace=0.2)
ax1=fig.add_subplot(2,2,1)
pd.plotting.lag_plot(df['Monthly Mean Total Sunspot Number'],lag=1)
plt.title('Lag_1')
ax2=fig.add_subplot(2,2,2)
pd.plotting.lag_plot(df['Monthly Mean Total Sunspot Number'],lag=3)
plt.title('Lag_3')
ax3=fig.add_subplot(2,2,3)
pd.plotting.lag_plot(df['Monthly Mean Total Sunspot Number'],lag=6)
plt.title('Lag_6')
ax3=fig.add_subplot(2,2,4)
pd.plotting.lag_plot(df['Monthly Mean Total Sunspot Number'],lag=24)
plt.title('Lag_24')
plt.show()
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 10/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: fig=plt.figure(figsize=(18,6))
fig.subplots_adjust(hspace=0.4, wspace=0.2)
ax1=fig.add_subplot(1,2,1)
df['Monthly Mean Total Sunspot Number'].hist()
plt.title('Histogram')
ax2=fig.add_subplot(1,2,2)
df['Monthly Mean Total Sunspot Number'].plot(kind='density')# kernel density p
lot
plt.title('KDE')
In [ ]:
A TS is said to be stationary if its statistical properties such as mean, variance remain constant over time. But
why is it important? Most of the TS models work on the assumption that the TS is stationary. Intuitively, we can
state that if a TS has a particular behaviour over time, there is a very high probability that it will follow the same
in the future. Also, the theories related to stationary series are more mature and easier to implement as
compared to non-stationary series.
Stationarity is defined using very strict criterion. However, for practical purposes we can assume the series to be
stationary if it has constant statistical properties over time, ie. the following:
https://www.statsmodels.org/stable/examples/notebooks/generated/stationarity_detrending_adf_kpss.html
(https://www.statsmodels.org/stable/examples/notebooks/generated/stationarity_detrending_adf_kpss.html)
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 11/31
10/25/2020 Sunspot_Time_Series_Data
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 12/31
10/25/2020 Sunspot_Time_Series_Data
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/stattools.py:1687: Int
erpolationWarning:
if p < 0.05 :
print('Series is not Stationary')
else:
print('Series is Stationary')
There are many ways to model a time series in order to make predictions.Few are
discussed here:
Different Moving Averages
Exponential Smoothing
ARIMA
SARIMA
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 13/31
10/25/2020 Sunspot_Time_Series_Data
Rolling Statistics:
We can plot the moving average or moving variance and see if it varies with time. By moving average/variance I
mean that at any instant ‘t’, we’ll take the average/variance of the last year, i.e. last 12 months. But again this is
more of a visual technique :
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 14/31
10/25/2020 Sunspot_Time_Series_Data
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
(https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) #### * Luckly there is a
function for this in pandas.
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 15/31
10/25/2020 Sunspot_Time_Series_Data
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 16/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: df_with_diff_avg=df[:200].copy()
df_with_diff_avg['Rolling mean']=df['Monthly Mean Total Sunspot Number'][:200]
.rolling(3).mean()
df_with_diff_avg['W_M_A']= df['Monthly Mean Total Sunspot Number'][:200].rolli
ng(window=3).apply(wma(np.array([0.5,1,1.5])))
df_with_diff_avg['E_W_A']= df['Monthly Mean Total Sunspot Number'][:200].ewm(s
pan=3, adjust=False, min_periods=0).mean()
df_with_diff_avg['E_S_M_A']= df['Monthly Mean Total Sunspot Number'][:200].ewm
(alpha=0.7, adjust=False, min_periods=3).mean()
print(df_with_diff_avg.head())
#df_with_diff_avg.set_index('Date', inplace=True)
df_with_diff_avg.plot()
[5 rows x 5 columns]
In [ ]: df_with_diff_avg.dropna(inplace=True)
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 17/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: df_with_diff_avg.head()
Out[ ]:
Monthly Mean Total Sunspot Number Rolling mean W_M_A E_W_A E_S_M_A
Date
In [ ]: def RMSE_CAL(df):
Rolling_Mean_RMSE=np.sqrt(np.sum((df.iloc[:,0]-df.iloc[:,1])**2))
W_M_A_RMSE=np.sqrt(np.sum((df.iloc[:,0]-df.iloc[:,2])**2))
E_W_A_RMSE=np.sqrt(np.sum((df.iloc[:,0]-df.iloc[:,3])**2))
E_S_M_A_RMSE=np.sqrt(np.sum((df.iloc[:,0]-df.iloc[:,4])**2))
return {"Rolling_Mean_RMSE":Rolling_Mean_RMSE,"W_M_A_RMSE":W_M_A_RMSE,"E_W_A
_RMSE":E_W_A_RMSE,"E_S_M_A_RMSE":E_S_M_A_RMSE}
RMSE_CAL(df_with_diff_avg)
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 18/31
10/25/2020 Sunspot_Time_Series_Data
Systematic: Components of the time series that have consistency or reocurrence and can be described and
modeled as level,trend, seasonality.
Non-Systematic: Components of the time series that cannot be directly modeled is noise/residual.
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 19/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: # Additive decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df['Monthly Mean Total Sunspot Number'], model="ad
ditive",freq=11*12) # Data Trend is repeated after every 11 year,freq=11*12
result.plot()
plt.show()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:3: FutureWarnin
g:
checking the definition of decompostion for additive nature time series data
y(t) = Trend + Seasonality + Noise
In [ ]: total_sum=result.trend+result.seasonal+result.resid
total_sum[:100] # compare this result with original Sunspot data
Out[ ]: Date
1749-01-31 NaN
1749-02-28 NaN
1749-03-31 NaN
1749-04-30 NaN
1749-05-31 NaN
...
1756-12-31 15.7
1757-01-31 23.5
1757-02-28 35.3
1757-03-31 43.7
1757-04-30 50.0
Length: 100, dtype: float64
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 20/31
10/25/2020 Sunspot_Time_Series_Data
Out[ ]: Date
1749-01-31 96.7
1749-02-28 104.3
1749-03-31 116.7
1749-04-30 92.8
1749-05-31 141.7
...
1756-12-31 15.7
1757-01-31 23.5
1757-02-28 35.3
1757-03-31 43.7
1757-04-30 50.0
Name: Monthly Mean Total Sunspot Number, Length: 100, dtype: float64
Detrended Data :
Since our data is additive in nature we are going to subtract the trend from observed value and get the
detrended data:
In [ ]: pd.DataFrame(result.observed-result.trend).plot()
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 21/31
10/25/2020 Sunspot_Time_Series_Data
5. Autocorrelation plot
We can assume the distribution of each variable fits a Gaussian (bell curve) distribution. If this is the case,
we can use the Pearson’s correlation coefficient to summarize the correlation between the variables.
The Pearson’s correlation coefficient is a number between -1 and 1 that describes a negative or positive
correlation respectively. A value of zero indicates no correlation.
We can calculate the correlation for time series observations with observations with previous time steps,
called lags. Because the correlation of the time series observations is calculated with values of the same
series at previous times, this is called a serial correlation, or an autocorrelation.
A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym
ACF. This plot is sometimes called a correlogram or an autocorrelation plot.
This helps us to find if current value depends on previous values. In the plot you can observe that current
value is dependent on previous 120-130 values. This can be around 10/11 years as it is monthly data.
Out[ ]: Date
1749-12-31 134.875000
1750-12-31 139.000000
1751-12-31 79.441667
1752-12-31 79.666667
1753-12-31 51.125000
...
2014-12-31 113.608333
2015-12-31 69.783333
2016-12-31 39.825000
2017-12-31 21.816667
2018-12-31 8.514286
Freq: A-DEC, Name: Monthly Mean Total Sunspot Number, Length: 270, dtype: flo
at64
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 22/31
10/25/2020 Sunspot_Time_Series_Data
# Draw Plot
plot_acf(df['Monthly Mean Total Sunspot Number'].tolist(), lags=20, ax=axes[0
])
plot_pacf(df['Monthly Mean Total Sunspot Number'].tolist(), lags=20, ax=axes[1
])
Out[ ]:
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 23/31
10/25/2020 Sunspot_Time_Series_Data
Finally, we add the final component: seasonality S(P, D, Q, s), where s is simply
the season’s length. Furthermore, this component requires the parameters P and
Q which are the same as p and q, but for the seasonal component. Finally, D is
the order of seasonal integration representing the number of differences required
to remove seasonality from the series.
Combining all, we get the SARIMA(p, d, q)(P, D, Q, s) model. The main takeaway is: before modelling with
SARIMA, we must apply transformations to our time series to remove seasonality and any non-stationary
behaviors.
In [ ]:
https://alkaline-ml.com/pmdarima/tips_and_tricks.html (https://alkaline-
ml.com/pmdarima/tips_and_tricks.html)
Auto ARIMA
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 24/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: import pmdarima as pm
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 25/31
10/25/2020 Sunspot_Time_Series_Data
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 26/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: model.summary()
Out[ ]:
SARIMAX Results
- 3235
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 27/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: df.head()
Out[ ]:
Date Monthly Mean Total Sunspot Number
0 1749-01-31 96.7
1 1749-02-28 104.3
2 1749-03-31 116.7
3 1749-04-30 92.8
4 1749-05-31 141.7
In [ ]: train=df[(df.Date.dt.year<1958)]
test=df[(df.Date.dt.year>=1958)]
Out[ ]: 0 False
1 False
2 False
3 False
4 False
...
3230 False
3231 False
3232 False
3233 False
3234 False
Name: Date, Length: 3235, dtype: bool
In [ ]: forecast=model.predict(n_periods=n, return_conf_int=True)
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 28/31
10/25/2020 Sunspot_Time_Series_Data
The result may not seem accurate but, note that time series forecasting is not
reasonable for many time step ahead. It may be valid only for 1,2 or few more
time step ahead in future.
From AutoArima, we have already got the parameters--> p,d,q.Using it directly on ARIMA model
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 29/31
10/25/2020 Sunspot_Time_Series_Data
In [ ]: predictions = []
lower_list = []
upper_list = []
for t in range(len(test)):
model = ARIMA(history, order=(4,0,1))
model_fit = model.fit(disp=0)
output = model_fit.forecast() # The number of time step ahead prediction o
ut of sample from the end of the sample. Default it is 1
yhat = output[0] #
lower = output[2][0][0] # lower bound for 95 % confidence interval for pre
dicted value
upper = output[2][0][1] # upper bound for 95 % confidence interval for pre
dicted value
predictions.append(yhat) # appending predicted value in prediction
lower_list.append(lower)
upper_list.append(upper)
obs = test[t] # taking t_th data from test as 'obs' and appendin
g it in 'history' list which is used for training
history.append(obs)
#print('predicted=%f, expected=%f' % (yhat, obs))
In [ ]: # plot
plt.plot(test,color='black',label='test set')
plt.plot(lower_list,color='red',label='lower bound of 95 % confidence interva
l')
plt.plot(upper_list,color='green',label='upper bound of 95% confidenc interva
l')
plt.plot(predictions,color='orange',label='predicted for test set')
plt.legend()
plt.show()
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 30/31
10/25/2020 Sunspot_Time_Series_Data
https://www.linkedin.com/in/ramendra-kumar-57334478/
(https://www.linkedin.com/in/ramendra-kumar-57334478/)
HAPPY LEARNING!!!
file:///C:/Users/hP/Downloads/Sunspot_Time_Series.html 31/31