Developing a Toolkit in Python to Test the Effects of Changepoints and Seasonality on Time series
Author: Ashwin K. Avula
Peer Reviewer: Sarah Tang
Professional Reviewer: Steven Larsen
Introduction
Fluctuations in the stock market can profoundly influence individual consumers and economies. A collapse in share prices has the potential to generate widespread economic disruption. Being able to analyze and predict the market is a determining factor in ensuring a stable economy [1].
Most events that will occur in the future are caused by past events. In order to predict these future events, time series are used to understand patterns that have evolved over time [14]. A time series is simply a series of data points ordered and graphed chronologically at equally spaced time intervals. Time series are used to display a set of data in order to obtain an understanding of the underlying patterns that produced the observed data [15]. Figure 1, a stock market trend, depicts this concept of a time series.
Time series analysis has become increasingly important in a plethora of fields including medicine, business, meteorology, and entertainment; being able to predict future events in these fields could potentially boost economic growth [1].
One goal of time series analysis is the detection of changepoints. Change detection, or changepoint detection (CPD), distinguishes abrupt variations in time series data. Change detection is vital in time series analysis; the nature and degree of known changes can be used as bias in forecasting and prediction.
In addition to changepoints, seasonality analysis is necessary in identifying and measuring seasonal variations within a system to aid forecasting. Seasonality is the presence of variations that occur at specific intervals within a certain time series. This effect is caused by various factors including weather, holidays, and politics. Understanding the level of seasonality present in a time series is vital to prepare for the temporary effects caused by increases or decreases in the time series.
This study discusses the effects of changepoint sensitivity and seasonality values on the accuracy of time series forecasting. Using Python, an open-source and user-friendly programming language, a toolkit was developed to analyze and predict for any time series using trained regression models [2]. By experimenting with different changepoint sensitivity and seasonality values, more accurate predictions can be made for any time series.
Materials & Dependencies
In order to develop this toolkit, several environments and programming libraries were implemented for their unique functionalities:
- Python 3.6.5:
- High-level, general purpose, open-source language allowing for the effective integration of systems and repositories [2].
- Anaconda:
- Open-source distribution for scientific computing, data science, and predictive analytics [3]
- Spyder:
- Powerful open-source scientific programming environment for advanced editing, code analysis, debugging, and data exploration [4]
- Jupyter:
- Open-source web application that allows for the creation of live code, real-time visualizations, and narrative text [5]
- Python Libraries:
- Quandl 3.3.0:
- Library imported for financial, economic, and alternative datasets [6]
- Matplotlib 2.2.2:
- Library imported for 2D and 3D plotting of data [7]
- Numpy 1.14.4:
- Library imported for scientific and mathematical computing [8]
- Fbprophet 0.3.1:
- Library imported for forecasting procedures
- Provides completely automated regression functions for forecasting [9]
- Pandas 0.23.0:
- Library imported for high-performance data-analysis tools [10]
- Quandl 3.3.0:
In the developing process for this toolkit, all programs and commands are created through Anaconda’s programming environments in order to most efficiently import and use the listed Python Libraries.
Methods
Regression Theory
Given a certain time series {yi, xi1,… , xip} ni=1 , where yi represents the outcome of predictor xi for n data points, this toolkit produces the following regression model.
In this model f j (xij)represents non-linear and linear smoothing functions fit from the data that produce the general long-term trend. In this toolkit, saturating growth models and piecewise linear models are incorporated to lay out the non-periodic changes in the values of the time series [11].
In addition to this, gj (xij) represents the Fourier series that corresponds to seasonality and cyclical patterns in the time series. In this toolkit, Fourier partial sums are incorporated where the number of terms in the partial sum determines the frequency of a pattern within a time series.
The implementation of an additive nonparametric regression model allows for the use of several linear and nonlinear functions of time. Because of this, models generated by this toolkit can effectively accommodate for new sources of cyclical patterns and even fluctuations in a time series.
Toolkit Functionality
Training
Before training, the toolkit splits the given time series data for training and validation. Typically when separating data, most of the data is used for training, and a smaller portion is reserved for validation [11]. In this study, time series data was split 80% and 20% for training and validation respectively. By the Pareto Principle, 80/20 is a naturally observed ratio, thus, this ratio has become widely applied for optimization efforts especially in the field of Machine Learning [12].
In order to determine the optimal model for an input time series, this toolkit produces several distinct models each with its own respective changepoint sensitivity and seasonality values. During this training process, each model is given the first 80% of the time series data, then is asked to forecast the next 20% of the known data for validation metrics to determine the most accurate model. By iterating through these different changepoint sensitivity values, the toolkit determines the best combination of smoothing functions for the given time series based on how the model either underfits or overfits during training. In addition, by iterating through many different seasonality values, the toolkit determines the best Fourier partial sum for the time series during training based on accuracy. Thus, after training, the toolkit returns a trained model using the optimal changepoint and seasonality values learned throughout the training process.
Forecasting
In order to forecast, the toolkit re-trains the model with 100% of the time series data; thus, when the model forecasts, it will forecast accurately for 30 days into the future. However, the biggest source of ambiguity in the model is the potential for future trend changes due to confounding factors [16]. To combat this, the toolkit generates an uncertainty interval based on a normal distribution using the Fb Prophet library. Even though future forecasts may not be 100% correct, it is expected that the true time series values will reside within the generated uncertainty interval. In general, a larger uncertainty interval indicates a more unstable and inaccurate model as the model is not as confident in its predictions. While the model is fully capable of extrapolating and forecasting past 30 days, predictions tend to become more inaccurate, analogous to the inaccuracy of a Euler approximation.
Testing & Results
To demonstrate the proficiency of the toolkit in analysis and projection, a test time series in the form of a stock trend was imported using the Quandl data repository [6]. Figure 1 depicts this imported time series that will be used for testing.
Once the data visualization step is complete, the toolkit begins creating several models based on different combinations of changepoint sensitivity and seasonality settings that ultimately alter the way the model learns the trends and patterns in the time series. After the toolkit generates the model instances, it begins to train every model combination to determine which model best fits the given data. Below, Figure 2 exhibits an example trained model for the imported stock trend.
For each model, the toolkit determines respective accuracy values as well as an uncertainty interval. By comparing these validation metrics for each model combination tested, the most accurate model can be determined. Below, Figures 3 & 4 depict the toolkit’s comparison of all generated models after training to determine the optimal model for the input time series
From this, the toolkit uses the optimal model to further analyze the time series data. In doing this, users will gain a better understanding of how well the model performs before generating any forecasts for the future of the time series. As stated before in the Toolkit Functionality section, the selected optimal model then undergoes re-training using 100% of the time series data for forecasting. Exhibited below in Figure 5, the toolkit produces a plot of major changepoints in the time series. By recognizing when these changepoints relatively occur, the model can adapt during the re-training phase and alter the trend equation accordingly.
Figure 6 depicts the final model generated by the toolkit. This visualization contains the original time series data and compares it to the generated best fit model to show how precisely the model follows the observed data. In addition to this, the figure contains the model’s narrow uncertainty interval as explained before. Finally, using this model, the toolkit generates predictions for 30 days into the future, and determines whether the time series trend will increase or decrease within this 30-day period. In the field of finance, knowing when a stock will increase or decrease in the future could help businesses make better investment decisions [1]. Figures 7 & 8 demonstrate the toolkit’s ability to create informative predictions.
Conclusions
The objective of this study was to develop a fully functional toolkit in Python capable of time series manipulation, analysis, and forecasting. Demonstrated in the example time series analyzed in Figures 1 – 8, the toolkit was able to obtain, plot, analyze, train, and forecast a given time series. Below, Figure 9 depicts common regression models and their lack of proficiency in training and testing. Common regression methods tend to generate either desensitized models with high bias, or models oversensitive to noise in the data. Figure 9 exhibits these concepts as many of the models either generate an inaccurate trend, or a trend with a significant uncertainty interval. By testing a plethora of model hyperparameter combinations during training for any given time series, this toolkit is able to combat common underfitting and overfitting issues that arise when using regression algorithms. While the task of predicting a time series such as the stock market is challenging, this toolkit is able to undertake this feat.
Future Endeavors
In order to make this toolkit more accessible to the general public, it would be beneficial to develop a graphical user interface for a more user-friendly experience. In addition to this, access to more recent and even live data would be beneficial in generating more accurate predictions and trend analyses.
References
- Pettinger, Tejvan. “How Does the Stock Market A?ect the Economy?” Economics Help, www.economicshelp.org/blog/221/stock-market/how-does-thestock-market-e?ect-the-economy-2/.
- “Welcome to Python.org.” Python.org, www.python.org/.
- “Home.” Anaconda, Anaconda, Inc., www.anaconda.com/.
- Team, Spyder. “Spyder Website.” Spyder Website, www.spyder-ide.org/.
- “Project Jupyter.” Project Jupyter, jupyter.org/.
- “Quandl Data Repository.” Quandl.com, www.quandl.com/.
- “Overview.” Matplotlib: Python Plotting – Matplotlib 2.2.3 Documentation, matplotlib.org/contents.html.
- “NumPy.” NumPy – NumPy, www.numpy.org/
- “Quick Start.” React Native Blog ATOM, facebook.github.io/prophet/docs
- “Python Data Analysis Library.” Pandas: Powerful Python Data Analysis Toolkit – Pandas 0.23.4 Documentation, pandas.pydata.org/.
- Minewiskan. “Training and Testing Data Sets.” Microsoft Docs, docs.microsoft.com/en-us/analysis-services/data-mining/training-and-testing-data-sets?view=asallproducts-allversions.
- Shrutiparna. “Is There a Rule-of-Thumb for How to Divide a Dataset into Training and Validation Sets?” Intellipaat Community, 2 June 2019, intellipaat.com/community/323/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validation-sets.
- ]Taylor SJ, Letham B. 2017. Forecasting at scale. PeerJ Preprints 5:e3190v2 https://doi.org/10.7287/peerj.preprints.3190v2
- Farrelly, and Colleen M. “Dimensionality Reduction Ensembles.” ArXiv.org, 11 Oct. 2017, arxiv.org/abs/1710.04484.
- 6.4.1. Definitions, Applications and Techniques, www.itl.nist.gov/div898/handbook/pmc/section4/pmc41.htm.
- “Uncertainty Intervals.” Nextjournal, nextjournal.com/clojurework/uncertainty-intervals.