Rodrguez MachineLearningbasedPrediction 2022
Rodrguez MachineLearningbasedPrediction 2022
Time Series
Author(s): José-Víctor Rodríguez, Ignacio Rodríguez-Rodríguez and Wai Lok Woo
Source: Publications of the Astronomical Society of the Pacific, 2022 December, Vol. 134,
No. 1042 (2022 December), pp. 1-7
Published by: Astronomical Society of the Pacific
Stable URL: https://www.jstor.org/stable/10.2307/27303191
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Astronomical Society of the Pacific is collaborating with JSTOR to digitize, preserve and extend access to
Publications of the Astronomical Society of the Pacific
Abstract
The study of solar activity holds special importance since the changes in our star’s behavior affect both the Earth’s
atmosphere and the conditions of the interplanetary environment. They can interfere with air navigation, space
flight, satellites, radar, high-frequency communications, and overhead power lines, and can even negatively
influence human health. We present here a machine learning-based prediction of the evolution of the current
sunspot cycle (solar cycle 25). First, we analyze the Fourier Transform of the total time series (from 1749 to 2022)
to find periodicities with which to lag this series and then add attributes (predictors) to the forecasting models to
obtain the most accurate result possible. Consequently, we build a trained model of the series considering different
starting points (from 1749 to 1940, with 1 yr steps), applying Random Forests, Support Vector Machines, Gaussian
Processes, and Linear Regression. We find that the model with the lowest error in the test phase (cycle 24) arises
with Random Forest and with 1915 as the start year of the time series (yielding a Root Mean Squared Error of 9.59
sunspots). Finally, for cycle 25 this model predicts that the maximum number of sunspots (90) will occur in 2025
March.
Unified Astronomy Thesaurus concepts: Sunspots (1653); Solar activity (1475); Time series analysis (1916);
Random Forests (1935); Linear regression (1945); Gaussian Processes regression (1930); Support vector
machine (1936)
1. Introduction 25). However, the variations that these cycles undergo in both
Changes in solar activity affect both the conditions of amplitude and duration (Cameron & Schüssler 2017)—point-
the interplanetary environment and the Earth’s atmosphere ing to the existence of other concurrent cycles of different
(Hiremath 2006; Pulkkinen 2007; Hathaway 2015; Sagir et al. periods as well as additional underlying phenomena—make the
2015; Kim et al. 2018), such that an increase in our star’s accurate prediction of the SSN evolution of the current and
intensity may not only impact aircraft navigation, space flight, future solar cycles an ongoing research challenge. In fact, it
satellites, radar, high-frequency communications, and overhead should be mentioned that solar cycles may not actually form a
power lines (Lybekk et al. 2012; Lewandowski 2015), but multi-periodic system at all, but rather constitute a weakly
could also be harmful to humans (Azcárate et al. 2016; Qu chaotic system that is unlike a periodic, multi-periodic, or
2016). Therefore, predicting solar activity is an area of great quasi-periodic dynamical system (Carbonell et al. 1994;
research interest to anticipate the potential impact of the Sun’s Letellier et al. 2006; Hanslmeier & Brajša 2010). In any case,
intensity on space technology and life on Earth in general. while numerous papers, using different approaches, have
In this sense, the number of sunspots (SSN) (dark areas that attempted to predict the SSN evolution of the current solar
appear on the solar disk) is one of the most important and cycle (Han & Yin 2019; Labonville et al. 2019; Kakad et al.
simplest indicators to measure the Sun’s activity (Usoskin 2020; Kitiashvili 2020; McIntosh et al. 2020), the great dis-
2017), not least because it correlates with several other phe- parity in the results of these and other works (Nandy 2021)
nomena, such as solar flares (Liu et al. 2008). As the SSN is a makes it necessary to continue the search for alternative and
directly observable parameter, there are public records on it innovative methods for the most accurate forecasting of solar
spanning 1749 to the present day. A simple observation of SSN activity.
behavior over this period (almost 300 yr) clearly reveals a A recent approach to the prediction of time series, such as
fundamental cyclical pattern of solar activity that repeats SSN, is the use of machine learning (ML) techniques. The
approximately every 11 yr (we are currently in cycle number currently available computational power, together with the
Figure 1. 13 month smoothed monthly total SSN between 1749 and 2022. Figure 2. Fourier Transform (periodogram) of the total time series as a function
of the period (years/cycle) Figure 1. 1749 and 2022.
Figure 3. Five main periods of the time series periodogram: (a) 5.45, (b) 8.52, (c) 10.91, (d) 15.15, and e) 54.53 yr.
Figure 4. Average RMSE values in the test stage for the four algorithms Figure 5. Example of the training stage with the RF algorithm and time series
considered as a function of the year of the time series start. from 1915.
(b) Seed: 1.
Finally, using Radial Basis Function Kernels (RBF) (c) Polynomial kernel: size of the cache = 250,007;
(Blomqvist et al. 2020) and other comparative strategies, GP exponent = 1.0.
can ensure overall consistency and offer unrestricted basic 4. SVM:
functions. GP generally extracts discernible reactions from a set (a) C parameter: 1.0.
of training data points (function values), subsequently model- (b) Polynomial kernel: size of cache = 250,007;
ing them as multivariate standard random features (Seeger exponent = 1.0.
2004), making it non-parametric. It assumes that the function (c) Sequential minimal optimization (SMO) with epsilon
data values have a priority distribution, ensuring the smooth for round-off error = 10−12; epsilon parameter for
operation of the function. If the vectors being compared are epsilon insensitive loss function = 0.001; toler-
close regarding their separation and sensitivity, the function ance = 0.001; seed = 1.
values correlate closely, with divergence producing decay.
Thus, we may make assumptions to estimate the distribution of The purpose of this analysis is two-fold. First, to determine
the unpredicted function data, applying basic probability which model is the best fit, and second, from which start year
manipulation. the series provides the most adequate quantity (and quality) of
information to obtain a more accurate prediction.
Clearly, the minimum error is reached when considering the
3. Results and Discussion time series start in 1915 and with the RF model (an average
Figure 4 presents the average RMSE obtained for the test RMSE of 9.59 sunspots); however, for that same year, the GP
data with the different algorithms while considering, in each and LR algorithms are also close, with SVM being slightly
case, a different time series start year (with 1 yr steps). The more distant. Therefore, it is worth noting that, in this case,
initialization parameters assumed for the models are the considering 1915 as the starting year of the time series repre-
following: sents the optimal compromise between, on the one hand,
1. RF: having a sufficiently large number of records to perform a good
(a) Bag size (percentage of training set size): 100. prediction and, on the other hand, the fact that the more recent
(b) Number of threads: 1. these data are, the better they reflect the current behavior of the
(c) Number of trees: 100. changing solar activity and, consequently, the more accurate
(d) Maximum depth of trees: unlimited. the forecast will be. It is also worth noting how the RMSE
(e) Number of random features: 0. seems to have a certain oscillating behavior (with ups and
(f) Seed: 1. downs common to the four models) with increasing time series
2. LR: start year. This fact could be explained by the increasing and
(a) Ridge parameter: 10−8. decreasing influence of the different considered periods as the
3. GP: progressive cutting of the historical data eliminates elements
(a) Level of Gaussian Noise: 1.0. that are more or less relevant in shaping such periods.
4. Conclusions
This paper presented the ML-based prediction of the SSN
evolution of cycle 25. For this purpose, a Fourier Transform
analysis of the total time series (from 1749 to 2022) was first
performed to identify the five most intense periodicities. The
series was then lagged corresponding to the positions of these
periods, meaning the resulting sequences were considered as
additional attributes (predictors) to be included in the modeling
algorithms used, i.e., RF, SVM, GP and LR. The study was
also carried out considering different starting points of the time
series (from 1749 to 1940, with 1 yr steps) to determine which
Figure 6. Comparison of the real data and predicted data for the test stage start year could provide the most adequate quantity (and
(cycle 24 and beginning of 25) with the RF algorithm.
quality) of information to obtain the minimum error and per-
form a more accurate prediction. The obtained results show that
the model with the lowest error in the test stage (cycle 24) is
RF, together with the use of 1915 as the time series start
(yielding an average RMSE of 9.59 sunspots). Finally, this
model predicted that the maximum number of sunspots in cycle
25 will occur in March 2025, with a total of 90.
ORCID iDs
José-Víctor Rodríguez https://orcid.org/0000-0002-
3298-6439
Ignacio Rodríguez-Rodríguez https://orcid.org/0000-0002-
0118-3406
Wai Lok Woo https://orcid.org/0000-0002-8698-7605
References
Azcárate, T., Mendoza, B., & Levi, J. R. 2016, AdSpR, 58, 2116
Figure 7. Predicted data with the obtained RF model for cycle 25 and the Blomqvist, K., Kaski, S., & Heinonen, M. 2020, Proceedings of the Mining
beginning of 26. Data for Financial Applications (Ghent) (Berlin: Springer), 582
Cameron, R. H., & Schüssler, M. 2017, ApJ, 843, 111
Carbonell, M., Oliver, R., & Ballester, J. L. 1994, A&A, 290, 983
Covas, E., Peixinho, N., & Fernandes, J. 2019, SoPh, 294, 24
Therefore, with the best method (RF) and the optimal Dang, Y., Chen, Z., Li, H., & Shu, H. 2022, Appl. Artif. Intell., 36, 1
Dani, T., & Sulistiani, S. 2019, JPhCS, 1231, 012022
starting point of the time series (1915) identified, the results Faloutsos, C., Gasthaus, J., Januschowski, T., & Wang, Y. 2018, Proc. VLDB
obtained for this algorithm and data range are shown in the Endow, 11, 2102
following figures. Figure 5 depicts a comparison of the mod- Fierrez, J., Morales, A., Vera-Rodriguez, R., & Camacho, D. 2018, Inf. Fusion,
44, 57
eled and real data in a time slot of the training stage Han, Y. B., & Yin, Z. Q. 2019, SoPh, 294, 107
(since 1969). Hanslmeier, A., & Brajša, R. 2010, A&A, 509, A5
As can be seen, the agreement is excellent. On the other Hathaway, D. H. 2015, LRSP, 12, 4
Hiremath, K. M. 2006, JApA, 27, 367
hand, Figure 6 shows the model prediction for the test stage Kakad, B., Kumar, R., & Kakad, A. 2020, SoPh, 295, 88
(part of cycle 24 and the beginning of cycle 25) compared with Kalekar, P. S. 2004, Time Series Forecasting Using Holt-Winters Exponential
the real known data. Smoothing, Kanwal Rekhi School of Information Technology, Powai,
Mumbai, 04329008
Again, a good fit is observed in both the behavior of the Kim, K. B., Kim, J. H., & Chang, H. Y. 2018, JASS, 35, 151
curve (period) and the amplitude, yielding the aforementioned Kitiashvili, I. N. 2020, ApJ, 890, 36
average RMSE of 9.59 sunspots. Finally, in Figure 7, based on Kuhn, M., & Johnson, K. 2013, Applied Predictive Modeling (1st edn.; New
York: Springer)
the model obtained, a prediction of the SSN evolution for cycle Labonville, F., Charbonneau, P., & Lemerle, A. 2019, SoPh, 294, 82
25 and the beginning of cycle 26 (until 2033) is presented. Letellier, C., Aguirre, L. A., Maquet, J., & Gilmore, R. 2006, A&A, 449, 379
Lewandowski, K. 2015, J. Polish Safety and Reliability Association, 6, 91 Pala, Z., & Atici, R. 2019, SoPh, 294, 1
Liaw, A., & Wiener, M. 2002, R News, 2, 18 Pulkkinen, T. 2007, LRSP, 4, 1
Liu, C., Deng, N., Liu, Y., et al. 2008, ApJ, 622, 722 Qu, J. 2016, Reviews in Medical Virology, 26, 309
Lybekk, B., Pedersen, A., Haaland, S., et al. 2012, JGRA, 117, A1 Sagir, S., Karatay, S., Atici, R., Yesil, A., & Ozcan, O. 2015, AdSpR, 55, 106
McIntosh, S. W., Chapman, S., Leamon, R. J., Egeland, R., & Watkins, N. W. Schölkopf, B., & Smola, A. J. 2003, A Short Introduction to Learning with
2020, SoPh, 295, 163 Kernels. In Advanced Lectures on Machine Learning (Berlin: Springer), 41
Nandy, D. 2021, SoPh, 296, 54 Seeger, M. 2004, IJNS, 14, 69
Novakovic, J., Strbac, P., & Bulatovic, D. 2011, J. Oper. Res, 21, 119 Shmueli, G., & Lichtendahl, K. C., Jr. 2016, Practical Time Series Forecasting
Okoh, D. I., Seemala, G. K., Rabiu, A. B., et al. 2018, SpWea, 16, 1424 with r: A Hands-on Guide (Green Cove Springs, FL: Axelrod Schnall
Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. 2012, How Many Trees in A Publishers)
Random Forest? In International Workshop on Machine Learning and Data Usoskin, I. G. 2017, LRSP, 14, 1
Mining in Pattern Recognition (Berlin: Springer), 154 Vapnik, V. 2013, The Nature of Statistical Learning Theory (Berlin: Springer)