0% found this document useful (0 votes)
130 views

Weather Forecasting Using Incremental K-Means Clustering

The document proposes a methodology for weather forecasting in West Bengal using incremental K-means clustering of an air pollution database. The database contains daily measurements of carbon dioxide, respirable particulate matter, sulfur dioxide, and nitrogen oxides from 2009-2010. The methodology first performs typical K-means clustering on the main database to develop weather categories based on cluster mean values. Then incremental K-means is used to group new data points into existing clusters, allowing prediction of those data points' weather. The accuracy of this incremental forecasting approach is evaluated.

Uploaded by

Patricio Manzur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Weather Forecasting Using Incremental K-Means Clustering

The document proposes a methodology for weather forecasting in West Bengal using incremental K-means clustering of an air pollution database. The database contains daily measurements of carbon dioxide, respirable particulate matter, sulfur dioxide, and nitrogen oxides from 2009-2010. The methodology first performs typical K-means clustering on the main database to develop weather categories based on cluster mean values. Then incremental K-means is used to group new data points into existing clusters, allowing prediction of those data points' weather. The accuracy of this incremental forecasting approach is evaluated.

Uploaded by

Patricio Manzur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Weather Forecasting using Incremental K-means Clustering

SANJAY CHAKRABORTY Prof. N.K.NAGWANI LOPAMUDRA DEY


National Institute of Technology National Institute of Technology University of Kalyani
(NIT) Raipur, CG, India. (NIT) Raipur, CG, India. Kalyani, W.B., India
email: [email protected] email:[email protected] email: [email protected]

Abstract – Clustering is a powerful tool which has been used in proposed model is explained in section four; here various
several forecasting works, such as time series forecasting, real stages are explained to perform weather forecasting using
time storm detection, flood forecasting and so on. In this paper, incremental K-means clustering. Section five is simulation
a generic methodology for weather forecasting is proposed by result; here the proposed technique is applied over the test
the help of incremental K-means clustering algorithm.
dataset and results are captured. And the last section i.e.
Weather forecasting plays an important role in day to day
applications.Weather forecasting of this paper is done based on
section six is conclusion and future scope of the proposed
the incremental air pollution database of west Bengal in the work.
years of 2009 and 2010. This paper generally uses typical K- II. RELATED WORK
means clustering on the main air pollution database and a list
of weather category will be developed based on the maximum There are several approaches that have been used for
mean values of the clusters.Now when the new data are weather prediction. In some cases, advance numerical
coming, the incremental K-means is used to group those data analysis has used for weather prediction but in most of the
into those clusters whose weather category has been already situations clustering techniques are used for different types
defined. Thus it builds up a strategy to predict the weather of of predictions. It may be weather prediction or may be
the upcoming data of the upcoming days. This forecasting
natural disaster prediction. All of these researches help to
database is totally based on the weather of west Bengal and
this forecasting methodology is developed to mitigating the
survive the world from the natural destructive events.
impacts of air pollutions and launch focused modeling Weather forecasting can also be done by using artificial
computations for prediction and forecasts of weather events. neural network [9].
Here accuracy of this approach is also measured. This paper is based on the incremental approach of
K-means clustering algorithm which has been already
Keywords – Clustering, Forecasting, Incremental, K-means. developed and discussed [1] [2]. Based on that incremental
algorithm concepts weather prediction of „West Bengal‟ is
done in this paper. There are also exist several approaches
I. INTRODUCTION which provide some modifications of this algorithm
[3][4][5][6]. An approach is proposed on a case study of
Forecasting is very important for prediction of the time series forecasting through clustering. In this approach,
future events. Science and computer technology together a generic methodology for time series forecasting is
has made significant advances over the past several years proposed. This methodology first search some useful
and using those advanced technologies and few past patterns in the form of curves and it then facilitates the
patterns, it grows the ability to predict the future. Weather forecasting through linear regression by matching to the
forecasting is directly dependent with the characteristics of closest pattern to each time series that has to be predicted.
the particulate matters present in the air. This approach is applied on Kddcup 2003 dataset [7]. Some
work is done on real time storm detection through data
Weather forecasting max (effects of NO2 +SO2+ mining. In this approach, a model and algorithms for
CO2+RPM +……) (1) bridging the gap between the physical environment and the
cyber infrastructure framework by means of an events
This paper presents a methodology for forecasting weather processing approach to responding to anomalous behavior
of „West Bengal‟ through clustering. This methodology uses and sophisticated data mining algorithms that apply
an air pollution database which is described in later section. classification techniques to the detection of severe storm
The paper is organized into the six sections. Section two is patterns. The above ideas have been implemented in the
consisting of work done previously in the same directions; a LEAD-CI prototype [8]. There exists one approach which
brief background history is covered in this section. The presents the data mining activity that was employed to
methodology is explained elaborately in section three. The mining weather data. The self-organizing data mining
approach employed is the enhanced Group Method of Data easily defined to which cluster the new means are belonged
Handling (e-GMDH). The weather data used for the DM and their weather category can also be defined based on that
research include daily temperature, daily pressure and particular cluster‟s weather category. At last the accuracy of
monthly rainfall. Experimental results indicate that the this method is also measured and discussed.
above approach is useful for data mining technique for
forecasting weather data [10]. A. Effects of air-pollution data on weather
The experimented database of this paper consists of four air-
III. METHODOLOGY polluted data (CO2, RPM, SO2, NOX). This four data has
very important roles in the weather or climate change. They
This analysis is based on the observation of the air not only make impact on the climate but they can also
pollution data has been collected from the “West Bengal Air harmful for humans and plants. These air pollutants are
Pollution Control Board” and the URL is- directly emitted from several different sources, such as ash
“http://www.wbpcb.gov.in/html/airqualitynxt.php”. This from a volcanic eruption, gas from a motor vehicle exhaust,
database consists of four air-pollution elements or attributes released from various industrial processes, high temperature
and they are Carbon dioxide (CO2), Respirable particulate combustion and so on.
matter (RPM), Sulphur dioxide (SO2) and Oxides of The following „Fig.1‟ shows how the air pollution data
Nitrogen (NOx). Air pollution data of each day are collected affect on humans health, plants health and moreover on the
and stored that record in an .arff (Attribute resource file environment [11].
format) file format. The detail database format is shown in
the „Table 1‟.
Table I. Original air-pollution Database

Date CO2 RPM SO2 NOX

1/1/2009 85 183 12 95

2/1/2009 95 289 14 125

3/1/2009 112 221 10 101

4/1/2009 114 191 11 97

5/1/2009 100 175 11 101

6/1/2009 78 149 7 93

………. ....... ……. …… ……..


Figure.1 Effects of air pollution data on humans, plants and environment
1/2/2009 120 197 10 105
Their effects on climate change are listed below,
2/2/2009 115 151 10 85

3/2/2009 98 154 8 96
 CO2: A colourless, odorless, non-toxic greenhouse
gas associated with ocean acidification, emitted
4/2/2009 90 195 8 93 from sources such as combustion, cement
production, and respiration. It is one of the main
………. ....... ……. …… …….. pollutants of causing „global warming‟. Due to the
„greenhouse effect‟, the temperature of the
environment is increased and most importantly
The above database is a dynamic database where data are seasonal change is caused due to the increase of
updated frequently. The main approach of this paper is that CO2 [11].
first apply the K-means clustering algorithm [Chakraborty  SO2: SO2 is produced by volcanoes and in various
and Nagwani, 2011] on that above original database industrial processes. Since coal and petroleum
(assuming initial cluster number). Then compute the means often contain sulphur compounds, their
of each cluster based on their air polluted attributes in the combustion generates sulphur dioxide. Further
database. Then a list of weather category will be developed oxidation of SO2, usually in the presence of a
based on the maximum mean value of each cluster. Now catalyst such as NO2, forms H2SO4, and thus acid
when the new data (data of upcoming days) are inserted into rain. SO2 also form smog (smoke+fog) which
the old database and then apply the incremental K-means create visibility problem [11][12][13].
clustering algorithm. Based on the behaviour of the
incremental K-means clustering algorithm the minimum  NOX: Nitrogen dioxide is emitted from high
means of the new cluster data can be computed and it can be temperature combustion. It can be seen as the
brown haze dome above or plume downwind of The result of the second iteration is same as the above.
cities. NO2 is one of the most prominent air From the above four clusters the nature of those clusters on
pollutants. Just like CO2, NO2 is responsible to climate change can be measured.
increase temperature and it also creates smog  From the resultant data of cluster1(C1), it can be
[11][12][13].. said that the effect of RPM (14 is maximum) are
more compare to the other pollutants. So, as per
 RPM: Particulates, alternatively referred to as their effects on weather (discussed above), the
particulate matter (PM) or fine particles, are tiny weather of those particular days were smogy in
particles of solid or liquid suspended in a gas. It nature and also lots of dust, fly ash was there in the
creates dust, smokes, fumes, mist, fog, aerosols, fly weather.
ash and so on. Increased levels of fine particles in
the air are linked to health hazards such as heart  As per the nature of Cluster2(C2), the weather of
disease, altered lung function and lung cancer those particular days were hot, dry and smogy in
[11][12][13]. nature due to the effect of NOx.

B. Mathematical Explanation  As per the nature of Cluster3(C3), the weather of


Suppose there is a set of air pollution data which consist of those particular days were hot, smogy and humid
15 data. Suppose each data represents the data of each day. due to effect of CO2(‘Greenhouse effect’).
The set is shown by the table below,
 As per the nature of Cluster4(C4), the weather of
Table II. Sample air pollution database
those particular days were hot, smogy and also
Air Pollutant data there may be chance of acid rain due to the effect
CO2 RPM SO2 NOX of SO2.
82 14 12 24
72 56 28 8 Now some new data are inserted into the existing database
36 2 48 5
means air pollution data of some upcoming days are
7 - 94 62
inserted, such as 49(NOx), 78(SO2), 20(CO2).
Here incremental K-means clustering algorithm [1] can be
Here typical K-means is applied for initial data and applied, according to the incremental K-means algorithm
incremental K-means [1] is applied for incremental or new the new data are directly cluster by using direct means
coming data. Let assume, initially the value of clusters is 4 calculation between the new data and the means of the
(K=4) & initially the means of those four clusters are existing clusters. There is no need to run the whole
C1=8,C2=56,C3=28,C4=72 and also assume the above algorithm again and again. Then the expected result is,
database contains no noisy data. i. C1= |8−49|= 41
C2=|55.33−49|= 6.33(minimum)
First Iteration: C3=|29.33−49|= 19.67
Now first apply typical K-means on the above data by using C4=|82.66−49|= 33.66
Manhattan distance metric (|Ai−𝐴j|), That‟s why, 49(NOx) C2
Now for the first data 12(SO2), So, the data 49 follows the features of cluster2 and it
C1= |8−12|= 4 (minimum) indicates that the weather of the next day will be hot and
C2=|56−12|= 44 smogy in nature.
C3=|28−12|= 16 ii. C1= |8−78|= 70
C4=|72−12|= 60 C2=|55.33−78|= 22.67
So, 12 C1 C3=|29.33−78|= 48.67
Thus apply the same technique for the above all data, C4=|82.66−78|= 4.66 (minimum)
That‟s why, 78(SO2) C4
#items Means So, the data 78 follows the features of cluster4 and it
C1= {12, 8, 5, 14, 7, 2}= 6 8 indicates that the weather of that day will be hot, smogy and
C2= {56, 48, 62} = 3 55.33 also there may be chance of acid rain.
C3= {28, 24, 36} = 3 29.33 iii. C1= |8−20|= 12
C4= {72, 82, 94} = 3 82.66 C2=|55.33−20|= 35.33
C3=|29.33−20|= 9.33 (minimum)
Second Iteration: C4=|82.66−20|= 62.66
Now, again perform clustering based on the above new That‟s why, 78(SO2) C3
generated means,
C1=12 C2=48 C1=5 C2=62 So, the data 20 follows the features of cluster3 and it
C3=24 C1=8 C1=14 C1=7 indicates that the weather of that day will be hot, smogy and
C4=82 C4=72 C3=28 C1=2 humid due to the effect of CO2.
C3=36 C4=94 C2=56
IV. PROPOSED MODEL V. SIMULATION RESULT

The simulation is totally based on the data of the year


2009 and 2010. This simulation is going to calculate the
Typical accuracy of the approach of this paper. This experiment is
K-means Air-pollution
done with the help of Java, Weka software and it performs
clustering Database
on the 2.26 GHz Core i3 processor computer with 4GB
memory, running on Windows 7 home basic. Accuracy of
any method can be measured by compare the actual value
with the current value of the new method.
Generate Clusters
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑐𝑡 ℎ𝑒𝑑 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
Accuracy = × 100 (2)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠

The result is shown below after applying typical K-means


on the air-pollution database (initially it contains the data of
the first 8 months of the year 2009),

Table III. Means of initial clusters


Select weather category based on
clust clustRPMm clustSOm clustNOm
the maximum value of the air clusterid
CO2mean ean ean ean
pollutants in each cluster cluster0 221.376238 110.366337 10.128713 92.415842
cluster1 112.600000 118.562500 8.425000 72.187500
cluster2 39.458824 36.176471 6.158824 41.523529
New data cluster3 65.196721 75.983607 7.704918 57.04918
cluster4 225.943182 145.022727 12.034091 107.10227

New
Incremental Data Based on that above means, five clusters are produced and
K-means the nature of each cluster depends upon the maximum value
clustering of the mean attribute. Such as, for the first cluster the value
of CO2 mean is maximum, now if the new coming data of
the upcoming day is inserted into the first cluster means the
weather of that particular day is hot, smogy and humid due
to the effect of CO2.
New data are clustered into the existing
clusters using incremental K-means
Suppose the data(10 months) from the September month of
clustering and weather prediction can be the year 2009 to June month of the year 2010 is shown by
performed based on the nature(previously the „Table 5‟,
calculated) of those existing clusters
Table IV. Weather category according to cluster data

Cluster Number Cluster Maximum data Weather Category


Cluster0 221.376238 hot, smogy and humid
Measure the Accuracy
dusty, fly ash, smogy,
Cluster1 118.562500
fog, Mist
Cluster2 41.523529 Hot, dry and smogy

Figure.2 Proposed model of weather forecasting using incremental dusty, fly ash, smogy,
Cluster3 75.983607
K-means clustering fog, Mist

Cluster4 225.943182 hot, smogy and humid

From „Fig.2‟ it can be shown that first apply typical


K-means clustering on the air pollution database and based
Table V. Pollution data of the year of 2009(September) and 2010
on the maximum value of the air pollutants the weather
category can be defined. Now, when the new data are Date CO2 RPM SO2 NOX
inserted into the existing database, then the new data are 1/9/2009 66 27 5 31
directly clustered into those existing clusters which weather 2/9/2009 27 83 5 36
category has been already decided. This new data insertion 3/9/2009 88 30 5 35
4/9/2009 98 29 5 35
can be done using incremental K-means clustering 5/9/2009 74 28 5 33
algorithm. So, the weather category of those new data can ………. ....... ……. …… ……..
be evaluated from the weather category of those clusters 28/9/2009 116 43 6 52
where the new data belong. 29/9/2009 125 53 6 60
30/9/2009 188 100 7 67 hot, smogy and
30/9/2009 Cluster4
………. ....... ……. …… …….. humid
1/1/2010 200 150 12 107 ……… ……….. ……………..
2/1/2010 220 160 13 110
……… ……. …….. ……. ………
1/3/2010 260 170 14 105 Now, the accuracy of the above technique can be measured,
2/3/2010 270 175 14 112
……. ……… …….. …….. ……. 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑐𝑡 ℎ𝑒𝑑 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
1/6/2010 190 145 16 120 Accuracy = × 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
2/6/2010 200 155 12 118 250
……… …….. ……. …….. ……..
= × 100
300
≅ 83.3%
Now the new data are inserted into the existing database and
incremental K-means clustering is used to cluster those data, So, the accuracy of this method is 83.3%.
The new data after proper clustering by the incremental
clustering algorithm it will insert into one of those existing VI. CONCLUSION AND FUTURE SCOPE
clusters whose weather category has already been defined in
the „Table 4‟, then the result of forecasting is shown below In this paper, a new technique is established to predict
from the month of September, 2009 to June 2010 (note that weather of upcoming days by the help of incremental K-
the database contains all the data of the West Bengal means clustering algorithm. This technique is suitable for
pollution board of the year 2009 and 2010). Here for the dynamic databases where the climate data are changed
calculation „Euclidean metric‟ is used. Examples of the first frequently. In this paper the accuracy of this technique is
three new calculated data is given below, calculated.
i. Cluster0= 186.7885 In future, other incremental clustering algorithms can be
Cluster1= 110.7402 used to predict the weather and can compare them with each
Cluster2= 29.8850 (minimum) other to detect which algorithm among them provide better
Cluster3= 55.5580 accuracy.
Cluster4= 212.9605
So, the incremental data of the second day (1/9/2009) ACKNOWLEDGMENT
should be inserted into the „Cluster2‟, it follows the same
nature like „Cluster2‟. Special thanks to Dr. S. Verma and Dr. T. S. Sinha from the
ii. Cluster0= 204.29954 National Institute of Technology and DIMAT Raipur,
Cluster1= 99.5656 whose comments improved the presentation of this article.
Cluster2= 48.78
Cluster3= 44.2561(minimum)
Cluster4= 220.295 REFERENCES
So, the incremental data of the second day (2/9/2009)
should be inserted into the „Cluster3‟, it follows the same [1] Chakraborty, S. and Nagwani, N.K. , “ Analysis and
nature like „Cluster3‟. study of Incremental K-Means clustering
iii. Cluster0= 166.0447 algorithm”,Communication in Computer and Information
Cluster1= 99.2124 Science. 1, International conference in High Performance
Architecture and Grid Computing (Springer Germany),
Cluster2= 49.3790(minimum)
Vol. 169, part 2, 2011, pp.338-341.
Cluster3= 55.9282
Cluster4= 193.666 [2] Chakraborty, S. and Nagwani, N.K. , “ Performance
So, the incremental data of the third day (3/9/2009) should evaluation of incremental K-means clustering algorithm ”,
be inserted into the „Cluster2‟, it follows the same nature IFRSA International Journal of Data Warehousing and
like „Cluster2‟. Thus the same way other data of the 10 Mining (IIJDWM), Vol.1, 2011, pp.54-59.
months (from September 2009 to June 2010) are calculated.
[3] Mumtaz, K. and Duraiswamy K. , “ A Novel Density
Table VI. Weather forecasting from September, 2009 to June 2010 based improved k-means Clustering Algorithm – Dbkmeans
”, IJCSE, Vol. 2, No. 02, 2010, pp.213-218,.
New data inserted
Date Weather Category [4] Kanungo, T. and Mount D.M. : “ An Efficient k-Means
into
1/9/2009 Cluster2 Hot, dry and smogy Clustering Algorithm : Analysis and implentation” , IEEE
2/9/2009 Cluster3
dusty, fly ash, smogy, Transaction, Vol. 24, No. 7, 2002.
fog, Mist
3/9/2009 Cluster2 Hot, dry and smogy [5] Ordonez, C. and Omiecinski E.: “An Efficient Disk- Based
4/9/2009 Cluster2 Hot, dry and smogy K-Means Clustering for Relational Databases, IEEE
……….. …………….. ……………… transaction on knowledge and Data Engineering,Vol.16,2004.
dusty, fly ash, smogy,
28/9/2009 Cluster3
fog, Mist
dusty, fly ash, smogy,
[6] Aristidis L. , Nikos V., Jakob J. V., ― “ The global k-means
29/9/2009 Cluster3
fog, Mist clustering algorithm ” , the journal of the
pattern recognition society, Pattern Recognition36 , 2002, pp.
451-461.

[7] Vipul K., Vamsidhar T., Kamalakar K., “Time Series


Forecasting through Clustering - A Case Study”, In
Proceedings of COMAD'2005, pp.183~191.

[8] Xiang L., Beth P., Nithya V., Rahul R., Sara G. and
Helen C., “Real-Time Storm Detection and Weather
Forecast Activation through Data Mining and Events
Processing”,Vol.1, No.2, May 2008, pp. 49-57.

[9] Dr. Santhosh B. S. and Kadar Shereef I., “An Efficient


Weather Forecasting System using Artificial Neural
Network”, International Journal of Environmental Science
and Development, Vol. 1, No. 4, October 2010.

[10] Godfrey C. O., Peter B. , Sitaram G., Visagaperuman R.,


Viti B. A. A., “ Self-organizing Data Mining for Weather
Forecasting”, IADIS European Conference Data Mining,
ISBN: 978-972-8924-40-9, 2007.

[11] “Air Pollution” , Wikipedia free encyclopaedia


http://en.wikipedia.org/wiki/Air_pollution#mw-head.

[12] “Air pollution and Climate change”, published by Science


for Environment Policy, European Commission, issue
24, November-2010.

[13] Hans E., Markus A., Jelle G. van M., “ The ETC-ACC
framework for base-line scenario development in the
context of integrated assessment for air pollution and
climate change” , European Topic Centre on Air and
Climate Change(ETC/ACC), 2002.

You might also like