0% found this document useful (0 votes)
2 views

Algoritmos de clasificacion

This study presents a novel iterative kappa architecture for processing and analyzing real-time data from cryptocurrency transactions and social media, specifically focusing on Bitcoin and Ethereum. Utilizing a k-means clustering approach, the architecture categorizes data into typical and atypical groups, revealing correlations between Twitter activity and cryptocurrency transactions. The findings indicate that approximately 14% of the data reflects extraordinary behaviors, highlighting the architecture's flexibility and effectiveness in managing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Algoritmos de clasificacion

This study presents a novel iterative kappa architecture for processing and analyzing real-time data from cryptocurrency transactions and social media, specifically focusing on Bitcoin and Ethereum. Utilizing a k-means clustering approach, the architecture categorizes data into typical and atypical groups, revealing correlations between Twitter activity and cryptocurrency transactions. The findings indicate that approximately 14% of the data reflects extraordinary behaviors, highlighting the architecture's flexibility and effectiveness in managing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

algorithms

Article
Real-Time Big Data Architecture for Processing Cryptocurrency
and Social Media Data: A Clustering Approach Based
on k-Means
Adrian Barradas *,† , Acela Tejeda-Gil † and Rosa-María Cantón-Croda †

Graduate School of Engineering, UPAEP-University, Puebla 72410, Mexico; [email protected] (A.T.-G.);


[email protected] (R.-M.C.-C.)
* Correspondence: [email protected]
† These authors contributed equally to this work.

Abstract: Cryptocurrencies have recently emerged as financial assets that allow their users to execute
transactions in a decentralized manner. Their popularity has led to the generation of huge amounts
of data, specifically on social media networks such as Twitter. In this study, we propose an iterative
kappa architecture that collects, processes, and temporarily stores data regarding transactions and
tweets of two of the major cryptocurrencies according to their market capitalization: Bitcoin (BTC)
and Ethereum (ETH). We applied a k-means clustering approach to group data according to their
principal characteristics. Data are categorized into three groups: BTC typical data, ETH typical
data, BTC and ETH atypical data. Findings show that activity on Twitter correlates to activity
regarding the transactions of cryptocurrencies. It was also found that around 14% of data relate to
extraordinary behaviors regarding cryptocurrencies. These data contain higher transaction volumes
of both cryptocurrencies, and about 9.5% more social media publications in comparison with the

 rest of the data. The main advantages of the proposed architecture are its flexibility and its ability to
Citation: Barradas, A.; Tejeda-Gil, A.; relate data from various datasets.
Cantón-Croda, R.-M. Real-Time Big
Data Architecture for Processing Keywords: kappa architecture; iterative data processing; document-oriented No-SQL database;
Cryptocurrency and Social Media Bitcoin; Ethereum; Twitter
Data: A Clustering Approach Based
on k-Means. Algorithms 2022, 15, 140.
https://doi.org/10.3390/a15050140
1. Introduction
Academic Editors: Christos Makris
and Andreas Kanavos During the past few years, the use of digital currencies has emerged as a novel
manner of executing financial transactions [1]. A digital currency works the same way
Received: 16 March 2022
a real currency does, with the particularity that it is not issued by a central bank; thus
Accepted: 7 April 2022
it is a decentralized currency [2]. Digital currencies are generated using a cryptographic
Published: 22 April 2022
algorithm called blockchain, which employs mathematical encryption methods to create
Publisher’s Note: MDPI stays neutral and verify a continuously growing data structure. Therefore, blockchain protects data by
with regard to jurisdictional claims in transforming it into an unreadable format, which can only be decrypted employing the
published maps and institutional affil- corresponding decryption algorithm. Blockchain transactions flow through a computer
iations. network without the need for intermediaries as the algorithm links users directly [1]. That
kind of network is known as a cryptocurrency network as it enables the establishment of
decentralized peer-to-peer data exchange [3].
In terms of trading volume, Bitcoin is currently the most popular cryptocurrency; it
Copyright: © 2022 by the authors.
allows electronic cash transactions directly from one partner to another without going
Licensee MDPI, Basel, Switzerland.
through a financial institution [4]. Diverse studies serve as evidence that Bitcoin has been
This article is an open access article
strangely volatile since its establishment. Its volatile nature has brought into vogue its
distributed under the terms and
use among speculators [5]. Although its use until now has been mostly for speculation,
conditions of the Creative Commons
Attribution (CC BY) license (https://
since at least 2010, numerous intermediaries have begun to transact with Bitcoin [6]. It has
creativecommons.org/licenses/by/
been reported that the market capitalization of the one hundred largest cryptocurrencies
4.0/). exceeded the equivalent of USD 2.65 trillion by November 2021; nevertheless, according

Algorithms 2022, 15, 140. https://doi.org/10.3390/a15050140 https://www.mdpi.com/journal/algorithms


Algorithms 2022, 15, 140 2 of 11

to CoinMarketCap, Bitcoin accounts for the largest cryptocurrency with a market capital-
ization that surpasses the 1.1 trillion mark, while Ethereum stands as the second-largest
cryptocurrency with a market capitalization equivalent to USD 543 billion [7]. Both Bit-
coin and Ethereum use the same principles of blockchain technology; nevertheless, while
Bitcoin’s purpose is limited to functioning as a digital currency, Ethereum is designed to
be a general-purpose programmable blockchain, which can manage the transactions of a
digital currency, but also any kind of data expressible as a key-value tuple [8]. This gives
Ethereum the advantage of being suitable for other decentralized applications, however,
this study focuses only on its use as a digital currency.
Cryptocurrencies rose as a tendency due to their popularity on social media. In
that context, one of the main sources of information about cryptocurrencies is Twitter. It
allows users to share their thoughts and mindsets regarding cryptocurrencies; therefore
it is, among other social networks, a medium to boost the cryptocurrency world [9,10].
According to data from BitInfoCharts, the number of daily tweets related to Bitcoin during
2021 fluctuated between 30,540 and 363,566; the latter corresponds to around 0.072 percent
of the average daily tweets worldwide [11]. This evidences the wide use of Twitter as an
information medium for cryptocurrencies [12,13]. It is worth mentioning that Twitter is
considered a leading social media platform and a rich source of real-time information [14].
On the other hand, during the same period, daily Bitcoin transactions averaged 332,355 [15].
In that context, a large amount of data is generated every day; i.e., around 136 tweets and
230 transactions per minute. Big data refers to large and complex datasets which require
advanced data storage, management, and analysis technologies [3]. One of the sources
of big data is social media which has an increasing number of users [16] that integrate
their background and daily activities into the networks. This fact contributes to the rapid
generation of gigantic datasets [17]. As data are generated rapidly, it is meaningful to
obtain information and insights in real time to react appropriately to events and trends
surrounding large volumes of data. In this case, it concerns the analysis of social media
posts and cryptocurrency transactions [18].
Given the popularity of cryptocurrencies, there is a vast number of recent studies and
projects focused on analyzing data from social media and cryptocurrencies in real time
utilizing novel data processing tools and methodologies. Moapatra et al. [19] proposed
a distributed architectural design for handling large volumes of data from Twitter and
Bitcoin transactions in real time to predict price fluctuations; by means of a combined
machine learning and lexicon approach, they determined the sentiments of the tweets and
related them with the price of Bitcoin to predict the next minute’s price. Bandi [20] utilized
a lambda architecture to process and visualize real-time data regarding cryptocurrencies’
prices. On the other hand, Horvat et al. [21] proposed an architecture for real-time
cryptocurrency data processing and analysis based on the lambda architectural approach
to obtain insights through the relation of different data sources such as social media,
cryptocurrencies, and the stock market. A kappa architecture was proposed by Bandi and
Hurtado [18] to process real-time data from Twitter to visualize analytics, such as trends
and tweet volume. In addition, a relation between tweets and cryptocurrencies’ prices was
studied by Abraham et al. [22] as a way to predict the direction of the price variation, from
which it was found that the volume of tweets is more significant than their sentiments. It
was also found by Park and Lee [23] that the volume of tweets correlates with Bitcoin prices.
Garcia et al. [24] found that an increase in Bitcoin’s price led to a higher number of tweets
which again would drive the price further up [25]. Some other studies focused only on
the relation between tweets and cryptocurrencies, leaving in the second term the methods
involved in the management and processing of data. Aharon et al. [26] found that there is
a causal relationship between the uncertainty associated with sentiments in social media
and cryptocurrency returns. In addition, we have found evidence of works that aim to
identify behavioral patterns regarding cryptocurrencies by means of clustering algorithms.
Baek et al. [27] applied a k-means clustering approach to identify suspicious transactions of
Ethereum. Aspembitova et al. [28] identified four types of cryptocurrency users through
Algorithms 2022, 15, 140 3 of 11

the application of k-means clustering and support vector machines (SVMs) on Bitcoin and
Ethereum transactional data. Fang et al. [29] used k-means to classify positive and negative
publications from Twitter related to Bitcoin.
The previous research serves as a reference and basis for our study; although similar
approaches have been proposed, to the best of our knowledge there is no evidence of
related papers that utilize an iterative kappa architecture to process, relate and manage data
from Twitter and cryptocurrency markets in real time. In this context, this study proposes
the application of a novel kappa architecture, derived from the lambda architecture, for
processing and analyzing real-time data from Twitter and the cryptocurrency market. It
integrates a temporary batch step which allows the relation of data from different data
sources in a specific time span. The proposed architecture focuses on the processing of data
in real time while looking for insights and patterns regarding the number of tweets, their
sentiment, and the number, type, and volume of cryptocurrency transactions. Data are
collected through application programming interfaces (APIs) and streamed to be processed
and stored in a document-oriented No-SQL database (MongoDB™). Afterward, data are
related with the purpose of finding meaningful patterns.
The present work aims to demonstrate the use and benefits of the proposed architecture
as a choice for relating data from cryptocurrencies and social media while identifying
patterns in real time; for that purpose, data from a defined period of time are used.
This paper is organized as follows: Section 2 describes in detail the materials and
methods used for the study’s development. Section 3 presents the results obtained by
processing and relating data using the proposed kappa architecture. Finally, Section 4
summarizes the main findings and future works for this study.

2. Materials and Methods


This study is developed by following an approach based on the kappa architecture
for big data as shown in Figure 1. The kappa architecture was first introduced by Kreps in
2014 [30]. It derives from the lambda architecture, which is considered one of the industry’s
best practices for scalable real-time big data processing [21]. Lambda architecture consists
of three layers: batch layer, speed layer, and serving layer. The batch layer processes data
and stores them to query precomputed data on demand instead of querying them on the
fly. The speed layer processes data in real-time to compensate for the low latency updates
in the batch layer. Thus, data are processed in a parallel manner in both layers. Finally,
the serving layer stores the views from the previous two layers [31]. Kappa architecture is
similar to the lambda architecture, with the difference that it does not include a batch layer,
therefore it processes data only in real time [30]. In this context, the main characteristics
of the kappa architecture are its simplicity and its flexibility in comparison with other big
data architectures [32]; thus it is suitable for online processing of data flows [33].

Figure 1. Proposed kappa architecture. Source: compiled by authors with data from [30]. “Apache
Kafka”, and “Apache Spark” are trademarks of the Apache Software Foundation. “TWITTER, TWEET,
RETWEET and the Twitter Bird logo are trademarks of Twitter Inc. or its affiliates”.
Algorithms 2022, 15, 140 4 of 11

The proposed architecture consists of a real-time streaming layer that receives and
processes new incoming data and a serving layer that stores data in MongoDB™ to be dis-
played or queried on demand. At the streaming layer, the processing is executed by means
of Apache Kafka™and Apache Spark™which are helpful to process data in a distributed
manner and consequently faster, in comparison with non-distributed approaches [34]. At
the serving layer of the kappa architecture, the processed, modeled, and evaluated data
coming from the real-time streaming layer are finally loaded into a database management
system (DBMS), i.e., MongoDB™. In this case, as we handle huge volumes of unstructured
data from Twitter, a document-oriented No-SQL database is better suitable than a tradi-
tional relational database due to its advantages regarding the horizontal scalability and the
storage of unstructured data.
The kappa architecture that we present is iterative. In the first iteration, single datasets
from Twitter and CryptoCompare are processed and transformed in order to be related;
thereafter, a second iteration is executed to classify the related datasets and obtain insights.
In that context, data are collected as they are generated and then streamed, transformed,
and stored in MongoDB™ from which datasets are queried. In this case, MongoDB™ serves
as a batch that stores data from the last 120 s with the purpose of relating it, by considering
a time span of one minute and therefore obtaining one register for each minute. Finally, the
queried dataset is returned to the streaming layer to be processed by means of a machine
learning approach; in this case, k-means clustering. K-means clustering is one of the
most popular algorithms for unsupervised machine learning. It groups data with similar
characteristics under a determined number of clusters while separating them according
to their dissimilarities [35]. Clustering is defined as a method for finding homogeneous
groups of data points in a dataset; in that sense, it allows the recognition of patterns in
data [36].
For this study, data related to the two largest cryptocurrencies, according to their
market capitalization, were collected, i.e., Bitcoin (BTC) and Ethereum (ETH) [7]. Data
mining for the corresponding tweets was done considering publications made in English.
Parameters for the k-means clustering approach were calculated for data collected on
14 January 2022 corresponding to a period of 8 h from 06:59:00 (UTC-6) to 16:59:00 (UTC-6).
The algorithms for the proposed architecture were executed by a single computer, neverthe-
less, it is suitable for its execution in a computer cluster, which distributes the computational
requirements between the computers that conform to it.
Figure 2 shows a representation of the steps involved in the development of the study.
First, data mining is executed in real time by means of public APIs [37,38] that enable
the retrieval of the latest available raw data from Twitter and CryptoCompare. Collected
data are then streamed and immediately transformed. Datasets are cleaned by deleting
unuseful variables, and the remaining are transformed in order to be correctly processed
and related. Additionally, a standard notation for the data is defined, and derived attributes
are calculated when needed. Thereafter, data pass to the serving layer, where they are stored
in MongoDB™ and then queried to relate the corresponding datasets according to their
most relevant attributes. The queried and related data are then returned to the real-time
streaming layer, at which a k-means clustering approach is executed to categorize data in
groups according to their characteristics. In that sense, data flow in a second iteration in
parallel through the architecture with the purpose of obtaining more information from
them in real time.
Algorithms 2022, 15, 140 5 of 11

Figure 2. Process diagram for the proposed kappa architecture. Source: compiled by authors .

3. Results
Data for this study were obtained from two different sources (Twitter and CryptoCom-
pare) in the form of a JSON real-time stream, by means of an API [37,38]. To query data, a
set of keywords were given which correspond to the name and symbol of the cryptocur-
rencies, i.e., Bitcoin (BTC), and Ethereum (ETH). As shown in Table 1, data collected from
Twitter contain several attributes related to each tweet such as id, timestamp, and text, but
also attributes related to the user such as user mentions, number of followers, and location,
among others. On the other hand, data from CryptoCompare contain transaction-inherent
attributes, i.e., timestamp [TS], market [M], symbol [FSYM], price [P], and volume [Q].

Table 1. Attributes of each raw dataset obtained.

Twitter Cryptocompare
created at: ‘Fri Jan 14 07:00:00 +0000 2022’,
id: 61e173da2e853f6c8c8c92ff,
id str: ‘148197437765466521’,
text: ‘RT @Saki5786: @WatcherGuru A big
transformation is on the way! The TIME HAS COME date:“2022-01-14 07:00:00”
for #CryptoIslandDAO!NOW is the best time to start TYPE:“0”
thi. . . ’, M:“Coinbase”
truncated: True, FSYM:“BTC”
entities: TSYM:“USD”
hashtags: [], F:“2”
followers: [], ID:“263883436”
user mentions: [], TS:“1642165200”
urls: [ Q:“0.00059115”
url: ”, P:“42070.6406”
display url: ‘twitter.com/i/web/status/1. . . ’, TOTAL:“24.8704”
location: []], RTS:“1642165200”
metadata: TSNS:“7000000000”
iso language code: ‘en’, RTSNS:“392000000”
result type: ‘recent’,
href=“https://mobile.twitter.com, accessed on 14
January 2022”
rel=“nofollow”>Twitter Web App,

Data streams feed their corresponding topic (Twitter and Crypto) in Apache Kafka™.
Data streaming is executed in a parallel manner, and in that way they can be processed
simultaneously. Then, data processing is sequentially carried out in Apache Spark™, which
allows the computation tasks to be divided between various processors forming a cluster.
Data from Twitter in Table 1 contain fields related to the user that, for the purposes of
this study, are not representative. Only the following attributes were kept: timestamp, id,
and text. In the case of data obtained from CryptoCompare, none of their attributes were
Algorithms 2022, 15, 140 6 of 11

neglected as they contain representative information regarding each trade. At this point,
data are transformed into a binary object which can be managed by Apache Kafka™. Text
data from each tweet are processed in the real-time streaming layer by means of the library
for natural language processing: Spark NLP, which is one of the most widely used NLP
libraries [39,40]. Attribute text is split into sentences and, for each one, sentiment analysis
is executed to identify whether it is positive or negative. Thus, a new attribute sentence for
each tweet is generated. On the other hand, data on the Apache Kafka™ topic Crypto are
transformed to have the same notation as data from the topic Twitter, so they can be related.
Finally, data are immediately uploaded to the corresponding collection in the database
hosted at MongoDB™.
By following the process presented in Figure 2, a new dataset that relates the individual
data from topics Twitter and Crypto is queried from the database. Attributes timestamp
and currency are defined as keys to establish a relationship that allows generating a new
dataset containing facts regarding the transactions. Considering the speculative nature of
cryptocurrencies dominated by short-term investors [25], data are analyzed on a time basis
of minutes; thereafter, new attributes are calculated: number of tweets, accumulated sentiment,
transaction volume, average currency price, and number of transactions. The obtained dataset,
as shown in Table 2, is then sent to a new topic (Query) in Apache Kafka™ to be streamed
to Apache Spark™ and thus processed in a second iteration.

Table 2. Relation between Twitter and CryptoCompare datasets on a time basis of minutes.

Sell Buy Buy


Timestamp Symb. Tweets Sent. Sell Avg. Sell No. Buy Vol.
Vol. Avg. No.
14 January 2022 T07:00:00.00 BTC 693 −165 1.77 42.1 * 101 3.98 42.1 * 160
14 January 2022 T07:00:00.00 ETH 878 −352 63.19 3.21 * 182 23.5 3.21 * 160
14 January 2022 T07:01:00.00 BTC 618 −124 4.9 42.0 * 154 5.11 42.0 * 213
14 January 2022 T07:01:00.00 ETH 809 −238 24.7 3.21 * 176 24.4 3.21 * 155
14 January 2022 T07:02:00.00 BTC 620 −160 0.38 42.0 * 95 2.06 42.0 * 165
14 January 2022 T07:02:00.00 ETH 815 −272 76.2 3.21 * 135 66.7 3.21 * 135
* Expressed in thousands.

With the purpose of demonstrating the application of the proposed algorithm, we col-
lected data for a period of 8 h, from 06:59:00 (UTC-6) to 16:59:00 (UTC-6) of 14 January 2022.
This corresponds to 248,313 tweets, 73,506 sell transactions, and 114,493 buy transactions
of both cryptocurrencies. Figures 3 and 4 show a graphical representation of the behavior
of the collected data regarding the cryptocurrencies Bitcoin (BTC) and Ethereum (ETH),
respectively.

Figure 3. Graphical representation of data related to cryptocurrency Bitcoin (BTC). Source: compiled
by authors.
Algorithms 2022, 15, 140 7 of 11

Figure 4. Graphical representation of data related to cryptocurrency Ethereum (ETH). Source: com-
piled by authors.

It is notorious that in the case of Bitcoin (BTC), as the price increases, the sentiment
does too. A similar behavior is seen when the price remains steady, thus having a stable
sentiment range. On other hand, buy and sell transactions seem to behave according to the
change in price, and this means that an increase or decrease in price is related to a larger
or smaller number of buy and sell transactions, respectively; nevertheless, this behavior
appears to happen only when there is an abrupt change in price. It is worth mentioning
that the number of tweets and transactions tends to lower values as the day goes by. This
may indicate that the vast majority of activities regarding cryptocurrencies are carried
out during normal working hours. Moreover, in the case of Ethereum (ETH), its behavior
is similar to that of Bitcoin (BTC). As shown in Figure 4, there is a relation between the
number of tweets, the sentiment around them, and price, but only when the price change
is abrupt. When the price remains steady, the rest of the variables seem to behave in the
same manner. In this case, it can also be seen that during the final minutes of the graph, the
sentiment does not affect the price, which tends to remain significantly unchanged. Finally,
as in the previous graph, the number of tweets and transactions tends to decrease as the
day passes by.
To determine whether there is a correlation between variables, a Pearson correlation
analysis was executed. For this purpose, data were standardized to let all the attributes
be expressed in the same terms, so they can be correctly related. Table 3 presents a
correlation matrix for the corresponding variables of the dataset, from which only the
statistically significant values (p-value ≥ 0.05) were considered. It was found that there is
a positive correlation between the number of tweets and the buy and sell volumes (0.34,
0.43). Additionally, there is a positive correlation between the sentiment and the buy and
sell prices of the cryptocurrencies (0.30, 0.30), while the correlation between the latter and
the number of tweets is negative (−0.69, −0.69). In addition, a correlation between volume,
avg. price, and number of transactions of both buy and sell transactions was found, which
was expected because of their mutually dependent nature. This approach complements the
findings from Figures 3 and 4.
Before returning data to the streaming layer for the execution of the k-means clustering
approach, an optimal number of clusters is defined by means of the silhouette method, which
measures compactness and separation of data [41] Compactness refers to the similarity
between each data point and the cluster, while when compared to other clusters, it is called
separation [42]. In this case, the optimal number of clusters is determined according to the
collected data; therefore, a silhouette coefficient was calculated for an arbitrary range of
clusters, from k = 3 to k = 9. As our data consider two cryptocurrencies, we neglected a
k-value of 2 with the purpose of grouping data beyond their cryptocurrency symbol. The
silhouette coefficient ranges between −1 and 1, with 1 being the value that denotes that
clusters are apart from each other, and data points belonging to them are close to their
centroid, while −1 denotes that data points are grouped in the wrong clusters and that their
Algorithms 2022, 15, 140 8 of 11

centers are not well separated [43]. In that context, the higher the value of the coefficient the
better the behavior of the clusters. We selected the optimal number of clusters according
to these criteria. Figure 5 shows the calculated values of the silhouette coefficient for the
clusters between the defined range. The highest coefficient is obtained by grouping data in
3 clusters, therefore this is the number that we consider for k.

Table 3. Pearson correlation matrix.

Tweets Sent Sell Vol. Sell Avg. Sell No. Buy Vol. Buy Avg. Buy No.
symb. - - - - - - - -
tweets 1 - - - - - - -
sent * 1 - - - - - -
sell vol. 0.34 −0.07 1 - - - - -
sell avg. −0.69 0.30 −0.39 1 - - - -
sell no. * 0.09 0.34 0.12 1 - - -
buy vol. 0.43 −0.10 0.52 −0.49 0.24 1 - -
buy avg. −0.69 0.30 −0.39 - 0.12 −0.49 1 -
buy no. * 0.04 0.19 0.09 * 0.39 - 1
* Omitted: p-value < 0.05.

Figure 5. Silhouette coefficient related to the number of clusters. Source: compiled by authors.

Now that the optimal number of clusters is selected, data are modeled at the streaming
layer in a second iteration. Thereafter, it was found that data are grouped according to their
symbol in the first and second clusters; nevertheless, the third cluster concentrates data
from both cryptocurrencies whose numbers of buy and sell transactions are significantly
higher in comparison with the rest of the data; in consequence, the volume of bought and
sold cryptocurrencies is also higher. In those cases, on average, the sentiment tends to
be more positive as well as the number of tweets. Table 4 shows the average values of
the grouped data, which indicate that cluster 3 groups data related to an increase in the
activity over cryptocurrencies. In that sense, and in relation to findings from graphs in
Figures 3 and 4, we consider that clusters 1 and 2 contain data corresponding to a steady
behavior of the cryptocurrencies while cluster 3 corresponds to data whose behavior is
more volatile.
Algorithms 2022, 15, 140 9 of 11

Table 4. Average values separated by cluster.

Avg. Avg. Sell Avg. Sell Avg. Sell Avg. Buy Avg. Buy Avg.
Symb. Avg. Sent. % Data
Teets Vol. Price No. Vol. Price Buy No.
1 BTC 515 −82 4.7 42.8 * 145 5.65 42.8 * 223.4 42%

2 ETH 720 −103 31.0 3.27 * 125 39.24 3.27 * 189.9 38%

BTC 564 −72 16.4 43.0 * 332 33.25 43.0 * 603.7


3 14%
ETH 789 −77 87.8 3.28 * 194 105.24 3.28 * 290.0
* Expressed in thousands.

4. Discussion
Results show that the proposed iterative kappa architecture is useful for processing
data and for determining patterns in real time. From the correlation analysis, it was found
that there is a relation between the activity in social networks, i.e., Twitter, and the behavior
of cryptocurrency markets. This evidences a positive correlation between the number of
tweets and the buy and sell volumes of the cryptocurrencies. The findings support previous
studies [19,22–24], in which it was found that the number of tweets and sentiment were
positively correlated with cryptocurrencies’ transaction volumes and prices. In addition,
by means of the k-means clustering approach, it was found that some data lie outside the
common trends regarding transaction volumes of the cryptocurrencies. We have identified
the outliers by grouping data in three clusters; two of them correspond to a steady behavior
of the cryptocurrencies, while the third gathers data related to unusual transaction volumes.
Thus, this latter group is useful for identifying anomalous behaviors in the market which
are characterized mainly by a higher volume of tweets with a more positive sentiment, and
higher transaction volumes.
From the executed k-means clustering approach, we have found that around 14%
of data fall in the third cluster. In that cluster, on average, Bitcoin (BTC) was sold and
bought around 128% and 170% more times than in cluster 1, while for Ethereum (ETH), the
percentages were 54% and 52%, respectively, in comparison with cluster 2, thus resulting
in higher transaction volumes. In both cases, the number of tweets was around 9.5%
higher than in the first two clusters. Additionally, the sentiment of the tweets shows
higher values (12% for Bitcoin (BTC) and 25% for Ethereum(ETH)) in the third cluster.
The previous findings demonstrate that positive sentiment in the environment regarding
cryptocurrencies promotes the activity in the market, thus giving sense to the correlation
found between the number of tweets and the buy and sell volumes.
The proposed architecture may be misidentified with a lambda architecture because
both have a batch step; nevertheless, they do accomplish different tasks, and thus different
purposes. While the lambda architecture contains an extra batch layer that receives data
simultaneously with the streaming layer, our proposed variant of the kappa architecture
applies a batch step inside the existent serving layer to temporarily store processed data.
In that sense and in comparison with the simple kappa architecture, our proposal has the
advantage of being able to relate various datasets in the second iteration by considering
a different time span than the one selected for data streaming at the first iteration. It is a
flexible architecture, which offers an alternative solution for real-time data processing and
modeling from the perspective of traditional techniques, i.e., relational databases [44].
The application of our proposal is not limited to the execution of a k-means clustering
approach. Other unsupervised machine learning algorithms may be explored, such as
hierarchical cluster analysis (HCA) or fuzzy C-means clustering, which could help find dif-
ferent patterns regarding the behavior of cryptocurrencies. In addition, supervised machine
learning algorithms may be supported. Some other studies proposed a similar application
of the kappa architecture to process and model data in real time [33,45]; nevertheless,
our proposal differs in the way data are processed. None of the previous studies found
combined an iterative approach with a batch step involving a database management system
Algorithms 2022, 15, 140 10 of 11

and machine learning processing together. The proposed iterative kappa architecture in
this study contributes to expanding the alternatives for real-time data processing with
machine learning techniques. Even though this study considers only data from Twitter for a
limited period of time and in a specific language, in future works, data from different social
networks, i.e., Reddit and Telegram [14], over a longer period and in other languages can be
evaluated. In addition, other machine algorithms may be explored within the architecture
in order to widen the knowledge regarding the data. The integration of data from new
data sources in order to analyze the architecture from a multidimensional approach also
remains open for further studies. Finally, a higher volume of data and more attributes may
be considered with the purpose of identifying if other variables correlate to specific trends
in the cryptocurrency market.

Author Contributions: Methodology, A.B.; Supervision, R.-M.C.-C.; Writing—review and editing,


A.B. and A.T.-G. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding. The APC was funded by UPAEP-University.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Restrictions apply to the availability of these data. Data were obtained
in real time from Twitter and CryptoCompare and are available at https://twitter.com, accessed
on 14 January 2022 and https://www.cryptocompare.com, accessed on 14 January 2022 with the
permission of Twitter and CryptoCompare.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Peters, G.; Panayi, E.; Chapelle, A. Trends in Cryptocurrencies and Blockchain Technologies: A Monetary Theory and Regulation
Perspective. J. Financ. Perspect. 2017, 3, 1–46.
2. de Albuquerque, B.S.; de Castro Callado, M. Understanding Bitcoins: Facts and Questions. Rev. Bras. Econ. 2015, 69, 3–16.
[CrossRef]
3. Hassani, H.; Huang, X.; Silva, E.S. Fusing Big Data, Blockchain, and Cryptocurrency. In Fusing Big Data, Blockchain and
Cryptocurrency: Their Individual and Combined Importance in the Digital Economy; Hassani, H., Huang, X., Silva, E.S., Eds.; Springer
International Publishing: Cham, Switzerland, 2019; pp. 99–117. [CrossRef]
4. Shen, D.; Urquhart, A.; Wang, P. Does Twitter Predict Bitcoin? Econ. Lett. 2019, 174, 118–122. [CrossRef]
5. Mallikarjuna, B.; Ramana, T.; Kallam, S.; Patan, R.; Manikandan, R. Visualizing Bitcoin Using Big Data Mempool Visualization,
Visualization, Peer Visualization, Attack Visual Analysis, High-Resolution Visualization of Bitcoin Systems, Effectiveness. In
Blockchain, Big Data and Machine Learning, 1st ed.; CRC Press: Boca Raton, FL, USA, 2020; pp. 155–176. [CrossRef]
6. Harwick, C. Cryptocurrency and the Problem of Intermediation. Independ. Rev. 2016, 20, 569–588.
7. CoinMarketCap. Bitcoin. Available online: https://coinmarketcap.com/currencies/bitcoin/ (accessed on 28 December 2021).
8. Antonopoulos, A.M.; Wood, G. Mastering Ethereum: Building Smart Contracts and DApps; O’Reilly Media, Inc.: Sevastopol, CA,
USA, 2018.
9. Nizzoli, L.; Tardelli, S.; Avvenuti, M.; Cresci, S.; Tesconi, M.; Ferrara, E. Charting the Landscape of Online Cryptocurrency
Manipulation. IEEE Access 2020, 8, 113230–113245. [CrossRef]
10. Tandon, C.; Revankar, S.; Palivela, H.; Parihar, S.S. How Can We Predict the Impact of the Social Media Messages on the Value of
Cryptocurrency? Insights from Big Data Analytics. Int. J. Inf. Manag. Data Insights 2021, 1, 100035. [CrossRef]
11. Bitcoin Tweets Chart. Available online: https://bitinfocharts.com/comparison/bitcoin-tweets.html (accessed on 28 December
2021).
12. Internet Live Stats. Twitter Usage Statistics. Available online: https://www.internetlivestats.com/twitter-statistics/ (accessed on
28 December 2021).
13. Sayce, D. The Number of Tweets per Day in 2020. 2019. Available online: https://www.dsayce.com/social-media/tweets-day/
(accessed on 28 December 2021).
14. Rothman, T. Trading the Dream: Does Social Media Affect Investors Activity—The Story of Twitter, Telegram and Reddit. Int. J.
Financ. Res. 2019, 10, 147–152. [CrossRef]
15. Nasdaq Data Link. Bitcoin Number of Transactions. 2021. Available online: https://data.nasdaq.com (accessed on 28
December 2021).
16. Campbell, Stefan. Twitter Statistics 2022: How Many People Use Twitter? 2021. Available online: //thesmallbusinessblog.net/
twitter-statistics/ (accessed on 29 December 2021).
Algorithms 2022, 15, 140 11 of 11

17. Ghani, N.A.; Hamid, S.; Targio Hashem, I.A.; Ahmed, E. Social Media Big Data Analytics: A Survey. Comput. Hum. Behav. 2019,
101, 417–428. [CrossRef]
18. Bandi, A.; Hurtado, J.A. Big Data Streaming Architecture for Edge Computing Using Kafka and Rockset. In Proceedings of the
2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021;
pp. 323–329. [CrossRef]
19. Mohapatra, S.; Ahmed, N.; Alencar, P. KryptoOracle: A Real-Time Cryptocurrency Price Prediction Platform Using Twitter
Sentiments. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data) Los Angeles, CA, USA, 9–12
December 2019; pp. 5544–5551. [CrossRef]
20. Bandi, A. Data Streaming Architecture for Visualizing Cryptocurrency Temporal Data. In Computer Networks, Big Data and IoT;
Pandian, A., Fernando, X., Islam, S.M.S., Eds.; Springer: Singapore, 2021; Volume 66, pp. 651–661. [CrossRef]
21. Horvat, N.; Ivkovic, V.; Todorovic, N.; Ivančević, V.; Gajić, D.; Lukovic, I. Big Data Architecture for Cryptocurrency Real-time
Data Processing. In Proceedings of the ICIST 2020 Proceedings, Information Society of Serbia—ISOS, Belgrade, Serbia, 8–11
March 2020; pp. 150–155.
22. Abraham, J.; Higdon, D.; Nelson, J.; Ibarra, J. Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis.
SMU Data Sci. Rev. 2018, 1, 1–21.
23. Park, H.W.; Lee, Y. How Are Twitter Activities Related to Top Cryptocurrencies’ Performance? Evidence from Social Media
Network and Sentiment Analysis. Drustvena Istrazivanja 2019, 28, 435–460. [CrossRef]
24. Garcia, D.; Tessone, C.J.; Mavrodiev, P.; Perony, N. The Digital Traces of Bubbles: Feedback Cycles between Socio-Economic
Signals in the Bitcoin Economy. J. R. Soc. Interface 2014, 11, 20140623. [CrossRef] [PubMed]
25. Kjærland, F.; Meland, M.; Oust, A.; Øyen, V. How Can Bitcoin Price Fluctuations Be Explained? Int. J. Econ. Financ. Issues 2018,
8, 323–332.
26. Aharon, D.Y.; Demir, E.; Lau, C.K.M.; Zaremba, A. Twitter-Based Uncertainty and Cryptocurrency Returns; SSRN Scholarly Paper ID
3735435; Social Science Research Network: Rochester, NY, USA, 2020. [CrossRef]
27. Baek, H.; Oh, J.; Kim, C.Y.; Lee, K. A Model for Detecting Cryptocurrency Transactions with Discernible Purpose. In Proceedings
of the 2019 Eleventh International Conference on Ubiquitous and Future Networks (ICUFN), Zagreb, Croatia, 2–5 July 2019;
pp. 713–717. [CrossRef]
28. Aspembitova, A.T.; Feng, L.; Chew, L.Y. Behavioral Structure of Users in Cryptocurrency Market. PLoS ONE 2021, 16, e0242600.
[CrossRef] [PubMed]
29. Fang, J.; Chiu, D.K.W.; Ho, K.K.W. Exploring Cryptocurrency Sentiments with Clustering Text Mining on Social Media. In
Intelligent Analytics with Advanced Multi-Industry Applications; Sun, Z., Ed.; IGI Global: Hershey, PA, USA, 2021; pp. 157–171.
[CrossRef]
30. Kreps, J. Questioning the Lambda Architecture. 2014. Available online: https://www.oreilly.com/radar/questioning-the-
lambda-architecture/ (accessed on 28 December 2021).
31. Marz, N.; Warren, J. Lambda Architecture. In Big Data: Principles and Best Practices of Scalable Real-Time Data Systems; Manning
Publications: Westhampton, NY, USA, 2015; p. 328.
32. Domínguez, J. De Lambda a Kappa: Evolución de las Arquitecturas Big Data. 2018. Available online: https://www.
paradigmadigital.com/techbiz/de-lambda-a-kappa-evolucion-de-las-arquitecturas-big-data/ (accessed on 29 December 2021).
33. Nkamla Penka, J.B.; Mahmoudi, S.; Debauche, O. A New Kappa Architecture for IoT Data Management in Smart Farming.
Procedia Comput. Sci. 2021, 191, 17–24. [CrossRef]
34. ProjectPro. How Data Partitioning in Spark Helps Achieve More Parallelism? 2021. Available online: https://www.projectpro.
io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297 (accessed on 29 December 2021).
35. Sinaga, K.P.; Yang, M.S. Unsupervised K-Means Clustering Algorithm. IEEE Access 2020, 8, 80716–80727. [CrossRef]
36. Likas, A.; Vlassis, N.; Verbeek, J. The Global K-Means Clustering Algorithm. Patt. Recognit. 2003, 36, 451–461. [CrossRef]
37. Cryptocompare. Cryptocurrency API, Historical & Real-Time Market Data. Available online: https://min-api.cryptocompare.
com (accessed on 14 January 2022).
38. Roesslein, J. Tweepy. Available online: https://www.tweepy.org/ (accessed on 4 January 2022).
39. Kuilboer, J.P.; Stull, T. Text Analytics and Big Data in the Financial Domain. In Proceedings of the 2021 16th Iberian Conference
on Information Systems and Technologies (CISTI), Chaves, Portugal, 23–26 June 2021; pp. 1–4.
40. John Snow Labs. Spark NLP. Available online: https://nlp.johnsnowlabs.com/ (accessed on 4 January 2022).
41. Lengyel, A.; Botta-Dukát, Z. Silhouette Width Using Generalized Mean—A Flexible Method for Assessing Clustering Efficiency.
Ecol. Evol. 2019, 9, 13231–13243. [CrossRef] [PubMed]
42. Yuan, C.; Yang, H. Research on K-Value Selection Method of K-Means Clustering Algorithm. J 2019, 2, 226–235. [CrossRef]
43. Hmwe, T.T.; Thein, N.Y.T.; Cho, K.M. Improving Clustering Quality Using Silhouette Score. J. Comput. Appl. Res. 2020, 1, 58–62.
44. Education, I.C. What Is Data Modeling? 2020. Available online: https://www.ibm.com/cloud/learn/data-modeling (accessed on
20 January 2022).
45. Zschörnig, T.; Wehlitz, R.; Franczyk, B. A Personal Analytics Platform for the Internet of Things—Implementing Kappa
Architecture with Microservice-based Stream Processing. In Proceedings of the 19th International Conference on Enterprise
Information Systems, Porto, Portugal, 26–29 April 2017; SCITEPRESS—Science and Technology Publications: Porto, Portugal,
2017; pp. 733–738. [CrossRef]

You might also like