SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 9, No. 4, August 2019, pp. 2620~2626
ISSN: 2088-8708, DOI: 10.11591/ijece.v9i4.pp2620-2626  2620
Journal homepage: http://iaescore.com/journals/index.php/IJECE
Online data preprocessing: a case study approach
Mohammed Zuhair Al-Taie1
, Seifedine Kadry2
, Joel Pinho Lucas3
1
UTM Big Data Centre, Universiti Teknologi Malaysia, Malaysia
2
Department of Mathematics and Computer Science, Faculty of Science, Beirut Arab University, Lebanon
3
Tail Target, Brazil
Article Info ABSTRACT
Article history:
Received Jan 28, 2018
Revised Aug 11, 2018
Accepted Mar 4, 2019
Besides the Internet search facility and e-mails, social networking is now one
of the three best uses of the Internet. A tremendous number of volunteers
every day write articles, share photos, videos and links at a scope and scale
never imagined before. However, because social network data are huge and
come from heterogeneous sources, the data are highly susceptible to
inconsistency, redundancy, noise, and loss. For data scientists, preparing the
data and getting it into a standard format is critical because the quality of
data is going to directly affect the performance of mining algorithms that are
going to be applied next. Low-quality data will certainly limit the analysis
and lower the quality of mining results. To this end, the goal of this study is
to provide an overview of the different phases involved in data
preprocessing, with a focus on social network data. As a case study, we will
show how we applied preprocessing to the data that we collected for the
Malaysian Flight MH370 that disappeared in 2014.
Keywords:
Data preprocessing
Data science
Flight MH370
Social networks
Copyright © 2019 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Seifedine Kadry,
Department of Mathematics and Computer Science,
Beirut Arab University,
Lebanon.
Email: s.kadry@bau.edu.lb
1. INTRODUCTION
Online social networks or social media websites -the recent form of traditional social networks-
have provided the mechanisms that allow people to comprehend, interact, engage and collaborate with each
other. Platforms such as Facebook, YouTube, LinkedIn, and Twitter are used these days heavily either for
fun or business processes. They allow users to share text, audio, video, hyperlinks, and photographs with
others and therefore large volumes of their social activities are stored. Facebook, for instance, stores more
than 30 billion new pieces of content every month, produced by more than one billion active users.
The analysis of data that come from different social media sources (for example, Facebook, Twitter,
and YouTube) is critical for effective decision making [1]. It can be used to identify trends, develop business
opportunities, predict customer behavior and market shifts, crime investigation, and natural disaster risk
management.
Data preprocessing consumes most of the time and implementation efforts and can be more critical
than the machine-learning algorithm itself. If any of the data preprocessing phases is not correctly handled,
machine-learning algorithms will not run or can give misleading results [2]. Having in mind the significance
of data preprocessing to data science [3], this study discusses data preprocessing and how we applied it to
Flight MH370 social data.
The rest of the study is organized as follows: Section 2 discusses what is meant by social networks
and what techniques are used to collect online social data. Section 3 gives an overview of the four tasks
involved in data preprocessing. Section 4 describes how we applied data preprocessing to a dataset that we
collected describing the Malaysian Flight MH370 social structure. Section 5 concludes the study.
Int J Elec & Comp Eng ISSN: 2088-8708 
Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie)
2621
2. ONLINE SOCIAL DATA COLLECTION
Typically, social networks are theoretical models for analyzing and visualizing relationships
between actors [4]. A tie between two actors should represent some relationship such as kinship, affection,
enmity, exchange of favours or loans, club membership or event attending. Ties can also vary in intensity in
the way that some ties are stronger than others. It is also possible to have more than one tie between two
individuals in one network. For example, a group of students in a college may be connected to each other
through friendship, joint courses, club memberships, etc.
In the past, social data was challenging to collect and hard to come by, and many people in the field
limited themselves to using only small amounts of data. However, things have changed with the advent of
online social websites such as Twitter, Facebook, and Flicker that generate much more data than anyone
would expect.
Beyond the traditional methods of collecting social data, online data can be collected using
Application Programming Interfaces (APIs), web crawlers, online surveys, and deployed applications.
However, collecting online social data is not always straightforward as it faces some problems [5] such as
dealing with unstructured and heterogeneous data, dynamic networks, processing power and data storage.
However, and in some cases, taking measurements on all actors in the relevant actor set is not possible.
Therefore; taking a sample set of actors from the complete set becomes enough, and inference about the
population is made later based on the sample [6]. Sampled data in this case, which can be viewed as
representative of the broader population, is called the probability sample.
3. DATA PREPROCESSING
In contrast to organizations, where both the data and the hierarchy of knowledge are well-organized,
online social data are rich with user-generated annotations and free-style handwriting. Given that people in
online social platforms are entirely free to write what they want [7], data quality in such environments ranges
from valuable to commercials and rubbish [8].
Four steps are typically involved in data preprocessing: data cleaning, data integration, data
transformation, and data reduction. We will address each step in detail and show how it can be used for
knowledge discovery. These processes are not isolated from each other but instead are integral components
of data preparation. If any of these steps are not performed as planned, data mining algorithms will not run
and will probably give unexpected results. We may safely say that having good and robust data is more
important than having an efficient algorithm that is applied to a large quantity of poor data.
3.1. Data cleaning
Data cleaning (also known as data cleansing or scrubbing) aims at removing noise, filling out
missing values, and fixing inconsistencies in data. Dirty data in the database can be the result of wrong data
entry, update, or transmission [9]. This phase also includes identification and removal of outliers [10].
Performing data cleaning for social network data is not always straightforward as it can face several
issues [5]. For example, it requires an understanding and realization of the data. Images, video, audio, failed
HTTP records, HTML tags, and white spaces should be removed. Spam detection should also be pursued.
Machine learning methods have been commonly used to perform data cleaning in social data [11].
For example, expectation maximization with Gaussian mixture models is used to manage missing data,
while Bayesian models are used for cleaning data. In the blogosphere domain, [12] performed data cleaning:
trimming out white spaces, punctuations, and stop words to blog posts. [13] proposed the removal of inactive
users from the social network. In the trust-based information propagation model they built, the authors
isolated inactive users from other users because they never or rarely send messages and had little influence
on information propagation. To deal with noise in Twitter data, the authors in [14], and before applying
sentiment analysis, removed questions, URLs, special characters as well as retweets. Targeting Twitter
service also, [15] proposed the use of a number of methods to clean data. They used regular expression tools
to remove unnecessary tags, English lexicon to filter out words in languages other than English, and an
external source to eliminate stop words. They aimed to discover peoples’ opinions by using topic models,
sentiment analysis as well as geolocation information. In the email communications, data cleaning was
applied by [16] to remove non-essential information from the corpus, such as message ID, message
timestamp, sender and receiver information as well as all messages with invalid email addresses.
3.2. Data integration
Inconsistency and redundancy in social data are very likely because users have different
perspectives and show a different behaviour [13]. Data integration aims at combining data from several
sources into coherent data storage [10]. This task is not simple [9] as it requires achieving a match between
different schema types. Incomplete or inefficient data integration can cause inconsistency and redundancy,
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626
2622
while a proper implementation will undoubtedly enhance accuracy and speed up later processes. Techniques
of data integration include entity identification, tuple deduplication, redundancy and correlation analysis, as
well as data value conflict detection and resolution.
3.3. Data transformation
Data transformation aims at converting data into a usable and understandable format. The goal is to
have more efficient data mining operations [10]. Strategies for data transformation include data smoothing,
feature construction, normalization, discretization, generalization, as well as concept hierarchy generation of
nominal data. These subtasks require human supervision and are highly dependent on the data being
preprocessed [17]. However, data transformation for social data faces a number of challenges [5] including:
a. Data having different formats. This is because the social data extracted from online websites can be both
categorical and continuous. A proper conversion between the two types, in this case, is required.
b. Application dependency, which means that each data management tool has its uses and can deal with only
particular types of data. For example, some tools (e.g., Pajek and UCINET) are standalone applications,
while NodeXL is a utility toolkit. Some packages are provided with programming utilities (e.g., JUNG
and SNAP-GAUSS) whereas others provide a graphical user interface that requires no scripting utilities.
Finally, some tools are free software (e.g., STRUCTURE and StOCNET) while others have commercial
licenses, such as InFlow and SocioMetric LinkAlyzer.
c. Data privacy and user personal information that can be broken throughout data transformation. One of the
remedies is to use data anonymization.
Several studies have addressed this issue. For example, and regarding data transformation of email
corpus, [16] proposed the transformation of data into a hierarchical structure, which can be achieved by
grouping the object-based relationships of people into a hierarchy. Objects in this unified structure represent
emails or threads, while each one can perform different roles, such as a mail sender, mail receiver, thread
creator or thread participant. The next step of social data preprocessing would include flattening the hierarchy
of objects into one selected level, followed by multi-layering of the social network (see the same previous
reference for more details). The authors in [12] performed data stemming on blog posts to transform all the
terms into their morphological roots with the help of the stemmer portal. In a different study, [15] applied
regular expression techniques to normalize some words to their standard form. They also applied Porter
stemming to stem words into a unified form.
3.4. Data reduction
Data reduction aims at reducing the size of data while keeping information loss at a minimum [10].
Techniques of data reduction include:
a. Dimensionality reduction: reducing the number of attributes or random variables by using techniques
such as principal component analysis, wavelet transformation, attribute subset selection and attribute
creation. Text data extracted from online social networks can be extremely sparse and high-
dimensional [18]. High dimensionality typically means that there are 16 or more data attributes. In such
data orientations, the nature of geometrical objects and the concept of proximity are not the same as two
or three dimensions. Data cleaning, which aims at removing undesired content, can be considered one
method for reducing data size.
b. Numerosity reduction: achieving smaller data representations by using special techniques such as
parametric models (such as regression and log-linear) and nonparametric models (such as sampling,
histograms, clustering, and data cube aggregation).
c. Data compression: compressing data to obtain a reduced copy of the original data. Resultant data can be
either lossless or lossy, depending on whether there was a loss of information during data compression.
Both dimensionality reduction and numerosity reduction can be applied to social data to have a
smaller data volume while at the same time producing the same (or near the same) mining results.
For example, to provide smaller representations of the original data, [13] performed dimensionality reduction
on a Twitter corpus to remove useless keywords from the old messages, and used numerosity techniques to
replace user’s messages with the least possible keywords that represent the main types and significant
sections of the messages. In a different study, [19] applied attribute reduction to improve the performance of
spam detectors on Twitter data by computing correlation coefficients between attributes [20-22].
4. CASE STUDY: FLIGHT MH370
This section discusses how we implemented preprocessing to Flight MH370 social data.
After preprocessing, we used the resultant data to examine the flight community structure, discover types
of social relationships, reveal the truth behind some of the unusual events, and study people coping behavior
(adaptation patterns) during disaster time [23].
Int J Elec & Comp Eng ISSN: 2088-8708 
Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie)
2623
The Boeing 777-200 ER, which carried 239 individuals, left Kuala Lumpur International Airport
(KLIA) at 12.41 am. It was supposed to land at Beijing Capital International Airport after six hours.
However, the Boeing disappeared from radar after only 40 minutes from its take-off. The passengers were
from China, Malaysia, Indonesia, Australia, France, India, United States, New Zealand, Canada, Ukraine,
Russia, Taiwan, Italy, Netherlands, and Austria. A group of employees from the American-based Freescale
Semiconductor Corporation was on a visit to China to take a technical course for one month. A group of
Chinese artists was on the way back from a visit to a cultural exhibition in the Malaysian capital, themed
Chinese Dream: Red and Green Painting. Some passengers were on travel from China to Malaysia to attend a
Buddhist religious ceremony. The airplane also carried a group of workers who worked in Singapore, a group
of tourists who were on a trip to Nepal, a number of families, as well as passengers just making a stopover.
The crew of the airplane has 12 Malaysian citizens (2 pilots and 10 attendants).
4.1. MH370 social data collection
As we said earlier in “online social data collection,” collecting social data is not straightforward and
requires much effort, in particular, that such data are unstructured and come from various sources, added to
other issues such as storage, processing, and dynamicity of the network.
The dataset that we collected came from tens of different online news websites (such as The New
Indian Express, YAHOO! News Malaysia, the Economic Times, CNN.com, the Daily Express, and India
Today). Data gathering was done over March and April in 2014 covering the events from when the airplane
disappeared until late of April, when most of the international efforts to locate the missing flight stopped,
without achieving their goals.
Online news media addressed this event in three different ways: how the airplane disappeared, the
background of the passengers, how relatives were interacting with news and the efforts made by the
international society to locate the missing airplane.
4.2. MH370 social data preparation
As we said earlier in this study, in order to efficiently mine and analyze collected data, data
preprocessing must be implemented as a series of sequential steps. First: data cleaning, which aims at
eliminating irrelevant and redundant content? Second: data integration, which aims at integrating data from
different sources into a unique and uniform data endpoint. Third: data transformation, which aims at
converting the resultant data into a format that is easy for analysis. Finally: data reduction, which aims at
having a smaller (both in volume and dimension), concise and clean, representation of the data that will
undergo analysis later.
4.2.1. MH370 social data cleaning
The first step of Flight MH370 data preprocessing was to check and evaluate all the attributes of the
original data. As we do in other data cleaning processes, we prioritize data quality over data scale.
It is preferable to reduce data volume to have more representative information, rather than poor and
inconsistent data.
In the case of Flight MH370, where data were collected from various online sources, passenger
demographic information was mostly inconsistent, ambiguous or missing. Accordingly, we preferred to
remove all demographic attributes except “Age.” As will be described in subsequent sections, we used ages
to differentiate users having the same name and to help explore groups and connections.
Using Pajek Software, it was possible to detect visually and correct some outliers and
inconsistencies related to the passenger age. We were able to probe some data sources and identify
inconsistencies, and using other data sources and relying on most common data among such sources;
we could smooth out noisy data like that.
In this phase, we removed unwanted and irrelevant information from data. For a study on social
networks, we were concerned only with the information that would help reveal relations and connections
between passengers. It was a time-consuming task since online social data tend to have noise, outliers,
missing values and other inconsistencies. Other methods and packages can also be used by data scientists for
this purpose, such the R Software; and Python packages for data manipulation, such as Pandas and NumPy.
4.2.2. MH370 social data integration
We collected our data from different social media websites over two months. Because the data were
mostly unstructured text, we had to run some data manipulation scripts to process the text. The goal was to
extract a simple and more unified data.
In this phase, we detected and solved data value conflicts on passenger data such as the use of
different representations, and different scales (i.e., metric vs. British units, or others). Afterwards, we
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626
2624
reformulated and unified the data such that we stored passenger information in more recognizable formats.
All the information about the flight passengers and their families that we were able to integrate are shown in
Figure 1.
Figure 1. Raw data of Flight MH370
In Figure 1 above, the data that describe names, nationalities, and ages were expanded from MH370
Passenger Manifest, released by Malaysia Airlines the same day of the disaster. Column “Connected-to”
represents the people that a particular passenger is connected to. This includes family members, co-workers,
and friends. The data in that column were collected based on statements from family members, journal
reports or officials. However, officials were accused of giving little information about the accident.
Column “Alternative Name” has the names that we used later for analysis instead of the list of
names provided by the MH370 Passenger Manifest which, in turn, are shown in column “Passenger Name.”
Each name in column “Alternative Name” begins with the passenger’s first name followed by the family
name (names with two parts were also considered). We also removed unnecessary word capitalizations and
suffixed the second name of each passenger with his/her age to avoid name confusions (for instance, the
name Zhang Yan (36) is also the name of another passenger, Zhang Yan (45)). For simplicity, we preferred to
use only the passenger’s first and second name followed by age.
Column “Symbol” has the labels that we gave to the passengers. For example, P1, P2, P3 refer,
respectively, to the first three passengers Ambre Wattrelos (14), An Wenlan (65) and Andrew Nari (49) in the
table. These symbols were used to label nodes when we built the graphical representations. Column
“Background” contains information over two months about the passengers. It tells about the career, marital
status, family members, and friendships of each one.
4.2.3. MH370 Social Data Transformation
One of the tasks that were actively present when we built our database was data transformation of
the passenger name. We used a simple heuristics for this purpose. The task included reviewing the table row
by row, transforming each name into more a proper structure (i.e., alternative name), and giving it a proper
identifier (i.e., label).
One of the things that we did for data transformation was adding extra nodes that serve particular
purposes. For example, in addition to the original 239 nodes, passenger nodes were tied to a central node,
called “Flight,” to show that all the passengers were present at the same event and were part of the flight
network structure. Another node “Singapore” was added to connect the Chinese workers who worked in
Singapore altogether. Three of those workers were confirmed to be friends and to know each other, and
therefore, they were directly connected to each other and the central node of the group. The resulting network
is a small undirected network that has no loops, having every two passengers connected.
Since our data preprocessing was mostly involved in text manipulation, we did not need to perform
(except for feature construction) other types of data transformation. Nevertheless, and depending on the
Int J Elec & Comp Eng ISSN: 2088-8708 
Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie)
2625
mining purpose, other methods are also used for data transformation. This includes data aggregation
(summarize numeric attributes based on their concept hierarchy), data generalization (analogous to
aggregation but for nominal attributes), data normalization (scaling numeric attribute values to fall within
specified ranges), and others.
4.2.4. MH370 social data reduction
As we mentioned earlier in the data cleaning section, data preprocessing should prioritize data
quality over data scale. In this way, data reduction that we performed on flight MH370 data has also given up
of some data volume in exchange for more representative and consistent information. It is important to
highlight that the operations performed during data cleaning have a different purpose from these performed
in data reduction. While the earlier aims at eliminating noisy and inconsistent data, the latter aims at reducing
data representation and making it simpler and more useful, particularly for visual analysis.
The table discussed earlier contains some raw data that we did not need during the analysis phase.
Therefore, we extracted only the data that show the relationship between the people who joined the flight.
We compiled the resultant data into a new Microsoft Excel file as shown in Figure 2.
Figure 2. Sample data from Flight MH370 dataset
The dataset is available from the first author’s and GitHub (https://github.com/mohammed-
taie/Flight-MH370) repository. The dataset has 241 vertices and 1563 edges, and contains three columns:
1. Connected-to: individuals that a particular passenger is connected to
2. Names: names that were used during data analysis
3. Labels: labels that we gave to the passengers
It was essential to transform the dataset into the “.net” format before embarking on data analysis and
data visualization. The resulting .net file has two parts:
1. Vertices to denote flight passengers. For data visualization, all the vertices were given the same shape and
the same size, even though they may have a different number of connections. A few additional vertices
were also added to achieve particular benefits.
2. Edges to connect the members of the same group and to connect the different groups. For a better data
visualization, edge labels and weights were eliminated. Labels and weights are sometimes used to show
the strength (among other things) of a connection between two nodes.
Depending on how big the data are, other methods can also be applied for data reduction such as
clustering (grouping sample values in clusters based on a similarity measure) and sampling (taking a sample
set of actors from the complete set).
Connecting the nodes helped us spot three large community structures [20]: the “Artists Group,”
consisting of 29 members, the “Freescale Semiconductor” group, comprised of 15 members and the “Aircraft
Crew” consisting of 12 members. Other smaller groups include six people and five people families from
China, four people family from Malaysia, a French family, an Australian family, and others. The analysis
also showed 8 nodes (two pilots, five toddlers, and one physically challenged elderly woman) that are not
directly connected to node “Flight.”
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626
2626
5. CONCLUSION
The goal of this study is to understand data preprocessing in the light of Flight MH370 social data.
Data preprocessing is a critical part of data science projects, mainly that real-world data often have noise,
outliers, missing values and other inconsistencies. The steps should be introduced efficiently, supported by
human experience, and to be applied more than once. It is common that the preprocessing steps are reiterated
if the results of data mining are significantly different from what is expected.
Handling online social data (regarding data collection and data preprocessing) is a hard task and can
become even more complicated if the data that we are concerned with come from online news pages that are
commonly known to have problems in credibility and accuracy (except for some cases). One of the problems
that we encountered when we performed MH370 social data preprocessing is how to deal with missing data,
shortage in uniformity, and lack of some critical information. One of the reasons that the available
information was not enough is that officials abstained from revealing much information about the accidence
for fear of triggering political issues between Malaysia and some other countries. Regarding accuracy, some
information that was included in the MH370 Passenger Manifest (released by Malaysian Airlines) was not
accurate.
REFERENCES
[1] S. Kadry and M. Z. Al-Taie, “Social network analysis: An introduction with an extensive implementation to a
large-scale online network using Pajek,” Bentham Science Publishers, 2014.
[2] M. S. Brown, “Data mining for dummies,” John Wiley & Sons, 2014.
[3] M. Z. Al-Taie, et al., “Successful Data Science Projects: Lessons Learned from Kaggle Competition,” Kurdistan
Journal of Applied Research, vol/issue: 2(3), 2017.
[4] S. Wasserman and K. Faust, “Social network analysis: Methods and applications,” Cambridge university press,
vol. 8, 1994.
[5] P. Gupta and V. Bhatnagar, “Data preprocessing for dynamic social network analysis,” Data Mining in Dynamic
Social Networks and Fuzzy Systems, pp. 25-39, 2013.
[6] M. Al-Taie and S. Kadry, “Python for Graph and Network Analysis,” Springer, pp. 1-184, 2017.
[7] M. Zuber, “A survey of data mining techniques for social network analysis,” International Journal of Research in
Computer Engineering & Electronics, vol/issue: 3(6), 2014.
[8] J. Bian, et al., “Finding the right facts in the crowd: factoid question answering over social media,” Proceedings of
the 17th international conference on World Wide Web, 2008.
[9] S. García, et al., “Data preprocessing in data mining,” Springer, 2015.
[10] J. Han, et al., “Data mining: concepts and techniques,” Elsevier, 2011.
[11] S. S. De and S. Dehuri, “Machine Learning for Auspicious Social Network Mining,” Social Networking, Springer,
pp. 45-83, 2014.
[12] N. Agarwal, et al., “Clustering of blog sites using collective wisdom,” Computational Social Network Analysis,
Springer, pp. 107-134, 2010.
[13] L. Wenxue and G. Sun, “A trust-based information propagation model in online social networks,” vol.issue: 8(8),
pp. 1767, 2013.
[14] I. Hemalatha, et al., “Preprocessing the informal text for efficient sentiment analysis,” International Journal of
Emerging Trends & Technology in Computer Science (IJETTCS), vol/issue: 1(2), pp. 58-61, 2012.
[15] T. H. Wen, et al., “Analysis of combining Topic model, Sentiment, Geolocation information approaches on Social
Network.”
[16] P. Kazienko, et al., “A generic model for multidimensional temporal social network,” Communications in
Computer and Information Science, CCIS, vol. 171, pp. 1-14, 2011.
[17] M. Al-Taie and S. Kadry, “Applying Social Network Analysis to Analyze a Web-Based Community,” arXiv
preprint arXiv:1212.6050, 2012.
[18] C. C. Aggarwal and C. K. Reddy, “Data clustering: algorithms and applications,” CRC press, 2013.
[19] M. Klassen, “Twitter data preprocessing for spam detection,” FUTURE COMPUTING 2013, The Fifth
International Conference on Future Computational Technologies and Applications, Citeseer, 2013.
[20] B. N. Octaviana and J. Abraham, “Tolerance for Emotional Internet Infidelity and Its Correlate with Relationship
Flourishing,” International Journal of Electrical and Computer Engineering (IJECE), vol/issue: 8(5), pp. 3158-
3168, 2018.
[21] C. Virmani, et al., “Clustering in Aggregated User Profiles across Multiple Social Networks,” International Journal
of Electrical and Computer Engineering (IJECE), vol/issue: 7(6), pp. 3692-3699, 2017.
[22] Y. L. S. Rani, et al., “Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensemble:
A Survey,” International Journal of Electrical and Computer Engineering (IJECE), vol/issue: 8(4), pp. 2351-2357.
[23] M. Z. Al-Taie, et al., “Flight MH370 community structure,” Int. J. Advance. Soft Comput. Appl, vol/issue: 6(2),
2014.

More Related Content

What's hot (20)

web 30.pptx
web 30.pptxweb 30.pptx
web 30.pptx
ImaneChiki1
 
How to social scientists use link data (11 june2010)
How to social scientists use link data (11 june2010)How to social scientists use link data (11 june2010)
How to social scientists use link data (11 june2010)
Han Woo PARK
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
Sam Shah
 
‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...
eraser Juan José Calderón
 
Data mining applied about polygamy using sentiment analysis on Twitters in In...
Data mining applied about polygamy using sentiment analysis on Twitters in In...Data mining applied about polygamy using sentiment analysis on Twitters in In...
Data mining applied about polygamy using sentiment analysis on Twitters in In...
journalBEEI
 
2053951715611145
20539517156111452053951715611145
2053951715611145
Firas Husseini
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Enrico Motta
 
Ross Virtual Teams Posting Version
Ross Virtual Teams Posting VersionRoss Virtual Teams Posting Version
Ross Virtual Teams Posting Version
Randy Ross
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
dailyye
 
Ties
TiesTies
Ties
Guilherme Canteiro
 
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGME-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
ijait
 
Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Investigating Internet-based Korean politics using e-research tools Kaist Cu...Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Han Woo PARK
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
Daryaz Fares
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin
eraser Juan José Calderón
 
Opportunity and risk in social computing environments
Opportunity and risk in social computing environmentsOpportunity and risk in social computing environments
Opportunity and risk in social computing environments
Hazel Hall
 
Accelerating biomedical discovery with an Internet of FAIR data and services ...
Accelerating biomedical discovery with an Internet of FAIR data and services ...Accelerating biomedical discovery with an Internet of FAIR data and services ...
Accelerating biomedical discovery with an Internet of FAIR data and services ...
Platform Linked Data Netherlands (PLDN)
 
Example phd proposal
Example phd proposalExample phd proposal
Example phd proposal
rockonbd08
 
Jasist11
Jasist11Jasist11
Jasist11
svennus
 
Laurence Favier, University Charles De Gaulle – Lille 3: Social Influence and...
Laurence Favier, University Charles De Gaulle – Lille 3: Social Influence and...Laurence Favier, University Charles De Gaulle – Lille 3: Social Influence and...
Laurence Favier, University Charles De Gaulle – Lille 3: Social Influence and...
Katedra Informatologii. Wydział Dziennikarstwa, Informacji i Bibliologii, Uniwersytet Warszawski
 
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
eckchela
 
How to social scientists use link data (11 june2010)
How to social scientists use link data (11 june2010)How to social scientists use link data (11 june2010)
How to social scientists use link data (11 june2010)
Han Woo PARK
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
Sam Shah
 
‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...
eraser Juan José Calderón
 
Data mining applied about polygamy using sentiment analysis on Twitters in In...
Data mining applied about polygamy using sentiment analysis on Twitters in In...Data mining applied about polygamy using sentiment analysis on Twitters in In...
Data mining applied about polygamy using sentiment analysis on Twitters in In...
journalBEEI
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Enrico Motta
 
Ross Virtual Teams Posting Version
Ross Virtual Teams Posting VersionRoss Virtual Teams Posting Version
Ross Virtual Teams Posting Version
Randy Ross
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
dailyye
 
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGME-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
E-COMMERCE BUSINESS MODELS IN THE CONTEXT OF WEB 3.0 PARADIGM
ijait
 
Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Investigating Internet-based Korean politics using e-research tools Kaist Cu...Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Investigating Internet-based Korean politics using e-research tools Kaist Cu...
Han Woo PARK
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin
eraser Juan José Calderón
 
Opportunity and risk in social computing environments
Opportunity and risk in social computing environmentsOpportunity and risk in social computing environments
Opportunity and risk in social computing environments
Hazel Hall
 
Accelerating biomedical discovery with an Internet of FAIR data and services ...
Accelerating biomedical discovery with an Internet of FAIR data and services ...Accelerating biomedical discovery with an Internet of FAIR data and services ...
Accelerating biomedical discovery with an Internet of FAIR data and services ...
Platform Linked Data Netherlands (PLDN)
 
Example phd proposal
Example phd proposalExample phd proposal
Example phd proposal
rockonbd08
 
Jasist11
Jasist11Jasist11
Jasist11
svennus
 
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
DPSY 6121 Wk2 ASSGN: Electronic Media Influence Part 1
eckchela
 

Similar to Online Data Preprocessing: A Case Study Approach (20)

Big Data Emerging Technology: Insights into Innovative Environment for Online...
Big Data Emerging Technology: Insights into Innovative Environment for Online...Big Data Emerging Technology: Insights into Innovative Environment for Online...
Big Data Emerging Technology: Insights into Innovative Environment for Online...
eraser Juan José Calderón
 
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
CSCJournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Survey of data mining techniques for social
Survey of data mining techniques for socialSurvey of data mining techniques for social
Survey of data mining techniques for social
Firas Husseini
 
AMUSED An Annotation Framework Of Multi-Modal Social Media Data
AMUSED  An Annotation Framework Of Multi-Modal Social Media DataAMUSED  An Annotation Framework Of Multi-Modal Social Media Data
AMUSED An Annotation Framework Of Multi-Modal Social Media Data
Christina Bauer
 
Mental Disorder Prevention on Social Network with Supervised Learning Based A...
Mental Disorder Prevention on Social Network with Supervised Learning Based A...Mental Disorder Prevention on Social Network with Supervised Learning Based A...
Mental Disorder Prevention on Social Network with Supervised Learning Based A...
ijtsrd
 
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
ijtsrd
 
Sentiment analysis of comments in social media
Sentiment analysis of comments in social media Sentiment analysis of comments in social media
Sentiment analysis of comments in social media
IJECEIAES
 
Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...
IJECEIAES
 
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
AIRCC Publishing Corporation
 
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
ijcsit
 
Terrorism Analysis through Social Media using Data Mining
Terrorism Analysis through Social Media using Data MiningTerrorism Analysis through Social Media using Data Mining
Terrorism Analysis through Social Media using Data Mining
IRJET Journal
 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullying
gerogepatton
 
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYINGA MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
ijaia
 
Towards Decision Support and Goal AchievementIdentifying Ac.docx
Towards Decision Support and Goal AchievementIdentifying Ac.docxTowards Decision Support and Goal AchievementIdentifying Ac.docx
Towards Decision Support and Goal AchievementIdentifying Ac.docx
turveycharlyn
 
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
ijseajournal
 
Detecting fake news_with_weak_social_supervision
Detecting fake news_with_weak_social_supervisionDetecting fake news_with_weak_social_supervision
Detecting fake news_with_weak_social_supervision
Suresh S
 
Comprehensive Social Media Security Analysis & XKeyscore Espionage Technology
Comprehensive Social Media Security Analysis & XKeyscore Espionage TechnologyComprehensive Social Media Security Analysis & XKeyscore Espionage Technology
Comprehensive Social Media Security Analysis & XKeyscore Espionage Technology
CSCJournals
 
Collusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networksCollusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networks
IJECEIAES
 
Data trawling and security strategies
Data trawling and security strategiesData trawling and security strategies
Data trawling and security strategies
Venkata Karthik Gullapalli
 
Big Data Emerging Technology: Insights into Innovative Environment for Online...
Big Data Emerging Technology: Insights into Innovative Environment for Online...Big Data Emerging Technology: Insights into Innovative Environment for Online...
Big Data Emerging Technology: Insights into Innovative Environment for Online...
eraser Juan José Calderón
 
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
CSCJournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Survey of data mining techniques for social
Survey of data mining techniques for socialSurvey of data mining techniques for social
Survey of data mining techniques for social
Firas Husseini
 
AMUSED An Annotation Framework Of Multi-Modal Social Media Data
AMUSED  An Annotation Framework Of Multi-Modal Social Media DataAMUSED  An Annotation Framework Of Multi-Modal Social Media Data
AMUSED An Annotation Framework Of Multi-Modal Social Media Data
Christina Bauer
 
Mental Disorder Prevention on Social Network with Supervised Learning Based A...
Mental Disorder Prevention on Social Network with Supervised Learning Based A...Mental Disorder Prevention on Social Network with Supervised Learning Based A...
Mental Disorder Prevention on Social Network with Supervised Learning Based A...
ijtsrd
 
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
Retrieving Hidden Friends a Collusion Privacy Attack against Online Friend Se...
ijtsrd
 
Sentiment analysis of comments in social media
Sentiment analysis of comments in social media Sentiment analysis of comments in social media
Sentiment analysis of comments in social media
IJECEIAES
 
Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...
IJECEIAES
 
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
AIRCC Publishing Corporation
 
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
ijcsit
 
Terrorism Analysis through Social Media using Data Mining
Terrorism Analysis through Social Media using Data MiningTerrorism Analysis through Social Media using Data Mining
Terrorism Analysis through Social Media using Data Mining
IRJET Journal
 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullying
gerogepatton
 
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYINGA MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
A MACHINE LEARNING ENSEMBLE MODEL FOR THE DETECTION OF CYBERBULLYING
ijaia
 
Towards Decision Support and Goal AchievementIdentifying Ac.docx
Towards Decision Support and Goal AchievementIdentifying Ac.docxTowards Decision Support and Goal AchievementIdentifying Ac.docx
Towards Decision Support and Goal AchievementIdentifying Ac.docx
turveycharlyn
 
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
FEDERATED LEARNING FOR PRIVACY-PRESERVING: A REVIEW OF PII DATA ANALYSIS IN F...
ijseajournal
 
Detecting fake news_with_weak_social_supervision
Detecting fake news_with_weak_social_supervisionDetecting fake news_with_weak_social_supervision
Detecting fake news_with_weak_social_supervision
Suresh S
 
Comprehensive Social Media Security Analysis & XKeyscore Espionage Technology
Comprehensive Social Media Security Analysis & XKeyscore Espionage TechnologyComprehensive Social Media Security Analysis & XKeyscore Espionage Technology
Comprehensive Social Media Security Analysis & XKeyscore Espionage Technology
CSCJournals
 
Collusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networksCollusion-resistant multiparty data sharing in social networks
Collusion-resistant multiparty data sharing in social networks
IJECEIAES
 

More from IJECEIAES (20)

Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Neural network optimizer of proportional-integral-differential controller par...
Neural network optimizer of proportional-integral-differential controller par...Neural network optimizer of proportional-integral-differential controller par...
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
A review on features and methods of potential fishing zone
A review on features and methods of potential fishing zoneA review on features and methods of potential fishing zone
A review on features and methods of potential fishing zone
IJECEIAES
 
Electrical signal interference minimization using appropriate core material f...
Electrical signal interference minimization using appropriate core material f...Electrical signal interference minimization using appropriate core material f...
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Bibliometric analysis highlighting the role of women in addressing climate ch...
Bibliometric analysis highlighting the role of women in addressing climate ch...Bibliometric analysis highlighting the role of women in addressing climate ch...
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Voltage and frequency control of microgrid in presence of micro-turbine inter...Voltage and frequency control of microgrid in presence of micro-turbine inter...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Enhancing battery system identification: nonlinear autoregressive modeling fo...Enhancing battery system identification: nonlinear autoregressive modeling fo...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
Smart grid deployment: from a bibliometric analysis to a survey
Smart grid deployment: from a bibliometric analysis to a surveySmart grid deployment: from a bibliometric analysis to a survey
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Use of analytical hierarchy process for selecting and prioritizing islanding ...Use of analytical hierarchy process for selecting and prioritizing islanding ...
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Adaptive synchronous sliding control for a robot manipulator based on neural ...Adaptive synchronous sliding control for a robot manipulator based on neural ...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Neural network optimizer of proportional-integral-differential controller par...
Neural network optimizer of proportional-integral-differential controller par...Neural network optimizer of proportional-integral-differential controller par...
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
A review on features and methods of potential fishing zone
A review on features and methods of potential fishing zoneA review on features and methods of potential fishing zone
A review on features and methods of potential fishing zone
IJECEIAES
 
Electrical signal interference minimization using appropriate core material f...
Electrical signal interference minimization using appropriate core material f...Electrical signal interference minimization using appropriate core material f...
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Bibliometric analysis highlighting the role of women in addressing climate ch...
Bibliometric analysis highlighting the role of women in addressing climate ch...Bibliometric analysis highlighting the role of women in addressing climate ch...
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Voltage and frequency control of microgrid in presence of micro-turbine inter...Voltage and frequency control of microgrid in presence of micro-turbine inter...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Enhancing battery system identification: nonlinear autoregressive modeling fo...Enhancing battery system identification: nonlinear autoregressive modeling fo...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
Smart grid deployment: from a bibliometric analysis to a survey
Smart grid deployment: from a bibliometric analysis to a surveySmart grid deployment: from a bibliometric analysis to a survey
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Use of analytical hierarchy process for selecting and prioritizing islanding ...Use of analytical hierarchy process for selecting and prioritizing islanding ...
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Adaptive synchronous sliding control for a robot manipulator based on neural ...Adaptive synchronous sliding control for a robot manipulator based on neural ...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 

Recently uploaded (20)

Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 

Online Data Preprocessing: A Case Study Approach

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 9, No. 4, August 2019, pp. 2620~2626 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i4.pp2620-2626  2620 Journal homepage: http://iaescore.com/journals/index.php/IJECE Online data preprocessing: a case study approach Mohammed Zuhair Al-Taie1 , Seifedine Kadry2 , Joel Pinho Lucas3 1 UTM Big Data Centre, Universiti Teknologi Malaysia, Malaysia 2 Department of Mathematics and Computer Science, Faculty of Science, Beirut Arab University, Lebanon 3 Tail Target, Brazil Article Info ABSTRACT Article history: Received Jan 28, 2018 Revised Aug 11, 2018 Accepted Mar 4, 2019 Besides the Internet search facility and e-mails, social networking is now one of the three best uses of the Internet. A tremendous number of volunteers every day write articles, share photos, videos and links at a scope and scale never imagined before. However, because social network data are huge and come from heterogeneous sources, the data are highly susceptible to inconsistency, redundancy, noise, and loss. For data scientists, preparing the data and getting it into a standard format is critical because the quality of data is going to directly affect the performance of mining algorithms that are going to be applied next. Low-quality data will certainly limit the analysis and lower the quality of mining results. To this end, the goal of this study is to provide an overview of the different phases involved in data preprocessing, with a focus on social network data. As a case study, we will show how we applied preprocessing to the data that we collected for the Malaysian Flight MH370 that disappeared in 2014. Keywords: Data preprocessing Data science Flight MH370 Social networks Copyright © 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Seifedine Kadry, Department of Mathematics and Computer Science, Beirut Arab University, Lebanon. Email: [email protected] 1. INTRODUCTION Online social networks or social media websites -the recent form of traditional social networks- have provided the mechanisms that allow people to comprehend, interact, engage and collaborate with each other. Platforms such as Facebook, YouTube, LinkedIn, and Twitter are used these days heavily either for fun or business processes. They allow users to share text, audio, video, hyperlinks, and photographs with others and therefore large volumes of their social activities are stored. Facebook, for instance, stores more than 30 billion new pieces of content every month, produced by more than one billion active users. The analysis of data that come from different social media sources (for example, Facebook, Twitter, and YouTube) is critical for effective decision making [1]. It can be used to identify trends, develop business opportunities, predict customer behavior and market shifts, crime investigation, and natural disaster risk management. Data preprocessing consumes most of the time and implementation efforts and can be more critical than the machine-learning algorithm itself. If any of the data preprocessing phases is not correctly handled, machine-learning algorithms will not run or can give misleading results [2]. Having in mind the significance of data preprocessing to data science [3], this study discusses data preprocessing and how we applied it to Flight MH370 social data. The rest of the study is organized as follows: Section 2 discusses what is meant by social networks and what techniques are used to collect online social data. Section 3 gives an overview of the four tasks involved in data preprocessing. Section 4 describes how we applied data preprocessing to a dataset that we collected describing the Malaysian Flight MH370 social structure. Section 5 concludes the study.
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie) 2621 2. ONLINE SOCIAL DATA COLLECTION Typically, social networks are theoretical models for analyzing and visualizing relationships between actors [4]. A tie between two actors should represent some relationship such as kinship, affection, enmity, exchange of favours or loans, club membership or event attending. Ties can also vary in intensity in the way that some ties are stronger than others. It is also possible to have more than one tie between two individuals in one network. For example, a group of students in a college may be connected to each other through friendship, joint courses, club memberships, etc. In the past, social data was challenging to collect and hard to come by, and many people in the field limited themselves to using only small amounts of data. However, things have changed with the advent of online social websites such as Twitter, Facebook, and Flicker that generate much more data than anyone would expect. Beyond the traditional methods of collecting social data, online data can be collected using Application Programming Interfaces (APIs), web crawlers, online surveys, and deployed applications. However, collecting online social data is not always straightforward as it faces some problems [5] such as dealing with unstructured and heterogeneous data, dynamic networks, processing power and data storage. However, and in some cases, taking measurements on all actors in the relevant actor set is not possible. Therefore; taking a sample set of actors from the complete set becomes enough, and inference about the population is made later based on the sample [6]. Sampled data in this case, which can be viewed as representative of the broader population, is called the probability sample. 3. DATA PREPROCESSING In contrast to organizations, where both the data and the hierarchy of knowledge are well-organized, online social data are rich with user-generated annotations and free-style handwriting. Given that people in online social platforms are entirely free to write what they want [7], data quality in such environments ranges from valuable to commercials and rubbish [8]. Four steps are typically involved in data preprocessing: data cleaning, data integration, data transformation, and data reduction. We will address each step in detail and show how it can be used for knowledge discovery. These processes are not isolated from each other but instead are integral components of data preparation. If any of these steps are not performed as planned, data mining algorithms will not run and will probably give unexpected results. We may safely say that having good and robust data is more important than having an efficient algorithm that is applied to a large quantity of poor data. 3.1. Data cleaning Data cleaning (also known as data cleansing or scrubbing) aims at removing noise, filling out missing values, and fixing inconsistencies in data. Dirty data in the database can be the result of wrong data entry, update, or transmission [9]. This phase also includes identification and removal of outliers [10]. Performing data cleaning for social network data is not always straightforward as it can face several issues [5]. For example, it requires an understanding and realization of the data. Images, video, audio, failed HTTP records, HTML tags, and white spaces should be removed. Spam detection should also be pursued. Machine learning methods have been commonly used to perform data cleaning in social data [11]. For example, expectation maximization with Gaussian mixture models is used to manage missing data, while Bayesian models are used for cleaning data. In the blogosphere domain, [12] performed data cleaning: trimming out white spaces, punctuations, and stop words to blog posts. [13] proposed the removal of inactive users from the social network. In the trust-based information propagation model they built, the authors isolated inactive users from other users because they never or rarely send messages and had little influence on information propagation. To deal with noise in Twitter data, the authors in [14], and before applying sentiment analysis, removed questions, URLs, special characters as well as retweets. Targeting Twitter service also, [15] proposed the use of a number of methods to clean data. They used regular expression tools to remove unnecessary tags, English lexicon to filter out words in languages other than English, and an external source to eliminate stop words. They aimed to discover peoples’ opinions by using topic models, sentiment analysis as well as geolocation information. In the email communications, data cleaning was applied by [16] to remove non-essential information from the corpus, such as message ID, message timestamp, sender and receiver information as well as all messages with invalid email addresses. 3.2. Data integration Inconsistency and redundancy in social data are very likely because users have different perspectives and show a different behaviour [13]. Data integration aims at combining data from several sources into coherent data storage [10]. This task is not simple [9] as it requires achieving a match between different schema types. Incomplete or inefficient data integration can cause inconsistency and redundancy,
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626 2622 while a proper implementation will undoubtedly enhance accuracy and speed up later processes. Techniques of data integration include entity identification, tuple deduplication, redundancy and correlation analysis, as well as data value conflict detection and resolution. 3.3. Data transformation Data transformation aims at converting data into a usable and understandable format. The goal is to have more efficient data mining operations [10]. Strategies for data transformation include data smoothing, feature construction, normalization, discretization, generalization, as well as concept hierarchy generation of nominal data. These subtasks require human supervision and are highly dependent on the data being preprocessed [17]. However, data transformation for social data faces a number of challenges [5] including: a. Data having different formats. This is because the social data extracted from online websites can be both categorical and continuous. A proper conversion between the two types, in this case, is required. b. Application dependency, which means that each data management tool has its uses and can deal with only particular types of data. For example, some tools (e.g., Pajek and UCINET) are standalone applications, while NodeXL is a utility toolkit. Some packages are provided with programming utilities (e.g., JUNG and SNAP-GAUSS) whereas others provide a graphical user interface that requires no scripting utilities. Finally, some tools are free software (e.g., STRUCTURE and StOCNET) while others have commercial licenses, such as InFlow and SocioMetric LinkAlyzer. c. Data privacy and user personal information that can be broken throughout data transformation. One of the remedies is to use data anonymization. Several studies have addressed this issue. For example, and regarding data transformation of email corpus, [16] proposed the transformation of data into a hierarchical structure, which can be achieved by grouping the object-based relationships of people into a hierarchy. Objects in this unified structure represent emails or threads, while each one can perform different roles, such as a mail sender, mail receiver, thread creator or thread participant. The next step of social data preprocessing would include flattening the hierarchy of objects into one selected level, followed by multi-layering of the social network (see the same previous reference for more details). The authors in [12] performed data stemming on blog posts to transform all the terms into their morphological roots with the help of the stemmer portal. In a different study, [15] applied regular expression techniques to normalize some words to their standard form. They also applied Porter stemming to stem words into a unified form. 3.4. Data reduction Data reduction aims at reducing the size of data while keeping information loss at a minimum [10]. Techniques of data reduction include: a. Dimensionality reduction: reducing the number of attributes or random variables by using techniques such as principal component analysis, wavelet transformation, attribute subset selection and attribute creation. Text data extracted from online social networks can be extremely sparse and high- dimensional [18]. High dimensionality typically means that there are 16 or more data attributes. In such data orientations, the nature of geometrical objects and the concept of proximity are not the same as two or three dimensions. Data cleaning, which aims at removing undesired content, can be considered one method for reducing data size. b. Numerosity reduction: achieving smaller data representations by using special techniques such as parametric models (such as regression and log-linear) and nonparametric models (such as sampling, histograms, clustering, and data cube aggregation). c. Data compression: compressing data to obtain a reduced copy of the original data. Resultant data can be either lossless or lossy, depending on whether there was a loss of information during data compression. Both dimensionality reduction and numerosity reduction can be applied to social data to have a smaller data volume while at the same time producing the same (or near the same) mining results. For example, to provide smaller representations of the original data, [13] performed dimensionality reduction on a Twitter corpus to remove useless keywords from the old messages, and used numerosity techniques to replace user’s messages with the least possible keywords that represent the main types and significant sections of the messages. In a different study, [19] applied attribute reduction to improve the performance of spam detectors on Twitter data by computing correlation coefficients between attributes [20-22]. 4. CASE STUDY: FLIGHT MH370 This section discusses how we implemented preprocessing to Flight MH370 social data. After preprocessing, we used the resultant data to examine the flight community structure, discover types of social relationships, reveal the truth behind some of the unusual events, and study people coping behavior (adaptation patterns) during disaster time [23].
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie) 2623 The Boeing 777-200 ER, which carried 239 individuals, left Kuala Lumpur International Airport (KLIA) at 12.41 am. It was supposed to land at Beijing Capital International Airport after six hours. However, the Boeing disappeared from radar after only 40 minutes from its take-off. The passengers were from China, Malaysia, Indonesia, Australia, France, India, United States, New Zealand, Canada, Ukraine, Russia, Taiwan, Italy, Netherlands, and Austria. A group of employees from the American-based Freescale Semiconductor Corporation was on a visit to China to take a technical course for one month. A group of Chinese artists was on the way back from a visit to a cultural exhibition in the Malaysian capital, themed Chinese Dream: Red and Green Painting. Some passengers were on travel from China to Malaysia to attend a Buddhist religious ceremony. The airplane also carried a group of workers who worked in Singapore, a group of tourists who were on a trip to Nepal, a number of families, as well as passengers just making a stopover. The crew of the airplane has 12 Malaysian citizens (2 pilots and 10 attendants). 4.1. MH370 social data collection As we said earlier in “online social data collection,” collecting social data is not straightforward and requires much effort, in particular, that such data are unstructured and come from various sources, added to other issues such as storage, processing, and dynamicity of the network. The dataset that we collected came from tens of different online news websites (such as The New Indian Express, YAHOO! News Malaysia, the Economic Times, CNN.com, the Daily Express, and India Today). Data gathering was done over March and April in 2014 covering the events from when the airplane disappeared until late of April, when most of the international efforts to locate the missing flight stopped, without achieving their goals. Online news media addressed this event in three different ways: how the airplane disappeared, the background of the passengers, how relatives were interacting with news and the efforts made by the international society to locate the missing airplane. 4.2. MH370 social data preparation As we said earlier in this study, in order to efficiently mine and analyze collected data, data preprocessing must be implemented as a series of sequential steps. First: data cleaning, which aims at eliminating irrelevant and redundant content? Second: data integration, which aims at integrating data from different sources into a unique and uniform data endpoint. Third: data transformation, which aims at converting the resultant data into a format that is easy for analysis. Finally: data reduction, which aims at having a smaller (both in volume and dimension), concise and clean, representation of the data that will undergo analysis later. 4.2.1. MH370 social data cleaning The first step of Flight MH370 data preprocessing was to check and evaluate all the attributes of the original data. As we do in other data cleaning processes, we prioritize data quality over data scale. It is preferable to reduce data volume to have more representative information, rather than poor and inconsistent data. In the case of Flight MH370, where data were collected from various online sources, passenger demographic information was mostly inconsistent, ambiguous or missing. Accordingly, we preferred to remove all demographic attributes except “Age.” As will be described in subsequent sections, we used ages to differentiate users having the same name and to help explore groups and connections. Using Pajek Software, it was possible to detect visually and correct some outliers and inconsistencies related to the passenger age. We were able to probe some data sources and identify inconsistencies, and using other data sources and relying on most common data among such sources; we could smooth out noisy data like that. In this phase, we removed unwanted and irrelevant information from data. For a study on social networks, we were concerned only with the information that would help reveal relations and connections between passengers. It was a time-consuming task since online social data tend to have noise, outliers, missing values and other inconsistencies. Other methods and packages can also be used by data scientists for this purpose, such the R Software; and Python packages for data manipulation, such as Pandas and NumPy. 4.2.2. MH370 social data integration We collected our data from different social media websites over two months. Because the data were mostly unstructured text, we had to run some data manipulation scripts to process the text. The goal was to extract a simple and more unified data. In this phase, we detected and solved data value conflicts on passenger data such as the use of different representations, and different scales (i.e., metric vs. British units, or others). Afterwards, we
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626 2624 reformulated and unified the data such that we stored passenger information in more recognizable formats. All the information about the flight passengers and their families that we were able to integrate are shown in Figure 1. Figure 1. Raw data of Flight MH370 In Figure 1 above, the data that describe names, nationalities, and ages were expanded from MH370 Passenger Manifest, released by Malaysia Airlines the same day of the disaster. Column “Connected-to” represents the people that a particular passenger is connected to. This includes family members, co-workers, and friends. The data in that column were collected based on statements from family members, journal reports or officials. However, officials were accused of giving little information about the accident. Column “Alternative Name” has the names that we used later for analysis instead of the list of names provided by the MH370 Passenger Manifest which, in turn, are shown in column “Passenger Name.” Each name in column “Alternative Name” begins with the passenger’s first name followed by the family name (names with two parts were also considered). We also removed unnecessary word capitalizations and suffixed the second name of each passenger with his/her age to avoid name confusions (for instance, the name Zhang Yan (36) is also the name of another passenger, Zhang Yan (45)). For simplicity, we preferred to use only the passenger’s first and second name followed by age. Column “Symbol” has the labels that we gave to the passengers. For example, P1, P2, P3 refer, respectively, to the first three passengers Ambre Wattrelos (14), An Wenlan (65) and Andrew Nari (49) in the table. These symbols were used to label nodes when we built the graphical representations. Column “Background” contains information over two months about the passengers. It tells about the career, marital status, family members, and friendships of each one. 4.2.3. MH370 Social Data Transformation One of the tasks that were actively present when we built our database was data transformation of the passenger name. We used a simple heuristics for this purpose. The task included reviewing the table row by row, transforming each name into more a proper structure (i.e., alternative name), and giving it a proper identifier (i.e., label). One of the things that we did for data transformation was adding extra nodes that serve particular purposes. For example, in addition to the original 239 nodes, passenger nodes were tied to a central node, called “Flight,” to show that all the passengers were present at the same event and were part of the flight network structure. Another node “Singapore” was added to connect the Chinese workers who worked in Singapore altogether. Three of those workers were confirmed to be friends and to know each other, and therefore, they were directly connected to each other and the central node of the group. The resulting network is a small undirected network that has no loops, having every two passengers connected. Since our data preprocessing was mostly involved in text manipulation, we did not need to perform (except for feature construction) other types of data transformation. Nevertheless, and depending on the
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  Online data preprocessing: a case study approach (Mohammed Zuhair Al-Taie) 2625 mining purpose, other methods are also used for data transformation. This includes data aggregation (summarize numeric attributes based on their concept hierarchy), data generalization (analogous to aggregation but for nominal attributes), data normalization (scaling numeric attribute values to fall within specified ranges), and others. 4.2.4. MH370 social data reduction As we mentioned earlier in the data cleaning section, data preprocessing should prioritize data quality over data scale. In this way, data reduction that we performed on flight MH370 data has also given up of some data volume in exchange for more representative and consistent information. It is important to highlight that the operations performed during data cleaning have a different purpose from these performed in data reduction. While the earlier aims at eliminating noisy and inconsistent data, the latter aims at reducing data representation and making it simpler and more useful, particularly for visual analysis. The table discussed earlier contains some raw data that we did not need during the analysis phase. Therefore, we extracted only the data that show the relationship between the people who joined the flight. We compiled the resultant data into a new Microsoft Excel file as shown in Figure 2. Figure 2. Sample data from Flight MH370 dataset The dataset is available from the first author’s and GitHub (https://github.com/mohammed- taie/Flight-MH370) repository. The dataset has 241 vertices and 1563 edges, and contains three columns: 1. Connected-to: individuals that a particular passenger is connected to 2. Names: names that were used during data analysis 3. Labels: labels that we gave to the passengers It was essential to transform the dataset into the “.net” format before embarking on data analysis and data visualization. The resulting .net file has two parts: 1. Vertices to denote flight passengers. For data visualization, all the vertices were given the same shape and the same size, even though they may have a different number of connections. A few additional vertices were also added to achieve particular benefits. 2. Edges to connect the members of the same group and to connect the different groups. For a better data visualization, edge labels and weights were eliminated. Labels and weights are sometimes used to show the strength (among other things) of a connection between two nodes. Depending on how big the data are, other methods can also be applied for data reduction such as clustering (grouping sample values in clusters based on a similarity measure) and sampling (taking a sample set of actors from the complete set). Connecting the nodes helped us spot three large community structures [20]: the “Artists Group,” consisting of 29 members, the “Freescale Semiconductor” group, comprised of 15 members and the “Aircraft Crew” consisting of 12 members. Other smaller groups include six people and five people families from China, four people family from Malaysia, a French family, an Australian family, and others. The analysis also showed 8 nodes (two pilots, five toddlers, and one physically challenged elderly woman) that are not directly connected to node “Flight.”
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2620 - 2626 2626 5. CONCLUSION The goal of this study is to understand data preprocessing in the light of Flight MH370 social data. Data preprocessing is a critical part of data science projects, mainly that real-world data often have noise, outliers, missing values and other inconsistencies. The steps should be introduced efficiently, supported by human experience, and to be applied more than once. It is common that the preprocessing steps are reiterated if the results of data mining are significantly different from what is expected. Handling online social data (regarding data collection and data preprocessing) is a hard task and can become even more complicated if the data that we are concerned with come from online news pages that are commonly known to have problems in credibility and accuracy (except for some cases). One of the problems that we encountered when we performed MH370 social data preprocessing is how to deal with missing data, shortage in uniformity, and lack of some critical information. One of the reasons that the available information was not enough is that officials abstained from revealing much information about the accidence for fear of triggering political issues between Malaysia and some other countries. Regarding accuracy, some information that was included in the MH370 Passenger Manifest (released by Malaysian Airlines) was not accurate. REFERENCES [1] S. Kadry and M. Z. Al-Taie, “Social network analysis: An introduction with an extensive implementation to a large-scale online network using Pajek,” Bentham Science Publishers, 2014. [2] M. S. Brown, “Data mining for dummies,” John Wiley & Sons, 2014. [3] M. Z. Al-Taie, et al., “Successful Data Science Projects: Lessons Learned from Kaggle Competition,” Kurdistan Journal of Applied Research, vol/issue: 2(3), 2017. [4] S. Wasserman and K. Faust, “Social network analysis: Methods and applications,” Cambridge university press, vol. 8, 1994. [5] P. Gupta and V. Bhatnagar, “Data preprocessing for dynamic social network analysis,” Data Mining in Dynamic Social Networks and Fuzzy Systems, pp. 25-39, 2013. [6] M. Al-Taie and S. Kadry, “Python for Graph and Network Analysis,” Springer, pp. 1-184, 2017. [7] M. Zuber, “A survey of data mining techniques for social network analysis,” International Journal of Research in Computer Engineering & Electronics, vol/issue: 3(6), 2014. [8] J. Bian, et al., “Finding the right facts in the crowd: factoid question answering over social media,” Proceedings of the 17th international conference on World Wide Web, 2008. [9] S. García, et al., “Data preprocessing in data mining,” Springer, 2015. [10] J. Han, et al., “Data mining: concepts and techniques,” Elsevier, 2011. [11] S. S. De and S. Dehuri, “Machine Learning for Auspicious Social Network Mining,” Social Networking, Springer, pp. 45-83, 2014. [12] N. Agarwal, et al., “Clustering of blog sites using collective wisdom,” Computational Social Network Analysis, Springer, pp. 107-134, 2010. [13] L. Wenxue and G. Sun, “A trust-based information propagation model in online social networks,” vol.issue: 8(8), pp. 1767, 2013. [14] I. Hemalatha, et al., “Preprocessing the informal text for efficient sentiment analysis,” International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), vol/issue: 1(2), pp. 58-61, 2012. [15] T. H. Wen, et al., “Analysis of combining Topic model, Sentiment, Geolocation information approaches on Social Network.” [16] P. Kazienko, et al., “A generic model for multidimensional temporal social network,” Communications in Computer and Information Science, CCIS, vol. 171, pp. 1-14, 2011. [17] M. Al-Taie and S. Kadry, “Applying Social Network Analysis to Analyze a Web-Based Community,” arXiv preprint arXiv:1212.6050, 2012. [18] C. C. Aggarwal and C. K. Reddy, “Data clustering: algorithms and applications,” CRC press, 2013. [19] M. Klassen, “Twitter data preprocessing for spam detection,” FUTURE COMPUTING 2013, The Fifth International Conference on Future Computational Technologies and Applications, Citeseer, 2013. [20] B. N. Octaviana and J. Abraham, “Tolerance for Emotional Internet Infidelity and Its Correlate with Relationship Flourishing,” International Journal of Electrical and Computer Engineering (IJECE), vol/issue: 8(5), pp. 3158- 3168, 2018. [21] C. Virmani, et al., “Clustering in Aggregated User Profiles across Multiple Social Networks,” International Journal of Electrical and Computer Engineering (IJECE), vol/issue: 7(6), pp. 3692-3699, 2017. [22] Y. L. S. Rani, et al., “Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensemble: A Survey,” International Journal of Electrical and Computer Engineering (IJECE), vol/issue: 8(4), pp. 2351-2357. [23] M. Z. Al-Taie, et al., “Flight MH370 community structure,” Int. J. Advance. Soft Comput. Appl, vol/issue: 6(2), 2014.