Hetrogeneous Data handling in Big Data Analysis

Heterogeneous Data in Big
Data

Heterogeneous Data
• Heterogeneous data are any data with high variability of data types
and formats. They are possibly ambiguous and low quality due to
missing values, high data redundancy, and untruthfulness.

Why Data from Source is Heterogeneous
• Firstly, the variety of data acquisition devices, the acquired data are
also different in types with heterogeneity.
• Second, they are at a large-scale. Massive data acquisition equipment
is used and distributed, not only the currently acquired data, but also
the historical data within a certain time frame should be stored.
• Third, there is a strong correlation between time and space.
• Fourth, effective data accounts for only a small portion of the big
data. A great quantity of noises may be collected during the acquisitio

types of data heterogeneity
• Syntactic heterogeneity occurs when two data sources are not
expressed in the same language.
• Conceptual heterogeneity, also known as semantic heterogeneity or
logical mismatch, denotes the differences in modelling the same
domain of interest.
• Terminological heterogeneity stands for variations in names when
referring to the same entities from different data sources.
• Semiotic heterogeneity, also known as pragmatic heterogeneity,
stands for different interpretation of entities by people.

Data representation can be described at four levels
• Level 1 is diverse raw data with different types and from different
sources.
• Level 2 is called ‘unified representation’. Heterogeneous data needs to
be unified. This layer converts individual attributes into information in
terms of ‘what-when-where’.
• Level 3 is aggregation. Aggregation aids easy visualization and
provides an intuitive query.
• Level 4 is called ‘situation detection and representation’. The final step
in situation detection is a classification operation that uses domain
knowledge to assign an appropriate class to each cell.

Data Processing Methods for Heterogeneous
Data
• Data Cleaning
• Data Integration
• Data Reduction and Normalisation

Data Cleaning
Data cleaning is a process to identify, incomplete, inaccurate or
unreasonable data, and then to modify or delete such data for
improving data quality.
For example, the multisource and multimodal nature of healthcare data
results in high complexity and noise problems.

Data Cleaning
• A database may also contain irrelevant attributes.
• Therefore, relevance analysis in the form of correlation analysis and
attribute subset selection can be used to detect attributes that do not
contribute to the classification or prediction task.
• PCA can also be used
• Data cleaning can be performed to detect and remove redundancies
that may have resulted from data integration.
• The removal of redundant data is often regarded as a king of data
cleaning as well as data reduction

Data Integration
• In the case of data integration or aggregation, datasets are matched
and merged on the basis of shared variables and attributes.
• Advanced data processing and analysis techniques allow to mix both
structured and unstructured data for eliciting new insights;
However, this requires “clean” data.

Data integration & Challenge
• Data integration tools are evolving towards the unification of
structured and unstructured data
• It is often required to structure unstructured data and merge
heterogeneous information sources and types into a unified data layer
• Challenge: One of reasons is that unique identifiers between records
of two different datasets often do not exist. Determining which data
should be merged may not be clear at the outset.

Approaches of Integration for unstructured
and structured Data
• Natural language processing pipelines: The Natural Language
Processing (NLP) can be directly applied to projects that demand
dealing with unstructured data.
• Entity recognition and linking: Extracting structured information from
unstructured data is a fundamental step. can be resolved by
information extraction techniques.
• Use of open data to integrate structured & unstructured data: Entities
in open datasets can be used to identify named entities (people,
organizations, places), which can be used to categorize and organize
text contents

Dimension Reduction and Data Normalization
There are several reasons to reduce the dimensionality of the data:
• First, high dimensional data impose computational challenges.
• Second, high dimensionality might lead to poor generalization abilities
of the learning algorithm.
• Finally, dimensionality reduction can be used for ﬁnding meaningful
structure of the data

Finding redudancy and Removal
• To check a correlation matrix obtained by correlation analysis.
• Factor analysis is a method for dimensionality reduction.
• Factor Analysis can be used to reduce the number of variables and
detect the structure in the relationships among variables. Therefore,
Factor Analysis is often used as a structure detection or data
reduction method.
• PCA is useful when there is data on a large number of variables and
possibly there is some redundancy in those variables.

Several ways in which PCA can help
• Pre-processing: With PCA one can also whiten the representation,
which rebalances the weights of the data to give better performance
in some cases.
• Modeling: PCA learns a representation that is sometimes used as an
entire model, e.g., a prior distribution for new data.
• Compression: PCA can be used to compress data, by replacing data
with its low-dimensional representation.

Paradox of Big Data
• Identity Paradox: Big data seeks to identify, but it also threatens
identity.
• The transparency paradox :The small data inputs are aggregated to
produce large datasets. This data collection happens invisibly. Big data
promises to use this data to make the world more transparent; but its
collection is invisible;
• The power paradox — Big data sensors and big data pools are
predominantly in the hands of powerful intermediary institutions, not
ordinary people.

Solution of big data analytics
• Data loading — Software has to be developed to load data from
multiple and various data sources. The system needs to deal with
corrupted records and need to provide monitoring services.
• Data parsing — Most data sources provide data in a certain format
that needs to be parsed into the Hadoop system.
• Data analytics —A solution of big data analytics needs to support
rapid iterations in order for data to be properly analyzed.

Big Data Analytics
• descriptive analytics — involving the description and summarization
of knowledge patterns;
• predictive analytics — forecasting and statistical modelling to
determine future possibilities; and
• prescriptive analytics — helping analysts in decision-making by
determining actions and assessing their impacts.

Big Data tools
• There are some Big Data tools such as
• Hive
• Splunk
• Tableau,
• Talend
• RapidMiner and
• MarkLogic

Big Data compute platforms strategies:
• Internal compute cluster. For long-term storage of unique or sensitive
data, it often makes sense to create and maintain an Apache Hadoop
cluster within the internal network of an organization.
• External compute cluster. There is a trend across the IT industry to
outsource elements of infrastructure to ‘utility computing’ service
providers.
• Hybrid compute cluster. A common hybrid option is to provision
external compute cluster resources using services for on-demand Big
Data analysis tasks and create a modest internal computer cluster for
long-term data storage.

Outlier detection
• The statistical approach,
• The density-based local outlier approach, (Local Outlier Factor)
• The distance-based approach, (Clustering)
• The deviation-based approach (Deep Learning Based)

Traditional Data Mining and Machine Learning, Deep Learning and Big
Data Analytics

Future Requirement for Big Data Technologies
• Handle the growth of the Internet — As more users come online, Big
Data technologies will need to handle larger volumes of data.
• Real-time processing — Big Data processing was initially carried out in
batches of historical data. In recent years, stream processing systems
is developing, such as Apache Storm.
• Process complex data types — Data such as graph data and possible
other types of more complicated

Future Requirement…
• Efficient indexing — Indexing is fundamental to the online lookup of
data and is therefore essential in managing large collections of
documents and their associated metadata.
• Dynamic orchestration of services in multi-server and cloud contexts
— Most platforms today are not suitable for the cloud and keeping
data consistent between different data stores is challenging.
• Concurrent data processing — Being able to process large quantities
of data concurrently is very useful

Hetrogeneous Data handling in Big Data Analysis

Recommended

More Related Content

Similar to Hetrogeneous Data handling in Big Data Analysis (20)

Recently uploaded (20)

Hetrogeneous Data handling in Big Data Analysis