SlideShare a Scribd company logo
Heterogeneous Data in Big
Data
Heterogeneous Data
• Heterogeneous data are any data with high variability of data types
and formats. They are possibly ambiguous and low quality due to
missing values, high data redundancy, and untruthfulness.
Why Data from Source is Heterogeneous
• Firstly, the variety of data acquisition devices, the acquired data are
also different in types with heterogeneity.
• Second, they are at a large-scale. Massive data acquisition equipment
is used and distributed, not only the currently acquired data, but also
the historical data within a certain time frame should be stored.
• Third, there is a strong correlation between time and space.
• Fourth, effective data accounts for only a small portion of the big
data. A great quantity of noises may be collected during the acquisitio
types of data heterogeneity
• Syntactic heterogeneity occurs when two data sources are not
expressed in the same language.
• Conceptual heterogeneity, also known as semantic heterogeneity or
logical mismatch, denotes the differences in modelling the same
domain of interest.
• Terminological heterogeneity stands for variations in names when
referring to the same entities from different data sources.
• Semiotic heterogeneity, also known as pragmatic heterogeneity,
stands for different interpretation of entities by people.
Data representation can be described at four levels
• Level 1 is diverse raw data with different types and from different
sources.
• Level 2 is called ‘unified representation’. Heterogeneous data needs to
be unified. This layer converts individual attributes into information in
terms of ‘what-when-where’.
• Level 3 is aggregation. Aggregation aids easy visualization and
provides an intuitive query.
• Level 4 is called ‘situation detection and representation’. The final step
in situation detection is a classification operation that uses domain
knowledge to assign an appropriate class to each cell.
Data Processing Methods for Heterogeneous
Data
• Data Cleaning
• Data Integration
• Data Reduction and Normalisation
Data Cleaning
Data cleaning is a process to identify, incomplete, inaccurate or
unreasonable data, and then to modify or delete such data for
improving data quality.
For example, the multisource and multimodal nature of healthcare data
results in high complexity and noise problems.
Data Cleaning
• A database may also contain irrelevant attributes.
• Therefore, relevance analysis in the form of correlation analysis and
attribute subset selection can be used to detect attributes that do not
contribute to the classification or prediction task.
• PCA can also be used
• Data cleaning can be performed to detect and remove redundancies
that may have resulted from data integration.
• The removal of redundant data is often regarded as a king of data
cleaning as well as data reduction
Data Integration
• In the case of data integration or aggregation, datasets are matched
and merged on the basis of shared variables and attributes.
• Advanced data processing and analysis techniques allow to mix both
structured and unstructured data for eliciting new insights;
However, this requires “clean” data.
Data integration & Challenge
• Data integration tools are evolving towards the unification of
structured and unstructured data
• It is often required to structure unstructured data and merge
heterogeneous information sources and types into a unified data layer
• Challenge: One of reasons is that unique identifiers between records
of two different datasets often do not exist. Determining which data
should be merged may not be clear at the outset.
Approaches of Integration for unstructured
and structured Data
• Natural language processing pipelines: The Natural Language
Processing (NLP) can be directly applied to projects that demand
dealing with unstructured data.
• Entity recognition and linking: Extracting structured information from
unstructured data is a fundamental step. can be resolved by
information extraction techniques.
• Use of open data to integrate structured & unstructured data: Entities
in open datasets can be used to identify named entities (people,
organizations, places), which can be used to categorize and organize
text contents
Dimension Reduction and Data Normalization
There are several reasons to reduce the dimensionality of the data:
• First, high dimensional data impose computational challenges.
• Second, high dimensionality might lead to poor generalization abilities
of the learning algorithm.
• Finally, dimensionality reduction can be used for finding meaningful
structure of the data
Finding redudancy and Removal
• To check a correlation matrix obtained by correlation analysis.
• Factor analysis is a method for dimensionality reduction.
• Factor Analysis can be used to reduce the number of variables and
detect the structure in the relationships among variables. Therefore,
Factor Analysis is often used as a structure detection or data
reduction method.
• PCA is useful when there is data on a large number of variables and
possibly there is some redundancy in those variables.
Several ways in which PCA can help
• Pre-processing: With PCA one can also whiten the representation,
which rebalances the weights of the data to give better performance
in some cases.
• Modeling: PCA learns a representation that is sometimes used as an
entire model, e.g., a prior distribution for new data.
• Compression: PCA can be used to compress data, by replacing data
with its low-dimensional representation.
Big Data gaps & Challenges
Paradox of Big Data
• Identity Paradox: Big data seeks to identify, but it also threatens
identity.
• The transparency paradox :The small data inputs are aggregated to
produce large datasets. This data collection happens invisibly. Big data
promises to use this data to make the world more transparent; but its
collection is invisible;
• The power paradox — Big data sensors and big data pools are
predominantly in the hands of powerful intermediary institutions, not
ordinary people.
Solution of big data analytics
• Data loading — Software has to be developed to load data from
multiple and various data sources. The system needs to deal with
corrupted records and need to provide monitoring services.
• Data parsing — Most data sources provide data in a certain format
that needs to be parsed into the Hadoop system.
• Data analytics —A solution of big data analytics needs to support
rapid iterations in order for data to be properly analyzed.
Big Data Analytics
• descriptive analytics — involving the description and summarization
of knowledge patterns;
• predictive analytics — forecasting and statistical modelling to
determine future possibilities; and
• prescriptive analytics — helping analysts in decision-making by
determining actions and assessing their impacts.
Big Data tools
• There are some Big Data tools such as
• Hive
• Splunk
• Tableau,
• Talend
• RapidMiner and
• MarkLogic
Big Data compute platforms strategies:
• Internal compute cluster. For long-term storage of unique or sensitive
data, it often makes sense to create and maintain an Apache Hadoop
cluster within the internal network of an organization.
• External compute cluster. There is a trend across the IT industry to
outsource elements of infrastructure to ‘utility computing’ service
providers.
• Hybrid compute cluster. A common hybrid option is to provision
external compute cluster resources using services for on-demand Big
Data analysis tasks and create a modest internal computer cluster for
long-term data storage.
Outlier detection
• The statistical approach,
• The density-based local outlier approach, (Local Outlier Factor)
• The distance-based approach, (Clustering)
• The deviation-based approach (Deep Learning Based)
Traditional Data Mining and Machine Learning, Deep Learning and Big
Data Analytics
Future Requirement for Big Data Technologies
• Handle the growth of the Internet — As more users come online, Big
Data technologies will need to handle larger volumes of data.
• Real-time processing — Big Data processing was initially carried out in
batches of historical data. In recent years, stream processing systems
is developing, such as Apache Storm.
• Process complex data types — Data such as graph data and possible
other types of more complicated
Future Requirement…
• Efficient indexing — Indexing is fundamental to the online lookup of
data and is therefore essential in managing large collections of
documents and their associated metadata.
• Dynamic orchestration of services in multi-server and cloud contexts
— Most platforms today are not suitable for the cloud and keeping
data consistent between different data stores is challenging.
• Concurrent data processing — Being able to process large quantities
of data concurrently is very useful
Ad

More Related Content

Similar to Hetrogeneous Data handling in Big Data Analysis (20)

Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 
chapter 2 Data Science.pdf emerging ecnology freshman course
chapter 2 Data Science.pdf emerging ecnology freshman coursechapter 2 Data Science.pdf emerging ecnology freshman course
chapter 2 Data Science.pdf emerging ecnology freshman course
tamratgintamo
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
DawitBirhanu13
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
DATA MINING BASIC INTRODUCTION OF ALL THE STAGES
DATA MINING BASIC INTRODUCTION OF ALL THE STAGESDATA MINING BASIC INTRODUCTION OF ALL THE STAGES
DATA MINING BASIC INTRODUCTION OF ALL THE STAGES
JignaJadav1
 
Unit-I- Introduction- Traits of Big Data-Final.pptx
Unit-I- Introduction- Traits of Big Data-Final.pptxUnit-I- Introduction- Traits of Big Data-Final.pptx
Unit-I- Introduction- Traits of Big Data-Final.pptx
subhashchandra197
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma
 
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptxMachine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
MaheshKini3
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
Wollo UNiversity
 
Big data
Big dataBig data
Big data
Sakshi Chawla
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 
Data mining slide for data mining process
Data mining slide for data mining processData mining slide for data mining process
Data mining slide for data mining process
NivaTripathy1
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
Lithal Fragrance
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
Dev EngineersSaathi
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
varshakumar21
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Data Science topic and introduction to basic concepts involving data manageme...
Data Science topic and introduction to basic concepts involving data manageme...Data Science topic and introduction to basic concepts involving data manageme...
Data Science topic and introduction to basic concepts involving data manageme...
aashishreddy10
 
DRK_Introduction to Data mining and Knowledge discovery
DRK_Introduction to Data mining and Knowledge discoveryDRK_Introduction to Data mining and Knowledge discovery
DRK_Introduction to Data mining and Knowledge discovery
coolscools1231
 
Lect 1a - Introduction to Pharmacy Informatics 1a.pdf
Lect 1a - Introduction to Pharmacy Informatics 1a.pdfLect 1a - Introduction to Pharmacy Informatics 1a.pdf
Lect 1a - Introduction to Pharmacy Informatics 1a.pdf
TSha7
 
chapter 2 Data Science.pdf emerging ecnology freshman course
chapter 2 Data Science.pdf emerging ecnology freshman coursechapter 2 Data Science.pdf emerging ecnology freshman course
chapter 2 Data Science.pdf emerging ecnology freshman course
tamratgintamo
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
DATA MINING BASIC INTRODUCTION OF ALL THE STAGES
DATA MINING BASIC INTRODUCTION OF ALL THE STAGESDATA MINING BASIC INTRODUCTION OF ALL THE STAGES
DATA MINING BASIC INTRODUCTION OF ALL THE STAGES
JignaJadav1
 
Unit-I- Introduction- Traits of Big Data-Final.pptx
Unit-I- Introduction- Traits of Big Data-Final.pptxUnit-I- Introduction- Traits of Big Data-Final.pptx
Unit-I- Introduction- Traits of Big Data-Final.pptx
subhashchandra197
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma
 
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptxMachine_Learning_VTU_6th_Semester_Module_2.1.pptx
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
MaheshKini3
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
Wollo UNiversity
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 
Data mining slide for data mining process
Data mining slide for data mining processData mining slide for data mining process
Data mining slide for data mining process
NivaTripathy1
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Data Science topic and introduction to basic concepts involving data manageme...
Data Science topic and introduction to basic concepts involving data manageme...Data Science topic and introduction to basic concepts involving data manageme...
Data Science topic and introduction to basic concepts involving data manageme...
aashishreddy10
 
DRK_Introduction to Data mining and Knowledge discovery
DRK_Introduction to Data mining and Knowledge discoveryDRK_Introduction to Data mining and Knowledge discovery
DRK_Introduction to Data mining and Knowledge discovery
coolscools1231
 
Lect 1a - Introduction to Pharmacy Informatics 1a.pdf
Lect 1a - Introduction to Pharmacy Informatics 1a.pdfLect 1a - Introduction to Pharmacy Informatics 1a.pdf
Lect 1a - Introduction to Pharmacy Informatics 1a.pdf
TSha7
 

Recently uploaded (20)

CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Ad

Hetrogeneous Data handling in Big Data Analysis

  • 2. Heterogeneous Data • Heterogeneous data are any data with high variability of data types and formats. They are possibly ambiguous and low quality due to missing values, high data redundancy, and untruthfulness.
  • 3. Why Data from Source is Heterogeneous • Firstly, the variety of data acquisition devices, the acquired data are also different in types with heterogeneity. • Second, they are at a large-scale. Massive data acquisition equipment is used and distributed, not only the currently acquired data, but also the historical data within a certain time frame should be stored. • Third, there is a strong correlation between time and space. • Fourth, effective data accounts for only a small portion of the big data. A great quantity of noises may be collected during the acquisitio
  • 4. types of data heterogeneity • Syntactic heterogeneity occurs when two data sources are not expressed in the same language. • Conceptual heterogeneity, also known as semantic heterogeneity or logical mismatch, denotes the differences in modelling the same domain of interest. • Terminological heterogeneity stands for variations in names when referring to the same entities from different data sources. • Semiotic heterogeneity, also known as pragmatic heterogeneity, stands for different interpretation of entities by people.
  • 5. Data representation can be described at four levels • Level 1 is diverse raw data with different types and from different sources. • Level 2 is called ‘unified representation’. Heterogeneous data needs to be unified. This layer converts individual attributes into information in terms of ‘what-when-where’. • Level 3 is aggregation. Aggregation aids easy visualization and provides an intuitive query. • Level 4 is called ‘situation detection and representation’. The final step in situation detection is a classification operation that uses domain knowledge to assign an appropriate class to each cell.
  • 6. Data Processing Methods for Heterogeneous Data • Data Cleaning • Data Integration • Data Reduction and Normalisation
  • 7. Data Cleaning Data cleaning is a process to identify, incomplete, inaccurate or unreasonable data, and then to modify or delete such data for improving data quality. For example, the multisource and multimodal nature of healthcare data results in high complexity and noise problems.
  • 8. Data Cleaning • A database may also contain irrelevant attributes. • Therefore, relevance analysis in the form of correlation analysis and attribute subset selection can be used to detect attributes that do not contribute to the classification or prediction task. • PCA can also be used • Data cleaning can be performed to detect and remove redundancies that may have resulted from data integration. • The removal of redundant data is often regarded as a king of data cleaning as well as data reduction
  • 9. Data Integration • In the case of data integration or aggregation, datasets are matched and merged on the basis of shared variables and attributes. • Advanced data processing and analysis techniques allow to mix both structured and unstructured data for eliciting new insights; However, this requires “clean” data.
  • 10. Data integration & Challenge • Data integration tools are evolving towards the unification of structured and unstructured data • It is often required to structure unstructured data and merge heterogeneous information sources and types into a unified data layer • Challenge: One of reasons is that unique identifiers between records of two different datasets often do not exist. Determining which data should be merged may not be clear at the outset.
  • 11. Approaches of Integration for unstructured and structured Data • Natural language processing pipelines: The Natural Language Processing (NLP) can be directly applied to projects that demand dealing with unstructured data. • Entity recognition and linking: Extracting structured information from unstructured data is a fundamental step. can be resolved by information extraction techniques. • Use of open data to integrate structured & unstructured data: Entities in open datasets can be used to identify named entities (people, organizations, places), which can be used to categorize and organize text contents
  • 12. Dimension Reduction and Data Normalization There are several reasons to reduce the dimensionality of the data: • First, high dimensional data impose computational challenges. • Second, high dimensionality might lead to poor generalization abilities of the learning algorithm. • Finally, dimensionality reduction can be used for finding meaningful structure of the data
  • 13. Finding redudancy and Removal • To check a correlation matrix obtained by correlation analysis. • Factor analysis is a method for dimensionality reduction. • Factor Analysis can be used to reduce the number of variables and detect the structure in the relationships among variables. Therefore, Factor Analysis is often used as a structure detection or data reduction method. • PCA is useful when there is data on a large number of variables and possibly there is some redundancy in those variables.
  • 14. Several ways in which PCA can help • Pre-processing: With PCA one can also whiten the representation, which rebalances the weights of the data to give better performance in some cases. • Modeling: PCA learns a representation that is sometimes used as an entire model, e.g., a prior distribution for new data. • Compression: PCA can be used to compress data, by replacing data with its low-dimensional representation.
  • 15. Big Data gaps & Challenges
  • 16. Paradox of Big Data • Identity Paradox: Big data seeks to identify, but it also threatens identity. • The transparency paradox :The small data inputs are aggregated to produce large datasets. This data collection happens invisibly. Big data promises to use this data to make the world more transparent; but its collection is invisible; • The power paradox — Big data sensors and big data pools are predominantly in the hands of powerful intermediary institutions, not ordinary people.
  • 17. Solution of big data analytics • Data loading — Software has to be developed to load data from multiple and various data sources. The system needs to deal with corrupted records and need to provide monitoring services. • Data parsing — Most data sources provide data in a certain format that needs to be parsed into the Hadoop system. • Data analytics —A solution of big data analytics needs to support rapid iterations in order for data to be properly analyzed.
  • 18. Big Data Analytics • descriptive analytics — involving the description and summarization of knowledge patterns; • predictive analytics — forecasting and statistical modelling to determine future possibilities; and • prescriptive analytics — helping analysts in decision-making by determining actions and assessing their impacts.
  • 19. Big Data tools • There are some Big Data tools such as • Hive • Splunk • Tableau, • Talend • RapidMiner and • MarkLogic
  • 20. Big Data compute platforms strategies: • Internal compute cluster. For long-term storage of unique or sensitive data, it often makes sense to create and maintain an Apache Hadoop cluster within the internal network of an organization. • External compute cluster. There is a trend across the IT industry to outsource elements of infrastructure to ‘utility computing’ service providers. • Hybrid compute cluster. A common hybrid option is to provision external compute cluster resources using services for on-demand Big Data analysis tasks and create a modest internal computer cluster for long-term data storage.
  • 21. Outlier detection • The statistical approach, • The density-based local outlier approach, (Local Outlier Factor) • The distance-based approach, (Clustering) • The deviation-based approach (Deep Learning Based)
  • 22. Traditional Data Mining and Machine Learning, Deep Learning and Big Data Analytics
  • 23. Future Requirement for Big Data Technologies • Handle the growth of the Internet — As more users come online, Big Data technologies will need to handle larger volumes of data. • Real-time processing — Big Data processing was initially carried out in batches of historical data. In recent years, stream processing systems is developing, such as Apache Storm. • Process complex data types — Data such as graph data and possible other types of more complicated
  • 24. Future Requirement… • Efficient indexing — Indexing is fundamental to the online lookup of data and is therefore essential in managing large collections of documents and their associated metadata. • Dynamic orchestration of services in multi-server and cloud contexts — Most platforms today are not suitable for the cloud and keeping data consistent between different data stores is challenging. • Concurrent data processing — Being able to process large quantities of data concurrently is very useful