Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data using scientific processes. It encompasses more than just data analysis. The data value chain describes the process of acquiring data and transforming it into useful information and insights. It involves data acquisition, analysis, curation, storage, and usage. There are three main types of data: structured data that follows a predefined model like databases, semi-structured data with some organization like JSON, and unstructured data like text without a clear model. Metadata provides additional context about data to help with analysis. Big data is characterized by its large volume, velocity, and variety that makes it difficult to process with traditional tools.
The document provides an overview of key concepts in data science and big data including:
1) It defines data science, data scientists, and their roles in extracting insights from structured, semi-structured, and unstructured data.
2) It explains different data types like structured, semi-structured, unstructured and their characteristics from a data analytics perspective.
3) It describes the data value chain involving data acquisition, analysis, curation, storage, and usage to generate value from data.
4) It introduces concepts in big data like the 3V's of volume, velocity and variety, and technologies like Hadoop and its ecosystem that are used for distributed processing of large datasets.
Data is unprocessed facts and figures that can be represented using characters. Information is processed data used to make decisions. Data science uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. The data processing cycle involves inputting data, processing it, and outputting the results. There are different types of data from both computer programming and data analytics perspectives including structured, semi-structured, and unstructured data. Metadata provides additional context about data.
This document provides an overview of key concepts in data science and big data, including:
- Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data.
- The data value chain describes the process of acquiring data, analyzing it, curating it for storage, and using it.
- Big data is characterized by its volume, velocity, variety, and veracity. Hadoop is an open-source framework that allows distributed processing of large datasets across computer clusters.
This document defines big data and discusses its key characteristics and applications. It begins by defining big data as large volumes of structured, semi-structured, and unstructured data that is difficult to process using traditional methods. It then outlines the 5 Vs of big data: volume, velocity, variety, veracity, and variability. The document also discusses Hadoop as an open-source framework for distributed storage and processing of big data, and lists several applications of big data across various industries. Finally, it discusses both the risks and benefits of working with big data.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for data mining and analysis. It addresses issues like missing values, inconsistent data, and reducing data size. The key goals of data preprocessing are to handle data problems, integrate multiple data sources, and reduce data size while maintaining the same analytical results. Major tasks involve data cleaning, integration, transformation, and reduction.
Introducition to Data scinece compiled by huwekineheshete
This document provides an overview of data science and its key components. It discusses that data science uses scientific methods and algorithms to extract knowledge from structured, semi-structured, and unstructured data sources. It also notes that data science involves organizing data, packaging it through visualization and statistics, and delivering insights. The document further outlines the data science lifecycle and workflow, covering understanding the problem, exploring and preprocessing data, developing models, and evaluating results.
Data inconsistency occurs when different versions of the same data exist in different places, creating unreliable information. It is likely caused by data redundancy, where duplicate data is stored. Good database design aims to eliminate redundancy to reduce inconsistency.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
Most of the time, when you hear about Artificial Intelligence (AI), people talk about new algorithms or even the computation power needed to train them. But Data is one of the most important factors in AI.
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
Ad
More Related Content
Similar to Hetrogeneous Data handling in Big Data Analysis (20)
Data is unprocessed facts and figures that can be represented using characters. Information is processed data used to make decisions. Data science uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. The data processing cycle involves inputting data, processing it, and outputting the results. There are different types of data from both computer programming and data analytics perspectives including structured, semi-structured, and unstructured data. Metadata provides additional context about data.
This document provides an overview of key concepts in data science and big data, including:
- Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data.
- The data value chain describes the process of acquiring data, analyzing it, curating it for storage, and using it.
- Big data is characterized by its volume, velocity, variety, and veracity. Hadoop is an open-source framework that allows distributed processing of large datasets across computer clusters.
This document defines big data and discusses its key characteristics and applications. It begins by defining big data as large volumes of structured, semi-structured, and unstructured data that is difficult to process using traditional methods. It then outlines the 5 Vs of big data: volume, velocity, variety, veracity, and variability. The document also discusses Hadoop as an open-source framework for distributed storage and processing of big data, and lists several applications of big data across various industries. Finally, it discusses both the risks and benefits of working with big data.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for data mining and analysis. It addresses issues like missing values, inconsistent data, and reducing data size. The key goals of data preprocessing are to handle data problems, integrate multiple data sources, and reduce data size while maintaining the same analytical results. Major tasks involve data cleaning, integration, transformation, and reduction.
Introducition to Data scinece compiled by huwekineheshete
This document provides an overview of data science and its key components. It discusses that data science uses scientific methods and algorithms to extract knowledge from structured, semi-structured, and unstructured data sources. It also notes that data science involves organizing data, packaging it through visualization and statistics, and delivering insights. The document further outlines the data science lifecycle and workflow, covering understanding the problem, exploring and preprocessing data, developing models, and evaluating results.
Data inconsistency occurs when different versions of the same data exist in different places, creating unreliable information. It is likely caused by data redundancy, where duplicate data is stored. Good database design aims to eliminate redundancy to reduce inconsistency.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
Most of the time, when you hear about Artificial Intelligence (AI), people talk about new algorithms or even the computation power needed to train them. But Data is one of the most important factors in AI.
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
2. Heterogeneous Data
• Heterogeneous data are any data with high variability of data types
and formats. They are possibly ambiguous and low quality due to
missing values, high data redundancy, and untruthfulness.
3. Why Data from Source is Heterogeneous
• Firstly, the variety of data acquisition devices, the acquired data are
also different in types with heterogeneity.
• Second, they are at a large-scale. Massive data acquisition equipment
is used and distributed, not only the currently acquired data, but also
the historical data within a certain time frame should be stored.
• Third, there is a strong correlation between time and space.
• Fourth, effective data accounts for only a small portion of the big
data. A great quantity of noises may be collected during the acquisitio
4. types of data heterogeneity
• Syntactic heterogeneity occurs when two data sources are not
expressed in the same language.
• Conceptual heterogeneity, also known as semantic heterogeneity or
logical mismatch, denotes the differences in modelling the same
domain of interest.
• Terminological heterogeneity stands for variations in names when
referring to the same entities from different data sources.
• Semiotic heterogeneity, also known as pragmatic heterogeneity,
stands for different interpretation of entities by people.
5. Data representation can be described at four levels
• Level 1 is diverse raw data with different types and from different
sources.
• Level 2 is called ‘unified representation’. Heterogeneous data needs to
be unified. This layer converts individual attributes into information in
terms of ‘what-when-where’.
• Level 3 is aggregation. Aggregation aids easy visualization and
provides an intuitive query.
• Level 4 is called ‘situation detection and representation’. The final step
in situation detection is a classification operation that uses domain
knowledge to assign an appropriate class to each cell.
6. Data Processing Methods for Heterogeneous
Data
• Data Cleaning
• Data Integration
• Data Reduction and Normalisation
7. Data Cleaning
Data cleaning is a process to identify, incomplete, inaccurate or
unreasonable data, and then to modify or delete such data for
improving data quality.
For example, the multisource and multimodal nature of healthcare data
results in high complexity and noise problems.
8. Data Cleaning
• A database may also contain irrelevant attributes.
• Therefore, relevance analysis in the form of correlation analysis and
attribute subset selection can be used to detect attributes that do not
contribute to the classification or prediction task.
• PCA can also be used
• Data cleaning can be performed to detect and remove redundancies
that may have resulted from data integration.
• The removal of redundant data is often regarded as a king of data
cleaning as well as data reduction
9. Data Integration
• In the case of data integration or aggregation, datasets are matched
and merged on the basis of shared variables and attributes.
• Advanced data processing and analysis techniques allow to mix both
structured and unstructured data for eliciting new insights;
However, this requires “clean” data.
10. Data integration & Challenge
• Data integration tools are evolving towards the unification of
structured and unstructured data
• It is often required to structure unstructured data and merge
heterogeneous information sources and types into a unified data layer
• Challenge: One of reasons is that unique identifiers between records
of two different datasets often do not exist. Determining which data
should be merged may not be clear at the outset.
11. Approaches of Integration for unstructured
and structured Data
• Natural language processing pipelines: The Natural Language
Processing (NLP) can be directly applied to projects that demand
dealing with unstructured data.
• Entity recognition and linking: Extracting structured information from
unstructured data is a fundamental step. can be resolved by
information extraction techniques.
• Use of open data to integrate structured & unstructured data: Entities
in open datasets can be used to identify named entities (people,
organizations, places), which can be used to categorize and organize
text contents
12. Dimension Reduction and Data Normalization
There are several reasons to reduce the dimensionality of the data:
• First, high dimensional data impose computational challenges.
• Second, high dimensionality might lead to poor generalization abilities
of the learning algorithm.
• Finally, dimensionality reduction can be used for finding meaningful
structure of the data
13. Finding redudancy and Removal
• To check a correlation matrix obtained by correlation analysis.
• Factor analysis is a method for dimensionality reduction.
• Factor Analysis can be used to reduce the number of variables and
detect the structure in the relationships among variables. Therefore,
Factor Analysis is often used as a structure detection or data
reduction method.
• PCA is useful when there is data on a large number of variables and
possibly there is some redundancy in those variables.
14. Several ways in which PCA can help
• Pre-processing: With PCA one can also whiten the representation,
which rebalances the weights of the data to give better performance
in some cases.
• Modeling: PCA learns a representation that is sometimes used as an
entire model, e.g., a prior distribution for new data.
• Compression: PCA can be used to compress data, by replacing data
with its low-dimensional representation.
16. Paradox of Big Data
• Identity Paradox: Big data seeks to identify, but it also threatens
identity.
• The transparency paradox :The small data inputs are aggregated to
produce large datasets. This data collection happens invisibly. Big data
promises to use this data to make the world more transparent; but its
collection is invisible;
• The power paradox — Big data sensors and big data pools are
predominantly in the hands of powerful intermediary institutions, not
ordinary people.
17. Solution of big data analytics
• Data loading — Software has to be developed to load data from
multiple and various data sources. The system needs to deal with
corrupted records and need to provide monitoring services.
• Data parsing — Most data sources provide data in a certain format
that needs to be parsed into the Hadoop system.
• Data analytics —A solution of big data analytics needs to support
rapid iterations in order for data to be properly analyzed.
18. Big Data Analytics
• descriptive analytics — involving the description and summarization
of knowledge patterns;
• predictive analytics — forecasting and statistical modelling to
determine future possibilities; and
• prescriptive analytics — helping analysts in decision-making by
determining actions and assessing their impacts.
19. Big Data tools
• There are some Big Data tools such as
• Hive
• Splunk
• Tableau,
• Talend
• RapidMiner and
• MarkLogic
20. Big Data compute platforms strategies:
• Internal compute cluster. For long-term storage of unique or sensitive
data, it often makes sense to create and maintain an Apache Hadoop
cluster within the internal network of an organization.
• External compute cluster. There is a trend across the IT industry to
outsource elements of infrastructure to ‘utility computing’ service
providers.
• Hybrid compute cluster. A common hybrid option is to provision
external compute cluster resources using services for on-demand Big
Data analysis tasks and create a modest internal computer cluster for
long-term data storage.
21. Outlier detection
• The statistical approach,
• The density-based local outlier approach, (Local Outlier Factor)
• The distance-based approach, (Clustering)
• The deviation-based approach (Deep Learning Based)
23. Future Requirement for Big Data Technologies
• Handle the growth of the Internet — As more users come online, Big
Data technologies will need to handle larger volumes of data.
• Real-time processing — Big Data processing was initially carried out in
batches of historical data. In recent years, stream processing systems
is developing, such as Apache Storm.
• Process complex data types — Data such as graph data and possible
other types of more complicated
24. Future Requirement…
• Efficient indexing — Indexing is fundamental to the online lookup of
data and is therefore essential in managing large collections of
documents and their associated metadata.
• Dynamic orchestration of services in multi-server and cloud contexts
— Most platforms today are not suitable for the cloud and keeping
data consistent between different data stores is challenging.
• Concurrent data processing — Being able to process large quantities
of data concurrently is very useful