Big Data Module 1
Big Data Module 1
of view, the velocity of data translates into the amount of time it takes for the data
to be processed once it enters the enterprise’s perimeter. Coping with the fast in flow
of data requires the enterprise to design highly elastic and available data processing
solutions and corresponding data storage capabilities.
Variety
Data variety refers to the multiple formats and types of data that need to be
supported by Big Data solutions. Data variety brings challenges for enterprises in
terms of data integration, transformation, processing, and storage.
Veracity
Veracity refers to the quality or fidelity of data. Data that enters Big Data
environments needs to be assessed for quality, which can lead to data processing
activities to resolve invalid data and remove noise. In relation to veracity, data can
be part of the signal or noise of a dataset.
Noise is data that cannot be converted into information and thus has no value,
whereas signals have value and lead to meaningful information. Data with a high
signal-to-noise ratio has more veracity than data with a lower ratio. Data that is
acquired in a controlled manner (for example, via online customer registration)
usually contains less noise. Data acquired via uncontrolled sources (such as blog
postings) contains more noise.
Value
Value is defined as the usefulness of data for an enterprise. The value
characteristic is intuitively related to the veracity characteristic in that the higher the
data fidelity, the more value it holds for the business. Value is also dependent on
how long it takes to process the data because analytics results have a shelf-life. For
instance, a 20 minute delayed stock quote has no value for making a stock trade.
Classification/Nature of Data
Data can be classified based on its nature, as structured, semi-structured, and
unstructured data.
Structured Data
Structured data conform and associate with data schemas and data models.
Structured data are found in tables.Structured data enables the following:
● Data insert, delete, update, and append
● Indexing to enable faster data retrieval
One disruptive facet of big data management is the use of a wide range of
innovative data management tools and frameworks whose designs are dedicated to
supporting operational and analytical processing. The NoSQL (not only SQL)
frameworks are used that differentiate it from traditional relational database
management systems and are also largely designed to fulfill performance demands
of big data applications such as managing a large amount of data and quick response
times. There are a variety of NoSQL approaches such as hierarchical object
representation (such as JSON, XML and BSON) and the concept of a key-value
storage. The wide range of NoSQL tools, developers and the status of the market are
creating uncertainty with the data management.
It is difficult to win the respect of media and analysts in tech without being
bombarded with content touting the value of the analysis of big data and
corresponding reliance on a wide range of disruptive technologies. The new tools
evolved in this sector can range from traditional relational database tools with some
alternative data layouts designed to maximize access speed while reducing the
storage footprints, NoSQL data management frameworks, in-memory analytics, and
as well as the broad Hadoop ecosystem. The reality is that there is a lack of skills
available in the market for big data technologies. The typical expert has also gained
experience through tool implementation and its use as a programming model, apart
from the big data management aspects.
It might be obvious that the intent of big data management involves analyzing
and processing a large amount of data. There are many people who have raised
expectations considering analyzing huge data sets for a big data platform. They also
may not be aware of the complexity behind the transmission, access, and delivery of
data and information from a wide range of resources and then loading these data into
a big data platform. The intricate aspects of data transmission, access and loading
are only part of the challenge. The requirement to navigate transformation and
extraction is not limited to conventional relational data sets.
Once you import data into big data platforms you may also realize that data
copies migrated from a wide range of sources on different rates and schedules can
rapidly get out of the synchronization with the originating system. This implies that
the data coming from one source is not out of date as compared to the data coming
from another source. It also means the commonality of data definitions, concepts,
metadata and the like. The traditional data management and data warehouses, the
sequence of data transformation, extraction and migrations all arise the situation in
which there are risks for data to become unsynchronized.
The most practical use cases for big data involve the availability of data,
augmenting existing storage of data as well as allowing access to end-user
employing business intelligence tools for the purpose of the discovery of data. This
business intelligence must be able to connect different big data platforms and also
provide transparency of the data consumers to eliminate the requirement of custom
coding. At the same time, if the number of data consumers grows, then one can
6. Miscellaneous Challenges:
Other challenges may occur while integrating big data. Some of the challenges
include integration of data, skill availability, solution cost, the volume of data, the
rate of transformation of data, veracity and validity of data. The ability to merge data
that is not similar in source or structure and to do so at a reasonable cost and in time.
It is also a challenge to process a large amount of data at a reasonable speed so that
information is available for data consumers when they need it. The validation of data
sets is also fulfilled while transferring data from one source to another or to
consumers as well.
Intelligent Data Analysis (IDA) discloses hidden facts that are not known previously
and provides potentially important information or facts from large quantities of data. It also
helps in making a decision. IDA helps to obtain useful information, necessary data and
interesting models from a lot of data available online in order to make the right choices.
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining;
(3) data validation and explanation. The preparation of data involves opting for the
required data from the related data source and incorporating it into a data set that can be
used for data mining.The main goal of intelligent data analysis is to obtain knowledge.
1. Collect Data
Raw or unstructured data that is too diverse or complex for a warehouse may
be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate
results on analytical queries, especially when it’s large and unstructured. Available
data is growing exponentially, making data processing a challenge for organizations.
One processing option is batch processing, which looks at large data blocks
over time. Batch processing is useful when there is a longer turnaround time between
collecting and analyzing data.
3. Clean Data
Data requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed
insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data
analysis methods include:
● Data mining sorts through large datasets to identify patterns and relationships
intelligence and machine learning to layer algorithms and find patterns in the
most complex and abstract data.
Analysis Vs Reporting
Following are the five major differences between Analysis and Reporting:
1. Purpose
Reporting helps companies monitor their data even before digital technology
booms. Various organizations have been dependent on the information it brings to
their business, as reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between
cross-channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not analysis
reports), analysis interprets this information and provides recommendations on
actions.
2. Tasks
3. Outputs
4. Delivery
Analysis requires a more custom approach, with human minds doing superior
reasoning and analytical thinking to extract insights, and technical skills to provide
efficient steps towards accomplishing a specific goal. This is why data analysts and
scientists are demanded these days, as organizations depend on them to come up
with recommendations for leaders or business executives to make decisions about
their businesses.
5. Value
Reporting itself is just numbers. Without drawing insights and getting reports
aligned with your organization’s big picture, you can’t make decisions based on
reports alone.
Data analysis is the most powerful tool to bring into your business. Employing
the powers of analysis can be comparable to finding gold in your reports, which
allows your business to increase profits and further develop.
require a fixed scheme, making them a great option for big, raw, unstructured
data. NoSQL stands for “not only SQL,” and these databases can handle a
variety of data models.
● Spark is an open source cluster computing framework that uses implicit data
Re-Sampling
The problem with the sampling process is that we only have a single estimate
of the population parameter, with little idea of the variability or uncertainty in the
estimate. One way to address this is by estimating the population parameter multiple
times from our data sample. This is called resampling.
Statistical Inference
Prediction Error
Predictive analytical processes use new and historical data to forecast activity,
behaviour, and trends. A prediction error is the failure of some expected event to
occur. When prediction fails, humans can use different methods, examining
predictions and failures and deciding some methods to overcome such errors in the
future. Applying that type of knowledge can inform decisions and improve the
quality of future prediction.