Unit-1-Part1-Big Data Analytics and Tools
Unit-1-Part1-Big Data Analytics and Tools
Data Information
Information Insight
Data that is available in Real World can be of any form. But on a broader scale all these forms fall into the
classification of data based on its structure.
Insert/Update/Delete All data manipulation operations are easy in and with structured data.
Scalability All these DB's can be increased and decreased according to size.
Transaction Processing All transactions in RDBMS are always safe and secure as it follows ACID
properties.
Semi-Structured Data:
We say that data is in semi-structured format if it does not contribute to any data model (example of data
model is RDMS), but it has some structure.
For example, XML files do not contribute to any data model, but they do have some structure.
Another example could be a C program, C program does not contribute to any data model, but it has its
own structure.
Though it does not have a data model contribution, it does have metadata that defines the data inside it.
But this metadata is not sufficient to describe the original data.
Likely to structured data even semi structured data has only 10% contribution to real world data.
Usually, semi-structured data is in the form of tags and attributes as in XML and HTML.
These tags try to create a hierarchy of records to establish relationships between data.
There exists no separate schema.
-> Sources of semi-structured data:
Greatest sources of semi-structured data are:
1. XML files
2. JSON files
Unstructured Data:
Data in unstructured format neither contributes to any data model nor have any format and it is not easily
understood to computer.
Unlike structured and unstructured data, it contributes almost 80-90% of data in an organization and in the
real world.
Ex: pdfs, jpegs, Png, mp3, mp4, ppt, docx etc.
We have multiple issues with this unstructured form of data.
-> Issues with unstructured data:
Main issue with unstructured data is computer cannot understand it.
Another issue with unstructured data is, though the extension of file may be some .txt or .pdf but it
still has data in structured format. But due to its extension we might assume that data is in
unstructured form and miss out the valuable insight.
Another issue is the data might have some structure but may not be properly defined. In this case
though data is structured it falls into the category of unstructured data.
Another issue is data might be highly structured, but the structure is highly unannounced and
unaccepted.
We know that computers cannot understand this unstructured form of data. Now what can we do??
-> Dealing Unstructured Data:
To make computer understand Unstructured data, we must convert it to structured format.
For doing so we have multiple methods. Some of them includes:
3. Data Mining:
a. Using Associate Rule Mining
b. Regression Analysis
4. Using Text Mining and Analytics.
5. Natural Language Processing.
6. POS – Parts of speech tagging
Etc...
themselves Etc..
2. Big Data: Definition
Big Data is high-volume, high velocity, and high- variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision-making.
(or)
Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured
data that continues to grow exponentially over time. These datasets are so huge and complex in volume,
velocity, and variety, that traditional data management systems cannot store, process, and analyze them.
3. Characteristics of Big Data:
Big data definitions may vary slightly, but it will always be described in terms of volume, velocity, and
variety. These big data characteristics are often referred to as the “3 Vs of big data” and were first defined by
Gartner in 2001.
1. Volume: - As the term implies, big data analytics entails handling and analyzing vast amounts of data.
To effectively work with such massive datasets, specialized tools and infrastructure are necessary for
capturing, storing, managing, cleaning, transforming, analyzing, and reporting the data.
2. Velocity: - Velocity denotes the speed at which data is generated. To keep up with the rapid generation
of data, systems for processing and analyzing data must possess sufficient capacity to handle the
influx of data and deliver timely, actionable insights.
3. Variety: - Variety refers to the diversity of data types and sources. Data can manifest in various forms,
originate from diverse sources, and exist in structured or unstructured formats. Understanding the types
of data and their sources, as well as the interrelationships within the datasets, is vital for generating
meaningful insights from big data.
4. Variability: - Big data often contains noisy and incomplete data points, which can obscure valuable
insights. Addressing this variability typically involves data cleaning and validation processes to ensure
data quality.
5. Veracity: - Veracity pertains to the accuracy and authenticity of the data. Data must undergo validation
to ensure that it accurately represents essential business functions and that any data manipulation,
modeling, and analysis does not compromise the data's accuracy.
6. Value: - A successful big data analytics strategy must generate value. The insights derived from the
analysis should provide meaningful guidance for improving operations, enhancing customer service,
or creating other forms of value. An integral part of developing a big data analytics strategy is
distinguishing between data that can contribute value and data that cannot.
7. Visualization: - Visualization plays a vital role in data analytics, as it involves presenting the analyzed
data in a visually comprehensible manner. When planning data visualization, it is essential to consider
the end user and the decisions the visualizations aim to support. Well-executed data visualization
facilitates swift and well-informed decision-making.
ERP Reporting/Dashboard
CRM OLAP
Data
Legacy Warehouse Ad hoc querying
Diagnostic Analytics:
Another common type of analytics is diagnostic analytics and it helps explain why things happened the
way they did.
It’s a more complex version of descriptive analytics, extending beyond what happened to why it happened.
Diagnostics analytics identifies trends or patterns in the past and then goes a step further to explain why
the trends occurred the way they did.
Diagnostic analytics applies data to figure out why something happened so you can develop better
strategies without so much trial and error.
Examples of diagnostic analytics include:
o Why did year-over-year sales go up?
o Why did a certain product perform above expectations?
o Why did we lose customers in Q3?
The main flaw with diagnostic analytics is its limitation of providing actionable observations about the
future by focusing on past occurrences.
Understanding the causal relationships and sequences may be enough for some businesses, but it may not
provide sufficient answers for others.
Predictive analytics
Predictive analytics is what it sounds like — it aims to predict likely outcomes and make educated
forecasts using historical data.
Predictive analytics extends trends into the future to see possible outcomes.
This is a more complex version of data analytics because it uses probabilities for predictions instead of
simply interpreting existing facts.
Statistical modeling or machine learning are commonly used with predictive analytics.
A business is in a better position to set realistic goals and avoid risks if they use data to create a list of
likely outcomes.
Predictive analytics can keep your team or the company as a whole aligned on the same strategic vision.
Examples of predictive analytics include:
o Ecommerce businesses that use a customer’s browsing and purchasing history to make product
recommendations.
o Financial organizations that need help determining whether a customer is likely to pay their credit
card bill on time.
o Marketers who analyze data to determine the likelihood that new customers will respond favorably
to a given campaign or product offering.
The primary challenge with predictive analytics is that the insights it generates are limited to the data.
First, that means that smaller or incomplete data sets will not yield predictions as accurate as larger data
sets might.
Additionally, the challenge of predictive analytics being restricted to the data simply means that even the
best algorithms with the biggest data sets can’t weigh intangible or distinctly human factors.
Prescriptive analytics
Prescriptive analytics uses the data from a variety of sources — including statistics, machine learning, and
data mining — to identify possible future outcomes and show the best option.
Prescriptive analytics is the most advanced of the three types because it provides actionable insights
instead of raw data.
This methodology is how you determine what should happen, not just what could happen.
Using prescriptive analytics enables you to not only envision future outcomes, but to understand why
they will happen.
Prescriptive analytics also can predict the effect of future decisions, including the ripple effects those
decisions can have on different parts of the business. And it does this in whatever order the decisions may
occur.
Prescriptive analytics is a complex process that involves many variables and tools like algorithms,
machine learning, and big data.
Examples of prescriptive analytics include:
o Calculating client risk in the insurance industry to determine what plans and rates an account
should be offered.
o Discovering what features to include in a new product to ensure its success in the market, possibly
by analyzing data like customer surveys and market research to identify what features are most
desirable for customers and prospects.
o Identifying tactics to optimize patient care in healthcare, like assessing the risk for developing
specific health problems in the future and targeting treatment decisions to reduce those risks.
The most common issue with prescriptive analytics is that it requires a lot of data to produce useful
results, but a large amount of data isn’t always available. This type of analytics could easily become
inaccessible for most.
Though the use of machine learning dramatically reduces the possibility of human error, an additional
downside is that it can’t always account for all external variables since it often relies on machine learning
algorithms.
1.11 Challenges:
1. Scale
2. Security
3. Schema
4. Continuous Availability
5. Consistency
6. Partition tolerant
7. Data Quality
1.12 Terminologies Used in Big Data Environments:
8. In-Memory Analytics
9. In-Database Processing
10.Symmetric Multiprocesor System
11.Massive Parallel Processing
12.Parallel and Distributed systems(Differences)
13.Shared Nothing Architecture.
14.CAP Theorem