Unit 1 - BD - Introduction To Big Data
Unit 1 - BD - Introduction To Big Data
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
“The world is one big data platform.” - Andrew McAfee, co-director of the
MIT Initiative on the Digital Economy, and the associate director of the
Center for Digital Business at the MIT Sloan School of Management.
“Errors using inadequate data are much less than those using no data at
all.” - Charles Babbage, inventor and mathematician.
“The most valuable commodity I know of is information.” - Gordon
Gekko, fictional character in the 1987 film Wall Street and its 2010
sequel Wall Street: Money Never Sleeps, played by Michael Douglas.
“Big data will replace the need for 80% of all doctors” - Vinod Khosla,
Indian-born American engineer and businessman.
“Thanks to big data, machines can now be programmed to the next thing
right. But only humans can do the next right thing.” - Dov Seidman,
American author, attorney, columnist and businessman
School of Computer Engineering
Motivating Quotes cont’d
3
“With data collection, ‘the sooner the better’ is always the best answer.” -
Marissa Mayer, former president and CEO of Yahoo!
“Data is a precious thing and will last longer than the systems
themselves.” - Tim Berners-Lee, inventor of the World Wide Web.
“Numbers have an important story to tell. They rely on you to give them
a voice.” - Stephen Few, Information Technology innovator, teacher, and
consultant.
“When we have all data online it will be great for humanity. It is a
prerequisite to solving many problems that humankind faces” - Vinod
Khosla, Indian-born American engineer and businessman.
“Thanks to big data, machines can now be programmed to the next thing
right. But only humans can do the next right thing.” - Robert Cailliau,
Belgian informatics engineer and computer scientist who, together with
Tim Berners-Lee, developed the World Wide Web.
School of Computer Engineering
Importance of the Course
4
Business Data
Science
Big Data
Analytics
Real-time
Job Market Usability
To get an answer to why you should learn Big Data? Let’s start with what
industry leaders say about Big Data:
Gartner – Big Data is the new Oil.
IDC – Its market will be growing 7 times faster than the overall IT market.
IBM – It is not just a technology – it’s a business strategy for capitalizing on
information resources.
IBM – Big Data is the biggest buzz word because technology makes it
possible to analyze all the available data.
McKinsey – There will be a shortage of 1500000 Big Data professionals by
the mid of 2030.
Industries today are searching new and better ways to maintain their
position and be prepared for the future. According to experts, Big Data
analytics provides leaders a path to capture insights and ideas to stay ahead
in the tough competition.
School of Computer Engineering
Course Objective
6
Exploring the Big Data Stack, Data Sources Layer, Ingestion Layer, Storage Layer,
Physical Infrastructure Layer, Platform Management Layer, Security Layer,
Monitoring Layer, Analytics Engine, Visualization Layer, Big Data Applications,
Virtualization.
3 Streaming 5
Textbook
Big Data, Black Book, DT Editorial Services, Dreamtech Press, 2016
Reference Books
Big Data and Analytics, Seema Acharya, Subhashini Chellappan, Infosys Limited,
Publication: Wiley India Private Limited,1st Edition 2015.
Discovering, Analyzing, Visualizing and Presenting Data by EMC Education
Services (Editor), Wiley, 2014
Stephan Kudyba, Thomas H. Davenport, Big Data, Mining, and Analytics,
Components of Strategic Decision Making, CRC Press, Taylor & Francis Group. 2014
Norman Matloff , THE ART OF R PROGRAMMING, No Starch Press, Inc.2011
Big Data For Dummies, Judith Hurwitz et al. Wiley 2013
Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big
Data Glossary, O’Reilly, 2011.
Grading:
Internal assessment – 30 marks
1 group critical thinking (class test) = 5 X 1 = 5 marks
2 group assignments = 5 X 2 = 10 marks
1 individual class note = 5 X 1 = 5 marks
1 group quiz = 5 X 1 = 5 marks
1 individual class participation = 5 X 1 = 5 marks
?
School of Computer Engineering
Data
12
Human-readable refers to information that only humans can interpret and study,
such as an image or the meaning of a block of text. If it requires a person to
interpret it, that information is human-readable.
Machine-readable refers to information that computer programs can process. A
program is a set of instructions for manipulating data. Such data can be
automatically read and processed by a computer, such as CSV, JSON, XML, etc.
Non-digital material (for example printed or hand-written documents) is by its non-
digital nature not machine-readable. But even digital material need not be machine-
readable. For example, a PDF document containing tables of data. These are
definitely digital but are not machine-readable because a computer would struggle
to access the tabular information - even though they are very human readable. The
equivalent tables in a format such as a spreadsheet would be machine readable.
Another example scans (photographs) of text are not machine-readable (but are
human readable!) but the equivalent text in a format such as a simple ASCII text file
can machine readable and processable.
It is defined as the data that has a defined repeating pattern and this pattern
makes it easier for any program to sort, read, and process the data.
This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
Relationships exist between entities of data.
Structured data:
Organize data in a pre-defined format
Is stored in a tabular form
Is the data that resides in a fixed fields within a record of file
Is formatted data that has entities and their attributes mapped
Is used to query and report against predetermined data types
Sources:
Relational Multidimensional
database databases
Structured data
Legacy
Flat files
databases
School of Computer Engineering
Ease with Structured Data
18
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages
Inconsistent Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
About 80 percent of enterprise data consists of unstructured content.
Unstructured data:
Typically consists of metadata i.e. additional information related to data.
Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
Consists of data in different formats such as e-mails, text, audio, video, or
images.
Sources: Body of email
Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
School of Computer Engineering
Challenges associated with Unstructured data
23
Working with unstructured data poses certain challenges, which are as follows:
Identifying the unstructured data that can be processed
Sorting, organizing, and arranging unstructured data indifferent sets and
formats
Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information
Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
Think of following:
Semi- Big
Structured Unstructured
structured Data
Data Data
Data
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
Traditional approach cannot work on unstructured data efficiently
Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data
Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before
the required insight is obtained
Traditional data management, warehousing, and analysis systems fizzle to
analyze this type of data. Due to it’s complexity, big data is processed with
parallelism. Parallelism in a traditional system is achieved through costly
hardware like MPP (Massively Parallel Processing) systems
Inadequate support of aggregated summaries of data
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues
School of Computer Engineering
Elements of Big Data
30
In most big data circles, these are called the four V’s: volume, variety, velocity, and veracity.
(One might consider a fifth V, value.)
Volume - refers to the incredible amounts of data generated each second from social media,
cell phones, cars, credit cards, M2M sensors, photographs, video, etc. The vast amounts of
data have become so large in fact it can no longer store and perform data analysis using
traditional database technology. So using distributed systems, where parts of the data is
stored in different locations and brought together by software.
Variety - defined as the different types of data the digital system now use. Data today looks
very different than data from the past. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used
simultaneously.
Velocity - refers to the speed at which vast amounts of data are being generated, collected
and analyzed. Every second of every day data is increasing. Not only must it be analyzed,
but the speed of transmission, and access to the data must also remain instantaneous to
allow for real-time access. Big data technology allows to analyze the data while it is being
generated, without ever putting it into databases.
Veracity - is the quality or trustworthiness of the data. Just how accurate is all this data?
For example, think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content.
School of Computer Engineering
Elements of Big Data cont’d
31
Value - refers to the ability to transform a tsunami of data into business. Having endless
amounts of data is one thing, but unless it can be turned into value it is useless.
Refer to Appendix
for data volumes
More data
Approach Explanation
Descriptive What’s happening in my business?
• Comprehensive, accurate and historical data
• Effective Visualisation
Diagnostic Why is it happening?
• Ability to drill-down to the root-cause
• Ability to isolate all confounding information
Predictive What’s likely to happen?
• Decisions are automated using algorithms and technology
• Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do?
• Recommended actions and strategies based on champion/challenger
strategy outcomes
• Applying advanced analytical algorithm to make specific
recommendations
School of Computer Engineering
Mapping of Big Data’s Vs to Analytics Focus
35
History data can be quite large. There might be a need to process huge amount of data many times a
day as it gets updated continuously. Therefore volume is mapped to history. Variety is pervasive.
Input data, insights, and decisions can span a variety of forms, hence it is mapped to all three. High
velocity data might have to be processed to help real time decision making and plays across
descriptive, predictive, and prescriptive analytics when they deal with present data. Predictive and
prescriptive analytics create data about the future. That data is uncertain, by nature and its veracity
is in doubt. Therefore veracity is mapped to prescriptive and predictive analytics when it deal with
future.
School of Computer Engineering
Evolution of Analytics Scalability
36
It goes without saying that the world of big data requires new levels of scalability. As the
amount of data organizations process continues to increase, the same old methods for
handling data just won’t work anymore. Organizations that don’t update their
technologies to provide a higher level of scalability will quite simply choke on big data.
Luckily, there are multiple technologies available that address different aspects of the
process of taming big data and making use of it in analytic processes.
Traditional Analytics Architecture
Database 1
Analytic Server
Database 2
Extract
Database 3
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
School of Computer Engineering
Analysis vs. Reporting
41
Reporting - The process of organizing data into informational summaries in
order to monitor how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract
meaningful insights, which can be used to better understand and improve
business performance.
Difference b/w Reporting and Analysis:
Reporting translates raw data into information. Analysis transforms data
and information into insights.
Reporting helps companies to monitor their online business and be alerted
to when data falls outside of expected ranges. Good reporting should raise
questions about the business from its end users. The goal of analysis is to
answer questions by interpreting the data at a deeper level and providing
actionable recommendations.
In summary, reporting shows you what is happening while analysis focuses
on explaining why it is happening and what you can do about it.
Big Data
Analytics isn’t
“One-size-fit-all” traditional
Only used by huge online Meant to replace data
RDBMS built on shared disk
companies warehouse
and memory
1. Obtaining executive sponsorships for investments in big data and its related
activities such as training etc.
2. Getting the business units to share information across organizational silos.
3. Fining the right skills that can manage large amounts of structured, semi-
structured, and unstructured data and create insights from it.
4. Determining the approach to scale rapidly and elastically. In other words,
the need to address the storage and processing of large volume, velocity and
variety of big data.
5. Deciding whether to use structured or unstructured, internal or external
data to make business decisions.
6. Determining what to do with the insights created from big data.
7. Choosing the optimal way to report findings and analysis of big data for the
presentations to make the most sense.
In-Memory Analytics: Data access from non-volatile storage such as hard disk
is a slow process. The more the data is required to be fetched from hard disk or
secondary storage, the slower the process gets. The problem can be addressed
using in-memory analytics. All the relevant data is stored in RAM or primary
storage thus eliminating the need to access the data from hard disk. The
advantage is faster access, rapid deployment, better insights and minimal IT
involvement. In-memory Analytics makes everything Instantly Available due to
lower cost of RAM or Flash Memory, and data can be stored and processed at
lightening speed.
In-Database Processing: Also called as In-Database analytics. It works by
fusing data warehouses with analytical systems. Typically the data from various
enterprise Online Transaction Processing (OLTP) systems after cleaning up (de-
duplication, scrubbing etc.) through the process of ETL is stored in the
Enterprise Data Warehouse or data marts. The huge datasets are then exported
to analytical programs for complex and extensive computations.
Note: Refer to Appendix for further details on OLTP and ETL.
School of Computer Engineering
Key terminologies used in Big Data cont’d
49
Source: https://en.wikipedia.org/wiki/Symmetric_multiprocessing
School of Computer Engineering
Key terminologies used in Big Data cont’d
50
P1 P2 P3
P2
User User User User
P1 P3
Network
SM SD
In a shared nothing (SN) architecture, neither memory nor disk is shared among
multiple processors.
Advantages:
Fault Isolation: provides the benefit of isolating fault. A fault in a single
machine or node is contained and confined to that node exclusively and
exposed only through messages.
Scalability: If the disk is a shared resource, synchronization will have to
maintain a consistent shared state and it means that different nodes will
have to take turns to access the critical data. This imposes a limit on how
many nodes can be added to the distributed shared disk system, this
compromising on scalability.
CAP Theorem: In the past, when we wanted to store more data or increase our
processing power, the common option was to scale vertically (get more
powerful machines) or further optimize the existing code base. However, with
the advances in parallel processing and distributed systems, it is more common
to expand horizontally, or have more machines to do the same task in parallel.
However, in order to effectively pick the tool of choice like Spark, Hadoop, Kafka,
Zookeeper and Storm in Apache project, a basic idea of CAP Theorem is
necessary. The CAP theorem is called the Brewer’s Theorem. It states that a
distributed computing environment can only have 2 of the 3: Consistency,
Availability and Partition Tolerance – one must be sacrificed.
Consistency implies that every read fetches the last write
Availability implies that reads and write always succeed. In other words,
each non-failing node will return a response in a reasonable amount of time
Partition Tolerance implies that the system will continue to function when
network partition occurs
Next, the client request that v1 be written to S1. Since the system is Client
available, S1 must respond. Since the network is partitioned, however, S1
cannot replicate its data to S2. This phase of execution is called α1.
S1 S2 S1 S2 S1 S2
V0 V0 V1 V0 V1 V0
Write V1 done
Client Client Client
Next, the client issue a read request to S2. Again, since the system is available,
S2 must respond and since the network is partitioned, S2 cannot update its
value from S1. It returns v0. This phase of execution is called α2.
S1 S2 S1 S2
V1 V0 V1 V0
read V0
Client Client
S2 returns v0 to the client after the client had already written v1 to S1. This is
inconsistent.
We assumed a consistent, available, partition tolerant system existed, but we
just showed that there exists an execution for any such system in which the
system acts inconsistently. Thus, no such system exists.
Big Data analysis differs from traditional data analysis primarily due to the
volume, velocity and variety characteristics of the data being processes.
To address the distinct requirements for performing analysis on Big Data,
a step-by-step methodology is needed to organize the activities and tasks
involved with acquiring, processing, analyzing and repurposing data.
From a Big Data adoption and planning perspective, it is important that in
addition to the lifecycle, consideration be made for issues of training,
education, tooling and staffing of a data analytics team.
The Big Data analytics lifecycle can be divided into the following nine
stages namely –
1. Business Case Evaluation 6. Data Aggregation & Representation
2. Data Identification 7. Data Analysis
3. Data Acquisition & Filtering 8. Data Visualization
4. Data Extraction 9. Utilization of Analysis Results
5. Data Validation & Cleansing
The Data Analysis stage of the Big Data Lifecycle stage is dedicated to
carrying out the actual analysis task.
It runs the code or algorithm that makes the calculations that will lead to
the actual result.
Data Analysis can be simple or really complex, depending on the required
analysis type.
In this stage the ‘actual value’ of the Big Data project will be generated. If all
previous stages have been executed carefully, the results will be factual and
correct.
Depending on the type of analytic result required, this stage can be as
simple as querying a dataset to compute an aggregation for comparison.
On the other hand, it can be as challenging as combining data mining and
complex statistical analysis techniques to discover patterns and anomalies
or to generate a statistical or mathematical model to depict relationships
between variables.
The ability to analyze massive amounts of data and find useful insights
carries little value if the only ones that can interpret the results are the
analysts.
The data visualization stage, is dedicated to using data visualization
techniques and tools to graphically communicate the analysis results for
effective interpretation by business users.
Business users need to be able to understand the results to obtain value
from the analysis and subsequently have the ability to provide feedback.
The results of completing the data visualization stage provide users with
the ability to perform visual analysis, allowing for the discovery of answers
to questions that users have not yet even formulated.
The same results may be presented in a number of different ways, which
can influence the interpretation of the results. Consequently, it is important
to use the most suitable visualization technique by keeping the business
domain in context.
After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the results
in practice.
The utilization of Analysis results is dedicated to determining how
and where the processed data can be further utilized to leverage the
result of the Big Data Project.
Depending on the nature of the analysis problems being addressed,
it is possible for the analysis results to produce “models” that
encapsulate new insights and understandings about the nature of
the patterns and relationships that exist within the data that was
analyzed.
A model may look like a mathematical equation or a set of rules.
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new system
or software program.
School of Computer Engineering
Home Assignments
75
Social Media: A social media marketing company which wants to expand its
business. They want to find the websites which have a low rank web page. You have
been tasked to find the low-rated links based on the user comments, likes etc.
Retail: A retail company wants to enhance their customer experience by analysing
the customer reviews for different products. So that, they can inform the
corresponding vendors and manufacturers about the product defects and
shortcomings. You have been tasked to analyse the complaints filed under each
product & the total number of complaints filed based on the geography, type of
product, etc. You also have to figure out the complaints which have no timely
response.
Tourism: A new company in the travel domain wants to start their business
efficiently, i.e. high profit for low TCO. They want to analyse & find the most frequent
& popular tourism destinations for their business. You have been tasked to analyse
top tourism destinations that people frequently travel & top locations from where
most of the tourism trips start. They also want you to analyze & find the destinations
with costly tourism packages.
Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information from
noisy unstructured text data.
School of Computer Engineering
Appendix cont…
84
ETL: ETL is short for extract, transform, load, three database functions that are
combined into one tool to pull data out of one database and place it into another
database.
Extract is the process of reading data from a database. In this stage, the data is
collected, often from multiple and different types of sources.
Transform is the process of converting the extracted data from its previous form
into the form it needs to be in so that it can be placed into another database.
Transformation occurs by using rules or lookup tables or by combining the data
with other data.
Load is the process of writing the data into the target database.
1. You are planning the marketing strategy for a new product in your
company. Identify and list some limitations of structured data
related to this work.
2. In what ways does analyzing Big Data help organizations prevent
fraud?
3. Discuss the techniques of parallel computing.
4. Discuss the features of cloud computing that can be used to handle
Big Data.
5. Discuss similarities and differences between ELT and ETL.
6. It is impossible for a web service to provide following three
guarantees at the same time i.e., consistency, availability and
partition-tolerance. Justify it with suitable explanation.
7. Hotel Booking: are we double-booking the same room? Justify this
statement with CAP theorem.
10. Consider an online bookstore OLTP model with the entities and attributes
as follows.
Publisher (PUBLISHER_ID, NAME)
Subject (SUBJECT_ID, NAME)
Author (AUTHOR_ID, NAME)
Publication (PUBLICATION_ID, SUBJECT_ID (FK), AUTHOR_ID (FK), TITLE)
Edition (PUBLISHER_ID (FK), PUBLICATION_ID (FK), PRINT_DATE, PAGES, PRICE, FORMAT)
Review (REVIEW_ID, PUBLICATION_ID, (FK), REVIEW_DATE, TEXT)
Draw the equivalent OLAP conceptual, logical, and physical data model.