0% found this document useful (0 votes)
136 views

Big Data Answers

This document discusses big data and various types of big data. It defines big data as data that is huge in size and growing exponentially over time. The document discusses the three main types of big data: structured data which can be stored and accessed in a fixed format, unstructured data which has no predefined format, and semi-structured data which has some organizational properties. It provides examples of each type of data. The document also lists and describes six applications of big data in healthcare, academia, banking, manufacturing, and IT.

Uploaded by

Manan Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Big Data Answers

This document discusses big data and various types of big data. It defines big data as data that is huge in size and growing exponentially over time. The document discusses the three main types of big data: structured data which can be stored and accessed in a fixed format, unstructured data which has no predefined format, and semi-structured data which has some organizational properties. It provides examples of each type of data. The document also lists and describes six applications of big data in healthcare, academia, banking, manufacturing, and IT.

Uploaded by

Manan Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

What is big data ? Discuss various types of big data?

BIG DATA: 1. Data is defined as the quantities, characters, or symbols on which operations are
performed by a computer.

2. Data may be stored and transmitted in the form of electrical signals and recorded on magnetic,
optical, or mechanical recording media.

3. Big Data is also data but with a huge size.

4. Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.

5. In short such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently

I) Structured:

1. Any data that can be stored, accessed and processed in the form of fixed format is termed as a
Structured Data.

2. It accounts for about 20% of the total existing data and is used the most in programming and
computer-related activities.

3. There are two sources of structured data - machines and humans

. 4. All the data received from sensors, weblogs, and financial systems are classified under
machinegenerated data.

5. These include medical devices, GPS data, data of usage statistics captured by servers and
applications.

6. Human-generated structured data mainly includes all the data a human input into a computer,
such as his name and other personal details.

7. When a person clicks a link on the internet, or even makes a move in a game, data is created.

8. Example: An 'Employee' table in a database is an example of Structured Data.

II) Unstructured:

1. Any data with unknown form or the structure is classified as unstructured data.

2. The rest of the data created, about 80% of the total account for unstructured big data.
3. Unstructured data is also classified based on its source, into machine-generated or human-
generated.

4. Machine-generated data accounts for all the satellite images, the scientific data from various
experiments and radar data captured by various facets of technology.

5. Human-generated unstructured data is found in abundance across the internet since it includes
social media data, mobile data, and website content.

6. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on
YouTube and even the text messages we send all contribute to the gigantic heap that is unstructured
data.

7. Examples of unstructured data include text, video, audio, mobile activity, social media activity,
satellite imagery, surveillance imagery etc.

III) Semi-Structured:

1. Semi Structured data is information that does not reside in a RDBMS.

2. Information that is not in the traditional database format as structured data, but contains some
organizational properties which make it easier to process, are included in semi-structured data.

3. It may organized in tree pattern which is easier to analyze in some cases.

4. Examples of semi structured data might include XML documents and NoSQL databases. Personal
data stored in an XML file.

List and Describe any six applications of big data.

I) Healthcare & Public Health Industry:

1. Big Data has already started to create a huge difference in the healthcare sector

. 2. With the help of predictive analytics, medical professionals and HCPs are now able to provide
personalized healthcare services to individual patients.

3. Like entire DNA strings can be decoded in minutes.

4. Apart from that, fitness wearable’s, telemedicine, remote monitoring – all powered by Big Data
and AI – are helping change lives for the better.

II) Academia

1. Big Data is also helping enhance education today.


2. Education is no more limited to the physical bounds of the classroom – there are numerous
online educational courses to learn from.

3. Academic institutions are investing in digital courses powered by Big Data technologies to aid the
allround development of budding learners.

III) Banking

1. The banking sector relies on Big Data for fraud detection.

2. Big Data tools can efficiently detect fraudulent acts in real-time such as misuse of credit/debit
cards, archival of inspection tracks, faulty alteration in customer stats, etc.

IV) Manufacturing

1. According to TCS Global Trend Study, the most significant benefit of Big Data in manufacturing is
improving the supply strategies and product quality.

2. In the manufacturing sector, Big data helps create a transparent infrastructure, thereby,
predicting uncertainties and incompetence’s that can affect the business adversely.

V) IT

1. One of the largest users of Big Data, IT companies around the world are using Big Data to optimize
their functioning, enhance employee productivity, and minimize risks in business operations.

2. By combining Big Data technologies with ML and AI, the IT sector is continually powering
innovation to find solutions even for the most complex of problems.

What is data analysis? Discuss various methods of data analysis.

Data analysis is a technique that typically involves multiple activities such as gathering, cleaning, and
organizing the data. These processes, which usually include data analysis software, are necessary to
prepare the data for business purposes. Data analysis is also known as data analytics, described as
the science of analyzing raw data to draw informed conclusions based on the data.

Data analysis methods and techniques are useful for finding insights in data, such as metrics, facts,
and figures. The two primary methods for data analysis are qualitative data analysis techniques and
quantitative data analysis techniques. These data analysis techniques can be used independently or
in combination with the other to help business leaders and decision-makers acquire business
insights from different data types.
Quantitative data analysis

Quantitative data analysis involves working with numerical variables — including statistics,
percentages, calculations, measurements, and other data — as the nature of quantitative data is
numerical. Quantitative data analysis techniques typically include working with algorithms,
mathematical analysis tools, and software to manipulate data and uncover insights that reveal the
business value.

For example, a financial data analyst can change one or more variables on a company’s Excel balance
sheet to project their employer’s future financial performance. Quantitative data analysis can also
be used to assess market data to help a company set a competitive price for its new product.

Qualitative data analysis

Qualitative data describes information that is typically nonnumerical. The qualitative data analysis
approach involves working with unique identifiers, such as labels and properties, and categorical
variables, such as statistics, percentages, and measurements. A data analyst may use firsthand or
participant observation approaches, conduct interviews, run focus groups, or review documents and
artifacts in qualitative data analysis.
Qualitative data analysis can be used in various business processes. For example, qualitative data
analysis techniques are often part of the software development process. Software testers record
bugs — ranging from functional errors to spelling mistakes — to determine bug severity on a
predetermined scale: from critical to low. When collected, this data provides information that can
help improve the final product.

Discuss any six modern analytical tools.

R programming (Programming language)

Tableau Public (Data Visualization Tool)

Python (Multipurpose prog-ramming language)

SAS (Sas is a programming environment and language for data manipulation)

Apache spark (Hadoop)

Excel (Analytic tool)

RapidMiner (Data Science Platform)

Knime

QlikView
Discuss the characteristics of big data?

I) Variety:

1. Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered
from multiple sources

. 2. The type and nature of data is having great variety.

3. During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications.

4. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications.

II) Velocity:

1. The term velocity refers to the speed of generation of data.

2. Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc.

3. The flow of data is massive and continuous

. 4. The speed of data accumulation also plays a role in determining whether the data is categorized
into big data or normal data

. 5. As can be seen from the figure 1.2 below, at first, mainframes were used wherein fewer people
used computers.

6. Then came the client/server model and more and more computers were evolved.

7. After this, the web applications came into the picture and started increasing over the Internet.

III) Volume:

1. The name Big Data itself is related to a size which is enormous.

2. Size of data plays a very crucial role in determining value out of data.

3. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data.

4. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.

5. This refers to the data that is tremendously large.


6. As shown in figure 1.3 below, the volume of data is rising exponentially.

7. In 2016, the data created was only 8 ZB and it is expected that, by 2020, the data would rise up to
40 ZB, which is extremely large.

IV) Veracity:

1. The data captured is not in certain format.

2. Data captured can vary greatly.

3. Veracity means the trustworthiness and quality of data.

4. It is necessary that the veracity of the data is maintained.

5. For example, think about Facebook posts, with hashtags, abbreviations, images, videos, etc.,
which make them unreliable and hamper the quality of their content.

6. Collecting loads and loads of data is of no use if the quality and trustworthiness of the data is not
up to the mark.

List and explain drawbacks of big data.

➨Traditional storage can cost lot of money to store big data.


➨Lots of big data is unstructured.
➨Big data analysis violates principles of privacy.
➨It can be used for manipulation of customer records.
➨It may increase social stratification.
➨Big data analysis is not useful in short run. It needs to be analyzed for longer duration to leverage
its benefits.
➨Big data analysis results are misleading sometimes.
➨Speedy updates in big data can mismatch real figures.

Discuss the various processes used to prepare data for analysis.

Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.
Data preparation is often a lengthy undertaking for data professionals or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.

For example, the data preparation process usually includes standardizing data formats, enriching
source data, and/or removing outliers.

1. Gather data

The data preparation process begins with finding the right data. This can come from an existing data
catalog or can be added ad-hoc.

2. Discover and assess data

After collecting the data, it is important to discover each dataset. This step is about getting to know
the data and understanding what has to be done before the data becomes useful in a particular
context.

Discovery is a big task

3. Cleanse and validate data

Cleaning up the data is traditionally the most time consuming part of the data preparation process,
but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:

 Removing extraneous data and outliers.


 Filling in missing values.
 Conforming data to a standardized pattern.
 Masking private or sensitive data entries.

Once data has been cleansed, it must be validated by testing for errors in the data preparation
process up to this point. Often times, an error in the system will become apparent during this step
and will need to be resolved before moving forward.

4. Transform and enrich data

Transforming data is the process of updating the format or value entries in order to reach a well-
defined outcome, or to make the data more easily understood by a wider audience. Enriching data
refers to adding and connecting data with other related information to provide deeper insights.

5. Store data

Once prepared, the data can be stored or channeled into a third party application—such as a
business intelligence tool—clearing the way for processing and analysis to take place.
Big Data Analytics Life Cycle

The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results

 Phase I Business Problem Definition –


In this stage, the team learns about the business domain, which presents the motivation and
goals for carrying out the analysis. In this stage, the problem is identified, and assumptions are
made that how much potential gain a company will make after carrying out the analysis.
Important activities in this step include framing the business problem as an analytics challenge
that can be addressed in subsequent phases. It helps the decision-makers understand the
business resources that will be required to be utilized thereby determining the underlying
budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem or not,
based on the business requirements in the business case. To qualify as a big data problem, the
business case should be directly related to one(or more) of the characteristics of volume,
velocity, or variety.

 Phase II Data Definition –


Once the business case is identified, now it’s time to find the appropriate datasets to work
with. In this stage, analysis is done to see what other companies have done for a similar case.
Depending on the business case and the scope of analysis of the project being addressed, the
sources of datasets can be either external or internal to the company. In the case of internal
datasets, the datasets can include data collected from internal sources, such as feedback
forms, from existing software, On the other hand, for external datasets, the list includes
datasets from third-party providers.

 Phase III Data Acquisition and filtration –


Once the source of data is identified, now it is time to gather the data from such sources. This
kind of data is mostly unstructured.Then it is subjected to filtration, such as removal of the
corrupt data or irrelevant data, which is of no scope to the analysis objective. Here corrupt
data means data that may have missing records, or the ones, which include incompatible data
types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use in the
future, for some other analysis.

 Phase IV Data Extraction –


Now the data is filtered, but there might be a possibility that some of the entries of the data
might be incompatible, to rectify this issue, a separate phase is created, known as the data
extraction phase. In this phase, the data, which don’t match with the underlying scope of the
analysis, are extracted and transformed in such a form.

 Phase V Data Munging –


As mentioned in phase III, the data is collected from various sources, which results in the data
being unstructured. There might be a possibility, that the data might have constraints, that are
unsuitable, which can lead to false results. Hence there is a need to clean and validate the
data.
It includes removing any invalid data and establishing complex validation rules. There are
many ways to validate and clean the data. For example, a dataset might contain few rows,
with null entries. If a similar dataset is present, then those entries are copied from that
dataset, else those rows are dropped.

 Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise. But the data
might be spread across multiple datasets, and it is not advisable to work with multiple
datasets. Hence, the datasets are joined together. For example: If there are two datasets,
namely that of a Student Academic section and Student Personal Details section, then both
can be joined together via common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large.
Automation can be brought into consideration, so that these things are executed, without any
human intervention.

 Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the big data
problem, analysis is carried out. Data analysis can be classified as Confirmatory analysis and
Exploratory analysis. In confirmatory analysis, the cause of a phenomenon is analyzed before.
The assumption is called the hypothesis. The data is analyzed to approve or disapprove the
hypothesis.
This kind of analysis provides definitive answers to some specific questions and confirms
whether an assumption was true or not.In an exploratory analysis, the data is explored to
obtain information, why a phenomenon occurred. This type of analysis answers “why” a
phenomenon occurred. This kind of analysis doesn’t provide definitive, meanwhile, it provides
discovery of patterns.

 Phase VIII Data Visualization –


Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users. A sort
of representation is required to obtains value or some conclusion from the analysis. Hence,
various tools are used to visualize the data in graphic form, which can easily be interpreted by
business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows the users
to discover answers to questions that are yet to be formulated.

 Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business users to make
decisions to utilize the results. The results can be used for optimization, to refine the business
process. It can also be used as an input for the systems to enhance performance.
The block diagram of the life cycle is given below :
Difference between Data Science and Business Intelligence

Difference between
Data Science and
Business
Intelligence

Factor Data Science Business Intelligence

It is a field that uses It is basically a set of


mathematics, statistics and technologies, applications and
various other tools to processes that are used by
discover the hidden the enterprises for business
Concept patterns in the data. data analysis.

It focuses the past and


Focus It focuses on the future. present.

It deals with both


structured as well as It mainly deals only with
Data unstructured data. structured data.

Data science is much more It is less flexible as in case of


flexible as data sources business intelligence data
can be added as per sources need to be pre-
Flexibility requirement. planned.

It makes the use of It makes the use of analytic


Method scientific method. method.

It has a higher complexity


in comparison to business It is much simpler when
Complexity intelligence. compared to data science.

It’s expertise is data


Expertise scientist. It’s expertise is business user.

It deals with the questions


what will happen and what It deals with the question what
Questions if. happened.
Difference between
Data Science and
Business
Intelligence

Factor Data Science Business Intelligence

It’s tools are InsightSquared


Sales Analytics, Klipfolio,
It’s tools are SAS, BigML, ThoughtSpot, Cyfe, TIBCO
Tools MATLAB, Excel etc. Spotfire etc.

Explain the various steps of MapReduce functional programming model.

Drivers of big data

1. The digitization of society;


2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT).

Explain Hadoop Architecture with neat diagram

HADOOP:

1. Hadoop is an open source software programming framework for storing a large amount of data and
performing the computation.

2. Its framework is based on Java programming with some native code in C and shell scripts.

3. Apache Software Foundation is the developers of Hadoop, and its co-founders are Doug Cutting and
Mike Cafarella.
4. The Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers. 5. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.

FEATURES:

1. Low Cost

2. High Computing Power

3. Scalability

4. Huge & Flexible Storage

5. Fault

Tolerance & Data Protection HADOOP ARCHITECTURE:

1. Figure 1.4 shows architecture of Hadoop.

2. At its core, Hadoop has two major layers namely:

a. Processing/Computation layer (MapReduce),

and

b. Storage layer (Hadoop Distributed File System).


MapReduce:

1. MapReduce is a parallel programming model for writing distributed applications.

2. It is used for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

3. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System:

1. The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)

2. It provides a distributed file system that is designed to run on commodity hardware.

3. It has many similarities with existing distributed file systems.

4. However, the differences from other distributed file systems are significant.

5. It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

6. It provides high throughput access to application data and is suitable for applications having large
datasets.

Hadoop framework also includes the following two modules:

1. Hadoop Common: These are Java libraries and utilities required by other Hadoop modules
2. Hadoop YARN: This is a framework for job scheduling and cluster resource management.

You might also like