Big Data Answers
Big Data Answers
BIG DATA: 1. Data is defined as the quantities, characters, or symbols on which operations are
performed by a computer.
2. Data may be stored and transmitted in the form of electrical signals and recorded on magnetic,
optical, or mechanical recording media.
4. Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.
5. In short such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently
I) Structured:
1. Any data that can be stored, accessed and processed in the form of fixed format is termed as a
Structured Data.
2. It accounts for about 20% of the total existing data and is used the most in programming and
computer-related activities.
. 4. All the data received from sensors, weblogs, and financial systems are classified under
machinegenerated data.
5. These include medical devices, GPS data, data of usage statistics captured by servers and
applications.
6. Human-generated structured data mainly includes all the data a human input into a computer,
such as his name and other personal details.
7. When a person clicks a link on the internet, or even makes a move in a game, data is created.
II) Unstructured:
1. Any data with unknown form or the structure is classified as unstructured data.
2. The rest of the data created, about 80% of the total account for unstructured big data.
3. Unstructured data is also classified based on its source, into machine-generated or human-
generated.
4. Machine-generated data accounts for all the satellite images, the scientific data from various
experiments and radar data captured by various facets of technology.
5. Human-generated unstructured data is found in abundance across the internet since it includes
social media data, mobile data, and website content.
6. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on
YouTube and even the text messages we send all contribute to the gigantic heap that is unstructured
data.
7. Examples of unstructured data include text, video, audio, mobile activity, social media activity,
satellite imagery, surveillance imagery etc.
III) Semi-Structured:
2. Information that is not in the traditional database format as structured data, but contains some
organizational properties which make it easier to process, are included in semi-structured data.
4. Examples of semi structured data might include XML documents and NoSQL databases. Personal
data stored in an XML file.
1. Big Data has already started to create a huge difference in the healthcare sector
. 2. With the help of predictive analytics, medical professionals and HCPs are now able to provide
personalized healthcare services to individual patients.
4. Apart from that, fitness wearable’s, telemedicine, remote monitoring – all powered by Big Data
and AI – are helping change lives for the better.
II) Academia
3. Academic institutions are investing in digital courses powered by Big Data technologies to aid the
allround development of budding learners.
III) Banking
2. Big Data tools can efficiently detect fraudulent acts in real-time such as misuse of credit/debit
cards, archival of inspection tracks, faulty alteration in customer stats, etc.
IV) Manufacturing
1. According to TCS Global Trend Study, the most significant benefit of Big Data in manufacturing is
improving the supply strategies and product quality.
2. In the manufacturing sector, Big data helps create a transparent infrastructure, thereby,
predicting uncertainties and incompetence’s that can affect the business adversely.
V) IT
1. One of the largest users of Big Data, IT companies around the world are using Big Data to optimize
their functioning, enhance employee productivity, and minimize risks in business operations.
2. By combining Big Data technologies with ML and AI, the IT sector is continually powering
innovation to find solutions even for the most complex of problems.
Data analysis is a technique that typically involves multiple activities such as gathering, cleaning, and
organizing the data. These processes, which usually include data analysis software, are necessary to
prepare the data for business purposes. Data analysis is also known as data analytics, described as
the science of analyzing raw data to draw informed conclusions based on the data.
Data analysis methods and techniques are useful for finding insights in data, such as metrics, facts,
and figures. The two primary methods for data analysis are qualitative data analysis techniques and
quantitative data analysis techniques. These data analysis techniques can be used independently or
in combination with the other to help business leaders and decision-makers acquire business
insights from different data types.
Quantitative data analysis
Quantitative data analysis involves working with numerical variables — including statistics,
percentages, calculations, measurements, and other data — as the nature of quantitative data is
numerical. Quantitative data analysis techniques typically include working with algorithms,
mathematical analysis tools, and software to manipulate data and uncover insights that reveal the
business value.
For example, a financial data analyst can change one or more variables on a company’s Excel balance
sheet to project their employer’s future financial performance. Quantitative data analysis can also
be used to assess market data to help a company set a competitive price for its new product.
Qualitative data describes information that is typically nonnumerical. The qualitative data analysis
approach involves working with unique identifiers, such as labels and properties, and categorical
variables, such as statistics, percentages, and measurements. A data analyst may use firsthand or
participant observation approaches, conduct interviews, run focus groups, or review documents and
artifacts in qualitative data analysis.
Qualitative data analysis can be used in various business processes. For example, qualitative data
analysis techniques are often part of the software development process. Software testers record
bugs — ranging from functional errors to spelling mistakes — to determine bug severity on a
predetermined scale: from critical to low. When collected, this data provides information that can
help improve the final product.
Knime
QlikView
Discuss the characteristics of big data?
I) Variety:
1. Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered
from multiple sources
3. During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications.
4. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications.
II) Velocity:
2. Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc.
. 4. The speed of data accumulation also plays a role in determining whether the data is categorized
into big data or normal data
. 5. As can be seen from the figure 1.2 below, at first, mainframes were used wherein fewer people
used computers.
6. Then came the client/server model and more and more computers were evolved.
7. After this, the web applications came into the picture and started increasing over the Internet.
III) Volume:
2. Size of data plays a very crucial role in determining value out of data.
3. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data.
4. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.
7. In 2016, the data created was only 8 ZB and it is expected that, by 2020, the data would rise up to
40 ZB, which is extremely large.
IV) Veracity:
5. For example, think about Facebook posts, with hashtags, abbreviations, images, videos, etc.,
which make them unreliable and hamper the quality of their content.
6. Collecting loads and loads of data is of no use if the quality and trustworthiness of the data is not
up to the mark.
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.
Data preparation is often a lengthy undertaking for data professionals or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.
For example, the data preparation process usually includes standardizing data formats, enriching
source data, and/or removing outliers.
1. Gather data
The data preparation process begins with finding the right data. This can come from an existing data
catalog or can be added ad-hoc.
After collecting the data, it is important to discover each dataset. This step is about getting to know
the data and understanding what has to be done before the data becomes useful in a particular
context.
Cleaning up the data is traditionally the most time consuming part of the data preparation process,
but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:
Once data has been cleansed, it must be validated by testing for errors in the data preparation
process up to this point. Often times, an error in the system will become apparent during this step
and will need to be resolved before moving forward.
Transforming data is the process of updating the format or value entries in order to reach a well-
defined outcome, or to make the data more easily understood by a wider audience. Enriching data
refers to adding and connecting data with other related information to provide deeper insights.
5. Store data
Once prepared, the data can be stored or channeled into a third party application—such as a
business intelligence tool—clearing the way for processing and analysis to take place.
Big Data Analytics Life Cycle
The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results
Difference between
Data Science and
Business
Intelligence
HADOOP:
1. Hadoop is an open source software programming framework for storing a large amount of data and
performing the computation.
2. Its framework is based on Java programming with some native code in C and shell scripts.
3. Apache Software Foundation is the developers of Hadoop, and its co-founders are Doug Cutting and
Mike Cafarella.
4. The Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers. 5. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
FEATURES:
1. Low Cost
3. Scalability
5. Fault
and
2. It is used for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
1. The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
4. However, the differences from other distributed file systems are significant.
6. It provides high throughput access to application data and is suitable for applications having large
datasets.
1. Hadoop Common: These are Java libraries and utilities required by other Hadoop modules
2. Hadoop YARN: This is a framework for job scheduling and cluster resource management.