BDA Assignment L9
BDA Assignment L9
18131A05K4
Voonna Manideep
1. Explain Big Data. What are the characteristics of big data?
A) Big Data: Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with so large size and complexity that none of
traditional data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
Examples: New York Stock Exchange, Social Media
Types of Big Data:
1. Structured: Any data that can be stored, accessed and processed in the form of fixed
sformat is termed as a ‘structured’ data.
Example: An employee table in a data base.
2. Unstructured: Any data with unknown form or the structure is classified as
unstructured data.
Example: Output generated by Google search.
3. Semi-structured: Semi-structured data can contain both the forms of data.
Example: Personal data stored in an XML file
• Volume
• Variety
• Velocity
• Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
1|Page
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
2. What are the challenges of big data?
A) There are 6 major challenges of big data.
They are:
1. Lack of proper understanding of big data:
Companies fail in their Big Data initiatives due to insufficient understanding.
Employees may not know what data is, its storage, processing, importance, and
sources. Data professionals may know what is going on, but others may not have a
clear picture.
For example, if employees do not understand the importance of data storage, they
might not keep the backup of sensitive data. They might not use databases properly
for storage. As a result, when this important data is required, it cannot be retrieved
easily.
2. Data growth issues:
One of the most pressing challenges of Big Data is storing all these huge sets of data
properly. The amount of data being stored in data centers and databases of companies
is increasing rapidly. As these data sets grow exponentially with time, it gets
extremely difficult to handle.
Most of the data is unstructured and comes from documents, videos, audios, text files
and other sources. This means that you cannot find them in databases.
3. Confusion while big data tool selection:
Companies often get confused while selecting the best tool for Big Data analysis and
storage. Is HBase or Cassandra the best technology for data storage? Is Hadoop
MapReduce good enough or will Spark be a better option for data analytics and
storage?
These questions bother companies and sometimes they are unable to find the answers.
They end up making poor decisions and selecting an inappropriate technology. As a
result, money, time, efforts and work hours are wasted.
4. Lack of data professionals:
To run these modern technologies and Big Data tools, companies need skilled data
professionals. These professionals will include data scientists, data analysts and data
2|Page
engineers who are experienced in working with the tools and making sense out of
huge data sets.
Companies face a problem of lack of Big Data professionals. This is because data
handling tools have evolved rapidly, but in most cases, the professionals have not.
Actionable steps need to be taken in order to bridge this gap.
5. Securing data:
Securing these huge sets of data is one of the daunting challenges of Big Data. Often
companies are so busy in understanding, storing and analyzing their data sets that
they push data security for later stages. But, this is not a smart move as unprotected
data repositories can become breeding grounds for malicious hackers.
6. Integrating data from a variety of sources:
Data in an organization comes from a variety of sources, such as social media pages,
ERP applications, customer logs, financial reports, e-mails, presentations and reports
created by employees. Combining all this data to prepare reports is a challenging task.
This is an area often neglected by firms. But, data integration is crucial for analysis,
reporting and business intelligence, so it has to be perfect.
3|Page
i. It’s managed
ii. It’s on-demand
iii. It’s public or private
Types of Cloud Computing: There are 3 types of cloud computing
i. Infrastructure as a Service (IaaS): means you're buying access to raw computing
hardware over the Net, such as servers or storage. Since you buy what you need
and pay-as-you-go, this is often referred to as utility computing. Ordinary web
hosting is a simple example of IaaS: you pay a monthly subscription or a per-
megabyte/gigabyte fee to have a hosting company serve up files for your website
from their servers.
ii. Software as a Service (SaaS): means you use a complete application running on
someone else's system. Web-based email and Google Documents are perhaps the
best-known examples. Zoho is another well-known SaaS provider offering a
variety of office applications online.
iii. Platform as a Service (PaaS): means you develop applications using Web-based
tools so they run on systems software and hardware provided by another
company. So, for example, you might develop your own ecommerce website but
have the whole thing, including the shopping cart, checkout, and payment
mechanism running on a merchant's server. App Cloud and the Google App
Engine are examples of PaaS.
Advantages of Cloud Computing:
1) Cost Savings
2) Security
3) Flexibility
4) Mobility
5) Insight
6) Increased Collaboration
7) Quality Control
8) Disaster Recovery
9) Loss Prevention
10) Automatic Software Updates
11) Competitive Edge
12) Sustainability
Disadvantages of Cloud Computing:
1) Network Connection Dependency
2) Limited Features
3) Loss of Control
4) Security
5) Technical issues
4|Page
IOT data: Generate data from sensors, satellites, computer log files, cameras, genetic
sequencing machines, space telescopes, probes, among others
Corporations: Generated data from transactions administrative system, credit cards,
financial system, accountability, e-commerce, sales, medical records, research, among
others.
They are:
a. Data Cleaning
b. Data Integration
c. Data Transformation
5|Page
d. Data Reduction
Data Cleaning: Data cleaning methods aim to fill in missing values, smooth out noise
while identifying outliers, and fix data discrepancies. Unclean data can confuse data and
the model. Therefore, running the data through various Data Cleaning/Cleansing methods
is an important Data Preprocessing step.
Data Integration: It is involved in a data analysis task that combines data from multiple
sources into a coherent data store. These sources may include multiple
databases. Databases and Data warehouses have Metadata (It is the data about data) it
helps in avoiding errors.
Data Transformation: This stage is used to convert the data into a format that can be
used in the mining process. This is done in the following ways:
3. Smoothing: Smoothing works to remove the noise from the data. Such
techniques include binning, clustering, and regression.
Data Reduction: Because data mining is a methodology for dealing with large amounts of
data. When dealing with large amounts of data, analysis becomes more difficult. We
employ a data reduction technique to get rid of this. Its goal is to improve storage
efficiency while lowering data storage and analysis expenses. There are 2 types of data
reduction. They are Dimensionality Reduction and Numerosity Reduction.
6|Page
7|Page