0% found this document useful (0 votes)
35 views

BDA Assignment L9

This document discusses big data and cloud computing. It defines big data and its key characteristics of volume, variety, velocity and variability. It outlines six major challenges of big data including lack of understanding, data growth issues, tool selection confusion, lack of professionals, data security, and data integration. The document then distinguishes between IoT and big data. Finally, it provides an in-depth explanation of cloud computing preliminaries including its types, features, advantages and disadvantages.

Uploaded by

Bharath Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

BDA Assignment L9

This document discusses big data and cloud computing. It defines big data and its key characteristics of volume, variety, velocity and variability. It outlines six major challenges of big data including lack of understanding, data growth issues, tool selection confusion, lack of professionals, data security, and data integration. The document then distinguishes between IoT and big data. Finally, it provides an in-depth explanation of cloud computing preliminaries including its types, features, advantages and disadvantages.

Uploaded by

Bharath Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BIG DATA ASSIGNMENT -1

18131A05K4
Voonna Manideep
1. Explain Big Data. What are the characteristics of big data?

A) Big Data: Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with so large size and complexity that none of
traditional data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
Examples: New York Stock Exchange, Social Media
Types of Big Data:
1. Structured: Any data that can be stored, accessed and processed in the form of fixed
sformat is termed as a ‘structured’ data.
Example: An employee table in a data base.
2. Unstructured: Any data with unknown form or the structure is classified as
unstructured data.
Example: Output generated by Google search.
3. Semi-structured: Semi-structured data can contain both the forms of data.
Example: Personal data stored in an XML file

Characteristics of Big Data:

Big data can be described by the following characteristics:

• Volume
• Variety
• Velocity
• Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,

1|Page
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
2. What are the challenges of big data?
A) There are 6 major challenges of big data.
They are:
1. Lack of proper understanding of big data:
Companies fail in their Big Data initiatives due to insufficient understanding.
Employees may not know what data is, its storage, processing, importance, and
sources. Data professionals may know what is going on, but others may not have a
clear picture.
For example, if employees do not understand the importance of data storage, they
might not keep the backup of sensitive data. They might not use databases properly
for storage. As a result, when this important data is required, it cannot be retrieved
easily.
2. Data growth issues:
One of the most pressing challenges of Big Data is storing all these huge sets of data
properly. The amount of data being stored in data centers and databases of companies
is increasing rapidly. As these data sets grow exponentially with time, it gets
extremely difficult to handle.
Most of the data is unstructured and comes from documents, videos, audios, text files
and other sources. This means that you cannot find them in databases.
3. Confusion while big data tool selection:
Companies often get confused while selecting the best tool for Big Data analysis and
storage. Is HBase or Cassandra the best technology for data storage? Is Hadoop
MapReduce good enough or will Spark be a better option for data analytics and
storage?
These questions bother companies and sometimes they are unable to find the answers.
They end up making poor decisions and selecting an inappropriate technology. As a
result, money, time, efforts and work hours are wasted.
4. Lack of data professionals:
To run these modern technologies and Big Data tools, companies need skilled data
professionals. These professionals will include data scientists, data analysts and data

2|Page
engineers who are experienced in working with the tools and making sense out of
huge data sets.
Companies face a problem of lack of Big Data professionals. This is because data
handling tools have evolved rapidly, but in most cases, the professionals have not.
Actionable steps need to be taken in order to bridge this gap.
5. Securing data:
Securing these huge sets of data is one of the daunting challenges of Big Data. Often
companies are so busy in understanding, storing and analyzing their data sets that
they push data security for later stages. But, this is not a smart move as unprotected
data repositories can become breeding grounds for malicious hackers.
6. Integrating data from a variety of sources:
Data in an organization comes from a variety of sources, such as social media pages,
ERP applications, customer logs, financial reports, e-mails, presentations and reports
created by employees. Combining all this data to prepare reports is a challenging task.
This is an area often neglected by firms. But, data integration is crucial for analysis,
reporting and business intelligence, so it has to be perfect.

3. Distinguish between IOT and Big Data


A)

IOT BIG DATA


IOT is a global system of interrelated Big Data refers to massive volumes of data
computing devices that are able to sense, generated from a variety of sources that is so
collect, and exchange data over the Internet large to process using traditional techniques.
IOT collects, analyzes and processes data The data streams are not subjected to
streams at real-time without any delay to processing at real time and there is a delay
make control decision an effective manner. between when the data is collected and when
it is processed.
The concept is to provide interconnection The concept is to find insights in new and
between devices to create a smart emerging types of data and content that lead
environment, making machines smart enough to better decisions and strategic business
to bypass human effort. moves.
IOT involves analyzing machine-generated Big Data deals with human-generated data
data such as sensors in home appliances and such as social media usage, photos and
so on. videos, etc.

4. What are the cloud computing preliminaries explained in


detail?
A) Cloud Computing: Cloud computing is a distribution and sending of on-demand
computing services from one application to storage and typically over the internet and on
a pay as you go basis.
Features of Cloud Computing:

3|Page
i. It’s managed
ii. It’s on-demand
iii. It’s public or private
Types of Cloud Computing: There are 3 types of cloud computing
i. Infrastructure as a Service (IaaS): means you're buying access to raw computing
hardware over the Net, such as servers or storage. Since you buy what you need
and pay-as-you-go, this is often referred to as utility computing. Ordinary web
hosting is a simple example of IaaS: you pay a monthly subscription or a per-
megabyte/gigabyte fee to have a hosting company serve up files for your website
from their servers.
ii. Software as a Service (SaaS): means you use a complete application running on
someone else's system. Web-based email and Google Documents are perhaps the
best-known examples. Zoho is another well-known SaaS provider offering a
variety of office applications online.
iii. Platform as a Service (PaaS): means you develop applications using Web-based
tools so they run on systems software and hardware provided by another
company. So, for example, you might develop your own ecommerce website but
have the whole thing, including the shopping cart, checkout, and payment
mechanism running on a merchant's server. App Cloud and the Google App
Engine are examples of PaaS.
Advantages of Cloud Computing:
1) Cost Savings
2) Security
3) Flexibility
4) Mobility
5) Insight
6) Increased Collaboration
7) Quality Control
8) Disaster Recovery
9) Loss Prevention
10) Automatic Software Updates
11) Competitive Edge
12) Sustainability
Disadvantages of Cloud Computing:
1) Network Connection Dependency
2) Limited Features
3) Loss of Control
4) Security
5) Technical issues

5. What is big data generation?


A) Big data generation means generation of data in large quantities from various sources.
Big Data generates data from three fonts: People, Machines, and Corporations.
Internet data: Generate data from Social Networks (Facebook, Twitter, Instagram,
LinkedIn), emails, Internet, documents, blogs, among others.

4|Page
IOT data: Generate data from sensors, satellites, computer log files, cameras, genetic
sequencing machines, space telescopes, probes, among others
Corporations: Generated data from transactions administrative system, credit cards,
financial system, accountability, e-commerce, sales, medical records, research, among
others.

6. What is big data acquisition and data collection?


A) Data Acquisition:
Data acquisition has been understood as the process of gathering, filtering, and cleaning
data before the data is put in a data warehouse or any other storage solution. The
acquisition of big data is most commonly governed by four of the Vs: volume, velocity,
variety, and value. Most data acquisition scenarios assume high-volume, high-velocity,
high-variety, but low-value data, making it important to have adaptable and time-efficient
gathering, filtering, and cleaning algorithms that ensure that only the high-value
fragments of the data are actually processed by the data-warehouse analysis.
Data Collection:
During data collection, the researchers must identify the data types, the sources of data,
and what methods are being used.
There are 2 methods of Data Collection:
i. Primary: As the name implies, this is original, first-hand data collected by the
data researchers. This process is the initial information gathering step, performed
before anyone carries out any further or related research. Primary data results are
highly accurate provided the researcher collects the information. However, there’s
a downside, as first-hand research is potentially time-consuming and expensive.
ii. Secondary: Secondary data is second-hand data collected by other parties and
already having undergone statistical analysis. This data is either information that
the researcher has tasked other people to collect or information the researcher has
looked up. Simply put, it’s second-hand information. Although it’s easier and
cheaper to obtain than primary information, secondary information raises
concerns regarding accuracy and authenticity. Quantitative data makes up a
majority of secondary data.

7. What are data preprocessing techniques?


A) Data preprocessing involves 4 techniques:

They are:

a. Data Cleaning

b. Data Integration

c. Data Transformation

5|Page
d. Data Reduction

Data Cleaning: Data cleaning methods aim to fill in missing values, smooth out noise
while identifying outliers, and fix data discrepancies. Unclean data can confuse data and
the model. Therefore, running the data through various Data Cleaning/Cleansing methods
is an important Data Preprocessing step.

Data Integration: It is involved in a data analysis task that combines data from multiple
sources into a coherent data store. These sources may include multiple
databases. Databases and Data warehouses have Metadata (It is the data about data) it
helps in avoiding errors.

Data Transformation: This stage is used to convert the data into a format that can be
used in the mining process. This is done in the following ways:

1. Normalization: It is done to scale the data values in a specified range (-1.0 to


1.0 or 0.0 to 1.0)

2. Concept Hierarchy Generation: Using concept hierarchies, low-level or


primitive/raw data is substituted with higher-level concepts in data generalization.
Categorical qualities, for example, are generalized to higher-level notions such as
street, city, and nation. Similarly, numeric attribute values can be translated to
higher-level concepts like age, such as youthful, middle-aged, or elderly.

3. Smoothing: Smoothing works to remove the noise from the data. Such
techniques include binning, clustering, and regression.

4. Aggregation: Aggregation is the process of applying summary or aggregation


operations on data. Daily sales data, for example, might be combined to calculate
monthly and annual totals.

Data Reduction: Because data mining is a methodology for dealing with large amounts of
data. When dealing with large amounts of data, analysis becomes more difficult. We
employ a data reduction technique to get rid of this. Its goal is to improve storage
efficiency while lowering data storage and analysis expenses. There are 2 types of data
reduction. They are Dimensionality Reduction and Numerosity Reduction.

6|Page
7|Page

You might also like