BDA
BDA
Q. Explain the characteristics of big data analytics what are the five Vs
of big data? (diagram)
Volume: Volumes of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, etc.
Eg: Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded each
day.
Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos etc.
The data is categorized as below:
Structured data: It is in a tabular form. Structured Data is stored in the relational
database management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server
that contains a list of activities.
Veracity: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently.
For example: Facebook posts with hashtags.
Value: It is not the data that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Velocity: Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.
*Q. What is Hadoop Eco System? What are the main components of?
(HDFs, YARN AND MAP Reduce)
Data Ingestion: Data ingestion is the first layer of Big Data Architecture. The data is
generated from various sources such as social media. Sensors, IoT devices, and
SaaS platforms need to be collected and brought to a single warehouse or a database.
There are three types of data ingestion techniques.
HDFS: HDFS stands for Hadoop Distributed File System, and it is designed to run on
commodity servers. A typical HDFS architecture consists of a Name node and several
data nodes. Nodes can be thought up as a single computer, and a collection of nodes
constitute a cluster, and each cluster could boast 1000s of nodes.When HDFS
receives data, it converts the data file into small chunks of data, mostly 64 or 128 MB.
The chunk size depends on the system configuration. While partitioning and replicating
the data, HDFS follows a principle called rack awareness. A rack is a collection of 40-
50 data nodes.
The entire process can be summed up in the below picture
Pig: Yahoo developed Apache Pig to analyse large amounts of data. This is what map-
reduce does, too, but one fundamental problem with Map Reduce is it takes a lot of
code to perform the intended jobs. This is the primary reason why Pig was developed.
It has two significant components Pig Latin and Pig engine.
Spark: One of the critical concerns with map-reduce was that it takes a sequential
multi-step process to run a job, and it must read cluster data to do the operation and
write it back to nodes to perform a job. Thus, map-reduce jobs have high latency,
making them inefficient for real-time analytics.
Hive: Hive is a data warehousing tool designed to work with voluminous data, and it
works on top of HDFS and Map Reduce. The Hive query language is similar to SQL,
making it user-friendly. The hive queries internally get converted into map-reduce or
spark jobs which run on Hadoop’s distributed node cluster.
Impala: Apache Impala is an open-source data warehouse tool for querying high
volume data. Syntactically it is like HQL but provides highly optimised faster queries
than Hive. Unlike Hive, it is not dependent on map-reduce; instead, it has its engine,
which stores intermediate results in memory, thus providing faster query execution. It
can easily be integrated with HDFS, HBase and amazon s3. AS Impala is similar to
SQL, and the learning curve is not very steep.
Zookeeper: Apache zookeeper is another essential member of the Hadoop family,
responsible for cross node synchronisation and coordination. Hadoop applications
may need cross-cluster services; deploying Zookeeper takes care of this issue.
Applications create a znode within Zookeeper; applications can synchronise their
tasks across the distributed cluster by updating their status in the node.
Q. What are the issues and challenges of big data?
1. Sharing and Accessing Data:
The inaccessibility of data sets from external sources.
Sharing data can cause substantial challenges.
2. Privacy and Security:
Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
Some of the organization collects information of the people in order to add
value to their business. This is done by making insights into their lives that
they’re unaware of.
3. Analytical Challenges:
There are some huge analytical challenges in big data which arise some
main challenges questions like how to deal with a problem if data volume
gets too large?
Or how to find out the important data points?
4. Technical challenges:
Quality of data:
o When there is a collection of a large amount of data and storage of
this data, it comes at a cost. Big companies, business leaders and IT
leaders always want large data storage.
o For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
Fault tolerance:
o Fault tolerance computing is extremely hard, involving intricate
algorithms.
o The new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should
be within the acceptable threshold that is the whole task should not
begin from the scratch.
Scalability:
o Big data projects can grow and evolve rapidly. The scalability issue
of Big Data has lead towards cloud computing.
o It leads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-effectively.
*Q. What are the Hadoop Commands?
Command Description
Reduce data size for storage Encode data into a format for storage
Purpose
and I/O and transmission
Common
Gzip, Bzip2, Snappy, LZO Writable, Avro, Protobuf, Thrift
Formats