Ite06 Big Data Analytics-Qbank
Ite06 Big Data Analytics-Qbank
Question bank
MODULE-1- BIG DATA AND HADOOP FRAMEWORK
2 MARKS
9.What is BASE?
Basically Available, Soft State, Eventual Consistency (BASE) is a data system design
philosophy that in distributed environment, it gives importance to availability over
consistency of operations.
BASE may be explained in contrast to another design philosophy -
Atomicity, Consistency, Isolation, and Durability (ACID). The ACID model promotes
consistency over availability, whereas BASE promotes availability over consistency.
16.What is Hadoop?
Hadoop is an open source framework that is meant for storage and processing of big
data in a distributed manner.It is the best solution for handling big data challenges.
16 MARKS
1. What is data serialization? With proper examples discuss and differentiate
structured, unstructured and semi-structured data. Make a note on how type of
data affects data serialization.
2. How Big Data Analytics can be useful in the development of smart cities.
3. What is big data analytics? Explain four ‗V‘s of Big data. Briefly discuss
applications of big data.Explain advantages and disadvantages of big data
analytics.
4. Explain the evolution of Big Data. What are the challenges of Big Data?
5. Define Big Data Analytics. What are the various types of analytics?
6. Explain the terminology of Big Data
7. What is HDFS? Explain the HDFS Architecture with a neat diagram.
8. Explain Building blocks of Hadoop (Namenode, Datanode, Secondary
Namenode, JobTracker, TaskTracker).
9. Write a MapReduce program for WordCount problem
10. What is MapReduce. Explain in detail different phases in MapReduce. (or)
Explain MapReduce anatomy.
11. Explain Hadoop Ecosystem in detail.
12. Discuss Hadoop YARN in detail with failures in classic MapReduce.
13. Explain features of HDFS. Discuss the design of Hadoop distributed file system
and concept in detail.
14. List various configuration files used in Hadoop Installation. What is use of
mapred-site.xml?
MODULE-2
2 MARKS
3. Define MongoDB.
MongoDB is a document database that provides high performance, high availability,
and easy scalability. It is a cross-platform document-oriented database system
classified as a NoSQL database, that bridges the gap between key-value and
traditional RDBMS systems.
8. Define Sharding.
Sharding is akin to horizontal scaling.It means that the large dataset is divided and
distributed over multiple servers or shards.Each shard is an independent database and
collectively they would constitute a logical database.
24. Create a keyspace by the name “chp” and describe all the keyspaces.
i. Create a keyspace by the name ―chp‖
cqlsh> create keyspace chp with replication={'class' : 'SimpleStrategy',
'replication_factor' : 1};
ii. To describe all the existing keyspaces.
cqlsh> describe keyspaces;
system_schema system stud system_traces
system_auth chp system_distributed stud_b
25.Write the syntax for CRUD(Create, Read, Update and Delete) Operations:
1. Create: To creating a column family or table in a keyspace
Syntax:
CREATE TABLE tablename(
column1_name datatype PRIMARY KEY,
column2_name data type,
column3_name data type…
);
i. Connect to the keyspace ―chp‖
cqlsh> use chp;
ii. Create a table/column family ―stud_info‖
cqlsh:chp> create table stud_info(rno int Primary Key,sname text, doj
timestamp, percent double);
iii. Display the structure of the table stud_info
cqlsh:chp> describe stud_info;
CREATE TABLE chp.stud_info (
rno int PRIMARY KEY,
doj timestamp,
percent double,
sname text)
iv. Display all the tables in the keyspace chp
cqlsh:chp> describe tables;
stud_info
16 MARKS
1. Define MongoDB. Write the differences between RDBMS and MongoDB. Explain
MongoDB advantages.
2. Explain MongoDB query language in detail.
3. Explain Mapper-Reducer classes with examples.
4. Define the following terms with examples. a) Partitioner b) Combiner c) Shuffling d)
Sorting
5. What is the role of Compression in Hadoop. Explain.
6. Explain different data types used in MongoDB
7. Explain following in brief with repect to mongo DB : (1) Collections and documents,
(2)Indexing and retrieval (3) Data aggregation(
8. Explain scaling in MangoDB
9. Explain CRUD operations in MongoDB.
10. Requirement specification of blog application in social networking is as follows. Every
post has a unique title, description and url. Every post can have one or more tags. Every
post has the name of its publisher and total number Requirement specification for a
meeting dashboard application in an organization is asfollows:
Any member in an organization can host a meeting and send invitations to other
members within an organization. Invitees can accept or reject the meeting with proper
reason. Every meeting has the title, timestamp and place/location associated. Every
meeting has predefined agendas and documents associated. Meeting discussion
concludes with identifying tasks to accomplish. Every task has title, priority, deadline and
note associated with it. Task can be assigned to any attendee of meeting. For this set of
requirements design a Mongo DB schema.
MODULE-3- HADOOP ECO SYSTEMS
2 MARKS
1. What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
4. What is bucketing?
The values in a column are hashed into a number of buckets which is defined by
user. It is a way to avoid too many partitions or nested partitions while ensuring
optimizes query output.
Column Types
Literals
Null Values
Complex Types
7. What is Apache Pig?
Apache Pig is a high-level data flow platform for executing MapReduce programs
of Hadoop. The language used for Pig is Pig Latin.
Features
It provides an enginee for executing dataflws.
It provides a language alled ―Pig Latin‖ to express data flows.
It allows users to develop own functions.
It doesn't allow nested data types. It provides nested data types like tuple, bag,
and map.
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke
Grunt shell, run the pig command. Once the Grunt mode executes, we can
provide Pig Latin statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These
files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.
10. Load the data into bag named "lines" and Arrange the output according to
count in descending order using the commands.
grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
grunt>OrderCnt = ORDER countletter BY $1 DESC;
Kafka Streams is tool used within the Kafka environment that helps do more
than just write and read records onto a Kafka topic.
20. What Is Spark Streaming?
21. State some key differences between Kafka and Spark Streaming:
29. List out the two main abstraction of Apache Spark Architecture.
Apache Spark Architecture is based on two main abstractions-
Resilient Distributed Datasets (RDD)
Directed Acyclic Graph (DAG)
30. What is RDD?
RDDs are the building blocks of any Spark application. RDDs Stands for:
Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values
Big questions