0% found this document useful (0 votes)
45 views

Ite06 Big Data Analytics-Qbank

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Ite06 Big Data Analytics-Qbank

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ITE06 BIG DATA ANALYTICS

Question bank
MODULE-1- BIG DATA AND HADOOP FRAMEWORK
2 MARKS

1. Entail the characteristics of data.


1. Composition: The composition of data deals with the structure of data,i.e; the
sources of data, the granularity, the types and nature of dataas to whether it is static
or real time streaming.
2. Condition: The condition of data deals with the state of data, i.e; ―Can one use this
data as is for analysis?‖ or ―Does it require cleaning for further enhancement and
enrichment?‖ data?‖
3. Context: The context of data deals with ―Where has this data been generated?‖.
―Why was this data generated?‖, ―How sensitive is this data?‖, ―What are the events
associated with this‖.

2. Examine about the digital data.


The data that is stored using specific machine language systems which can be
interpreted by various technologies is called digital data. Eg. Audio, video or text
information

3. What are the types of Digital Data?


1. Structured Data
2. Semi-Structured Data
3. Unstructured Data

3. What you mean by big data?


Big Data is a collection of data that is huge in volume. Yet growing
exponentially with time. It is data with so large a size and complexity that none of the
traditional data management tools can store it or process it efficiently

4. How is traditional BI environment different from Big data environment?


5. Define Big Data Analytics.
Big Data Analytics is the process of examining big data to uncover patterns, unearth
trends, and find unknown correlations and other useful information to make faster and
better decisions.
Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics,
Statistica, World Programming Systems (WPS), and Weka. The open source analytics
tools are: R analytics and Weka.

6. List the classification of Analytics.


There are basically two schools of thought:
1. Those that classify analytics into basic, operational, advanced and monetized.
2. Those that classify analytics into analytics 1.0, analytics 2.0 and analytics 3.0.

7. What are the advantages of Big Data Analytics?


● Business Transformation
● Competitive Advantage
● Innovation
● Lower Costs
● Improved Customer
● Increased Security

8. What are the terminologies of Big Data?


a. In-Memory Analytics
b. In-Database processing
c. Symmetric Mulit-processor system
d. Massively parallel processing
e. Shared nothing architecture
f. CAP Theorem

9.What is BASE?
Basically Available, Soft State, Eventual Consistency (BASE) is a data system design
philosophy that in distributed environment, it gives importance to availability over
consistency of operations.
BASE may be explained in contrast to another design philosophy -
Atomicity, Consistency, Isolation, and Durability (ACID). The ACID model promotes
consistency over availability, whereas BASE promotes availability over consistency.

10.What is NoSQL? What is the need of NoSQL?


NoSQL Database is a non-relational Data Management System that does not
require a fixed schema. NoSQL database stands for ―Not only SQL‖ or ―Not SQL‖
NoSQL database technology stores information in JSON document instead of
columns and rows used by relational databases. NoSQL databases are widely used
in real-time web application and big data, because their main advantages are high
scalability and high availability.

11.List out the Features of NoSQL


1. NoSQL databases are non-relational
2. Distributed
3. No Support for ACID properties
4. No fixed table schema

12. Write the Characterisitcs of NewSQL.


● SQL interface for application interaction ACID support for transactions
● An architecture that provides higher per node performance vis-a-vs traditional
RDBMS solution
● Scale out, shared nothing architecture
● Non-locking concurrency control mechanism so that real time reads will not
conflict with writes.

13.Differentiate SQL, NoSQL and NewSQL


14. List some Real time applications of NoSQL in BigData Analytics:
● HBase for Hadoop, a popular NoSQL database is used extensively by Facebook
for its messaging infrastructure.
● HBase is used by Twitter for generating data, storing, logging, and
● monitoring data around people search.
● HBase is used by the discovery engine Stumble upon for data
● analytics and storage.
● MongoDB is another NoSQL Database used by CERN, a European Nuclear
Research Organization for collecting data from the huge particle collider ―Hadron
Collider‖.
● LinkedIn, Orbitz, and Concur use the Couchbase NoSQL Database for various
data processing and monitoring tasks.

15. Give the differences between Hadoop and RDBMS

16.What is Hadoop?
Hadoop is an open source framework that is meant for storage and processing of big
data in a distributed manner.It is the best solution for handling big data challenges.

17. What is HDFS?


Apache Hadoop is a collection of open-source software utilities that facilitate using a
network of many computers to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed storage and processing
of big data using the MapReduce programming model.

18. What is MapReduce?


MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs)

19. What is YARN?


YARN is an Apache Hadoop technology and stands for Yet Another Resource
Negotiator. YARN is a large-scale, distributed operating system for big data
applications. YARN is a software rewrite that is capable of decoupling MapReduce's
resource management and scheduling capabilities from the data processing component.

20. Name some big data technologies.


 Hadoop
 Spark
 Hive
 HBase
 MongoDB
 Zookeepe

21. What are the types of NOSQL databases?


1. Document databases: It store data in document similar to JSON (JavaScrift Object
Notation) objects, each document contains pairs of fields and values. The values can
typically be a variety of types including things like strings, numbers, Booleans, arrays, or
objects.
2. Key-value databases: are simpler types of database where each item contains keys
and values.
3. Wide-column stores or column Family Data stores: store data in tables, rows, and
dynamic columns.
4. Graph databases: store data in nodes and edges. Nodes typically store information
about the relationships between the nodes.

22.Justify about the Coexistence of Big data and Data warehouse


 Data warehouse continue with standard workload from legacy operational
systems, storing historical data to provision traditional BI reporting and analytics.
 Can‘t ignore Hadoop also since different types of table can be analyzed.
 Hence both work together none can be thrown out.

23.Relate Big data in terms of Analytics.


Big Data Analytics is the process of examining big data to uncover patterns, unearth
trends, and find unknown correlations and other useful information to make faster
and better decisions.

24. List out some of the Top Analytics tools .


MS Excel, SAS, IBM SPSS Modeler, R analytics, Statistica, World Programming
Systems (WPS), and Weka.
The open source analytics tools are: R analytics and Weka.
25.Components of HDFS
There are two (and a half) types of machines in a HDFS cluster
 NameNode :– is the heart of an HDFS filesystem, it maintains and manages the
file system metadata. E.g; what blocks make up a file, and on which datanodes
those blocks are stored.
 DataNode :- where HDFS stores the actual data, there are usually quite a few of
these.

16 MARKS
1. What is data serialization? With proper examples discuss and differentiate
structured, unstructured and semi-structured data. Make a note on how type of
data affects data serialization.
2. How Big Data Analytics can be useful in the development of smart cities.
3. What is big data analytics? Explain four ‗V‘s of Big data. Briefly discuss
applications of big data.Explain advantages and disadvantages of big data
analytics.
4. Explain the evolution of Big Data. What are the challenges of Big Data?
5. Define Big Data Analytics. What are the various types of analytics?
6. Explain the terminology of Big Data
7. What is HDFS? Explain the HDFS Architecture with a neat diagram.
8. Explain Building blocks of Hadoop (Namenode, Datanode, Secondary
Namenode, JobTracker, TaskTracker).
9. Write a MapReduce program for WordCount problem
10. What is MapReduce. Explain in detail different phases in MapReduce. (or)
Explain MapReduce anatomy.
11. Explain Hadoop Ecosystem in detail.
12. Discuss Hadoop YARN in detail with failures in classic MapReduce.
13. Explain features of HDFS. Discuss the design of Hadoop distributed file system
and concept in detail.
14. List various configuration files used in Hadoop Installation. What is use of
mapred-site.xml?
MODULE-2

2 MARKS

1. What is NoSQL Database?


NoSQL Database is used to refer a non-SQL or non relational database.
It provides a mechanism for storage and retrieval of data other than tabular relations
model used in relational databases. NoSQL database doesn't use tables for storing
data. It is generally used to store big data and real-time web applications

2. State the advantages of NoSQL


o It supports query language.
o It provides fast performance.
o It provides horizontal scalability.

3. Define MongoDB.
MongoDB is a document database that provides high performance, high availability,
and easy scalability. It is a cross-platform document-oriented database system
classified as a NoSQL database, that bridges the gap between key-value and
traditional RDBMS systems.

4. Why MongoDB is so popular?


MongoDB is a NoSQL product and is getting enormously popular in the developer
community. This is because MongoDB blends seamlessly with programming
languages like JavaScript, Ruby and Python; this seamless blending conveys high
coding velocity. This feature along with its simplicity, has made MongoDB very popular
in a short span of time.

5. Why Use MongoDB?


Document Oriented Storage - Data is stored in the form of JSON style documents.
 Index on any attribute
 Replication and high availability
 Auto-Sharding
 Rich queries
 Fast in-place updates
 Professional support by MongoDB

6. Where to Use MongoDB?


 Big Data
 Content Management and Delivery
 Mobile and Social Infrastructure
 User Data Management
 Data Hub

7. State the purpose of Partitioners.


Partitioners are responsible for dividing up the intermediate key space and assigning
intermediate key-value pairs to reducers. In other words, the partitioner specifies the
task to which an intermediate key-value pair must be copied.

8. Define Sharding.
Sharding is akin to horizontal scaling.It means that the large dataset is divided and
distributed over multiple servers or shards.Each shard is an independent database and
collectively they would constitute a logical database.

9. List the advantages of sharding.


Sharding reduces the amount of data that each shard needs to store and manage.
Sharding reduces the number of operation that each shard handles.

10. Write about CRUD operations in MongoDB.


Create Operations-Create or insert operations add new documents to a collection. If
the collection does not currently exist, insert operations will create the collection.
Read Operations-Read operations retrieve documents from a collection; i.e. query a
collection for documents.
Update Operations-Update operations modify existing documents in a collection
Delete Operations-Delete operations remove documents from a collection.

11. Explain features of Cassandra.


Apache Cassandra we born at Facebook. After Facebook open sourced the
code in 2008, Cassandra became an Apache Incubator project in 2009 and
subsequently became a top level apache project in 2010.
It is built on Amazon‟s dynamo and Google‟s BigTable.
Cassandra has been immensely used in Twitter, Netflix, Cisco, Adobe, eBay
and Rackspace.

12. List out the Features of Cassandra.


1. Open source
2. Distributed
3. No Single Point of Failure
4. Column Oriented
5. Availability
6. Scalability
7. Peer-to-Peer Network

13. Write some advantages of Cassandra.


 Since data can be replicated to several nodes, Cassandra is fault tolerant.
 Cassandra can handle a large set of data.
 Cassandra provides high scalability.

14. Define commit log.


It is a mechanism that is used to recover data in case the database crashes.
Every operation that is carried out is saved in the commit log. Using this data can
be recovered.

15. Define composite key.


Composite keys include row key and column name. They are used to define
column family with a concatenation of data of different type.

16. Define SSTable.


SSTable is Sorted String Table. It is a data file that accepts regular Mem Tables.

17. What is Memtable?


Memtable is in-memory/write-back cache space containing content in key
and column format. In memtable, data is sorted by key, and each column family
has a distinct memtable that retrieves column data via key. It stores the writes until it is
full, and then flushed out.

18. How the SSTable is different from other relational tables?


SStables do not allow any further addition and removal of data items once
written for each SSTable, Cassandra creates three separate files like partition index
partition summary and a bloom filter.

19. What is data replication in Cassandra?


Data replication is an electronic copying of data from a database in one
computer or server to a database in another so that all users can share the same level
of information. Cassandra stores replicas on multiple nodes to ensure
reliability and fault tolerance. The replication strategy decides the nodes where
replicas are placed.

20.What is CQL? List data types in CQL.


The Cassandra Query Language (CQL) offers a model similar to SQL. The
data is stored in tables containing rows of columns.

21. Define Keyspaces. Explain various operations on Keyspaces with


suitable examples.
 Keyspace is a container to hold application data. It is comparable to a relational
database.
 It is used to group column families(tables) together.
 Typically, a cluster has one keyspace per application.
 Replication is controlled on a per keyspace basis. So, data that has different
replication requirements should reside on different key spaces.

22. What is the need of Collections? Explain different collections in


Cassandra.
Cassandra provides collection types as a way to group and store data
together in a column i.e; to store multiple values in a column like storing
multiple mobile number etc. They are used when to store or denormalize a
small amount of data.

23. Answer for the following queries


i. Create a table "earnings" with columns:sid, cid, corder, title,coordinator with primary
key as id and corder
cqlsh> use chp;
cqlsh:chp> create table earnings(sid int, cid int,corder int, title
text,coordinator text, Primary Key(sid,corder));

ii. Retrieve data from the table with coordinator as chp.


cqlsh:chp> select * from earnings where coordinator = 'chp' allow filtering;
sid | corder | cid | coordinator | title
-----+--------+-----+-------------+-----------
101 | 1001 | 1 | chp | Cassandra
101 | 1003 | 3 | chp | Hadoop

24. Create a keyspace by the name “chp” and describe all the keyspaces.
i. Create a keyspace by the name ―chp‖
cqlsh> create keyspace chp with replication={'class' : 'SimpleStrategy',
'replication_factor' : 1};
ii. To describe all the existing keyspaces.
cqlsh> describe keyspaces;
system_schema system stud system_traces
system_auth chp system_distributed stud_b

25.Write the syntax for CRUD(Create, Read, Update and Delete) Operations:
1. Create: To creating a column family or table in a keyspace
Syntax:
CREATE TABLE tablename(
column1_name datatype PRIMARY KEY,
column2_name data type,
column3_name data type…
);
i. Connect to the keyspace ―chp‖
cqlsh> use chp;
ii. Create a table/column family ―stud_info‖
cqlsh:chp> create table stud_info(rno int Primary Key,sname text, doj
timestamp, percent double);
iii. Display the structure of the table stud_info
cqlsh:chp> describe stud_info;
CREATE TABLE chp.stud_info (
rno int PRIMARY KEY,
doj timestamp,
percent double,
sname text)
iv. Display all the tables in the keyspace chp
cqlsh:chp> describe tables;
stud_info

16 MARKS
1. Define MongoDB. Write the differences between RDBMS and MongoDB. Explain
MongoDB advantages.
2. Explain MongoDB query language in detail.
3. Explain Mapper-Reducer classes with examples.
4. Define the following terms with examples. a) Partitioner b) Combiner c) Shuffling d)
Sorting
5. What is the role of Compression in Hadoop. Explain.
6. Explain different data types used in MongoDB
7. Explain following in brief with repect to mongo DB : (1) Collections and documents,
(2)Indexing and retrieval (3) Data aggregation(
8. Explain scaling in MangoDB
9. Explain CRUD operations in MongoDB.
10. Requirement specification of blog application in social networking is as follows. Every
post has a unique title, description and url. Every post can have one or more tags. Every
post has the name of its publisher and total number Requirement specification for a
meeting dashboard application in an organization is asfollows:
Any member in an organization can host a meeting and send invitations to other
members within an organization. Invitees can accept or reject the meeting with proper
reason. Every meeting has the title, timestamp and place/location associated. Every
meeting has predefined agendas and documents associated. Meeting discussion
concludes with identifying tasks to accomplish. Every task has title, priority, deadline and
note associated with it. Task can be assigned to any attendee of meeting. For this set of
requirements design a Mongo DB schema.
MODULE-3- HADOOP ECO SYSTEMS

2 MARKS

1. What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.

2. What are the different types of tables available in HIve?


There are two types. Managed table and external table. In managed table both the
data an schema in under control of hive but in external table only the schema is
under control of Hive.

3. What is a generic UDF in hive?


It is a UDF which is created using a java program to server some specific need
not covered under the existing functions in Hive. It can detect the type of input
argument programmatically and provide appropriate response.

4. What is bucketing?
The values in a column are hashed into a number of buckets which is defined by
user. It is a way to avoid too many partitions or nested partitions while ensuring
optimizes query output.

5. List the Features of Hive.


 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

6. Give the classificatios of HIVE.

Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types
7. What is Apache Pig?

Apache Pig is a high-level data flow platform for executing MapReduce programs
of Hadoop. The language used for Pig is Pig Latin.

Features
It provides an enginee for executing dataflws.
It provides a language alled ―Pig Latin‖ to express data flows.
It allows users to develop own functions.

8. Differences between Apache MapReduce and PIG.

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex


programs using Java or Python. programs.

It is difficult to perform data operations in It provides built-in operators to perform data


MapReduce. operations like union, sorting and ordering.

It doesn't allow nested data types. It provides nested data types like tuple, bag,
and map.

9. What are the modes of executing the Apache Pig ?


Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Ways to execute Pig Program

o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke
Grunt shell, run the pig command. Once the Grunt mode executes, we can
provide Pig Latin statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These
files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.
10. Load the data into bag named "lines" and Arrange the output according to
count in descending order using the commands.
 grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
 grunt>OrderCnt = ORDER countletter BY $1 DESC;

11. What is JasperReport?


JasperReports is an open source java reporting engine. JasperReports is a Java
class library, and it is meant for those Java developers who need to add reporting
capabilities to their applications

12. Give the life cycle of Jasper Report.

 Designing the report or iReportDesigner to manually create it.


 Compiling the report
 Executing the report (Filling data into the report)
 Exporting the report to desired format

13. Write about Spark SQL


 Spark SQL was added to Spark in version 1.0.
 Spark SQL is Spark‗s package for working with structured data. It allows
querying data via SQL as well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many sources of data, including
Hive tables and JSON.
 Spark SQL allows developers to intermix SQL queries with the programmatic
data manipulations supported by RDDs in Python, Java, and Scala, all within a
single application, thus combining SQL with complex analytics.

14. Elucidate about Spark Streaming.


 Spark Streaming is a Spark component that enables processing of live
streams of data. Eg.logfiles generated by production web servers.
 Spark Streaming was designed to provide the same degree of fault tolerance,
throughput, and scalability as Spark Core.

15. Write a program for wordcount in spark.


Wordcount.py
importpyspark
import random
if not 'sc' in globals():
sc = pyspark.SparkContext()
text_file = sc.textFile("/home/hadoop/Desktop/dept.txt")
counts = text_file.flatMap(lambda line: line.split(","))
\ .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/home/hadoop/Desktop/word")

16. What are Apache Spark Workloads?

The Spark framework includes:

 Spark Core as the foundation for the platform


 Spark SQL for interactive queries
 Spark Streaming for real-time analytics
 Spark MLlib for machine learning
 Spark GraphX for graph processing

17. Comment on MLlib for spark.

Spark includes MLlib, a library of algorithms to do machine learning on data at scale.


Machine Learning models can be trained by data scientists with R or Python on any
Hadoop data source, saved using MLlib, and imported into a Java or Scala-based
pipeline. Spark was designed for fast, interactive computation that runs in memory,
enabling machine learning to run quickly. The algorithms include the ability to do
classification, regression, clustering, collaborative filtering, and pattern mining.

18. Comment on GraphX for Spark

Spark GraphX is a distributed graph processing framework built on top of Spark.


GraphX provides ETL, exploratory analysis, and iterative graph computation to
enable users to interactively build, and transform a graph data structure at scale. It
comes with a highly flexible API, and a selection of distributed Graph algorithms.

19. What Is Apache Kafka and Kafka Streams API?

Kafka is an event streams processing system with a focusing on high


throughput and low latency delivery of data in real time.

Kafka Streams is tool used within the Kafka environment that helps do more
than just write and read records onto a Kafka topic.
20. What Is Spark Streaming?

Apache Spark Streaming is a distributed processing engine built on top of the


Apache Spark framework. It enables real-time processing of data streams,
allowing developers to analyze and manipulate data as it is being generated,
rather than having to wait for the data to be stored in a database or file system.

21. State some key differences between Kafka and Spark Streaming:

 Processing model: Spark Streaming provides a high-level API for processing


data streams using Spark's parallel processing engine, while Kafka provides a
distributed messaging system for handling real-time data streams.
 Data storage: Spark Streaming stores data in memory or disk, depending on
the configuration, while Kafka stores data in distributed, fault-tolerant, and
scalable log files called topics.
 APIs: Spark Streaming provides APIs for several programming languages,
including Scala, Java, Python, and R. Kafka provides APIs in several
programming languages, including Java, Scala, Python, and .NET.
 Use cases: Spark Streaming is suitable for data processing use cases that
involve complex analytics, machine learning, and graph processing. Kafka is
suitable for real-time data streaming use cases, such as clickstream analysis,
fraud detection, and real-time analytics.

22. What is decision tree?


Decision tree learning is a method for approximating discrete-valued target
functions, in which the learned function is represented by a decision tree.
A decision tree is a tree where each node represents a feature (attribute),
each link(branch) represents a decision(rule) and each leaf represents an
outcome (categorical or continues value).
A decision tree or a classification tree is a tree in which each internal node is
labeled with an input feature. The arcs coming from a node labeled with a
feature are labeled with each of the possible values of the feature.
23. What is regression?
Regression is a method to determine the statistical relationship between a
dependent variable and one or more independent variables.

24. Explain linear and non-linear regression model.


In linear regression models, the dependence of the response on the regressors is
defined by a linear function, which makes their statistical analysis mathematically
tractable. On the other hand, in nonlinear regression models, this dependence is
defined by a nonlinear function, hence the mathematical difficulty in their
analysis.
25. What is regression analysis used for?
Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent
variable (s) (predictor). This technique is used for forecasting, time series
modelling and finding the causal effect relationship between the variables.

26. List two properties of logistic regression.


1. The dependent variable in logistic regression follows Bernoulli Distribution.
2. Estimation is done through maximum likelihood.

27. What is the goal of logistic regression?


The goal of logistic regression is to correctly predict the category of outcome for
individual cases using the most parsimonious model. To accomplish this goal, a
model is created that includes all predictor variables that are useful in predicting
the response variable.

28. What is Spark? Explain features of Spark.


Apache Spark is an open-source, distributed processing system used for big
data analytics. It utilizes in-memory caching, and optimized query execution
for fast analytic queries against data of any size.
Features of Spark:
Swift Processing
Dynamic in Nature
In-Memory Computation in Spark
Reusability
Spark Fault Tolerance
Real-Time Stream Processing

29. List out the two main abstraction of Apache Spark Architecture.
Apache Spark Architecture is based on two main abstractions-
 Resilient Distributed Datasets (RDD)
 Directed Acyclic Graph (DAG)
30. What is RDD?
RDDs are the building blocks of any Spark application. RDDs Stands for:
 Resilient: Fault tolerant and is capable of rebuilding data on failure
 Distributed: Distributed data among the multiple nodes in a cluster
 Dataset: Collection of partitioned data with values

Big questions

1. Explain working of Hive with proper steps and Architecture diagram.


2. What is Zookeeper? List the benefits of it. Differentiate: Apache pig Vs Map
Reduce.
3. What do you mean by HiveQL Data Definition Language? Explain any three
HiveQL DDL command with its syntax and example.
4. Explain Spark components in detail. Also list the features of spark.
5. What are the problems related to Map Reduce data storage? How Apache
Spark solves it using Resilient Distributed Dataset? Explain RDDs in detail.
6. What is Apache Spark? What are the advantages of using Apache Spark over
Hadoop?
7. Explain in brief four major libraries of Apache Spark.
8. Describe about the anatomy of Pig with its execution modes.
9. Show the Connection procedure of Apache spark to Mongo DB and
Cassandra.
10. Explain the Stream Data Model and its Architecture.

You might also like