0% found this document useful (0 votes)
13 views

Big Data Course Agenda

Hadoop is an open source framework for distributed processing and storage of big data across clusters of servers. It manages structured and unstructured data and supports analytics like predictive modeling. Hadoop uses HDFS for rapid data access and fault tolerance. Sqoop transfers data between Hadoop and relational databases. Hive provides SQL queries on Hadoop data. Spark is a distributed processing system for fast queries on large data using in-memory caching. Spark SQL unifies SQL queries and complex analytics. Scala is a functional language that runs on the JVM and is used for Spark development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Big Data Course Agenda

Hadoop is an open source framework for distributed processing and storage of big data across clusters of servers. It manages structured and unstructured data and supports analytics like predictive modeling. Hadoop uses HDFS for rapid data access and fault tolerance. Sqoop transfers data between Hadoop and relational databases. Hive provides SQL queries on Hadoop data. Spark is a distributed processing system for fast queries on large data using in-memory caching. Spark SQL unifies SQL queries and complex analytics. Scala is a functional language that runs on the JVM and is used for Spark development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

BIG DATA

hadoop

Hadoop is an open source distributed processing framework that manages data


processing and storage for big data applications in scalable clusters of computer
servers. It's at the center of an ecosystem of big data technologies that are primarily
used to support advanced analytics initiatives, including predictive analytics, data
mining and machine learning. Hadoop systems can handle various forms of
structured and unstructured data, giving users more flexibility for collecting,
processing, analyzing and managing data than relational databases and data
warehouses provide.

Hadoop & Bigdata

Hadoop runs on commodity servers and can scale up to support thousands of


hardware nodes. The Hadoop Distributed File System (HDFS) is designed to provide
rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so
applications can continue to run if individual nodes fail. Those features helped
Hadoop become a foundational data management platform for big data
analytics uses after it emerged in the mid-2000s
Sqoop..

Enterprises that use Hadoop are finding it necessary to transfer some of


their data from traditional relational database management systems
(RDBMSs) to the Hadoop ecosystem.

Sqoop, an integral part of Hadoop, can perform this transfer in an automated


fashion. Moreover, the data imported into Hadoop can be transformed with
MapReduce before exporting them back to the RDBMS. Sqoop can also
generate Java classes for programmatically interacting with imported data.

Sqoop uses a connector-based architecture that allows it to use plugins for


connecting with external databases.

hive..

Apache Hive is an open source data warehouse software for reading, writing
and managing large data set files that are stored directly in either the Apache
Hadoop Distributed File System (HDFS) or other data storage systems such
as Apache HBase. Hive enables SQL developers to write Hive Query
Language (HQL) statements that are similar to standard SQL statements for
data query and analysis. It is designed to make MapReduce programming
easier because you don’t have to know and write lengthy Java code. Instead,
you can write queries more simply in HQL, and Hive can then create the map
and reduce the functions.
Apache Spark..

Apache Spark is an open-source, distributed processing system used for big


data workloads. It utilizes in-memory caching, and optimized query
execution for fast analytic queries against data of any size. It provides
development APIs in Java, Scala, Python and R, and supports code reuse
across multiple workloads—batch processing, interactive queries, real-time
analytics, machine learning, and graph processing. You’ll find it used by
organizations from any industry, including at FINRA, Yelp, Zillow, DataXu,
Urban Institute, and CrowdStrike. Apache Spark has become one of the most
popular big data distributed processing framework with 365,000 meetup
members in 2017

Spark sql / scala..

Spark SQL brings native support for SQL to Spark and streamlines the process of
querying data stored both in RDDs (Spark’s distributed datasets) and in external
sources. Spark SQL conveniently blurs the lines between RDDs and relational
tables. Unifying these powerful abstractions makes it easy for developers to
intermix SQL commands querying external data with complex analytics, all within
in a single application. Concretely, Spark SQL will allow developers to:

 Import relational data from Parquet files and Hive tables


 Run SQL queries over imported data and existing RDDs
 Easily write RDDs out to Hive tables or Parquet files

Scala is a hybrid functional and object-oriented programming language which


runs on JVM (Java Virtual Machine). The name is an acronym for Scalable
Language. It is designed for concurrency, expressiveness, and scalability

You might also like