spark_sql

Uploaded by

23mca005

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

spark_sql

Uploaded by

23mca005

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

SPARK SQL

Spark SQL
• Spark SQL is a Spark module for structured data processing.
• Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD
Spark SQl
• Spark SQL was first released in Spark 1.0 (May, 2014).
• Initial committed by Michael Armbrust & Reynold Xin from Databricks.
• Spark introduces a programming module for structured data processing called
Spark SQL.
• It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.
Drawbacks of Hive

• It cannot resume processing, which means if the execution fails in the middle of a
workflow, you cannot resume from where it got stuck.
• We cannot drop the encrypted databases in cascade when the trash is enabled. It
leads to the execution error. For dropping such type of database, users have to use
the Purge option.
• The ad-hoc queries are executed using MapReduce, which is launched by the Hive
but when we analyze the medium size database, it delays the performance.
• Hive doesn't support the update or delete operation.
• It is limited to the subquery support.
These drawbacks are the reasons to develop the Apache SQL.
Challenges and Solutions
Challenges
• Perform ETL to and from various (semi- or unstructured) data sources.
• Perform advanced analytics (e.g. machine learning, graph processing) that are
hard to express in relational systems.

Solutions
• A DataFrame API that can perform relational operations on both external data
sources and Spark’s built-in RDDs.
• A highly extensible optimizer, Catalyst, that uses features of Scala to add
composable rule, control code gen., and define extensions.
Spark Architecture
• Language API
Spark is compatible with different languages and Spark SQL.
It is also, supported by these languages- API (python, scala, java, HiveQL).
• Schema RDD
Spark Core is designed with special data structure called RDD.
Generally, Spark SQL works on schemas, tables, and records.
Therefore, we can use the Schema RDD as temporary table.
We can call this Schema RDD as Data Frame.
• Data Sources
Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
Sources for Spark SQL is different.
Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
Features of Spark SQL
1. Integrated:
• Seamlessly mix SQL queries with Spark programs.
• Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark,
with integrated APIs in Python, Scala and Java.
• This tight integration makes it easy to run SQL queries alongside complex analytic
algorithms.
2. Unified Data Access:
• Load and query data from a variety of sources.
• Schema-RDDs provide a single interface for efficiently working with structured
data, including Apache Hive tables, parquet files and JSON files.
3. Hive Compatibility:
• Run unmodified Hive queries on existing warehouses.
• Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility
with existing Hive data, queries, and UDFs.
• Simply install it alongside Hive.
SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data)
4. Standard Connectivity:
• Connect through JDBC or ODBC.
• Spark SQL includes a server mode with industry standard JDBC and ODBC
connectivity
5. Scalability:
• Use the same engine for both interactive and long queries.
• Spark SQL takes advantage of the RDD model to support mid-query fault
tolerance, letting it scale to large jobs too.
• Do not worry about using a different engine for historical data.
DataFrame and Dataset
• A distributed collection of data, which is organized into named columns.
• Conceptually, it is equivalent to relational tables with good optimization techniques.
• A DataFrame can be constructed from an array of different sources such as Hive
tables, Structured Data files, external databases, or existing RDDs.
• This API was designed for modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas in Python.
DataFrame
• Data is organized into named columns, like a table in a relational database
Dataset: a distributed collection of data
• A new interface added in Spark 1.6
• static-typing and runtime type-safety
Features of DataFrame
• Ability to process the data in the size of Kilobytes to Petabytes on a single node
cluster to large cluster.
• Supports different data formats (Avro, csv, elastic search, and Cassandra) and
storage systems (HDFS, HIVE tables, mysql, etc).
• State of art optimization and code generation through the Spark SQL Catalyst
optimizer (tree transformation framework).
• Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
• Provides API for Python, Java, Scala, and R Programming
Hive Compatibility
Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with
existing Hive data, queries, and UDFs.
What’s the catch in Spark SQL???
• Create and Run Spark Programs Faster:
1. Write less code.
2. Read less data.
3. Let the optimizer do the hard work.
• RDD V.S. Dataframe
Less Code…………..
Faster Implementation…………
Plan Optimization and Execution

Azure Devops Engineer Interview Questions
No ratings yet
Azure Devops Engineer Interview Questions
20 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Awesome 2 Column Arpit-Bhayani PDF
No ratings yet
Awesome 2 Column Arpit-Bhayani PDF
1 page
Lec no 10
No ratings yet
Lec no 10
17 pages
Spark SQL
No ratings yet
Spark SQL
34 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Spark
No ratings yet
Spark
4 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
Spark BD
No ratings yet
Spark BD
9 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Module 3
No ratings yet
Module 3
51 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark
No ratings yet
Spark
9 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit 6
No ratings yet
Unit 6
26 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
SparkSql_AND_DF
No ratings yet
SparkSql_AND_DF
89 pages
DA U2
No ratings yet
DA U2
17 pages
Iswarya - SR - Bigdata Hadoop Developer
No ratings yet
Iswarya - SR - Bigdata Hadoop Developer
8 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Bda 7
No ratings yet
Bda 7
4 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Akhil Data+Engineer1
No ratings yet
Akhil Data+Engineer1
5 pages
BDA1
No ratings yet
BDA1
17 pages
Manideep Lenkalapally
No ratings yet
Manideep Lenkalapally
7 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
3_UNIT3_Spark
No ratings yet
3_UNIT3_Spark
55 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Sspark
No ratings yet
Sspark
7 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Bda 5
No ratings yet
Bda 5
21 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Apache Spark
No ratings yet
Apache Spark
16 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Y8 Essay Structure
100% (1)
Y8 Essay Structure
1 page
WTS Probability PDF
No ratings yet
WTS Probability PDF
26 pages
AN5413
No ratings yet
AN5413
72 pages
MCCRS Standards:: Literacy Lesson Plan Grade Subject
No ratings yet
MCCRS Standards:: Literacy Lesson Plan Grade Subject
5 pages
Instant Download The Art of Multiprocessor Programming 2nd Edition Maurice Herlihy PDF All Chapters
100% (1)
Instant Download The Art of Multiprocessor Programming 2nd Edition Maurice Herlihy PDF All Chapters
59 pages
Landslide - Fleetwood Mac - Vocal Notation & Guitar Tablature PDF - Landslide
No ratings yet
Landslide - Fleetwood Mac - Vocal Notation & Guitar Tablature PDF - Landslide
1 page
Presentation Basics OPTIX RTN (605, 610, 620)
100% (8)
Presentation Basics OPTIX RTN (605, 610, 620)
74 pages
A3 Flyers 2022 Brunei
No ratings yet
A3 Flyers 2022 Brunei
2 pages
Direct and Indirect Speech
No ratings yet
Direct and Indirect Speech
8 pages
Julius Caesar Conflicting Perspectives Thesis
100% (3)
Julius Caesar Conflicting Perspectives Thesis
7 pages
Format Soal Pas 2022 2023 Kelas Xii Otkp Dan Atph
No ratings yet
Format Soal Pas 2022 2023 Kelas Xii Otkp Dan Atph
3 pages
ĐỀ CƯƠNG ÔN TẬP - C.Kì 2- ANH 8
No ratings yet
ĐỀ CƯƠNG ÔN TẬP - C.Kì 2- ANH 8
4 pages
JN0-664-Demo
No ratings yet
JN0-664-Demo
9 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
3 pages
Pontis 5.1.2 User Manual PDF
No ratings yet
Pontis 5.1.2 User Manual PDF
459 pages
Lesson 3 - Inverse Functions
No ratings yet
Lesson 3 - Inverse Functions
31 pages
Solidworks Installation Guide Network License
No ratings yet
Solidworks Installation Guide Network License
24 pages
Cat Tools History-1
No ratings yet
Cat Tools History-1
8 pages
English 4 PDF
No ratings yet
English 4 PDF
12 pages
PythonReference 2024 en
100% (1)
PythonReference 2024 en
645 pages
Robert Romanchuk-Byzantine Hermeneutics and Pedagogy in The Russian North - Monks and Masters at The Kirillo-Belozerskii Monastery, 1397-1501-University of Toronto Press (2007)
No ratings yet
Robert Romanchuk-Byzantine Hermeneutics and Pedagogy in The Russian North - Monks and Masters at The Kirillo-Belozerskii Monastery, 1397-1501-University of Toronto Press (2007)
471 pages
Sample Paper_CS607P
No ratings yet
Sample Paper_CS607P
10 pages
Rubrics in Grammar
No ratings yet
Rubrics in Grammar
5 pages
Mandarin
No ratings yet
Mandarin
26 pages
Unit 4 CE 402 Two Hinged Arch
No ratings yet
Unit 4 CE 402 Two Hinged Arch
8 pages
A Guide To The MARIE Machine Simulator Environment
No ratings yet
A Guide To The MARIE Machine Simulator Environment
20 pages
Library Genesis
No ratings yet
Library Genesis
2 pages
Why Is Terminology Your Passion? Book 2
No ratings yet
Why Is Terminology Your Passion? Book 2
176 pages

spark_sql

Uploaded by

spark_sql

Uploaded by

SPARK SQL

You might also like