spark_sql
spark_sql
Spark SQL
• Spark SQL is a Spark module for structured data processing.
• Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD
Spark SQl
• Spark SQL was first released in Spark 1.0 (May, 2014).
• Initial committed by Michael Armbrust & Reynold Xin from Databricks.
• Spark introduces a programming module for structured data processing called
Spark SQL.
• It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.
Drawbacks of Hive
• It cannot resume processing, which means if the execution fails in the middle of a
workflow, you cannot resume from where it got stuck.
• We cannot drop the encrypted databases in cascade when the trash is enabled. It
leads to the execution error. For dropping such type of database, users have to use
the Purge option.
• The ad-hoc queries are executed using MapReduce, which is launched by the Hive
but when we analyze the medium size database, it delays the performance.
• Hive doesn't support the update or delete operation.
• It is limited to the subquery support.
These drawbacks are the reasons to develop the Apache SQL.
Challenges and Solutions
Challenges
• Perform ETL to and from various (semi- or unstructured) data sources.
• Perform advanced analytics (e.g. machine learning, graph processing) that are
hard to express in relational systems.
Solutions
• A DataFrame API that can perform relational operations on both external data
sources and Spark’s built-in RDDs.
• A highly extensible optimizer, Catalyst, that uses features of Scala to add
composable rule, control code gen., and define extensions.
Spark Architecture
• Language API
Spark is compatible with different languages and Spark SQL.
It is also, supported by these languages- API (python, scala, java, HiveQL).
• Schema RDD
Spark Core is designed with special data structure called RDD.
Generally, Spark SQL works on schemas, tables, and records.
Therefore, we can use the Schema RDD as temporary table.
We can call this Schema RDD as Data Frame.
• Data Sources
Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
Sources for Spark SQL is different.
Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
Features of Spark SQL
1. Integrated:
• Seamlessly mix SQL queries with Spark programs.
• Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark,
with integrated APIs in Python, Scala and Java.
• This tight integration makes it easy to run SQL queries alongside complex analytic
algorithms.
2. Unified Data Access:
• Load and query data from a variety of sources.
• Schema-RDDs provide a single interface for efficiently working with structured
data, including Apache Hive tables, parquet files and JSON files.
3. Hive Compatibility:
• Run unmodified Hive queries on existing warehouses.
• Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility
with existing Hive data, queries, and UDFs.
• Simply install it alongside Hive.
SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data)
4. Standard Connectivity:
• Connect through JDBC or ODBC.
• Spark SQL includes a server mode with industry standard JDBC and ODBC
connectivity
5. Scalability:
• Use the same engine for both interactive and long queries.
• Spark SQL takes advantage of the RDD model to support mid-query fault
tolerance, letting it scale to large jobs too.
• Do not worry about using a different engine for historical data.
DataFrame and Dataset
• A distributed collection of data, which is organized into named columns.
• Conceptually, it is equivalent to relational tables with good optimization techniques.
• A DataFrame can be constructed from an array of different sources such as Hive
tables, Structured Data files, external databases, or existing RDDs.
• This API was designed for modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas in Python.
DataFrame
• Data is organized into named columns, like a table in a relational database
Dataset: a distributed collection of data
• A new interface added in Spark 1.6
• static-typing and runtime type-safety
Features of DataFrame
• Ability to process the data in the size of Kilobytes to Petabytes on a single node
cluster to large cluster.
• Supports different data formats (Avro, csv, elastic search, and Cassandra) and
storage systems (HDFS, HIVE tables, mysql, etc).
• State of art optimization and code generation through the Spark SQL Catalyst
optimizer (tree transformation framework).
• Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
• Provides API for Python, Java, Scala, and R Programming
Hive Compatibility
Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with
existing Hive data, queries, and UDFs.
What’s the catch in Spark SQL???
• Create and Run Spark Programs Faster:
1. Write less code.
2. Read less data.
3. Let the optimizer do the hard work.
• RDD V.S. Dataframe
Less Code…………..
Faster Implementation…………
Plan Optimization and Execution