DAT202.2x Lab03
DAT202.2x Lab03
Overview
In this lab, you will provision an HDInsight Spark cluster. You will then use the Spark cluster to explore
data interactively.
Note: To set up the required environment for the lab, follow the instructions in the Setup document for
this course. Specifically, you must have signed up for an Azure subscription.
Note: The Microsoft Azure portal is continually improved in response to customer feedback. The steps in
this exercise reflect the user interface of the Microsoft Azure portal at the time of writing, but may not
match the latest design of the portal exactly.
Note: As soon as an HDInsight cluster is running, the credit in your Azure subscription will start to be
charged. Free-trial subscriptions include a limited amount of credit limit that you can spend over a
period of 30 days, which should be enough to complete the labs in this course as long as clusters are
deleted when not in use. If you decide not to complete this lab, follow the instructions in the Clean Up
procedure at the end of the lab to delete your cluster to avoid using your Azure credit unnecessarily.
ssh sshuser@your_cluster_name-ssh.azurehdinsight.net
2. Open a new terminal session, and paste the ssh command, specifying your SSH user name (not
the cluster login username).
3. If you are prompted to connect even though the certificate can’t be verified, enter yes.
4. When prompted, enter the password for the SSH username.
Note: If you have previously connected to a cluster with the same name, the certificate for the older
cluster will still be stored and a connection may be denied because the new certificate does not
match the stored certificate. You can delete the old certificate by using the ssh-keygen command,
specifying the path of your certificate file (f) and the host record to be removed (R) - for example:
ssh-keygen -f "/home/usr/.ssh/known_hosts" -R clstr-ssh.azurehdinsight.net
2. Ignore any warnings about a new version of conda being available, and when prompted to
proceed (after a minute or so), enter y.
3. Wait for the command to finish and the prompt to be displayed.
4. Enter the following command to count the number of lines of text in the text file. This triggers
an action, which may take a few seconds to complete before displaying the resulting count.
txt.count()
5. Enter the following command view the first line in the text file.
txt.first()
6. Enter the following command to create a new RDD named filtTxt that filters the txt RDD so that
only lines containing the word “Leonardo” are included.
filtTxt = txt.filter(lambda txt: "Leonardo" in txt)
7. Enter the following command to count the number of rows in the filtTxt RDD.
filtTxt.count()
8. Enter the following command to display the contents of the filtTxt RDD.
filtTxt.collect()
9. Enter the following command to exit the Python shell, and then press ENTER to return to the
command line.
quit()
2. When the Scala Spark shell has started, note that the Spark Context is automatically imported as
sc and SparkSession as spark.
3. Enter the following command to create an RDD named txt from the sample outlineofscience.txt
text file provided by default with all HDInsight clusters.
val txt = sc.textFile("/example/data/gutenberg/outlineofscience.txt")
4. Enter the following command to count the number of lines of text in the text file. This triggers
an action, which may take a few seconds to complete before displaying the long integer result.
txt.count()
5. Enter the following command view the first line in the text file.
txt.first()
6. Enter the following command to create a new RDD named filtTxt that filters the txt RDD so that
only lines containing the word “science” are included.
val filtTxt = txt.filter(txt => txt.contains("science"))
7. Enter the following command to count the number of rows in the filtTxt RDD.
filtTxt.count()
8. Enter the following command to display the contents of the filtTxt RDD.
filtTxt.collect()
9. Enter the following command to exit the Scala shell, and then press ENTER to return to the
command line.
:quit
1. In the SSH console window, enter the following command to start nano text editor application and
create a file named WordCount.py. If you are prompted to create a new file, click Yes.
nano WordCount.py
2. In the Nano text editor, enter the following Python code. You can copy and paste this from Python
WordCount.txt in the Lab03 folder where you extracted the lab files for this course.
# import and initialize Spark context
from pyspark import SparkConf, SparkContext
cnf = SparkConf().setMaster("local").setAppName("WordCount")
sc = SparkContext(conf = cnf)
# store results
counts.saveAsTextFile("/wordcount_output")
3. Exit Nano (press CTRL + X) and save WordCount.py (press Y and ENTER when prompted).
4. In the SSH console, enter the following command to submit the WordCount.py Python script to
Spark.
spark-submit WordCount.py
5. Wait for the application to finish, and then in the command line, enter the following command to
view the output files that have been generated.
hdfs dfs -ls /wordcount_output
6. Enter the following command to view the contents of the part-00000 file, which contains the word
counts.
hdfs dfs -cat /wordcount_output/part-00000
Note: If you want to re-run the WordCount.py script, you must first delete the
wordcount_output folder by running the following command:
7. Minimize the SSH console (you will use it again later in this lab).
1. In the SSH console, enter the following commands to add sbt repository keys and install the
package. You can copy and paste this command from Install SBT.txt in the Lab03 folder where you
extracted the lab files for this course.
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a
/etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv
2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
2. Enter the following command to verify the sbt version and download the libraries needed for the
tool to work correctly (ignore any errors):
sbt –version
3. Enter the following commands to clear the console, create a new folder for your application, and
change the current folder to the new folder you created:
clear
mkdir wordcount
cd wordcount
4. Enter the following command to start the nano text editor and create a new sbt project file named
build.sbt.
nano build.sbt
5. Enter the following configuration, be careful to leave one empty line between each line. You can
copy and paste this text from Build_sbt.txt in the Lab03 folder where you extracted the lab files for
this course.
name := "Word Count"
version := "1.0"
scalaVersion := "2.11.7"
6. Exit Nano (press CTRL + X) and save build.sbt (press Y and ENTER when prompted).
7. Enter the following command to create a new scala application file named wordcount.scala.
nano wordcount.scala
8. Add the following code to the wordcount.scala file. You can copy and paste this from Scala
WordCount.txt in the Lab03 folder where you extracted the lab files for this course.
package edx.course
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCountApplication{
def main(args: Array[String]){
val cnf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(cnf)
val txt =
sc.textFile("wasb:///example/data/gutenberg/outlineofscience.txt")
val words = txt.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey((x, y) => x + y)
counts.saveAsTextFile("/output_wordcount_application")
}
}
9. Exit Nano (press CTRL + X) and save wordcount.scala (press Y and ENTER when prompted).
10. Enter the following command to compile the source code and package the application in a jar file.
sbt compile
sbt package
This code will compile and package the application content in a jar file with the path
/wordcount/target/scala-2.11/word-count_2.11-1.0.jar.
11. Enter the following command on a single line to submit the application.
13. Enter the following command to view the contents of the part-00000 file, which contains the word
counts.
hdfs dfs -cat /output_wordcount_application/part-00000
Note: If you want to re-run the WordCount.scala script, you must first delete the output folder by
running the following command:
hdfs dfs -rm -r /output_wordcount_application
Tip: It’s worth spending some time exploring these sample notebooks, as they contain useful
examples that will help you learn more about running Python or Scala code in Spark.
Create a Folder
1. Return to the Home folder in the Jupyter web page, then on the New drop-down menu, click
Folder. This create a folder named Untitled Folder.
2. Select the checkbox for the Untitled Folder folder, and then above the list of folders, click
Rename. Then rename the directory to Labs.
3. Open the Labs folder, and verify that it contains no notebooks.
Create a Notebook
In this procedure, you will create a notebook using your choice of Python or Scala (or if you prefer, you
can use both).
4. With cursor still in the cell, on the toolbar click the run cell, select below button. As the code
runs the O symbol next to Python 2 at the top right of the page changes to a symbol, and then
returns to O when the code has finished running.
5. When the code has finished running, view the output under the cell, which shows the first line in
the text file.
6. In the new cell under the first cell, enter the following code to count the elements in the txt
RDD.
txt.count()
4. With cursor still in the cell, on the toolbar click the run cell, select below button. As the code
runs the O symbol next to Spark at the top right of the page changes to a symbol, and then
returns to O when the code has finished running.
5. When the code has finished running, view the output under the cell, which shows the first line in
the text file.
6. In the new cell under the output from the first cell, enter the following code to count the
elements in the txt RDD:
txt.count()
Note: Spark 2.0 introduces the new SparkSession object, which unifies the SqlContext and HiveContext
objects used in previous versions of Spark. The SqlContext and HiveContext objects are still available for
backward-compatibility, but the SparkSession object is the preferred way to work with Spark SQL.
In this exercise, you can use your choice of Python or Scala – just follow the instructions in the
appropriate section below.
csv.printSchema()
8. In the new empty cell, enter the following code to use the spark.read.csv method to
automatically infer the schema from the header row of column names and the data the file
contains:
building_csv =
spark.read.csv('wasb:///HdiSamples/HdiSamples/SensorSampleData/building
/building.csv', header=True, inferSchema=True)
building_csv.printSchema()
This technique of inferring the schema makes it easy to read structured data files into a
DataFrame containing multiple columns. However, it incurs a performance overhead; and in
some cases you may want to have specific control over column names or data types.
Consider the following data in a file named HVAC.csv, which you will load in the next step.
Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
6/6/13,5:00:01,67,56,13,28,4
13. In the new empty cell, enter the following code to define a schema named schma, and load the
DataFrame using the spark.read.csv method:
from pyspark.sql.types import *
schma = StructType([
StructField("Date", StringType(), False),
StructField("Time", StringType(), False),
StructField("TargetTemp", IntegerType(), False),
StructField("ActualTemp", StringType(), False),
StructField("System", IntegerType(), False),
StructField("SystemAge", IntegerType(), False),
StructField("BuildingID", IntegerType(), False),
])
hvac_csv =
spark.read.csv('wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVA
C.csv', schema=schma, header=True)
hvac_csv.printSchema()
17. Run the cell, selecting a new cell beneath to view the contents of the DataFrame.
2. Run the cell, selecting a new cell beneath, and view the results.
3. In the empty cell at the bottom of the notebook, enter the following code to create a new
DataFrame named hvac_data by selecting columns in the hvac_csv DataFrame and filtering it to
include only rows where the ActualTemp is higher than the TargetTemp (note that you need to
import the Spark SQL functions library to access the col function, which is used to identify filter
parameters as columns):
from pyspark.sql.functions import *
4. Run the cell, selecting a new cell beneath, and view the results.
5. In the empty cell at the bottom of the notebook, enter the following code to join the hvac_data
DataFrame to the building_data DataFrame:
hot_buildings = building_data.join(hvac_data, "BuildingID")
hot_buildings.show()
6. Run the cell, selecting a new cell beneath, and view the results.
4. When the code has finished running, view the output returned from the query, which are shown
as the default output format of Table.
5. Click Bar to see the results visualized as a bar chart, and specify the following encoding settings:
a. X: HVACproduct
b. Y: AvgError
c. Func: -
Temporary tables only exist within the current session. However, you can save a DataFrame as a
persisted table that can be accessed by other processes using Spark SQL.
6. In the empty cell, enter the following code to save the building_csv DataFrame as a persisted
table named hvac so it can be accessed by other processes:
building_csv.write.saveAsTable("building")
1. In the empty cell at the bottom of the notebook, enter the following code, which creates a
DataFrame based on a Hive query against the standard hivesampletable in your cluster, and
then displays its contents:
calls = spark.sql("""SELECT devicemodel, COUNT(*) AS calls
FROM hivesampletable
GROUP BY devicemodel
ORDER BY calls DESC """)
calls.show()
csv.printSchema
4. On the Cell menu, click Run Cells and Select Below (or click the | button on the toolbar) to
run the cell, selecting a new cell beneath.
5. When the code has finished running, view the output returned, which describes the schema of
the DataFrame. Note that the file content has been loaded into a DataFrame with a single
column named value.
6. In the new empty cell, enter the following code to display the contents of the DataFrame:
csv.show(truncate = false)
The building.csv file contains details of buildings and their HVAC systems, as shown in the
following extract, so you need to define a schema that reflects this structure and use it to load
the data into a suitable DataFrame.
BuildingID,BuildingMgr,BuildingAge,HVACproduct,Country
1,M1,25,AC1000,USA
2,M2,27,FN39TG,France
3,M3,28,JDNS77,Brazil
4,M4,17,GG1919,Finland
8. In the new empty cell, enter the following code to use the spark.read.csv method to
automatically infer the schema from the header row of column names and the data the file
contains:
val building_csv =
spark.read.option("inferSchema","true").option("header","true").csv("wa
sb:///HdiSamples/HdiSamples/SensorSampleData/building/building.csv")
building_csv.printSchema
Note that the DataFrame shows the data in multiple columns, which are named based on the
header row in the source file.
This technique of inferring the schema makes it easy to read structured data files into a
DataFrame containing multiple columns. However, it incurs a performance overhead; and in
some cases you may want to have specific control over column names or data types.
Consider the following data in a file named HVAC.csv, which you will load in the next step.
Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
6/6/13,5:00:01,67,56,13,28,4
13. In the new empty cell, enter the following code to define a schema named schma, and load the
DataFrame using the spark.read.csv method:
import org.apache.spark.sql.types._;
An alternative (and generally preferred) way to define a schema when using Scala is to create a
case class, and use it to determine the schema for the DataFrame.
16. In the new empty cell, enter the following code to define a case class named HvacReading, and
then use it to derive the schema for the DataFrame:
import org.apache.spark.sql.Encoders
hvac_csv.printSchema
19. Run the cell, selecting a new cell beneath to view the contents of the DataFrame.
Use DataFrame Methods
1. In the empty cell at the bottom of the notebook, enter the following code to create a new
DataFrame named building_data by selecting columns in the building_csv DataFrame:
val building_data = building_csv.select($"BuildingID", $"BuildingAge",
$"HVACproduct")
building_data.show()
2. Run the cell, selecting a new cell beneath, and view the results.
3. In the empty cell at the bottom of the notebook, enter the following code to create a new
DataFrame named hvac_data by selecting columns in the hvac_csv DataFrame and filtering it to
include only rows where the ActualTemp is higher than the TargetTemp:
var hvac_data = hvac_csv.select($"BuildingID", $"ActualTemp",
$"TargetTemp").filter($"ActualTemp" > $"TargetTemp")
hvac_data.show()
4. Run the cell, selecting a new cell beneath, and view the results.
5. In the empty cell at the bottom of the notebook, enter the following code to join the hvac_data
DataFrame to the building_data DataFrame:
var hot_buildings = building_data.join(hvac_data, "BuildingID")
hot_buildings.show()
6. Run the cell, selecting a new cell beneath, and view the results.
4. When the code has finished running, view the output returned from the query, which are shown
as the default output format of Table.
5. Click Bar to see the results visualized as a bar chart, and specify the following encoding settings:
a. X: HVACproduct
b. Y: AvgError
c. Func: -
Temporary tables only exist within the current session. However, you can save a DataFrame as a
persisted table that can be accessed by other processes using Spark SQL.
6. In the empty cell, enter the following code to save the hvac_csv DataFrame as a persisted table
named hvac so it can be accessed by other processes:
hvac_csv.write.saveAsTable("hvac")
1. In the empty cell at the bottom of the notebook, enter the following code, which creates a
DataFrame based on a Hive query against the standard hivesampletable in your cluster, and
then displays its contents:
val calls = spark.sql("""SELECT devicemodel, COUNT(*) AS calls
FROM hivesampletable
GROUP BY devicemodel
ORDER BY calls DESC """)
calls.show()
1. In the SSH console for your cluster, enter the following command to create a folder named
stream in the shared blob storage for the cluster:
hdfs dfs -mkdir /stream
2. Enter the following command to start the Nano text editor and create a file named text1 on the
local file system of the cluster head node.
nano text1
3. In Nano, enter the following text.
the boy stood on the burning deck
4. Exit Nano (press CTRL + X) and save text1 (press Y and ENTER when prompted).
5. Enter the following command to restart the Nano text editor and create a file named text2.
nano text2
7. Exit Nano (press CTRL + X) and save text2 (press Y and ENTER when prompted).
# Create a StreamingContext
ssc = StreamingContext(sc, 1)
ssc.checkpoint("wasb:///chkpnt")
ssc.start()
Note: This code creates a Spark Streaming context and uses its textFileStream method to create
a stream that reads text files as they are added to the /stream folder. Any new data in the folder
is read into an RDD, and the text is split into words, which are counted within a sliding window
that includes the last 60 seconds of data at 10 second intervals. The pprint method is then used
to write the counted words from each batch to the console.
4. On the Cell menu, click Run Cells and Select Below (or click the | button on the toolbar) to
run the cell, selecting a new cell beneath.
5. Wait for the kernel to return to the idle status.
ssc.start()
Note: This code creates a Spark Streaming context and uses its textFileStream method to create
a stream that reads text files as they are added to the /stream folder. Any new data in the folder
is read into an RDD, and the text is split into words, which are counted within a sliding window
that includes the last 60 seconds of data at 10 second intervals. The print method is then used
to write the counted words from each batch to the console.
4. On the Cell menu, click Run Cells and Select Below (or click the | button on the toolbar) to
run the cell, selecting a new cell beneath.
5. Wait for the kernel to return to the idle status.
1. In the SSH console, enter the following command to upload a copy of text1 to the stream folder:
hdfs dfs -put text1 /stream/text1_1
2. Wait a few seconds, and then enter the following command to upload a second copy of text1 to
the stream folder:
hdfs dfs -put text1 /stream/text1_2
3. Wait a few more seconds, and then enter the following command to upload a copy of text2 to
the stream folder:
hdfs dfs -put text2 /stream/text2_1
4. Continue uploading copies of the two files to the stream folder over the next minute or so. It
doesn’t matter how many copies of each file you upload - the goal is simply to generate data in
the folder that will be captured by the streaming program.
1. In the Jupyter notepad where you ran the code to start the streaming process, in the empty cell
at the bottom, add the following code to stop the streaming context:
ssc.stop()
2. Review the output generated by the code. Every ten seconds or so, the stream processing
operations should have run and the time is displayed with the count of each word within the
previous minute, similar to this:
-------------------------
Time: 2016-03-01 12:00:00
-------------------------
(u'tiger', 2)
(u'stood', 3)
(u'boy', 3)
(u'on', 3)
(u'the', 6)
(u'bright', 1)
(u'burning', 4)
(u'deck', 3)
3. On the File menu click Close and Halt. If prompted, confirm that you want to close the tab.
1. In the SSH console for your cluster, enter the following command to create a folder named
structtream in the shared blob storage for the cluster:
hdfs dfs -mkdir /structstream
2. Enter the following command to start the Nano text editor and create a file named devdata.txt
on the local file system of the cluster head node.
nano devdata.txt
3. In Nano, enter the following text (you can copy and paste this text from devdata.txt in the
Lab03 folder where you extracted the lab files for this course):
{"device":"Dev1","status":"ok"}
{"device":"Dev2","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev2","status":"error"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"error"}
{"device":"Dev2","status":"ok"}
{"device":"Dev2","status":"error"}
{"device":"Dev1","status":"ok"}
4. Exit Nano (press CTRL + X) and save text1 (press Y and ENTER when prompted).
inputPath = "wasb:///structstream/"
jsonSchema = StructType([
StructField("device", StringType(), False),
StructField("status", StringType(), False)
])
fileDF =
spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger",
1).json(inputPath)
query =
countDF.writeStream.format("memory").queryName("counts").outputMode("co
mplete").start()
Note: This code defines a query object that writes the output to an in-memory table named
counts. This technique is useful for testing a structured streaming program, but a production
solution would typically write the data to a file or database.
4. On the Cell menu, click Run Cells and Select below (or click the | button on the toolbar) to run
the cell, selecting a new cell beneath.
5. Wait for the kernel to return to the idle status.
val fileDF =
spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger",
1).json(inputPath)
val countDF =
fileDF.filter("status == 'error'").groupBy($"device").count()
val query =
countDF.writeStream.format("memory").queryName("counts").outputMode("co
mplete").start()
Note: This code defines a query object that writes the output to an in-memory table named
counts. This technique is useful for testing a structured streaming program, but a production
solution would typically write the data to a file or database.
4. On the Cell menu, click Run Cells and Select below. (or click the | button on the toolbar) to
run the cell, selecting a new cell beneath.
5. Wait for the kernel to return to the idle status.
1. In the SSH console, enter the following command to upload a copy of devdata.txt to the
structstream folder (the copy is saved as a file named 1):
hdfs dfs -put devdata.txt /structstream/1
2. Wait a few seconds, and then enter the following command to upload a second copy of
devdata.txt to the structstream folder (the copy is saved as a file named 2)
hdfs dfs -put devdata.txt /structstream/2
1. In the Jupyter notepad where you ran the code to start the streaming process, in the empty cell
at the bottom, add the following code to query the counts in-memory table containing the
running count of device errors:
%%sql
select * from counts
5. Return to the Jupytrer notebook, and re-run the cell containing your SQL query to verify that the
running count of errors is increasing as new data is added to the folder.
6. In the empty cell at the bottom, add the following code to stop the streaming query:
query.stop()
7. On the File menu click Close and Halt. If prompted, confirm that you want to close the tab.
8. Close the web browser and the SSH console.
Using Spark Structured Streaming with Azure Event Hubs
Azure Event Hubs provides reliable real-time message ingestion for streaming solutions. In this exercise,
you will use Azure Event Hubs to ingest simulated device readings, which you will then process using
Spark Structured Streaming.
1. In the Microsoft Azure portal, in the Hub Menu, click New. Then in the Internet of Things menu,
click Event Hubs.
2. In the Create namespace blade, enter the following settings, and then click Create:
• Name: Enter a unique name (and make a note of it!)
• Pricing tier: Basic
• Subscription: Select your Azure subscription
• Resource Group: Select the resource group containing your HDInsight cluster
• Location: Select any available region
• Pin to dashboard: Not selected
3. In the Azure portal, view Notifications to verify that deployment has started. Then wait for the
namespace to be deployed (this can take a few minutes.)
4. In the Azure portal, browse to the namespace you created.
5. In the blade for your namespace, click Add Event Hub.
6. In the Create Event Hub blade, enter the following settings (other options may be shown as
unavailable) and click Create:
• Name: Enter a unique name (and make a note of it!)
• Partition Count: 2
7. In the Azure portal, wait for the notification that the event hub has been created.
8. In the blade for your namespace, select the event hub you just created.
9. In the blade for your event hub, click Shared access policies.
10. On the Shared access policies blade for your event hub, click Add. Then add a shared access
policy with the following settings:
• Policy name: DeviceAccess
• Claim: Select Send and Listen
11. Wait for the shared access policy to be created, then select it, and note that primary and
secondary keys and connection strings have been created for it. Copy the primary connection
string to the clipboard - you will use it to connect to your event hub from a simulated client
device in the next procedure.
4. Enter the following command to install the Azure Event Hubs package:
npm install azure-event-hubs
5. Use a text editor to edit the eventclient.js file in the eventclient folder.
6. Modify the script to set the connStr variable to reflect your shared access policy connection
string, as shown here:
var EventHubClient = require('azure-event-hubs').Client;
9. Observe the script running as it submits simulated events. Then leave it running and continue to
the next procedure.
Tip: Copy and paste the code used in this procedure from EventHub Spark Shell.txt in the Lab03 folder
where you extracted the lab files for this course.
1. Open an SSH connection to your Spark cluster (you can use Putty on Windows, or the Bash
console on Mac OSX or Linux.
2. Enter the following command to start the Spark shell and load the spark-streaming-eventhubs
package:
spark-shell --packages "com.microsoft.azure:spark-streaming-
eventhubs_2.11:2.1.0"
3. Wait for the Scala prompt to appear, and then enter the following code to define the connection
parameters for your event hub, substituting the appropriate policy key, namespace name, and
event hub name for your event hub:
val eventhubParameters = Map[String, String] (
"eventhubs.policyname" -> "DeviceAccess",
"eventhubs.policykey" -> "<POLICY_KEY>",
"eventhubs.namespace" -> "<EVENT_HUB_NAMESPACE>",
"eventhubs.name" -> "<EVENT_HUB>",
"eventhubs.partition.count" -> "2",
"eventhubs.consumergroup" -> "$Default",
"eventhubs.progressTrackingDir" -> "/eventhubs/progress",
"eventhubs.sql.containsProperties" -> "true"
)
4. Enter the following code to read data from the event hub:
val inputStream = spark.readStream.
format("eventhubs").
options(eventhubParameters).
load()
5. Enter the following code to define a schema for the JSON message:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val jsonSchema = new StructType().
add("device", StringType).
add("reading", StringType)
6. Enter the following code to retrieve the enqueued time and JSON message from the input
stream:
val events = inputStream.select($"enqueuedTime".cast("Timestamp").
alias("enqueuedTime"),from_json($"body".cast("String"), jsonSchema).
alias("sensorReading"))
7. Enter the following code to extract the device name and reading from the JSON message:
val eventdetails =
events.select($"enqueuedTime",$"sensorReading.device".
alias("device"), $"sensorReading.reading".cast("Float").
alias("reading"))
8. Enter the following code to aggregate the average reading for each device over a 1-minute
tumbling window:
val eventAvgs = eventdetails.
withWatermark("enqueuedTime", "10 seconds").
groupBy(
window($"enqueuedTime", "1 minutes"),
$"device"
).avg("reading").
select($"window.start", $"window.end", $"device", $"avg(reading)")
9. Enter the following code to start the query and write the output stream in CSV format:
eventAvgs.writeStream.format("csv").
option("checkpointLocation", "/checkpoint").
option("path", "/streamoutput").
outputMode("append").
start().awaitTermination()
10. Leave the code running for a minute or so. After some time, the console should indicate
progress for each stage that is processed.
Tip: Copy and paste the code used in this procedure from EventHub Notebook.txt in the Lab03 folder
where you extracted the lab files for this course.
1. Open the Jupyter Notebooks dashboard for your Spark cluster and create a new PySpark3
notebook.
2. Enter the following code into the first cell in the notebook, substituting the name of the blob
container and Azure Storage account for your cluster:
from pyspark.sql.types import *
from pyspark.sql.functions import *
devSchema = StructType([
StructField("WindowStart", TimestampType(), False),
StructField("WindowEnd", TimestampType(), False),
StructField("Device", StringType(), False),
StructField("AvgReading", FloatType(), False)
])
devData =
spark.read.csv('wasb://<CONTAINER>@<STORAGE_ACCT>.blob.core.windows.net
/streamoutput/',
schema=devSchema, header=False)
devData.createOrReplaceTempView("devicereadings")
devData.show()
3. Run the first cell. After a while, the first 20 rows of data from the CSV output should be
displayed.
4. In the second cell, enter the following code to query the temporary table you created in the
previous step:
%%sql
SELECT * FROM devicereadings
ORDER BY WindowEnd
5. Run the second cell, and view the table that is displayed.
6. Change the output visualization to a line chart and view the average of AvgReading by
WindowEnd.
7. Change the output visualization to a pie chart and view the average of AvgReading by Device.
8. In the SSH console window where the Scala Spark Structured Streaming query is running, press
CTRL+C to end the query.
9. In the console window where the eventclient Node.JS script is running, press CTRL+C to end the
script.
Clean Up
Now that you have finished using Spark, you can delete your cluster, its associated storage account, and
any other Azure resources you have used in this lab. This ensures that you avoid being charged for
resources when you are not using them. If you are using a trial Azure subscription that includes a limited
free credit value, deleting resources maximizes your credit and helps to prevent using it all before the
free trial period has ended.