Hadoop For Windows Succinctly PDF
Hadoop For Windows Succinctly PDF
Dave Vickers
Foreword by Daniel Jebaraj
2
Copyright © 2019 by Syncfusion, Inc.
If you obtained this book from any other source, please register and download a free copy from
www.syncfusion.com.
The authors and copyright holders provide absolutely no warranty for any information provided.
The authors and copyright holders shall not be liable for any claim, damages, or any other
liability arising from, out of, or in connection with the information in this book.
Please do not use this book if the listed terms are unacceptable.
3
Table of Contents
Acknowledgements ................................................................................................................. 9
Introduction .............................................................................................................................10
Summary ..............................................................................................................................37
Summary ..............................................................................................................................62
4
Chapter 3 Programming Enterprise Hadoop in Windows ...................................................63
Executing data-warehousing tasks using Hive Query Language over MapReduce ...............75
Utilizing Apache Pig and using Sqoop with external data sources ........................................81
Summary ..............................................................................................................................91
Chapter 4 Hadoop Integration and Business Intelligence (BI) Tools in Windows .............93
Hadoop Integration in Windows and SQL Server 2019 CTP 2.0 ...........................................93
Three features from Hadoop in Linux I'd like to see more of ...............................................128
Using Azure Data Studio with large datasets in Windows Hadoop .....................................134
Conclusion ............................................................................................................................148
5
The Story Behind the Succinctly Series
of Books
6
Free forever
Syncfusion will be working to produce books on several topics. The books will always be free.
Any updates we publish will also be free.
We sincerely hope you enjoy reading this book and that it helps you better understand the topic
of study. Thank you for reading.
7
About the Author
After graduating from the biggest school of Architecture and Town Planning in Europe, Dave
Vickers’s professional career began at the Acer Engineering Group in London.
The absence of turnkey interfaces between content providers and network operators led Dave
to create DavincksOne. Its aim is developing solutions that meet the 5G Development and
Validation Platform standards for global, industry-specific networks—in particular, the 5G-XCast
Media and Entertainment vertical and Object-Oriented Broadcasting solutions.
8
Acknowledgements
I would like to thank my father, whose ethic for hard work and self-sacrifice made my education
possible, and my mother, who has supported me in everything I do, and with who I have shared
so much. I dedicate this book to my younger sister Natasha—what she has achieved in her
young life is an inspiration— and to our beloved Leonora, gone but not forgotten, and loved by
all. To all those generous enough to share their knowledge with me over the years, thank you.
9
Introduction
Hadoop is a collection of utilities that work together to enable distributed storage and processing
of very large datasets. Since its inception, it has almost exclusively been associated with Linux
operating systems. An example of this is the number of text books and publications focusing on
Hadoop for Linux. Conversely, the number of text books focusing on Hadoop for Windows is
almost non-existent.
It's important at this stage to be clear about what I mean when I say “Hadoop for Windows.”
Hadoop for Windows refers to Hadoop running directly on the Microsoft Windows operating
system, and native support for Windows must be provided.
There are many online examples of people recommending the preceding options to run Hadoop
for Windows. They seem unaware that Hadoop can be installed directly onto Windows, so it’s
not something they consider. The aim of this book is firstly to make people aware that Hadoop
runs perfectly on Windows. The subsequent aim is to guide the reader in the installation and
usage of Hadoop on the Windows platform.
Although this book is about Hadoop for Windows, it would not be credible to omit referring to
Linux where it’s pertinent to do so. By comparing the two environments as operating systems for
Hadoop, we may discover the reasons behind the popularity of Linux. That said, there are some
fairly obvious reasons why Hadoop has been so heavily associated with Linux. Hadoop is open
source, so an open source operating system such as Linux was always a natural pairing; both
are also available free of charge. As important as the role of Linux has been, the role of
Microsoft is equally important. Microsoft deployed Hadoop in the cloud via HDInsight on
Microsoft Azure.
HDInsight was the result of Hortonworks collaborating with Microsoft, and was based upon the
Hortonworks Hadoop distribution. A desktop emulator version was available, and Hortonworks
released its own Hadoop distribution for Windows, which is now archived. Since then,
Hortonworks has promoted its Hadoop Sandbox for Windows that only runs on a virtual
machine. The end of HDInsight for Windows finally came in July 2018, leaving only HDInsight
for Linux. Naturally, it raised eyebrows that Microsoft ended HDInsight for Windows, and various
questions were raised, including: Why was Hadoop for Windows named HDInsight? If you
asked IT professionals if they knew Microsoft had released Hadoop for Windows, how many
would know? The reality is that Microsoft has never released a multi-node version of Hadoop for
on-premises usage.
The optimum solution may have been to offer the same HDInsight solution on premises that
was offered in the cloud. There have been numerous products that haven’t done as well as they
could have, as they were primarily cloud-based. IBM Watson Analytics springs to mind—it’s an
intelligent piece of software, but unavailable on premises, so it lost on-premises sales.
10
An on-premises setup puts you in control, but in Azure cloud you can only use Ranger, Kafka,
Interactive Query, and Spark on HDInsight for Linux. You can't use them in the retired HDInsight
for Windows, nor can you create or resize Windows clusters. Despite this, Microsoft feels that its
Hadoop deployment has an edge over competitors by using Azure Storage to store data,
instead of on-premises storage or nodes.
Figure 1: Cluster creation on HDInsight for Windows was retired in July 2018
These factors together may give the impression that running Hadoop for Windows has been a
challenge. All the more reason for letting users know they can easily run Hadoop on a
supported Windows platform, just like Linux users run Hadoop on supported Linux platforms.
11
Table 1: General requirements for Hadoop for Windows and Linux
Architecture type AMD64, Intel x86, x86_64 AMD64, Intel x86, x86_64
Table 2 shows that it’s not just Hadoop, but also the Windows operating system, that is creating
more demanding requirements. Windows Server also has a cost, whereas Ubuntu Server is free
of charge.
Table 2: Windows Server 2016 vs. Linux Ubuntu Server 16.04 LTS minimum requirements
These factors will clearly influence people’s choices when it comes to acquisition of a big data
solution. But in a corporate environment you would need internal or external support for your
solution, and this has a cost. The yearly support costs for Ubuntu Advantage support from
Canonical highlights the cost of using Linux Server in such an environment.
12
Table 3: 2018 Ubuntu Advantage Server support – Yearly costs per node in U.S. dollars
Using Linux Ubuntu Desktop in a similar supported environment also has a cost, so essentially,
Linux in a corporate environment is not free.
Table 4: 2018 Ubuntu Desktop support – Yearly costs per node in US dollars
Finally, Table 5 shows that the zero-cost Ubuntu desktop actually has higher system
requirements than Windows 10. This is not surprising, as Windows 10 is now an older system
with older basic requirements. Newer releases of Linux, such as Red Hat Enterprise
Workstation, offer unlimited RAM, the ability to use a second CPU socket, and four virtualized
guests.
Table 5: Windows 10 and Linux Ubuntu 16.04 Desktop recommended minimum requirements
RAM 2 GB 2 GB
By its very nature, the computer industry does not stand still. The long-held belief that Microsoft
is the operating system of choice for most users is still true. At the same time, Red Hat, Inc.
became the first open-source provider to reach revenues of more than a billion dollars. This fact
cannot be underestimated; the same can be said for Red Hat Enterprise Server, which many
corporate users prefer over Windows Server.
13
Microsoft is aware of these challenges; the reason it's possible to run Apache Hadoop for
Windows is because Microsoft made significant changes to the Windows operating system.
Official Apache Hadoop releases have included native support for Windows since Hadoop 2.2.
This has opened the door for Windows-based Hadoop and Windows-based Hadoop vendors.
With so many data applications running on Windows, it’s important that they’re able to connect
to Hadoop as efficiently as possible. The optimum solution is for Hadoop to be available within
the same Windows environment as the applications themselves. Microsoft Power BI, for
example, can connect to 80 data sources, including Apache Hadoop File (HDFS) and other big
data solutions, without ODBC for Hive. I know of no other business intelligence application with
the ability to connect to so many systems.
The current version of Apache Hadoop is 2.9.2, and native support has been provided for
Windows since Apache Hadoop 2.2. Apache Hadoop 3.1.1 is also a popular release of the
software and good resources include Hadoop Wiki and the Apache Hadoop download page.
Windows owners used to automatic installers may find these resources for installing Hadoop
problematic. In the wider context, this includes Linux users having problems installing multi-
node Hadoop. This has led to companies taking Apache Hadoop and bundling it with other tools
to create user-friendly Hadoop installers.
14
For the best experience, you require the solution to be 100 percent Apache Hadoop-compliant.
Apache Hadoop is free and open source, and by installing it yourself, you will learn more than
you will by using an automated installer. Once you get past the installation, the functions and
commands you use in Apache Hadoop are the same commands you use in other vendors’
bundled Hadoop distributions, provided they are Apache Hadoop-compliant.
Microsoft HDInsight
HDInsight for Windows was the Microsoft Hadoop offering based on the Hortonworks Hadoop
platform. As mentioned previously, it was available via Microsoft Azure cloud, and on premises
via HDInsight Desktop Emulator. It was retired in July 2018 in favor of HDInsight for Linux. You
will notice in the following screenshot that the latest HDInsight Emulator was a 2014 release, so
as of 2018, was four years old. The age of the last release may give some indication of
Microsoft’s commitment to on-premises HDInsight.
This means the only way to access a Hadoop distribution from Microsoft is cloud-based
HDInsight for Linux. This can be a cheaper option than on-premises Hadoop, and it’s supported
by Canonical, which supports HDInsight 3.4 onward on Ubuntu Server. Ubuntu Server is now a
certified guest system for HDInsight access.
The reasons Microsoft gave for HDInsight now only being supported for Linux were:
If Microsoft makes HDInsight open source, I can see their preceding reasons benefiting
HDInsight over time. But placing a non-open source project into an open source community is a
different proposition. Remember, there are companies with much bigger on-premises Hadoop
installations than your typical Microsoft Azure customer. They may be more interested in other
Microsoft big-data innovations with on-premises options.
15
In November 2018, the public preview of Microsoft Azure Service Fabric Mesh became
available. It’s an upgrade of Azure Service Fabric that allows you to build applications on shared
nodes that scale as the need arises. Many elements of Microsoft Azure already run on Service
Fabric, which means that Hadoop is not the only big-data option for Microsoft. It's
straightforward to install a one- or five-node Azure Service Fabric system, thanks to the time
and resources Microsoft invested in it. If you choose to install it, ensure that there are no
warnings or errors, and that the Cluster Health State shows OK, as highlighted in Figure 4.
Hortonworks
Hortonworks on-premises Hadoop took the form of an offline installer for Windows. As
previously mentioned, the product is now archived, and Hortonworks promotes its Sandbox
running on VMware or Virtual Box in Windows.
16
Figure 5: Hortonworks HDP v2.3, a Hadoop installer for Windows
Hortonworks Sandbox is able to bypass issues associated with installing Hadoop on premises.
That said, in my experience Hortonworks HDP installer works perfectly on Windows, so why is
the Sandbox needed? Sure, it’s not the most intuitive installer, but don’t let this detract from the
positives. I found it very fast and very thorough, and it provided smoke tests to ensure that Pig,
Hive, and the Hadoop Ecosystem were working perfectly.
17
Figure 6: Post Installation Smoke Test for Pig on Hortonworks HDP v2.3 for Windows
While the HDP installer provides a classic Hadoop installation from the command line, the
interactive Hortonworks Sandbox interface isn’t available in Windows. Would the more
interactive Sandbox environment running directly on Windows have been a better version of
Hadoop to build for Windows? If Windows is anything, it’s interactive!
Where Hortonworks tried to produce on premises Hadoop for Windows, Syncfusion succeeded.
The Syncfusion Big Data Platform runs perfectly on Windows and can be installed in much the
same as any other Windows software. It’s slightly more complex for multi-node installation, but
that is partly the nature of Hadoop. Importantly, with each new release of the platform, there has
been significant progress.
18
Some users of the Hortonworks HDP platform raised issues, such as having to set up nodes
manually. With Syncfusion you don’t have to—it’s made adding nodes and creating clusters
straightforward, while adding seamless integration with Active Directory. This gives the user a
choice of on-and-off-premises deployment options, and as an offline installer for Hadoop,
Syncfusion Big Data Platform is unsurpassed. The online installations from major Hadoop
vendors for Linux are impressive, but installing offline isn’t always so easy. Syncfusion Big Data
platform puts you in control of where, when, and how you want to install Hadoop. It’s interesting
that the best offline Hadoop installer on any platform is for a Windows platform.
19
Chapter 1 Installing Hadoop for Windows
• Hadoop 2.0 or later: You can download the Hadoop binary file hadoop-2.9.2.tar.gz
from here.
• Microsoft Windows: Windows 7, 8, 10, and Windows Server 2008 and above.
• Additional prerequisites: You’ll also need a text editor, such as Notepad or Notepad
++, for writing short amounts of code, and Winutils 3.1, which you can download from
GitHub.
20
Figure 9: Default Java installation path
21
Ensure that you see the screen informing you that Java has been successfully installed.
Go to Control Panel > System and Security > System and click Advanced System Settings,
and then click the System Environment Variables button. Whether creating a new
environment variable for JAVA_HOME or editing an existing one, you must alter the Program
Files text to text that Hadoop can interpret. On Windows 8, to create a Hadoop-compatible
JAVA_HOME file instead of entering Program Files, insert Progra~1 when entering the Java
location in the Variable value field. On Windows 10 and Windows Server, avoid folder names
with blank spaces.
Please ensure that you add the JAVA_HOME to the Path variable in System Variables. In this
instance, it is done by adding %JAVA_HOME%\bin between semi colons in the Path Variable
value field. Use the java -version command from a command prompt to ensure that Java is
installed and running correctly.
2. Using an application such as 7-Zip File Manager, extract the Hadoop binary file hadoop-
2.9.2.tar.gz from this website to a directory of your choice, or directly to
C:\hadoop\hadoop-2.9.2. If you choose to extract the files to a directory of your choice,
then you first have to copy the extracted files to C:\hadoop. You may find it more
convenient to extract them directly to C:\hadoop, which will then have an extracted
folder in it called hadoop-2.9.2, so you end up with the C:\hadoop\hadoop-2.9.2 folders.
22
3. You can now create a HADOOP_HOME similar to how we created one previously, by
going back to Control Panel > System and Security > System, clicking Advanced
System Settings, and then clicking the Environment Variables button. Create the
Hadoop home by adding the system variable name HADOOP_HOME, with the system
variable value being the folder that we extracted the Hadoop binary to, which was
C:\hadoop\hadoop-2.9.2.
We must add the HADOOP_HOME file to the Path variable in System variables. In this
instance, it is done by adding %HADOOP_HOME%\bin between semi colons in the Variable
value field.
In addition, we must add a second HADOOP_HOME to the Path variable for the folder in
Hadoop called sbin. This is done by adding %HADOOP_HOME%\sbin between semi colons in
the Variable value field. You should now have Hadoop and Java homes, and two Hadoop path
variables.
23
Figure 15: Java and Hadoop homes
The resource page I mentioned previously is an official Apache resource that will assist us in
finishing the installation. The area of the site we need first is “Section 3.1. Example HDFS
Configuration,” which states:
“Before you can start the Hadoop Daemons you will need to make a few edits to
configuration files. The configuration file templates will all be found in
c:\deploy\etc\hadoop, assuming your installation directory is c:\deploy.”
24
Code Listing 1: The core-site.xml file format
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
We need to substitute the name and value elements shown on the core-site.xml file on the
Apache Wiki page for values in the installation we are carrying out. The values we require are
contained in the following code listing and reflect our current Hadoop installation.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
We need to do the same for the hdfs-site.xml file template, and the new values we require are
in the following code listing.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/Hadoop/hadoop-2.9.2/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///C:/Hadoop/hadoop-2.9.2/datanode</value>
</property>
</configuration>
25
Code Listing 3. Note that the Hadoop configuration files use forward slashes instead
of backward slashes in file paths, even on Windows systems.
Next, we need to edit the mapred-site.xml configuration file; the values required are shown in
the following code listing.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
We also need to edit the yarn-site.xml configuration file; the values required are provided in the
following code listing.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
3. Make a copy of the bin folder at C:\hadoop\hadoop-2.9.2\bin, and then delete the
folder you made the copy from.
4. Copy the bin folder you extracted from the apache-hadoop-3.1.0-winutils-master file to
C:\hadoop\hadoop-2.9.2\; it replaces the bin folder you deleted.
26
Now we must follow the instructions in section 3.4 of the Hadoop Wiki page, “3.4. Format the
FileSystem.” This is done by executing the following command (with administrator privileges)
from a command shell:
You start HDFS daemons by running the following code from the command prompt.
start-dfs.cmd
You start YARN daemons and run a YARN job by running the following code.
27
Code Listing 8: Start YARN daemons and run a YARN job command
start-yarn.cmd
You should now see the Hadoop namenode and datanode successfully started.
In addition, you will see the YARN resourcemanager and YARN nodemanager successfully
started.
28
Figure 18: Yarn resourcemanager and Yarn nodemanager successfully started
29
The finished Hadoop installation directory is shown in the following image. I have added a folder
called datastore, into which I have placed two text files called ukhousetransactions.txt and
ukhousetransactions2.txt.
The next section of code, hadoop fs -ls /, confirms the directory has been created by listing
its contents.
C:\Windows\system32>hadoop fs -ls /
Found 1 items
drwxr-xr-x - Dave supergroup 0 2018-12-11 13:52 /bigdata
30
Try to remember the code you used for the first file to copy the second file,
ukhousetransactions2.txt, from memory. If you’re new to Hadoop, you’ll get used to using the
command line much quicker if you remember the basic commands.
Code Listing 11: Copying Text Files to the Hadoop Distributed File System (HDFS)
We can now list the files we have copied with the following command.
C:\hadoop\hadoop-2.9.2\datastore>hadoop fs -ls -R /
31
Figure 21: HDFS connector for Power BI
Now that we have Hadoop installed, we can revisit this by selecting the Hadoop File (HDFS)
connection in Power BI and clicking Connect. The aim is to connect to HDFS without a Hive
ODBC driver.
Enter localhost as the name of the server that HDFS is installed on, then click OK.
32
Figure 23: Connecting to HDFS from Power BI in Windows
The first time you get past the preceding screen, you may see a screen similar to the one shown
in Figure 24, asking for your preferred method of security access. This can also happen if you
don’t type in localhost on the preceding screen, but instead, enter localhost:9000, for
example. You simply need to enter localhost then click OK, as shown in the previous screen. If
you see a screen similar to Figure 24 in response to entering localhost, click Connect.
The files we copied to HDFS are now accessible in Power BI. Click Load to continue.
33
Figure 25: Loading files from HDFS into Power BI without Hive ODBC
Now all the tools in Power BI can be used on the data and queries run against it.
In addition, we have the ability to combine files in Power BI for multiple-file loading from HDFS.
34
Figure 27: Combining multiple files from HDFS in Power BI
We also have the ability to remove duplicates from the UK house transactions files we loaded
into HDFS.
Figure 28: Removing duplicates from the house transactions files in Power BI
Now we can create dashboards from the HDFS data; the integration is such that we can
automatically convert numeric text to numeric values. This allows us to ask math-based
questions using free text, a function seen previously in the online BI tool IBM Watson Analytics.
35
Figure 29: Creating a Dashboard from HDFS data in Microsoft Power BI
I started typing the question "How many England and Wales Aug" and before I could type the
full sentence it automatically calculated 105k.This is a rounding up of the correct answer of
104590 as shown next. It automatically offered the option to select data for the other months.
Figure 30: Power BI calculating a math answer from a free text question
Everything we required to do this exercise runs in the same Windows environment. As good as
this is, how would you recommend it to peers? HDInsight is only online for Linux, and HDP is
archived or a VM, leaving manual Apache Hadoop or Syncfusion Big Data Platform.
36
Summary
Despite learning about the Linux dominance of Hadoop and questions surrounding the ability of
Windows to run Hadoop, we were able to install Apache Hadoop 2.9.2 on Windows 8 quite
easily. We did this with no changes to Windows, other than installing the required software. We
then created a directory within HDFS to store files and accessed them with ease from Power BI.
This was an advantage over having to access Hadoop on Linux from an external Windows
system running BI tools. Using the command prompt in Windows felt no different than using
Hadoop in Linux. The memory management was good, and seldom went above 4 GB of RAM;
there was no deterioration of the system performance at any time. You can freely download
Hadoop and a free version of Power BI, and you probably already have Windows. This is the
strength of Hadoop for Windows: the lack of disruption. Everything else you need already runs
on Windows. It's all already there, so it makes sense to invite Hadoop into Windows rather than
move to Linux.
37
Chapter 2 Enterprise Hadoop for Windows
Three vendors have released software that falls into this category: Apache, Hortonworks, and
Syncfusion. We’ve already done an Apache installation, and Hortonworks is in archive, so
Syncfusion and Microsoft remain. Why do I include Microsoft? It’s because their enterprise-level
changes to Windows Server made multi-node Hadoop installations possible. They have gone
even further with their flagship enterprise database product, SQL Server 2019. Though not yet
released, the technology preview of SQL Server 2019 has taken Hadoop integration in Windows
to new levels. We will look more closely at that product in Chapter 4. All this progression would
not have been possible without Apache themselves, as can be seen at Apache.org.
The next two figures highlight this, with the first discussing a patch for running Hadoop for
Windows without Cygwin. This was an important milestone in terms of Hadoop being able to run
on Windows. On the left-hand side in Figure 31, there are numerous other issues that have
been raised and logged.
38
Figure 32 shows an enhancement to support Hadoop natively on Windows Server and Windows
Azure environments. This highlights an expectation to run Hadoop natively on Windows, the
same way Hadoop runs on Linux. Managing user expectations are vital in this area.
Figure 32: Enhancement to support Hadoop for Windows Server and Windows Azure at Apache.org
We'll be using the Hadoop distribution from Syncfusion, which is available from Syncfusion.com.
Before installation, there are a few things we need to be aware of. Windows and Linux are very
different environments, and using the same application in either environment will not be the
same. If the version of Spark your customer uses is more recent than the one you're
demonstrating, be aware of that in advance. If there are features of the latest version of Spark
that your client depends on, it may be an issue for you. For these reasons, Hadoop distributors
need to update Hadoop ecosystem components within reasonable timescales.
There are key features that Linux developers will expect to see in Hadoop for Windows. Often
these issues are sorted out by looking at the feature sets of the Hadoop distribution in question.
Impala, for example, is thought to be the fastest SQL-type engine for Hadoop, but it only runs on
Linux. A bigger problem is when the issue is not the feature set of the Hadoop distribution, but
the operating system itself. In Linux you have control groups (cgroups), which aren't present in
Windows Server, nor is there an equivalent. I will discuss cgroups in Chapter 3 in the section
about memory management and Hadoop performance in Windows. When you talk to Linux
users about potentially using Hadoop in Windows, you should demonstrate awareness of these
matters. While Hyper-V and virtual machines can be used to allocate resources in Windows,
they're just not the same as cgroups in Linux.
39
Network setup and installation
Before we set up a production cluster, we need to understand the network we’re going to install
it on. Sometimes you hear complaints that Hadoop is slow or doesn’t meet expectations. Often,
it’s because the network it’s installed on isn’t the optimum network for Hadoop. A positive of
Microsoft Azure is that Microsoft gives you all the computing power you need to run Hadoop.
This enables companies to analyze a hundred terabytes of data or more. If you're fortunate
enough to be able to build your own data network, build the fastest network you possibly can. If
you have access to physical servers, use those instead of virtual servers—you’ll notice the
power of a production cluster on a more powerful network.
There is a price premium for these gains, but they can be negated by cost savings per terabyte.
The faster a system can analyze data, the less time you spend running the cluster and its
associated electricity, CPU, and cooling costs. This is partly how HDInsight works; you pay for
what you use, and can provision or decommission clusters when you’re not using them. You
can do this yourself on-premises, but you’d have setup costs that you don’t have on Azure.
If you’re dealing with hundreds of terabytes, a good strategic investment may be a 10–100 Gbps
switch; this gives a wider coverage of network speeds without having to change switches.
Figure 34: Cisco Nexus 7700 Switch - 10, 40, and 100 Gbps
Your server adapter should be of a speed commensurate to that of your network. A 40-Gbps
adapter for your server is optimum; your PC’s network adapter should be at least 1 Gbps.
40
Figure 35: QLogic 40 Gbps Ethernet Adapter
The following cables can manage up to 40 Gbps, with the most economical solution to buy bulk
Cat 8 cabling and fit the RJ45 plugs.
Figure 36: BAKTOONS Cat 8, 40 Gbps RJ45 (left) and Cat 8 bulk cable 25/40 Gbps (right)
CPU: 2-4 Octa- CPU: 2-4 CPU: 2-4 CPU: 2-4 CPU: 2-4
core+ Octa-core+ Octa-core+ Octa-core+ Octa-core+
96 GB RAM 96 GB RAM 64 GB RAM 64 GB RAM 32 GB RAM
Hard Drive: Hard Drive: Hard Drive:
2 × 1 TB 2 × 1 TB 2 × 1 TB
41
Active Standby Datanode 1 Datanode 2 Cluster
Namenode Namenode (If needed) Manager
Server Server
Network: Network: Hard Drive: Hard Drive: Network:
10 Gbps 10 Gbps 4 × 1 TB SAS 4 × 1 TB SAS 10 Gbps
JBOD(16 × 1 TB) JBOD (16x 1TB)
Network: 10 Gbps Network: 10 Gbps
If you can’t access servers with the RAM listed in the preceding table, use nodes with at least
32 GB of RAM. You won’t be able to handle very large amounts of data, but it will certainly work.
We now have the server roles in our cluster defined; I’d recommend four physical servers over
virtual ones. The pricing for Windows Server is listed, but if you already have Windows Server
licenses, you can use those.
A minimum of eight core licenses must be purchased per processor, and each server core has
to be licensed. A 1 CPU Quad Core Server is no cheaper than a 2 CPU Quad Core, due to the
minimum fee.
42
The reason I wouldn’t recommend one physical server to host three or four virtual ones is that if
it shuts down due to CPU overheating, it will shut down all running virtual servers with it. This
leaves even the Hadoop standby node unavailable, and your Hadoop cluster becomes useless.
You need to work out how mission critical the data and operations on your servers are going to
be. Put yourself in the situation of something having gone wrong, and ask yourself what
decisions you'd make. You may wish to use solid state drives (SSDs) or high-RPM hard disks.
Of the two disk types, SSDs have more efficient energy use. While this is not a book about
computer networking, if you hire someone to build a network for you, check that what you’ve
specified is delivered. If you’ve paid for high-quality network components, find a way to check
those components, and make sure inferior ones aren’t used in parts of the network you can’t
see or that are underground.
You can sign up for a free Syncfusion account to download the files from Syncfusion. For
businesses with a turnover of less than £1,000,000, the Syncfusion Big Data software is free.
For businesses with higher turnover, prices of around £4,000 per developer license are
available. Free trials are available that are totally unrestricted.
Figure 39: Choosing the functional level of the New Forest and Root Domain
An in-depth look at Active Directory is outside the scope of this book, but a competent Windows
Administrator should be able to assist in setting it up. The servers in the Hadoop cluster must be
part of the same domain; they should be joined to the domain via Active Directory, as shown in
Figure 40. If they are not, DNS and reverse DNS validation can fail, and the Hadoop installation
won’t proceed. Just having computers on the same physical network is not enough.
43
Figure 40: Adding machines to our domain using Active Directory Administration Center
Point each machine you want to join the domain to the IP address of the DNS machine. If not,
your attempt to join computers to the domain may receive “DNS Server cannot be found” errors.
Install the Syncfusion Big Data Agent by running the downloaded Syncfusion Big Data Agent
v3.2.0.20 file. Run the file as an administrator, and choose to install it to the default directory.
The following installation screen appears before you see the final “Installed successfully”
screen.
44
Figure 42: Installing Syncfusion Big Data Agent
After installing Syncfusion Big Data Agent, check that the Big Data Agent is running in Windows
Server Services. Remember that it must be installed and running on all machines in the cluster.
Figure 43: Syncfusion Big Data Agent running in Windows Server Services
Now run the Syncfusion Big Data Cluster Manager v3.2.0.20 file downloaded from Syncfusion.
Install it as an administrator on the machine defined as the Cluster Manager. You install it in the
same way you install any Windows software—this is the beauty of Syncfusion. Install it to its
default location by simply following the instructions. After installation, start the Syncfusion Big
Data platform from the standard Windows program menu. Once it's started, click the Launch
Manager button under Cluster Manager, as highlighted in Figure 44.
45
Figure 44: Syncfusion Big Data Platform screen
You will see the Syncfusion Cluster Manager interface, which opens in a web browser, as
shown in the following figure.
Log in to the Cluster Manager with the default admin username and password, and click the
green Login button shown in Figure 45.
46
Creating a multi-node Hadoop cluster in Windows
Once you're logged in, you'll see the screen in the following figure. On the right-hand side, you'll
see the Create button.
Figure 47: Closer view of the create cluster button in Cluster Manager
Choose Normal Cluster from the three options displayed, then click Next, as shown in the
following figure.
On the next screen, choose Manual Mode and click Next. Provide a name for the cluster and
leave the replication value at 3. You then need to provide IP or host name information to identify
and assign the following nodes.
47
Figure 49: Adding cluster name and IP Address of the Active name node
Figure 50: Add the IP Address of the Standby Name Node, leave the replication at 3
After adding the IP addresses of the servers, click the Next button on the top-right side of the
screen. The Import option is for importing multiple host names or IP addresses from a single-
column CSV file. To add additional data nodes (if needed), you'd click Add Node.
If after clicking Next, you see server clock-time errors (as shown in Figure 53), ensure that
server clock times are within 20 seconds of one another. Synchronize the times of the servers in
the cluster, then click Next again.
48
Figure 53: Server clock-time errors
The Cluster Manager resolves the proper host name and verifies that reverse DNS works. The
Validation column displays Success and a green dot. Now, click Next on the top-right side of
the screen.
The cluster should finish installing in 10–15 minutes, or quicker on a fast network.
49
Upon completion, ensure that all the elements in the following screen are checked.
If not, go to the very right-hand side of each white bar, which represents each node. You will
see three gray dots, as shown in Figure 57. Click on the three dots, then click Start Services to
select each element that is not running, to check that you can run all services. On a powerful
machine on a fast network, you'll hopefully have no issues running all the services shown in
Figure 56. On machines with less RAM, or where there are system or network bottlenecks, you
may find you can't run all the services listed. Take care not to stop the node. If you do, be aware
that when you go to start the node, there’s an option to remove the node, so be careful. To
prevent accidental removals, confirmation messages will appear and ask you to confirm any
deletion actions.
50
Cluster maintenance and management
Hadoop cluster maintenance is often done by Hadoop administrators. While respecting this, be
aware that clusters can develop serious problems, and you want to prevent them before they
happen. To assist us with this, we need to know the health of our clusters at all times.
The Syncfusion Cluster Manager aids us in this by displaying the status of the nodes at all
times. Figure 58 shows the four status levels that nodes can be classified as:
• Active: The active node is denoted by a green circle. This is why the name and data
nodes have green circles by them in the IP Address and HBase columns.
• Dead: A dead node is denoted by a red circle, and while this is negative, it’s also helpful
to know before installation. You will recall the dead nodes denoted by red circles when
the server clock times failed to synchronize. This allowed us to fix the problem by
synchronizing the clock times so the installation could proceed. At that point, the notes
turned to green.
• Standby: The standby node is denoted by a gray circle, and is shown with a gray circle
in the IP Address and HBase columns. It is correct for the standby node to be displaying
a gray circle, as it is on standby.
Whether it’s a local or enterprise installation, it’s imperative that you investigate why your cluster
is unhealthy. You should be able to do this in any distribution of Hadoop, not just an enterprise
one that does it automatically. If this isn't possible on all Windows machines running Hadoop,
you can't realistically use it. To briefly test this, start the Hadoop installation we did in Chapter 1,
and use the following command.
51
Code Listing 13: Checking the status of the HDFS
hadoop fsck /
The output in the following image shows a healthy status in a fairly detailed output. It is worth
taking a look at what those outputs mean, as they apply to any Hadoop installation.
• Mis-replicated blocks: These blocks involve a failure to replicate in line with your
replication policy. You need to manually correct this, depending on the error you
discover.
• Corrupt blocks: This is self-explanatory, and reflects corrupt blocks. This can be
corrected by HDFS on its own if at least one block is not corrupt.
If our cluster shows errors, one of the first things we can do is to seek more detail. This is
achieved by using the following command to show the status of individual files.
52
Code Listing 14: Checking the status of files in HDFS
This allows us to see the actual files that are involved in the blocks concerned so that if there
were any issues, you would know which files you have to take action on. In the worst-case
scenario, the ukhousetransactions.txt files we loaded into HDFS would need to be deleted using
the -delete command. You can then replace them the same way we put them there in the first
place.
The hadoop fsck / -files command is also available in the Syncfusion Big Data Platform,
and is included in the final piece of software we need to install. We will use the Syncfusion Big
Data Studio v3.2.0.20 file downloaded from Syncfusion, and install it on the same machine as
the Cluster Manager. You could install it on another machine on your network, but I'm doing it
this way, as I want to show you something.
You install the software the same way you install any Windows software. Accept its default
install location and follow the on-screen instructions until completion. Now, start the Big Data
Studio from the Windows program menu, and you'll see the Syncfusion Big Data Platform
screen, as shown in Figure 61. You'll notice that under Syncfusion Big Data Studio, the Launch
Studio button has replaced the Download button. This is because the Cluster Manager and Big
Data Studio are now recognized on the same server.
53
Figure 61: Cluster Manager and Big Data Studio on the same server
You are now inside the Syncfusion Big Data Studio where HDFS, Hive, and others are visible
from the displayed tabs. If you click the Hadoop tab, you'll see a selection of samples. The
Hadoop/DFS/Attributes folder has the fsck sample, as shown in Figure 62. If you go to the Big
Data Platform main screen, you can click the Command Shell link shown in Figure 63, which
you can use to launch a command prompt from the Big Data Studio.
54
Figure 63: Launching the Command shell from the Big Data Platform main screen
From the command prompt, you can access Hadoop commands in the normal way. Syncfusion
Big Data Platform is giving you flexibility to work within the more-interactive Windows
environment, or to use the more traditional command-line environment.
55
Figure 65: Adding an existing cluster in Cluster Manager
You now have two clusters, which are visible in the following figure: the multi-node cluster
Davinicks, and the local development cluster Hadoop4windows. If you click a cluster under the
Cluster Name column, you can access a screen with more cluster details. Click
Hadoop4windows.
56
Figure 67: Switching between clusters in Cluster Manager
If you click the menu item called Monitoring, you can see the cluster Active Namenode and
Active Datanode details. The green dots highlighted in Figure 68 denote that the nodes are
active and healthy. The name of the cluster you're working with is shown on the top-left side of
the screen.
57
Figure 69: Hadoop Services Monitoring screen
The next figure takes a closer look at the Hadoop Services Monitoring screen for the Davinicks
cluster. It shows there is rather more going on than on the local development cluster. On the
left-hand side under NODES, you can see the Active Name Node highlighted in yellow, the
Standby Name Node highlighted in green, and the Data Node in blue. The abbreviations NN,
JN, and ZK stand for Name Node, Journal Node, and ZooKeeper, respectively. You can click on
each node to see the Hadoop services for each one.
58
This web-style navigation allows you to see useful information on many pre-defined cluster
elements. These include network usage, garbage collection, JVM Metrics, and useful
information about how much RAM and disk space are available. You will also see the load on
the CPU, IP addresses, and general machine configurations. It is useful that under Machine
Configurations, there are three CPU information elements as highlighted: the System CPU
Load, Process CPU Load, and Process CPU time.
59
Check all the boxes for each cluster, as shown in Figure 73, and set the alert Frequency to
minutes. Now, click Save in the top-right corner.
60
If you do not open or view the alert messages in Cluster Manager, you will see the number of
alerts shown in white on a little red square. This is useful for letting you know there are errors as
soon as you enter Cluster Manager.
61
Figure 78: Remove dead nodes and Start/Stop all nodes
These facilities are useful when a node fails; they allow you to replace the failed node using the
same method of node creation. This involves simply entering the IP address and node type of
the node you wish to create.
To further manage and maintain clusters, we need to start putting them to work to see how they
perform. To achieve this, we need to start ingesting data into Hadoop, which is covered in the
next chapter.
Summary
In Chapter 2, we dealt with the network, environment, and server specifications for deploying
multi-node Hadoop installations. We also covered Windows Server licensing and touched upon
the setup of Active Directory. We created a Hadoop cluster using Windows Server machines
and the Syncfusion distribution of Hadoop. We then ensured all components of Hadoop and the
Hadoop ecosystem were installed and running without fault.
After installing Hadoop, we compared the cluster-management tools used in Apache Hadoop
against those available in the Syncfusion Hadoop distribution. We established the availability of
the command line, giving users choices for working within Hadoop for Windows. Cluster
creation, swapping between clusters, and starting, stopping, and removing clusters was also
covered. We highlighted facilities for monitoring clusters, with reference to the ability of
Windows Server to monitor activity on each Hadoop node. We also set up cluster alerts and
examined the methods used to start, stop, and remove cluster nodes.
62
Chapter 3 Programming Enterprise Hadoop
in Windows
63
Let's say the size of the data query results after processing is 4 MB. Let's look at what is going
on behind the scenes to achieve that. It's those behind-the-scenes factors that will affect
performance, even if your final data query results or output are small.
We know that networks don't quite reach their stated maximum speed, but let's assume you
consistently reach 40 Mbps on a 1-Gbps network.
Table 8: Hadoop options for processing 4 GB of data
Total Size of
Process Size of data process
Network Data to final
Number of time per processed time
speed in process returned
nodes node in per node in across all
Mbps in MB data in
seconds MB nodes in
seconds MB
Table 8 shows that while your data can be processed faster, the time gains diminish the more
nodes you add. The processing time drops from 25 seconds to 12.5 seconds when eight nodes
are used instead of four. However, when 16 nodes are used instead of 12, data processing time
only decreases by just over two seconds. Also, the data must be transported from more nodes
to the node displaying the data, which adds to processing time. This leads to diminishing
returns, to the point where adding more nodes becomes ineffective.
Another issue that’s just as important is how Hadoop stores data in a cluster—it stores them in
blocks. Often the blocks are 128 MB or 64 MB, so a 4-GB file stored in 128-MB blocks is stored
across 32 blocks. With each file in Hadoop being replicated three times, you then go from 32 to
96 blocks.
A further complication is that you can only store one file per block. Therefore, if your blocks are
64 or 128 MB, then files much smaller than those block sizes are highly inefficient. This is
because a 1-MB file is stored in the same 128-MB block size as a 110-MB file. It also requires
metadata about each block to be stored in the RAM. Luckily, you can alter block sizes for
individual files to tackle this issue. There are also other file formats you can use, such as Avro
and Parquet, that can greatly compress the size of your files for use in Hadoop. The benefits
can be great, with query times well over ten or twenty times as fast as queries on the
uncompressed file.
64
Once you get a feel for how Hadoop stores data, you can begin to estimate how much disk
capacity and computing power your Hadoop project will require. While the metadata for each
data block uses only a tiny amount of RAM, what happens to your RAM when you have
hundreds, or even thousands, of files? The inevitable happens, and that small amount of
memory is multiplied by thousands. Suddenly your RAM is compromised, and you feel the
impact of file uploads on performance and memory management.
I would recommend actually carrying out calculations before considering the ingestion of large
numbers of files. This enables a demarcation between system resources required for Windows
servers in the cluster, and system resources they're losing to running Hadoop. In Windows, the
monitoring of individual Windows servers can't be ignored. They are Windows servers in their
own right, in addition to being part of a cluster. Where you have more demanding requirements,
you can add more RAM to avoid encroaching the base RAM Windows Server needs. You may
conclude that big data systems are in fact better at handling big data than small data, which
should be no surprise. The problem is that by using compression too much, you can end up
creating the very thing you don't want, which is too many small files.
While this isn't something that's done too often, I think it is important to isolate what is actually
happening resource- and performance-wise when Hadoop is running in Windows. This informs
answers to questions such as: Should Hadoop be the only application on each server? I take a
particular interest in this, as Windows does not have the control groups (cgroups) feature that is
available to Linux users. Cgroups can control and prioritize network, memory, CPU, and disk I/O
usage. Cgroups also control which devices can be accessed and are in most major Linux
distributions, including Ubuntu. Further features include limiting processes to individual CPU
cores, setting memory limits, and blocking I/O resources. It is a feature that Windows Server
does not have, and that Linux developers would notice. Without these features, we need to take
care to monitor resource usage for Hadoop in Windows.
Let's start by looking at the resources used by the Big Data Platform within Windows Server.
You could do this by setting up a virtual machine in Windows Server and allocating as little as
12 GB, though I'd feel safer at 16 GB, as a minimum. The reason for this is a simple one: it's the
same reason that Linux and Microsoft list the minimum requirements for their operating systems
as low as possible. If you list the minimum requirements as too high, you'll put off some
customers and diminish the distribution or sale of your product. The trick is to be realistic and
state recommended minimum requirements, as we'll do now. If you launch the Syncfusion Big
Data Platform on a virtual or physical server with a realistic minimum of 16 GB of RAM, you will
hopefully see the Resource Monitor in Figure 79. It shows Windows Server at only 22 percent
CPU usage and 35 percent memory usage; this includes resources used running Windows
Server.
65
Figure 79: Syncfusion Big Data distribution running in Windows Server 2016
Note that 22 percent of CPU usage is from processes, with 13 percent CPU usage from
services, so there is low resource usage.
66
At this point we have around 6 GB of memory in use by both Windows Server and the Big Data
Platform, with approaching 3 GB on standby, and 7 GB free. Nearly 10 GB is available, which
again, is stress-free computing. I am using i7 2.9 GHz CPUs, which are fast and robust
processors.
67
I mentioned the IMDB data we'd be using to demonstrate data ingestion into the Syncfusion
distribution of Hadoop. I am purposely using the .tsv file format for IMBD data and the .gz
compressed file format.
Subsets of IMDB data are available for access to customers for personal and non-commercial
use. You can hold local copies of this data, and it is subject to terms and conditions.
Download the zipped .gz files, then unzip the .tsv files using a tool like 7-Zip. The files include:
68
Title.ratings.tsv.gz – contains the IMDB rating and votes information for titles:
We will also download UK Land Registry data, which is approaching 4 GB in file size. We
require the 3.7-GB .csv file.
These two resources provide exhaustive information, whereas this section of the books lists
essential information. Those of you who know SQL may find similarities between Hive data
types and SQL data types; the same can be said for Hive Query Language and SQL.
Numeric types:
• TINYINT (1-byte signed integer, from -128 to 127)
• SMALLINT (2-byte signed integer, from -32,768 to 32,767)
• INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
• BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807)
• FLOAT (4-byte single precision floating point number)
• DOUBLE (8-byte double precision floating point number)
• DOUBLE PRECISION (alias for DOUBLE, only available starting with Hive 2.2.0
• DECIMAL: Introduced in Hive 0.11.0 with a precision of 38 digits Hive 0.13.0 introduced
user-definable precision and scale
• NUMERIC (same as DECIMAL, starting with Hive 3.0.0)
Date/time types:
• TIMESTAMP (Only available starting with Hive 0.8.0)
• DATE (Only available starting with Hive 0.12.0)
• INTERVAL (Only available starting with Hive 1.2.0)
String types:
• STRING
• VARCHAR (Only available starting with Hive 0.12.0)
• CHAR (Only available starting with Hive 0.13.0)
69
Misc types:
• BOOLEAN
• BINARY (Only available starting with Hive 0.8.0)
Complex types:
• arrays: ARRAY<data_type>
• maps: MAP<primitive_type, data_type>
• structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
• union: UNIONTYPE<data_type, data_type, ...>
Hive DDL (Data Definition Language) provides support for the following:
• Create database
• Describe database
• Alter database
• Drop database
• Create table
• Truncate table
• Repair table
DDL functions also apply to the components of databases such as views, indexes, schemas,
and functions. The best way of detailing or describing the terms and functions shown previously
is to actually use them. To achieve this in Hadoop for Windows, we need to prepare our
environment for data ingestion.
70
Figure 83: Add Cluster to Syncfusion Big Data Studio
We need to enter the IP address of the host name of the active name node of our cluster. Enter
the host server name or IP address as requested. In Figure 83, the full computer name
(including domain element) is given, and it connects to the cluster, as seen in Figure 84.
71
Figure 84: Connecting to Hadoop cluster
The cluster is shown underneath the localhost cluster. To start receiving data, click New to add
a new folder, enter a name for the folder, and click Create.
72
Figure 86: Starting services on the local cluster
This means you can switch between clusters by simply connecting or disconnecting.
We are now going to do our first data upload to Hadoop. We are going to upload the 3.7-GB
.csv file of house prices paid transactions mentioned earlier in the chapter. If you can't get ahold
of the file, just follow along, as we'll be using files of a much smaller size in the next section. In
Big Data Studio, choose the HDFS menu item. Click New to create a new folder, and call it
ukproperty. It will appear to the right of the text HDFS Root, which is shown in Figure 88.
If you click ukproperty, you will access the folder. Once there, you can create another folder
called 2018Update. You will now notice that the 2018Update folder has been added to the right
of ukproperty, and if you click 2018Update, you will access that folder.
Next to the New folder button on the menu, click Upload and choose the radio button to select
File. Use the button highlighted in the red rectangle in Figure 88 to select the property
transactions file we are going to upload, called pp-complete.csv. Once you find and select the
file, click OK, and you will see the following screen again. Now, click Upload.
73
Figure 88: Selecting file for HDFS upload
On the bottom, right-hand corner of the screen, you will see a box that shows a progress bar of
the file uploading. When the file has been uploaded, it shows Completed.
The file should not take long to upload on a fast system, but you could be waiting a few minutes
on a slower one. Hadoop for Windows benefits from fast, interactive facilities that are
unavailable in many Windows applications. For example, I could not even begin to load a file
this size into Excel. In QlikView, you'd be using all sorts of built-in compression (QVDs) to load
the file, and your system would certainly feel it. In databases, you'd be taking a while to upload
it, and then have to wait for running queries to view outputs from it. In Hadoop for Windows, this
file is trivial on a powerful system. Simply double-click the uploaded file pp-complete1.csv, or
74
right-click it and click View. The 4-GB file instantly opens in the HDFS File Viewer, which allows
you to instantly go to any one of the 31,573 pages that make up the data file.
Figure 90: Instant viewing of 4-GB data fie in HDFS File Viewer
For these reasons, Hadoop for Windows is very useful for storing large files—you don't have to
wait even a second to see what is contained in the large file you're viewing. Imagine quickly
double-clicking on a 4-GB CSV file in Windows, and locking up your machine and whatever
application tried to open it. In Hadoop for Windows you can have all your files, folders, and
directories neatly ordered and instantly available to view. We will be using this 4-GB file a bit
later in the book when we undertake some tasks of greater complexity.
We have ingested the data in HDFS, and we can store it and view it instantly in Windows. This
is fine, but there is also a need to manipulate data once it's in HDFS. This is where the Hive
data warehouse is used, along with other tools in the Hadoop ecosystem. We have to be able to
use Hive in Windows to the same standard that Linux users use Hive in Linux.
75
Code Listing 15
Enter the code from Code Listing 15 in the Hive editor window, and then click Execute. The
Hive editor window and console window are shown in Figure 91.
76
We can remove the first row of data by using the code in Code Listing 16, and then clicking
Execute.
Code Listing 16: Removing first row of data from table
77
You then see a database called default and a database table called titleaka, as seen in Figure
94. It's the table we created in Hive that's now visible in Spark SQL. Right-click the titleaka
table and click the option to Select Top 500 Rows. In the Spark SQL window, we can clearly
see all the data was wrongly ingested into the titleId field; this is shown in Figure 95. The data
looked correct in Hive, but it clearly isn't—imagine trying to use the table in a join with another
table. As Hadoop can ingest so many different formats of data, there's a potential for errors in
processing such a range of file formats.
Figure 96: Uploaded compressed files in HDFS with title.ratings changed to title.rating
Enter the following code in the Hive editor window to create a table called titleepisode from the
compressed title.episode.tsv.gz file.
78
Load data INPATH '/Hadoop4win/title.episode.tsv.gz' into table
titleepisode;
select * from titleepisode LIMIT 25;
Now, do the same to create a table called titlerating with the following code.
Code Listing 18: Code for creating second table from compressed file in Hive
The output in the next figure shows the output from the table created in the preceding code,
namely select * from titlerating LIMIT 25. This selects the first 25 rows from the
titlerating table we just created from the compressed file. This line is a good starting point for
explaining the concept of MapReduce.
79
Let's join the tables after switching to Tez mode in Hive, which helps to speed up queries. Tez is
not as fast as Impala by any means, but can make the difference between a query executing
and a query failing.
After running this code, you'll notice a long delay before your results are returned, as the map
and reduce actions are carried out, as shown next.
80
Figure 100: Query results from joining titlerating and titleepisode tables using TEZ and MapReduce
81
Code Listing 20: Code for grouping data in Pig
After you enter the code in the editor, you will notice the code created in the console. An
important line in that code is the URL to track the job, as highlighted next.
You simply click on the link shown, and you'll see an image similar to the one in Figure 102. The
figure shows the same job as in the URL, which is 1546500940384_0021. It shows that the job
succeeded and the time, date, and elapsed time. It also shows the map, shuffle, merge, and
reduce time.
82
Figure 102: Tracking Pig MapReduce job
The requested result is also shown in the results window, which groups the individual start years
of the films in the database.
Figure 103: Returned results of Pig query showing the start years of films in the IMDB.
Some people have preferences for Hive, and others for Pig; it can depend on what exactly
you’re doing. Other people prefer to export their data out of Hadoop to work with relational
database systems. This can be because they're much faster when working with joins between
tables than Hadoop, for example. There’s no real difference when using Pig in Windows to
using Pig with Hadoop in Linux. It’s therefore a good time to move onto Sqoop, which can
import and export data to and from Hadoop. This is a key feature of Hadoop, and is available in
Hadoop for Windows.
83
Sqoop
Because of current advances in integrating Hadoop in Windows, this section nearly failed to
make the book. It would though be wrong to exclude it though, since it is used by so many
people in Hadoop to import and export data. To access Sqoop, simply click Sqoop on the Big
Data Studio menu. You'll notice in the following figure that the JDBC connectors are not
installed. If you are able to go online, click the checkbox to select the Microsoft SQL Server
JDBC Connector, and the highlighted Install button will become enabled. Click Install if you
choose to install the driver in this fashion, or click the link highlighted in Figure 104 to get the
instructions to install the connector jars manually.
As I am working with SQL Server v-next CTP 2.0 2019 Technology preview, I needed to
download a connector that would work with it. So if you download sqljdbc_6.0.8112.200_enu,
sqljdbc_7.0.0.0_enu or any other version it depends on the version of SQL Server you are
using. You need to visit the Microsoft.com website to accurately determine this. One you extract
the files you take the jar file which in my case is sqljdbc42 and place it in the Sqoop\lib folder.
So in a multi-node cluster environment you place that file on each node where Sqoop is
installed. The directory you place it in should be:
<InstalledDirectory>\Syncfusion\HadoopNode\<Version>\BigDataSDK\SDK\Sqoop\lib
If you are working on a local development cluster, the directory you should place it in is:
<InstalledDirectory>\Syncfusion\BigData\<Version>\BigDataSDK\SDK\Sqoop\lib
84
Figure 105: Connector Jar in the Sqoop\lib folder.
After you put the file in the relevant folder, close the Big Data Studio, and then open it again.
The JDBC connector for SQL Server is now installed, as shown in Figure 106.
85
Figure 107: Creating a connection to SQL Server in Sqoop.
After saving the connection click Add Job and name the job CurrencyImport. Choose the
connection SqoopMov02 from the drop-down box, as shown in Figure 108.
86
Since we're doing an import, click the Import radio button, then click Next. Now we will enter
the details of the database and the table in SQL Server that we wish to import data from, as
shown next. The database in SQL Server is called INTERN28, and the table is called
Currencies. Click Next to continue.
87
Figure 111: Accepted Sqoop job
The status of the job then changes from Accepted to Succeeded.
88
Figure 114: Customers csv file that is installed with the software
We'll add a job as we did with the data import, but this time we choose the Export option. We'll
use the same SqoopMov02 connection as we used for the import, as shown in the following
figure. Now, click Next.
We’ll use the button highlighted in Figure 116 to choose the folder we wish to export data from;
here, it's Data/Customers/. Click Next.
89
Figure 116: Specifying folder to export data from
We now enter the name of the database we want to export the data to, and the name of the
table we want to export the data to, and click Save & Run.
90
Figure 118: Sqoop export job succeeded
You should now also see the data exported from Sqoop in SQL Server, as displayed in the
following figure. The customers.csv file data has arrived in the INTERN28 database table called
Table_1.
Figure 119: Exported table from Syncfusion Hadoop distribution arrived in SQL Server
While Sqoop can import and export data, I can't say it's the smoothest or most efficient tool I've
ever seen. This is nothing to do with Windows or Linux; I've just never felt it's an impressive tool.
Fortunately, Microsoft is making real progress at integrating Hadoop with Windows, and this is
providing alternatives. We will look at this in more depth in Chapter 4.
Summary
We started by presenting the capabilities of Windows Server as a perfect partner for Hadoop. Its
ability to utilize 24 terabytes of RAM and 256 processors allow it to scale to any task that
Hadoop can throw at it. We then used Amdhal’s Law to examine how Hadoop transports data
across a network, and the mechanisms Hadoop uses to store and access data on disk.
91
We examined the system resources required by Hadoop in Windows Server, before looking at
Hive data types and the Hive data manipulation language. We prepared Hadoop to ingest data
as we began to upload files, create tables, and manipulate data in the Hadoop ecosystem. We
manipulated compressed data files and executed table joins in Hive before carrying out jobs in
Pig and setting up connections in Sqoop. This included loading external SQL Server drivers and
focusing on the role of MapReduce in querying data in Hive and Pig.
We concluded by creating both import and export jobs between SQL Server and Sqoop after
setting up Sqoop connections for data transfer. I think we can safely say that Hadoop and its
ecosystem run perfectly on Windows Server—we are not missing out on anything in Windows
that is available in Linux.
When we go one step further—to connect to and report from Hadoop—we see the advantages
of the Windows environment over Linux. If you had asked me about this even a short while ago,
I would not have agreed. Luckily, the release of the SQL Server v-next CTP 2.0 Technology
Preview has changed all that.
92
Chapter 4 Hadoop Integration and Business
Intelligence (BI) Tools in Windows
PolyBase, which is at the heart of the preceding developments, supports Hortonworks Data
Platform (HDP) and Cloudera Distributed Hadoop (CDH). I have tested the PolyBase
functionality with the Syncfusion Hadoop distribution and can confirm it works with the
Syncfusion platform in Windows. Please bear in mind that SQL Server 2019 v-next is a
technology preview, so if you choose to follow along in this section, you do so at your own risk.
What do these developments mean for ETL and tools like Sqoop? Why spend time importing
data if you don't need to? Microsoft has gone further with Hadoop integration in Windows and
introduced big data clusters. Steps to install and configure them may seem complex at first, but
you can quickly get used to them.
Figure 120: Accessing SQL Server Big Data Clusters using Azure Data Studio
93
Integrating Hadoop into Windows has implications for Windows BI users connecting to Hadoop
data in Linux. Can the Hadoop integration used in SQL Server be taken advantage of by third-
party BI tools in Windows? If the answer is “yes,” it provides an alternative for Windows users
connecting Windows BI tools to Hadoop in Linux. For those who think SQL Server is not
significant to Hadoop or Linux users, I'll address this directly.
The most successful SQL Server is SQL Server 2017 for Linux—it had over 7,000,000
downloads between October 2017 and September 2018.
Let's have a look at SQL Server 2019 v-next CTP 2; please bear in mind that it's a technology
preview. You can download the .ISO file from www.microcrosft.com. When installing, make sure
you choose the highlighted items shown in Figure 121. Ensure that the PolyBase Query
Service for External Data and the Integration Services options are selected to install.
Please remember that, unlike older versions of SQL Server, SSMS (SQL Server Management
Studio) is not installed during the installation—you have to install it separately yourself. SSMS
v17.5 is available free from Microsoft, but includes no database engine.
We'll be connecting SQL Server 2019 with the Syncfusion Hadoop distribution we used
previously. You don't need to install the machine learning options selected in the following
figure; that was simply my preference. If you do choose to install them, you have to download
them separately from a link provided during installation. Like SSMS, they are not included in the
SQL Server 2019 installer, and the combined .cab files are quite large. After installation is
complete, create a database in SQL Server called Hadoop4windows.
94
Code Listing 21: Enabling PolyBase for Syncfusion Hadoop distribution in SQL Server 2019
Imagine you have a Hadoop currency file that's updated as different jurisdictions trade in more
and more currencies. You don't have to import it anymore; you can run it live in SQL Server. To
achieve this, we'll upload worldcurrency.txt to a folder we create in Hadoop called ukproperty.
95
Code Listing 22: Code to set up live connection to HDFS from SQL Server 2019
USE Hadoop4windows
GO
USE Hadoop4windows
GO
FIELD_TERMINATOR =',',
USE_TYPE_DEFAULT = TRUE));
USE Hadoop4windows
GO
WITH (LOCATION='/ukproperty/',
DATA_SOURCE = hadoop_4_windows,
FILE_FORMAT = TextFileFormat
);
Once your code has run, you should see the highlighted changes in SQL Server Object
Explorer. Refresh or restart your server if you don't see them initially.
96
Figure 123: PolyBase and external resources enabled in SQL Server 2019
Note that you can see the dbo.worldcurrency table containing data from the
worldcurrency.txt file in Hadoop under the External Tables section shown in Figure 123. It's a
live connection, and you can query it live as you would any other table in SQL Server 2019.
Right-click the table in SQL Server and select the built-in Select Top 1000 Rows query to see
97
the output in the next figure. As SQL Server 2019, CTP 2.0 can read the items inside HDFS;
you don't even have to create a table in Hadoop. With this way of working, you don't have to
export updates to SQL Server; you just read the data live. Of course, you can still keep storing
historic copies in Hadoop.
Use it like any SQL Server table; it's as fast as other tables when involved in joins, as opposed
to the inherently slow joins in Hadoop.
Figure 124: Running live query against file in HDFS from SQL Server 2019 v-next CTP 2.0
98
You can also see the PolyBase group of objects in the Object Explorer in SQL Server, if you
right-click the Scale-out Group and click the option Configure PolyBase Scale-out Group,
you'll see what’s shown in Figure 125. This shows the PolyBase scale-out cluster instance head
node, which is displaying Scale-out cluster server ready. You need to be running SQL Server
Enterprise to designate a head node. You then set up compute nodes as desired to create a
cluster.
99
For this to work, you have to designate an account you created in Active Directory as the
account to run the PolyBase engine and data movement services. This is done during
installation and is shown in Figure 127. You also need to install SQL Server 2019 as a compute
node, and not a head node, as this will open the firewall for connectivity.
Figure 127: Designating accounts for PolyBase services in SQL Server 2019 installation
The installation will not continue unless you provide your network credentials, which you set up
in Active Directory. You must use this account when installing SQL Server on each node in the
cluster. The next figure shows an acceptable account setup for the PolyBase engine and data
movement services.
100
Figure 128: Set up of accounts for PolyBase services in SQL Server 2019
The previous steps are important because the service instance I've just added to the scale-out
group is now showing as compatible, as seen in Figure 129.
Figure 129: An imported server instance compatible with PolyBase scale-out groups
Had we not completed the steps outlined during installation, we would see a red icon denoting
incompatibility, and the process would end here. You'd then have to reinstall the node correctly.
101
We can click the arrow to add the server instance to the scale-out cluster. As you can see in
Figure 131, it's been added as a compute (worker) node, hence the different symbol.
Figure 132: The allure of Microsoft Azure, Azure Data Studio connecting to HDFS via SQL Server 2019
There could be those who take another view, namely that setting up SQL Server with PolyBase
scale-out groups involves more setup time than Hadoop itself. In Windows there's also an
upfront server cost, so is this setup appearing twice as expensive and complex? Remember,
102
SQL Server is not free, just as Windows is not free. Here's a statement from a well-known
Windows-based organization, summarizing why they don't use Hadoop for Windows.
"While possible, Hadoop for Windows would bring a lot of complexity."
It takes longer to set up Microsoft big data clusters than Syncfusion Hadoop clusters or certain
Linux Hadoop clusters. To be objective, we must remember that it’s only a technology preview
at this stage. Also, the tiny footprint and simplicity of Azure Data Studio is a route that Microsoft
is also pursuing. My hope, and I think the optimal solution, is for the two approaches to meet in
the middle.
• QlikView
• Tableau
• Power BI
• Azure Data Studio
• Arcadia
• Elasticsearch & Kibana
• Syncfusion Dashboard Designer
QlikView: qlikview.com
QlikView at one time or another been the best selling and most popular BI tool. It has a Server
and Desktop edition, and as far as big data is concerned, can connect to Hive from Windows via
ODBC. There is also a Spark ODBC driver, and the software can run on a group of servers
handling larger datasets.
It is recommended that QlikView be the only application running on an individual server, due to
its heavy resource usage of RAM in particular. This is partly because it is an in-memory tool that
relies heavily on data compression, native QVDs, and data aggregation. Direct Discovery within
QlikView allows you to connect live to an external data source, but you can only work with a
reduced feature set when using Direct Discovery.
Tableau: tableau.com
Tableau has managed to take sizeable chunks of the BI market, and is arguably the second-
biggest player behind QlikView in terms of purpose-built BI tools. Tableau comes in desktop and
server editions but, unlike QlikView, is available for Linux. This is a more recent development,
providing advantages like lower operating-system costs. Tableau connects to Hive via ODBC
drivers, and is an in-memory application. It provides live connections to external data sources
and has a wide range of data connectors, including Spark SQL. Tableau has excellent data
grouping and classification tools that are part of a strong in-built feature set.
103
As we've already seen, Power BI can directly connect to the HDFS in Windows. It therefore
gives deeper access to Hadoop than Hive data warehouse. It has benefitted from Hadoop
integration across the Microsoft BI stack and wider Windows environment. A few years ago, I
didn't use Power BI at all. Now it's indispensable, due to its native big data connectivity and tight
Windows integration.
Azure Data Studio seems like a sleek and lightweight design on the surface. Dig a little deeper,
and there's a lot of power that can be tightly integrated with all Microsoft big data and database
innovations. Azure Data Studio does this in part by using extensions that can be added to the
application. You cannot access Microsoft big data clusters without extensions, for example.
Importantly, Microsoft sees big data as a whole concept, not just Hadoop. In addition, Azure
Data Studio is cross-platform: it works on the Windows, Linux, and Mac platforms. Azure Data
Studio has something of a companion product in Visual Studio Code. Visual Studio Code is a
lightweight take on Visual Studio, and is similar in appearance to Azure Data Studio.
Visual Studio Code has given the Windows command prompt a contemporary feel and new
lease on life. It can be enhanced by using extensions in the same way Azure Data Studio does.
Visual Studio Code, shown in Figure 133, looks almost identical to Azure Data Studio.
Extensions for both pieces of software can be installed from the internet or downloaded for
offline installation. You click on the three red dots, as shown in the following figure, to get the
extensions menu. From here, click Install from VSIX, which is the file extension for the file
extensions.
104
Arcadia: arcadiadata.com
Arcadia needs no help when connecting to Hadoop—there's no need to load ODBC drivers, as
Arcadia connects to Hadoop natively. Arcadia also connects directly to the fast Impala query
engine and runs on Windows and Mac. It's a BI tool built for big data, embracing data
granularity, controlling data volumes, and accelerating processes. Join creation is automated,
and connections to RDBMS and big data sources are supported. For some, it may be the only
BI tool they need, because it meets requirements of scale of any BI task. It can also directly
access Hadoop to run queries at very impressive speeds. Using Arcadia is like being in Hadoop
itself. Add to that every visualization tool you'll ever need, and you've got Arcadia.
While not originally designed to work with Hadoop, Elasticsearch can now use the HDFS as a
snapshot repository. Support for the HDFS is provided via an HDFS snapshot/restore plugin.
Elasticsearch and Kibana can, of course, search data in Hadoop and run just as smoothly on
Windows as in Linux. Sure, the texts for these products are written around Linux, but this belies
their integration with Windows. Elasticsearch 6.6.0 can be installed as a Windows service, and
Kibana 6.6.0 has new features that allow the easy import of data files. The real strength of these
tools is the ingestion and display of real-time and near real-time data. Their footprint is
comparatively small, and installation is simple and speedy. You can also very tightly manage
the resources allocated to nodes and clusters within your installation.
It's important for Windows products to compare well against their Linux counterparts. Sadly, the
absence of Impala on the Windows platform is currently a problem. In a project unrelated to this
publication, I have been in contact with the Impala developers on this subject.
Outside of Microsoft big data clusters, what tools are there for fast queries involving joins? What
if you don't want to use SQL Server—what other choices are out there? All too often, the other
choices involve using Linux products, even if it means learning a new system.
Cloudera CDH is a Hadoop release you'd struggle to create in Windows, as key functionality
isn't available. If Syncfusion could put Impala in their Big Data Platform they would; currently,
however, it isn't possible. I haven't mentioned it until now, but the exercises I've been doing in
Windows are exercises I've duplicated in Hadoop on Linux. We'll see some of this work later on
in the chapter. I did this because it's important to identify features that may benefit Hadoop in
Windows. You'll recall that I mentioned cgroups and updating ecosystem elements like Hive, as
new versions become available.
Cloudera and Hortonworks have features you'd be pleased to see in Windows solutions.
Hadoop for Windows should take more advantage of the interactive nature of Windows. This is
because the best Linux Hadoop distributions are more interactive than they used to be. The
older, rather "wooden" Linux Hadoop is being replaced by sleeker, more modern designs.
Microsoft has observed this and given Azure Data Studio a contemporary feel that you can
change at will. You'll see some of this as we go further into this chapter.
105
Connecting BI tools to Hadoop in Windows
QlikView
A key benchmark for Hadoop in Windows is the ability to connect live with Hadoop. QlikView
connects live to data sources with Direct Discovery, but you lose certain functionality. Direct
Discovery also requires different code scripting. Some other BI tools, like Tableau, achieve it by
just clicking a button.
• Advanced calculations
• Calculated dimensions
• Comparative Analysis (Alternate State) on QlikView objects using Direct Discovery flds
• Direct Discovery fields are not supported on Global Search
• Binary load from a QlikView application with Direct Discovery table
• Section access and data reduction
• Loop and Reduce
• Table naming in script does not apply to the Direct table
• The use of * after DIRECT SELECT on a load script (DIRECT SELECT *)
To see if we can access the HDFS live without restrictions, we'll connect to SQL Server via Edit
Script. Click Connect and enter the login details for SQL Server, as shown in Figure 134.
106
Click Test Connection before proceeding. You should see the Test connection succeeded
message displayed, which tells you you've successfully made connection with SQL Server. The
connection to SQL Server 2019 has allowed direct access to Hadoop via the external table that
we created called worldcurrency. The Preview facility displays the data, as shown in the
following figure.
Figure 135: Hadoop file being read live in QlikView via SQL Server 2019
The load script is then executed to load the data into the application, as shown next.
Figure 136: Script is run within QlikView to load Hadoop data into the application
107
Though we were able to connect to Hadoop from SQL Server, the action of loading the script
ensures we can't ingest live data without the limited Direct Discovery. The way it works means
you essentially work from extracts; you would need to reload data to pick up any changes. While
we can create dashboards, as shown in Figure 137, we don't have the live connection to
Hadoop we desire or could have. Like so many BI tools not built for big data, we're importing
data due to the limits of the BI tool. Direct Discovery compensates for the fact that QlikView
wasn't built for big data analysis. Sadly, the unavailability of key functionality is
counterproductive. The restrictions of Direct Discovery previously listed are also not the only
restrictions. You cannot use pivot charts and mini charts, for example, and there are
performance-tuning issues that must be dealt with at the data source.
We could load the data into QVD files for fast load times and high data compression. You'd still
be loading extracts though, albeit much faster and with functionality unavailable in Direct
Discovery. Organizations I've worked at with multiple QlikView servers never considered using it
for live connections.
Figure 137: Maybe it's live, maybe it's not: QlikView visualization from loaded Hadoop data extract
108
Tableau
Let's see if Tableau fares any better than QlikView when attempting to connect live to the
HDFS. Connecting to SQL Server from Tableau is a straightforward task. You can't help but
notice the large number of other connectors available.
After connecting to SQL Server from the preceding screen, you should see what’s shown in
Figure 139. It is very different from what we saw in QlikView at the same stage.
The first thing we notice is the "Live" connection, indicated at the top-right side of the screen.
Tableau has also identified the live worldcurrency table from the HDFS file data. Tableau
109
achieves this by sending dynamic SQL to the source system rather than importing it.
Advantages to this approach include less storage space, as you avoid duplicating source data in
the BI system.
Figure 139: Live connection to SQL Server from Tableau connecting directly to HDFS
Live connection does not mean a reduced feature set; without being connected to the internet,
you can map data from the countries listed in the Hadoop data file.
110
Figure 140: Features are not restricted when using live connection in Tableau
Tableau has passed the test with flying colors—there was no discernible difference in
performance, and you'd never guess you were connecting to raw Hadoop. So, the work from
Microsoft to integrate Hadoop into SQL Server can benefit third-party BI tools.
111
Figure 141: Tableau dashboard created from live connection to HDFS
Tableau perhaps reflects both sides of the story: the ability to connect live to Hadoop within
Windows, and now, producing Tableau for Linux. Tableau Server for Linux has integrated well
with SQL Server for Linux, which is making a big impression itself. Perhaps the lines are
becoming blurred, and the future is not as black and white as Windows or Linux. The efforts of
Microsoft to build the WSL (Windows Subsystem for Linux) is perhaps the strongest indicator of
this. The ability to access both environments from within one environment is a noble aspiration,
though the reality may be more difficult to achieve.
Power BI
We've already seen Power BI in Chapter 1, so we know what it can do. That said, I can confirm
that Power BI can take advantage of the live SQL Server connection to HDFS by using Direct
Query. Direct Query allows you to access data directly from your chosen data source. It's
enabled by clicking the DirectQuery button highlighted in Figure 142. This eliminates the need
to import data by giving you a live connection to interrogate larger data volumes.
112
Figure 142: Power BI live connection to SQL Server to read HDFS files live
Power BI can also access the Impala high-speed query engine in Linux. It's interesting to note
that there is no Hive connector in Power BI. Microsoft has confronted the problem of slow query
speed with joins in Hive, and gone straight for Impala. If you really want to use just Hive, you
can use an ODBC connection.
Figure 143: Power BI live connection to Impala via the direct query mechanism
113
Figure 144: Azure Data Studio with extensions installed
Azure Data Studio can connect to our live HDFS connection in the most versatile way. This is
achieved by using the SQL Server 2019 extension highlighted in Figure 144, and deployed as
shown in Figure 145.
Figure 145: Deployed Azure Data Studio extension for SQL Server 2019
One of the features within the extension is the ability to create external tables. Does it reveal the
thinking of Microsoft? Why should you have to be in a system to create a table? Shouldn't there
be an ability not just to exchange data, but to design the components that hold it?
114
Figure 146: Create external table in Azure Data Studio
Security issues have been overcome by the creation of a database master key to secure the
credentials used by an external data source.
115
Figure 148: Azure Data Studio Server Dashboard
In order to connect to a server, you need to click the server symbol, as highlighted in Figure
149. You will find it on the top, left-hand side of the screen.
Figure 149: Making a connection to SQL Server 2019 in Azure Data Studio
You are then taken to the login screen, as shown in the following figure.
116
Figure 150: Azure Data Studio login
The two fields highlighted show the server and the individual database login fields. We then
enter server name and port number, after making sure the Server is configured to accept
remote connections. This includes allocating a port number for SQL Server to accept
connections on. If you don't do this, you may only be able to connect to a local server. After you
log in, you see the server dashboard screen we saw earlier, plus more details on the left-hand
side. You see additional folders, including server objects and endpoints, as shown in the
following figure.
117
When you're in Azure Data Studio, you can change the look and feel of the app completely. You
can also access the live Hadoop data connection we created, as shown in Figure 152.
Figure 153: Fast execution of Hadoop queries involving joins as fast as Impala
118
In the preceding figure, you can see the additional table among those highlighted, called
worldinformation. The tables are followed by the word external, as they are live connections
to Hadoop. Running a query joining that table to our worldcurrency table was executed in
00:00:01.252 seconds. In any version of Hive in any Hadoop release, this would take at least a
few minutes. This is one of the reasons why some people prefer what Microsoft is doing with big
data and Hadoop. I'm able to do all this from an 80-MB app that is around 400 MB when
installed, plus extensions. It's a lot of bang for your buck, or it would be if it wasn't free of charge
on Mac, Linux, and Windows. The fun doesn't stop here: you can create dashboards from your
live Hadoop queries. Charts are created "on the fly" and shown on the Chart tab, as highlighted
in Figure 154. You can make as many charts as you wish to populate your dashboards.
Figure 154: Live charts based on the results of high speed Hadoop queries with joins
The power of the app is such that you can change the visualization types in an instant on
screen. You don't have to go to a separate area or separate mode of operation—they are
instant, on-screen changes.
119
Figure 155: Instant changes of visualizations using Azure Data Studio
Arcadia Data
Arcadia comes with everything you need to connect to Hadoop contained within it.
120
After installing Arcadia, click the Start button shown in Figure 156. You see a message
displaying Server is running, at which point you click Go. At this point you must register the
product, or you cannot proceed. You then create a Hive connection in Arcadia, as shown next.
Figure 157: Hive connection in Arcadia, requires hostname, username and password to Hive
With your connection created, the Connection Explorer locates the databases in Hadoop and
lists all the tables in them.
121
The connection to Hive is a live one, and Arcadia is an interactive experience. You drag and
drop the fields from your tables into dimensions and measures to create visualizations. This is
shown in Figure 159, where the fields highlighted in the red rectangle are dragged into the fields
highlighted in the green rectangle.
122
At first glance, Arcadia is very capable of connecting to live Hadoop instances and creating
visualizations. We have looked at Arcadia 2.4, but when we revisit it, we'll look at Arcadia 5.0.
We'll see if progress has been made by this tool designed to work with big data.
123
Elasticsearch and Kibana
While it's known that Elasticsearch now uses the HDFS as a snapshot repository, it’s less well-
known that Kibana is being developed to ingest data commonly found in BI tools.
Many tools that were once the domain of Linux are now established on the Windows platform.
Kibana and Elasticsearch are welcome additions to that group of products. The Windows
installer may have made their installation painless, but launching both tools is still done via the
command line. In Elasticsearch 6.6, the startup is automated by clicking the icon that appears
after installation. You also have the option of installing it as a service with this release. You
always run Elasticsearch before Kibana. After launching Elasticsearch, you'll see the following
screen.
124
Figure 162: Kibana Server started and now active in Windows
The browser address to access Kibana is highlighted in red in Figure 162. The link shown is
http://localhost:5601. After accessing the link, you'll see the screen shown in Figure 163. You're
immediately invited to ingest data logs, operating system metrics, and much more. This is
perhaps the Kibana and Elasticsearch sweet spot: ingesting data generated by the hour,
minute, or second.
125
You can use the experimental feature to import data outside of the JSON format. These are
perhaps the first steps in not just utilizing Hadoop for snapshots, but for "front-loading" Hadoop
data for BI purposes. The experimental Import feature is shown on the following screen.
Figure 164: Front loading data into Kibana, the experimental Import data feature
This allows you to easily create outputs from a wider range of custom data sources in Windows.
126
Figure 165: Kibana dashboard using data from experimental Import data tool
The Syncfusion Dashboard Designer is made by the same company that produced the
Syncfusion Big Data Platform. We'll look at Syncfusion Dashboard Designer in the final chapter,
when we see how BI tools perform with larger data loads. Only the 4-GB Land Registry file we
downloaded earlier will be used in the final section. First, let's look at elements of Hadoop in
Linux that I'd like to see in Windows.
127
Three features from Hadoop in Linux I'd like to see more of
While this is a book about Hadoop for Windows, we should be aware of Hadoop on Linux
developments. In Cloudera (CDH), Impala is the default query editor engine. When you use
Query Editor you immediately access Impala to write queries. Figure 166 shows the same query
with joins that we ran earlier; it joins the Ratings table and the Episodes table. You could run
this query in any version of Hive or Pig, and it would take a good few minutes. Our figure
highlights the 4.31 seconds it took in Impala. The joined tables presented no problems and
near-relational database speeds were achieved.
128
Figure 167: Basic visualization facilities in Cloudera CDH
The Hortonworks Linux platform has a function that would benefit any Hadoop distribution:
automatic creation or removal of the first row as a header for data. The ability to define
delimiters would also grace any Hadoop distribution.
Figure 168: Automatic creation or removal of column header and defining of delimiters
129
Chapter 5 When Data Scales, Does BI Fail?
We need to see how BI tools for Hadoop fare when faced with a larger data load in Windows.
For this reason, we'll create the code for the 4-GB Land Registry database and its .csv file
format.
Code Listing 23: Code for creating csv file format and Land Registry table in SQL Server 2019
(
FIELD_TERMINATOR =',',
DATE_FORMAT = 'MM/dd/yyyy',
STRING_DELIMITER = '"',
USE_TYPE_DEFAULT = TRUE));
130
WITH (LOCATION='/ukproperty/',
DATA_SOURCE = hadoop_4_windows,
FILE_FORMAT = CSVformat
);
For this code to work, please make sure the 4-GB Land Registry file is in a folder called
ukproperty. To give the BI tools we're going to use a fair chance, we'll also create the table in
Hive.
Code Listing 24: Code to create Land Registry table in Hive
The tools we are going to use are: Tableau, Azure Data Studio, Arcadia Data, and Syncfusion
Dashboard Designer.
We already know the limitations of QlikView Direct Discovery, and with Azure Data Studio and
SQL Server, Microsoft products are already represented.
131
Figure 169: Tableau connecting live to 4-GB Land Registry file in Hive
The Tableau live connection was immediate and fast, automatically previewing 10,000 rows in a
preview window with ease. I certainly don't need a data preview for anything that big, but it was
impressive to see. I was able to use any visualization I wished to without adverse effects on
performance. Overall, it's an easy pass for Tableau, though it would be good to see one or two
new visualizations. As you work with larger amounts of data, standard visualizations don't
always provide the best presentation choices. The Tableau visualization is shown in the
following figure.
132
Figure 170: Tableau visualization using Land Registry data
133
Using Azure Data Studio with large datasets in Windows
Hadoop
Azure Data Studio pulled back the data in almost effortless fashion. The highlight was the easy
use of IntelliSense on a file that big. A query with a where clause, as shown in Figure 171, works
with IntelliSense. There is no delay in predicting the next word you type, or identifying words in
the system. As before, there is no connection via Hive, but directly to HDFS from SQL Server. I
can't find fault with the performance of this tool, and the application is compatible with the
Syncfusion platform. As good as Azure Data Studio is, it's only as solid and reliable as the
Hadoop system it's connected to.
Figure 171: Azure Data Studio making a fast connection to Land Registry data file in Hadoop
134
Perhaps because of the small size of the app, I was expecting some kind of decrease or
degradation of performance. That didn't happen though, and it's clear that Microsoft has laid the
foundations for much smoother big data experiences. The word I would probably choose to
describe the performance of this tool overall is “effortless.” Windows Server never felt like it was
under any strain at all.
135
Arcadia has a number of features that enable you to sustain long sessions with large amounts
of data. The following figure shows the "Clear result cache" facility, something you don't see
often in BI tools.
In order to analyze data and create visuals, you create datasets from your tables. This just
involves clicking New dataset in the same row that your table is shown in. This is portrayed in
the following figure.
136
Figure 175: Creating a new dataset from a table
You can now create a dataset from your visual by clicking the New Visual link.
137
Figure 177: Choosing dimensions and measures for our visualizations
We can filter the field data and store the choice to create segments.
138
.
139
Using Syncfusion Dashboard Designer with large datasets in
Windows Hadoop
When you start the Syncfusion Dashboard Designer, you see what is effectively a blank canvas.
In Figure 181, the icon highlighted in red is the button for creating a data source. To connect to
SQL Server 2019, you simply fill out the form shown in the following figure, then click Connect.
140
Figure 182: Connection to SQL Server 2019
If the tables you want to work with are in Hive, you can use a Hive connection or a faster Spark
SQL connection. This will pick up the tables in the "default" database when you log in.
141
You instantly connect to a blank canvas, and on the left side, you see the tables within SQL
Server. You follow the onscreen message to drag and drop tables to create a virtual table.
Figure 184: Invitation to drag and drop tables to create a virtual table
When you drag your table to the canvas, you can add columns as you wish. The application
does not overload the table with data from the data source.
142
Figure 186: Filtering records you do or don't wish to select
You can change column types, rename columns, and carry out aggregation functions on the fly.
143
Figure 188: Bar and Column graph icons on the Dashboard tab
You create graphs by dragging Measures and Dimension fields to values, columns, and row
fields. If large amounts of data are being loaded, a warning appears on screen, inviting you to
filter the data. This warning is shown in the following figure.
144
Figure 190: Filtering the records in your dataset
If you wish, you can use the Rank feature; this example ranks the top 10 average house prices.
We can now create visualizations with our filtered dataset. Dashboard speed is maintained by
preventing unnecessary data from overwhelming the dashboard.
145
Figure 192: Creating visualizations from our filtered data
We can put together several visualizations to create dashboards; this includes pivot table
visualizations shown in Figure 193. I often don't use pivot table visualizations, as the dashboard
creation tools can't manage the data. You have seen throughout the use of this tool that it has
been designed from the ground up to manage large amounts of data. That said, Syncfusion is
the designer of the Big Data Platform, so the excellent performance of the tool here isn't
surprising.
146
Figure 193: Syncfusion Dashboard Designer Dashboard including pivot table visualization
147
Conclusion
At the start of this book, we were standing in the shadows of Hadoop on Linux. We were aware
that Hadoop for Windows was seen as a novelty in some circles, a fledging aspiration perhaps.
Now at the end of the book, would you say you felt the same? Why didn't we notice the
resources Microsoft spent integrating Hadoop into their Windows real estate? A cynic might say
Microsoft is cleverly "blurring the lines" by constantly releasing free software to Windows, Mac,
and Linux users and normalizing cross-platform software. It counters the notion that Hadoop
belongs to Linux, or SQL Server belongs to Windows. The 7,000,000 Linux users who
downloaded SQL Server for Linux further reinforces that.
A year ago, my first thought for a new tablet would have been an iPad. But I've recently taken
delivery of a Microsoft 2-in-1 cellular tablet, on which I was able to install Azure Data Studio and
connect to Hadoop. The introduction of the portable and mobile experience to Hadoop is built
into Microsoft strategy. Microsoft is constantly releasing new builds of these apps, so they are
certainly committed to them. Why should I have to use a heavy server or PC to access
Hadoop—who says things have to stay the same? Why can't I access it using an app with a tiny
80-MB install file? These innovations provide attractive new options for accessing Hadoop, and
they are available on Windows. Arcadia Data has released its software on the Windows
platform, and the INSTANT version is only 85 MB and allows fast connections to Hadoop.
SQL Server does not "belong" to Windows anymore, and Hadoop does not "belong" to Linux.
The door has already been opened—you only need to walk through it!
148