Mongodb Spark
Mongodb Spark
Table of Contents
Introduction
Conclusion
10
We Can Help
10
Introduction
We live in a world of big data. But it isnt only the data
itself that is valuable it is the insight it can generate. That
insight can help designers better predict new products that
customers will love. It can help manufacturing companies
model failures in critical components, before costly recalls.
It can help financial institutions detect fraud. It can help
retailers better forecast supply and demand. Marketing
departments can deliver more relevant recommendations
to its customers. The list goes on.
How quickly an organization can unlock and act on that
insight is becoming a major source of competitive
advantage. Collecting data in operational systems and then
relying on nightly batch ETL (Extract Transform Load)
processes to update the Enterprise Data Warehouse
(EDW) is no longer sufficient. Speed-to-insight is critical,
and so analytics against live operational data to drive
real-time action is fast becoming a necessity, enabled by
technologies like MongoDB and Apache Spark.
Data Aggregation
The MongoDB Aggregation Framework is similar in
concept to the SQL GROUP BY statement, enabling users
to generate aggregations of values returned by the query
(e.g., count, minimum, maximum, average, intersections)
that can be used to power reports, dashboards and
vizualizations.
Using the Aggregation Framework, documents in a
MongoDB collection (analogous to a table in a relational
database) pass through an aggregation pipeline, where
they are processed by operators. Expressions produce
output documents based on calculations performed on the
input documents. The accumulator expressions used in the
$group operator maintain state (e.g., totals, mins, maxs,
averages) as documents progress through the pipeline.
The aggregation pipeline enables multi-step data
enrichment and transformations to be performed directly in
the database with a simple declarative interface, supporting
processes such as lightweight ETL to be performed within
MongoDB.
Processing Paradigm
Many programming languages can use their own
MongoDB drivers to execute queries against the database,
returning results to the application where additional
analytics can be run using standard machine learning and
statistics libraries. For example, a developer could use the
MongoDB Python or R drivers to query the database,
loading the result sets into the application tier for additional
processing.
However, this starts to become more complex when an
analytical job in the application needs to be distributed
across multiple threads and nodes. While MongoDB can
service thousands of connections in parallel, the application
would need to partition the data, distribute the processing
across the cluster, and then merge results. Spark makes
this kind of distributed processing easier and faster to
develop. MongoDB exposes operational data to Sparks
distributed processing layer to provide fast, real-time
analysis. Combining Spark queries with MongoDB indexes
allows data to be filtered, avoiding full collection scans and
delivering low-latency responsiveness with minimal
hardware and database overhead.
Skills Re-Use
With libraries for SQL, machine learning and others
combined with programming in Java, Scala and Python
developers can leverage existing skills and best practices
to build sophisticated analytics workflows on top of
MongoDB.
Yes
Yes
SQL
Not currently
Yes
DataFrames
Not currently
Yes
Streaming
Not currently
Not currently
Machine Learning
Python
Yes
Yes
Using SparkSQL syntax
Yes
Yes
Yes
Yes
MongoDB Support
Yes
Yes
HDFS Support
Yes
Partial
Write only
Yes
No
Yes
Yes
Provided by Stratio
Spark-MongoDB Connector
Figur
Figure
e 2: Using MongoDB Replica Sets to Isolate Analytics from Operational Workloads
Hadoop Integration
Like MongoDB, Hadoop is seeing growing adoption across
industry and government, becoming an adjunct to and in
Figur
Figure
e 3: Integrating MongoDB & Spark with BI Tools
Conclusion
MongoDB natively provides a rich analytics framework
within the database. Multiple connectors are also available
to integrate Spark with MongoDB to enrich analytics
capabilities by enabling analysts to apply libraries for
machine learning, streaming and SQL to MongoDB data.
Together MongoDB and Apache Spark are already
enabling developers and data scientists to turn analytics
into real-time action.
We Can Help
We are the MongoDB experts. Over 2,000 organizations
rely on our commercial products, including startups and
more than a third of the Fortune 100. We offer software
and services to make your life easier:
MongoDB Enterprise Advanced is the best way to run
MongoDB in your data center. Its a finely-tuned package
of advanced software, support, certifications, and other
services designed for the way you do business.
MongoDB Cloud Manager is the easiest way to run
MongoDB in the cloud. It makes MongoDB the system you
worry about the least and like managing the most.
Production Support helps keep your system up and
running and gives you peace of mind. MongoDB engineers
help you with production issues and any aspect of your
project.
Development Support helps you get up and running quickly.
It gives you a complete package of software and services
for the early stages of your project.
10
Resources
For more information, please visit mongodb.com or contact
us at [email protected].
Case Studies (mongodb.com/customers)
Presentations (mongodb.com/presentations)
Free Online Training (university.mongodb.com)
Webinars and Events (mongodb.com/events)
Documentation (docs.mongodb.org)
MongoDB Enterprise Download (mongodb.com/download)
New York Palo Alto Washington, D.C. London Dublin Barcelona Sydney Tel Aviv
US 866-237-8815 INTL +1-650-440-4474 [email protected]
2015 MongoDB, Inc. All rights reserved.
11