0% found this document useful (0 votes)

206 views

Kafka and Mongodb

Uploaded by

Mani Yangkatisal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views

Kafka and Mongodb

Uploaded by

Mani Yangkatisal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A MongoDB White Paper

Data Streaming with Apache Kafka &

MongoDB
November 2017
Table of Contents
Introduction 1

Apache Kafka 1
How Kafka Works & What it Provides 1

Kafka Use Cases 3

Implementation Recommendations for Kafka 5

Kafka Limitations 5

Alternate Messaging Systems 6

Related Technologies 6

A Diversion – Graphically Build Data Streams Using

Node-RED 7

Operationalizing the Data Lake with MongoDB 9

MongoDB As a Kafka Producer – Change Streams 10

MongoDB As a Kafka Consumer 11

MongoDB and Kafka in the Real World 11

Recent Additions to Kafka 12

We Can Help 11

Resources 12
Introduction

In today's data landscape, no single system can provide all modern data ingestion and routing pipelines. A new
of the required perspectives to deliver real insight. Deriving generation of technologies is needed to consume and
the full meaning from data requires mixing huge volumes exploit today's data sources. This paper digs into these
of information from many sources. technologies (Kafka in particular) and how they're used.
The paper also examines where MongoDB fits into the
At the same time, we're impatient to get answers instantly;
data streaming landscape and includes a deep dive into
if the time to insight exceeds 10s of milliseconds then the
how it integrates with Kafka.
value is lost – applications such as high frequency trading,
fraud detection, and recommendation engines can't afford
to wait. This often means analyzing the inflow of data Apache Kafka
before it even makes it to the database of record. Add in
zero tolerance for data loss and the challenge gets even
more daunting. How Kafka Works & What it Provides
As the number, variety, and velocity of data sources grow, Kafka provides a flexible, scalable, and reliable method to
new architectures and technologies are needed. distribute streams of event data from one or more
pr
producers
oducers to one or more consumers
consumers. Examples of
Apache Kafka and data streams are focused on ingesting
events (or messages
messages) include:
the massive flow of data from multiple fire-hoses and then
routing it to the systems that need it – filtering, • A periodic sensor reading such as the current
aggregating, and analyzing en-route. temperature

Enterprise messaging systems are far from new, but even • A user adding an item to the shopping cart in an online
frameworks from the last 10 years (such as ActiveMQ and store
RabbitMQ) are not always up to the job of managing • A Tweet being sent with a specific hashtag

1
• A log entry generated for each click in a web application In Kafka, topics are further divided into partitions to
support scale out. As each message is produced, the
Streams of Kafka events are organized into topics
topics. A
producer determines the correct partition for a message
producer chooses a topic to send a given event to and
(depending on its topic and message key), and sends it to
consumers select which topics they pull events from. For
an appropriate br
broker
oker for that partition. In this way, the
example, a financial application could pull NYSE stock
processing and storage for a topic can be linearly scaled
trades from one topic, and company financial
across many brokers. Similarly, an application may scale
announcements from another in order to look for trading
out by using many consumers for a given topic, with each
opportunities.
pulling events from a discrete set of partitions.
The consumers typically receive streamed events in near
A consumer receives events from a topic's partition in the
real-time, but Kafka actually persists data internally,
order that they were added by the provider, but the order is
allowing consumers to disconnect for hours and then catch
not guaranteed between different partitions. While this
up once they’re back online. In fact, an individual consumer
appears troublesome, it can be mitigated by controlling
can request the full stream of events from the first event
how events are assigned to partitions. By default, the
stored on the broker. Administrators can choose to keep
mapping of events to partitions is random but a
the full history for a topic, or can configure an appropriate
partitioning key can be defined such that 'related' events
deletion policy. With this system, it's possible for different
are placed in the same partitions. In our financial
consumers to be processing different chunks of the
application example, the stock symbol could be used as the
sequence of events at any given time. Each event in a topic
partitioning key so that all events for the same company
is assigned an offset to identify its position within the
are written to the same partition – if the application then
stream. This offset is unique to the event and never
has just one consumer pulling from that partition then you
changes.
have a guarantee that all trades for that stock are
processed in the correct order.

Multiple consumers can be combined to form a consumer

gr
group
oup. Each consumer within the group processes a
subset of events from a specific topic – at any instant,
messages from one partition can only be received by one
consumer within the group. Different applications (e.g.,
fraud detection and product recommendation) can be
represented by their own consumer groups.

High A
Availability
vailability can be implemented using multiple
Kafka brokers for each topic partition. For each partition
there is a single broker that acts as the leader in addition
to one or more followers
followers. The leader handles all reads and
Figur
Figuree 1: Kafka Topics & Partitions
writes for the topic partition and the followers each
replic
eplicate
ate from the leader. Should the leader fail, one of the
The ability to reliably replay events that have already been
followers is automatically promoted to be the new leader.
received, in the exact order in which they were received,
Typically, each broker acts as the leader for some partitions
provides a number of benefits such as:
and as a follower for others. This replication approach
• Newly added consumers can catch up on everything prevents the loss of data when a broker fails and increases
that's happened Kafka's availability,

• When the code for an application is updated, the

application can process the full stream of events again,
applying the new logic to each one

2
Kafka Use Cases

Log Aggr
Aggregation
egation: This has traditionally involved the
collection of physical log files from multiple servers so that
they can be stored in a central location – for example,
HDFS. Analyzing the data would then be performed by a
periodic batch job. More modern architectures use Kafka to
combine real-time log feeds from multiple sources so that
the data can be constantly monitored and analyzed –
reducing the interval between an event being logged and
its consequences being understood and acted upon.

Website Activity T Trac

racking
king: This was the original use case
for Kafka. Site activity such as pages visited or adverts
Figur
Figure
e 2: Kafka Replication rendered are captured into Kafka topics – one topic per
data type. Those topics can then be consumed by multiple
An event is considered committed when all replicas for functions such as monitoring, real-time analysis, or
that topic partition have applied it. A producer has the archiving for offline analysis. Insights from the data are
option of blocking until the write has been committed or then stored in an operational database such as MongoDB
continuing right away. Consumers only ever read where they can be analyzed alongside data from other
committed events. sources.

Each Kafka broker stores all of the data for its topic Event Sour
Sourcing
cing: Rather than maintaining and storing the
partitions on disk in order to provide persistence. Because latest application state, event sourcing relies on storing all
the data is immutable and append-only, each broker is able of the changes to the state (e.g., [x=150, x++, x+=12,
to handle a large number of writes, and the cost of reading x-=2]) in the original order so that they can be replayed to
the most recent message remains constant as the volume recreate the final state. This pattern is often used in
of data stored grows. In spite of Kafka's ability to work financial applications. Kafka is well suited to this design
efficiently with large data volumes, it is often desirable to approach as it can store arbitrarily long sequences of
remove old or obsolete data; Kafka users can choose events, and quickly and efficiently provide them in the
between two different algorithms for managing space: correct sequence to an application.
retention policies and log compaction:
Micr
Microservices
oservices: Microservice architectures break up
• Retention policy
policy: You can choose, on a per-topic basis, services into small, discrete functions (typically isolated
to delete log files after they reach a certain age, or the within containers) which can only communicate with each
number of bytes in the topic exceeds a certain size other through well defined, network-based APIs. Examples
• Log compaction
compaction: Log Compaction is an optional Kafka include eCommerce applications where the service behind
feature that reduces the storage size for keyed data the 'Buy Now' button must communicate with the inventory
(e.g., change logs for data stored in a database). Log service. A typical application contains a large number of
compaction allows you to retain only the most recent microservices and containers. Kafka provides the means
message with a given key, and delete older messages for containers to pass messages to each other – multiple
with the same key. This can be useful for operations like containers publishing and subscribing to the same topics
database updates, where you only care about the most so that each container has the data it needs. Since Kafka
recent message. persists the sequences of messages, when a container is
rescheduled, it is able to catch up on everything it has
missed; when a new container is added (e.g., to scale out) it
can bootstrap itself by requesting prior event data.

3
Figur
Figure
e 3: Stream Processing with Kafka & MongoDB

More information on Microservices can be found in the Streaming, Apache Flink, and Kafka Streams are used to
Microservices: The Evolution of Building Modern process events as they pass through, while interesting
Applications white paper as well as in Enabling events and results are written to a database like MongoDB
Microservices: Containers & Orchestration Explained. where they're used for analysis and operational decisions.
This is the pattern used for the Indian smart housing
Str
Stream
eam Pr
Processing
ocessing: stream processing involves filtering,
project described later – where MongoDB stores
manipulating, triggering actions, and deriving insights from
aggregated energy sensor data which is used for billing
the data stream as it passes through a series of functions.
and energy performance benchmarking between
Kafka passes the event messages between the processing
properties.
functions, merging and forking the data as required.
Technologies such as Apache Storm, Samza, Spark

Figur
Figure
e 4: Lambda Architecture

4
Lambda ArArcchitectur
hitecturee: Applications following the Lambda Some data may still need to be kept in silos due to security
Architecture (Figure 4) augment data produced by stream concerns, but it's worth asking the question about which
processing on recent events (the Speed Layer) with views topics can be made available to all internal users. Kafka 0.9
built from batch processing jobs run on the full, historical supports secure authentication and per-topic ACLs, which
data set (the Batch Layer). allow administrators to control access to individual topics.

The Lambda Architecture is coming under increasing J SON is a suitable common data format for
scrutiny due to the operational complexity of managing two standardization, and it's also worth considering Apac
Apache
he
distributed systems which must implement the same logic. Avr
vro
o, which allows schemas to be enforced on any given
A more contemporary, scalable, and less complex solution topic. Avro also has the advantage that it's more compact
is described below in Operationalizing the Data Lake with than JSON. It's very simple to map between JSON and
MongoDB. Avro, which helps when integrating with MongoDB.

Internet of T
Things
hings (IoT)
(IoT): IoT applications must cope with Each topic should use a common schema, but it's also
massive numbers of events being generated by a multitude important to recognize that schemas evolve over time;
of devices. Kafka plays a vital role in providing the fan-in consider including a schema version identifier in each
and real-time collection of all of that sensor data. A message.
common use case is telematics, where diagnostics from a
Kafka is extremely good at ingesting and storing massive
vehicle's sensors must be received and processed back at
numbers of events in real-time; this makes it a great buffer
base.
to smooth out bursty traffic. An example could be the
Once captured in Kafka topics, the data can be processed results produced at the end of a customer classification
in multiple ways, including stream processing or Lambda Hadoop batch job, which must then be propagated into
architectures. It is also likely to be stored in an operational personalization and marketing applications.
database such as MongoDB, where it can be combined
As in any distributed system, poor network performance is
with other stored data to perform real-time analytics and
the enemy of low latency and high throughput. Where
support operational applications such as triggering
possible, keep processing local and replicate data between
personalized offers.
Kafka clusters in different data centers.

Implementation Kafka Limitations

Recommendations for Kafka By default, the replication of messages from the leader to
the followers is asynchronous, which can result in lost
Great value can be derived from combining multiple messages. In many applications, that is a worthwhile trade
sources of data – some combinations might be obvious off (e.g., for log aggregation) as the performance benefits
when setting up the Kafka architecture, but others only outweigh the potential loss of a small number of events.
come to mind later when an application developer or When greater safety is required, a topic can be configured
analyst sees everything that's available. For this reason, it's to require 1 or more followers to acknowledge receipt of
recommended to design on the side of openness and the event before it's committed.
extensibility:
Writes are immutable – once an event has been written to
• Build a minimal number of Kafka clusters. a Kafka topic, it cannot be removed or changed, and at
some point, it will be received and acted upon by all
• Make your Kafka topics accessible to all.
subscribed consumers. In many cases, it may be possible
• Use common data formats between topics to reduce to undo the downstream effects of an event by creating a
barriers in parsing and combining multiple data sources. new, compensating event – e.g. if you want to undo an
event to reduce a balance by $100 then a second event

5
can be written that increases the same balance by $100. Apac
Apache
he Storm
Storm: Provides batch, distributed processing of
There are other scenarios where it's harder to roll back the streaming data. Storm is often used to implement Stream
clock; consider a rogue temperature sensor event from a Processing, where it can act as both a Kafka consumer of
data center that triggers cutting power and a halon dump. raw data and a producer of derived results.

Kafka is designed for high performance writes and Apac

Apachehe Her
Heron
on: The successor to Apache Storm, produced
sequential reads but if random access is a requirement by Twitter after their Storm deployment ran into issues
then the data should be be persisted and read from a when scaled to thousands of nodes. It is more resource
database such as MongoDB. efficient and maintains API compatibility with Storm, while
delivering an order of magnitude greater performance.

Alternate Messaging Systems Apac

Apache he Spark
Spark: Performs data analytics using a cluster of
nodes. Spark is used for many batch workloads that would
previously have been run with Apache Hadoop – Spark's
The idea of moving data between systems is nothing new ability to store the data in memory rather than on disk
and frameworks have been available for decades. Two of results in much faster execution. More information can be
the most popular frameworks are described here. found in Apache Spark and MongoDB – Turning Analytics
into Real-Time Action.
RabbitMQ
RabbitMQ: Implements the Advanced Message Queuing
Protocol (AMQP) to act as a message broker. RabbitMQ Apac
Apachehe Spark Str
Streaming
eaming: Adds the ability to perform
focuses on the delivery of messages to consumers with streaming analytics, reusing the same application analytics
complex routing and per-message delivery guarantees. code used in Spark batch jobs. As with Storm it can act as
Kafka is more focused on ingesting and persisting massive both a Kafka consumer of raw data and producer of
streams of updates and ensuring that they're delivered to derived results – it may be preferred if you already have
consumers in the correct sequence (for a given partition). Spark batch code that can be reused with Spark
Streaming.
Apac
Apachehe ActiveMQ
ActiveMQ: A Java based message broker which
include a Java Message Service (JMS) client. As with Kafk
Kafkaa Str
Streams
eams: A new library included with Apache Kafka
RabbitMQ, the advantage offered by Kafka is the ingesting 0.10 that allows you to build a modern stream processing
and persisting of massive streams of data. system with your existing Kafka Cluster. Kafka Streams
makes it easy to join, filter, or aggregate data from streams
using a high level API.
Related Technologies
Apac
Apache
he Apex
Apex: Enables batch and streaming analytics on
Hadoop – allowing the same application Java code to be
A number of technologies complement Kafka; some of the
used for both. Through Apache Apex Malhar
Malhar, reference
more notable ones are described here.
data can be pulled in from other sources and results can
Kafk
Kafkaa Connect
Connect: A standard framework to reliably stream be pushed to those same data stores – including
data between Kafka and other systems. It makes it simple MongoDB. Malhar is also able to act as a Kafka consumer,
to implement connectors that move large sets of data into allowing Apex to analyze streams of topic data.
and out of Kafka. Kafka Connect can run in either
Apac
Apachehe Flume
Flume: A distributed, highly available service for
standalone or distributed modes – providing scalability.
aggregating large volumes of log data, based on streaming
Apac
Apache
he Zookeeper
Zookeeper: Provides distributed configuration, data flows. Kafka is more general purpose and has a
synchronization, and naming for large distributed systems. couple of advantages over Flume:
Kafka uses Zookeeper to share configuration information
• Kafka consumers pull data and so cannot be
across a cluster.
overwhelmed by the data stream.

6
• Kafka includes replication so events are not lost if a
A Diversion – Graphically Build
single cluster node is lost.
Data Streams Using Node-RED
Apac
Apachehe Flink
Flink: A framework for distributed big data
analytics using a distributed data flow engine. Any data
storage is external to Flink; e.g., in MongoDB or HDFS. Node-RED provides a simple, graphical way to build data
Flink often consumes its incoming data from Kafka. pipelines – referred to as flows
flows. There are community
provided nodes which act as connectors for various
Apac
Apache he A
Avr
vro
o: A data serialization system, very often used devices and APIs. At the time of writing, there are almost
for event messages in Kafka. The schema used is stored 500 nodes available covering everything from Nest
with the messages and so serialization and deserialization thermostats and Arduino boards, to Slack and Google
is straight-forward and efficient. Avro schemas are defined Hangouts, to MongoDB. Flows are built by linking together
in JSON, making it simple to use with languages, libraries, nodes and by adding custom JavaScript code. Node-RED
and databases such as MongoDB designed to work with is a perfect tool to get small IoT projects up and running
JSON. quickly or to run at the edge of the network to collect
sensor data.

Installing Node-Red together with the node for MongoDB

and then starting the server is as simple as:

sudo npm install -g node-red

sudo npm install -g node-red-node-mongodb
node-red

Once running, everything else is setup by browsing

http://127.0.0.1:1880/. Building a new flow is
intuitive:

• Drag nodes onto the canvas.

• Double click to configure a node.

• Drag lines between nodes to connect them.

• Add snippets of JavaScript to manipulate the data as it

passes through a function node.

Figur
Figuree 5: Example Node-RED Flow

The flow shown in Figure 5 implements a weather

publishing system:

• The timestamp node triggers the flow every five

minutes.

7
• The Fetch Weather node makes a web API call to To recreate this flow, simply import this definition.
retrieve the current weather and forecast data for our
location; that data is then used in two ways:

◦ The summary of the current weather is extracted by

the Prepare Tweet JavaScript node and then the
Tweet node tweets out that summary.

◦ The Data Cleanup JavaScript node extracts the

current weather and reformats the timestamp before
sending it to the Record Weather node which writes
it to a MongoDB collection.

The Javascript for Prepare Tweet is:

var weather = JSON.parse(msg.payload);

msg = {};
payload=weather.currently.summary;
msg.payload=payload;
return msg;

and for Data Cleanup:

var weather = JSON.parse(msg.payload);

msg = {};
payload = {};
payload.weatherNow=weather.currently;
payload.weatherNow.time=new Date(
weather.currently.time * 1000);
msg.payload=payload;
return msg;

After executing the flow, the Tweet gets sent and the data
is stored in MongoDB:

db.weather.findOne()
{
"_id" : ObjectId("571e2e9865177b7a1f40ddb5"),
"weatherNow" : {
"time" : ISODate("2016-04-25T14:49:58Z"),
"summary" : "Mostly Cloudy",
"icon" : "partly-cloudy-day",
"nearestStormDistance" : 0,
"precipIntensity" : 0.0033,
"precipIntensityError" : 0.0021,
"precipProbability" : 0.21,
"precipType" : "rain",
"temperature" : 51.51,
"apparentTemperature" : 51.51,
"dewPoint" : 39.57,
"humidity" : 0.64,
"windSpeed" : 13.41,
"windBearing" : 314,
"visibility" : 10,
"cloudCover" : 0.7,
"pressure" : 1007.32,
"ozone" : 387.18
}
}

8
Operationalizing the Data Lake • MongoDB exposes these models to the operational
processes, serving queries and updates against them
with MongoDB with real-time responsiveness.

• The distributed processing libraries can re-compute

The traditional Enterprise Data Warehouse (EDW) is analytics models, against data stored in either HDFS or
straining under the load, overwhelmed by the sheer volume MongoDB, continuously flowing updates from the
and variety of data pouring into the business, and being operational database to analytics views.
able to store it in a cost-efficient way. As a result many
The details and motivation for this solution may be found in
organizations have turned to Hadoop as a centralized
the white paper Unlocking Operational Intelligence from
repository for this new data, creating what many call a data
the Data Lake.
lake.

Figure 6 presents a design pattern for integrating

MongoDB with a data lake.

• Data streams are ingested into one or more Kafka

topics, which route all raw data into HDFS. Processed
events that drive real-time actions, such as personalizing
an offer to a user browsing a product page, or alarms
for vehicle telemetry, are routed to MongoDB for
immediate consumption by operational applications.

• Distributed processing libraries such as Spark or

MapReduce jobs materialize batch views from the raw
data stored in the Hadoop data lake.

Figur
Figure
e 6: Design pattern for operationalizing the data lake

9
MongoDB As a Kafka Producer An example use of change streams using the mongo shell:

– Change Streams var myCursor = db.simples.aggregate(

[{$changeStream:{}}])

db.simples.update({name:'Billy'},
When using any database as a producer, it's necessary to {$set:{score:50}})
capture all database changes so that they can be written to
myCursor.forEach(printjson);
Kafka. With MongoDB, you can do this using change {
str
streams
eams (new in MongoDB 3.6). The change stream API "_id" : {
"clusterTime" : {
enables applications to register for real-time notifications
"ts" : Timestamp(1505741442, 1)
of inserts, updates, or deletions in the database. Change },
streams allow applications to instantly view, filter, and act "uuid" : UUID("9bcd6be7-9c7f-4ee5-\
a278-659f8ce13d1a"),
on changes to data as they occur. "documentKey" : {
"_id" : ObjectId("59bfc4a33497\
You open a change stream against a collection, and it then bef0711d9346")
tracks all data changes within that collection – you can }
},
access the results using regular cursor operations. You "operationType" : "update",
create a change stream using the aggregation pipeline, "ns" : {
"db" : "clusterdb",
where $changeStream is the first stage. You may limit
"coll" : "simples"
which changes are included using the operationType },
option (inserts, deletes, replacements, or updates – or any "documentKey" : {
"_id" : ObjectId("59bfc4a33497bef0\
combination). $match, $project, $addFields, 711d9346")
$replaceRoot, and $redact can be used in subsequent },
"updateDescription" : {
stages to filter further and massage the notifications. "updatedFields" : {
"score" : 50
},
"removedFields" : [ ]
}
}

The MongoDB drivers abstract away the use of the

aggregation pipeline using the watch and on commands.
For example (using JavaScript/Node.js):

var changeStream = collection.watch([

$match: {operationType: 'insert'
}]);

changeStream.on('change', function(change) {
// This is where you can write the
// contents of `data` to the Kafka topic
console.log(change);
});

When implementing the consumer, you should copy the

required fields from the document and write them to your
Kafka topic.

10
MongoDB As a Kafka Consumer Josh Softwar
Software e: Part of a project near Mumbai, India that
will house more than 100,000 people in affordable smart
homes. Data from millions of sensors, is pushed to Kafka
In order to use MongoDB as a Kafka consumer, the and then processed in Spark before the results are written
received events must be converted into BSON documents to MongoDB. MongoDB Connector for Hadoop is used to
before they are stored in the database. It's a simple and connect the operational and analytical data sets. More
automatable process to convert the received JSON or Avro details on this project and its use of Kafka, MongoDB and
message payload into a Java object and apply any required Spark can be found in this blog post.
business logic on that object before encoding it as a
BSON document which is then written to MongoDB.

This blog post shows a worked Java example of a

MongoDB Kafka consumer.
Compar
Comparethemarket.com
ethemarket.com: One of the UK’s leading price
comparison providers, and one of the country’s best known
MongoDB and Kafka in the Real household brands. Comparethemarket.com uses MongoDB

World as the default operational database across its

microservices architecture. Its online comparison systems
need to collect customer details efficiently and then
securely send them to a number of different providers.
Once the insurers' systems respond,
Comparethemarket.com can aggregate and display prices
for consumers. At the same time, MongoDB generates
real-time analytics to personalize the customer experience
Man AHAHLL: The Man Group is one of the largest hedge across the company's web and mobile properties.
fund investors in the world. AHL is a subsidiary, focused on
systems trading. The data for up to 150,000 ticks per As Comparethemarket.com transitioned to microservices,
second is received from multiple financial sources and the data warehousing and analytics stack were also
written to Kafka. Kafka provides both the consolidation and modernized. While each microservice uses its own
buffering of the events before they are stored in MongoDB, MongoDB database, the company needs to maintain
where the data can be analyzed. Details on Man AHL's use synchronization between services, so every application
case can be found in this presentation. event is written to a Kafka topic. Event processing runs
against the topic to identify relevant events that can then
trigger specific actions – for example customizing
customer questions, firing off emails, presenting new offers
and more. Relevant events are written to MongoDB,
enabling the user experience to be personalized in real
St
State
ate: State is an intelligent opinion network connecting time as customers interact with the service.
people with similar beliefs who want to join forces and
Learn more about these use cases together with more
make waves. User and opinion data is written to MongoDB
details on using Kafka with MongoDB by watching the
and then the oplog is tailed so that all changes are written
Data Streaming with Apache Kafka & MongoDB webinar
to topics in Kafka where they are consumed by the user
replay.
recommendation engine. Details on State's use of
MongoDB and Kafka can be found in this presentation.

11
Recent Additions to Kafka of advanced software, support, certifications, and other
services designed for the way you do business.

MongoDB Atlas is a database as a service for MongoDB,

Kafka 0.11 was released in June 2017 with these key
letting you focus on apps instead of ops. With MongoDB
features:
Atlas, you only pay for what you use with a convenient
• Exactly once writing of messages (previously at least hourly billing model. With the click of a button, you can
once) scale up and down when you need to, with no downtime,
• Transactional writing of multiple messages across full security, and high performance.
multiple topics MongoDB Stitch is a backend as a service (BaaS), giving
• AdminClient API to allow programmatic administration developers full access to MongoDB, declarative read/write
of Apache Kafka controls, and integration with their choice of services.

• Support for record headers (to be used by the MongoDB Cloud Manager is a cloud-based tool that helps
application or middleware) you manage MongoDB on your own infrastructure. With
• Request rate quotas, limiting the rate that consumers automated provisioning, fine-grained monitoring, and
can retrieve messages (existing functionality can be continuous backups, you get a full management suite that
used to limit the volume of data retrieved) reduces operational overhead, while maintaining full control
over your databases.
• Improved resiliency
MongoDB Professional helps you manage your
Kafka 0.10 was released in May 2016 with these key
deployment and keep it running smoothly. It includes
features:
support from MongoDB engineers, as well as access to
• Kafka Streams is a Java library for building distributed MongoDB Cloud Manager.
stream processing apps using Kafka; in other words
Development Support helps you get up and running quickly.
enabling applications to transform input Kafka topics
It gives you a complete package of software and services
into output Kafka topics and/or invoke external services
for the early stages of your project.
or write to a database. Kafka Streams is intended to
distinguish itself from the more analytics-focused MongoDB Consulting packages get you to production
frameworks such as Spark, Storm, Flink, and Samza by faster, help you tune performance in production, help you
targeting core application functionality. scale, and free you up to focus on your next release.
• Performance enhancements for compressed streams.
MongoDB Training helps you become a MongoDB expert,
• Rack-aware replica assignment can be used to ensure from design to operating mission-critical systems at scale.
that a leader and its followers are run in different racks, Whether you're a developer, DBA, or architect, we can
improving availability. make you better at MongoDB.

We Can Help
We are the MongoDB experts. Over 4,300 organizations
rely on our commercial products, including startups and
more than half of the Fortune 100. We offer software and
services to make your life easier:

MongoDB Enterprise Advanced is the best way to run

MongoDB in your data center. It's a finely-tuned package

12
Resources

For more information, please visit mongodb.com or contact

us at [email protected].

Case Studies (mongodb.com/customers)

Presentations (mongodb.com/presentations)
Free Online Training (university.mongodb.com)
Webinars and Events (mongodb.com/events)
Documentation (docs.mongodb.com)
MongoDB Enterprise Download (mongodb.com/download)
MongoDB Atlas database as a service for MongoDB
(mongodb.com/cloud)
MongoDB Stitch backend as a service (mongodb.com/
cloud/stitch)

US 866-237-8815 • INTL +1-650-440-4474 • [email protected]

Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
From Everand
Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
Elara Drevyn
No ratings yet
Creating A Powershell Script As Per A Scenario of Your Choice
No ratings yet
Creating A Powershell Script As Per A Scenario of Your Choice
3 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Installation First Map Icon Markers: Regular Icons: Default and Simple
No ratings yet
Installation First Map Icon Markers: Regular Icons: Default and Simple
1 page
ACM - E2E - Commingled - Storage and Load - Out
No ratings yet
ACM - E2E - Commingled - Storage and Load - Out
20 pages
Apache Kafka
No ratings yet
Apache Kafka
130 pages
Documentation
No ratings yet
Documentation
105 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Practical OneOps
From Everand
Practical OneOps
Nilesh Nimkar
No ratings yet
A Visual Introduction To Apache Kafka PDF
No ratings yet
A Visual Introduction To Apache Kafka PDF
84 pages
Cloudurable Kafka Tutorial v1 PDF
No ratings yet
Cloudurable Kafka Tutorial v1 PDF
79 pages
Hbase PDF
No ratings yet
Hbase PDF
8 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
SHIVA KUMARA - JavaArchitect
No ratings yet
SHIVA KUMARA - JavaArchitect
9 pages
System Design
No ratings yet
System Design
18 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
Chapter 1 - Introduction To KAFKA: Objectives
No ratings yet
Chapter 1 - Introduction To KAFKA: Objectives
17 pages
Hibernate
No ratings yet
Hibernate
161 pages
Multithreading in Java (Unit 4)
100% (1)
Multithreading in Java (Unit 4)
19 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
12 Microservices Design Patterns 1696645895
No ratings yet
12 Microservices Design Patterns 1696645895
14 pages
2 Kafka Eventstorming
No ratings yet
2 Kafka Eventstorming
104 pages
Hibernate Interview Question
No ratings yet
Hibernate Interview Question
119 pages
Spring Cloud
No ratings yet
Spring Cloud
44 pages
Java 8 Pluralsight
No ratings yet
Java 8 Pluralsight
2 pages
Soft Eng Interview Prep
No ratings yet
Soft Eng Interview Prep
96 pages
Design Patterns
No ratings yet
Design Patterns
13 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
No ratings yet
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
25 pages
Spring Cloud Sleuth
No ratings yet
Spring Cloud Sleuth
31 pages
Microservices Interview Questions
No ratings yet
Microservices Interview Questions
5 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Hibernate Interview Questions
No ratings yet
Hibernate Interview Questions
13 pages
Cassandra Interview Questions Answers
No ratings yet
Cassandra Interview Questions Answers
10 pages
Java and Caching: by Martin Nad
100% (1)
Java and Caching: by Martin Nad
20 pages
Top Answers To Kafka Interview Questions
No ratings yet
Top Answers To Kafka Interview Questions
3 pages
Hadoop Interview Questions - HDFS
No ratings yet
Hadoop Interview Questions - HDFS
19 pages
Spring Transactions
No ratings yet
Spring Transactions
22 pages
5+ Java - Interview
No ratings yet
5+ Java - Interview
5 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Microservices Architecture
No ratings yet
Microservices Architecture
60 pages
Zookeeper
100% (1)
Zookeeper
42 pages
Pattern Saga
No ratings yet
Pattern Saga
5 pages
Microservices
No ratings yet
Microservices
3 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
New Features in JDK 8: Ivan St. Ivanov Dmitry Alexandrov Martin Toshev
No ratings yet
New Features in JDK 8: Ivan St. Ivanov Dmitry Alexandrov Martin Toshev
58 pages
Apache Kafka
No ratings yet
Apache Kafka
6 pages
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
Java IO Tutorial
No ratings yet
Java IO Tutorial
60 pages
Micro Services
No ratings yet
Micro Services
92 pages
Docker - Part1
No ratings yet
Docker - Part1
3 pages
CQRS y Objectives
No ratings yet
CQRS y Objectives
5 pages
Microservice Patterns
No ratings yet
Microservice Patterns
8 pages
Spring ORM
No ratings yet
Spring ORM
55 pages
AWS-Storage Services V2
No ratings yet
AWS-Storage Services V2
25 pages
Spring Data JPA Notes
No ratings yet
Spring Data JPA Notes
12 pages
MicroService - Introduction
100% (1)
MicroService - Introduction
45 pages
Spring Boot Annotations
No ratings yet
Spring Boot Annotations
12 pages
Dzone Com Articles JVM Architecture Explained
No ratings yet
Dzone Com Articles JVM Architecture Explained
8 pages
Commercial Shipping Catalogue
No ratings yet
Commercial Shipping Catalogue
87 pages
Advanced .Net Course Contents
No ratings yet
Advanced .Net Course Contents
4 pages
TAS Srs
No ratings yet
TAS Srs
8 pages
OpenText Content Server CE 21.2 - Installation Guide English (LLESCOR210200-IGD-EN-01) 22
No ratings yet
OpenText Content Server CE 21.2 - Installation Guide English (LLESCOR210200-IGD-EN-01) 22
160 pages
DAO2ADO
No ratings yet
DAO2ADO
20 pages
How To Configure PPPoE On TP-LINK Modem (Orange Page)
No ratings yet
How To Configure PPPoE On TP-LINK Modem (Orange Page)
8 pages
Big Data Analytics: Seema Acharya Subhashini Chellappan
100% (1)
Big Data Analytics: Seema Acharya Subhashini Chellappan
47 pages
Question One: 40 Marks: Jomo Kenyatta University of Agriculture and Technology University Examinations 2016/2017
No ratings yet
Question One: 40 Marks: Jomo Kenyatta University of Agriculture and Technology University Examinations 2016/2017
3 pages
Insurance For Computers
No ratings yet
Insurance For Computers
6 pages
cm-ict-7-curriculum-map
No ratings yet
cm-ict-7-curriculum-map
8 pages
DBSM Report 3
No ratings yet
DBSM Report 3
21 pages
Howto Setup Mini ISP Using Mikrotik As PPPoE Server + DMASOFTLAB Radius Manager Scratch Card Billing System+ Linux Transparent
100% (1)
Howto Setup Mini ISP Using Mikrotik As PPPoE Server + DMASOFTLAB Radius Manager Scratch Card Billing System+ Linux Transparent
48 pages
CorrectedJuniperJN0 105ExamQuestions2024 PassGuarantee
No ratings yet
CorrectedJuniperJN0 105ExamQuestions2024 PassGuarantee
5 pages
Real-Time Campus University Bus Tracking Mobile Application: July 2018
No ratings yet
Real-Time Campus University Bus Tracking Mobile Application: July 2018
7 pages
PS SSAS UserGuide AlertAdvanced DSASL PSv1
No ratings yet
PS SSAS UserGuide AlertAdvanced DSASL PSv1
35 pages
Bandwidth Monitoring & Measurement (Tools and Services)
No ratings yet
Bandwidth Monitoring & Measurement (Tools and Services)
40 pages
Client 2024 01 14 Log
No ratings yet
Client 2024 01 14 Log
4 pages
Windows: Macos: CONTRASEÑA: WWW - Descarga.xyz
No ratings yet
Windows: Macos: CONTRASEÑA: WWW - Descarga.xyz
3 pages
Mahesh CV
No ratings yet
Mahesh CV
3 pages
Liferay Features Detailed
No ratings yet
Liferay Features Detailed
16 pages
CP311-4 Visual Lisp
No ratings yet
CP311-4 Visual Lisp
34 pages
IT Infra Weekly Report 13 Dec 24
No ratings yet
IT Infra Weekly Report 13 Dec 24
11 pages
Avaya Knowledge - AES Missing User Management Menu Item With Admin Account OAM and CRITICAL TSAPI tsviAuthenticateSession Failed Error Code 6.
No ratings yet
Avaya Knowledge - AES Missing User Management Menu Item With Admin Account OAM and CRITICAL TSAPI tsviAuthenticateSession Failed Error Code 6.
4 pages
BCA-141-20 BCA-CS-141-20 C Programming
No ratings yet
BCA-141-20 BCA-CS-141-20 C Programming
2 pages
Intertask Communication in Rtos
No ratings yet
Intertask Communication in Rtos
12 pages
Netbeans - Shortcuts PDF
No ratings yet
Netbeans - Shortcuts PDF
2 pages
1COR063
No ratings yet
1COR063
4 pages

Kafka and Mongodb

Uploaded by

Kafka and Mongodb

Uploaded by

A MongoDB White Paper

Data Streaming with Apache Kafka &

Kafka Use Cases 3

Implementation Recommendations for Kafka 5

Alternate Messaging Systems 6

A Diversion – Graphically Build Data Streams Using

Operationalizing the Data Lake with MongoDB 9

MongoDB As a Kafka Producer – Change Streams 10

MongoDB As a Kafka Consumer 11

MongoDB and Kafka in the Real World 11

Recent Additions to Kafka 12

Multiple consumers can be combined to form a consumer

• When the code for an application is updated, the

Website Activity T Trac

Implementation Kafka Limitations

Kafka is designed for high performance writes and Apac

Alternate Messaging Systems Apac

Installing Node-Red together with the node for MongoDB

sudo npm install -g node-red

Once running, everything else is setup by browsing

• Drag nodes onto the canvas.

• Double click to configure a node.

• Drag lines between nodes to connect them.

• Add snippets of JavaScript to manipulate the data as it

The flow shown in Figure 5 implements a weather

• The timestamp node triggers the flow every five

◦ The summary of the current weather is extracted by

◦ The Data Cleanup JavaScript node extracts the

The Javascript for Prepare Tweet is:

var weather = JSON.parse(msg.payload);

and for Data Cleanup:

var weather = JSON.parse(msg.payload);

• The distributed processing libraries can re-compute

Figure 6 presents a design pattern for integrating

• Data streams are ingested into one or more Kafka

• Distributed processing libraries such as Spark or

– Change Streams var myCursor = db.simples.aggregate(

The MongoDB drivers abstract away the use of the

var changeStream = collection.watch([

When implementing the consumer, you should copy the

This blog post shows a worked Java example of a

World as the default operational database across its

MongoDB Atlas is a database as a service for MongoDB,

MongoDB Enterprise Advanced is the best way to run

For more information, please visit mongodb.com or contact

Case Studies (mongodb.com/customers)

US 866-237-8815 • INTL +1-650-440-4474 • [email protected]

You might also like