Ep. 115 Exploring Kafka with Kris Jenkins and Rob Walters

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, Ep. 115 Exploring Kafka with Kris Jenkins and Rob Walters. The summary for this episode is: Today on the show we're talking about streaming data and streaming applications with Apache Kafka. We're joined by Kris Jenkins, Developer Advocate from Confluent, and Rob Walters, Product Manager at MongoDB, who will discuss how you can leverage this technology to your benefit and use it in your applications. Kafka is traditionally used for building real time streaming data pipelines and real time streaming applications. It began its life in 2010 at LinkedIn and made its way to the public open-source space through a relationship with Apache, the Apache Foundation, in 2011. Since then, the use of Kafka has grown massively and it's estimated that approximately 30% of all Fortune 500 companies are already using Kafka in one way or another. A great example for why you might want to use Kafka would be perhaps capturing all of the user activity that happens on your website. As users visit your website, they're interacting with links on the page and scrolling up and down. This is potentially large volumes of data. You may want to store this to understand how users are interacting with your website in real time. Kafka will aid in this process by ingesting and storing all of this activity data while serving up reads for applications on the other side. Conversation highlights include:<ul><li>[03:38] What is Kafka?</li><li>[05:29] At the heart of every database</li><li>[08:03] The difference between Kafka and a database</li><li>[09:03] What Kafka's architecture looks like</li><li>[12:03] Kafka as a data backbone of system architecture</li><li>[14:06] MongoDB and Kafka working together</li><li>[15:40] What are "Topics" in Kafka?</li><li>[17:53] Chain stream events</li><li>[19:58] Kafka's history</li><li>[22:07] MongoDB Connector, and Kafka via Confluent Cloud</li><li>[25:53] Popular use cases using Kafka and MongoDB</li><li>[27:48] Kafka and stream processing with games and event data</li><li>[29:13] KSQL and processing against the stream of data</li><li>[30:59] Developer.Confluence.io, a place to learn everything about Kafka</li></ul>

Key Takeaways

Transcript

Episode Overview

02:08 MIN

Kris Jenkins Introduction

00:15 MIN

Rob Walters Introduction

00:26 MIN

What is Kafka?

01:38 MIN

At the heart of every database

02:11 MIN

The difference between Kafka and a database

00:53 MIN

What Kafka's architecture looks like

00:54 MIN

Ordering guarantees

01:05 MIN

What adding data to the queue looks like from a metadata perspective

00:45 MIN

Kafka as a data backbone of system architecture

02:02 MIN

MongoDB and Kafka working together

01:32 MIN

What are topics

02:12 MIN

Chain streams

01:03 MIN

Kafka history

02:00 MIN

MongoDB Connector, and Kafka via Confluent Cloud

03:45 MIN

Popular use cases using Kafka and MongoDB

01:52 MIN

Kafka and stream processing with games and event data

01:22 MIN

KSQL and processing against the stream of data

01:30 MIN

Developer.Confluence.io, a place to learn everything about Kafka

02:47 MIN

Michael Lynn: Hey, folks. Welcome to the show. My name is Michael Lynn, Developer Advocate at MongoDB and host of this, The MongoDB Podcast. Today on the show, we're talking about streaming data and streaming applications with Apache Kafka. Now, if you're not familiar, we're going to explain all about Kafka and we're going to talk to a couple of experts to help us understand how and why we might want to implement Kafka as part of our application stack. Now, Kafka is traditionally used for building real time streaming data pipelines and real time streaming applications. A data pipeline simply reliably processes and moves data from one system to another and a streaming application similarly is an application that consumes potentially large volumes of data. A great example for why you might want to use Kafka would be perhaps capturing all of the user activity that happens on your website. As users visit your website, they're interacting with links on the page and scrolling up and down. This is potentially large volumes of data. You may want to store this to understand how users are interacting with your website in real time. Now, Kafka will aid in this process by ingesting and storing all of this activity data while serving up reads for applications on the other side. Kafka began its life in 2010 at LinkedIn and made its way to the public open- source space through a relationship with Apache, the Apache Foundation, and that was in 2011. Since then, the use of Kafka has grown massively and it's estimated that approximately 30% of all Fortune 500 companies are already using Kafka in one way or another. Today on the show, Kris Jenkins, Developer Advocate from Confluent, and Rob Walters, Product Manager at MongoDB. We're going to talk all about streaming data, streaming applications, Kafka and how you can leverage this technology to your benefit and use it in your applications. Stay tuned. Well welcome to the show. Today, we have Kris Jenkins of Confluent. Welcome, Kris, it's great to have you on the show. Do you want to introduce yourself to the audience?

Kris Jenkins: Yeah, absolutely. Nice to be on the show. My name's Kris Jenkins, I'm a Developer Advocate. I've been a developer advocate for a year which means I'm still figuring out what that means. But I spend my days building things with Kafka, talking about Kafka, talking to people who are using it and just figuring out what this world is.

Michael Lynn: And we also have Rob Walters from MongoDB. And Rob, your title has changed. Are you in product management?

Rob Walters: Yes. Yeah, I'm Rob Walters. I'm a Product Manager responsible for our Kafka and Spark connectors. And so my mission really is to make it easy for our users to leverage MongoDB within their Apache Kafka environment. So I look at the connector from a features and functionality perspectives, I take feedback. I also build a lot of tutorials, blog posts and webinars and so forth really to add value and make it easier for developers to use these two technologies together.

Michael Lynn: Well welcome to the show both of you. Thank you so much for taking the time to talk with me. I'm wondering if someone might want to take a stab at explaining what Kafka is for the beginner. Kris, maybe you want to take a swing?

Kris Jenkins: Yeah, that sounds like a question I should be able to field. So Kafka is an event system, right? It's a system for recording, processing and sending on events. So if you're used to event- driven architectures, that's probably going to be a fairly obvious idea. It's the persistence layer of an event- driven architecture. If you're not, probably the easiest way to think about it is to start by thinking about a queue and get more sophisticated in the thinking from there. So something happens, you want to record it, save it to disc and process it later somewhere else. Now multiply that by a million things happening a second, they're streaming in like a fire hose, like a floodgate. First job is to persist them, second job is to process them, third job is to send them to their destination. And the third one's particularly interesting because you might want to send it to multiple destinations, right? So to give you a concrete example, let's say you've got... Let's pick something fun. You've got telemetry data coming in from a massive online game, right? People are playing the game, you've got a 100, 000 users playing your game, recording scores and logging events and logging out events and all kinds of different data just streaming in like a fire hose from all these people. And you want to save it, you know that perhaps your billing team needs a subset of that data to be processed quickly. You know that the analytics team wants to process the same set of data in a different way. And you just get dealing with the fire hose of events coming from the real world in the real time in different, interesting ways for different departments. That is exactly where Kafka shines.

Michael Lynn: Now in my naivete, I'm not really super familiar with Kafka. I've used it in the past very briefly for a couple of different projects. But when I first approached streaming and queuing, my initial thought was around this appears to be filling a gap in the database technology space. Why can't databases simply handle the volume of data that we're trying to deal with by implementing streams?

Kris Jenkins: Well there are two things to that. One is you think about how databases deal with the underlying storage of events at a high scale, right? One of the first things a traditional database will do before it's done any work with the data that's coming in is it will write it to a transaction log as partly as its replication mechanism, partly as its recovery mechanism, right? The first thing a database wants to do is store what's actually happened to its replication log and then deal with it. And that's terrifically efficient because replication logs essentially being an append- only file are incredibly cheap to write to. So at the heart of every database, especially if it's replicated, you've probably got an append- only log serving as the master record of what actually happened before it was processed. So take that piece out, make it the center of your database, make it the center of your data solution let's say and you've got Kafka. And then the question is what do you build on top of that high throughput, high reliability, cheap to write to, cheap to replicate log?

Rob Walters: And I think one of the other things that's interesting about Kafka is the fact that it doesn't really know what data's in it, it's just binary data. So you can take data and move it really fast amongst from a partition perspective, the fact that it doesn't have to know that it's JSON or it's a text or string or something like that. And so that's another I think valuable architecture that Kafka has built in.

Kris Jenkins: And you can build more intelligence on the data on top of that. But fundamentally, the base is a binary key value store that's exceptional at a never ending stream of events.

Michael Lynn: Interesting. So as the events occur, I'm sending them to the queue and they're available for last in, first out, or first in, first out, or-

Kris Jenkins: First in, first out. Yeah.

Michael Lynn: Okay.

Kris Jenkins: You would process them in the order they arrive in subject to some sharding.

Michael Lynn: Is there any index capability? And I guess the question I have is what are the similarities and differences between a database and a stream in Kafka?

Kris Jenkins: That's a good question. I think the starting point is it starts to look more like a traditional database when you start thinking about state, right? So I have a stream of customer orders that come in let's say, and I record every order that comes in. And then eventually, I say," Hey, I'd like to know how much people have bought by country." So I like a process that looks at every new order coming in, looks at the country it's come from and adds that to a group by country total, right? And it's when you start doing things that roll up a stateful running total or running state, running balance perhaps, that's when Kafka starts to look more like a traditional database.

Michael Lynn: Okay, that makes sense. And from a technology perspective, Kafka is open- source. What does the architecture look like?

Kris Jenkins: We've actually got a really great course on that by one of the guys that started the original code base. Under the hood, think of an append- only file as your logical starting point, and then two slightly more sophisticated things are going on on top of that. The first is sharding. So you could partition by key. So let's say all your keys are a UUID for your users. You might partition into 16 partitions by the first character of their UUID and that would be an obvious benefit for scaling, right? You've horizontally sharded by the first character of a hex string. Now, you've got potentially 16 times as much capacity and you've lost ordering guarantees over the whole thing, but you've kept ordering guarantees by user key which is usually what matters.

Michael Lynn: Explain to me what ordering guarantees are.

Kris Jenkins: So all this fire hose flood of data's coming in, you need to think a bit about the meaningful order of events, right? So if you recorded all those events just as they come over the network and stuck them on an append- only log, then you would have a total global ordering, but you would be limited to one log. You might want to partition that out, right, and say," I need a bit more capacity." I'm going to partition that out by not all my users, that's too large, let's partition those by the first hex character. So that gives me 16 partitions. And so you will lose total global ordering, but you'll still have ordering within users, right? Each user will have all their events coming in order, which is probably fine. In most use cases, you're more interested in, say, the order by user than the global total order. So you can partition that way. And you've got some options around how you partition to keep the ordering that matters whilst not tying your hands having scalability.

Michael Lynn: So that sounds a lot like the way that MongoDB scales in terms of partitioning and sharding. So that's similar in terms of the architecture. I'm curious about when I add an item to the queue, what does that look like from a metadata perspective? There's going to be the data associated with whatever event is, but can I add additional elements of data in a key value pair?

Kris Jenkins: Certainly, yeah. Firstly, the value can be any structured data you like obviously at the base level, it's just binary. So you can have more structure in there. But then there are header fields for you can add interesting things like this value is actually a password or one of the sub- fields within it's a password. So we need to track that for security purpose for instance and you might put that in the metadata field. And then you've got partition level metadata which is how replicated do we want this thing to be? Are we replicating every event that's recorded to three machines for resiliency for instance?

Michael Lynn: Okay. So what are the applications that... I mean obviously, if something is a high volume of throughput, but let's talk about the use cases where people will want to start considering a queuing system like Kafka.

Kris Jenkins: I'm going to pick you up on there. Queuing system isn't entirely fair. It's a great use case, but what it is an event system. And if I can just dwell on the distinction for a second, I think traditionally, you might use a queuing system for stuff where you want to capture it, wait until it's ready to process, process it and throw it away.

Michael Lynn: Okay.

Kris Jenkins: Often, events in a queue are transient. They are used up when they're consumed. Whereas in Kafka, you would probably keep the events around for much longer, long after they've been processed perhaps indefinitely because for two reasons, one is you might want to process them again in a different way. And the second reason is that events are usually quite cheap to store and expensive to acquire. Your user interactions, they only happen once. Why throw that data away now that the disc is so cheap? So you often capture the events in perpetuity perhaps for the future use case you haven't thought of.

Rob Walters: In another use case from a MongoDB perspective, we tend to focus a lot on single view applications or mainframe offloading as key use cases where MongoDB shines and Kafka is seen used in these architectures sometimes because it's a great way to aggregate data from heterogeneous data sources and if it flows through Kafka at really high rates, then you can do transforms and then put that into MongoDB as a single view in that case.

Kris Jenkins: Yeah, that's a very common use case that you would use Kafka as a way of funneling different data sources into a unified stream of events and then funneling out perhaps into Mongo. And maybe you're funneling into Mongo for operational purposes and an analytics database for the marketing team for instance, and then you've got Kafka as the data backbone of that architecture.

Michael Lynn: So speaking of MongoDB, Rob, maybe talk a little bit about how Kafka and MongoDB work together.

Rob Walters: Sure. So there are a couple different services as part of Kafka and Kafka Connect is one service. And that service's job really is to connect these heterogeneous data sources and have them interact with Kafka. So rather than having an application written that connects to a Kafka topic directly and then worrying about failover errors and that sort of thing, Kafka Connect's job really is to handle all that. So as a connector, we provide a connector that talks directly with the Kafka Connect API. And so we will set up MongoDB or the user can set up MongoDB either as a source, so it's taking data, reading data in from MongoDB and writing it into a Kafka topic or as a sync which takes data from a Kafka topic and writes it into MongoDB. When it's used as a source, what we're effectively doing under the covers is we're creating a change stream on that database or collection, whatever the user specifies. So as data is inserted, updated, deleted from that collection, those events are being created as change stream events which are funneling through to a Kafka topic. So that's a common use case there. And then of course as a sync, you can subscribe to a topic and write that out to a collection. So Kafka Connect is a way, there's hundreds or maybe even thousands of connectors that are available online to download and use.

Michael Lynn: Great. And you mentioned a couple of key terms there I think from the MongoDB side as well as from the Kafka side, you're using the term," Topic," in Kafka. Kris, tell folks what a topic is.

Kris Jenkins: Right. So a topic, it's that logical stream of events, it's all the key value pairs for the thing in question. So user orders might be a topic. But then if you want to get the whole architecture picture in your head, beneath that you might split the topic actually out into 16 physical files. Above it, you might add some type information. So rather than just being binary key values, you might say the key is a string and the value is adjacent blob with this format. And we would use the term," Stream for that." The thing above the raw binary topic that has types is a stream.

Michael Lynn: Okay. So we've got streams, topics and then on the-

Kris Jenkins: I wish I had a diagram I could give you there. Hopefully you can visualize that. Rob, go.

Rob Walters: Yeah. One thing I just wanted to point out now that we're talking about strings and some data types and things, I mean as you know, MongoDB, we work in JSON documents and before we were talking about Kafka with respect to how it stores data or how it looks at data which is just binary. And so this bridging of a JSON document versus binary, how is that done? And that's actually defined through what are called converters. And so by default, I think string converter, if you don't specify one, where basically it just takes whatever JSON document have, converts into a string and that pushes that on the Kafka topic. But you can do other string converters like JSON or Avro or Protobuf and so forth. And so that's how the documents move and how the sync knows that the source it really came from is a JSON document. So that's just a side note there as we're talking about strings and so forth.

Kris Jenkins: You might want to do something like, I don't know, you want to take some user data out of Mongo and give it to the marketing department, but you don't want to give them their password or the name or their age, you just want to give them some change data, right? So you might use their connector there to pull out a subset of things and just stick those into Kafka.

Michael Lynn: And Rob, before we move on, you mentioned chain streams, that's a critical component in the MongoDB architecture. Do you want to explain that for folks that might not be aware?

Rob Walters: Yeah. So change streams is MongoDB's implementation of a change data capture type of functionality. So it opens a cursor on whatever the user specifies through the pipeline. So that could be all database changes or a specific collection or a filtered collection. So if you only want events where the inventory was greater than a hundred for example, you want to trigger that certain event, that will produce a change stream event. And inside that event, you'll have the documents that was changed. You'll have a whole bunch of other metadata such as the time it was changed, the operation where there was an insert, a delete, or an update and so forth. And so that's a feature that came out I believe in MongoDB 3. 4 that has been iterated on. And we have some exciting stuff coming out in 6.0 that will be announced at MongoDB World around change stream cell. So stay tuned for that.

Michael Lynn: Yeah, that's coming up right around the corner. Speaking of MongoDB World, it's coming June 7th through the 9th to New York City. It's in the Javits Center. And if you're interested in joining us, tickets are on sale now. You can use the code," PODCAST," P- O- D- C- A- S- T, to get 25% off of your tickets. And if you use that code, you can get some additional podcast swag as you arrive at the conference. So once again, June 7th through the 9th at the Javits Center, use the code," PODCAST," at mongodb.com/ world. Rob, did you have something else?

Rob Walters: Yes. Yeah. I just had a question for Kris. So in a Kafka architecture, so one of the things that Kafka has for those that really start digging down, deploying it and so forth is ZooKeeper. And I think one of the most exciting things that's happening is the removal of ZooKeeper. And I was just wondering if Kris could explain to our listeners a little bit about that and what the change is being made with.

Kris Jenkins: Yeah, absolutely. As long as we understand we're all a developer friends here, you're not going to ask me to commit to a timeline.

Rob Walters: Yeah.

Kris Jenkins: Okay. So quick bit of history. So Kafka comes out around I think about 2011 or 2012, something like that. And back then, the right choice for distributed consensus if you didn't want to build it from scratch was ZooKeeper. It's what everyone used at the time, it's a good product. So that's why historically for the things like leader election, Kafka has ZooKeeper helping to manage failover and partition leaders and that kind of thing. But that is an extra moving piece, it's an extra thing to deploy and maintain and care for. It would be nice if you didn't need that in the system. So a lot of work has been going into Kafka recently to get rid of ZooKeeper and have its own fault tolerant leader election mechanism built into Kafka. It uses the Raft protocol. If you fancy bit of computer science, you can go and look up the Raft paper and it's called KRaft, our implementation, because it's Kafka Raft, great name. And a lot of work's been going on for that. There was a really good talk about it, which is coming up online on YouTube soon at Kafka Summit London'22 if you want to look about it. But I'll give you the precis which is it's pretty much ready, it's gone into beta. You can use it now but we say don't use it in production. And the reason for that mostly is there are some management tools around it that are still being built. So you could use it, but you'll probably not have the best time until the whole maintenance picture has been built out. Anytime soon, you'll find that Kafka uses KRaft natively. There's no ZooKeeper. And it uses Kafka topics to manage the metadata for leader election, which as a final aside, I found quite satisfying when I realized that when they needed a new system for managing the events of who's the leader, of course they use Kafka and it's dogfooding. I always find it satisfying when a tool uses itself, it's a mark of confidence.

Rob Walters: Yeah. And it's very exciting. I'm looking forward to be able to deploy without ZooKeeper, just cleans up my Docker Compose.

Kris Jenkins: Yeah.

Michael Lynn: So you mentioned how MongoDB and Kafka work together. Both of these software products are available. They're open, you can download them and install them on a server and run them on your own. But MongoDB has launched a pretty successful service in the cloud. It's called MongoDB Atlas. I'm curious, Kris, Confluent. Talk to us about what Confluent the company does and how does it relate to the Kafka project from Apache?

Kris Jenkins: Well similarly. So Confluent was long ago the company spinoff to the open- source project, a few of the co- founders of the Apache Kafka product said there's going to be need for support, paid maintenance, all those things you'd expect from an open- source project that's being seriously used in enterprise. And so that's how Confluent were founded. And as the internet and internet businesses have matured, they've moved to a cloud- based version of Kafka called Confluent Cloud. Probably similar aims to Atlas, it's a great tool, but it's even better when someone else manages it for you. And you've got a team of absolute native experts dealing with the infrastructure. That's what Confluent Cloud does. I'm tempted to give people the podcast code they can sign up with because turnabout's fair play, but I'll avoid it.

Michael Lynn: Oh, please do.

Kris Jenkins: Okay. If you go to confluent. cloud and sign up with the code," PODCAST100," because we have our own podcast, you can get$ 100 of free credit. And it's managed Kafka in the cloud. We've got plenty of the original and ongoing committers to Kafka working on it. We regularly give back code to the open- source project and just work on it, pay for people to work on it full- time. Occasionally, we do things like we realize one of our customers is going to need a new feature. We build that feature. And then once it's ready and matured, we start pushing it back to the open- source project. So we are independent, definitely connected, definitely trying to make the open- source project thrive and support the people that need commercial support.

Rob Walters: Yeah. And one thing I just wanted to add to the Confluent Cloud is that we as MongoDB have a great partnership with Confluent and we've been working with our development team on the connectors in the Confluent Cloud. So the MongoDB Atlas source and sync is our connector that is hosted in the Confluent Cloud. So it makes it really easy to stand up all that Kafka Connect stuff and Kafka topics, you don't have to worry about any of that stuff. ZooKeeper, and hopefully not ZooKeeper anymore, but just with a click of a button, you have direct connection to an Atlas cluster. You can read data from it and then put it in the Confluent Cloud and go from there. So that's something that we came out with about a year ago or so. We were one of the first connectors in the Confluent Cloud, and so we've just been iterating on that. And I think they're getting the version 1.7 which is our latest release of the connector.

Michael Lynn: Rob, where does that configuration take place to configure the connector? Is that from the Kafka Cloud side?

Rob Walters: The Confluent Cloud, yeah. Yeah, I think they have a command line, is it C Cloud or something, that you can dynamically or programmatically create these connections for. But yeah, so basically it's almost every option is in there. Obviously with a Confluent Cloud, you can't upload your own JAR file. So for those that are really familiar with connectors, if you wanted to create your own post processor or write strategy in MongoDB, you would write Java code, compile it into a JAR file and then upload that. It's not possible in the Confluent Cloud. So some of the more advanced cases you might have to just do some research on, but for the 80% case for sure, you can use the Confluent Cloud for.

Michael Lynn: So we talked about some of the use cases that are popular, where people start to think about Kafka and MongoDB together. Rob, what else are we missing? What other use cases are very popular?

Rob Walters: Yeah. We see a lot of customers using the combination of Kafka and MongoDB for time series use cases like IOT. And most recently, we added some features in our MongoDB Kafka Connector to make it really easy, literally check in a box or defining a configuration parameter to send that data directly to a time series collection in MongoDB. So time series collections are a new collection type that we introduced in MongoDB 5.0 that really optimized the storage of IOT data or time series data for that matter in MongoDB. And to give you an example, when we originally created MongoDB, it was a very de- normalized way of dealing with data. So one document doesn't really represent one data point, one document might represent a customer for example with sub documents and arrays and so forth. And what we found in the IOT use case is that a lot of customers would create one document per data point because they were used to the relational world. So here's a temperature sensor, I'm going to create a new document. This is a timestamp, my temperature is whatever, 20 degrees centigrade or something. But that from a MongoDB standpoint is not the best way to store data because you have an index entry on every single document. And so when you have very large IOT data, obviously that would arbitrarily increase your index size. So we created time series collections to really optimally store that data in a commoner format, under the covers. So now, you can still store one document per data point, but under the covers, we're storing that optimally and we're getting really incredible savings from storage and performance and so forth. And so all that happens for you, all you have to do is just say," Yeah, store it to a time series collection," and then the connector takes care of that for you. So long winded answer to your short question.

Michael Lynn: Okay. So time series and IOT specific use case is popular. Kris, anything we're missing in that space? Any additional use cases that you wanted to mention?

Kris Jenkins: I think one thing that's really nice that you can do on top of Kafka is stream processing, it's worth spending some time with. So if we take my favorite example of the month like games, just because they're high volume and high different use cases, right? So say you've got all that event data coming in from people's games that I mentioned earlier, and you would like to put some statistics in your Mongo database, right? So that's the central thing that we presented to the user. Kafka's dealing with the fire hose of game data. And what you really like to do is just roll up that fire hose and just say," Oh, the user played five games in the past hour." And then once that's rolled up, spit that data out into Mongo. That will be a perfect use case where you've got Kafka, a bit of stream processing in Kafka to summarize or massage the data and then going across a connector into Mongo. And oh, it's worth adding into that you've got a few options for how to do that stream processing. So you could write it as a Java process or we've got KSQL which is a SQL like interface for that data processing. So those are pretty good options for how to define the transformations that you need to shape.

Michael Lynn: That's interesting. So the KSQL, is that processing against the actual stream of data?

Kris Jenkins: Yeah. Yeah, and the interesting thing about it is if you're used to SQL in a relational database, you normally issue a query and it rolls over what data is currently in the database, gives you an answer and stops. Whereas in a stream processing database, you might run an insert select statement that will insert all the existing rows that it's selected, but then it keeps living in the background looking for new rows, new events coming into the database and processes those later. So it almost becomes like your SQL statement becomes a living transformation of new events.

Michael Lynn: Almost like a trigger.

Kris Jenkins: Yeah, kind of. You can see it like a trigger, except you can just define it in a straight SQL. There's less programming and more manipulation I would say.

Rob Walters: Yeah. And with regard to programming, that brings up another point. So we've had these single message transforms or SMTs for a while in Kafka, but there's only a limited amount of them. And if you want to do anything special, you write your own Java code into a JAR file and upload it and all that. But I think KSQL makes transformations a lot easier in that case. So you don't have to go through that. Plus SMTs from a performance perspective are not fantastic. So I think the adoption of KSQL you'll see is probably increased in these cases.

Kris Jenkins: Yeah, and any SQL expert will tell you it's amazing what you can get done with an SQL query.

Michael Lynn: Indeed. Well this has been a great discussion. I want to make sure we get folks that are interested after hearing this information, get them some information about where they can go. Kris, where do folks go to get more information about Confluent Cloud, about Kafka?

Kris Jenkins: The number one place is a thing we've been working on a lot in our department which is developer. confluence. io, which is our site to teach you everything we can about Kafka. So it's got everything from getting started guides if you're brand new to the idea, it's got videos on how it's built on the internals if you want to dive really deep, it's got architectural recipes. So you can learn how to start thinking and event streaming models, which is often I find the most interesting thing. The tools and techniques we grasp pretty quickly. But thinking in a different architecture is an incredibly useful brain exercise and it's great to have some content that hand holds you through that new way of thinking. So I'm particularly proud of those bits. And also we have a podcast for which I am the host. If you search for the Confluent Podcast on your podcast app of choice, I will happily fill you in with more details about how it's being used in the real world. And with a bit of luck, we'll have someone from Mongo on our show soon.

Michael Lynn: Fantastic. I think we can work on that. Rob.

Rob Walters: Yeah, definitely. And one thing I wanted to mention since that we're talking about next steps and call to actions and so forth is I will be at MongoDB World under the Builders Fest giving a Kafka demonstration as well as tutorial. So if you bring your laptop or whatever, you can spin up a Docker image that we'll have there and you can actually go through and use our connector. But if you can't make it, these tutorials are going to be online by world on our MongoDB Kafka documentation. So today, there's two tutorials there, but we're actually adding a whole bunch more, more around the introductory. So creating a source, creating a sync, some of the more basic operations just to get you up to speed on using these two technologies together.

Michael Lynn: And is there a documentation source you want to send folks to read more about the connectors?

Rob Walters: Yeah. You can start by just using MongoDB Kafka documentation, if you just put a Google search for that or Bing or whatever your search engine is, that will bring you to our MongoDB documentation section. And there'll be tutorials on the left hand side that you can go through and start working away. And like I said, right around MongoDB World, we're going to be updating that. We're going to be adding a whole bunch of more introductory ones. So if you're hearing this podcast after June, you'll see them in there.

Michael Lynn: Awesome. Kris, Rob, thank you both so much for joining us. Is there anything else you'd like to share, Kris, before we wrap up?

Kris Jenkins: Oh, God, I work with a lot of Californians, so sharing too much can be dangerous. I'm just going to repeat developer. confluent. io will tell you a lot of what you need to know about Kafka.

Michael Lynn: Thank you, Kris. Rob, anything else?

Rob Walters: Nope. Just check out the tutorial documentation. If you're coming to MongoDB World, definitely meet us on Builders Fest Day and then other than that, we'll look forward to seeing everybody there.

Kris Jenkins: Thanks once again.

Michael Lynn: Thanks once again to Kris Jenkins, Confluent, make sure you check out Kris's podcast. It's called the Streaming Audio Podcast from Confluent, talks all about Kafka. Learn more. You can continue your journey of discovery around streaming data with the Confluent podcast. And as Rob mentioned, you can check out more information about MongoDB's Kafka Connector by visiting the MongoDB Kafka documentation. There's links in the show notes to all of these resources. I hope you enjoyed the show. If you did, I'd love to get a review from you. If you could visit Apple Podcasts, give us a five star review. Let us know what you thought. Leave a comment. Thanks everybody. Have a great day.

DESCRIPTION

Today on the show, we're talking about streaming data and streaming applications with Apache Kafka. Now, if you're not familiar, we're going to explain all about Kafka and we're going to talk to a couple of experts to help us understand how and why we might want to implement Kafka as part of our application stack.

Kafka is traditionally used for building real time streaming data pipelines and real time streaming applications. A data pipeline simply reliably processes and moves data from one system to another and a streaming application similarly is an application that consumes potentially large volumes of data. A great example for why you might want to use Kafka would be perhaps capturing all of the user activity that happens on your website. As users visit your website, they're interacting with links on the page and scrolling up and down. This is potentially large volumes of data. You may want to store this to understand how users are interacting with your website in real time. Kafka will aid in this process by ingesting and storing all of this activity data while serving up reads for applications on the other side.

Kafka began its life in 2010 at LinkedIn and made its way to the public open- source space through a relationship with Apache, the Apache Foundation, and that was in 2011. Since then, the use of Kafka has grown massively and it's estimated that approximately 30% of all Fortune 500 companies are already using Kafka in one way or another.

Today on the show, Kris Jenkins, Developer Advocate from Confluent, and Rob Walters, Product Manager at MongoDB. We're going to talk all about streaming data, streaming applications, Kafka and how you can leverage this technology to your benefit and use it in your applications.

Today's Host

Shane McAllister

|Lead, Developer Advocacy

Today's Guests

Rob Walters

|Senior Product Manager, Connector and Things at Confluent

Connect with Rob

Kris Jenkins

|Senior Developer Advocate at Confluent

In 2003 I co-founded BullionVault, an online service for trading gold and silver bullion. From a technical standpoint BullionVault is a stock exchange dedicated to precious metals. The Java-based system included the realtime exchange itself, plus trade settlement, double-entry accounting, and account-management facilities. After 10 years as CTO I decided to move on and return to being a full-time developer. Since then I've been a contractor for a wide variety of companies in the startup and/or finance space, ranging from fledgling retail platforms to blockchain companies and large banks. I've been told by many people that I'm good at explaining technology to non-technical people; at pulling a great product out of a loosely-formed requirement; and building software to a high degree of polish. Outside of contracting I have run several regular coding events in London, with a mix of hackathons and educational events. I am a regular speaker at technical conferences* and an occasional writer of technical articles and open-source software.

Connect with Kris