Ep. 176 Revolutionizing Data Transformation at Dataworkz with Nikhil Smotra
Michael Lin: Welcome everybody. My name is Michael Lin and I'm a developer advocate at MongoDB. We're live from MongoDB. local, 2023, right here in New York City. We've got a full day of coverage, live interviews all day, and I'm so delighted to have you join us. We're live on LinkedIn Live and YouTube live throughout the entire day, and we're going to be sitting down with a full roster of insightful interviews planned with distinguished partners, developers, visionary founders, and I'm so delighted to welcome my first guest for the day. His name is Nikhil Smotra. He's the Chief Technology Officer, co- founder of Dataworkz. And Nikhil, welcome to the show. It's great to have you on the podcast.
Nikhil Smotra: Thank you, Michael, inaudible.
Michael Lin: Are you feeling the energy?
Nikhil Smotra: Oh, yeah. It is an exciting show you guys have put up.
Michael Lin: It really is.
Nikhil Smotra: Excited to be a part of the show.
Michael Lin: Fantastic. So you are the co- founder and Chief Technology Officer of DataWorkz. Tell the folks a little bit about DataWorkz.
Nikhil Smotra: Yeah, so DataWorkz is all about Customer 360. Building a robust Customer 360 view is crucial for the success of any business, but building such a view is difficult. We at DataWorkz are on a mission to simplify the creation of Customer 360, to enable organizations to tailor their products, services, marketing and sales effort to meet their customers' demands and expectations.
Michael Lin: 360 degrees of customers and their data. This is a mission that's been in place for quite some time for many companies, but what is Dataworkz doing differently?
Nikhil Smotra: So at Dataworkz, we've taken an all- in- one approach to combine data wherever it is, transform it, running complex transformation on it, and applying AI functions using a single platform. The product is designed to be used by both technical as well as non- technical users. Technical users can get into the weeds and write their own SQL, and the non- technical users can use the abstraction and we use a bunch of massively parallel processing systems behind the scenes. The way we are different is we are actually trying to redefine the whole problem of creating a single view of your customers using large language models. What we've done differently is we've started with the tougher problem of describing data transformation using plain English, leverage a large language model to understand the semantics, convert that plain English into code, for example, say on Spark, and execute it on the cloud.
Michael Lin: Oh wow. So now large language models and AI, it's all the rage lately. You must've been ahead of the curve. When did Dataworkz start working on these challenges?
Nikhil Smotra: We actually have been working on these challenges for the last two years, and in the last seven or eight months, LLMs, they've been a reach and we thought that we could actually go ahead, use LLMs to make the whole process of building a Customer 360 seamless and intelligent.
Michael Lin: So the challenge, and I wrestled with this challenge through the'90s into the 2000's, the challenge around a Customer 360 is it's rarely one stream of data from one source. There's many different systems producing data. How does Dataworkz simplify that whole entire process?
Nikhil Smotra: Okay, so if you are trying to put together a Customer 360 and you want to have full ownership of the data, there is a very, very high probability that your data integration project has become a software integration nightmare. The reason I say that is because it all starts with getting data from the source systems into a data lake or a data warehouse. Once the data is there, you want to use, hire a bunch of data engineers or use a data preparation tool to transform the data, hire a data scientist, and then use a reverse ETL tool to push the data back to the systems which are used by the business. None of the business users are going to come and say that, " Hey, I'm looking at your CSV data," and be happy with that. They actually want the data to be pushed back to Salesforce or most of the systems they use on a regular basis. By the time you get any outcomes, you've already spent six months and half a million dollars. So we looked at the problem and we said, Customer 360 is a difficult problem, but there has to be an easy fix to it. That is where it comes in. We've actually combined data transformations and AI into a single experience. In a nutshell, I would say we have an easy button for creating a Customer 360. It should not take people months. It should be done in minutes at the best and in a couple of hours. So we actually believe in that mantra and that's why we built Dataworkz and we are leveraging large language models to make that whole experience for our customers seamless and more intelligent.
Michael Lin: Fantastic. I love the concept of an easy button for Customer 360. And for me it's about eliminating or reducing the amount of technical debt that you incur as you build the solution. So partnering with a company like Dataworkz sounds like that's the right approach.
Nikhil Smotra: Yeah, that is true.
Michael Lin: So you're here at the MongoDB. local conference. Let's talk about how MongoDB and Dataworkz are working together.
Nikhil Smotra: I'll actually tie that to a use case. So let's assume that you have a business who have their customer data sitting in some MongoDB collection and they want to go ahead and gain transformative insights on the data using large language models. So there are three key components of the architecture they would need to accomplish that task. The first one would be pre- processing the data to get it into a shape or format where it can be fit to the large language model. This is where the conversational data preparation from Dataworkz wherein you just write plain English to define data transformation tasks and we do all the heavy lifting for you, comes into play. The second part of the architecture is creating chunks and embeddings. Now, before the data can be fed to a large language model, you have to break up the data into chunks and store it. We believe that this is something which is done seamlessly by Dataworkz and we are actually using MongoDB as an operational store to store the chunks. When it comes to embeddings, we are actually using a vector database like Pinecone. The third and the most important piece is privately hosted LLMs because Dataworkz provides that as a service to host a commercially available large language model like DALL- E. One of the reasons why this is a really important thing is because many times customers are really worried about the security and the privacy of their proprietary data, and they do not want to use public LLMs like our OpenAI.
Michael Lin: Yeah, I see that as a major accomplishment. Having the ability to essentially build your own large language model and enrich that with data specific to your own use case. Yeah. Was that the plan from the very beginning?
Nikhil Smotra: No, I think we've evolved as we've seen how we can actually improve the experience for our customers. With Dataworkz, we can actually go ahead and leverage or harness the power of LLM on customer data or any data setting in a MongoDB collection in a matter of minutes and without writing a single line of code.
Michael Lin: So that I understand this, when a customer comes to you and leverages the Dataworkz platform, there's a custom instantiation of a large language model. Is it custom to their specific set of data?
Nikhil Smotra: Yes. So we offer privately hosted LLMs, which have not been customized, and we can go ahead and customize LLM to meet your specific requirements. If you're trying to build intelligent question and answering system and you wanted to train the model using your own specific data, we can actually get that up and running in a couple of hours. If you wanted to go ahead and use publicly available LL. M either for summarization or for question and answers, we can actually host that privately for you. So we give our customers all the options, whatever they are comfortable with, so we can go with that.
Michael Lin: I want to talk about the service area for the developer within Dataworkz. I understand it's a flexible platform, but I'm curious about alignment in the LLM space and adding points along the journey for human evaluation of the context data. Is that something that Dataworkz enables?
Nikhil Smotra: We do that and when we are training LLMs, or as we call it, fine- tuning LLMs, we just do not go ahead and feed it the label data and then expect that everything would be working fine. So we go ahead and make sure we validate whether the LLM is working for the use cases we've trained it for, and then go from there.
Michael Lin: Yeah, fantastic. If you're just joining us on the stream, we're live from the MongoDB. local conference here in the Javits Center in New York City. We'll be having a day- long series of interviews with founders, members of the community, visionaries in the data space, and Nikhil, you're one of those. Talk a little bit about your background.
Nikhil Smotra: So my background is primarily data engineering and analytics. Prior to Dataworkz, I was head of data engineering for one of the largest call centers in the world, and I used to run their data engineering, and prior to that I was working for Lockheed Martin research and development running, building lots of crazy stuff related to data, which I'm not supposed to talk about.
Michael Lin: Okay. We won't mention that. So it's all about the data.
Nikhil Smotra: It's all about the data and how do you manage it and how do you let the business use it without getting into the complexities of setting up an AI stack.
Michael Lin: Yeah, yeah. So data, and we're serving the software developer community here at MongoDB. local. Did you spend time as a developer?
Nikhil Smotra: Oh, yes.
Michael Lin: Yeah. And what kind of things did you build as a developer?
Nikhil Smotra: I was building applications, I was building products, so I have been a full stack developer all my life. I've used MongoDB and I love what MongoDB has to offer, especially when it comes to building a Customer 360. One of the biggest challenges, how do you go ahead, combine data from different sources, different formats? The schemas are evolving. It's extremely difficult with the typical relational model. That is where the JSON data model and the flexible schema from MongoDB shines, because with a flexible schema, you do not really have to worry about some changes in the source system breaking either your Customer 360 or any of the downstream applications, which are dependent on a Customer 360. A great case for this would be using something like, so far most of our customers, one of the key sources for building a customer 360 is Salesforce. And Salesforce, we've seen with many customers, there are new fields which are being added to the Salesforce instance. The marketing doesn't know what's happening. They're running a new campaign and they go in and make a request to the administrators, please add these new fields, delete some other fields. There's a regime change, they want to change how Salesforce was working. So all these things, they actually can break your Customer 360 view. But having a flexible JSON schema like MongoDB, it is one of the most ideal destinations for building a Customer 360.
Michael Lin: And I'm just thinking about some of the challenges that I faced as a developer solving the Customer 360 challenge. You often in a large enterprise have many organizations solving very similar problems across the customer space. Does Dataworkz have an impact, I would think, from the larger language model, from AI perspective, rationalizing similar customer fields across organizations?
Nikhil Smotra: Oh, definitely. And the funny thing is, Michael, you asked this question because we are leveraging large language models to do the classification, figure out similar data sets. Before the large language models became a reach, we actually had built our own models to do that, but we are leveraging a large language model. So enterprises have lots of data, which is being owned by different departments within the organization, and one of the biggest challenges is how do you identify how different people are working on solving the same problem? Fine, and how do you actually cut down on that spending? So within Dataworkz, we actually believe that collaboration is a key piece, and there are two pieces to collaboration. The first one is the visibility, and the second one is the transparency. So let me talk about visibility. What do we mean by visibility? So within Dataworkz, we have a concept of something called as activity streams. Now, as you are working with data, activity streams help other people on the teams understand what is happening. Let's assume that we have the marketing department, they're going ahead and creating a new dataset. Notification would be sent to all the stakeholders that somebody has created this new dataset, and you can actually start looking at whether this is useful for you or not. As somebody is changing the schema, an alert would be sent to all the stakeholders. As new fields are being added within Salesforce, we can actually go ahead and automatically detect those things, change the schema accordingly, and send out to all the stakeholders. Within Dataworkz, we have with Dataworkz, Lineage is something which is available out of the box, and what that does is it actually creates a shared understanding of the origins of the data and what steps were taken to create that data. What that does is it actually gives all the producers as well as the consumers of the data visibility into what's happening. Combining these two features together, we are able to reduce the friction between the data consumers and the data producers.
Michael Lin: Fantastic. Really powerful set of capabilities. Well, I'm so glad you joined me today. We've got a day- long list of interviews with visionaries like Nikhil, so stay tuned. I understand you were upstairs talking with the startups. There was a startups talk earlier. Tell me a little bit about that.
Nikhil Smotra: Oh, yeah, so I was excited to meet Mark and he was talking about all the things you should not be doing when you are a startup. And there's lots of activity happening in the large language model and generative AI space, and we also spoke about how Dataworkz would, MongoDB customers, unleash the power of their proprietary data by using large language models, building them quickly without ever worrying about the data, leaving the confines of their secure environment.
Michael Lin: What else do people need to know about Dataworkz?
Nikhil Smotra: Some of the key features which are available with Dataworkz, I'll just talk about them, is the first one is the conversational data preparation. As I said, we actually expect our customers to go ahead and define data transformation problems in plain English. We leverage a large language model to understand the semantics, convert those semantics into appropriate code for a massively parallel processing system, view the results, and then create complex pipelines. Another feature available within Dataworkz is data flows, aka AI- enabled workflows. It's a series of steps that you can arrange or combine to create really complex data transformations right from creating embeddings for vectors to going ahead and creating really complicated JSON schemas. These data flows, they can be shared and reused once they're built, and if somebody else on a team wants to extend them, you can actually extend them. So what we are seeing with our customers is that they are increasingly using these AI- enabled data flows to create complex data pipelines without writing a single line of code. So recently one of our customers moved their 300- step pipeline from Alteryx to Dataworkz, and we were able to cut down the time to run the complicated data pipeline from 7 days to approximately 115 minutes.
Michael Lin: That's incredible.
Nikhil Smotra: Yeah. Another feature is private hosted LLMs for MongoDB data, and this is giving us a capability to go ahead and build question and answer systems, intelligent summarizations, intelligent insights on MongoDB without worrying about the data leaving the customer environment. And last but not the least is how we are using large language models to go ahead, classify the data as we work for data, and also generate descriptions about it so that everybody within the organization is aware of what data means. So we are all- in- one solution for data transformation and AI, and we actually have inbuilt capabilities for data discovery, cataloging, Lineage monitoring if you have anomalies at the data level. We are not talking about anomalies at the infrastructure level because inaudible-
Michael Lin: In the activity stream, yeah. No?
Nikhil Smotra: Not just the activity stream. Let's assume you're actually sending us data. Every day we get 10 million rows, and one day we actually get say, 9-1/ 2 million rows, so we have a inbuilt patent pending anomaly detection system. We would be able to recognize all those things and surface them pretty quickly so that you're not being reactive, you know, finding out after 10 days that, hey, there's problem with the data when the execs are seeing the dashboards.
Michael Lin: Well, I want to tell folks if they want to learn more about Dataworkz, there's a link there, MDB. link/ dataworkz. Nikhil, it's been great chatting with you. Thanks so much for coming.
Nikhil Smotra: You guys have put up a great show. We just came aboard as a partner recently, and we are super excited to work with you guys and, you know, get this technology into hands of numerous MongoDB customers.
Michael Lin: Yeah. Fantastic. Well, thanks once again.
DESCRIPTION
This week on the MongoDB Podcast, we interview Nikhil Smotra at .local Live NYC. Nikhil explains how Dataworkz combines AI, data integration, and transformation to simplify customer profiling. He also discusses how Dataworkz uses MongoDB's flexible data model for language models for seamless data handling.