Ep. 112 Dr. Sasha Fedorova of MongoDB Labs
Michael Lynn: Welcome to the show. My name is Michael Lynn. I'll be your host today, and this is the MongoDB podcast. Today on the show Dr. Sasha Fedorova. She's a Professor in the Electrical and Computer Engineering Department at UBC, University of British Columbia. She's got a Ph. D. from Harvard and she spends time working with MongoDB Labs, focusing research around how MongoDB and WiredTiger, the storage engine, can better use newer types of hardware. We discuss that today in detail. Go into a discussion of nonvolatile memory and some newer types of storage. Stay tuned for that. Hey, did you know that MongoDB World is June 7th through the 9th in New York City. Tickets are still on sale. Tickets are still on sale and you can get 25% off using the code podcast. P- O- D- C- A- S- T. Visit mongodb.com/ world to get your tickets. Don't miss this one. Ray Kurzweil, author inventor, entrepreneur, and futurist is going to be a featured keynote speaker on Thursday, June 9th. Visit mongodb.com/ world. Don't forget, use the code podcast to get 25% off as well as some podcast swag. Dr. Sasha Fedorova, welcome to the show. It's great to have you on the podcast. Maybe start by telling folks who you are and what you do.
Dr. Sasha Fedorova: I'm a professor at UBC and my work focuses on systems. So what do we mean by systems in my context, because it's a very overloaded word? So, basically it's a software that sits between the hardware and the applications, and makes it easier for applications to use that hardware that comes in all shapes and sizes in one common way, and also leveraging its performance to the best extent possible, hopefully. So, that's what I do. That's what my research focuses on. And I teach classes obviously around the same topics. Now, why am I consulting at MongoDB? How did I get here? So that started quite a while ago, around 2013, when I went on my first sabbatical. I was looking for something interesting to do. Like it often happens in life, somehow the disinformation gets out and this opportunity just landed on my lap. I knew some people in WiredTiger, which was not part of MongoDB yet at the time. And so, I joined them and I started basically working on various small projects on improving performance, just looking for opportunities to make this or that faster, and I can talk about details later. Then when WiredTiger got acquired by MongoDB, thankfully I got acquired with them and MongoDB decided that they still want to keep me. And so, I'm still here. My role has expanded a little bit. I still do some performance hacking, but more recently I also started looking at adoption of new technologies. For example, if there is a new storage technologies like obtain non- volatile memory that came out in 2019, can we take advantage of this in the storage engine to make more efficient, performant, easier to use? My role at that point, I turned from somebody who just reads up and learns about the technologies to an actual engineer who sits down and prototypes the necessary software that's needed to integrate this technology into the storage engine and then measure performance, evaluate it, and look at pros and cons and write about it.
Michael Lynn: Fascinating. I want to get deeper into the work you've done with newer technologies. But before we go there, I'm curious how you got interested in systems. This is your area of specialization. How did you get interested in systems?
Dr. Sasha Fedorova: I don't really have a good answer to that. I got into computer science by accident. I was studying economics in college and that was in the late 90s. The dot com boom was about to happen or was happening. I just figured I need to learn something about computers just to be on the same page with what's happening in the world. I took a computer science class and I just totally fell in love with it. I was just so fascinated and excited about the process itself because it felt so real that you could sit down and write a program, and it works and it does something. I just found this fascinating, specifically the aspect of programming. Then, so I decided that I had to make this my major even though it was already junior year, very late to be deciding on your major. But I stayed up all night planning out course schedule to make that happen, and I was able to make that happen. Then one thing led to another. The first company that I went to work for didn't have really sort of exciting and challenging opportunities for young people. And so, I said, okay, I'm going to go to grad school. This is where the tough stuff happens. Sure enough, I got more of it than I was maybe bargaining for. Then, why systems? It's hard to tell. It's like explaining to somebody why do you like one thing and not the other. I don't know. I just found the idea of hacking complicated systems just very personally compelling for the reasons that I can't explain, and that's why.
Michael Lynn: I think people in tech in general have a natural curiosity. I think that tends to draw you to certain topics and certain areas. There's an interesting Venn overlay between economics and computers, technology and performance specifically. Would you agree?
Dr. Sasha Fedorova: Yeah, there certainly is. There are many ways in which they can be explored in computer science. For example, one of my good friends is exploring game theoretical approaches that certainly have their place in computer science and in algorithms. My work didn't go in that direction. Though, that being said, being a consultant at the company sort of forces me to answer the questions that ultimately have to do with the economics. For example, non- volatile memory that I'd recently explored as how can we use it in WiredTiger. One of the questions that I asked is it economically more attractive relative to other alternatives; say, relative to using existing memory that we have?
Michael Lynn: For the listeners that may not be familiar with the technology, can you explain what non- volatile memory is?
Dr. Sasha Fedorova: Nonvolatile memory is basically a physical storage fabric where you can write your data and it retains the data, even if your system crashes or if you power it off.
Michael Lynn: So, like a hard disk?
Dr. Sasha Fedorova: Hard disk refers to how you can package the storage fabric. Inside the hard disc, you have basically a bunch of chips. If you open up an SSG, not the old style hard drive that has a bunch of spinning platters, but if you open up an SSD, all you have is a bunch of chips inside. And so, inside these chips is a particular storage fabric and the SSDs that are prevalent these days have flash memory. And so, the non- volatile memory product from Intel and Micron, which is called Optane, and it was released in 2019, it's a new storage fabric. It's a new persistent type of persistent memory that doesn't lose your data if your system crashes or if you power it off. You can actually package it two ways. You can take those memory chips and put them inside a regular SSD. And so, Intel has this SSD product with their Optane memory. But you can also with this memory, package them as DIMS, dual inline memory modules, that you can then put into the same slot that you put your regular DRM. So you can take that persistent storage fabric and have it as an SSD, or have it just like regular memory that sits next to DRM. Then, so essentially you can have your storage device sitting right next to DRM, and you can use it in two ways. You can use it as storage as you would use a regular disk, or you can use it as memory. There are various interesting aspects around these use cases. So, the first question that you ask about this technology, okay, well, so it's a non- volatile DIM that sits right next to your DRM, but how is it performing? Is it as fast as DRM? The answer here is very interesting. If you look at latency. Latency is, if I need a hundred bytes to read a hundred bytes of data from memory, how long does it take if I'm reading from regular DRM versus NVRAM, non- volatile RAM. And so, the difference is about a factor of two or three. It only takes about two, three times slower to read a hundred bytes. And that's really, really good because if you are reading your data off of an SSD, then that takes at least an order of magnitude longer than with DRM. So, that's latency. If you only need a small amount of data. What about throughput? So throughput is if you want to basically suck a large volumes of data continuously, and here we're talking about not how many milliseconds or nanoseconds it takes, but how many bytes per second can I read or write? If you're trying to read your data sequentially, which is usually the easiest use case for all kinds of storage technologies, sequential access is usually the fastest. Then, from a one non- volatile DIM, you're going to get the throughput of about six gigabytes per second. If you have six DIM, it's going to be about 36 gigabytes per second. That's very comparable to regular DRM. And DRM, it depends how many channels you have from your CPU to memory. So for example, if you have one channel, you're able to get the throughput of 19 gigabytes per second with your DRM. The number of channels you have in the system, well, it depends how expensive your system is. If you want to have eight channels, it's going to be a very expensive system. If you have two, it's going to be not very expensive. So the read throughput is comparable, but the right throughput of Optane NVRAM is its Achilles heel. It is quite low. On the system that I have, a single NVDIM, non- volatile DIM, gives me less than a gigabyte per second. Other researchers have managed it to around one gigabyte per second, probably different memory version per single DIM. By comparison, if you take the Optane SSD, so same non-volatile memory but that's in SSD, you get about two gigabytes per second of right throughput.
Michael Lynn: Why the difference?
Dr. Sasha Fedorova: My guess is that there's just more parallelism inside the SSD because it's a separate card that sits on a separate bus and you can just put more chips in there. So that would be my guess, but I could be wrong about that.
Michael Lynn: Do you see this technology being adopted by some of the more popular cloud providers like AWS and Azure?
Dr. Sasha Fedorova: Not yet. When this technology was announced in 2019, Google made an announcement that they said that they would offer it as part of their SAP HANA offering which is a large in- memory analytics engine. But I have not seen specific offerings since. Perhaps part of the reason is that people don't yet know where exactly it fits in a storage stack and what's the right way to use it. And so, part of the reason why I wanted to do the experiment that I did with nonvolatile memory in the storage engine is to answer this question of: How do you use it? What is it useful for? Is it worth it in terms of performance and economics, the cost?
Michael Lynn: What types of changes need to take place in the storage engine subsystem in order to take advantage of this technology?
Dr. Sasha Fedorova: I can tell you about a specific use case that I explored. So basically, in general, let's talk about how you can use this non- volatile memory. So I said earlier that you can use it either as a storage or as memory. You can put it into your system and then the operating system can make it look just like a regular disk drive. With that, you don't have to make any changes to use it. You use it just like a regular device.
Michael Lynn: I want to pause. So when you're preparing your system in order for it to appear as a hard drive, obviously you need to format it and create a file system that lives on that. You would do that in the same conventional way?
Dr. Sasha Fedorova: Yeah.
Michael Lynn: Okay, gotcha.
Dr. Sasha Fedorova: Well, there are a few, a couple of extra commands you type before that, but then it looks just like a regular block device and you can put a file system on it and use it in a way that you would use a regular SSD. So, you can do that. This use case is questionable in terms of economics. Although it's not the use case that I explored in detail, but in my opinion, it is a bit questionable in terms of economics because you basically get a very small and very expensive SSD that with a right throughput that's not very good.
Michael Lynn: Although better than conventional, than most SSDs? Or no, it wouldn't be, would it?
Dr. Sasha Fedorova: Than most, right. Better than the conventional non- flash SSDs, yes, absolutely. My guess is that probably this will not be the dominant use case of this technology. Now, the other opportunity is to use it as memory. Here we have two options. The first option is that you can use this as memory, but keeping in mind that it's persistent. You write your data structure to memory, but then if you crash and then you reboot, potentially your data structure is still there. That is a very challenging use case because the applications are not designed to deal with memory that survives crashes. That's really hard. That's a whole can of worms. I mean, imagine you started writing a change to your data and it involves writing 50 bytes to affect this change. Let's say you wrote 25 bytes and then your system crashed. It comes back up and you have your data and only 25 bytes written and the other 25 contained garbage, and your application doesn't know how to interpret this data. It's not designed to deal with corrupt data. Because in the past, if your system crashed, none of it would survive. So, that use case is actually quite challenging. It requires lots of new software infrastructure to make it work, and there's a lot of research that's going in the community. Now, there is another way to use NVRAM's memory and that is to completely ignore the fact that it's persistent. So just use it as volatile memory. Now, why the heck would you want to do this? The reason is that NVRAM would be dancer and cheaper than regular DRM. What do I mean by denser? It means that I can put more total memory into my physical system, physical server. If I have a two- CPU system, two socket system, I can put up to 12 terabytes of NVRAM. Or up to six terabytes. I think it's up to six terabytes. So basically, on that few terabytes, it's six per CPU and up to 512 gigs in a DIM, I think it's basically six terabytes. So, I can put six terabytes of NVRAM packed into my system, and putting that much DRAM into a single system is very, very expensive, and maybe not even possible because the individual DIMS are just not that dense. So I can pack more bytes and this memory will be cheaper per byte and the end will be cheaper per byte than DRAM.
Michael Lynn: Why is it cheaper? Is it just constructed differently and uses cheaper materials?
Dr. Sasha Fedorova: Yeah, it's just a different technology.
Michael Lynn: I want to touch on the application. Now, it would seem to me, obviously in the database space, that more memory is better, especially when you've got a large working set for your data. I'm wondering if this would've been a solution for Nmap rather than in a memory mapped database where you have this level, this amount of storage or amount of memory. I'm wondering if that would've been a solution as opposed to rewriting the memory map structures.
Dr. Sasha Fedorova: Absolutely. I mean, if you have an in an in- memory storage engine and you want to address customers who have a larger working set, instead of telling them go get a different storage engine or go order a storage engine, you could say," Well, hey, just get more memory and it's cheaper now." So this is one use case. But even for any system that is not necessarily in- memory only, having more memory is better because the access latency is just smaller than having to go to disk. And so, imagine that you have a budget that you can spend on extra memory and you have a choice. Do I buy more DRAM or do I buy more NVMRAM? For that same money, you can get more NVRAM, about three times more NVRAMM than you can get DRAM. It depends on the price, depending on your discount and your vendors, but roughly saying, that's the ballpark. But then, your performance will also be a bit slower. Maybe, not always, but in most cases it will be. So the question that you're asking is, okay, is it worthwhile for me to, given my fixed dollar budget that I have, and memory is the most expensive component in data centers. That's what I hear from people who run data centers. So given my fixed budget, do I get more DRAM or do I get more NVRAM? This is why the use case of... let's forget about persistent and all that complicated software that's needed. Let's just treat it as volatile because then you're writing the software just like you would write it for DRAM and you don't need anything extra. Is it worthwhile to use it in that way? That's exactly the use case that I explored in MongoDB and the prototype that I built.
Michael Lynn: Now, when can we expect to see larger scale adoption of this type of technology? I mean, even with just with MongoDB and WiredTiger, have we seen changes made to the WiredTiger storage engine in support of this technology?
Dr. Sasha Fedorova: Yes. As part of my work, I implemented the changes to WiredTiger to use NVRAM. What I built was NVRAM cache. It allows WiredTiger to allocate the chunk of space in NVRAM. Then, when it reads a block of data from disk, it also puts this block of data into this NVRAM cache so that later, if it needs to reread this block again, it checks to see is it in the NVRAM cache. If it is there, it gets it from the cache and it doesn't need to pay the latency of going to disk. That's the extension to WiredTiger that I built. It has been merged into develop branch so it's there and whoever wants to use it, they can use it. Whether or not it'll get adopted, my guess is as good as anybody else's. My guess is that if there is a customer with a specific need for more memory and a limited budget, and they adopted and they like it, and enough people do, then we could see big cloud providers making this available. Whether or not it'll happen, I can't predict the future.
Michael Lynn: How will this surface in the configuration of MongoDB? Is it switches set at compile or is it configurations at a run time?
Dr. Sasha Fedorova: Well, you do need to compile to enable one option at compilation time, and then you need to enable it at run time.
Michael Lynn: Fantastic.
Dr. Sasha Fedorova: If you want, I can also talk about some interesting technical aspects of building the cache on and the RAM because it has to do with specificities of performance.
Michael Lynn: Oh yeah, definitely.
Dr. Sasha Fedorova: I mentioned earlier about rights throughput being an Achilles heel of this technology. And so, right throughput can be slow. That's fine. But the very interesting thing that I found on this technology was that, if you're just reading your data, your performance is pretty good. We talked about read throughput being very good, comparable to DRAM, but as long as you have a single concurrent writer, a thread that writes into the memory while other threads are reading from it, the performance of readers drops quite significantly. The presence of writers drastically affect the performance of readers. This is not a completely new phenomenon. This happens to some extent on any storage device, including DRAM. But the extent to which this happens on NVRAM is just much, much higher. For example, the presence of a single writer can drop your performance to half of having no writers. If you have like eight writers and eight readers on a system with 16 CPU so they're not competing for CPU or anything, then your writers will experience throughput that's combined, that's 90% lower than if there were no readers. Your readers will experience the throughput that's 90% lower than if there were no writers present. So the impact that the writers have on readers is just much, much more drastic on this technology. And that occurs if they're writing to the same DIM. Not if the readers are already reading for one DIM, and the writers are writing to another. If the readers and writers are accessing the same DIM, so this is what would happen. This consideration, that was not widely known before, but it turned out to be very critical for cache design. Because when you build a cache, there is a trade off. If you have a cache, you're doing two things. First of all, you are admitting new data into the cache. You're reading those blocks from disk and you're deciding, oh, okay, I'm going to cache this block in my NVRAM, you're writing those new blocks into your cache. You're admitting new data. And you need to admit new data in order for the cache to be effective, because if you only have the old data that nobody cares about anymore, that's not very useful. You have to purge the old data, you have to admit new, useful data. The second thing that you're doing is you are retrieving from the cache the data that's already there. So if I have the data that I need and it's in the cache, great, I want to retrieve it. And so, retrieving the data from the cache is obviously reading, but admitting the data in the cache is writing. And so, this is where we might potentially have a problem, because if your admission, putting new data in the cache, your writing new data in the cache is too eager, then the rate of your retrieval of the existing data will suffer. It'll just take a very, very long time. And so, what I discovered is that in order for the NVRAM cache to perform well and to be effective, you have to be very careful about how eagerly you admit new data. You have to throttle the admission rate, and that is a more or less new consideration for this type of technology. Because it wasn't very relevant for other caches that people have built in the past. But for anybody building caches on NVRAM, this will be very, very important consideration for performance.
Michael Lynn: So it would seem to me that with a greater capacity, you may want to give consideration to warming your cache prior to an exercise.
Dr. Sasha Fedorova: Absolutely. This is where we can also make our cache be a bit smarter and actually take advantage of the fact that the memory is persistent. So, if you can add a bit more software support and say," Okay, well, I have NVRAM, but now I can actually take advantage of its persistence. If my application crashes and restarts, I don't have to populate my cache from scratch, but I can use the data that was already there prior to the crash." This is not the feature that we have in our cache yet, but this will be important going forward because if you have six terabytes of NVRAM sitting on your system, it takes a long time to populate six terabytes of data especially given that the right throughput is low and we want to limit it, then taking advantage, making the cache persistent and survive crashes will be very important and very beneficial for performance.
Michael Lynn: So that would seem to be a really valuable attribute of this type of storage. If it's so costly to warm your cache and populate your cache, then the ability to persist is massively important.
Dr. Sasha Fedorova: Absolutely. You should join our team. You have great ideas.
Michael Lynn: Well, it would seem to be just common sense. I love learning more about the time and effort and expertise we dedicate to ensuring that the storage engine at the heart of MongoDB stays relevant, stays efficient. What else is on the roadmap? What else are you looking at?
Dr. Sasha Fedorova: Broadly, there's a whole bunch of interesting technologies, both in terms of hardware and software space, that are on my radar. There are new kinds of SSDs that are called zoned devices, where you only write to those SSD sequentially. So that puts a burden onto the operating system, the application. Why would you want to do that? Because then it simplifies the job of the SSD in terms of how to organize the data internally. Usually, what conventional SSDs do is you throw a bunch of data into it, and that at some point, for the reasons that I won't get into today, they basically have to reorganize the data. They have to read it from one place and write it to another, and that's called garbage collection. That basically can introduce some unpredictability into the performance of your IO device, because if I'm reading from the device and then the device is also doing some internal reorganization, reading and writing the data, then my reads are going to suffer, and by writes are going to be slower. Zone devices, they put some burden onto you in terms of how you use the device, but then they don't have this extra housekeeping to do so your performance can stay more predictable. So hat's one type of technology that's on my radar. Then, there is a bunch of questions that I'm asking about how to make the IO path faster. But perhaps one of the aspects that I will look at in the nearest future, again has to do with caching. WiredTiger is now being re- architected to adopt a tiered storage architecture in a sense that, so if MongoDB is running in multiple replicas and each replica has its own instance of the storage engine underneath, now you would be able to configure MongoDB to have its data stored in a object storage such as Amazon S3. Then that would make it easier for the replicas to share data. And so, the WiredTiger is being extended the support, but now the natural question that comes in that, okay, so accessing data from S3 now takes longer than accessing it from a local SSG. How do we solve this problem?
Michael Lynn: Priorities.
Dr. Sasha Fedorova: A cache. We need a cache. I wrote a cache for WiredTiger. And so, I think my next job will be to figure out what is the role of caching in the tiered storage architecture. Can we use the cache code that I wrote to make it a more general kind of cache? What would be the performance advantages, the trade offs, the pros and cons, the engineering effort involved, and all of that. I think that's the most imminent task on my roadmap.
Michael Lynn: Oh, that's fantastic. Like I said, I think it's wonderful that we're dedicating this much effort and care toward looking at how we can continue to improve with the changes that's taking place in the hardware landscape and the operating system landscape. Fascinating discussion. We are at about time, but I'm curious what you do when you're not working on systems. What do you do in your personal life?
Dr. Sasha Fedorova: Depends on the season. It's winter now, so I do quite a bit of skiing.
Michael Lynn: Where in the world are you?
Dr. Sasha Fedorova: I'm in Vancouver, B. C. British Columbia. Last week, my family and I went skiing to Alberta, to Lake Louise. I do mostly cross country skiing, because I like torture. In the summer, it's more biking, camping, being at the beach, windsurfing. I also do some ballet.
Michael Lynn: Oh, really?
Dr. Sasha Fedorova: And I cook a lot. I love to cook. I really want to be a good chef so I cook every day and all kinds of things. Recently, I got into smoking, so I smoke various things on my grill. Today I'm going to smoke fish. I also bake bread.
Michael Lynn: Oh, beautiful.
Dr. Sasha Fedorova: I bake bread from scratch. I have a home grain mill. And so, I buy grain from a local farm in B. C. I make bread from just this grain, water and salt. So I don't use any commercial flour, any commercial yeast. I just grind this grain. I grow my own yeast, and make my own flour, and bake this bread. Me and my family have switched to eating only this kind of bread.
Michael Lynn: Ah, that sounds so, so good. Do I smell a sunset career perhaps as a chef?
Dr. Sasha Fedorova: No, that's too hard. You have to wake up too early. Being in systems, you can have any schedule you want.
Michael Lynn: True.
Dr. Sasha Fedorova: I think I'm going to stick to that.
Michael Lynn: Well, Dr. Fedorova, thank you so much for taking the time to talk with me. Is there anything else that you'd like to share with the audience before we wrap up?
Dr. Sasha Fedorova: Thank you for having me. I mean, stay tuned for new posts in the MongoDB Engineering Journal because I'll be writing about the things that I talked about today.
Michael Lynn: Fantastic. I'll include links in the show notes. So if you're curious and you want to learn more, check out the show notes. Thank you once again.
Dr. Sasha Fedorova: Thank you.
Michael Lynn: Thanks for listening. If you want to learn more about storage technology and the research that Sasha is working on, you can visit the engineering journal at engineering. mongodb. com. If you enjoyed the show today, I would love to get some feedback. Apple Podcast, Spotify, leave a comment, leave a rating, greatly appreciate that. Visit mongodb.com/ world to get your tickets. Use the code podcast, P- O- D- C- A- S- T, to get 25% off and some really cool podcast swag. Thanks everybody. Have a great day.
At MongoDB, we're constantly searching for ways to ensure that the database is as efficient as possible. This means exploring all sorts of new technologies to stay on the forefront of the cutting edge. Dr. Sasha Fedorova works as a consultant in MongoDB Labs, focusing on and exploring ways that the database, specifically the storage engine can leverage new storage and memory technologies. On today's show, Michael spends time chatting in depth about Dr. Fedorova's research into these areas and they cover discussions on RAM, DIM, and Intel's Optane Memory technology which offers vastly lower latencies than conventional SSDs.
Highlights of the conversation include:
- Dr. Sasha Fedorova introduces herself and her work
- How Sasha became interested in systems
- Non-volatile memory and reducing latency
- Adopting non-volatile memory, and changes that need to occur to take advantage of the technology
- The application of non-volatile memory, and new solutions it can offer
- Larger scale adoption and effects with MongoDB
- Solving throughput, and technical aspects of building cache on RAM
- What's next on the roadmap
- What Sasha does outside of working on systems