Ep. 101 Debunking the Myth of Going Schemaless

Episode Thumbnail
00:00
00:00
This is a podcast episode titled, Ep. 101 Debunking the Myth of Going Schemaless. The summary for this episode is: <p>Developers everywhere have embraced document databases. But in many cases, it’s for the wrong reasons. Take the current hype around going schemaless — it’s almost too easy to store arbitrary JSON or XML in a document database. But if you expect to filter, modify and retrieve it efficiently, you may be setting yourself up for disappointment. On today's episode, we're talking with <a href="https://www.linkedin.com/in/john-page-448b981/" rel="noopener noreferrer" target="_blank">John Page, Distinguished Engineer at MongoDB</a> about the concepts and myths surrounding the schemaless nature of document databases and MongoDB.&nbsp;</p><p><br></p><p>John's Blog: <a href="https://medium.com/@johnlpage" rel="noopener noreferrer" target="_blank">https://medium.com/@johnlpage</a></p><p><br></p><p>John's Article on The New Stack: <a href="https://thenewstack.io/debunking-the-myth-of-going-schemaless/" rel="noopener noreferrer" target="_blank">https://thenewstack.io/debunking-the-myth-of-going-schemaless/</a></p><p><br></p><p>MongoDB is coming to the <a href="https://gdconf.com" rel="noopener noreferrer" target="_blank">Game Developers Conference</a> March 21st thru March 25th. Stop by the MongoDB booth to share your story with the Podcast. Mention the word <strong>podcast</strong> to receive a special SWAG Surprise.&nbsp;</p><p><br></p><p>MongoDB World is coming to New York City June 7th - 9th. Visit <a href="https://mongodb.com/world-2022" rel="noopener noreferrer" target="_blank">https://mongodb.com/world-2022</a> for more details. Be sure to use the code "<strong>PODCAST</strong>" for an additional 25% discount</p>
Episode Overview: Schema
00:52 MIN
John's background and what he does for MongoDB
01:12 MIN
John's love of education and wildlife, and how his passion projects materialize
04:33 MIN
How John translates passion projects into MongoDB projects
02:44 MIN
Mike and John discuss curiosity and a love of learning, wanting to know how things work
01:28 MIN
Employment after university, diving into programming languages
04:47 MIN
John talks about his favorite role at MongoDB between presales, professional services, and developer relations
01:12 MIN
Dissecting schema and schemaless databases, and use cases of each
02:56 MIN
Where schema doesn't matter, and places where you might not want rigidity with your data
03:06 MIN
Mike asks whether John sees challenges around MongoDB being schemaless when promoting it to developers and clients
05:25 MIN
What schema enforcement looks like from a technology perspective in MongoDB
03:06 MIN
How does a developer implement schema enforcement, and what happens when a document fails to meet the criteria?
02:19 MIN
John's advice to folks coming from a legacy relational world to MongoDB in terms of whose responsibility schema is
03:48 MIN
Differences between Realm Mobile Database and MongoDB
05:09 MIN
Resource advantages and efficiencies of schemaless design with MongoDB
09:34 MIN

Mike Lynn: If you've been working for data for some time, you're familiar with the concept of a schema. It enables you to define the structures and the types of data that you'll store in the database. Now with document databases, there's also a concept of going schemaless. Developers everywhere have embraced the document database for this reason, but in many cases, it's for the wrong reasons. It's almost too easy to store arbitrary data, whether it be from JSON or XML, storing it in a document database is seamless and easy. But if you expect to filter, modify and retrieve that data efficiently, you may be in for some surprises. On today's show, John Page, a distinguished engineer at MongoDB is going to talk about the myth of going schemaless. Stay tuned.

Mike Lynn: Hey, it's GDC, the Game Developers Conference. It's coming to San Francisco, March 21st through the 25th and MongoDB is taking part. There'll be a MongoDB booth. The MongoDB Podcast will be on site. Swing by the booth and mention the word podcast to receive a special gift. The Game Developers Conference brings the game development community together to exchange ideas, solve problems, and shape the future of the industry across five days of education, inspiration, and networking. To get more information, visit gdconf. com. And remember to swing by the MongoDB booth, mention the word podcast to receive a special gift. Hope to see you there.

Mike Lynn: Well, John, welcome to the show. It's great to have you on the podcast.

John Page: Hi, it's nice to be here. Second time I've done a podcast with you, first full one though.

Mike Lynn: That's right. We did something at. local in London, right?

John Page: Yeah. You called me for a few comments on the show, I think.

Mike Lynn: I'm good for that. Well, awesome. It's great to get you back, and we can take our time with this one. We're going to talk a lot about schema and the myths associated with MongoDB being a schemaless database. But before we get there, John, can you introduce yourself to the audience? Let folks know who you are and what you do.

John Page: So these days, I like to think for myself as a social media influencer, but that's probably pushing it a little bit. So I've done a bunch of things at MongoDB over the years, I've been with MongoDB for eight and a half years, and I've done presales, which is really telling people about MongoDB and getting them enthusiastic about it. I've done profess services, which is telling people about MongoDB and getting them enthusiastic about it. And now, I'm in developer relations where, guess what, my job is to tell people about MongoDB and try and get them enthusiastic about it. The only thing that's changed through those, I guess, has been the channels. When you think of presales, it's going out to see somebody in a presales capacity who might be curious. Professional services, they're already using it, but they might not know enough to be really loving it. And so it's helping that. Now I'm trying to reach a much broader audience by building things with MongoDB and telling people about what I'm doing, and then I'm trying to go out to social media and things like that. So mostly, I build stuff particularly with new features and potentially lesser used features, and I just try and spread the knowledge. Learning and teaching every day, that's my motto. Learn something and teach something every day.

Mike Lynn: That's a great motto. I love that. So you've done a lot at MongoDB and you always seem to be working on these really cool projects. And I'm curious, where do you get your ideas for projects? And maybe talk a little bit about some of the recent ones that you've done.

John Page: That's a good question. So firstly, my first love, I think is education. I want people to know things. I love learning things myself and I want to teach things. So I follow a fair bit of educational tech and what's going on there. And I also have an interest both through myself and through my daughters and nature and wildlife, and things like that. So I'm always looking for educational and for wildlife things to do. And we have dogs and fish as well. We're that kind of outdoorsy family. So I guess, I've had quite a few ideas come to me over the years, possibly when I'm just kicking back and watching TV and I think, " Hey, wouldn't it be great if we had X or we had Y?" My major passion is actually nothing to do with databases. I have a robot tank, which I built. So you imagine this thing, it's about six inches, 15 centimeters long. And on the front of it, right down to the ground is a camera, but a very close up camera. And so by driving this thing around the garden and projecting it onto a big screen, it's like shrinking myself to being half an inch tall. I get to see bugs walking past and all the details of them like I was driving around and a lion walk past on the safari. That's my real, I won't say side hustle, because it's never intended to make money, but it's in the kind of passion project. And then building that over the last six years has led me to all kinds of knowledge about wireless protocols and about robotics, and about a lot more mechanical stuff than I used to do. So that led me a long way into the world of electronics and mechanics and things, because at heart I'm a software person or by career, I'm a software person. So I've built that during the pandemic. We had lockdown here, so I'd also, I'd been cycling and then the weather got really bad and I built myself a smart trainer crossed with a games' controller. So I wanted to be able to cycle, but I didn't... I'm not a competitive person, it's something I think about my upbringing. But I'm really not inherently competitive. I actually, I'm sure this will put off many of your listeners, I detest sport. I have done a little bit of motor sport, but mostly I cannot be bothered with sport, which is ironic living in Scotland where football is everything and in this house it's referred to as the F word. So I was cycling, but my idea of cycling isn't like, "How fast can I do this?" It's, " Hey, can I cycle through the countryside and what animals can I see, and how can I just enjoy it?" And my bike was very much a sit up, very straight, look around kind of thing, not crouched down over the handle bars. So then when you look at, how do you adapt that to doing it indoors? When you look at things like Zwift and Peloton, and they're all about, " Get your head down, how fast can you go. Go, go." And it's like, " No, I don't want to do that. I want to explore." So I ended up building a bike trainer, very like the Peloton thing, but where you could steer, you could pedal and also would change the resistance on the rear wheel, but all driven by a computer game so I could cycle around computer games initially. So driving games, I'd pick things with big open worlds, but you could even pick games that are set in different things like Red Dead Redemption, which is in the Wild West, instead of riding a horse here, just cycle around it, and Skyrim and things like that. So some of these ideas come to me, some are just adaptations. And of course, those two things, that little mini buggy that drives around the garden and the cycling will play together. So now I can shrink myself to an inch tall and cycle around my garden for an hour. People say to you, "Well, why not use VR?" And it's like, " Well, I could use VR, but actually it's just nicer having a big projection screen in front of me."

Mike Lynn: And really, isn't reality so much more interesting than virtual reality?

John Page: Well, there's a bit of both. So the cycling around the garden and the reality stuff, I absolutely love that. But even playing the games, I'm choosing games that have scenery. So I'll be cycling around the Pacific Northwest and a herd of deer walk out on the road. And you know yourself because you saw this, it's surprisingly realistic in places.

Mike Lynn: Yeah.

John Page: And then we have MongoDB projects. So I'm always looking for interesting things to do with the product. And especially, if I can include the product in what I'm doing, " Hey, I can do it in work hours." That's work. No matter how much fun it is, it's work. So we recently had a thing in the UK where the cost of domestic fuel in the UK is capped by the government. So they have a limit on how much the energy companies can charge for electricity and gas, and those are normally pretty high. The companies will generally do you a fixed price deal, which is less than the maximum. But all over the world, the price of fuel is rocketing and this was killing off energy companies in the UK. And so the government, every six months reviews that. And at the last review, they put it up by 54%. So what's actually literally happening is that people's bills for April are 54% higher than they were for March, and that's heartbreaking for many. Now, I'm fortunate, I can manage that. But it made me take a step back and look at my own bills and think, " Those are high. We're being flagrant with energy. And as somebody who cares a little bit about the world and the environment, I probably shouldn't do that." And so I look through, I sit in a room full of educational tech I bought over the years, partly to try and encourage my own kids to do things, partly to learn it, partly could I've done stuff with neighborhood groups and things. And I thought, " How can I use all these bits and pieces to measure and manage my own energy usage?" And so, I did that. First, I built something which read my electricity meters and it photographs them, it OCRs that information, it puts it into MongoDB and it draws nice graphs and charts, and then I blog about it. And then having done that, I said, "All right, well, probably the biggest cost is heating." Because Scotland is not a warm country. And so I grabbed a bunch of what's called BBC micro bits, which are little standalone computers with a five by five pixel screen and a couple of buttons that you can program. And they're just for teaching the principles of programming, but they have a thermometer built in and they have Bluetooth built in. So I plugged one in and half dozen rooms in my house, and built something to gather the temperatures and graph and monitor all of that. And then something to check if those temperatures were above or below what I would consider a reasonable temperature and, " Hey Presto, my house is way too warm. So turn things down." And just as of this morning, I checked and to my absolute astonishment, I'm saving myself 150 pounds, like$ 200 a month already, which is mind blowing.

Mike Lynn: Yeah.

John Page: Well, I've started on electricity, but I'm still working on cutting that.

Mike Lynn: Yeah. Necessity is the mother of invention. I'm curious though, were you always this inquisitive? Did you always have this curious mind?

John Page: Curious can mean a couple of different things there, so you have to have... I'm sure people refer to me as a curious child. But I've always had a love of learning. I've always been a reader. I was having this conversation recently with my wife that as a young teen or even preteen, I would not be found in the library. I'd literally be in the paper library, reading books. I'd read fiction, but I'd read lots and lots of nonfiction as well. For those listening to this wondering what I'm talking about here, I'm of an age before the internet. And therefore, if you wanted to learn stuff, you went to the library and you just grabbed a book off the shelf and learned a new topic, back in the days when Wikipedia was 12 feet long. I used to read lots of periodicals. So I'd say I've always been curious. I've always been interested in things, I've always wanted to know how they work. Since my very, very youngest memories, I've wanted to be variously an inventor, or a toymaker or a magician, those are... Where I saw being a magician as a subset of inventor because tricks are things you create. So I fell into computer science by accident. I did train as a geologist, but that's only because I wanted to understand computing from the point of, first, I have to find Silicon, then I can work through way up from there.

Mike Lynn: Yeah. And so, out of university, where do you find yourself in terms of employment?

John Page: In my last year of university, I was looking around on Usenet, which again, for those who've been around a very long time, it's... I'm trying to think what the equivalent. What would you say the equivalent of Usenet is these days? It's a combination of like Reddit and Stack Overflow. Reddit actually is probably the closest thing, but this thing was absolutely text only. And I looked on the scott. jobs channel and there was an advert there wanted Unix GUI programmer for up and coming intelligence company. Now this was 1995 and I was like, " Intelligence? No idea what they really mean by that, but I'll go along." And so that was my first and only job interview, and I got the job and it was working for a company called Mimix who were a Scottish company, who'd been for 20 years at that point. And amazingly, they were very like MongoDB. They were a document database company who built their own database on top of, at that point about 14 different Unix platforms. Along with, I'd say it was very much closed source. And so they had very few, if any people using it apart from the US military. The US military used it as an underlying development tool. But they built their own GUIs on top of it as well, both generic GUIs like MongoDB Compass and also specific line of business applications for intelligence management. So that's basically, the story behind intelligence is somebody tells you something and then you have to assess, do I trust this person? Does this person say that this is reliable intelligence? Because you could have somebody that is very trustworthy, but they say, " I heard this third hand." And then a third category is, how much risk is there to this person or other people, if this gets in the wrong hands? Which is more than just security, it's like safety risk assessment. And so you get this raw intelligence and then you build from that a much more... That's almost an unstructured thing, it's a bunch of text. It's literally a text report. And then from that, you manually build structured information. So if it mentions a person, you create a record for that person or you link it to that person. So from your unstructured intelligence, you're building up this structured model of intelligence as well. And then the third part is analyzing that structured intelligence and going, " Hey, we've now had 10 different reports about this guy. That's probably something." And so we'll synthesize an intelligence report, which basically says, " We think this person's a problem. This is why, this is the evidence and this is what we need to do to emenurate that problem." I did that in that company. I went from being a Unix GUI developer and then I worked on our first web front-end. And at that point, we were writing a web front-end and we thought, " We're probably going to need a framework for this." So it starts from the ground up. And we were writing web front- end and backend in C ++ because JavaScript didn't exist at this point, just to give you an idea of how far back we're going. But also, we didn't want to have a server that had to spin the time, so we had a persistent server in C++ and then tiny little binary in C that called it to get the next page, and things like that. And we built it as an extensible framework because even back then we understood those concepts. And then I moved into Windows and I built Windows GUI text mining tools, graphical text mining tools. My boss at the time was super encouraging of try and do some cool blue sky stuff because we want to be the amazing software of the year 2000, not just straight old database software. And I did that for 18 years. Ultimately ended up, we sold that company to SaaS. I was a director in SaaS. And MongoDB then headhunted me to come to do presales for them. But loads of time and obviously, all through that was document databases, but also RDBMS and moving data from and between RDMS and document databases.

Mike Lynn: Yeah. And primarily C ++.

John Page: Well, no, from a development perspective there, it was C ++, Visual Basic, C#, Pearl, PHP, bunch of different technologies. 18 years is a long time, you get to get reasonably good at a bunch of things. I did a lot of data transformation laterally. I worked extensively with the Middle Eastern government to bring a bunch of systems up to and including the immigration system all under one searchable unified system. And that was all Pearl. So ended up, I was doing masses of Pearl data manipulation.

Mike Lynn: Yeah. So you end up at MongoDB in presales, you've eventually gone through professional services and now in developer relations, what's your favorite role?

John Page: I'd like to claim that all three are really the same role. It's just as the company evolves, the person paying me to do it changed. When I started in presales, presales, people did a little bit of professional services and they also went out and did conferences and they also spoke about MongoDB. In professional services, the customer was paying, which was nice. And the audience size would be smaller, but really it would still be to go out and talk to people about what their problems were and educate them. I have to say that my current role is nice, because it's less structured. There's more freedom in my current role to do what I wanted. In presales, the customers were chosen for me. In professional services, the customers were chosen for me. Whereas in my current role, I just have a lot more influence over what I do and what I work on. I think it gives me the ability to just provide more value to MongoDB and more value to our customers because I can target a broader base and maybe do that slightly unusual.

Mike Lynn: Well, yes, definitely. And I see that. And so coming to the topic that we wanted to discuss today, you've written a great article and it's about schemaless. Do you want to give an over view of the article?

John Page: Yeah. There's a concept effectively of schemaless databases. And people talk about schemaless databases as being some polar opposite to our traditional, I don't know, schemafull relational databases. I think they're used, or they tend to be used for different use cases. But MongoDB isn't necessarily schemaless. I'd argue that there are very few true cases where a database in the sense of asset of data is schemaless. How much schema and how that schema is enforced by the underlying storage technology is a different question. So you can have technologies that insist that before you give it any data, you tell it in great detail what that data will look like. You can have technologies which say, " Just give me your poor and hungry. Sorry, give me your data, and I will absorb it and then we'll deal with it later." And MongoDB as a technology, will do either. We can strictly enforce a bunch of guidelines upfront, or we can let you put some data in and then work out some guidelines or we can trust you just to manage at the client end what your data looks like. The important thing is that if your data that truly has no schema, hypothetically every piece of data in your database could have a different set of fields. It's not going to be really very usable. And so you have to accept that there's at least some schema and think about applying schema to what you're doing. Or if you're thinking of your MongoDB use case as well, my data has a schema therefore I shouldn't use MongoDB. You're probably missing some of the point.

Mike Lynn: Yeah. As I think about it, schema is really a definition of the types and structure of data that lives in a database. And you mentioned it from a relational database perspective, a legacy relational database technology will force you to define the types and structures prior to storing the data. And as you mentioned, MongoDB doesn't do that. There are few use cases that I can think of... I'm not even sure I can think of one where, I mean, unless you've got an application that is accepting data and immediately storing it without concern for the key value pairs and the structure of the data, I can't imagine would be, well, number one, secure. But number two, the reporting, what are you doing with this data if you don't truly understand what types and structure the data is? So I'm going to ask you, I mean, can you think of an application where schema really doesn't matter?

John Page: I think where schema doesn't matter, yes and no. So when we talk about MongoDB, we talk about having dynamic schema, which effectively says, " If you want, we'll accept anything you give to us." And there are three use cases for that, three places where you may not want to rigidly tell it upfront what your data is. The first one is just to do with evolution of software. If I have version one of my software and then I want to have version two of my software, which has slightly different schema, having a database which is flexible in terms of schema means I can store the two at the same time. The next case is the one that really is where people are using MongoDB in a schemaless manner. It's not entirely wrong, but it's only very small part of what you could do. So imagine you are a organization that has pre- defined objects that you pass around your organization for different bits of software. So somewhere along the line, a data architect has defined an XML or a JSON model to describe a customer or a trade, or something like that. And that definition is a schema in its own right. It may be defined at your organization level, it may be something that's even defined beyond your organization. But it's not something that you, as an application developer have direct influence on what happens. If one day you're getting things from a downstream system that looked like this and then the next day it changes, the nice thing about MongoDB is you can say, " Look, just take this downstream data, which actually I don't really know what's in it and store it in a way that lets me access individual parts." Now I'll almost certainly want to add some metadata to that to say, where did it come from, and when did it come from there? And that metadata in MongoDB is definitely going to be a schema. But we can have a partial schema where we say, " Okay, I must have this field, this field and this field. So I understand where this data came from and what date I got it and what date I should delete it." But the rest of it could be just any set of fields. And it may be that when that data comes in as just any set of fields, the correct answer is, " Well, we'll store it as a binary BLOB." If the data comes in in XML, it may not be worth converting to JSON, it may be worth just storing as a BLOB of XML or a string. Or it may be that we do want to be able to do ad hoc queries against it, and so we will store that arbitrary data. And at some point in the future, somebody may say, " Look, I know that these documents have this feel and so I'm going to with a query against that." So there is this concept of storing object as a payload. And the nice thing about MongoDB is that we can do that slightly better than simply storing them as a opaque BLOB of data or as a string, we can store them with some structure.

Mike Lynn: So the article we're talking about is titled, debunking the myth of going schemaless. It's on the New Stack, at thenewstack. io. I'll include links in the show notes, if you want to check out that article. So I like what you said about the metadata, and that's where I think the value of MongoDB comes in. Regardless of the bulk of the data that you're storing, you have the ability to have this flexible framework for storing key value pairs in MongoDB. I mean, we have said as a company, " MongoDB is schemaless." We've used that language. Do you see challenges around the dichotomy that we're promoting MongoDB as a schemaless database? And then in the same sentence, it's like this Heisenberg experiment. I mean, the schema exists, maybe not until you look at it. Do you see challenges around that from a language perspective?

John Page: I think there's a little bit of overloading of the term schemaless, which causes us some problems there. So the first part of that is that MongoDB as an underlying database technology has flexible schema. You do not need to define the schema upfront, which some people will refer to as schemaless, if the database is not enforcing a schema and I can therefore put data into it with no schema, and therefore I can conceivably create a completely schemaless database. But a schemaless database is like a paperless toilet, it's entirely possible, but not entirely desirable. So we do have this thing where we've used the term schemaless to describe the underlying database, but we actually stopped using that officially quite some time ago. That said, the world also uses the term schemaless database for better or for worse, again, to refer to a database that may or may not have an available schema, or may or may not enforce a schema. And we live in a world where, when somebody searches for schemaless database, we very much like them to look at us. Maybe they are one of those few use cases who for some reason or other one, are mostly schemaless database. Maybe they do have just a lot of arbitrary JSON or similar data, plus a few metadata fields. So we still have information that says MongoDB is a schemaless database. I think the language I like to use, the way I like to talk about MongoDB is MongoDB is a document database. And that's a very orthogonal concept to being a schemaless database. So a document database can be schemaless and a schemaless database could be not a document. I mentioned a previous company I worked for was a document database. It also had a very rigidly enforced schema. It was very like MongoDB, but you had to tell it upfront what those documents were going to be and what those key and value pairs were going to be. MongoDB lets you do that. MongoDB will let you say, " These are the keys, these are the values, these are the ranges for these values, these are the co- dependencies. This value must never be more than three of this." It can do all of those things, whether you choose to use them or not. But that's a different concept to MongoDB being a document database and as a document database, being something that you can use to build very efficient schemas for transaction processing.

Mike Lynn: I think schema on read might actually be a better phrase. I mean, the schema does exist once the data is stored.

John Page: Well, the schema also exists in your client side. Regardless of how much schema is enforced or not enforced by your database, if your client application just sends arbitrary data to the database and you have a schema, you're going to get things rejected. There's no point in having a schema on the server side and then not having a schema understood and enforced on the client end, because you're just going to get errors. So ultimately, pretty much all software is going to have schema. It's going to have layers of object and code that define these are the fields that my application understands. And so schema on read is an interesting concept. Schema on read is almost an idea of, we just throw arbitrary things in the database because they come from some downstream system and then we have to figure that out on read time. That's a really inefficient, if useful way of doing it. So that's the data lake world where it's like, " Look, we're going to store this data just now as cheaply as possible, both in terms of processing and in terms of storage. And if we ever need to use it, well, we'll deal with figuring out what it is then." Whereas I think that's possible with MongoDB, but MongoDB is more about the... The primary responsibility for schema enforcement is the application. At the database end, we'll make that optional for you. Because to be honest, if the database doesn't enforce the schema for you, it gives you more flexibility in your application. For example, you can say, " Here are the five fields that I need, but whatever is in this field could be an object of any type." So I'm not sure I agree with schema on read. I think it's devolving the majority of schema responsibility to the client application. And then, as a corollary to that, you need to keep control of that. So if you have one application using your data and ideally that's got one class, one library, one bit of code that interacts with the database and crystallizes that schema, that's great. If on the other hand, I'm allowing lots of different people to access my data, then maybe I do need to turn on schema enforcement. Maybe I need to make MongoDB less schemaless because I'm allowing many, many people to access it, but we aren't giving that flexibility.

Mike Lynn: Yeah. And I love that flexibility. It comes down to, you can make decisions about the schemaless or schemafull nature of your application based on the use case and maybe the number of users and the disparity of the use cases that you understand about your application. But I want to go back and I want to touch on... We've talked about the free for all nature. I can just write an application that dumps the data to MongoDB. Maybe I include some structured metadata, but what about the other end? Now you mentioned that we do make available schema enforcement. What does that look like from a technology perspective in MongoDB?

John Page: So MongoDB schema enforcement, I recently had calls to do more work with this than I expected to. So it's a newish topic for me. I was never convinced as to the value of schema enforcement on the basis that most of the time there's one application and a limited amount of code hitting it. So MongoDB can set up what we call document validation. So we can basically set up a rule in the server that says, " Every time a new version of this document or a new document is given to the data and simplest, it must match this query." So it uses the same query language as everything else in MongoDB, but I could do something that says something like, " Every document that's added to the person collection must have a date of birth field that falls between this and this." So I'm basically saying, I must have a date of birth field and it must be a date between these two ranges, which then makes that a mandatory field and it makes it a type date and it means it falls in here. And I can also just explicitly say, " This field must be a date or this field must be one of these values, or this..." I can do the things with arrays. I can say, " This array must have no more than five things in it." So it's actually incredibly powerful. And it spills out into the whole aggregation expression language as well, which is a inaudible and complete language. So I can then go beyond that and do things like, " This field must be less than this one, plus this one." Which you can do in SQL with WHERE clause, but I'm not aware of any other new SQL technology that gives you that capability. And in my own case, I was actually creating... I wanted records that I could allow a developer to edit, but only edit certain fields in. And MongoDB security says, " If you can edit a record, you can edit the whole record." So I ended up building a document validator that just basically said, " This field, this field and this field, if you hash them together with this secret value, then they must equal this field." Which allowed me to create records and sets of fields that a developer couldn't change. A developer can change anything else, but if they change any one of those fields, then that computation is not going to work anymore. So it goes all the way the simplest thing to say, " It must have these fields, or it must have these fields of these types up to any amount of business logic applied to the things you're putting in the database."

Mike Lynn: But then we get ourselves into the same quandary where we dealt with this in the relational world, where the schema now lives separate from the code that manipulates the data itself. How does a developer implement schema enforcement and what happens when a document fails to meet the criteria?

John Page: I think in the general case, server side schema enforcement is not a good thing to do. It's there, it's possible. And if you're really uncomfortable with the idea that the schema enforcement lives in your application code, then you can enforce it. It's only really of use where you do not have control over developers to your database. In a good world, you have one access layer for the data, sorry, one right access layer for the data. Reading isn't really an issue, but you have one right access layer, whether that's some API, microservice, class, library, whatever it is, and that's how you interact with the data. And in a perfect world, you've got a nice business data access layer that actually understands even versions of your schema. So when I do get payment details, if it's an old payment details record, it will return that in one form, if it's a new payment details record, it'll still work and I don't have to upgrade things. Really, I would say that schema enforcement isn't the database's job. It should be done in the client end code, but there are always exceptions to that rule. And the exceptions are, where you have a large number of development teams hitting the data, potentially when you're putting out new versions of code and you've got a risk that perhaps something isn't doing what it's supposed to do. So as part of your testing cycle, you might even turn on schema enforcement just as part of testing. So you don't run it in live, but you do run it in pre- production to make sure everything is running the way you expect. Or in my particular case as a very unusual use case, where I wanted my end users to be MongoDB developers because I was building a educational experience for MongoDB developers. And so I wanted them to be able to use a Mongo Shell or Compass and do whatever they liked while still having certain constraints enforced upon them. But it was genuinely, I want to constrain people who are allowed to use any programming technique they like.

Mike Lynn: Now, I cut my teeth in an environment where separation of duties was very enforced, I'll say. In the world of finance, there were towers and really strict lines between responsibilities. And one of the arguments that I heard early on was that schema should be the responsibility of the database administrator. What do you say to folks coming from the legacy relational world to MongoDB that say, " Well, you're going to end up with a Wild West if you've got developers controlling what the database looks like from a structure perspective."

John Page: I think with that one, firstly, the origins of the DBA and the schema design and the idea that a DBA or architect or somebody would determine the schema, and then the schema would be given to the developers to build against. So I think that notion originates from a world where it's the mid to late 1970s, and somebody has come up with a great idea that in any given organization, there should be one and only one copy of the data. There will be one database and that's it. And no data point will be duplicated because that way, there's never any ambiguity if we only have one version of a fact, and not only that, but we can save money on the storage if we only have one version of a fact. And we can't trust developers to read and write files because they won't follow these rules. And so the birth of the relational database. And so if you have one database for an organization, then the first rule is that the design of that database must not really be geared to any particular application. It must be entirely generic and driven by the domain of the data. We go through first, second, third, and sixth normal form taking this data and basically saying, " Look, how do we organize this in a way that isn't geared up to any particular access pattern, mode or use case?" And that's carried over, that's even still done today. Even though, nowadays the idea that any organization would have, " Yeah, we have one database and every one of our applications forever will run on it." Nowadays when we have an application, our first thought is, " So I'll need a database for this then." And databases in the real world are very much tailored towards the applications they run on. So those lovely architect designed, perfect, normalized schemas don't survive first contact with the enemy and you find string fields with comma separated text in them, and you find people doing all sorts of things. And you'll find databases that have a table that says customer number one, customer number two, customer number three, customer cell one, all kind of design compromises that'd been made because the original perfect design didn't work when it came to be developed. So I think nowadays there are people working in development who have a better understanding of what that data structure needs to look like. I'm not saying it's every developer, there are developers who expect to just work in a very rigid framework, but I think there are a lot of developers now who have many of the data design skills needed to actually work out, what should my backend schema look like? And I think we, as a company, encourage that idea that getting the perfect schema needs to be a top down process. We need to understand both the nonfunctional requirements, the data requirements, the performance requirements, but also it needs to be driven by the user interface requirements. If the principle requirement 90% of the time is to retrieve a data form that looks like X, then let's make sure that the back end supports doing that efficiently.

Mike Lynn: I like that. Now we're coming up on time, but I would be remiss if I didn't bring up the topic of mobile and our Realm Mobile Database. Do you work much with our Realm Mobile solution?

John Page: I have done in the past. So I worked with the Realm Mobile Database shortly after MongoDB acquired it and I understand a lot of the concepts behind it, but I really haven't done much with it recently.

Mike Lynn: Yeah. So there's this dichotomy, the Realm Mobile Database, there are so many things in common between the MongoDB server based database and the concepts around the Realm Mobile Database. But one glaring difference is in that the Realm Mobile Database does require that you strictly enforce a schema. You need to create the types and structure prior to writing data to the database, and I was curious if you had thoughts on that. Actually, I mean, I know you've worked a little bit with it, but what are your thoughts on the differences between Realm Mobile and MongoDB?

John Page: That was one of the first questions I asked when I started learning about Realm was, it was there but not there. And I think it's to do with the expectations of mobile developers and very much the expectations of mobile libraries and mobile projects. So in the web world, people work with JavaScript, which is a very happy dynamic language, which will readily accept different data shapes and the same applies in a lot of server technologies. In the mobile world, things initially were all quite rigid. I mean, mobile phones, ultimately and interestingly, mobile phones remind me of the development work I did in the late 1990s. I was writing Windows desktop applications and some Linux desktop applications. And the whole mobile programming stack has so much in common with that, from how the GUIs works to how the code works. But one thing that you were always thinking about when you were writing on Windows or when you were writing on Linux was, how is my memory usage? How can I be efficient about this? How can I limit my resource usage? And that's compounded from mobile applications by battery usage. So even though mobile applications have much more storage than they used to have and much more RAM and much faster CPUs, there's still a finite amount of battery to play with on a mobile device. And that then influences the software stack and how things work. And of course, it's evolved from devices that had very little memory, relatively speaking, very slow CPUs, relatively speaking, and very, very little battery. And Realm itself has come through that whole cycle, because it's not a new product by any means. Realm is designed for efficiency. Fine, that makes sense. But also all of the other libraries in mobile technologies are designed for efficiency. And so none of them are really designed to work with data structures. If I give you an example, lots of the GUI components, rather than you have to explicitly update a GUI component and say, " You'll add this to my list, add this to my list, add this to my list." The components you see on the screen are bound to underlying components. So that as those line components change, then the GUI changes. So as a developer, you just need to add something to a list or an array and it updates in the GUI, but none of them are designed around the idea of dynamic. So the idea that you can add something to a list, that's fine. But if you were to add a new field to an object, none of the GUIs on mobile technology have any idea what to do with that. So one of the reasons why Realm is the way it is and why at this point, we've not made it more dynamic, more MongoDB like, more flexible is simply because mobile components wouldn't understand what to do with that mobile developers, therefore really have not a great deal of use for that. One of the first things I did when I got my hands on Realm was build a tiny little micro Mongo library on it that had collections and documents and databases and behaved like MongoDB on the device, and behind the scenes used Realm as a transport protocol to move data to and from the server. So I could literally write some mobile code that looked and smelled like MongoDB code. As a MongoDB developer, it'd be great. But I just stopped after I saw a POC of that project, because it was like, even if I had, the MongoDB driver API on the mobile device in most use cases, it just doesn't make any sense.

Mike Lynn: Yeah. It's like even in the server world early on, the physical constraints of the servers drove the constraints of the database itself. And it sounds like we're still in that space with mobile. So it sounds like the constraints on the device drive the flexibility or the degree to which we can be flexible in the database. And I do want to ask about, there are some resource advantages, some advantages in the way that we deal with resources and when we constrain. Do you want to talk about the efficiencies or lack thereof associated with the schemaless design with MongoDB?

John Page: So the thing about a schemaless design. So if we're talking about having a lot of fields that aren't particularly defined upfront, and what you defined as schema on read, as in we've got this data, we've stored it, but then we need to... we can't optimize for anything we do. There really is no optimization you can do for a schemaless design. If you don't know how you're going to use your data, you can't even build viable indexes for it. And you're really, ultimately, building a key value store where you're saying, " Look, I'm going to achieve by this single dimension and you're going to be back a big BLOB of something. Otherwise, there's very little I can do." I think the point I'd like to make is that that concept of schemaless is almost completely at odds with the idea of a document database, which is a very, very structured concept. MongoDB has schemaless as a capability, but it's also a document database. And a document database dare I say, correctly used, although that's very much an opinion is a highly structured thing, in the same way that an object oriented language is a structured thing. The advantage of document databases is that you can design much richer server side objects, which colocate data, which explicitly join data, which make it faster and more efficient to retrieve it because I don't need to join things to get other. It makes it easier to build a distributed system, because now the things I need to update are all on the same server. I can do an atomic change to change one thing. I, ideally don't need to worry about using asset transactions. I can do asset transactions or multi- document asset transactions in MongoDB. But if I do that, then I've got multiple steps to my modification. I have to tell the server, " Right, okay, I'm going to start." And then I have to change one thing and then I have to change another thing, or any number of things. And then I have to say, " This is over. Hey, I've finished." And between that first thing being changed and me finishing, all of those things are in a indeterminate state. As soon as I change something inside a transaction, nobody else can really rely on its value. They can't change it. Even reading it is questionable, because it's in a state of it's neither happened nor hasn't happened. Very much transactions result in, I think you said this earlier, shorting those records. And the length of time that they're in that indeterminate state and therefore effectively locked for anything else to use, is huge in computing terms. So any software system, which relies on me changing multiple things and not part as a single atomic send to the server, " Do this now." And doubly so if those things are over a network because compared to in memory time, network time is forever, that whole thing causes contention. Let me give you a great example. Let's say I want to rip through my database and for the sake of argument, I want to just set a new field to five and I've got a million records. So I start and I kick it off and it's just setting this new field to five, for whatever reason. I'm basically locking my whole database as I ripple through the database. Now nobody else can update that database until I complete my task or roll it back, because they're all relying on my number five change being there. And that can cause enormous contention and locking. And it's one of the fundamental problems with traditional relational databases in terms of scale. Why do relational databases have problems with scale? One, trying to get them across multiple machines, but two, transactions are fundamentally evil. They're necessary evil in places, but they're fundamentally evil. So a document database can give you a much better chance of co- locating all the data you might need to change together so that a single operation can come in, grab the record, lock it, change it, unlock it. And that's microseconds and it's not got any contention with anyone else. So you've got schemaless on one side and people think of document databases meaning the same thing as schemaless, but it's not. Document databases can have schema with or without enforcement. You could technically also have a schemaless tabular database if you really wanted to. Some of the wide column databases do that. But the key thing with MongoDB is don't mistake it for something that's only to be used for schemaless data. Think about it as using it is as it being where you want to a bit more structured than your relational database allows you because by adding more structure, you can literally half the cost of development. So I guess almost my final point on this, for the majority of workloads, I would say that MongoDB uses 50% or less of the compute resource of a relational database. And if that's not a money saving item for you, and remember nowadays, we don't buy a big server upfront and do not worry about it. We pay as we go, we pay for the hardware we use by the hour. So you could literally look at halving your cost. But the second part is because it's got this document, this object nature to it, it's actually much easier for a developer to build and achieve things with. And so you cut money on the developer costs as well as cut money on the hardware costs. So document models ultimately just cost less money.

Mike Lynn: That's a bold statement, John, 50%.

John Page: At least.

Mike Lynn: Can you tell me, what's the rationale behind 50%?

John Page: I think it's just a lot of years of understanding of how relational databases work. All those years ago in MongoDB, when I did my interview and I was explaining the internals of an Oracle transaction block, because I'd been working with Oracle and hitting some limits on how many open transactions I could have on a given set of data at a given time. So I'd like to think that I have a pretty solid understanding of RDBMS and of document databases and how they work. But the kitchen logic version of this is that document databases just have to do way less work. A join is an expensive thing to do. A transaction is a painful and expensive thing to do, although that's more for locking and contention reasons. The fact that document databases encourage denormalization and so again, cut that join capability. So the capability is definitely there and in my experience, it just needs far, far less hardware to achieve more. And I've heard people say, " When should I move to a document database? When do I know my problem is big enough that I need the scale, that a document database gives me?" Which I think is the wrong way of looking at it because with the exception of some very extreme use cases, a relational database will scale to pretty much any size problem if you throw enough money and hardware at it. But if you just start from the premise of, for any given problem, I'm going to need half as much hardware if I use MongoDB compared to using RDBMS of your choice. The answer is, when should I move to MongoDB, whenever I want to pay less money.

Mike Lynn: That's a great rule of thumb. And I guess if I were to try and restate that, so what you're saying is in the relational world, you are bound by the requirement to create a schema and you are bound if you're doing it right, if you're doing it the correct way to create multiple tables to describe your data. So you'll almost invariably have multiple tables containing different parts of the data that you'll manipulate with your application. So customer order, for example, customers in one table orders in another, and to fetch that data, you're going to have to do a join. So it's going to be multiple reads. Whereas in MongoDB, I can have a customer order collection where the order is embedded in the customer. That is one document fetch, which gets me the two parts of that data, the customer as well as the order. Is that essentially what you're saying in terms of the decisions?

John Page: Essentially, yes. One thing you said there was if you're doing it right. But actually with the relational database, you don't have that choice because you don't have a way of bedding one in the other. You can't even do it badly with a relational database and get that benefit. Choosing what to embed inside what is a design decision, is very much coming back to this notion of, it's not schemaless. You need to think about your schema. Do you embed the order inside the customer or not, or do you embed the customer inside the order? Either is possible. So those are some decisions you have to make. But ultimately, yes, when you read or write the database, less work needs to be done.

Mike Lynn: Perfect way to wrap up. John, anything else you'd like to share with the audience?

John Page: I don't think so. Feel free to plug my blog. That'll be good. So I'm medium. com/ pageagainstthemachine.

Mike Lynn: Awesome. I'll include links there to that as well as links to the article on debunking the myth of schemaless. John, it's been a great discussion. Thank you so much for taking the time to talk with me.

John Page: Thank you very much.

Mike Lynn: Thanks for listening, and thanks once again to John Page. Be sure to remember, GDConf, the Game Developers Conference is coming to San Francisco. If you're planning to be there, be sure you add to your calendar a reminder to visit the MongoDB booth. Mention the word podcast to receive a special gift. That's the Game Developers Conference, March 21st through the 25th in San Francisco at the Moscone Center. Hope to see you there.

DESCRIPTION

Developers everywhere have embraced document databases. But in many cases, it’s for the wrong reasons. Take the current hype around going schemaless — it’s almost too easy to store arbitrary JSON or XML in a document database. But if you expect to filter, modify and retrieve it efficiently, you may be setting yourself up for disappointment. On today's episode, we're talking with John Page, Distinguished Engineer at MongoDB about the concepts and myths surrounding the schemaless nature of document databases and MongoDB. 

John's Blog: https://medium.com/@johnlpage

John's Article on The New Stack: https://thenewstack.io/debunking-the-myth-of-going-schemaless/

MongoDB is coming to the Game Developers Conference March 21st thru March 25th. Stop by the MongoDB booth to share your story with the Podcast. Mention the word podcast to receive a special SWAG Surprise. 

MongoDB World is coming to New York City June 7th - 9th. Visit https://mongodb.com/world-2022 for more details. Be sure to use the code "PODCAST" for an additional 25% discount off of the early bird special pricing.

Today's Host

Guest Thumbnail

Michael Lynn

|Principal Developer Advocate

Today's Guests

Guest Thumbnail

John Page

|Distinguished Engineer, MongoDB