Ep. 171 The Impact of Dataset imbalance on Machine Learning and AI.
Snehal Bhatia: Hey, I'm Snehal, and you're listening to the MongoDB Podcast. I'm here today to talk about the impact of dataset imbalance on machine learning and AI algorithms. So, stay tuned.
Shane McAllister: Welcome to the MongoDB Podcast. I'm Shane McAllister, and as ever, whether you're a regular subscriber or a brand new listener, we're glad to have you tune in and join us. In this episode, we're joined once again by Snehal Bhatia. Snehal is a solutions architect here at MongoDB. She previously joined us in an episode to talk about designing environmentally sustainable data architectures. In this episode, Snehal turns her attention to an area that she's very familiar with and an area that she did her own thesis in, but it has only now come to mainstream with the continued widespread adoption of machine learning and artificial intelligence. In this episode, we deep dive into the area of the impact of dataset imbalance on machine learning and AI. Speaking of deep dives, our MongoDB. local series of events has already started and is coming to 30 cities globally this year. Visit mongodb. com/ events to learn more about where and when our. local events are happening and perhaps join us in person at a MongoDB. local event near you. With that, let's get on with the show. On today's show, it's fair to say that we have a friend of the MongoDB Podcast, Snehal Bhatia, the solutions architect here at MongoDB. Welcome back, Snehal. How are you?
Snehal Bhatia: I'm good, thanks. Thanks for having me again.
Shane McAllister: It's great to have you again. You joined us, I think, I did my notes somewhere, episode 136, which was back in November. So, it's been a number of months, but you joined us then to talk about designing environmentally sustainable architectures. For our listeners, if they want to go and listen to that episode, we will link to it in the show notes as well too. So, you can jump back to that on whatever podcast app you use. In that show, you were talking to us all about the environmental cost of data and data architecture. I remember one statistic in that, which was the IT industry was equivalent in percentage in global emissions as the whole airline industry or the aviation industry. So, that took me by surprise and that you mentioned that we're at a tipping point and there was a climate clock, et cetera. So, I think an awful lot of lessons learned there for most to show developers and IT professionals how they can think about their environmental footprint. The environment is top of mind for a lot of people most of the time, but in the last number of months, we've seen AI pretty much everywhere. So, you're back to talk to us about that, about the impact of dataset imbalance on machine learning and AI.
Snehal Bhatia: Absolutely. Yeah. I know you can't see it, but I'm wearing the MongoDB Podcast t- shirt as well.
Shane McAllister: I'd love to see it. You're looking for a second one then, second appearance. We have to have an alternative. Unfortunately, we've only one type, but a second one. So, one is in the wash, you can have the other one available.
Snehal Bhatia: Yeah, yeah, I'll be a brand ambassador, but yeah, absolutely. So, yeah, thanks for mentioning the previous one as well. That's still a topic that remains top of mind and continues to evolve even in this context. But today, I'm here to talk about is more so to do with this whole wave of AI, artificial intelligence, machine learning, LLMs or large language models that have taken over the tech world by storm. That's not to say that's something that's new. It's just more to say that now it's become a more day- to- day tech, something that every developer wants to incorporate into their apps. So, I believe it's now important, especially as MongoDB is releasing AI- related functionality as well. So, it's important that we talk about the implications and underlying issues that might impact such things.
Shane McAllister: At our. local events in New York, which is our flagship. local event, we are running them this year in 30 cities across the globe, but there, we announced a lot of AI announcements. We announced an AI innovators program. We announced a partnership with Google Cloud AI, and we announced Vector Search. Obviously, Vector Search for us is the linchpin to be able to do these AI efforts. For us, obviously, Vector Search allows us to search images. We can use natural language processing. We can enhance what we can do for machine learning. So, for those unfamiliar with Vector Search, each item or data point is now also represented by a vector which corresponds to a specific feature or characteristic of the item. They're embedded in alongside the data. It allows you to search for those much easier. So, like similar images, et cetera, et cetera, can all now be search. Whereas beforehand, it was really hard to do that. So, now with Vector Search on MongoDB, you can store and search vector data within MongoDB using the same document model that we've always had. This area in AI, as I said at the intro, it was super new. But for you, it's been a while or in this space because you had your thesis in this area, Snehal. Tell us a little bit about that.
Snehal Bhatia: That's right. I do feel a bit old now talking about my university days in past tense, but this is an area I focused on during my academy. I did my master's thesis on the use of GANs or generative adversarial networks for addressing dataset imbalance for image datasets. Prior to that, I'd worked on a couple other mini projects trying to understand if more data always means better machine learning algorithms or not. I also worked in research labs where we were trying to get surprising results back from recommender systems rather than just the expected ones. So, for example, if I've got my friends on my social media network, then my Netflix should not just recommend things to me based on what I watch, but also based on what my friends watch and if we have a link in the common directors we like and things like that. I had been working in this space quite a bit before I joined MongoDB as well, and I still keep up to date with it. So, it's been something that's been going on in the academic community anyway for many years now.
Shane McAllister: AI has really come to the fore recently. We see it associated with everything pretty much. Most of the tech companies have an AI story to tell now. A lot of the huge amount of startups in the AI space, we have actors worrying about their roles being replaced by AI. We have developers worried about their roles being replaced by AI. We have search and prompt engineers as jobs now. So, it's the complexity and what AI is getting into is enormous. The one thing about it is it's all underpinned by data, and obviously, the complexity of that data increases as we bring more AI and more machine learning. You're here today to talk about the effect that has on your data and your data structures. One of the pointers that you gave me when we were prepping for this was talking about dataset imbalance. That was a new one. We're always talking about just getting the data, storing the data, capturing the data, keeping it up to date, archiving it off as needed, and still being able to query across that. Imbalance, I suppose, was something in my world I wasn't considering. Talk to us a little bit about the consequences of dataset imbalance on machine learning algorithms.
Snehal Bhatia: Absolutely. I think just because when we talk about data science algorithms, so that's encapsulating everything, we are usually dealing with really large data sets. So, billions, trillions of data points even. It's bound to happen that not all that data is going to be of perfect quality. So, there's things like noise in the dataset, there's missing values, lack of standardization of the way it's represented, and all these things are something that we are very conscious of when we're designing and training our algorithms. So, we have the pre- processing step, which is a big thing in dataset labeling and things like that before we actually start the training. But one thing that very often gets overlooked and it's something that's easy to skip as well is just addressing the fact that there might be a skew in the way the data is distributed or an imbalance in the types of data points being represented. So, if you talk about classification algorithms, which look at an image or look at a data point and put it in one category or another, then that's where we can say it's like a class imbalance as well. So, this can occur because of many reasons. So, some dataset might just be just by the nature of the problem, the natural frequency of occurrence of certain types of data might be less than others. So, if you talk about disease detection, maybe there's a rare disease that we're trying to detect. Even if you have a dataset of millions of people, maybe only 1% of them actually have that disease. If we just go by that natural dataset, that's not going to be very efficient, because the model is generally trained on cases where the disease is lacking. Similarly, when you try and do dataset on things like weather prediction, maybe you forget to take into account all the different areas. There's just many ways where it's imbalanced because of just the natural occurrence of data can creep into your training data. There's also sometimes issues that are beyond this. So, if there's an issue in the way data has been collected and stored, especially if you're looking at sensor- related data collection, maybe sensors can fail at certain points and maybe the network between the storage solution and the collection solution fails. It's just because we're collecting so much data. We're not highly concerned about one or two data points failing, but if they are critical to getting the anomalies or the unlikely data in place, then something that can often be overlooked.
Shane McAllister: I suppose as the complexity of the dataset increases, this only gets amplified more and more. I think we like to collect all this data. There would be an old adage that data never lies. But as you say, if we're not looking at this data, if we're not looking at the bias of this data perhaps, then it could be biased towards, I think as you mentioned, the majority. Then when we apply machine learning and AI to that, then that skews that up, correct?
Snehal Bhatia: Yeah, absolutely. So, I feel like in general, what research has been trying to show, of course, there's no one conclusive fact that we can state because it's just almost from every use case. But as the complexity of the dataset increases, the impact of the imbalance also increases. So, if we go back to the weather prediction use case that I was talking about, maybe a couple years ago, it would have just been based on historical trends of what the weather has been like in that area. But now because of global warming, oftentimes it's hard to predict things based on historical trends. So, now, such data sets are also enriched with data being collected from physical sensors that are being placed around, so just to get more real time weather input and also maybe sentiment analysis that's been carried out on social media platforms to see how people are feeling about weather conditions in a particular location. So, these applications often have to operate real time. Hard to determine if devices would fail, maybe no one is talking about the weather at a given time. You can't really do sentiment analysis in real time either. So, these are really real time data collection and they're enriching the existing data sets that's going on. So, really hard to sit back and think about how to make the dataset balanced.
Shane McAllister: I suppose there's probably a caution tale to some of this as well too, the more we rely on AI and ML to look at our data sets. Unless we're keeping a watchful eye on the outcomes of that or policing that in many respects, then things and systems that maybe are automated, such as... I think I talked to somebody before in the podcast about most stock trading now is automated, and I would imagine the algorithms and the predictions around that. If you have increasing complexity of a dataset in that space, then you're putting real money at risk as well too. Give us an example of some other real life scenarios where the impact of data imbalance could have a real effect. For example, as you were saying earlier, if we skip a data point or two because we think they're an outlier, you mentioned cancer studies in patients that obviously you hope that it's a very small fraction of people that exhibit those symptoms, but in fact, that's what you're looking for. On the flip side then, you're saying, " We want to monitor all of the weather stations," but we see this massive anomaly. Let's not ignore that. How does that work out? Is there tools that can be used in machine learning algorithms to tease this out and manage these outside outliers?
Snehal Bhatia: So I think if we go back into how exactly are some of the machine learning algorithms impacted, if you look at some of the very simple ones like logistic regression, this is just simply what it's doing is it's trying to find a boundary between two different types of data that exist in the dataset. So, this boundary gets pushed either in the favor of one or the other based on the formula behind it. So, likely what's going to happen is this boundary will go towards the majority class in the dataset. So, when you give a new input to this algorithm, what that means is the algorithm will likely try and find a result that corresponds with what the majority of the dataset behaves like, right? So you mentioned the cancer detection use case or let's say maybe fraud detection in transactions or detecting nuclear leaks, things like that. These are rare occurrences, thankfully.
Shane McAllister: Thankfully, yes.
Snehal Bhatia: So when you think about this and there are multiple kind of metrics that we might use to evaluate the efficiency of such algorithms, accuracy is a big one. So, accuracy is essentially just defined as the ratio of correct predictions to the total number of predictions. So, that just means that if you use accuracy in a highly imbalanced dataset or an imbalanced dataset, you can actually end up with misleading interpretations. If you go back to that use case of cancer detection, if you're working on 100 patients of a hospital out of which 99 are healthy and only 1 has cancer, then a classifier that is trained on this dataset may just label the cancer patient as healthy and still the model will have an accuracy of 99%. That's a very high accuracy to have. So, if you just extrapolate the scenario to a dataset that contains 1 million samples, it can lead to 10, 000 cancer- affected patients being labeled as cancer free, which really highlights the gravity of the situation. On the flip side, if we're too cautious as well, then what can happen is we can see too many people being detected as having a disease. So, it's too many tests going on, a waste of medical resources, which we all know how important they are. So, it does have real life impact. The reason we don't see it so much now is maybe some things like cancer detection, tsunami detection, things like that, they might actually be very conscious of the fact that these are anomalies. So, these algorithms might be trained like that or we actually have a lot of human input going into this as well. So, there's probably not a case today where if an algorithm detected a patient having cancer or not, then a doctor won't take a second look. That's not going to happen today, but as we progress and as technology progresses and as our confidence in these AI solutions increases, it will probably come to a point where we will take the word of this AI solution to be correct without even going back and having a second look. The thing is this seemed like a distant problem up until a year ago, but now with the pace which it's progressing, it's definitely something we need to start thinking about now.
Shane McAllister: As I said at the beginning, you see it an awful lot in the media now. When will the AI take over? You see The Terminator clips from the movies being surfaced, in a lot of articles as well too. That was a great explanation for the logistic regression. I know in Vector Search, we used the approximate nearest neighbor search. So, for MongoDB Vector Search, there is that as well too in this looking for these anomalies, you have a nearest neighbor underlying logic as well.
Snehal Bhatia: Yeah, absolutely. So, vector search is based on the KN algorithm or the k- neighbor algorithm, just like you mentioned. So, that is something that an algorithm that's used for the classification kind of problems, just like logistic regression is, but also, in many other use cases such as stock price prediction or recommender systems or the likelihood of a loan being approved based on your credit score as well. So, what that algorithm does is essentially, it finds the majority classes based on its nearest neighbors. So, if you have data points that are related to each other, it's going to see what its neighbors are looking like and consider a large number of its neighboring data points or similar data points and then make predictions based on that. So, if you think about things like stock prediction, et cetera, it's not just putting things in one category or another. It's actually forecasting stuff as well. So, the imbalanced dataset here would really depend on how big of a neighboring set are we considering or not and how much similar data we have in that neighboring set. So, again, if you think about it, the sensitivity to imbalanced data can be considerably less if the value of k in KNN is small, so if only a small number of neighbors are considered for classification, but if the value of k is large, it just becomes more sensitive, because it's more likely that the nearest neighbors of a sample will belong to the majority class. This is an example of where it's not just about the dataset anymore. It's also about how we are designing the algorithms. It's also about being conscious of the balance between understanding our dataset and then also making sure we're putting in all the thought we need to put it into the algorithm during training to make sure we get the right results.
Shane McAllister: So essentially, Snehal, the size of your data set is really important as well too. In AI and machine learning, we hear about the large language models obviously. When they're large, they're enormous, the quantity that data is being consumed. I know we've a number of projects underway in MongoDB in that space, particularly to help people code better using a model that's been trained on MongoDB essentially, adding to all the languages that we support. So, in all of this then, depending on how this is managed and what algorithms are applied to, there could be a bias obviously. We've seen this reported before where bias of data trained on human faces for face recognition, et cetera, has backfired. Tell us a little bit about those scenarios then.
Snehal Bhatia: So I think if we go back to really why are we discussing this issue, it's not just because we want our algorithms to be more performant or more accurate. Of course, we want all those things, but the reason why this is really relevant right now is because we are starting to see all these technologies creep into our everyday lives. So, this is where it starts impacting the society and individuals. So, you talked about bias in face recognition systems, so maybe typically tech companies have emerged from let's say the Americas. If you are using easily available dataset over there, it's very possible some ethnic minorities from other countries are not fully represented in it. We saw some cases where the phone wouldn't unlock when a minority person was trying to look at it or between the faces of minority people, it would unlock for the wrong person as well, which is just one of the problems. But the bigger problem here would then be the existing kind of human or systemic biases that we have right now, they can actually get amplified to a much, much wider scale, because a technology developed is not just for a particular region. It's propagated globally apart from just these inconveniences of your phone not opening up or the voice detection algorithm not detecting vernacular accents or things like that. If you actually think about it, maybe a mention like a loan approval service based on credit rating, you can get unfair treatment. You can get denied services and resources. If you're using these kind of systems in criminal justice systems, maybe you can be subject to discrimination and wrongful kind of punishments or wrongful decisions. Then if you think about even more and more vital spaces such as healthcare, we talked about the disease detection algorithm where you can either have too less of people detected with the disease or too many of them. Both of them are harmful either to the individual or to the wider medical resources. Autonomous vehicles, they're trained on datasets from road occurrences and things like that. But if cars are designed to be shipped globally, then the dataset should reflect driving conditions and behaviors of different countries as well. There's very interesting studies there as well on how different countries perceive what is ethical in driving or not and how different countries perceive based on their culture and their societal kind of structures who should be punished for something wrong. For example, if someone makes a mistake, in the developed countries, it's mostly observed that people believe that the person making the mistake should be punished. But in developing countries, because there's so many other factors, you might be making a mistake because of not your issue, but just because no one else around you was following the laws, which I have personally firsthand seen in Delhi, growing up in New Delhi. They're just straight out saying the person who made the mistake in the road accident should be punished is wrong because there's so much context behind it and very easy to overlook these minor differences and the differences in conditions globally.
Shane McAllister: We could certainly do a whole podcast on the AI and machine learning for autonomous vehicles. We've all seen those dilemmas of the group of school kids versus the old lady crossing the road. What does the autonomous car do? I'm a mad car fan as anybody who knows me will know. But when we see the autonomous vehicle videos, particularly obviously emanating from the US where most of that work has gone on at the moment, they're big wide roads with really good weather conditions, proper turns, right hand, left- hand crossroads, those things. I live in Ireland. Our roads are pretty appalling and our weather can be pretty hideous at times as well too. So, I've yet to see a good video of a car driving an unmarked road with rain coming in sideways. So, I look forward to that someday whenever they get to that. Then a really good discussion so far on the kind of causes of the imbalance, the potential risks of the imbalance and the real impact, all of those examples that you've set out, Snehal, and how this impacts, we tend to look at technology in the abstract. Oh, it doesn't affect me, but these as you say are decisions that machines and computers and programs and software are making on our behalf. The probably most familiar with many might be you got rejected for a loan or something like that. This computer says no. We're all used to those forms that you fill in. What has come up out from all of this in terms of the ways of addressing this imbalance problem? Are there a number of tools? Are there a number of methods that if done properly and applied correctly can level set this again?
Snehal Bhatia: So I think that the first step there is being aware that the problem exists and wanting to do something about it from the get- go. But then if you think about it, what are some of the ways, and one of the very intuitive ones is just to level the dataset, right? Resample the data until you find equality in the data points, of course, in cases where it makes sense. So, for example, if you can do oversampling, which is just repeating random data points from the minority represented data points or minority classes until a balance is achieved. So, easy to do. It can work well in some situations. If you think about an image dataset maybe, you don't have to just repeat it, you can rotate it, you can blur it, you can stretch it out. So, they're still slightly different samples, but you are augmenting a certain class which was minority before.
Shane McAllister: I see.
Snehal Bhatia: Of course, there's like a possibility there that it can cause what we call overfitting in machine learning algorithms, which means the model becomes too focused on a certain kind of data, because it sees it occurring so many times. So, this repetition has to also be done smartly so that the repeated dataset brings some variety into the mix and the model doesn't just learn to parrot back the same thing over and over again. It might not always be possible to just pick random data points and modify it, like I talked about images, but maybe in text and languages, maybe very challenging to actually repeat it with a variation and then make sure everything else is still preserved, which is where actually the generative networks themselves can be used for synthetic data generation. I alluded to my thesis previously, that's where what I was trying to do was use these GANs or generative adversarial networks to generate fake realistic data points that can match the minority class. After a point, it comes to a point where it's generating very real looking data points. You can't tell one from the other. That's really good of a way of augmenting your dataset as well to make sure that all your dataset is balanced. Yes, that's like increasing the amount of data, which is very intuitive. Another way of doing this is under sampling, which is actually reducing the representation of the majority class so that it feels really counterintuitive because we're deleting data. No one wants to delete data.
Shane McAllister: It does. Yeah.
Snehal Bhatia: We all want more data.
Shane McAllister: That's never a good thing. Yeah.
Snehal Bhatia: Yeah. I don't think that's the first thing that pops up, but in some cases, if done very carefully, it can actually be a preferable approach, because we did talk about this earlier on as well, that more data doesn't always mean a better model, especially if you're thinking about cases where we're retraining existing models to fit our specific dataset, which is what is actually happening. If you think about all these AI products that are out there in the market today, we'd pick them up and retrain them on our own dataset. It doesn't require an extensive amount of retraining. We just need to contextualize it back. So, in those cases, maybe under sampling is easier than oversampling. Of course, making sure we don't discard valuable information or we don't accidentally reduce the diversity of the training data in this process. Increasing and decreasing dataset, making sure it's balanced is one way of doing it. Another way is just to look at it from an algorithmic perspective as well. Instead of actually altering the data distribution or the dataset, you can make adjustments to the learning process in a way that maybe it increases the importance of minority class. So, if you just shift the decision threshold to reduce the bias towards the majority classes or assign waiting in the training process in a way that underrepresented data classes are being given more inaudible, you can always look at it from that perspective. Such modifications, like we discussed in the case of the k- nearest neighbor algorithm as well, you really need to have domain knowledge and problem specific expertise. So, you need to know what is it that you're solving for. You can't just be a machine learning engineer looking at it from a purely tech problem. You have to have an idea of what is the problem, what is the context, what is the social context, what are the factors we can monitor and how, that or trial and error. So, again, it requires a bit of thought to go behind it.
Shane McAllister: As we said all along the speed of which all of this is happening, the speed at which these new methods are coming out, is it nearly chicken and egg? In other words, we need a lot of data to figure out whether our data needs to be over sampled, under sampled. We need to apply that algorithm to that data to figure out whether it's the algorithm that needs to change. You mentioned trial and error there. There is a bit of this going on all of the time, I suppose, as well too to figure out the best approach. How would you measure the efficiency of an algorithm? What metrics would you bring forward to figure out are we doing the right thing?
Snehal Bhatia: Yes, that's an interesting one. We discussed in the context of the disease detection algorithm, the use of accuracy, which is the ratio of correct predictions to the total predictions. This is the most commonly used metric in any kind of learning algorithms, machine learning algorithms, but we saw how it could give you misleading interpretations as well. Another issue is not just that. If the training data set is imbalanced, but the testing dataset is not because we usually split the dataset into train and test sets, it can actually cause another issue where the training is done on an imbalance dataset, but in the testing, we've actually seen a different kind of results. So, we might just go, " Okay, even if we change the accuracy measure and even if you improve it, the discrimination that occurs doesn't really change." So being conscious about making sure the whole dataset is balanced in the same way is also important. So, accuracy is one, but it's not enough by itself. You need to think about some other measures as well, such as precision, which essentially measures how many predictions were correct. So, how many total predictions you have versus how many of those were actually correct. So, hand in hand with that goes recall as well, which is the true positive rate, which is how many predictions are actually the same as what the actual labels were for those datasets. So, that's actually again measuring it against your trained test dataset. So, you see the prediction and then you actually go back and check what that actual dataset would have been like. But then combining these two is another one called F1 score, because multiple reasons, just precision and recall are not enough. This might be very boring to someone who's not actually training models, but it's just more way of saying we need to make sure we're measuring the right metrics. So, accuracy by itself is not enough, and that's often the biggest one that people tend to measure.
Shane McAllister: We tend to talk about AI and ML as abstracting the humans from the process, but it seems to me that there's still a very valid place there for almost train the trainer or train the machine algorithm and keep an eye in it to make sure everything is going well. So, that's probably a very stacked process, layers upon layers as things move on. What else can we do? Usually, we see these stories come out about AI and ML when they affect people. In essence, if a computer system breaks down and goes down and a few websites are offline or the banking system goes down for a couple of hours, it's the fact that people couldn't do what they wanted to do. So, when it comes to AI and ML, particularly for human facing technology, things that we interact with, what else can we do to make sure that the process is robust? Everything will and should be as balanced as possible and as functional as possible as we remove more and more of these decisions and monitoring from the human stack and put it over to the tech stack as it were.
Snehal Bhatia: Yeah, so I think that's a wider question. It goes beyond what is the issue. Is the dataset balanced or not? That was the theme so far, but I think this is a question that, as you mentioned, is more like a human level question. A few of the techniques that have been suggested in general when we talk about ethical implications of technology and stuff is, for example, having white box algorithms. Anyone and everyone, even non- technical people, they should be able to understand exactly how the decision making is happening. This has been a particular problem in the world of AI, because we don't understand. Even as designers of these algorithms, we don't understand how exactly are these neural networks assigning waits? How are they doing predictions? So by default, it is a black box algorithm even for someone who's very intimate with that. So, one of the suggested techniques is actually explainability in AI. So, any decisions that algorithms make that can affect people, they should be explainable, their working should be exposed, their understandability and fairness and all these things should be exposed in whatever capacity it is possible to do. But of course, we think about it, these projects are usually made by companies and they might not want to expose their IP. They might not want to share that amount of information of how they're advancing. Also, another way to think about this is exposing this much detail might just render these things vulnerable to social engineering or hacking. So, of course, appropriate measures need to be taken. So, another way to think about it is actually carrying out validation on these tools, which is black box validation. So, if designing transparent systems is not possible, then the testing of these systems should be done thoroughly and transparently, which means that you think about testing, testing the algorithm in every possible or thinkable situation and you make those tests public. So, if for example, I'm looking at that and I'm going like, " Wait, my situation was not tested here," I can feed that back and that democratize the testing of it in a way. Then of course, we need to have maybe more code of ethics for developers and designers so that we can ensure that personal bias or unintentional systemic bias doesn't creep into the systems. I always believe this and I see this being said very often as technology advances way faster than laws and legal systems and frameworks do. So, definitely, we need a lot more of that as well, regulation from the government, regulation from legal bodies. That needs to be at the same pace as the evolution of technology and ultimately just spreading education and awareness, making sure that if every person, regardless of what they do, is going to be affected by technology, then it's only fair that every person, regardless of their educational backgrounds or their profession, should understand what that does.
Shane McAllister: I think, again, like some of the other topics in our conversation, this is an episode of its own. As you said, companies want to have a technical advantage. They want to have a competitive advantage, and that's not always exposed. We've seen that across the board through many companies. The code of ethics, again, depending where this technology's originated from, that can very much vary. I know particularly in Europe, we see a lot of clamp down from some of the major tech companies in European regulations that don't apply to them in the US or don't apply to them in Asia and other regions as well too. So, this issue of code of ethics for companies and you mentioned it in particular respect to developers and designers, I think that's an intrinsic code of ethics most people might have themselves, but when it comes to companies, it's something different. Above all, it's probably the last point that you mentioned there, Snehal, that resonated with me the most, education and awareness. We cannot say, " The machine says no." We cannot say, " The black box just did X." I think too many times we rely on that as almost a first line of defense when something doesn't go correctly. So, education and awareness as we go forward is great. From my particular perspective as a developer, one of the things I mentioned in the intro was AI creating your code for you. Who's policing that code? Who's debugging that code? Who's making sure that that's done correctly as well too? I think in much the same way, that we don't know a lot about AI at the moment and the speed in which it's happening is scary to some. I think this will open out opportunities going forward. It's the same way that before social media, we have a whole industry now that has opportunities. Yes, it has its own fault. Before mobile, before apps, there were still developers developing things. Now we just have different devices in which to consume this. I think the key, as I said, is that the education awareness piece. It's very simple to look back on the stories we see about ML and AI mostly in the negative because that's what sells newspapers or websites or access. You put out those stories. But going back to one of your examples, if we can use AI for early detection of onset of cancer that could be missed by the human eye for example, that in itself is a huge bonus to humanity as a whole. I think there is a tendency to think that AI will take over everything. What's your own thoughts on that? Is there always machine learning, AI solution for all of the problems that we're looking at these days?
Snehal Bhatia: Yeah, I think that's the overarching question here. We need to stop and ask, "Do we really need an LLM for this? Do we really need every single shopping websites chatbot to be powered by ChatGPT? Do we really need to introduce LMS in, let's say, historical analysis?" That's pretty easily done with the tools we have right now. I'm not saying that because I'm against it. Of course, ultimately, we want to personalize, we want to improve the experiences, but we have to really think about the impact it has in general on society, on people as we discussed, but also on the environment. So, the MIT Technology Review reported that training just one AI model can emit the equivalent carbon dioxide of the lifetime emissions of an average American car. Five times of that actually.
Shane McAllister: Oh, wow.
Snehal Bhatia: That's five times of a lifetime of a car's emission just for one model. You think about how many models are being trained just in the second as we speak. Also, because these are very demanding algorithms, the hardware can be rendered obsolete more quickly. Of course, there's specialized hardware as well, but not all of us have access to it. It hasn't yet evolved in a way where every single data center is ready to train machine learning algorithms and do that level of computations. It becomes really hard to deal with that waste management, things like that. The environment is one, the society is another. Do we really need to introduce it right now? That's a question we need to ask in every single kind of use case. In some places, yes, of course, it makes total sense, but in others when it doesn't, the wider implications need to be balanced for.
Shane McAllister: So do we need some AI police? Do we need somebody to say, "Should you apply this?" Touching on the environmental things certainly brings us back to your original podcast with us as well too. Obviously, as these models progress, there's more storage required, there's more compute required. We all think of this almost as a throwaway. We're used to asking our voice assistants for the answer to something, but we fail to see the chain that happens in the background that that initiates and that potential environmental footprint there. I'm based in Ireland. I know that we have currently ongoing debate because we have a huge amount of data centers based in Ireland. They are consuming about nearly 30% of our output in power generation in the country as a whole because all the major provider is there. But it is a concern as we increase the amount of data that's required for these large language models that's required for true AI and machine learning to really benefit and come into its own. There is that unseen impact. That almost black box effect as you talked about with the software and the algorithms, we have a black box effect to there's a big building there somewhere that's a data center and it does things and we need to be concerned about that. This was an area that I was not so familiar with prior to jumping on this conversation with you. I am intrigued now. Everything that you've gone through here has been so well explained. I do appreciate that, Snehal. Any last comments for our audience as to if they're interested by this or if they want to go and learn more? Is there any sites that you use to keep abreast with all of the changes in this space?
Snehal Bhatia: Really all over the place. I tend to rely a little bit more on research papers than I do on blog posts and articles, because like you mentioned, they're always going, " Oh, AI will end the world." I don't believe that. I just believe we need to be more conscious about it. I don't have a specific source to recommend, but what I would recommend is looking not at news articles and blog posts and opinionated posts, unless they're people you trust, but rather looking at actual research metrics and research papers and things that have been proven out and tested.
Shane McAllister: Excellent. If we all took that view and didn't go to the socials and the media outlets, we would probably be much more informed generally. I know for me, if I hear radio shows or TV programs talking about developing and computing and they bring on these experts, as my wife says, I always end up shouting at the television or the radio when that happens. Look, hey, this is a podcast. It's somewhat a sound bite as well too. This is not a deep dive into this area, but it is important to go to the source of this. There's too much, particularly now because of the speed and the advancements of AI, certainly since... Look, ChatGPT entered the world properly, publicly really, I suppose last November or so. Ever since then, it's all AI, AI, AI. My youngest used it to do his homework one time and I was quite impressed as a parent, a technical parent, but also quite annoyed that he skipped doing his homework properly and asked ChatGPT to do it for him. So, look, we'll have to watch this space. Again, another episode, possibly the education going forward of our children when every answer is just a click away, where does that leave us as well too? But lot of food for thought in this conversation. Snehal, thank you so much for so clearly explaining this and I suppose the impact of dataset imbalance, how algorithms and looking at the data can assess that as well too. Above all that we very much are still at the beginnings of this space and the tools and the knowledge is still evolving at a rapid rate that we need to keep an eye on this. Snehal Bhatia, thank you so much for joining us on the MongoDB Podcast.
Snehal Bhatia: Thank you for having me.
Shane McAllister: A fascinating conversation as ever with Snehal there. I certainly learned a huge amount about the ever advancing area of machine learning and AI. It seems that we're only getting started in our understanding of the ramifications of offloading decision making processes to automated systems. Thanks to, Snehal, for enlighten us. As ever, if you enjoy our podcast, we don't forget to subscribe and leave a review wherever you get your podcasts. We really do appreciate it. So, for me, Shane McAllister and the rest of the podcast team here at MongoDB, until next time. Take care and thanks for listening.
In this episode, we’re joined once again by Snehal Bhatia - Snehal is a Solutions architect here at MongoDB, and Snehal previously joined us in an episode to talk about designing environmentally sustainable data architectures. In this episode, Snehal turns her attention to an areas that she’s very familiar with, and an area that she did her own thesis in, but an area that has only now come to the mainstream with the continued widespread adoption of Machine Learning and Artificial intelligence. In this episode, we deep dive into the area of Impact of Dataset imbalance on Machine Learning and AI.