Marlos C. Machado is a Fellow in Residence at the Alberta Machine Intelligence Institute (Amii), an adjunct professor at the University of Alberta, and an Amii fellow, where he also holds a Canada CIFAR AI Chair. Marlos’s research mostly focuses on the problem of reinforcement learning. He received his B.Sc. and M.Sc. from UFMG, in Brazil, and his Ph.D. from the University of Alberta, where he popularized the idea of temporally-extended exploration through options.
He was a researcher at DeepMind from 2021 to 2023 and at Google Brain from 2019 to 2021, during which time he made major contributions to reinforcement learning, in particular the application of deep reinforcement learning to control Loon’s stratospheric balloons. Marlos’s work has been published in the leading conferences and journals in AI, including Nature, JMLR, JAIR, NeurIPS, ICML, ICLR, and AAAI. His research has also been featured in popular media such as BBC, Bloomberg TV, The Verge, and Wired.
Your primary focus has being on reinforcement learning, what draws you to this type of machine learning?
What I like about reinforcement learning is this concept, it's a very natural way, in my opinion, of learning, that is you learn by interaction. It feels that it's how we learn as humans, in a sense. I don't like to anthropomorphize AI, but it's just like it's this intuitive way of you'll try things out, some things feel good, some things feel bad, and you learn to do the things that make you feel better. One of the things that I am fascinated about reinforcement learning is the fact that because you actually interact with the world, you are this agent that we talk about, it's trying things in the world and the agent can come up with a hypothesis, and test that hypothesis.
The reason this matters is because it allows discovery of new behavior. For example, one of the most famous examples is AlphaGo, the move 37 that they talk about in the documentary, which is this move that people say was creativity. It was something that was never seen before, it left us all flabbergasted. It's not anywhere, it was just by interacting with the world, you get to discover those things. You get this ability to discover, like one of the projects that I worked on was flying visible balloons in the stratosphere, and we saw very similar things as well.
We saw behavior emerging that left everyone impressed and like we never thought about that, but it's brilliant. I think that reinforcement learning is uniquely situated to allow us to discover this type of behavior because you're interacting, because in a sense, one of the really difficult things is counterfactuals, like what would happened if I had done that instead of what I did? This is a super difficult problem in general, but in a lot of settings in machine learning studies, there is nothing you can do about it. In reinforcement learning you can, “What would happened if I had done that?” I might as well try next time that I'm experiencing this. I think that this interactive aspect of it, I really like it.
Of course I am not going to be hypocritical, I think that a lot of the cool applications that came with it made it quite interesting. Like going back decades and decades ago, even when we talk about the early examples of big success of reinforcement learning, this all made it to me very attractive.
What was your favorite historical application?
I think that there are two very famous ones, one is the flying helicopter that they did at Stanford with reinforcement learning, and another one is TD-Gammon, which is this backgammon player that became a world champion. This was back in the '90s, and so this is during my PhD, I made sure that I did an internship at IBM with Gerald Tesauro and Gerald Tesauro was the guy leading the TD-Gammon project, so it was like this is really cool. It's funny because when I started doing reinforcement learning, it's not that I was fully aware of what it was. When I was applying to grad school, I remember I went to a lot of websites of professors because I wanted to do machine learning, like very generally, and I was reading the description of the research of everyone, and I was like, “Oh, this is interesting.” When I look back, without knowing the field, I chose all the famous professors in our reinforcement learning but not because they were famous, but because the description of their research was appealing to me. I was like, “Oh, this website is really nice, I want to work with this guy and this guy and this woman,” so in a sense it was-
Like you found them organically.
Exactly, so when I look back I was saying like, “Oh, these are the people that I applied to work with a long time ago,” or these are the papers that before I actually knew what I was doing, I was reading the description in someone else's paper, I was like, “Oh, this is something that I should read,” it consistently got back to reinforcement learning.
While at Google Brain, you worked on autonomous navigation of stratospheric balloons. Why was this a good use case for providing internet access to difficult to reach areas?
That I'm not an expert on, this is the pitch that Loon, which was the subsidiary from Alphabet was working on. When going through the way we provide internet to a lot of people in the world, it's that you build an antenna, like say build an antenna in Edmonton, and this antenna, it allows you to serve internet to a region of let's say five, six kilometers of radius. If you put an antenna downtown of New York, you are serving millions of people, but now imagine that you're trying to serve internet to a tribe in the Amazon rainforest. Maybe you have 50 people in the tribe, the economic cost of putting an antenna there, it makes it really hard, not to mention even accessing that region.
Economically speaking, it doesn't make sense to make a big infrastructure investment in a difficult to reach region which is so sparsely populated. The idea of balloons was just like, “But what if we could build an antenna that was really tall? What if we could build an antenna that is 20 kilometers tall?” Of course we don't know how to build that antenna, but we could put a balloon there, and then the balloon would be able to serve a region that is a radius of 10 times bigger, or if you talk about radius, then it's 100 times bigger area of internet. If you put it there, let's say in the middle of the forest or in the middle of the jungle, then maybe you can serve several tribes that otherwise would require a single antenna for each one of them.
Serving internet access to these hard to reach regions was one of the motivations. I remember that Loon's motto was not to provide internet to the next billion people, it was to provide internet to the last billion people, which was extremely ambitious in a sense. It's not the next billion, but it's just like the hardest billion people to reach.
What were the navigation issues that you were trying to solve?
The way these balloons work is that they are not propelled, just like the way people navigate hot air balloons is that you either go up or down and you find the windstream that is blowing you in a specific direction, then you ride that wind, and then it's like, “Oh, I don't want to go there anymore,” maybe then you go up or you go down and you find a different one and so on. This is what it does as well with those balloons. It’s not a hot air balloon, it's a fixed volume balloon that's flying in the stratosphere.
All it can do in a sense from navigational perspective is to go up, to go down, or stay where it is, and then it must find winds that are going to let it go where it wants to be. In that sense, this is how we would navigate, and there are so many challenges, actually. The first one is that, talking about formulation first, you want to be in a region, serve the internet, but you also want to make sure these balloons are solar powered, that you retain power. There's this multi-objective optimization problem, to not only make sure that I'm in the region that I want to be, but that I'm also being power efficient in a way, so this is the first thing.
This was the problem itself, but then when you look at the details, you don't know what the winds look like, you know what the winds look like where you are, but you don't know what the winds look like 500 meters above you. You have what we call in AI partial observability, so you don't have that data. You can have forecasts, and there are papers written about this, but the forecasts often can be up to 90 degrees wrong. It's a really difficult problem in the sense of how you deal with this partial observability, it's an extremely high dimensional problem because we're talking about hundreds of different layers of wind, and then you have to consider the speed of the wind, the bearing of the wind, the way we modeled it, how confident we are on that forecast of the uncertainty.
This just makes the problem very hard to reckon with. One of the things that we struggled the most in that project is that after everything was done and so on, it was just like how can we convey how hard this problem is? Because it's hard to wrap our minds around it, because it's not a thing that you see on the screen, it's hundreds of dimensions and winds, and when was the last time that I had a measurement of that wind? In a sense, you have to ingest all that while you're thinking about power, the time of the day, where you want to be, it's a lot.
What's the machine learning studying? Is it simply wind patterns and temperature?
The way it works is that we had a model of the winds that was a machine learning system, but it was not reinforcement learning. You have historical data about all sorts of different altitudes, so then we built a machine learning model on top of that. When I say “we”, I was not part of this, this was a thing that Loon did even before Google Brain got involved. They had this wind model that was beyond just the different altitudes, so how do you interpolate between the different altitudes?
You could say, “let's say, two years ago, this is what the wind looked like, but what it looked like maybe 10 meters above, we don't know”. Then you put a Gaussian process on top of that, so they had papers written on how good of a modeling that was. The way we did it is you started from a reinforcement learning perspective, we had a very good simulator of dynamics of the balloon, and then we also had this wind simulator. Then what we did was that we went back in time and said, “Let's pretend that I'm in 2010.” We have data for what the wind was like in 2010 across the whole world, but very coarse, but then we can overlay this machine learning model, this Gaussian process on top so we get actually the measurements of the winds, and then we can introduce noise, we can also do all sorts of things.
Then eventually, because we have the dynamics of the model and we have the winds and we're going back in time pretending that this is where we were, then we actually had a simulator.
It's like a digital twin back in time.
Exactly, we designed a reward function that it was staying on target and a bit power efficient, but we designed this reward function that we had the balloon learn by interacting with this world, but it can only interact with the world because we don't know how to model the weather and the winds, but because we were pretending that we're in the past, and then we managed to learn how to navigate. Basically it was do I go up, down, or stay? Given everything that is going around me, at the end of the day, the bottom line is that I want to serve internet to that region. That's what was the problem, in a sense.
What are some of the challenges in deploying reinforcement learning in the real world versus a game setting?
I think that there are a couple of challenges. I don't even think it's necessarily about games and real world, it's about fundamental research and applied research. Because you could do applied research in games, let's say that you're trying to deploy the next model in a game that is going to ship to millions of people, but I think that one of the main challenges is the engineering. If you're working, a lot of times you use games as a research environment because they capture a lot of the properties that we care about, but they capture them in a more well-defined set of constraints. Because of that, we can do the research, we can validate the learning, but it's kind of a safer set. Maybe “safer” is not the right word, but it's more of a constrained setting that we better understand.
It’s not that the research necessarily needs to be very different, but I think that the real world, they bring a lot of extra challenges. It's about deploying the systems like safety constraints, like we had to make sure that the solution was safe. When you're just doing games, you don't necessarily think about that. How do you make sure that the balloon is not going to do something stupid, or that the reinforcement learning agent didn't learn something that we hadn't foreseen, and that is going to have bad consequences? This was one of the utmost concerns that we had, was safety. Of course, if you're just playing games, then we're not really concerned about that, worst case, you lost the game.
This is the challenge, the other one is the engineering stack. It's very different than if you're a researcher on your own to interact with a computer game because you want to validate it, it's fine, but now you have an engineering stack of a whole product that you have to deal with. It's not that they're just going to let you go crazy and do whatever you want, so I think that you have to become much more familiar with that additional piece as well. I think the size of the team can also be vastly different, like Loon at the time, they had dozens if not hundreds of people. We were still of course interacting with a small number of them, but then they have a control room that would actually talk with aviation staff.
We were clueless about that, but then you have many more stakeholders in a sense. I think that a lot of the difference is that, one, engineering, safety and so on, and maybe the other one of course is that your assumptions don't hold. A lot of the assumptions that you make that these algorithms are based on, when they go to the real world, they don't hold, and then you have to figure out how to deal with that. The world is not as friendly as any application that you're going to do in games, it's mainly if you're talking about just a very constrained game that you are doing on your own.
One example that I really love is that they gave us everything, we're like, “Okay, so now we can try some of these things to solve this problem,” and then we went to do it, and then one week later, two weeks later, we come back to the Loon engineers like, “We solved your problem.” We were really smart, they looked at us with a smirk on their face like, “You didn't, we know you cannot solve this problem, it's too hard,” like, “No, we did, we absolutely solved your problem, look, we have 100% accuracy.” Like, “This is literally impossible, sometimes you don't have the winds that let you …” “No, let's look at what's going on.”
We figured out what was going on. The balloon, the reinforcement learning algorithm learned to go to the center of the region, and then it would go up, and up, and then the balloon would pop, and then the balloon would go down and it was inside the region forever. They're like, “This is clearly not what we want,” but then of course this was simulation, but then we say, “Oh yeah, so how do we fix that?” They're like, “Oh yeah, of course there are a couple of things, but one of the things, we make sure the balloon cannot go up above the level that it's going to burst.”
These constraints in the real world, these aspects of how your solution actually interacts with other things, it's easy to overlook when you're just a reinforcement learning researcher working on games, and then when you actually go to the real world, you're like, “Oh wait, these things have consequences, and I have to be aware of that.” I think that this is one of the main difficulties.
I think that the other one is just like the cycle of these experiments are really long, like in a game I can just hit play. Worst case, after a week I have results, but then if I actually have to fly balloons in the stratosphere, we have this expression that I like to use my talk that's like we were A/B testing the stratosphere, because eventually after we have the solution and we're confident with it, so now we want to make sure that it's actually statistically better. We got 13 balloons, I think, and we flew them in the Pacific Ocean for more than a month, because that's how long it took for us to even validate that what everything we had come up with was actually better. The timescale is much more different as well, so you don't get that many chances of trying stuff out.
Unlike games, there's not a million iterations of the same game running simultaneously.
Yeah. We had that for training because we were leveraging simulation, even though, again, the simulator is way slower than any game that you would have, but we were able to deal with that engineering-wise. When you do it in the real world, then it's different.
What is your research that you're working on today?
Now I am at University of Alberta, and I have a research group here with lots of students. My research is much more diverse in a sense, because my students afford me to do this. One thing that I'm particularly excited about is this notion of continual learning. What happens is that pretty much every time that we talk about machine learning in general, we're going to do some computation be it using a simulator, be it using a dataset and processing the data, and we're going to learn a machine learning model, and we deploy that model and we hope it does okay, and that's fine. A lot of times that's exactly what you need, a lot of times that's perfect, but sometimes it's not because sometimes the problems are the real world is too complex for you to expect that a model, it doesn't matter how big it is, actually was able to incorporate everything that you wanted to, all the complexities in the world, so you have to adapt.
One of the projects that I'm involved with, for example, here at the University of Alberta is a water treatment plant. Basically it's how do we come up with reinforcement learning algorithms that are able to support other humans in the decision making process, or how to do it autonomously for water treatment? We have the data, we can see the data, and sometimes the quality of the water changes within hours, so even if you say that, “Every day I'm going to train my machine learning model from the previous day, and I'm going to deploy it within hours of your day,” that model is not valid anymore because there is data drift, it's not stationary. It's really hard for you to model those things because maybe it's a forest fire that is going on upstream, or maybe the snow is starting to melt, so you would have to model the whole world to be able to do this.
Of course no one does that, we don't do that as humans, so what do we do? We adapt, we keep learning, we're like, “Oh, this thing that I was doing, it's not working anymore, so I might as well learn to do something else.” I think that there are a lot of publications, mainly the real world ones that require you to be learning constantly and forever, and this is not the standard way that we talk about machine learning. Oftentimes we talk about, “I'm going to do a big batch of computation, and I'm going to deploy a model,” and maybe I deploy the model while I'm already doing more computation because I will deploy a model a couple of days, weeks later, but sometimes the time scale of those things don't work out.
The question is, “How can we learn continually forever, such that we're just getting better and adapting?” and this is really hard. We have a couple of papers about this, like our current machinery is not able to do this, like a lot of the solutions that we have that are the gold standard in the field, if you just have something just keep learning instead of stop and deploy, things get bad really quickly. This is one of the things that I'm really excited about, which I think is just like now that we have done so many successful things, deploy fixed models, and we will continue to do them, thinking as a researcher, “What is the frontier of the area?” I think that one of the frontiers that we have is this aspect of learning continually.
I think that one of the things that reinforcement learning is particularly suited to do this, because a lot of our algorithms, they are processing data as the data is coming, and so a lot of the algorithms just are in a sense directly they would be naturally fit to be learning. It doesn't mean that they do or that they are good at that, but we don't have to question ourselves, and I think we are a lot of interesting research questions about what can we do.
What future applications using this continual learning are you most excited about?
This is the billion-dollar question, because in a sense I've been looking for those applications. I think that in a sense as a researcher, I have been able to ask the right questions, it's more than half of the work, so I think that in our reinforcement learning a lot of times, I like to be driven by problems. It's just like, “Oh look, we have this challenge, let's say five balloons in the stratosphere, so now we have to figure out how to solve this,” and then along the way you are making scientific advances. Right now I'm working with other a APIs like Adam White, Martha White on this, which is the projects actually led by them on this water treatment plant. It's something that I'm really excited about because it's one that it's really hard to even describe it with language in a sense, so it's just like it's not that all the current exciting successes that we have with language, they are easily applicable there.
They do require this continual learning aspect, as I was saying, you have the water changes quite often, be it the turbidity, be it its temperature and so on, and operates a different timescales. I think that it's unavoidable that we need to learn continually. It has a huge social impact, it's hard to imagine something more important than actually providing drinking water to the population, and sometimes this matters a lot. Because it's easy to overlook the fact that sometimes in Canada, for example, when we go to these more sparsely populated regions like in the northern part and so on, sometimes we don't have even an operator to operate a water treatment plant. It's not that this is supposed to necessarily replace operators, but it's to actually power us to the things that otherwise we couldn't, because we just don't have the personnel or the strength to do that.
I think that it has a huge potential social impact, it is an extremely challenging research problem. We don't have a simulator, we don't have the means to procure one, so then we have to use best data, we have to be learning online, so there's a lot of challenges there, and this is one of the things that I'm excited about. Another one, and this is not something that I've been doing much, but another one is cooling buildings, and again, thinking about weather, about climate change and things that we can have an impact on, quite often it's just like, how do we decide how we are going to cool a building? Like this building that we have hundreds of people today here, this is very different than what was last week, and are we going to be using exactly the same policy? At most we have a thermostat, so we're like, “Oh yeah, it's warm, so we can probably be more clever about this and adapt,” again, and sometimes there are a lot of people in one room, not the other.
There's a lot of these opportunities about controlled systems that are high dimension, very hard to reckon with in our minds that we can probably do much better than the standard approaches that we have right now in the field.
In some places up 75% of power consumption is literally A/C units, so that makes a lot of sense.
Exactly, and I think that a lot of this in your house, they are already in a sense some products that do machine learning and that then they learn from their clients. In these buildings, you can have a much more fine-grained approach, like Florida, Brazil, it's a lot of places that have this need. Cooling data centers, this is another one as well, there are some companies that are starting to do this, and this sounds like almost sci-fi, but there's an ability to be constantly learning and adapting as the need comes. his can have a huge impact in this control problems that are high dimensional and so on, like when we're flying the balloons. For example, one of the things that we were able to show was exactly how reinforcement learning, and specifically deep reinforcement learning can learn decisions based on the sensors that are way more complex than what humans can design.
Just by definition, you look at how a human would design a response curve, just some sense where it's like, “Well, it's probably going to be linear, quadratic,” but when you have a neural network, it can learn all the non-linearities that make it a much more fine-grained decision, that sometimes it's quite effective.
Thank you for the amazing interview, readers who wish to learn more should visit the following resources: