Articles

Geoffrey Hinton and Yann LeCun, 2018 ACM A.M. Turing Award Lecture "The Deep Learning Revolution"



– Good evening. That worked. I'm Vivek Sarkar from Georgia Tech. And it gives me great pleasure on behalf of all the organizers to welcome you here in Phoenix for FCRC 2019. And to also welcome everyone
connected on live stream. As we all know, and at this point, all our deans, provosts, high tech managers know, our conference is where
you find the cutting edge of computer science research. FCRC was created in 1993 with the idea of having a federated event
every three to four years at which major conferences
can be co-located. This year we have a record
number of 2700 participants in 30 major conferences and many related workshops and tutorials. That was about 20% more than
what we had planned for. Actually on that note, if your cell phone rings this evening, that's 2700 people who will remember you for the rest of your career, including everyone on live stream. Let me take a moment to read out the names of all the conferences to
remind you who all are here. So we have COLT, e-Energy, EC HPDC, ICS, ISCA, ISMM, IWQoS, LCTES, PLDI, SIGMETRICS, SPAA, and STOC. These conferences cover
as you know a wide range of foundation areas of
computer science research including computer architecture, economics and computation, embedded systems, high performance of super computing, machine learning theory, measurement and modeling, compilers and programming languages, memory management, panel algorithms, quality of service, smart energy systems, theory of computing, and many related topics. I'm specially pleased from
an attendance perspective that we also have a record number of over 1100 students attending this year. I'd like to encourage everyone, especially the students to take advantage of the unique opportunity offered by FCRC to attend sessions and conferences outside your research area as
well so you can be exposed to emerging ideas in other
fields of computer science. Just check out the Hoover
app or the online schedule to see what's going on in
the core located events. Now, a major highlight of FCRC, the opportunity for you
to hear plenary talks by imminent leaders from different
areas of computer science starting of course with today
evening's Turing lecture by Goeff Hinton and Yann LeCun. All plenary talks will
be held in this beautiful symphony hall space. During next week, the talks are scheduled at 11:20 in the morning each day with no conflicting events. So be sure to attend. You have no excuse for missing them. As a reminder, the
plenary speakers next week are Jim Smith, Cynthia Dwork, Shriram Krishnamurthi, Jeannette Wing, and Erik Lindahl, and they will all be
introduced by Mary Hall, the plenary speaker chair for FCRC. Though I'll have the opportunity to give more comprehensive round of thanks to everyone at the end of
the conference on Friday, I'd like to definitely express
my deepest appreciation to the sponsors of all conferences, and specially the sponsors
listed here for FCRC as a whole. This is a unique once in four years event. And that's just not possible without the companies listed here stepping up to support FCRC. So thanks to all of them. Also, the entire ACM team
has been working hard to make sure that this
entire week is a success. And I'd like to specially
thank Donna Kappel, for her tireless leadership
of the conference administration team. Donna's been involved with organizing FCRC since it's first instance in 1993. And is absolutely vital
to the success of FCRC. So thank you Donna. Yeah. (audience clapping) And finally thanks to all of you for coming to Phoenix for FCRC and for filling this room. I hope you have a great conference and enjoy the numerous interactions with all your CS research colleagues. And with that I would like
to invite Cherri Pancake president of ACM to the stage to introduce the Turing lecture. Thank you. (audience clapping) – Thank you. I'm delighted to be here at FCRC. As Vivek mentioned I have
the honor of being president of ACM, the world's largest society for computing professionals. Did you know that ACM has
almost 100,000 members around the globe? We serve the computing community in 190 countries with our
conferences like these, our publications, webinars, and learning resources. ACM is also very active
in computing education and curriculum guidelines
around the world. It's particularly great to be here at FCRC because of the reasons
that Vivek mentioned. What a unique opportunity it is. We all know that computing has become much more interdisciplinary
but it's not often that we have the chance
to meet and interact with leading researchers from other areas outside our own. I really encourage you to take
advantage of that this week. As all of us know, AI is the
most rapidly growing area in all the sciences. And certainly a hot topic
in society at large. The incredible advances
that we've been seeing in AI would not have been possible without some of the foundations
that were established by people like those
we're honoring tonight. For example, when we think about impact. Think about the research
that went into development of GPUs originally in the gaming industry. Who would have imagined at that point that later they would be
assembled into large arrays and used as a platform
for vast neural networks that in turn have driven, just leapfrogged advances
in fields like robotics and computer vision? The kinds of advances that
we're recognizing tonight are those generally in
the area of deep learning. Billions of people around the world benefit from the machine
learning advantages. Anybody with a smart phone has access to just amazing advances in things like computer vision
and speech recognition that we never even dreamed
of just a few years ago. Even more importantly perhaps, machine learning has
been giving scientists new tools that are allowing them to make advances in fields from medicine to astronomy and material science. FCRC only happens every four years. So when we talked about this session we wanted to do something
special for the welcome session. I think you'll agree with me that hearing from this year's laureates of the Turing prize is the way to make it really special. The 2018 ACM A.M. Turing award was presented just last week in San Francisco to three
pioneers of deep learning. Yoshua Bengio, Geoffrey Hinton, and Yann LeCun. The three of them
collectively and independently worked over a 30 year period to develop first of all
the conceptual foundations for deep neural networks, and then performed experimentation that ended up identifying a lot of very interesting phenomena. But they didn't stop there. They went on to develop
engineering advances that demonstrated conclusively that deep neural nets
could actually be applied in practice and in an economic way. This in turn allowed other people to develop these amazing concepts that we are now, and
advances that we are now benefiting from in so
many different areas. Computer vision, speech recognition, natural language processing, robotics, so many different other areas. So it is with great pleasure that I am able to introduce
tonight's speakers. The first is Geoff Hinton, who will be giving his
Turing lecture on the topic the Digital Learning Revolution. He will be followed by Yann LeCun who very fittingly has called his talk the Deep Learning
Revolution, the Sequel. So Geoffrey I'd like to welcome you. (audience clapping) – I'd first like to thank
all the people at ACM who devote their time to making all of this run smoothly. So there have been two paradigms for AI. Since the 1950s, there's been
the logic inspired approach, where the essence of intelligence is seen as symbolic expressions operated on by symbolic rules. And the main problem has been reasoning. How do we get a computer to
do reasoning like people do? And there's been the
biologically inspired approach. Which is very different. It sees the essence of intelligence as learning the connection
strengths in the neural network and the main things to focus on at least to begin with are
learning and perception. So they're very different paradigms with very different initial goals. They have very different views of the internal representations
that should be used. So the symbolic paradigm thinks that you should use symbolic expressions and you can give these to the computer if you invent a good
language to express them in. And you can of course get new expressions within the computer by applying rules. The biological paradigm thinks the internal representations are nothing at all like language. They're just big vectors
of neural activity. And these big vectors
have cause and effects on other big vectors. And these vectors are
gonna be learned from data. So all the structure in these vectors is gonna be learned from data. I'm obviously giving sort of caricatures of the two positions to emphasize how different they are. They lead to two very different ways of trying to get a computer
to do what you want. So one method which I slightly naughtily call intelligent design is what you would call programming, it's you figure out how
to solve the problem. And then you tell the
computer exactly what to do. The other method is you
just show the computer a lot of examples of
inputs and the outputs that you produce and you let
the computer figure it out. Of course, you have to
program the computer there to. But it's programmed once with some general purpose
learning algorithm. That again is a simplification. So an example of a kind of thing that people spent 50 years trying to do with symbolic AI is take an image and
describe what's in the image. So think about taking
the millions of pixels in the image on the
left and converting them to a string of words. It's not obvious how
you'd write that program. People tried for a long time and they couldn't write that program. People doing neural nets
also tried for a long time and in the end, they managed to get a system
that worked quite well, which was based on the
pure learning approach. So the central question for neural nets was always we know the big neural nets with lots of layers and
non-linear processing elements can compute complicated things. At least we believe they can. But the question is, can they learn to do it? So can you learn a task
like object recognition or machine translation by taking a big net and
starting from random weights and somehow training it so it changes the weights, so it changes what it computes? There's an obvious learning
algorithm for such systems which was proposed by
Turing and by Selfridge and by many other people, variations of it. And the idea is you start
with random weights. So this is how Turing believed
human intelligence works. You start with random weights and rewards and punishments cause you to change the connection strengths so you eventually learn stuff. Um, this is extremely inefficient. It will work. But it's extremely inefficient. In the 1960s, Rosenblatt introduced a fairly simple and efficient learning procedure. Much more efficient than
random trial and error that could figure out
how to learn the weights on features in which you extract features from the image and then you combine the features using weights
to make a decision. And he managed to show you
can do some things like that. Some moderately impressive things. But in perceptrons you
don't learn the features. That again is a simplification. Rosenblatt had all sorts of ideas about how you would learn features. But he didn't invent backpropagation. In 1969, Minsky and Papert showed that the kinds of perceptrons that Rosenblatt had got to work were very limited in what they could do. There were some fairly simple things they were unable to do. And Minsky and Papert strongly implied that making them deeper wouldn't help. And better learning
algorithms wouldn't help. There was a basic limitation of this way of doing things. And that led to the
first neural net winter. In the 1970s and the 1980s, many different groups invented the backpropagation algorithm. Variations of it. And backpropagation
allows a neural network to learn the feature detectors and to have multiple layers
of learned feature detectors. That created a lot of excitement. It allowed neural networks for example to convert words into vectors that represented the
meanings of the words, and they could do that just by trying to predict the next word. And it looked as if it might be able to solve tough problems
like speech recognition and shape recognition. And indeed it did solve, it did do moderately well
at speech recognition. And for some forms of shape recognition it did very well. Like Yalican's networks
that read handwriting. But, what I'm gonna do now is explain very briefly how neural networks work. I know most of you will know this. But I just want to go
over it just in case. So we make a gross
idealization of a neuron. And the aim of this idealization is to get something that can learn so that we can study how you
put all these things together to learn something complicated in big networks of these things. So it has some incoming weights that you can vary. Well the learning algorithm will vary. And it gives an output that's just equal to its input, provided the input's
over a certain amount. So that's a rectified linear neuron. Which we actually didn't
start using 'til later but these are the kinds of
neurons that work very well. And then you hook them up into a network and you have weights
on the incoming weights for each of these neurons. And as you change those incoming weights, you're changing what feature that neuron will respond to. So by learning these weights, you're learning the features. You put in a few hidden layers, and then you'd like to train it so that the output
neurons do what you like. So for example, we might show it images of dogs and cats. And we might like the left neuron to turn on for a dog
and the right for a cat. And the question is how
are we gonna train it? So there's two kinds of
learning algorithms mainly. Oh there's actually three, but the third one doesn't work very well. That's called reinforcement learning. (audience laughing) There's a wonderful reductor at absurd of reinforcement
learning called Deep Mind. (audience laughing) So, that was a joke. (audience laughing) There's supervised training where you show the network what the output ought to be. And you adjust the
weights until it produces the output you want. And for that you need to know what the output ought to be. And there's unsupervised learning where you take some data and you try and represent
that data in the hidden layers in such a way that you
can reconstruct the data or perhaps reconstruct parts of the data. If I blank out small parts of the data, can I reconstruct them
now from the hidden notes? That's the way unsupervised
learning typically works in neural nets. So here's a really inefficient
way to do supervised learning by using a mutational
reinforcement kinda method. What you would do is you
take your neural net, you give it some, a
typical set of examples, you'd see how well it did. You then take one weight and you change that weight slightly, and you'd see if the neural
net does better or worse. If it does better, you keep that change. If it does worse you throw it away. Perhaps you change in
the opposite direction and that's already a
factor of two improvement. But this is an incredibly
slow learning algorithm. It will work. But what it achieves can be
achieved many, many times faster by backpropagation. So you could think of backpropagation as just an efficient version of this algorithm. So in backpropagation, instead of changing a weight and measuring what effect that has on the performance of the network, what you do is you use the fact that all of the weights of the network are inside the computer. You use that fact to
compute what the effect of a weight change would
be on the performance. And you do that for all of
the weights in parallel. So if you have a million weights, you can compute for
all of them in parallel what the effect of a small
change in that weight would be on the performance. And then you can update
them all in parallel. That has it's own problems, but it'll go a million times faster than the previous algorithm. Many people in the press describe that as an exponential speed up. Actually it's a linear speed up. The term exponential is used
quadratically too often. (audience laughing) So we get to backpropagation where you do a forward
pass through the net, you look to see what the outputs are, and then using the difference between what you got and what you wanted, you do a backwards pass which has much the same flavor as a forward pass. It's just high school calculus or maybe first university year calculus. And you can now compute in parallel which direction you should
change each weight in. And then very surprisingly you don't have to do that for the whole training set. You just take a small batch of examples and on that batch of examples you compute how to change the connection strengths. And you might have got it wrong because of the quirks of
that batch of examples. But you change them anyway. And then you take another
batch of examples. This is called stochastic
gradient descent. And I guess the major discovery of the neural net community is that stochastic gradient descent, even though it has no real right to work, actually works really well. But it works really well at scale. If you give it lots of data and big nets, it really shows its colors. However, in the 1980s we were very, very pleased by backpropagation. It seemed to have solved the problem. We were convinced it was
gonna solve everything. And it did actually do quite well at speech recognition and some forms of auditory recognition. But it was basically a disappointment. It didn't work nearly
as well as we thought. And the real issue was why. And at the time people
had all sorts of analyses of why it didn't work. Most of which were wrong. So they said it's getting trapped in local optima. We now know that wasn't the problem. When other learning
algorithms worked better than backpropagation on
modest sized data sets, most people in the
machine learning community adopted the view that what
you guys are trying to do is learn these deep multi-layer networks from random weights just using stochastic gradient descent, and this is crazy. It's never gonna work. You're just asking for too much. There's no way you're gonna get systems like this to work unless you put in quite a lot of hand engineering. You somehow wire in some prior knowledge. So linguists for example have been indoctrinated to believe that a lot of language is innate and you'd never learn language
without prior knowledge. In fact they have mathematical theorems that proved you couldn't learn language without prior knowledge. My response to that is
beware of mathematicians bearing theorems. So I just want to give you
some really silly theories. I'm a Monty Python fan. So here's some really silly theories. The continents used to be connected and then drifted apart. And you can imagine how silly geologists thought that theory was. Great big neural nets that start with random weights and no prior knowledge can learn to do machine translation. That seemed like a very, very silly theory to many people. Just to add one more. If you take a natural remedy and you keep diluting it, the more you dilute it, the more potent it gets. (audience laughing) And some people believe that too. So the quote at the top was taken actually from the continental drift literature. Wegener, who suggested it in 1912, was kinda laughed out of town. Even though he actually
had very arguments. He didn't have a good mechanism. And the geological community said we've gotta keep this
stuff out of the textbooks and out of the journals. It's just gonna confuse people. We had our own little experience of that in the second neural net winter. So NIPS, of all conferences, declined to take a paper of mine. You don't forget those things. (audience laughing) And like many other disappointed authors, I had a word with a friend
on the program committee. And my friend on the
program committee told me, "Well, you see they couldn't accept this because they had two
papers on deep learning and they had to decide
which one to accept." And they had actually
accepted the other one. So they couldn't reasonably
be expected to have two papers on the same thing
in the same conference. I suggest you go to NIPS now and see where they're at. (audience laughing) Yoshua Bengio submitted a paper to ICML in about 2009. I'm not certain of the
year, but it's around then. And one of the reviewers said
that neural network papers had no place in a machine
learning conference. So I suggest you to to ICML. CVPR, which is the leading
computer vision conference, that was the most
outrageous of all I think. Yann and his coworkers submitted a paper doing some antique segmentation that beat the state of the art. It beat what the main line
computer vision people could do. And it got rejected. And one of the reviewers said, "This paper tells us nothing
about computer vision "because everything's learned." So the reviewer, like the field of computer
vision at the time, was stuck in the frame of mind that the way you do computer vision is you think about the nature
of the task of vision, you preferably write down some equations, you think about how to do the computations that are required to do vision, then you get some implementation of it, and then you see whether it works. The idea that you just learn everything was outside the realm of things that were worth considering. And so the reviewer
basically missed the point which was that everything was learned. He completely failed to see
how that completely changed computer vision. Now I shouldn't be too hard on those guys. 'Cause a little later on,
they were very reasonable. With a bit more evidence,
they suddenly flipped. So between 2005 and 2009, researchers, some of them in Canada, we make Yann an honorary Canadian 'cause he's French. (audience laughing) Made several technical advances that allowed backpropagation
to work better in feed forward nets. They involved using
unsupervised pre-training to initialize the weights before you turn on backpropagation. Things like dropping out units at random to make the whole thing much more robust. And introducing rectified linear units which turned out to be easier to train. For us the details of those advances are our bread and butter. We are very interested in those. But the main message is that with a few technical advances, backpropagation works amazingly well. And the main reason is 'cause we now have lots of labeled data and a lot of convenient compute power. Inconvenient compute power isn't much use. But things like GPUs
and more recently, TPUs, allow you to apply a lot of computation and they've made the huge difference. So really the deciding factor I think was the increase in compute power. So I think a lot of the
credit for deep learning really goes to the people who collected the big databases like Fay Fay Lee. And the people who made
the computers go fast like David Patterson and others. Lots of others. So the killer app from my point of view is in 2009 when in my lab we've got a bunch of GPUs, and two graduate students. Made them learn to do acoustic modeling. Acoustic modeling means you take something like a spectrogram and you try and figure out for the middle frame of the spectrogram, which piece of which phone in the speaker is trying to express? And in this little database we used, relatively little, there are 183 labels for which piece of which phone it might be. And so you pre-train
a net with many layers of 2000 hidden units, you can't pre-train the last leg 'cause you don't know the labels yet. And you're training it just to be able to reproduce what's in the layer below. And then you turn on
learning in all the layers, and it does slightly better
than the state of the art, which had taken 30 years to develop. When people in speech saw that, the smart people, they realized that with
more development this stuff was gonna be amazing. And my graduate students went off to various groups like
MSR and IBM & Google. In particular Navdeep
Jaitly went to Google and ported the system for acoustic modeling that was developed in Toronto, fairly literally, and it came out in the Android in 2012. There was a lot of good engineering to make it run in real time. And it gave a big decrease in error rates. And at more or less the same time all the other groups started changing the way they did speech recognition. And now, all the good speech recognizers use neural nets. They're not like the neural
nets we introduced initially, neural nets have gradually
eroded more and more parts of the system. Sort of putting a neural
net in your system is a bit like getting gangrene. It'll gradually eat the whole system. Then in 2012, two other of my graduate students applied neural nets of the kind
developed over many years by Yann LeCun to object recognition on a big database that Fay Fay Lee had put together with a 1000 different classes of object. And it was finally a big enough database of real images so you could show what neural nets could do. And they could do a lot. So if you looked at the results all the computer vision systems, the standard ones, had asymptoted at about 25% error. Our system developed by
two graduate students got 16% error. And then further work on
neural nets like that, by 2015 it was down to 5%. And now it's down to
considerably below that. So then what happened was exactly what ought to happen in science. Leaders of the computer vision community looked at this result and they said, "Oh, they really do work. "We were wrong. "Okay we're gonna switch." And within a year they all switched. And so science finally
worked like it was meant to. The last thing I want to talk about is a radically new way to
do machine translation. Which was introduced in
2014 by people at Google and also in Montreal by
people in Yoshua Bengio's lab. And the idea in 2014 was for each language we're gonna have a neural network. It'll be a recurring network that is gonna encode the string of words in that language. Which it receives one at a time into a big vector. I call that big vector a thought vector. The idea is that big
vector captures the meaning of that string of words. Then you take that big vector and you give it to a decoder network. And the decoder network
turns the big vector into a string of words
in another language. And it sorta worked. And with a bit of development
it worked very well. Since 2014, one of the
major pieces of development has been that when you're decoding the meaning of a sentence, what you do is you look back at the sentence you were encoding, and that's called soft attention. So each time you produce a new word you're deciding where
to look in the sentence that you're translating. That helps a lot. You also now pre-train
the word embeddings. And that helps a lot. And the way the pre-training works is you take a bunch of words and you try and reproduce these words in a deep net but you've left out some of the words. So from these words you have to reproduce the same words, but you have to fill in
the blanks essentially. They use things called transformers where in this deep net, as each word goes through the net, it's looking at kind of nearby words to disambiguate what it might mean. So if you have a word like "may," when it goes in you'll
get an initial vector that sorts of ambiguous between
the modal and the month, but if it sees "the 13th" next to it, it knows pretty well it's the month. And so in the next area
it can disambiguate that and the meaning of that
May will be the month. And those transformer
nets now work really well for getting word embeddings. They also it turns out learn
a whole lot of grammar. So all the stuff that linguists thought had to be put in innately, these neural nets are
now getting in there. They're getting lots of
syntactical understanding. But it's all being learned from data. If you look in the early layers of transformer nets, they know what parts of speech things are. If you look in later parts of the nets, they know how to disambiguate
pronoun references. Basically, they're learning grammar the way a little kid learns grammar. Just from looking at sentences. So I think that the machine translation was really the final nail in the coffin of symbolic AI. 'Cause machine translation
is the ideal task for symbolic AI. It symbols in and it symbols out. But it turns out if
you want to do it well, inside what you need is big vectors. Okay, I have said everything I wanted to say about the
history up to 2014 or so of neural nets. I've emphasized the ideology. That there were these two camps. And that the good guys one. It's not over yet because of course what we
need is for neural nets now to begin to be able to explain reasoning. We can't do that yet. We're working on it. But reasoning is the last
thing that people do, not the first thing. And reasoning is built on
top of all this other stuff. And my view's always been, you're never gonna understand reasoning until you understand all this other stuff. And now we are beginning to
understand all this other stuff. And we're more or less ready to begin to understand reasoning. But reasoning just with
sort of bare symbols by using rules that
expresses other symbols. That seemed to me just hopeless. You're missing all the content. There's no meaning there. Okay, I want to talk a little bit about the
future of computer vision. So convolutional neural nets
have been very effective. And what convolutional neural
nets do is they wire in the idea that if a feature's
useful in one place, it's also gonna be
useful in another place. And that allows us to combine evidence from different locations to
learn a shared feature detector. That is to learn replicative
feature detectors that are the same in all these places. And that's a huge win. It makes it much more data efficient. And those things Yann
got working in the 1990s. They were one of the few things that worked really well in the 1990s. And they work even better now. But I don't think they're the way people do vision. I mean I think one aspect of it, that there's replicated apparatus. That's clearly true of the brain. But they don't recognize objects the same way as we do. And that leads to adversary examples. So if I give you a big database, a convolutional neural
net will do very well. It may do better than a person. But it doesn't recognize things the same way as a person does. And so I can change things in a way that will cause the
convolutional neural net to change its mind. And a person can't even
see the changes I've made. They're using things much
more like texture and color. They're not using the
geometrical relationships between objects and their paths. I'm convinced that people, the main way in which
people recognize objects, they obviously use texture and color, but they're very well aware of
the geometrical relationships between an object and its parts. And that geometrical
relationship is completely independent of viewpoint. And that gives you something
that is very robust. That you should be able to
train from much less data. And I actually can't resist
doing a little demonstration to convince you that when
you understand objects, it's not just when
you're being a scientist that you use coordinate frames. It's even when you're just
naively thinking about objects, you impose coordinate frames on them. And so I'm gonna do a
little demonstration. And you have to participate
in this demonstration, otherwise it's no fun. Okay, so I want you to imagine
sitting on the table top in front of you there's a cube. So here's the top, here's the bottom. Here's the cube. It's a wire frame cube like this. Matte black wires. And what I'm gonna do with this cube is from your point of view, there's a front bottom
right hand corner here and this top back left hand corner here. Okay. And I'm gonna rotate the cube so that the top back left hand corner is vertically above the front
bottom right hand corner. So here we are. And so now I want you to
hold your fingertip in space, probably your left finger tip, where the top vertex of the cube is, okay? And now, nobody's doing it, come on. (audience laughing) Now, with your other finger tip I just want you to point
to where the other corners of the cube are. The ones that aren't resting on the table. So there's one on the table. One vertically above it here. Where are the other corners? And you have to do it, you have to point them out. Okay, now I can't see what you're doing, but I know that a large number of you will have pointed out four other corners 'cause I've done this before. And now I want you to imagine a cube in the normal orientation and ask how many corners does it have? It's got eight corners, right? So there's six of these guys. And what most people do is they say here, here, here, and here. What's the problem? Well the problem is that's not a cube. What you've done is you've preserved the four fold rotational
symmetry that a cube has. And pointed out a
completely different shape. It's a completely different shape that has the same number of faces as a cube has corners. And the same number of
corners as a cube has faces. It's the jewel of a cube. If you substitute corners for faces. 'Cause you really like symmetry so much that you had to really mangle things to preserve the symmetries. Actually, a cube has three
edges coming down like that. And three just coming up like that. And my six fingertips are
where the corners are. And people just can't see that. Unless they're crystallographers
or very clever. So the main point of this demo is I forced you by doing this rotation to use an axis for the cube. The main axis that defined
the orientation of the cube was not one of the axes
of the coordinate frame you usually use for a cube. And by forcing you to use an
unfamiliar coordinate frame, I destroyed all your knowledge about where the parts of a cube are. You understand things
relative to coordinate frames. And if I get you to impose a
different coordinate frame, it's just a different object
as far as you're concerned. Now convolutional nets don't do that. And because they don't do that, I don't think they're the
way people perceive shapes. We've recently managed to
make neural nets do that by doing some self supervised training. And there's an archive reference there which if you're very quick you could get or you could, I'll send
out a tweet about it later. And the last thing I want to say is not about shape
recognition in particular, but about the future of neural networks. There's something very
funny and very unbiological we've been doing for the last 50 years. Which is we've only been
using two time scales. That is you have neural activities, and they change rapidly. And you have weights
and they change slowly. And that's it. But we know that in
biology synapses change at all sorts of time scales. And the question is what happens if you now introduce more time scales. In particular, let's just
introduce one more time scale and let's say that in addition to these weights changing slowly, and that's what's going
on in long term learning. The weights have a component. The very same weights, the very same synapses. But there's an extra component that can change more rapidly and decays quite rapidly. So if you ask where's your memory of the fact that a minute ago I put my finger on this corner here, is that in a bunch of neurons that are sitting there sort of being active, so that you can remember that? That seems unlikely. It's much more likely your memory for this is in fast modifications to the weights of the neural network that allow you to reconstruct this very rapidly. And that will decay with time. So you've got a memory
that's in the weights, that's a short term memory. As soon as you do that, all sorts of good things happen. You can use that to get a
better optimization method. And you can use that to do something that may very well be
relevant to reasoning. You can use it to allow neural networks to do true recursion. Not very deep, but true recursion. And what I mean by true recursion is when you do the recursive call like a relative clause in a sentence, the neural net can use
all the same neurons and all the same weights that it was using for the whole sentence to process the relative clause. And of course to do that, somehow it has to
remember what was going on when it decided to process
the relative clause. It has to store that somewhere. And I don't think it
stores it on the neurons, I think it stores it in temporary changes to synapse strengths. And when it's finished
processing the relative clause, it packages it up and says, basically it says now what was I doing when I started doing this processing? And it can get the information back from this associative
memory and the fast weights. I wanted to finish with that because the very first talk I gave in 1973 was about exactly that. I had a system that worked on a computer that had 64K of memory. I haven't got round to publishing it yet. But I think it's becoming
fashionable again, so I soon will. And that's the end of my talk. And I'm out of time. (audience clapping) And now I'd like to introduce Yann LeCun who's not only a colleague, but a very good friend. (audience clapping) – Okay, I'll talk about the sequel. But I'll start also with
a little bit of history and sort of go through some of the things that Geoff just mentioned. So Geoff talked about supervised learning. And supervised learning
works amazingly well if you have lots of data. We all knew this so we
can do speech recognition, we can do image recognition, we can do face recognition, we can generate captions for images. We can do translation. That works really well. And if you give your neural
net a particular structure in something like a convolutional net, as Geoff mentioned in
the late 80s, early 90s, we could train systems
to recognize handwriting that was quite successful. By the end of the 90s,
a system of this type that I built at Bell Labs, was reading something like 10 to 20% of all the checks in the U.S. So a big success, even a commercial success. But by the time the entire community had basically abandoned neural nets, partly because of the
lack of large datasets for which they could work. Partly because the type
of software at the time that you had to write
was fairly complicated and it was a big investment to do this. Partly also because computers
were not fast enough for all kinds of other applications. But convolutional nets really are inspired by biology. They're not copying biology, but there is a lot of
inspiration from biology, from the architecture of the visual cortex and ideas that come naturally when used to the signal processing the idea that filtering is a good way to kinda process signals. Whether they are audio
signals or image signals. And the convolution is
the way to do filtering is very natural. And the fact that you
find this in the brain is really not that surprising. And those ideas of course were proposed by Hubel and Wiesel in
sort of classic work in neuroscience back
in the 60s as well as, and sort of picked up by Fukushima, who's a Japanese researcher who tried to build computer models of the Hubel & Wiesel model if you want. And I found that inspiring and sort of tried to reproduce this using neural nets that could be trained with backpropagation. That's basically what
a convolutional net is. And so the idea of a
convolutional net is that the world, the perceptual world is compositional. That the visual world, objects are formed by parts, and parts are formed by motifs, and motifs are formed by textures or elementary combinations of edges and edges are formed by pixels, arrangements of pixels. And so if you have a system that sort of hierarchically can detect unusually useful combinations
of pixels into edges, and edges into motifs, and motifs into parts of objects, then you will have a recognition system. This idea of hierarchy
actually goes back a long time. And so that's the really the principal of convolutional nets. And it turns out that
hierarchical representations are good not just for vision, but also for speech, for text, and for all kinds
of other natural signals that are comprehensible because they are compositional. I think there is this saying. It's attributed to Einstein I believe. What is most mysterious about the world is that it is understandable. And it's probably because
of the compositional nature of natural signals. So in the early 90s, we're able to do things like build
recognition systems like this one. This is the younger
version of myself here. I'm at Bell Labs. This is by the way my
phone number at Bell Labs in Holmdel, no longer operating. I'm hitting a key here and
the system captures an image with a video camera. This runs on the PC with
a special DSP card in it. And it could run those convolutional nets at several hundred characters per second at the time which was amazing. We could run 20 mega flops. You know that was just incredible. So that worked pretty well. And pretty soon we realized we could use this on actual images as well to do things like detecting faces. Eventually detecting pedestrians. That took a few years. But as Geoff mentioned there
was sort of a neural net winter between the mid 90s and the
sort of late 2000 if you want where almost nobody
was working neural nets except a few crazy people like us. So that didn't stop us. And so working on face detection, pedestrian detection. Even working on using machine learning and convolutional net for robotics where we would use a convolutional net to label an entire
image in such a way that every pixel in an image would be labeled as to whether it's
traversable or not traversable by a robot. And the nice thing about this is that you can collect data automatically. You don't need to manually label it because using stereo
vision you can figure out if a pixel sticks out of the ground or not using 3D reconstruction. But unfortunately that
only works at short range. So if you want a system that can plan long range trajectories, then you can train a convolutional net to make the predictions for traversability using those labels and then let the robot drive itself around. So it's got this particular robot here has a combination of different features that it uses extracted
by the convolutional net and also a rapid stereo vision system that allows it to avoid obstacles such as pesky graduate students. (audience laughing) Cassel Manet and Ryan Hedse
by the way who are pretty sure the robot is not gonna run them over because they actually wrote the code. (audience laughing) Okay, and then a couple years later we used a very similar system to do some antique segmentation. This is actually the work that Geoff was talking about that was rejected from CVPR 2011. So this is the system
that could in real time using a FGA implementation segment. Basically give a category
for every pixel in an image at about 30 frames per second at sort of decent resolution. It was far from perfect, but it could sort of label with sort of reasonable accuracy, detect pedestrians, detect the roads and the trees, et cetera. But the results basically
were not immediately believed by the computer vision community. Now to measure the progress that has happened since then in the last 10 years essentially, this is an example of a result of a really recent system
that was put together by a team at Facebook that they call the panoptic feature pyramid network. So it's basically a
large convolutional net that has sort of a path
that extracts features. Multi-layer pass that extracts features and then another pass that sort of generates an output image. And the output image basically identifies and generates a mask for every instance of every object in the image and tells you what category they are. So here the name of the
category's on the display, but it can recognize something
like a few 100 categories. People, vehicles of various kinds, and not just object categories. But also sort of background, sort of textures or regions. Thinks like grass and sand, trees and things like that. So you would imagine a system like this would be very useful for
things like self driving cars if you have the complete segmentation, identification of all pixels in an image, it would make it easier to build self driving cars. Not just self driving cars, but also medical image analysis systems. So this is a relatively
similar architecture. People call this U-net
sometimes because of the obvious U shape of this convolutional net. Again, it has an encoded
part that sort of extracts features and then a sort of the part that constructs the output image where the parts of the
medical images are segmented. This is the kind of result
that it's producing. This is some work by some
of my colleagues at NYU. I was not involved in this work. A different sub group of colleagues with some common
co-authors as Wartle so on detecting breast cancer from imaging from x-rays from mammograms. In fact, one of the most
sort of hottest topics in radiology these days
is using deep learning for medical vision analysis. It's probably going to effect, if not revolutionize, radiology in the next few years. It already has to some extent. Some more work along those directions. This is actually a collaboration between the NYU medical school
and physical care research in accelerating the
data collection for MRI. So when you go through an MRI you have to sit in the
machine for about an hour, or 20 minutes, depending on the kind of exam you're going through. And this technique here using those kinds of reconstruction
convolutional net allows to basically reduce
the data collection time and get images that are
essentially of the same quality. So they will not put
radiology out of jobs. But it will make the job
more interesting probably. Geoff was mentioning work on translation with neural nets. This is I think a very surprising and interesting development of the fact that you can use neural
nets to do translation. And there is a lot of innovation in the kind of architectures
that are used for this. So Geoff talked about
the attention mechanism, the transformer architecture. This is a new one called
dynamic convolutions which kinda recycles a bit of those ideas. And things work really well there. Those networks are very large. They have a few hundred
million parameters in them. And so, some of the challenges there is actually running them on GPUs. Having enough memory to run them. We're basically limited by GPU memory there. So those ideas of image segmentation have been used by people working on self driving cars. Particularly people at MobilEye going, which is now Intel going
back several years. The first convolutional nets I think that were deployed for self driving cars or for driving assistance were
in the 2015 Tesla S model. NVIDIA has devoted a large sort of efforts also to self driving cars. And so there's a lot of
interesting things going on there. But progress is, I wouldn't say slow, but to completely autonomous
driving is a hard problem. It's not as easy as
people thought initially. Okay, so Geoff kinda brushed away reinforcement learning. But reinforcement learning is something that a lot of people are
really excited about. Particularly people at Deep Mind. But there is a problem
with the current crop of reinforcement learning. Which is that it's
extremely data inefficient. If you want to train a
system to do anything using reinforcement learning, it will have to do lots and
lots of trial and errors. So for example, to get a
machine to play Atari games, classic Atari games, to the level that any human can reach in about 15 minutes of training, the machine will have
to play the equivalent of 80 hours of real time play. To play Go at superhuman level, it will have to play something
like 20 million games. To play StarCraft. This is a recent Deep Mind work, it's a blog post, not a paper, the AlphaStar system took the equivalent of 200 years of real time play to reach human level, on a single map for kind
of a single type of player. By the way, all those systems use ConvNets and various other things. But that's an interesting thing. So the problem with reinforcement learning is that those models have to try something to know if it's gonna work. And it's really not practical
to use in the real world if you want to train a
robot to grasp those things or you want to train
a car to drive itself. So to figure out, to train a system to drive a car so it doesn't run off cliffs, it will actually have to try to, it will actually have to run off a cliffs multiple times before it figures out how not to do that. First of all, to figure
out it's a bad idea, and second, to figure
out how not to do it. Because it doesn't have
a model of the world. It doesn't, it can't
imagine what's gonna happen before it happens. It has to try things to correct itself. That's why it's so inefficient. So that begs the question, how is it that humans
and animals can learn so efficiently, so quickly? We can learn to drive a car. Most of us can learn to drive a car in about 20 hours of training with hardly any accident. How does that happen? We don't run off cliffs because we have a pretty good
intuitive physics model that tells us if I'm
driving next to a cliff, and I'm turning the wheel to the right, the car is gonna run off the cliff, it's gonna fall and nothing good is gonna come out of this. So we have this internal model. And the question is how do
we learn this internal model? And the next question is
how do we get machines to learn internal models like that? Basically just by observation. So there is a gentlemen called Emmanual Dupoux in Paris. He's a developmental psychologist. He works actually on how
children learn language and speech and things like that. But also other concepts. And he made this chart about the time, the age in months at which
babies learn basic concepts. Like things like
distinguishing animate objects from inanimate objects. That happens really quickly
around three months old. The fact that some objects are stable, some of them will fall, and you can sort of measure whether babies are surprised by the
behavior of some objects. And then it takes about nine months for babies to figure out that objects that are not supported will fall. Basically gravity. So if you show a six month old baby the scenario on the top left where there's a little car and a platform and you push the little
car off the platform, and the car doesn't fall, it's a trick. Babies at six months old
don't even pay attention. That's just another thing that the world throws at them that they
have to learn, it's fine. A nine month old baby will go like the little girl at the bottom left. Be very, very surprised. In the meantime they've
learned the concept of gravity. And nobody has really
told them what gravity is. They've just kind of observed the world and they figured out that objects that are not supported just fall. And so when that doesn't happen, they get surprised. How does that happen? It's not just humans. Animals have those models too. You know cats, dogs, rats, orangutans. So here's a video. The orangutan here is
being shown a magic trick. Put an object in a cup. Remove the object, but
he doesn't see that. Then show the cup. It's empty. He rolls on the floor laughing. (audience laughing) So his model of the world was violated. He has a pretty good model of the world. Object permanence. That's a very basic concept. Objects are not supposed
to disappear like that. And when your model of the
world is being violated, you pay attention because
you're gonna learn something about the world you didn't know. If it reevaluates a very
basic thing about the world, it's funny. But it also, it might be dangerous. It's something that can kill you 'cause you just didn't
predict what just happened. Okay, so what's the salvation? Really, how do we get
machines to learn this kind of stuff? You know, learn all the huge
amount of background knowledge we learn about the world by just observing in the first few months of life. And animals do this too. So for example, if I ask you, if I train myself to predict what the world is gonna look like when I move my head slightly to the left, because of parallel motion, objects that are nearby and objects that are faraway won't move the same way relative to my viewpoint. And so the best way to
predict how the world is gonna look when I move my head, is to basically represent internally the notion of depth. And consequently, sort of conversely, if I train a system to predict what the world is gonna look like when it moves its camera, maybe it's gonna learn the
notion of depth automatically. And once you have depth, you have objects, because you have objects
in front of others. You have occlusion edges. Once you have objects, you have things you can influence. And things that can move
independently of others and things like that. So concepts can kind of build
on top of each other like this through prediction. So that's the idea of
self supervised learning. It's prediction and reconstruction. I give the machine a piece of data. Let's say a video clip. I mask a piece of that video clip and I ask the system to
predict the missing part from the part that it can observe. Okay so that would be video prediction. Just predict the future. But the more general form
of self supervised learning is I don't specify in advance which part I'm gonna mask or not, I'm just gonna tell the system
I'm gonna mask a piece of it, and whatever is masked, I'm asking you to reconstruct it. And in fact, I may not
even mask it at all. I'm just gonna virtually mask it and just ask the system
to reconstruct the input under certain constraints. So the advantage of this
self supervised learning is that it's not task dependent. You get the machine to
learn about the world without training it for a particular task. And so it can learn just by observation without having to interact with the world, which is a bit more efficient. But more importantly, you're asking the system
to predict a lot of stuff. Not just a value function like
in reinforcement learning, where basically the only thing you give the machine to predict is a scale or value once in a while. Not supervised learning where you ask the system
to predict a label, which is a few bits. In the case of self supervised learning you're asking the machine
to predict a lot of stuff. And so that led me to this
slightly obnoxious analogy, at least for people who work
on reinforcement learning, which is the idea that if intelligence or learning is a cake, the bulk of the cake, the genoise as we say in French is really
self supervised learning. Most of what we learn, most of the knowledge we
accumulate about the world is learned through self
supervised learning. This little bit of icing on the cake, which is supervised learning. We're being showed a picture book and we're being told the name of objects, and with just a few examples, we can know what the objects are. We're taught the meaning of some words and babies can learn, young children can learn
many, many words per day. New words. And then the cherry on the
cake is reinforcement learning. It's a very small amount of information you're asking the machine to predict. And so there's no way
that the machine can learn purely from that form of learning. It has to be a combination of probably all three forms of learning, but principally self supervised learning. This idea is not new. A lot of people have argued
for the idea of prediction for learning. The idea of learning models, predictive models. And one such person is Geoff as a matter of fact. This is a quote from him, which this is from a few years ago, but he's been saying
this for about 40 years. At least for longer than I've known him. And it goes like this. "The brain has about
10 to the 14th synapses "and we only live about
10 to the nine seconds. "So we have a lot more
parameters than data. "This motivates the idea that we must "do a lot of unsupervised learning "or self supervised learning
since the perceptual "input including proprioception
is the only place "where we can get 10 to the 15 dimensions "of constraint per second." If you're asked to predict everything that comes into your senses, every fraction of a second, that's a lot of information
you have to learn. And that might be enough to constrain all the synapses we have in
our brain to learn things that are meaningful. So the sequel of deep
learning in my opinion is self supervised learning. And in fact, historically
as Geoff mentioned, the sort of deep learning conspiracy that Yoshua, Geoff and I
started in the early 2000s was focused on unsupervised learning, unsupervised speech training. And it was partly successful. But we kind of put it on the back burner for a while. And it's coming back to the fore now. It's gonna create a new revolution. At least that's my prediction. And the next, the revolution
will not be supervised. So I have to thank Alyosha
Efros for this slogan. He invented it. Of course he got inspired
by Gil Scott Heron, the revolution will not be televised. You can even get a t-shirt with it now. So what is self supervised
learning really? Self supervised learning
is filling in the blanks. And it works really well for
natural language processing. So in natural language processing, a method that has become
standard over the last year, in models like Burt and others is you take a long sequence
of words extracted from a corpus of text, you blank out some
proportion of the words. And you train a very large neural net based on those transformer architectures or various other architectures to predict the missing words. And in fact, it cannot exactly
predict the missing words, so you're asking it to
predict the distribution over the entire vocabulary
for the probability that each word may occur
at those locations. So that's called, that's a special case of what we call a masked auto encoder. Give it an input, ask it to reconstruct this part of input that is not present. People have been trained
to do this in the context of image recognition as well. There's various attempts at doing this. So this is work from Pathak, et al from a few years ago where you blank out some pieces of an image and then you ask the
system to fill them in. And it's only partially successful. Not nearly as successful as in the context of
natural language processing. So natural language processing, there's been a revolution
over the last year of using those pre-training systems for natural language, understanding translation, all kinds of stuff. And the performance is amazing. They're very, very big models. But the performance
really works really well. And there were sort of
early indications of this in work that Yoshua Bengio
did a long time ago in the 90s and Milan Colevent, and
Weston did around 2010 using neural nets for an LP. And then more recent work, Word2vec, FastText, et cetera I would choose this
idea of predicting words from the context basically. But really sort of, this whole
idea is completely taken off. So why does it work for
natural language processing and why does it not work so well in the context of images and vision? I think it's because of the sort of, how we represent uncertainty or how we do not represent uncertainty. So let's say we want
to do video prediction. We have short video clips with a few frames. In this case here, a little girl approaching a birthday cake. And then we asked the machine to predict the next few frames in the video. If you train a large neural net to predict the next few frames using least squared error, what you get are blurry predictions. Why? Because the system cannot exactly predict what's gonna happen and so it, the best it can do is predict the average of all the possible futures. To be more concrete, let's say all the videos consist of someone putting a pen on the table and letting it go and every time you repeat the experiment, the pen falls in a different direction. And you can't really
predict in which direction it's gonna fall. Then if you predict the average of all the outcomes, it would be a transparent pen superimposed on itself in
all possible orientations. That's not a good prediction. So if you want the system to be able to represent multiple predictions, it has to have what's
called a latent variable. So you have a function
implemented by neural net. It takes the past let's say
a few frames from a video, and it wants to predict
the next few frames. It has to have an extra variable, here it's called Z, so that when you vary this variable, the output varies over a particular set of possible predictions. Okay, that's called a
latent variable model. The problem with training those things is that there's basically
only two ways of training them that we know about. Or two kind of families of ways to train those systems. One is a very cool idea from Ian Goodfellow and his collaborators at University of Montreal a few years ago called adversarial training. Or generative adversarial networks. And the idea of GANs, generative adversarial networks, is to train the second neural net to tell the first neural net whether it's prediction is on this manifold or set of plausible futures or not. And you train those two
networks simultaneously. There's another technique that consist in sort of inferring what the ideal value of the latent variable would be to make a good prediction. But if you do this, you have the danger
that the latent variable would capture all the information there is to capture about the prediction, and no information will actually be used from the past to make that prediction. So you have to regularize
this latent variable. Okay, so those ideas of things like adversarial training work really well. So what you see here at the bottom is a video prediction for a short clip where the system has been trained with this adversarial training. And there are various ways
of doing those predictions. Not just in pixel space, but also in the space of objects that have been already segmented. Those adversarial generative
adversarial networks can generate images that are used for kind of assistance to sort of artistic production. So these are non-existing faces. You have a system here
that has been trained to produce an image that looks like a celebrity and after
the system is trained, you feed it a few 100 random numbers, and out comes a face that doesn't exist. And they look pretty good. This is work by NVIDIA from this year actually it
was presented this year. You can use this to produce all kinds of different things. Like you know, clothing for example. Training on a collection of clothes from a famous designer. So I think we need sort of new ways of sort of formulating this problem of unsupervised learning so that our systems can
deal with this uncertainty in the prediction in the context of continuous high dimensional spaces. We don't have the problem in the context of natural language processing, because it's easy to represent
a distribution of our words, it's just a discrete distribution. It's a long vector of numbers between zero and one that's sum to one. But it's very hard in continuous
high dimensional spaces. And so we need new techniques for this. And one technique I'm proposing is something called energy
based self supervised learning. Which is imagine that your world is two dimensional. You only have two input variables, two sensors, and your entire world, your entire training set
is composed of those dots here in this two dimensional space. What you'd like is to train a contrast function, let's call it an energy that gives low energy to points that are on the manifold of data. And higher energy outside. There is basically a lot
of research to do there to find the best method to do this. My favorite one is what I call regularized latent variable models. And we had some success about 10 years ago in using techniques of this type for learning features
in a convolutional net, completely unsupervised. What you see on the left here is animation of a system that learns basically oriented filters
by just being trained with natural image patches to reconstruct those under sparsity constraints. And what you see on the right is filters of the convolutional net that are learned in the same, with the same algorithm with different numbers of filters. Those things kinda work. They don't beat supervised learning if you have tons of data. But the hope is that they will reduce the amount of necessary label data. So I'm gonna end with an example of how to combine all
those to get a machine to learn something useful like a task, a motor task. So here what I'm talking about is can we train a machine to learn to drive by just observing other people driving? And by training a model of what goes on in the world? So you are in your car. You can see all the cars around you. And if you can predict
what the cars around you are gonna do ahead of time, then you can drive defensively basically. You can decide to stay away from this car because you see it swerving. You can decide to kinda slow down because the car in front of you is likely to slow down 'cause there's another car in front of it that is slowing down. So you have all those predictive models that basically keep you safe and you sort of learn to
integrate them over time. You don't even have to think about it. It's just in your sort
of reflexes of driving. You can talk at the same time. And your work. But the way to train a system like this is you first have to
train a forward model. So forward model would be here is the state of the world at time T, give me a prediction about
the state of the world at time T plus one. And the problem with this of course is the world is not deterministic. There's a lot of things that could happen. So it's the same problem
that I was talking about with a pen, many things can happen. But if you had such a forward model, you could run the forward
model multiple time steps. And then if you had an objective function like how far you are from the other cars, whether you are in lane, things like this, you could backpropagate
gradient with this entire system to train a neural net
to predict the correct course of action that would
be safe over the long run. And this can be done
completely in your head. If you have a forward model in your head, you don't have to actually drive to train yourself to drive, you can just imagine all of those things. So that's a specific example. So put a camera looking down at a highway. It follows every car and it
extracts a little rectangle around every car that follows every car that you see at the bottom. And what you're doing
now is you're training a convolutional net to take a few frames centered on a particular car and predict the next state of the world. And if you do this, you get, oops, sorry. You get the second column. So the column on the left is what happens in the real world. The second column is what happens if you just train a convolutional
net with Z squared to predict what's gonna happen. It can only predict the
average of all the possible futures, and so you
get blurry predictions. If you now transform the model so that it has a latent variable that allows it to take into account the uncertainty about the world. And I'm not going to explain
exactly how that works. Then you get the prediction
that you just saw on the right where for every drawing
of this latent variable you get different predictions, but they are crisp. Okay, so now what you can do is you can to do this training I was telling you about earlier, you sample the certain variables so you get possible scenarios about what's gonna happen in your future. Then through backpropagation
you train your policy network to get your system to drive. And if you do this, it doesn't work. It doesn't work because the
system goes into regions of the state space where the forward model is very inaccurate and very uncertain. So what we have to do is add another term in the objective function that prevents the system from going into parts of the space where it doesn't, where it's predictions are bad. Okay, so it's like an inverse curiosity constraint if you want. And if you do this, it works. So these are examples of, the blue car is driving itself. The little white dot indicates whether it accelerates, whether it brakes, or whether it turns. And it kinda keeps itself safe away from the other cars. The other cars can't see it. The blue car is invisible here. Let me show you another example here. So here the yellow car is the actual car in the video. The blue car is what the agent here that's been trained is doing. And it's being squeezed between two cars so it has to escape because
the other cars don't see it. So it has to squeeze out. But it works. It works reasonably well. And basically that system has never interacted with the real world. It's just watched other people drive. And then ensues that for training its actions plans. Basically it's policy. Okay, now I'm gonna go a little philosophical if you want. There is, throughout history
of technology and science, there's been this phenomenon. It's not universal, but it's pretty frequent where people invent an artifact and then derive science
out of this artifact to explain how this artifact works or to kind of figure out its limitations. So a good example is the
invention of the telescope in the 1600s. Optics was not developed
until at least 50 years later. But people had a good intuition of how to build telescopes before that. The steam engine was
invented in late 1600, early 1700s, and thermodynamics was, came
out more than 100 years later. Basically designed to explain the limitations of thermal engines. And thermodynamics now is the foundation of one of the most fundamental integral for construction of all science. So it was purposely defined to explain a particular artifact. That's very interesting. Same thing with electromagnetism and electrodynamics. With the invention of
sailboats and airplanes, and aerodynamics. Invention of compounds and chemistry to explain, et cetera, right? Computer science came after the invention of computers, right? Information theory came
after the invention of first digital communication through radio and teletype
and things like that. So it's quite possible that now we have in the next few decades, we'll have empirical systems that are built by trial and error, perhaps by systematic optimization on powerful machines. Perhaps by intuitions. By empirical work. Perhaps with a little bit of theory. Perhaps a lot of theory hopefully. And the question is whether this will lead to a whole theory of intelligence. The fact that we can build an artifact that is intelligent might
lead to a general theory of information processing
and intelligence. And that's kind of a big hope. I'm not sure this is gonna be realized over the next few decades, but that's a good program. A word of caution about biological inspiration. So neural nets are biologically inspired. Convolutional nets are
biologically inspired. But they're just inspired, they're not copied. Let me give you a story of a gentlemen called Clement Ader. Is there any French
people in the room here? Okay, can you raise your
hand, French people? No French people. Yeah, okay a couple. Have you heard of Clement Ader? Never heard of Clement Ader. Yeah, you have, okay. Is there anyone who is not French who have heard of Clement Ader? Okay, one person, two persons. Basically nobody. You guys have no idea who he is. Okay. So this guy built in the late 1800 a bat shaped airplane, steam powered. He was a steam engine designer. And his airplane actually took off on its own power 13 years
before the Wright brothers. Flew for about 50 meters at
about 50 centimeters altitude and then kinda crashed, landed. It was basically uncontrollable. So basically the guy just copied bats and just assumed that because it has the shape of a bat, it would just fly. That seemed a little bit naive. It was not naive at all but it was, it kinda stuck a little bit too close to biology and got sort of hypnotized
by it a little bit. And didn't do things like build a model or a glider or a kite or you know, a wind tunnel like the Wright brothers did. So he stuck a little too close to biology. On the other hand, he had a big legacy which is that his second airplane
was called the L'Avion. And that's actually the word in French, Spanish, and Portuguese for airplane. So he had some legacy. But he was kind of a secretive guy. This was before the open source days. And so, this is why
you never heard of him. Thank you very much. (audience clapping) – Thank you. Thank you Geoff. Thank you Yann. We have, can we get the
house lights on please? We have two microphones up and we have time for
a couple of questions. House lights on please. I know they came on earlier. (audience laughing) Yes, so well first,
let's give Yann and Geoff a round of applause for an amazing, amazing talk. (audience clapping) A little piece of trivia while someone comes up to the microphone. Alan Turing's birthday
was June 23rd, 1912. So today is the 107th birth anniversary. And so it's very appropriate that we had this memorable Turing lecture today. Okay, question. Yeah. – [Woman] Hi thank you both. I'm really interested in the work on understanding reasoning. Can you give us a little taste of what you're thinking there? The reasoning of how neural nets reason. – Okay, so neural nets are pretty
good at things you do in parallel in a hundred milliseconds. So far they're not so good at things you do over longer time periods. And in particular, one thing that people criticize neural nets for is they can't do recursion. So when we understand a sentence, we can off into a relative clause, and understand the relative clause. And we devote all our effort to understanding that relative clause and then come back again. And that kind of thing we're just beginning to be able to do with neural nets. So people at Facebook
have done lots of that. People at Google are doing it. But in order to do things like that, you need some kind of memory. The typical thing to use in a neural net is you just have another bank of neurons, which are copies of
neurons you already have. But that's not biologically plausible. So I always want something
that's biologically plausible. And in the brain it seemed much more
likely that this memory is not copies of neural activities, it's an associative memory that can recreate neural activities. But it's one that's just
used for temporary things. – There's actually quite
a lot of work on this on sort of trying to sort of fill the gap of neural nets not being able to do long chains of reasoning. So there's one question that Geoff has been sort of advocating for for a long time is the fact that if you have sort of classical
logic based reasoning, it's discrete, therefore incompatible with
gradient based learning. And so, how can you do
reasoning with vectors by replacing symbols by vectors and replacing logic by
continuous functions. Basically prime continuous functions. And then if you want
long chains of reasoning, you need to have a working memory. So Geoff was kinda mentioning one idea using fast weights, there's a lot of people working on what's called memory networks. So you have basically what amounts to a recurrent net which can access the separate neural net
which is also differentiable but it's built, the particular architecture of that turns it into an associative
memory basically. And those kind of work in simple cases. They haven't really been scaled up to big problems. But there is very interesting work on basically neural nets that do not directly compute an answer. But they produce a neural net which is designed to answer the question that's being asked. So visual question answering is a typical example of this. You show a complex image to a system, and you ask it "Is there a Chinese
theater that is larger than "the two cubes in this picture?" And what the neural
net does is it produces another neural net which
has the right module to answer that question. You can train this whole
thing with backprop and it's kind of amazing
that it works at all. But it works. – [Vivek] Uh, one more question. Oh sorry, yeah go ahead. – [Efram] Efram from
Northeastern University. So I know many people
question about neural network saying that we know neural network work. But we don't know how they works. So I wonder what's your
comments on this question. – It's not really true. I mean we have some
understanding of course. I mean, first of all, we have access to everything inside the machine, right? Obviously those things
have hundreds of millions of parameters, hundreds of
thousands of variables inside. It's gonna be complicated. It has to be complicated because we want them to solve
complicated problems. So thinking that you're gonna have a complete understanding of exactly every detail is hopeless. On the other hand, I think there is quite a lot
of theoretical understanding of for example, why
optimization seems to work in large networks. Why the system doesn't seem to be trapped in the local mini mart for example. Or the kind of representations
that are learned. – I have something to say about that too. Which is most of the things people do we don't know how they work. We have no idea how they do it. And so if you replace people by neural networks, you're no worse off than you were with people. But you're probably better off because you can direct for bias better with a neural network than you can with a person. But the other thing is there may be some tasks where you need to use hundreds of thousands of
weak regularities in the data to make a prediction. And there's no simple rules. There's just lots of weak regularities. And what big neural nets do is they use them all and they say you know, 300,000 regularities say yes, and 150,000 regularities say no, so it's probably yes. And if you ask me, but how did it do it? If you're expecting to get some lines of computer code that would compute that, you're not gonna get it. This neural network has
a billion weights in it and the way it did it
was those billion weights have these values. And that may be the best you can get. So for that kind of decision, you're just gonna have
to live with the fact that people have intuitions
and their intuitions tell 'em what to do. And neural nets have intuitions too. And they work by having, the same way they do with people, by having large numbers of weights that conspire together to say this is more likely than that. – [Vivek] Thanks, I
know you've been waiting so why don't you go ahead and then we'll take one more from there. – [Man] Sorry, just a quick question. On other AI aspects like
evolution computation and stuff, do you have some opinion whether they will? – Sorry I didn't hear what you said. – [Man] In other AI fields like evolutionary computation for example, do you have any opinion on those? Because you sort of mentioned they're only symbolic in Europe. – Did you say evolution? – [Man] Like genetic programming, those kind of things. – Ah yes. I think it's great for
setting hyperparameters. That is if you're in a
high dimensional space, and you want to improve, if you can get a gradient
you're gonna do much better than someone who can't get a gradient. And the brain is a device
for getting gradients. Evolution can't get gradients 'cause a lot of what determines the relationship between the genotype and the phenotype is outside your control. It's the environment. So evolution has to use techniques like mutation, random changes, and recombinations. But we're not limited to that. We can produce a device
that can get gradients. Now obviously if you can get gradients, you can also use evolution
to make that device better. And if you look at what
happens in neural nets now, you train a neural net using gradients. But now you fiddle with
the hyperparameters using something much more like
an evolutionary technique. So. – [Vivek] I think we have
time for one more question from that side. – [Man] Thank you so Yann talk about using CNs or other types of
neural nets in perception for autonomous vehicles and doesn't sound very positive. So could you make some
more comments on that? – Well there's been a lot of declaration probably kinda
more marketing oriented than science oriented that fully autonomous driving
is just around the corner. And it's just a lot harder than most people imagine. A lot of people in the business of course knew it was hard and it's
not just around the corner. I think there's a similar story in a lot of areas of AI, and AI as a whole where a lot of people had
very optimistic expectations about when human level AI
will be attained for example. In my opinion, it's not
just around the corner. There are certainly things like how to do self supervised running properly that need to be figured
out before that happens. But it's not the only obstacles. It's just the first mountain that we see. And there might be a whole
bunch of mountains behind that we haven't figured out. So I think it's a little bit the same for autonomous driving. Autonomous driving, it's easy to get early impressive results
where a car appears to drive itself pretty well
for about half an hour. But to get the same level
of reliability as humans, which is one fatal accident per 100 miles, per 100 million miles, I'm sorry. (audience laughing) You know what's 10 to the
sixth between friends. (audience laughing) That it's really hard
to get to that level. And if you try to sort of
extrapolate how much data you need to get to that level by sort of seeing how the performance improves as you increase the amount of data, it's basically impractical. So we have to find new ways
of training those systems. And I think self supervised learning is part of the answer. We'll see, okay. It's a hard problem. You can over engineer it. You can add sensors that make the processing easier. You can do detailed maps. You can do all kinds of stuff to kinda make it practical
in some conditions. But fully sort of level five
autonomous driving is hard. – [Man] Thank you very much. – [Vivek] Okay, with
that I know many of you have other conference events to go to. Now we're done with the last question. So I would really like
to thank Yann and Geoff once again for a most
memorable Turing lecture. (audience clapping)
Thank you. And enjoy the rest of the conference including the events that
you have this evening. Thank you.

15
Comments

Leave a Reply

Your email address will not be published. Required fields are marked *