John welcome to the show
um awesome to be here great to have you
back this time we get to do it one on
one uh which is nice it was great having
you and uh Daniel Kahneman on and uh you
and I have been speaking for a long time
since 2015 2016 about the possibilities
for artificial intelligence and I think
there was like a quick moment where
everyone started talking about crypto
but now I think we're we're focused on
the right stuff again
it was a distraction
so let's talk a little bit about the
wave of uh generative artificial
intelligence that we're seeing and
how that might relate to general
intelligence so
look I think when we were talking for
the first time we were talking about how
can we make a machine that looks like
and that thinks like a human and you had
told me okay there's going to be a time
where they start to predict and if they
can predict they can plan and that's how
we're going to get close to artificial
intelligence and I say okay that sounds
nice yeah but that's never going to
happen at least that's what I was
thinking now we're starting to chat with
some of these advances like chat GPT and
I'm starting to think oh okay maybe it
wasn't that far off so where are we now
in terms of the pursuit for artificial
general intelligence and is this a step
a big step forward towards that or is it
again maybe a distraction
uh the short answer is it's not it's not
a particularly big step towards uh you
know more like human level
intelligence I don't like the term AGI
artificial general intelligence because
I think human intelligence is very
specialized so if you want to designate
uh the type of intelligence that we
observe in humans by General that's a
complete misnomer so I I right you know
I know maybe they're cheaper sale but uh
but I want to make that point that that
human intelligence is actually very
specialized so no the answer the short
answer is uh so first of all from the
scientific point of view uh gtt is not a
particularly interesting scientific
advance of course we don't know all the
details because open AI has not
published any but uh for a company
called open AI yeah right now it's the
it's the least open that you can imagine
um but uh they started out as wanting to
be open and then they realized they
couldn't fund their research unless they
went slightly secret secretive so by the
way can we just pause on that because
that's interesting yeah what does that
say about the the AI research world that
if you wanted to start out as open and
you couldn't you had to go for profit
now they're like cat profit is that is
that it's interesting that that it's
impossible to fund this type of research
without that and there's something
structurally need to change because of
that well there are a lot of non-profit
AI research organizations right called
universities but there are also other
non-profits like
um you know the Island Institute for AI
for example is non-profit uh in in
Seattle
uh open AI initially was was non-profit
and then switched to for-profit uh and
originally was publishing everything
they were doing and now kind of
basically keep everything secret they've
become sort of a contract research house
for for Microsoft to some extent
um and it's because the the funding
model is not clear
um there's a you know uh reverse
phenomenon that occurred at Google when
uh I created Fair nine years ago we
uh you know
had a big drum roll about the fact that
we were going to be completely open Etc
and we still are we are holding that
line
and as a result it caused
uh Google brain at the time to become
much more open than they were
interesting because uh that's what the
scientists wanted
you know if you uh
you tell a scientist you can come work
for us but you can't say a word about
what you what you're doing you're
killing that career so
um you know so you have to basically
write a life right so and open research
is much more efficient you just get more
more stuff out of it you get things that
is more reliable uh you know and and you
attract better people uh you you have
you know a better intellectual impact
which is which means people have kind of
more respect for you and want to work
for you and things like that right so
there's a lot of advantages to this
but there has to be an economic model
and the economic model the only one I
know outside of universities and
philanthropy
is a industry research lab inside of a
large company that has that is
profitable and is uh sufficiently well
established in its Market that it can
think for the long term and invest in
fundamental research so
um
so that's the case for you know certain
corners of Google that's the case for
fair at uh admitta and
not quite for deep nine so deepmind is
an interesting uh thing because they
started out as a startup and you can't
you can't absolutely cannot do research
in the startup you just can't because
you just don't have
the the phones right or the ability to
wait long term
that's right so you know you can do it
you can do it for two or three years but
then you basically have to focus your
entire attention to uh you know building
products and and getting revenue and
getting the company to survive so what
allowed them to do what they're doing is
that they got bought by Google but then
still then after that their economic
model was not obvious because they were
kind of sort of an ivory Tower separated
from Google and to some extent they
still are and and so you know Google was
had the foresight to fund them
regardless of whether they were
producing something that was useful uh
but in the current context of uh you
know more efficiency and money saving in
the tech industry uh that model might
might have to change I you know it's
it's not clear right the so the economic
return uh after 10 years or so or nine
years that Google has gotten from
deepmind you know is now clear it's
worth the investment so they're banking
on like you know bigger Longer
investment I'd say are the the business
model is very clear like Fair had has
had a huge impact on on the company
mostly indirectly through other groups
right because fail does not work on
products but
um but there's been like a huge amount
of impact so
um long long Rider sorry the uh open AI
um you know could not keep doing what
they were doing unless they were
commercial essentially and made and it
caused them also to make kind of wide
promises and so what they have to do to
be able to raise enough money for
Microsoft and others
uh is to make very flashy demos
and so they and they're very good at
that so they make really flashy demos
their technology is not particularly
Innovative from the scientific point of
view it's very well engineered so they
put together you know large-scale scaled
up system with trained with very well
curated data I mean they know what
they're doing but in terms of Advance
there's nothing but not much right okay
so let me take this from and by the way
thank you for that little uh diversion I
think it's good we went into that
discussion of how this stuff gets funded
and where it goes but going back to our
discussion of the March towards General
Intel or sorry I'll use your term human
level intelligence
okay
[Laughter]
from uh layperson's point of view it
does feel like oh okay now I'm talking
to AI now ai understands what I think
and can actually draw it now I can take
my voice and start talking on its own
so
why isn't that a step towards okay
intelligence
because the understanding that those the
current systems have
of uh you know the underlying reality
that language expresses is extremely
shallow so those systems have only been
trained with text
a huge amount of text
so they can regurgitate text that
they've seen and you know interpolate
for new situations uh things like that
they can uh
you know even produce code and and stuff
like that but
um they do not understand they have no
knowledge of the underlying reality
there's no
they've never had any contact with
you know the physical world
um you know if I uh
uh if I take a piece of paper let's say
looking for a piece of paper
um and I you know I I hold it like this
right and I tell you uh I'm gonna lift
my my hand from one side you can exactly
predict what's gonna happen right and
for listeners that paper is being held
horizontally
horizontally drops and um and and sort
of
you know with my two hands and then one
hand moves away so when one part of the
paper kind of droops
uh because of gravity and you know
exactly how it looks because you know
you know the properties of papers and
stuff like that right so
um this type of knowledge that all of us
have learned in the first few months of
life none of those systems have any of
this but I could I could do your
knowledge right but I could chat with
Chachi PT and say what happens if I'm
holding a paper or two hands and I let
go with one and it will tell you it will
droop no
I mean it will answer that but you think
but it just won't understand it
no it might actually not tell you
because it will depend whether you'll
tell you or not depends on uh whether
there were kind of similar situations uh
in text that is being trained on but
it's not at all and I get I can come up
with you know a huge stack of similar
situations that each one of them will
not have been described in any text
um so so then the question you you want
to ask is you know how much of uh human
knowledge is present and described in
text
and my answer to this is a tiny portion
like most of human knowledge is not
actually language related
um I don't know you do uh I don't know
do carpentry right build a piece of
furniture
uh your ability to predict where the
piece of furniture is going to look like
as a consequence of how you build it
it's completely non-linguistic
okay everything that has involves any
kind of motor control planning things
like that that there's basically no UBC
college so now think about the entire
collection of knowledge in every
animal
is obviously non-linguistic because they
don't have language or at least now
human type language languages through a
few a few species like dolphins and
stuff right
um now you know dogs and cats know a lot
of stuff and about how the world works
and all of that knowledge humans have it
too to some extent not to the same
degree but uh in in all domains because
we're all specialized but none of that
knowledge is captured by any current AI
system
essentially that's a lot
okay let's build on that I'm just going
to read the response I asked chat GPT if
I'm holding a paper horizontally with
two hands and let go with one hand what
will happen now I I'm not going to say
that that that you're wrong obviously
you're right but I'm just gonna gonna
read it to you for
for the sake of discussion chat GPT
responds if you are holding a paper
horizontally with two hands and let go
with one hand the paper will tilt or
rotate in the direction of the hand that
is no longer holding it due to the
unbalanced forces acting on the paper if
the paper was initially still it will
also move in the direction of the hand
that let go do the force of gravity
acting on it if the paper was moving in
a certain direction before you let go it
will continue in that direction but may
also be affected by air resistance and
other external factors okay that's a
critical response it sounds correct and
it's completely wrong it's exactly wrong
it's actually the exact opposite that's
that that's happening right it's not
moving
uh I mean it's it's the part that you
let go that troops right so yeah and
this is saying the opposite oh that's
true yeah yeah yeah so so right so it
sounds correct it's grammatically
correct the general theme is correct
because they probably were descriptions
of similar situations that the system
was trained on and it kind of memorized
it uh and it's it tries to adapt the the
text so that it uh is relevant to the
current situation but it gets com it
gets it completely wrong
and you know it gets wrong things like
you know comparing numbers
so you tell it you know uh
you do a prompt you say you know
you know for a fact that seven is
smaller than five or bigger numbers like
250 is smaller than uh you know 196.
and then you start kind of
you know telling a story with with
numbers and it will assume that what you
prompted it with is right
um even though it's false rob you don't
even have to do this right I mean
there's very a lot of situations like
this where the system will just not um
say things that are actually correct now
why is that
it's because uh large language models
are trained to predict the next word in
a text
and they train on enormous amounts of
text and they have enormous amounts of
memory
um
but uh but they basically
you know probabilistically generate the
next the next word
and then we inject that word into their
context of a few dozen previous words
that they've said uh or or the prompt
and then generate the next word again
and then re-inject that in the input Etc
there are various ways to do this more
efficiently but that's the basic excuse
me the basic idea
um
so now the issue with this is that there
is no
um
there is no way to specify a task that
the system has to accomplish
other than by specifying that task
inside of the prompt which is a very
circuitous inefficient and complicated
way of specifying your task
um
it's not very controllable okay that's
the first thing the second thing is that
system so that system is not like
optimizing uh an objective if you want
like trying to satisfy an objective
right
um it's just kind of generating one word
up to the other
and because it's generating one word
after the other it's not doing any
planning so it's not like planning to
tell a story or or an answer whether
it's you know like uh
uh a kind of uh align to the story or
you know a set of facts and things like
this
um it just generates one word after
another it has no capability of
generating commands for a tool like say
a calculator
or anything like that
or a physics Simulator for example you
could have simulated that piece of paper
and then observe the result and then
kind of tell you what the result was
that's what we do in our head
when we are being described this kind of
situation we have our own internal
mental simulator and because we've
learned how the world works we can
simulate what goes on and then describe
the result right
uh llms do not have that they don't have
any internal model of the world that
allows them to predict
um and then uh in addition to this you
would like when the system produces a
statement you'd like to be able to
verify that that statement is factually
correct
or does not uh you know break any kind
of logic of any kind you know compared
to another statement that was made
before
and there is no way in the current
architecture of those systems to do this
um right and so
until we we build
uh systems that have some internal model
of the world
it allows them to kind of simulate the
world if you want
some way of generating actions uh on the
World to use tools like a calculator or
something or interrogating a database or
search engine
uh
uh and
an objective that it has to satisfy for
the tasks that we are asked you to
accomplish
and a method by which it can plan an
answer
that satisfies the objective is
factually correct or or not depending on
these the desired Behavior Uh you know
and uh perhaps interrogates the the
right sources of information we're not
going to have anywhere anything
resembling human level intelligence okay
and I definitely want to get to the type
of research and the models that might
get us there but first I want to talk a
little bit about the hallucination that
chat GPT just had with my interaction
with it because Hallucination is
definitely a big issue and I I'll be
honest and this is embarrassing to admit
as a journalist but as I read it I was
ready to believe it because it was like
oh here's AI answering a question with a
somewhat plausible answer and stating it
so confidently and that is an issue with
these models right is that they they do
hallucinate that's probably why we
haven't seen Google bring it into search
uh go go ahead well that's why you
haven't seen any kind of systems of this
type from either Google or meta despite
the fact that they have the technology
okay okay so uh you know
uh I mean certainly uh we have to
realize that most of the technology
underlying techniques used in
chargeability have been invented at
Google and meta right
uh and the whole thing has been built
with pytorch which is you know made by
by metal it's not owned by meta anymore
but uh um but it originated there so for
example you know it uses large
Transformers Transformer architectures
those were originally invented at Google
Transformer themselves use something
called associative memory I mean it's
called self-attention but it's you know
basically the same principle those were
basically proposed by meta many years
ago uh the the use cell supervised
pre-training by removing words
those are techniques that you know go
back to the uh 1990s or even 1980s in
some cases uh they've been popularized
by the bird style language models again
were proposed at Google and then a
number of techniques for dialogue
systems so there's a very active
dialogue system uh research group at
meta that was proposed lots and lots of
methods
um which uh inevitably whether they say
it or not open AI must have been
influenced by
um and then the user technique now now
that uh transgpt is uh available using a
technique that is called reinforcement
running through human feedback rhf uh
which was proposed by deepmind actually
so you know they've done a good job at
sort of integrating a lot of things that
you know have been proposed in the
literature and uh and sort of
engineering A system that kind of
produces a impressive demo and they have
to produce impressive demos because
that's the economic model that's how
they're gonna raise money for Microsoft
and others whereas if you are meta or
you could think about like you know
putting out a system of this type that
you know is going to spew nonsense
um and you know because you are a large
company you have a lot to lose by you
know people kind of making fun of you
for that uh and it's not clear what the
benefits are okay there are right so
we're still working on those things to
make them useful but it didn't matter
put out a system of its own Galactica
and yeah okay so yeah talk us through
what happened there because this was a
system that was supposed to summarize
scientific literature and do lots of
other cool things it comes out and then
three days later it goes back behind
closed doors that's right so what
happened uh there was a previous system
also called blunderbot and there's
another story about that that will oh
right that's the thing that started
talking about how Mark Zuckerberg is a
sort of yeah money hungry capitalist
I mean it was just reflecting what it
what it was trained on in the Press
right and that's basically you didn't
get called into Zach's office and be
like hey what are you what are you guys
talking telling it about me yeah no no
it just I mean it just trained on the
you know the general
conversations you read in the media and
in the media you know Mark Zuckerberg is
uh very often painted as you know some
sort of money hungry bad guy which is
not at all but that's the way he spent
it so
um uh yeah so let's start with uh
brother but then so blenderbot was was
put out and uh several months ago and
it's a dialogue system a bit like
tragedy PT it's designed to be
entertaining more than anything else and
it's capable of having multiple
personalities so we can talk into like
several Styles and things like that
and it has provisions
to somewhat verify uh factual
correctness although not particularly
well developed
uh but it does have a lot of uh kind of
guardrails and and
kind of systems to like prevent prevent
it from you know saying things that
might be offending or or whatever
uh or or even objectionable or even
controversial right so if you try to
get it to talk about anything related to
politics or religion or anything like
that it will change the topic
and it won't tell you it's because I
don't want to talk about this it will
just change topic right so people
thought this thing was really stupid and
boring because it doesn't want to talk
about anything that's kind of you know
controversial or fun which is the kind
of stuff you want to talk about you know
everybody wants to talk about uh and uh
and it it's frustrating because it would
change topic and you know anytime you
wanted to to do that so uh it was not
nearly as convincing
um but so so you could you could you
could say that
the reason it was you know not that
impressive in the end is was because it
was it was made to be safe
essentially
okay and if it's too safe it's boring
um so
um now let's go to Galactica so
Galactica is a different animal it's
also a large language model and that
large language model has been trained on
the entire scientific literature so this
is something like
you know millions of scientific papers
and the purpose of it it's used is to
help scientists write papers so think of
it
so it's not going to write a scientific
paper for you it's not going to answer
scientific questions although you could
try to use it for that but sometimes you
might make stuff up but it's designed to
be essentially a a you know predictive
keyboard on steroids right so you start
typing a paragraph about something and
it will you know
complete the text the entire paragraph
it will insert relevant citations uh if
you say you know the the state of the
art uh in object recognition on the
image that database is it will find the
correct reference it will actually you
know build a table of results with links
to the references and stuff like that
right
but the same way uh driving assistance
systems for cars are just that driving
assistance this is just writing
assistance right so in the end
your hands have to be on the wheels on
the wheel at all at all times uh you are
responsible for the text that in the end
uh is finished it just helps you it's a
tool that helps you write more
efficiently particularly if you are not
a you know native English speaker
um which you know most scientists aren't
right I mean I even use chat GPT that
way I put in the beginning of the
paragraph and say hey which ways could
this go understand that it might not be
accurate
and that's the way you should treat it
really right um as uh you know as as a
predictive keyboard on steroid
and and something that just helps you
write but it's it's not gonna you know
write event new things answer questions
do science blah blah blah so what
happened was that when we put out
Galactica
people try to break it so people are not
scientists like didn't understand what
the use of it was was going to be uh and
and they would prompt it with things
like uh you know what what are the
benefits of eating crushed glass or
something like that
and of course that's kind of a leading
question so the system will kind of make
a story of like why it's good to you
know create a crush class and then the
reaction on Twitter was oh my God you
know people are going to eat crushed
glass because they're going to listen to
it right which is no insane I mean it's
stupid people are not that dumb uh well
yeah and I think you overestimate people
a little bit but sorry continue
well there might be a tiny proportion
but like you know other like if you use
things like that it's not clear
um you know particularly
which really ultimately was designed to
be you know integrated into tools that
scientists use to write papers right so
um and then you know others more
seriously said uh oh this is going to
destroy scientific publication because
now
uh you know anybody can generate a a
nice sounding scientific paper and then
we'll submit it to a a conference and
this will completely flood uh and
overwhelm the reviewing system that we
have in science and the star science and
uh I thought that was a completely
ridiculous argument because the reason
why you might want to submit a paper is
because you want to prop up your CV and
so you have to put your name on it
otherwise what's the point and if you
put your own name and it's garbage it's
bad for you it's bad for your career
like you know if you if you've you know
send 100 papers there are complete
nonsense to a conference with your name
on it it's certainly be good for your
career like absolutely not so
um so I don't I so I think this kind of
knee-jerk uh reaction uh was completely
unwarranted and it really mirrors a lot
of knee-jerk reactions that have
happened in the past when new tools or
new communication technologies have
appeared where you know all of all of
them was going to destroy Society so I
think it's the same kind of New York
reaction that we're observing with with
AI today
um you know this is not to say that
there is no denture but
um but it's not like the horrible things
that people make to me right so then why
not keep it up
well so what happened was the team that
worked on it which is uh within fair is
called papers with code
there was so distraught by the reaction
that they just they just they just
couldn't take it they said like
you know
we're just gonna take it out like this
was not a high level decision this was
not a decision by uh Communications
departments or the management this was
them okay it's a small team a few people
and they felt like really good about
what they produced they wrote a very
long paper they open source their code
uh they took down the demo but they left
the code so anybody can download and run
it on their own computer interesting and
this is sort of what we're gonna see I I
just wrote this story and we've talked
about on the podcast about how the the
battle over you know AI ethics
applications of AI is just going to be
extremely intense and I think we're
starting to see some of that so we've
covered hallucination good we got to
that let's talk about the type of models
that you think so you talk about our
move to artificial or to human level
intelligence needing an understanding of
the environment things of that that
can't be expressed with words
now when you tell me that I'm back to
kind of where I was in the early days of
our conversations saying there's no way
technology is going to be able to do
that but sounds like you think that
there is a chance that it can how do we
get there and what type of advances are
we seeing today that might lead us to
think that we do have a chance
so I actually wrote a position paper
about this which is rather long but the
intro is easily readable by
non-specialists and the title of it is a
path towards autonomous machine
intelligence
where basically I I lay out a plan or a
path forward
to address those questions to build AI
systems that are capable of of planning
whose behavior is controlled by
objectives that need to be satisfied
which can be specified by hand or
learned so things like you know factual
correctness and blah blah blah and
answer this question and you know don't
those few offensive stuff and things
like that uh and uh and have internal
models of the the world or the the thing
they are interacting with which could be
a person if they're if there are
dialogue systems then uh when you're
talking to someone you need to have a
mental model of what that person
uh knows and can understand okay
if you if you
speak in terms that the person cannot
understand then you know the
communication fails right so when when
you when you talk to someone you you
have something to say
and you may have to tell them you know
some background information depending on
what you think they know and then sort
of take them to the stage where uh
uh where where you think they've
absorbed the information that uh you
think would be useful to them
so you need
to talk to someone you need an internal
model of what that person or that person
will react to what you tell them
um
if if it's another type of agent that
generates actions in the physical world
or even in the digital world like a
robot that you know domestic robot that
you know needs to cook or fill up the
the uh dishwasher
um that model that system needs to have
an internal model of the world that
allows you to predict
where the state of the world is going to
be as a consequence of its own actions
because that's what you need to be able
to plan
if you want to plan a sequence of action
to arrive at a goal you need to be able
to imagine if I take this action this
will happen and then if I take that
action this will happen
etc etc and so you can optimize your
sequence of actions so that the
resulting sequence of states that the
world is going to follow is going to
satisfy your ultimate goal
um
and uh
uh and then the issue with this is that
you know how do we get a machine to
learn models of the world baby humans
and animals do this in the first few
months of life
uh mostly by observation
and understand a huge amount of
background knowledge about the world
um basically just by observation this is
the type of learning that we need to
reproduce in machines that we currently
cannot do except for simple things like
text because you have to understand that
text is actually simple compared to the
real world right
so how do you do it
okay so
um okay so there's one way to not do it
which is the way large language models
are trained so the way you pre-train a
large language model is that you don't
training to just predict the next word
in a text you take a long piece of text
if it doesn't work typically
and you remove some of the words uh you
you blank them out you replace them by a
blank marker or you substitute another
word or you do Vice corruption things
and then you train some gigantic neural
net to predict the words that are
missing
okay so this is called cell supervised
running
and it's uh this particular instance is
uh what's called a generative model
which is much more General than the
usual kind of colloquial use of the of
the term generator model
it's generative in the sense that it it
produces uh signals that are the same as
the input okay it tries to fill in the
blanks if you want
okay
um so it generates the missing
information if you want
uh and this particular instance is
called a denoising autoencoder why
autoencoder because you give it an input
that is corrupted and you ask it to
produce the clean uh version of that
input that's called a denoising auto
encoder and again that concept goes back
to the 1980s nothing new there uh except
the application of this idea of the
nursing how to encoder to text using a
Transformer architecture which is those
you know very large neural Nets that we
use nowadays with 40 90 layers or
whatever and hundreds of billions or or
at least tens of billions of parameters
uh works amazingly well like
surprisingly well so in the process of
being trained to filling in the blanks
those models basically learn to
represent
language
as a series of numbers
that represents the basically the
meaning of the input sentence
to the extent that this meeting is
useful to fill in the blanks okay
so inevitably
by being trained to do this the those
systems understand a little bit about
the real world but not much it's very
superficial you know if I if I train the
system we sentences of the type uh you
know the Cat chases the blank in the
kitchen
you know the blank there can only be a
few things right it's either a mouse
uh or an insect of some type or maybe a
laser spot that someone is you know
playing with the cat uh or something
like that right there's only a few
options
and so those systems are trained to
produce essentially a long list of
numbers that are scores for each of the
possible words in the dictionary for as
you know how likely they are to appear
at that location
and and to be able to do a good job at
this the system has to understand a
little bit about you know what's a
kitchen and what's a cat and you know
cat
catches mice and things like that and so
we launched that but it's very
superficial
um
and uh
and then what you do once you've trained
the system to do this you chop off the
top layers and then you can use the
internal representation as input to a
subsequent Downstream task that you can
train supervise like for example
translation or uh you know hate speech
detection for example
um so the this technique has become
completely mainstream in natural
language processing
um and
so much so that uh companies like meta
and alphabet in their various Services
have deployed those things massively for
doing things like content moderation
right so uh H fish detection for example
has made enormous progress over the last
three four years
and it's essentially entirely you
to the fact that we're not using those
techniques
okay
we can do his speech detection in
multiple languages basically with a
single model
uh with we don't have to train it with a
lot of data in each language because we
don't have a lot of data in every
language we have a lot of data in
English and others and no French and
German and blah blah blah uh you know
maybe the main language in India but
India has you know like
an enormously large number of languages
that people use including online you
know in local dialects and stuff
um
uh so you know how you make the HP
detection work in 500 languages
and even 500 would not cover everything
so you need those techniques and and you
know this has made a huge amount of
progress now one thing you can do with
those things you can fine tune them
to just predict the next word and that's
what large language models are
now
um coming back to this question of
planning last time we talked with with
Danny kaneman
which was an amazing discussion uh Danny
is famous for you know this book uh
thinking fast and slow and for this idea
that uh we have basically two ways of uh
acting uh system one and system two
equals and system one system two so
system one corresponds to tasks that you
accomplish subconsciously you don't need
to plan anything it's completely
reactive
um so think about like you know
you're you're an experienced driver and
you're driving on the highway you're
barely paying attention and you know you
you're you're not calling on your sort
of high-level cognitive functions to do
that you can do it more or less
automatically
uh anything that involves a real-time
reaction like
um you know playing tennis or something
like that that's mostly subconscious
you're not you just don't have time to
plan right so it has to be built into
your muscle memory as we call it right
but even complex acts like you know you
pay chess
and you are a chess Grandmaster or
you're playing as you know 17 News game
against 50
bad players like me uh you don't have to
think you can just look at the board and
just immediately play you don't have to
plan anything because it's so easy for
you you know it it's become kind of a
subconscious task
now all of those things though all of
those tasks
uh when you learn them
but when you're not very good at them
uh you have to use your entire cognitive
uh resources you run to drive and you
pay attention to everything you imagine
all kinds of catastrophe scenarios uh
you Drive slowly you're using your photo
cortex your model of the world uh that
you know tells you I'm driving next to a
cliff and I know that if I turn my
steering wheel to the right I've got to
fall off the cliff and nothing good is
going to happen
you don't need to try it to know that
something bad is going to happen right
because you have this model that you've
you've built in your head for the last
17 years if you are a 17 year old
um so that model of the world allows you
to predict the consequences of your
actions and allows you to learn
extremely quickly any new skill
um
you know same with with chess uh if
you're a bad chess player you will have
to think for
you know 15 minutes or or more when you
play against a challenging player
and and plan all kinds of strategies uh
so uh what characterizes intelligence is
the ability to predict first of all and
then the ability to use those prediction
those predictions as a tool to plan
by predicting the consequences of
actions you might take
uh
prediction is the essence of
intelligence okay so now here's the
problem
encoder that we use to pre-train natural
language processing systems works for
text it doesn't work for anything else
so it doesn't really work
for things like video or images
so natural ideas you take an image you
block some pieces of it and then you
train some system to predict the pieces
reconstruct the the parts huh
that's how Dolly works
not really no okay it's it's you know it
uses I mean W2 uses diffusion model
which is kind of a slightly different
idea but it has yeah
uh
but uh if you do it the way I I just
described there is like one or two
models that sort of worked that that use
that one is called Nae by my colleagues
at fair and that means Mass autoencoder
uh but it doesn't work as well as other
techniques and those other techniques
are not generative models okay so
they're models that do not attempt
to reconstruct missing information
there are techniques that uh
attempts to reconstruct uh Missing
information but not reconstruct the
image itself but reconstruct a
representation of that image an internal
representation of that image those
techniques I I call them joint embedding
architectures
so essentially in the uh you know the
university you have an encoder that
produces produces a representation of
the input whether it's an image or text
or video or whatever and then
you try to reconstruct the uncorrupted
version of the input the input being
corrupted right
that's through a decoder
in a joint embedding architecture you
have two encoders
one encoder sees the perfect version of
the input the other encoder sees a
corrupted version or distorted version
of some kind
you run those two things two both
encoders and then you tell the encoder
that sees the corrupted input can you
predict the representation of the full
input but you're not trying to reproduce
all the details
and it makes sense for images because or
video so let's imagine we're on a
scenario with video right so I have a
video clip
uh and a complete video clip and what I
do is I I mask
the
the last half of the video clip okay the
corrupted version of video clip it's
just the the first half of that video
clip
okay the rest is invisible
and then the the complete version of
course it's a full video clip right so
you run this for video clips with some
neural net that produces some
representation of the video clip right
uh and then you train this guy to
predict the representation
that that guy has produced and of course
implicitly what it has to do is predict
the rest of that video clip
in representation space now why is that
better than just reconstructing
the piece of the video clip that is
missing just predicting it the reason is
there is an infinite number of things
that can happen after a particular video
clip right
and we don't know how to represent uh
a distribution
of all the possible things that could
possibly happen
we cannot do it in pixel space right so
for example uh you know you you're
seeing a green screen behind me right
now you're not seeing the back of my
head okay you can make so if I start
rotating my my head uh you might be able
to predict what the back of my head will
will look like and then when you look at
it you might be surprised by what you
see maybe I have a small ponytail or
something I don't but you know
and so and there is the you know an
infinite number of variations of what
could be possible there and I could
decide to you know change the action and
move my head in a particular way or or
something
so you can't just predict every detail
what's going to happen
let's say you want to build a it's very
important if you want to do things like
building Cellular in cars because you to
be able to drive safely you like to be
able to predict what cars around you are
going to do
or what pedestrians are going to do
you see a kid on the sidewalk and uh you
know a soccer ball kind of crossing the
road
good chance that the kid is going to run
after it right so you slow down
uh so you know we have this sort of
interesting models of the world that
allow us to predict and then as a
consequence plan kind of safe actions
but then in that same street where the
kid is
with a soccer ball uh it's you know
aligned with trees and there is a pond
behind the trees and it's a windy day so
the tree you know all the leaves are
moving and there is you know ripples on
the pond and everything
and you don't want to spend any
resources predicting all of those
details which are essentially
unpredictable
and so that's why generative models uh
essentially fail when you train them on
images because there's just so many
details to predict the system gets
completely lost in predicting all kinds
of irrelevant stuff
which we call noise uh but you know how
would the system know
so uh so one of the main thing I'm
proposing in that in that piece is to
abandon those generating models
basically and focus on those joint
embedding architectures
or getting models to learn
system to run models of the world pretty
two models of the world
okay and maybe that's what gets us there
okay let's let's take a quick break here
and then come back for about 10 more
minutes on the show we have Jan lacun
with us he's the VP and chief AI
scientist at Facebook
and known as the father of deep learning
so
plenty to talk about we'll be back right
after the break
and we're back here with yanmakun the VP
and chief AI scientist at Facebook
Facebook meta
I don't know is it the same meta it's
called meta
um actually I'm also a professor and
professor at NYU right of course can't
forget that
so uh yeah let's let's talk a little bit
let's just go one one level deeper about
this in the time that we have left so
Facebook I I know released a a
application or at least has it
internally where you can type a sentence
and it will make like not like an image
with Dolly but like a small little video
that actually resembles that sentence so
is that taking us closer to this ability
to predict and understand the world that
you're talking about
uh yes or no so yeah there is uh two
little demo applications one is uh it's
called Uh make a scene and this is one
uh
yeah a bit similar to to Delhi where you
type a description of an image and it
just produces an image it's it's based
on slightly different ideas but it's
um uh anyway before W actually right
uh the main author of Delhi by the way
is uh uh was a DTR Ramesh a brilliant
young guy who he was an undergraduate
student with me and did some research
projects in my lab before
going on to open AI as an intern and
then being hired as a
as a scientist um
so uh so there's this thing called
magazine that produces fixed images and
then there is a similar thing called
make a video that produces short short
video clips essentially and that kind of
systems that can do this now from Google
as well and right but those are they're
not really sort of publicly available
yeah you can't use it I saw the blog
post I was like oh I want to use this
and then I realized it wasn't public I
imagine the problems that you had with
Galactica would pale in comparison to
whatever happens with this movie maker
with language that people kind of I
think you know more attention to no the
the menu resources is like where does
the training data come from and things
like that so uh you know before we can
roll them out it has to be trained on
data that is
uh public acceptable and all that stuff
um uh there is a similar system also uh
in the works um there's been some
publication on it uh they can generate
audio so we can generate sound effects
also by textual description or
simultaneously with the video
um there's systems also from fair that
uh produce music uh they can do music
continuation so it's like a large
language model but with music right so
you feed it audio and it just continues
interesting
uh so you're going to see more and more
of those things and meta is really
interested in those things in sort of
creative AIDS because uh everyone who is
on an online service whether it's a
social network like like you know
Facebook or Instagram or whether it's
going to be you know in the metaverse
everyone is going to need to have easy
ways to generate content and be creative
without necessarily having all the you
know being technically uh
astute enough uh in in terms of of art
uh to uh to do all that thing so so
being able to generate you know be
creative and generate content I think is
something that's very very important and
so you're going to see
things like that popping up on you know
meta's services in various uh
interesting generating images effects
modifying images generating video
generating music sound effects 3D models
okay which of course is important for
the metaverse
um and uh and eventually you know all
the way up to intelligent virtual
characters and stuff like that really
okay So Meta is going to go in all in on
this stuff oh totally yeah yeah I want
to ask you a question about it because
you know the thing about
um you know our song videos all this
stuff
it there's a musician songwriter Nick
Cave who did this post about
Dolly writing I mean not Dolly um chat
GPT writing songs and the authenticness
of all of it now there's been this
debate is this going to take everybody's
job or is it too too soulless and dumb
to actually do this this is sort of the
argument that this AI really can't do
what humans do so this person writes
Nick writes chat GPT may be able to
write a speech or an essay or a sermon
or an obituary but it cannot create a
genuine song it could perhaps in time
create a song that is on the surface
indistinguishable from an original but
it would always be it will always be a
replication kind of a burlesque songs
arise out of suffering by which I mean
they are predicated upon the complex
internal human struggle of creation and
and well as far as I know algorithms
don't feel so I'm curious when you what
you think about that like from the
perspective of a songwriter okay and
this stuff actually produces yeah
there's a lot of questions around this
right so first of all it is true that
current AI systems don't feel
okay but it's not going to be too far
much longer so if we have systems of the
type that it was describing before that
have objectives and that plan actions to
optimize objectives they will have the
ability to predict the outcome of a
situation
which means if they are able to satisfy
their goal uh given a situation
they're going to feel the equivalent of
elation if they can predict that they're
not going to be able to satisfy their
goal uh you know they're going to be
disappointed
if they predict that the situation may
be dangerous for either themselves or
the person they are interacting with
they might actually have you know the
equivalent of fear so if you have
autonomous AI systems that work by
optimizing objectives and they have the
ability to predict they're going to have
emotions it's Inseparable from
autonomous intelligence fascinating so
maybe we'll have an AI that tries to
fulfill some objective and fails
miserably and writes a song about it and
that will be okay but but
um
a big part of art I mean it's certainly
true for songwriting and but also for
novels and for jazz improvisation and
all kinds of stuff right is uh basically
a vehicle to communicate human emotions
and Nick Cave has a point which is that
there is nothing to communicate unless
you have emotions that the auditorial
reader can relate to right so uh and and
that really depends on the person like
you know I'm I'm totally taken by you
know good jazz improvisation uh uh but
like words on songs like like you know
in the case like do nothing to me like I
mean we have different different ways of
reacting to different things but
um so you know is right in the way that
if you want the genuine experience of
Art
um of communicating uh human Human
Experience or human emotions
it has to come from another human
even if machines eventually have
emotions they're going to be very
different different from humans so it's
not going to replace this type of
genuine art it's you know
yeah like you know project yourself like
back
uh
a couple hundred years ago where like
every
like he wanted to buy a salad bowl or
something
uh it would be a handmade right
it would be like a putter that would you
know kind of
make that
right and just make it by hand and bake
it and everything right uh as a
consequence would be relatively
expensive for most people
um and then and then came
industrialization so now it became
possible to build uh you know ceramic
bowls for pennies uh in sort of large
quantities
and uh
did that
um kill the the whole idea of making
Balls by hand no
in fact handmade uh
objects become more expensive as a
result
right and that leads me to a question
that I have to ask you which is that
everybody's asking about the training
data that that these things are using in
fact I I put a tweet out asking hey I'm
going to talk about generative AI so
we'll end with this generative AI ethics
question what would you like to know
everybody wanted to know about how
the people who these
databases or programs are trained on are
actually going to end up being
compensated and whether we should
actually even allow them to train on
stuff that people have created without
their consent here's like two of those
comments so uh Roger McNamee early
Facebook investor now now critic says
and he didn't know it was going to be a
Facebook conversation a conversation
with someone with Facebook but he says
um why should it be legal for a computer
scientists entrepreneurs and investors
to profit from unlicensed exploitation
of the works of creative humans and we
also had one more from Marshall Morrow
and he said the inevitably of the
inevitability of intrusive Tech is a
uniquely American phenomenon I don't
know why American but anyway this is
what he says they they ask you to post
baby pictures and are now training
lucrative AI engines with your images
this was not disclosed at the time pay
me now so yeah when you think about that
what's your what's your reaction
it's going to be a debate for society to
figure out because I don't think the
answer is totally clear uh you know for
example uh photography the invention of
Photography uh Shrunk the market for
um
portrait
painted Portrait by a lot it's not like
portrait
portrait is disappeared but it certainly
reduced the market for it
recorded music
reduce the market for performance
musicians and in every instance of those
things there was uh you know collectives
of artists to say like you know we have
to stop this because this is going to
kill our business
they were universally unsuccessful
okay so you're not going to stop
technology right the now the question is
is a legal one so if you if you assume
that current legal
uh
interpretation of copyright if you want
uh is used then
you cannot let those machines plagiarize
so if you use a generative model that's
been trained on whatever and it produces
regardless of the process and it
produces a piece of art that is too
similar to an existing one the audience
that produced that existing one uh is
entitled to sue the
the the person who is Distributing this
this new piece of art
um
uh and and ask for
compensation now but what if that piece
of art is not copyrighted that generated
piece of art is not copyrighted so
nobody can drop it from it
who are you who are you gonna assume
um you know is is there grounds for
um pursuing now if again it's a copy yes
there is going for suing even if you
give it for free
if that piece of art is in the same
style as a known artist
but is not the same piece of the same
painting or whatever it is
then that's where things become
complicated because human artists are
absolutely
authorized to is get inspired and write
down copy
someone else's Style
um that happens absolutely all the time
in um
in art
and so would would it make sense to
apply a different rule for
it's called Uh artificial audience right
that generate things uh like you know
they can get inspired by whatever they
see they've been trained on
uh but then they cannot just reproduce
so that would be a perfectly fine thing
to do I imagine that a lot of artists
would not be happy with this uh but
you know that's a it's a definite
possibility now uh perhaps what you
might want to do is what you know the
early internet uh also authorized or or
put in place which is that you know you
can put a little five or you know
robot.txt or something that uh tells
search engines and and crawlers do not
use my content for any for anything like
you know you can't index it you can't uh
use it for anything
so if you're an artist and you don't
want your content to be used you know
Lobby for this kind of stuff and this
kind of policy to be uh to be respected
by by you know calling
um algorithm
um
and then um and then it could be like
you know the next step would be
uh like if you use my
RTC production as training data you owe
me money and I think that's a bit
extreme
so you know the answer is probably
somewhere in between not clear what like
and it's not for people like me to
decide like you know I have no
particular legitimacy to like tell
people this is the right thing to do uh
I'm a scientist I'm an AI scientist
um I think this is for society as a
large at large to the to the side uh
through the usual Democratic process
um
but and but you have to be careful like
in one of the
remarks that you you mentioned
it was said like you know why should
like a large tech company in California
profit from my my art me poor artist
uh this is the wrong way to frame this
because within a year or two you know
any teenager in the parents basement is
going to be able to do this so
um you you don't want to
use the
you know current bad press the the tech
industry uh has to motivate people to
kind of go after this because that's
you're going to lose if you do this yeah
this is not something you know the big
tech company are just the first ones to
have the technology to do this but
eventually everybody is going to be able
to use this right this kind of stuff and
train their own models and call the
internet right I mean you know calculate
startups and and and you know young
programmers yeah and it is a point I
made this point in my most recent
newsletter but that we are not going to
see the centralization I'm sure you
agree that we saw with with you know
Communication online like with the
Facebook and the twitters and AI it's
going to be much more distributed and
the last thing I'll say is that I'm so
torn about this because I have been
plagiarized by someone using my words
feeding them into a generative AI
systems and then publishing them as
their own that's already happened to me
however I just love the systems I love
using Dali to illustrate uh my my
stories and I love
speaking with chat GPT where I mean I
have it on speed dial it's so fun to
speak with so okay that that will do it
for us uh unfortunately out of time
Beyond I feel like we can speak for
hours each time but unfortunately uh
that's the end of the show so thank you
so much for coming on we'll have to do
it again soon
well thanks for having me guys okay
great it was a pleasure awesome
yeah sorry for