February 18 2017 at the cloudflare HQ
work began winding down this Friday
afternoon as the weekend approached
Moreau was high and no one was prepared
for the disaster that was about to
happen the engineers were excited to go
home early and enjoy the beautiful
weekend free of any operational issues
suddenly 4 11 pm Pacific time at the
friendly neighborhood Google complex one
of the googlers working in Project zero
a security research team discovered a
severe issue with cloudflare's system he
immediately reached out through the most
sensible channel for something urgent
like this and first Contact was made
minutes later it was now 4 32 pm and the
alarming details of the report were made
clear to cloudflare suggesting a
possible widespread data leak always a
Friday afternoon
I've read my weekend you may have seen
cloudflare's DDOS mitigation service
before this is built on top of their
primary product a Content delivery
Network or CDN cdns came into existence
in the 1990s to speed up the delivery of
Internet content they're kind of like
distribution centers Amazon isn't just
going to have a single Warehouse in the
middle of the United States that every
delivery driver starts from there are
many spread all across the country in a
store or should I say cash commonly sold
items to minimize delivery time
similarly it makes no sense to deliver
internet content to all users across the
world from a single centralized Source a
CDN will have many points of presence
across the world with Edge servers that
cache content from the origin server
when a user makes a request for a
particular website the request is
directed to the nearest Edge server
where the content is most likely already
cached it was here that cloudflare not
only returned the requested website but
also cookies keys and other sensitive
customer data this is what it would look
like if you knew words and look plenty
of useful information could be extracted
from the leaked Memory full https
requests IP addresses responses
passwords and who knows how long this
exploit has been out there Bad actors
could have already compromised thousands
of companies and cloudflare's monitoring
evidently did not self-detect this issue
as a third party had to identify it and
reach out to them data leakage like this
can come with Hefty consequences FTC
fines lawsuits and increased audits but
most importantly of all it degrades
customer trust no customer trust no
customers no customers no Revenue no
Revenue no taco Tuesdays to make matters
worse search engines like Google also
regularly index and cache websites so
this leaked data could also be accessed
through Google's cache 4 40 PM now this
was serious business everyone
immediately assembled in San Francisco
maybe even some cross-company action
with the Google employees the engineers
noticed in the dashboards that the
occurrence of this bug seem to correlate
with the usage of the email obfuscation
feature which was also immediately
suspected as ahead of recent deployment
to partially migrate to a new HTML
parser either way every feature that
codflare ships comes with a feature flag
and Engineers immediately flipped what
they called the global kill which would
prevent all customers from using the
feature by 522 PST about an hour after
the initial report email obfuscation had
been disabled board worldwide however
the bug was still occurring
on the other side of the Atlantic the
London team had joined the call all
hands on deck it was time to spend the
Friday night debugging and rethinking
life
8 24 PM PST four hours in another two
features were found to be problematic
automatic HTTP rewrites and server-side
excludes automatic HTTP rewrites was
shut down immediately with its Global
kill the server side excludes with such
an old feature that it predated the
practice of deploying with global kills
the engineers had a Crossroads here they
could release a patch for this feature
to allow it to be turned off but that
would take some time for implementation
and deployment alternatively they could
spend time Root causing the issue and
deploy a single proper fix but the root
cause was not apparent thus the
engineers began working on the global
kill for Server Assad excludes and
readied it for a deployment 11 22 PM PST
7 hours in as the night progressed
streets outside the San Francisco office
grew quieter Daybreak in London the
engineers were more than ready to sign
off and get some much needed sleep the
patch to turn off server-side excludes
was finally deployed worldwide but there
was still much work to be done cache
data from search engines still needed to
be purged and without knowing the true
root cause reoccurrence was still within
the realm of possibilities but what
could have caused this well Edge servers
contain software to perform all kinds of
operations on the content they deliver
this was the clear common denominator
among the three aforementioned features
they all parsed and modified The
Returned HTML content in some way email
obfuscation would erase any email
addresses in The Returned webpage if the
requester source IP was deemed
suspicious server-side excludes is very
similar it can automatically hide
content wrapped in a special tag from
suspicious Source IPS
automatic HTTP rewrites would simply
rewrite any HTTP content embedded on The
Returned website to https
furthermore these three features all use
the new HTML parser mentioned earlier CF
HTML the engineers however found nothing
suspicious in the code despite thorough
verification
it wasn't until the next few days that
the root cause was made clear
now cloudflare had originally been using
a parser generated using radio and they
were looking to migrate to something
simpler and more maintainable it was in
the self-described ancient piece of
software that the bug took Roots vraja
was a parser language that no one knows
how to pronounce that works through
defining finite State machines with
regular expressions and Performing
various actions based on the match
results you can think of it like those
flow charts where we start at one state
and transfer to different states based
on various conditions for example here
is a machine which matches consecutive
numbers and letters in practice you can
see radio code is embedded within C here
using the double percent signs
and then it can compile down to C C plus
plus Java it's a live ratio it's
actually fairly readable and concise
after a bit of getting used to and I'd
imagine its very performance or maybe
one engineer a long time ago thought it
was a fun language and implemented it
themselves with minimal communication
and it ended up just working fine
everyone else was like walking broken
fixed it and it was untouched until now
so the HTML web page consumed by the
Rachel parser is represented by a series
of data buffers with each buffer
containing a portion of the HTML code
each time the radial parser is invoked
to consume a buffer the user needs to
pass in data pointers initialized to the
beginning and end of the buffer varesio
uses P to iterate through the buffer and
PE to tell when the buffer has been
fully parsed in cloudfire's case one of
the things they wanted to parse were
HTML attributes within script tags such
as type or source
taking a look at the racial code this
script consume attribute machine will
try to match this regular expression
attribute characters followed by space
slash or closing angle bracket then we
have a few actions this is an entering
action which will be performed when
starting the machine it simply logs that
the script is running this at symbol is
a finishing action which is performed
with the machine complete successfully
here we call F hold which is equivalent
to P minus minus and will move the
pointer back by one this is likely
because the script tag parse machine it
proceeds to jump to is to consume the
space short angle of reacted characters
that the attribute machine would have
already matched as those are also part
of the tag
there is also a local Air action which
is performed with an attribute fails to
match there's a log here for failure and
then it recurses and tries to parse the
next attribute
going back to the success case After
exiting back to script tag parse the
many parser machines will continue until
the end of the buffer is reached but how
do we know if we've reached the end of
the buffer well if the data pointer p is
equal to the data endpoints or PE then
we have surely reached the end of the
buffer
so it turns out that something very bad
happens if there is an unfinished
attribute at the very end of a web page
when this happens failures to match
would occur when the data pointer p is
equal to data end pointer PE the parser
then reinvokes itself now at risk of
parsing undefined heat memory let's see
if the buffer end check saves us oh man
the pre-increment causes P to skip over
and never be equal to PE a Ricky mistake
[Music]
but wait this is a bug in the old parser
which has been in use for years has
cloudflare been leaking data all this
time
no so it was actually migration to the
new parser that triggered the issue
going back to the buffer override we
were talking about before if there are
more buffers to come the unfinished tag
could just be due to the rest of the
elements being in the next buffer so the
error action will not be invoked the
error action is only triggered on an
unfinished match within the very last
buffer as there is no more data at that
point to complete the match this is why
in the example The Unfinished attribute
is at the very end of the page that is
at the very end of the last possible
buffer however the key here is that
historically when only the old parser
was used it would always receive an
extra dummy last buffer that had no
content why no particular reason it just
did this meant that for a website that
ended with an unfinished tag The
Unfinished tag would be in the second to
last buffer and the error action would
not be caught then since the last buffer
is empty the parser would also not be
again after the new parser was
introduced this Behavior changed and the
empty last buffer was no longer present
in the buffer sequence passed to radio
causing the unfinished tag to be in the
last buffer and making the overrun
possible perhaps the new parser cleaned
up the empty last buffer before passing
data to the old one this also meant that
the bug can only occur when a customer
enables features which in combination
use both old and new parsers so what can
be learned from this failure well here
we see a classic example of backwards
compatibility no matter how dumb the
behavior of something is if it's been
set in stone for a long time and you
change it something is definitely going
to break however it's not always so easy
to maintain backwards compatibility
obviously Microsoft can easily choose to
not deprecate the ability for Windows to
run 32-bit programs but CF HTML removing
the last buffer or perhaps more
accurately not adding the extra dummy
buffer back for no reason is something
that can easily be overlooked and it was
not just this but also a bug in the
existing code plus a very specific type
of input that in combination caused the
data leak when you consider even larger
systems with dozens of interlocking
components each with millions of
possible inputs it's clear that there
will inevitably be bugs in all software
so what can be done to minimize impact
cloudflare inventions buzzing generated
code to search for pointer overruns as
well as building test cases for
malformed web pages there are also
various memory management techniques
that can reduce impact this can likely
also have been caught by Static code
analysis perhaps another thing worth
pointing out are best practices the
coding standards for ratio are not very
clear but for my limited experimentation
I don't think it is possible for Rachel
to naturally overrun the buffer it's
possible to underrun the buffer by
spamming fold but radio's default
Behavior seems to make overrunning
impossible there's no radial command to
force iteration of the data pointer and
the way radio iterates the data pointer
forward naturally is as follows always
explicitly checking if it has reached
the data end this points to cloudflare
potentially going in and modifying the
compiled C code rather than the radio
code itself something that would
obviously not be radio best practice
two days later pointer checks to detect
memory leaks were rolled out and three
days later the engineers determined it
was safe enough to re-enable the three
aforementioned features cloudflare then
worked with the various search engines
to purge their caches of affected
websites in terms of overall impact
evidence suggests that it was quite
small there were quite a few conditions
that needed to be met for the bug to
manifest and cloudflare claims that
there is no evidence of the bug being
leveraged for any attacks we know that
0.6 percent of cloudflare websites ended
with unfinished tags and that the bug
occurred more than 18 million times it
is reasonable to say that cloudflare
just got really lucky in fact one of the
features which could trigger this bug
was available as far back as November
2016. had this exploit falling into the
wrong hands or occurred more recently
now that cloudflare is so much bigger
there may not have been such a happy
ending