Trojan Source: Raw-Bin Hood Pick Pocketing Source Code with Unicode Bidirectional Control Characters

Pick Pocket cartoon

A new type of vulnerability has been disclosed by researchers at Cambridge University in the UK where Unicode Bidirectional Control Characters are used to change the way text appears in the IDE or text editor compared to how the compiler will interpret and compile the source code into an executable.  Proof of concept code has been released for virtually every language including C, C#, C++, Go, Java, Ruby, Python, JavaScript, Rust, and more. Here is a link to the original paper, a GitHub repository released by the authors that includes proof-of-concept code samples for virtually every popular language and the issued CVEs CVE-2021-42574 and CVE-2021-42694, both having severity score of 9.8 “Critical”.

Unicode Bidirectional Control Characters are needed in Unicode because Unicode is meant as a super encoding standard which allows all languages (and even emojis) to be contained in a single encoding standard as opposed to say, ASCII which contains only English.  But, some languages are written and read from right to left (RTL) instead of left to right (LTR), so Bidirectional Control Characters are used to tell an application which direction to print the characters, using RTL or LTR.  However, as mentioned above, it has been discovered that IDEs do not files containing Bidirectional Control Characters the same way that software compilers compile the same file.  OOOOOooops.  That’s bad since you might be looking at software source code that looks one way, but will be built and compiled do work differently.

In response, GitHub has started including a warning notice on their site, for source code files that include Unicode bidirectional control characters.  However, since it’s unlikely that you will be looking at all the files in a repository before you run the code, that won’t help much at scale. Perhaps the warning should be issued atop repositories if ANY of the files include those characters.

Here let’s look at a bit of code from the proof-of-concept repository in a file called :


You might expect the output to show that Alice is down $50.  But, it looks like this is Wonderland with hidden surprises that keep Alice’s bank account full.

What’s going on here?  First let’s look at the output of the file when printed from the terminal with the cat -v command:

Looks a little different doesn’t it?  Specifically look at the first line after the function definition, you will see that the return statement has magically moved to outside of the quotation marks.  The semicolon ‘;’  character was also required in the code to for multiple commands on one line and allows the ‘return’ to be executed without syntax error.  So, what looked like a comment line was actually a line that contained a comment and a command to execute.  So the function simply returns nothing before it can execute the other code in the fiction.

This digital slight of hand was accomplished by using something called Unicode Bidirectional Control Characters.  The Raw binary contains them but the IDE or text editor does not really display them, it “interprets” them.  And vulnerability occurs when the researchers realized that compilers, interpret them differently.  Unicode Bidirectional Control characters are needed because…. well… Unicode is meant as a super encoding scheme that allows all languages to be contained in a single character set as opposed to say, only English as in the ASCII character set.  But, some languages are written from right to left instead of left to right, and to accomplish this some control characters are used to define whether a block of characters are right to left (rtl) or left to right (ltr).

So, these bidirectional characters can be maliciously placed to move blocks of characters, in the example above, moving the “return” statement outside of the Python triple quoted comment.  The only good news is that while this has been demonstrated across all major languages, it’s not exploitable in compilers for all versions of all languages.

Rapid 7 seems to discount the vulnerability, and actually questions whether this should be considered a vulnerability at all!  What is it then, a feature?!?!  You can view their scoring schema here which is very different than the 9.8 score issued by MITRE.  I have a few problems with their interpretation of the vulnerability.  First, since it took me about 30 minutes to fully understand and write exploit code, I would say the attack complexity is pretty low.  Basement low.  Trivial. 

On the other hand, Krebs On Security entitled their article  “Trojan Source Threatens the Security of All Code”. So, apparently Brian Krebs sees this at the opposite end of the risk scale.  I’m siding with Brian Krebs.

This vulnerability is the pick pocket of vulnerabilities.  I think a fun name for it is “Raw Bin Hood” since the raw binary of the file will show characters that the IDE or text editor won’t.  So don’t get hoodwinked.  Scan your source code for hidden control characters before you compile.


Leave a comment

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.