FCSC 2021 Qualification - Battle ship (misc / hardware)

FCSC 2021 Qualification - Battle Ship

This challenge was about a side channel attack on a shared cache between an untrusted and a trusted module in a chip8 emulator.

To help us understand a bit the architecture of the software we had a well detailed schema and the python source code of the emulator. It was a really nice challenge I really appreciated as the python code was a pleasure to read.

Understanding the architecture

The green part corresponds to the untrusted module and red one is trusted. We can directly execute code in the memory space of the untrusted one but not in the trusted. However we can call routine stored in the trusted.

img

We also had this information (in french sorry) :

L’élément sécurisé n’est rattaché à aucun périphérique. Il contient du code pré-compilé dans lequel vient s’ajouter la clé secrète précédemment générée.

Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat. Pour palier ce problème de performance, le constructeur a mis au point un système de cache partagé entre la VM utilisateur et l’élément sécurisé. Le cache repose sur un algorithme de type LRU et est placé en amont de l’UAL. Ainsi, si une opération a récemment été réalisée, le CPU a directement connaissance du résultat coûtant donc qu’un seul cycle.

En lien avec la gestion du cache, le constructeur a ajouté le code opération 00E1. Cette nouvelle instruction permet de vider toutes les lignes du cache de l’UAL.

Le constructeur a également implémenté deux autres codes d’opération : 0000 et 0001. Ils permettent d’interagir avec l’élément sécurisé en lui demandant de chiffrer une entrée de 10 octets avec la clé secrète ou de lui demander de vérifier si une entrée de taille 10 octet correspond à la clé secrète. Le code des deux routines est fourni. Un troisième code d’opération est implémenté : FFFF. Il permet de notifier la VM de la fin de l’exécution de la ROM.

When calling a routine the address space associated to the RAM is copied from the untrusted module to the trusted one and then some information are sent back through some registers I and VF. These are the only interactions between the two modules.

But there is also another shared thing : the LRU cache. The untrusted module can access trusted module cache value and vice versa. This way the architecture is vulnerable to a potential side channel attack on the cache.

To interact with the emulator we can simply send a custom hex encoded ROM that will be executed in the untrusted module.

Determining the goal

Thanks to the emulator source code we can identify where the flag is injected.

if msn == 0:
    if n == 0x0:
        se = self.cpu.context.secure_element
        se.reset()
        se.set_context(
            i=self.cpu.processor.i,
            memory=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
        )

        se.exc_encrypt()
        self.cpu.processor.v[0xF] = se.cpu.tick

        elif n == 0x1:
            se = self.cpu.context.secure_element
            se.reset()
            se.set_context(
                i=self.cpu.processor.i,
                memory=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
            )
            flag = se.exc_verify()
            self.cpu.processor.v[0xF] = se.cpu.processor.v[0xF]
            if self.cpu.processor.v[0xF] == 0:
                if self.cpu.processor.i+10+len(flag) > self.memory.SIZE:
                    raise MemoryError("Not enought memory space")
               self.memory.data[self.cpu.processor.i+10:self.cpu.processor.i+10+len(flag)] = flag 

As noticed in the challenge information, when the execution returns from the trusted module it copies the VF and I registers to the untrusted registers. Following the se.exc_verify we see that if the VF register is equal to 0 the flag will be loaded in the untrusted memory making it accessible for us.

We have now to understand how VF can be zero at the end of the function.

class TrustedContext(BaseContext):
    def __init__(self):
        super().__init__()
        self.execution_key = os.urandom(10)
        self.co_encrypt = routines.preco_encrypt.format(*self.execution_key)
        self.co_verify = routines.preco_verify.format(*self.execution_key)

    def set_context(self, i=None, memory=None):
        if i:
            self.cpu.processor.i = i
        if memory:
            assert len(memory) == Memory.SIZE - Memory.O_STACK
            self.memory.data[Memory.O_STACK:Memory.SIZE] = memory

    def exc_encrypt(self):
        self.load(self.co_encrypt)
        self.run()

    def exc_verify(self):
        self.load(self.co_verify)
        self.run()
        return flag.FLAG_LV2

We see in the above code that exc_verify load a pre-compiled routine named co_verify and pass it 10 random bytes. Lets disassemble this routine.

For the disassembly I simply used https://github.com/mwales/chip8 which do a great work :

0x0200  ld v1, #01
0x0202  ld v2, #00

0x0204  ld v0, [I]
0x0206  ld v3, #71
0x0208  xor v3, v0
0x020a  or v2, v3
0x020c  add I, v1

0x020e  ld v0, [I]
0x0210  ld v3, #43
0x0212  xor v3, v0
0x0214  or v2, v3
0x0216  add I, v1

0x0218  ld v0, [I]
0x021a  ld v3, #7c
0x021c  xor v3, v0
0x021e  or v2, v3
0x0220  add I, v1

0x0222  ld v0, [I]
0x0224  ld v3, #8e
0x0226  xor v3, v0
0x0228  or v2, v3

0x022a  add I, v1
0x022c  ld v0, [I]
0x022e  ld v3, #d6
0x0230  xor v3, v0
0x0232  or v2, v3

0x0234  add I, v1
0x0236  ld v0, [I]
0x0238  ld v3, #af
0x023a  xor v3, v0
0x023c  or v2, v3

0x023e  add I, v1
0x0240  ld v0, [I]
0x0242  ld v3, #60
0x0244  xor v3, v0
0x0246  or v2, v3
0x0248  add I, v1

0x024a  ld v0, [I]
0x024c  ld v3, #11
0x024e  xor v3, v0
0x0250  or v2, v3
0x0252  add I, v1

0x0254  ld v0, [I]
0x0256  ld v3, #d6
0x0258  xor v3, v0
0x025a  or v2, v3
0x025c  add I, v1

0x025e  ld v0, [I]
0x0260  ld v3, #10
0x0262  xor v3, v0
0x0264  or v2, v3
0x0266  add I, v1

0x0268  ld vf, v2
0x026a  BAD OPCODE 0xffff

The code is constituted of 10 patterns which do the same thing. It loads a value from the I register add 1 to this address, deference a byte and xor this byte with one of the random 10 bytes and finally stores the result in V2.

To resume it, xor a string provided through the I register with 10 random bytes and store the result in V2 before placing it in VF. So VF will be 0 only if we provided the same 10 characters as the random one.

However it’s impossible to guess this 10 bytes.

The shared cache vulnerability

Lucky us, there is another routine we can access trough the 0x00e0 instruction. This routine will simply xor the provided string against the same 10 random bytes and will return the number of CPU cycle executed trough the VF register.

As the 10 random bytes are used to xor our input are the same we need to guess we could simply send 10 null bytes in order to leak it. However the result of this encryption resides in the trusted module ram we can’t read.

But there is something really important :

Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat.

If the UAL operation is already in the CPU cache it will only do 1 cycle instead of 2.

Thats means we can know if an operation with it’s 2 operand has already been executed by looking at the number of cycle returned.

After some tests I noticed the encrypt routine will need 61 cycles to be executed. If the routine is executed with less than 61 it means that a xor with the two operands already have be done before.

Test case

For example lets take 0x41 as the first random byte.

Executing 1 ^ 2 in untrusted memory will store this instruction in the cache and will do 2 cycles.

The next time doing 1 ^ 2 the cache would directly send back 3 in one cycle.

If I did 1 ^ 0x41 it will store the operation in the cache. Now if I call the xor encryption routine it will also do 1 ^ 0x41 as 0x41 is the first random byte. The value will be cached as I already have done this instruction just before and so the routine will do 1 cycle less. At the end of the execution VF will hold 60 instead of 61.

To resume if the encrypt routine do an instruction we have already done just before we will able to know which value have been used.

Payload analysis

I used the following project to write CHIP8 asm https://github.com/wernsey/chip8. The code is splited in several steps.

start:  
    CLS
    LD VE, 0 ; counter for 10 bytes key
go:
    LD V5, 0 ; char to test

loop:
    LD V0, #13 ; we gonna set char_to_test ^ #13 in the cache right now
    LD V1, V5
    xor V0, V1

    LD V1, 1   
    LD I, #EA0 ; start of our buffer in ram
    ADD I, VE  ; index we test right now
    LD V0, #13 ; we gonna set #13 too

    LD [I], V0 ; ; we set in the ram #13 at the offset of the char of the key we want to test

    LD I, #EA0 ; we pass our buffer to the call to encrypt

    SYS 0
    SNE VF, 60 ; vf hold the cpu cycle count, if there are only cache miss its 61
               ; but if we already did an operation with the same value #13 ^ V5 (char we test)
               ; there will be a cache it and the cpu will have only one cycle instead of 2 
               ; and so VF is 60
    
    JP cont ; if VF is not 60 we test another char

    ADD V5, 1
    dw #00e1 ; but first we clean the AL cache
    JP loop
    
cont: ; we found a key
    LD I, #EC0 ; we prepare the buffer for the verify call
    ADD I, VE ; which offset we found
    LD V0, V5 ; the value found
    LD [I], V0 ; set it
    
LD I, #EA0 ; reset modified byte else the cache hit value will be erroned
ADD I, VE
LD V0, 0
LD [I], V0

ADD VE, 1 ; increment to reach 10 bytes
SNE VE, 10
JP then
JP go

This code will simply do a xor between 0x13 and value from 0 to 0xff after each xor it calls the routine with 0x13 at the right offset thanks to SYS 0. If VF is equal to 60 we have found the right value and can now set the value in another buffer which will be sent to the second routine for verification.

Everything is in a loop and between two calls to the routine we simply flush the UAL cache to reset the initial number of CPU cycle needed.

Once we found the 10 keys we can prepare the call to the second routine

then: ; we found all the bytes
    LD I, #EC0 ; our buffer with the bruteforced char
    SYS 1 ; calling the verify routine which will add the flag in RAM

If everything made it until here, the flag should resides in the memory right after our buffer. The remain of the payload is only used to display the flag and add some sleep function as the delay wasn’t implemented in the emulator.

  LD VE, 0 ; counter for displaying the flag
final:
    LD I, #EC5 ; EC0 + 10 ?? should be eca
    ADD I, VE
    LD V5, [I] ; we fetch the byte

dis_one:
    CLS ; to mark the transition between 2 digit
    LD V3, 1 ; SET x
    LD V4, 1 ; SET Y
    LD V6, 0
    LD V7, 15 ; 1111
    LD V8, V5
    LD V9, V5

    SHR V8, V6 ; fetch the 4 msb to the lsb
    SHR V8, V6
    SHR V8, V6
    SHR V8, V6
    AND V8, V7 ; keep only the 4 lsb now

    AND V9, V7 ; second part of the byte, 4 lsb

    LD V1, 1 ; sprite to use
    LD F, V8 ; set I with the selected sprite
    DRW V3, V4, 5 ; display sprite to the coordinate and size

letssleep: ; sleep as DT was not implemented to be able to see the char on screen
    LD VB, 0
sleep0:
    LD VA, 0
    ADD VB, 1
sleep:
    ADD VA, 1
    SE VA, #ff
    JP sleep
    SE VB, #60 
    JP sleep0
    CLS
    LD V1, 1 ; sprite to use
    LD F, V9 ; set I with the selected sprite
    DRW V3, V4, 5 ; display sprite to the coordinate and size

letssleep2:
    LD VB, 0
sleep2:
    LD VA, 0
    ADD VB, 1
sleep1:
    ADD VA, 1
    SE VA, #ff
    JP sleep1
    SE VB, #80 
    JP sleep2
    CLS

ADD VE, 1 ; increment
JP final

As we can’t display a byte I splited it into 2 parts composed of 4 bits each. I used the SHR instruction to rotate the bytes to reach the least significant bit position and then I could use the DRW instruction to display a sprite. Then I take the saved real 4 least significant bits to display another sprite finally. So I have to used 2 sprites to display one byte.

I did some loop to create a delay in order to have time between each hexadecimal characters displayed on the screen. As mentioned in the documentation there is no keyboard handled by the emulator and so and make the program hang until I press a key.

Executing the payload

To compile and run the payload I used the following command

c8asm payload.asm &&  cat a.ch8 |  (xxd -p -l 2000 -c 20000 | tr -d "\n" ; echo "ffff") | nc challenges1.france-cybersecurity-challenge.fr 7004

Then I could contemplate the hex characters displayed one by one on my screen :

I didn’t really talk about chip8 architecture as it’s for me more the purpose of the first challenge. But I found a lot of very well detailed documentation and a lot of projects.

Sources

https://en.wikipedia.org/wiki/CHIP-8#Virtual_machine_description

https://github.com/wernsey/chip8

http://devernay.free.fr/hacks/chip8/C8TECH10.HTM#Dxyn

https://github.com/mattmikolay/chip-8/wiki/Mastering-CHIP%E2%80%908

https://github.com/mwales/chip8/blob/master/customRom/helloworld.md

https://github.com/mwales/chip8