FCSC 2021 Qualification - Battle ship (misc / hardware)

FCSC 2021 Qualification - Battle Ship

This challenge was about a side channel attack on a shared cache between an untrusted and a trusted module in a chip8 emulator.

To help us understand a bit the architecture of the software we had a well detailed schema and the python source code of the emulator. It was a really nice challenge I really appreciated as the python code was a pleasure to read.

Understanding the architecture

The green part corresponds to the untrusted module and red one is trusted. We can directly execute code in the memory space of the untrusted one but not in the trusted. However we can call routine stored in the trusted.

img

We also had this information (in french sorry) :

L’élément sécurisé n’est rattaché à aucun périphérique. Il contient du code pré-compilé dans lequel vient s’ajouter la clé secrète précédemment générée.

Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat. Pour palier ce problème de performance, le constructeur a mis au point un système de cache partagé entre la VM utilisateur et l’élément sécurisé. Le cache repose sur un algorithme de type LRU et est placé en amont de l’UAL. Ainsi, si une opération a récemment été réalisée, le CPU a directement connaissance du résultat coûtant donc qu’un seul cycle.

En lien avec la gestion du cache, le constructeur a ajouté le code opération 00E1. Cette nouvelle instruction permet de vider toutes les lignes du cache de l’UAL.

Le constructeur a également implémenté deux autres codes d’opération : 0000 et 0001. Ils permettent d’interagir avec l’élément sécurisé en lui demandant de chiffrer une entrée de 10 octets avec la clé secrète ou de lui demander de vérifier si une entrée de taille 10 octet correspond à la clé secrète. Le code des deux routines est fourni. Un troisième code d’opération est implémenté : FFFF. Il permet de notifier la VM de la fin de l’exécution de la ROM.

When calling a routine the address space associated to the RAM is copied from the untrusted module to the trusted one and then some information are sent back through some registers I and VF. These are the only interactions between the two modules.

But there is also another shared thing : the LRU cache. The untrusted module can access trusted module cache value and vice versa. This way the architecture is vulnerable to a potential side channel attack on the cache.

To interact with the emulator we can simply send a custom hex encoded ROM that will be executed in the untrusted module.

Determining the goal

Thanks to the emulator source code we can identify where the flag is injected.

if msn == 0:
if n == 0x0:
se = self.cpu.context.secure_element
se.reset()
se.set_context(
i=self.cpu.processor.i,
memory=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
)
se.exc_encrypt()
self.cpu.processor.v[0xF] = se.cpu.tick
elif n == 0x1:
se = self.cpu.context.secure_element
se.reset()
se.set_context(
i=self.cpu.processor.i,
memory=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
)
flag = se.exc_verify()
self.cpu.processor.v[0xF] = se.cpu.processor.v[0xF]
if self.cpu.processor.v[0xF] == 0:
if self.cpu.processor.i+10+len(flag) > self.memory.SIZE:
raise MemoryError("Not enought memory space")
self.memory.data[self.cpu.processor.i+10:self.cpu.processor.i+10+len(flag)] = flag

As noticed in the challenge information, when the execution returns from the trusted module it copies the VF and I registers to the untrusted registers. Following the se.exc_verify we see that if the VF register is equal to 0 the flag will be loaded in the untrusted memory making it accessible for us.

We have now to understand how VF can be zero at the end of the function.

class TrustedContext(BaseContext):
def __init__(self):
super().__init__()
self.execution_key = os.urandom(10)
self.co_encrypt = routines.preco_encrypt.format(*self.execution_key)
self.co_verify = routines.preco_verify.format(*self.execution_key)
def set_context(self, i=None, memory=None):
if i:
self.cpu.processor.i = i
if memory:
assert len(memory) == Memory.SIZE - Memory.O_STACK
self.memory.data[Memory.O_STACK:Memory.SIZE] = memory
def exc_encrypt(self):
self.load(self.co_encrypt)
self.run()
def exc_verify(self):
self.load(self.co_verify)
self.run()
return flag.FLAG_LV2

We see in the above code that exc_verify load a pre-compiled routine named co_verify and pass it 10 random bytes. Lets disassemble this routine.

For the disassembly I simply used https://github.com/mwales/chip8 which do a great work :

0x0200 ld v1, #01
0x0202 ld v2, #00
0x0204 ld v0, [I]
0x0206 ld v3, #71
0x0208 xor v3, v0
0x020a or v2, v3
0x020c add I, v1
0x020e ld v0, [I]
0x0210 ld v3, #43
0x0212 xor v3, v0
0x0214 or v2, v3
0x0216 add I, v1
0x0218 ld v0, [I]
0x021a ld v3, #7c
0x021c xor v3, v0
0x021e or v2, v3
0x0220 add I, v1
0x0222 ld v0, [I]
0x0224 ld v3, #8e
0x0226 xor v3, v0
0x0228 or v2, v3
0x022a add I, v1
0x022c ld v0, [I]
0x022e ld v3, #d6
0x0230 xor v3, v0
0x0232 or v2, v3
0x0234 add I, v1
0x0236 ld v0, [I]
0x0238 ld v3, #af
0x023a xor v3, v0
0x023c or v2, v3
0x023e add I, v1
0x0240 ld v0, [I]
0x0242 ld v3, #60
0x0244 xor v3, v0
0x0246 or v2, v3
0x0248 add I, v1
0x024a ld v0, [I]
0x024c ld v3, #11
0x024e xor v3, v0
0x0250 or v2, v3
0x0252 add I, v1
0x0254 ld v0, [I]
0x0256 ld v3, #d6
0x0258 xor v3, v0
0x025a or v2, v3
0x025c add I, v1
0x025e ld v0, [I]
0x0260 ld v3, #10
0x0262 xor v3, v0
0x0264 or v2, v3
0x0266 add I, v1
0x0268 ld vf, v2
0x026a BAD OPCODE 0xffff

The code is constituted of 10 patterns which do the same thing. It loads a value from the I register add 1 to this address, deference a byte and xor this byte with one of the random 10 bytes and finally stores the result in V2.

To resume it, xor a string provided through the I register with 10 random bytes and store the result in V2 before placing it in VF. So VF will be 0 only if we provided the same 10 characters as the random one.

However it’s impossible to guess this 10 bytes.

The shared cache vulnerability

Lucky us, there is another routine we can access trough the 0x00e0 instruction. This routine will simply xor the provided string against the same 10 random bytes and will return the number of CPU cycle executed trough the VF register.

As the 10 random bytes are used to xor our input are the same we need to guess we could simply send 10 null bytes in order to leak it. However the result of this encryption resides in the trusted module ram we can’t read.

But there is something really important :

Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat.

If the UAL operation is already in the CPU cache it will only do 1 cycle instead of 2.

Thats means we can know if an operation with it’s 2 operand has already been executed by looking at the number of cycle returned.

After some tests I noticed the encrypt routine will need 61 cycles to be executed. If the routine is executed with less than 61 it means that a xor with the two operands already have be done before.

Test case

For example lets take 0x41 as the first random byte.

Executing 1 ^ 2 in untrusted memory will store this instruction in the cache and will do 2 cycles.

The next time doing 1 ^ 2 the cache would directly send back 3 in one cycle.

If I did 1 ^ 0x41 it will store the operation in the cache. Now if I call the xor encryption routine it will also do 1 ^ 0x41 as 0x41 is the first random byte. The value will be cached as I already have done this instruction just before and so the routine will do 1 cycle less. At the end of the execution VF will hold 60 instead of 61.

To resume if the encrypt routine do an instruction we have already done just before we will able to know which value have been used.

Payload analysis

I used the following project to write CHIP8 asm https://github.com/wernsey/chip8. The code is splited in several steps.

start:
CLS
LD VE, 0 ; counter for 10 bytes key
go:
LD V5, 0 ; char to test
loop:
LD V0, #13 ; we gonna set char_to_test ^ #13 in the cache right now
LD V1, V5
xor V0, V1
LD V1, 1
LD I, #EA0 ; start of our buffer in ram
ADD I, VE ; index we test right now
LD V0, #13 ; we gonna set #13 too
LD [I], V0 ; ; we set in the ram #13 at the offset of the char of the key we want to test
LD I, #EA0 ; we pass our buffer to the call to encrypt
SYS 0
SNE VF, 60 ; vf hold the cpu cycle count, if there are only cache miss its 61
; but if we already did an operation with the same value #13 ^ V5 (char we test)
; there will be a cache it and the cpu will have only one cycle instead of 2
; and so VF is 60
JP cont ; if VF is not 60 we test another char
ADD V5, 1
dw #00e1 ; but first we clean the AL cache
JP loop
cont: ; we found a key
LD I, #EC0 ; we prepare the buffer for the verify call
ADD I, VE ; which offset we found
LD V0, V5 ; the value found
LD [I], V0 ; set it
LD I, #EA0 ; reset modified byte else the cache hit value will be erroned
ADD I, VE
LD V0, 0
LD [I], V0
ADD VE, 1 ; increment to reach 10 bytes
SNE VE, 10
JP then
JP go

This code will simply do a xor between 0x13 and value from 0 to 0xff after each xor it calls the routine with 0x13 at the right offset thanks to SYS 0. If VF is equal to 60 we have found the right value and can now set the value in another buffer which will be sent to the second routine for verification.

Everything is in a loop and between two calls to the routine we simply flush the UAL cache to reset the initial number of CPU cycle needed.

Once we found the 10 keys we can prepare the call to the second routine

then: ; we found all the bytes
LD I, #EC0 ; our buffer with the bruteforced char
SYS 1 ; calling the verify routine which will add the flag in RAM

If everything made it until here, the flag should resides in the memory right after our buffer. The remain of the payload is only used to display the flag and add some sleep function as the delay wasn’t implemented in the emulator.

LD VE, 0 ; counter for displaying the flag
final:
LD I, #EC5 ; EC0 + 10 ?? should be eca
ADD I, VE
LD V5, [I] ; we fetch the byte
dis_one:
CLS ; to mark the transition between 2 digit
LD V3, 1 ; SET x
LD V4, 1 ; SET Y
LD V6, 0
LD V7, 15 ; 1111
LD V8, V5
LD V9, V5
SHR V8, V6 ; fetch the 4 msb to the lsb
SHR V8, V6
SHR V8, V6
SHR V8, V6
AND V8, V7 ; keep only the 4 lsb now
AND V9, V7 ; second part of the byte, 4 lsb
LD V1, 1 ; sprite to use
LD F, V8 ; set I with the selected sprite
DRW V3, V4, 5 ; display sprite to the coordinate and size
letssleep: ; sleep as DT was not implemented to be able to see the char on screen
LD VB, 0
sleep0:
LD VA, 0
ADD VB, 1
sleep:
ADD VA, 1
SE VA, #ff
JP sleep
SE VB, #60
JP sleep0
CLS
LD V1, 1 ; sprite to use
LD F, V9 ; set I with the selected sprite
DRW V3, V4, 5 ; display sprite to the coordinate and size
letssleep2:
LD VB, 0
sleep2:
LD VA, 0
ADD VB, 1
sleep1:
ADD VA, 1
SE VA, #ff
JP sleep1
SE VB, #80
JP sleep2
CLS
ADD VE, 1 ; increment
JP final

As we can’t display a byte I splited it into 2 parts composed of 4 bits each. I used the SHR instruction to rotate the bytes to reach the least significant bit position and then I could use the DRW instruction to display a sprite. Then I take the saved real 4 least significant bits to display another sprite finally. So I have to used 2 sprites to display one byte.

I did some loop to create a delay in order to have time between each hexadecimal characters displayed on the screen. As mentioned in the documentation there is no keyboard handled by the emulator and so and make the program hang until I press a key.

Executing the payload

To compile and run the payload I used the following command

c8asm payload.asm &&  cat a.ch8 |  (xxd -p -l 2000 -c 20000 | tr -d "\n" ; echo "ffff") | nc challenges1.france-cybersecurity-challenge.fr 7004

Then I could contemplate the hex characters displayed one by one on my screen :

I didn’t really talk about chip8 architecture as it’s for me more the purpose of the first challenge. But I found a lot of very well detailed documentation and a lot of projects.

Sources

https://en.wikipedia.org/wiki/CHIP-8#Virtual_machine_description

https://github.com/wernsey/chip8

http://devernay.free.fr/hacks/chip8/C8TECH10.HTM#Dxyn

https://github.com/mattmikolay/chip-8/wiki/Mastering-CHIP%E2%80%908

https://github.com/mwales/chip8/blob/master/customRom/helloworld.md

https://github.com/mwales/chip8