FCSC 2021 Qualification - Battle ship (misc / hardware)
FCSC 2021 Qualification - Battle Ship
This challenge was about a side channel attack on a shared cache between an untrusted and a trusted module in a chip8 emulator.
To help us understand a bit the architecture of the software we had a well detailed schema and the python source code of the emulator. It was a really nice challenge I really appreciated as the python code was a pleasure to read.
Understanding the architecture
The green part corresponds to the untrusted module and red one is trusted. We can directly execute code in the memory space of the untrusted one but not in the trusted. However we can call routine stored in the trusted.
We also had this information (in french sorry) :
L’élément sécurisé n’est rattaché à aucun périphérique. Il contient du code pré-compilé dans lequel vient s’ajouter la clé secrète précédemment générée.
Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat. Pour palier ce problème de performance, le constructeur a mis au point un système de cache partagé entre la VM utilisateur et l’élément sécurisé. Le cache repose sur un algorithme de type LRU et est placé en amont de l’UAL. Ainsi, si une opération a récemment été réalisée, le CPU a directement connaissance du résultat coûtant donc qu’un seul cycle.
En lien avec la gestion du cache, le constructeur a ajouté le code opération
00E1
. Cette nouvelle instruction permet de vider toutes les lignes du cache de l’UAL.Le constructeur a également implémenté deux autres codes d’opération : 0000 et 0001. Ils permettent d’interagir avec l’élément sécurisé en lui demandant de chiffrer une entrée de 10 octets avec la clé secrète ou de lui demander de vérifier si une entrée de taille 10 octet correspond à la clé secrète. Le code des deux routines est fourni. Un troisième code d’opération est implémenté :
FFFF
. Il permet de notifier la VM de la fin de l’exécution de la ROM.
When calling a routine the address space associated to the RAM is copied from the untrusted module to the trusted one and then some information are sent back through some registers I
and VF
. These are the only interactions between the two modules.
But there is also another shared thing : the LRU cache. The untrusted module can access trusted module cache value and vice versa. This way the architecture is vulnerable to a potential side channel attack on the cache.
To interact with the emulator we can simply send a custom hex encoded ROM that will be executed in the untrusted module.
Determining the goal
Thanks to the emulator source code we can identify where the flag is injected.
if msn == 0:
if n == 0x0:
= self.cpu.context.secure_element
se
se.reset()
se.set_context(=self.cpu.processor.i,
i=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
memory
)
se.exc_encrypt()self.cpu.processor.v[0xF] = se.cpu.tick
elif n == 0x1:
= self.cpu.context.secure_element
se
se.reset()
se.set_context(=self.cpu.processor.i,
i=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
memory
)= se.exc_verify()
flag self.cpu.processor.v[0xF] = se.cpu.processor.v[0xF]
if self.cpu.processor.v[0xF] == 0:
if self.cpu.processor.i+10+len(flag) > self.memory.SIZE:
raise MemoryError("Not enought memory space")
self.memory.data[self.cpu.processor.i+10:self.cpu.processor.i+10+len(flag)] = flag
As noticed in the challenge information, when the execution returns from the trusted module it copies the VF
and I
registers to the untrusted registers. Following the se.exc_verify
we see that if the VF
register is equal to 0
the flag will be loaded in the untrusted memory making it accessible for us.
We have now to understand how VF
can be zero at the end of the function.
class TrustedContext(BaseContext):
def __init__(self):
super().__init__()
self.execution_key = os.urandom(10)
self.co_encrypt = routines.preco_encrypt.format(*self.execution_key)
self.co_verify = routines.preco_verify.format(*self.execution_key)
def set_context(self, i=None, memory=None):
if i:
self.cpu.processor.i = i
if memory:
assert len(memory) == Memory.SIZE - Memory.O_STACK
self.memory.data[Memory.O_STACK:Memory.SIZE] = memory
def exc_encrypt(self):
self.load(self.co_encrypt)
self.run()
def exc_verify(self):
self.load(self.co_verify)
self.run()
return flag.FLAG_LV2
We see in the above code that exc_verify
load a pre-compiled routine named co_verify
and pass it 10 random bytes. Lets disassemble this routine.
For the disassembly I simply used https://github.com/mwales/chip8 which do a great work :
, #01
0x0200 ld v1, #00
0x0202 ld v2
, [I]
0x0204 ld v0, #71
0x0206 ld v3xor v3, v0
0x0208 or v2, v3
0x020a add I, v1
0x020c
, [I]
0x020e ld v0, #43
0x0210 ld v3xor v3, v0
0x0212 or v2, v3
0x0214 add I, v1
0x0216
, [I]
0x0218 ld v0, #7c
0x021a ld v3xor v3, v0
0x021c or v2, v3
0x021e add I, v1
0x0220
, [I]
0x0222 ld v0, #8e
0x0224 ld v3xor v3, v0
0x0226 or v2, v3
0x0228
add I, v1
0x022a , [I]
0x022c ld v0, #d6
0x022e ld v3xor v3, v0
0x0230 or v2, v3
0x0232
add I, v1
0x0234 , [I]
0x0236 ld v0, #af
0x0238 ld v3xor v3, v0
0x023a or v2, v3
0x023c
add I, v1
0x023e , [I]
0x0240 ld v0, #60
0x0242 ld v3xor v3, v0
0x0244 or v2, v3
0x0246 add I, v1
0x0248
, [I]
0x024a ld v0, #11
0x024c ld v3xor v3, v0
0x024e or v2, v3
0x0250 add I, v1
0x0252
, [I]
0x0254 ld v0, #d6
0x0256 ld v3xor v3, v0
0x0258 or v2, v3
0x025a add I, v1
0x025c
, [I]
0x025e ld v0, #10
0x0260 ld v3xor v3, v0
0x0262 or v2, v3
0x0264 add I, v1
0x0266
, v2
0x0268 ld vf0xffff 0x026a BAD OPCODE
The code is constituted of 10 patterns which do the same thing. It loads a value from the I
register add 1 to this address, deference a byte and xor this byte with one of the random 10 bytes and finally stores the result in V2
.
To resume it, xor a string provided through the I register with 10 random bytes and store the result in V2
before placing it in VF
. So VF
will be 0
only if we provided the same 10 characters as the random one.
However it’s impossible to guess this 10 bytes.
The shared cache vulnerability
Lucky us, there is another routine we can access trough the 0x00e0
instruction. This routine will simply xor the provided string against the same 10 random bytes and will return the number of CPU cycle executed trough the VF
register.
As the 10 random bytes are used to xor
our input are the same we need to guess we could simply send 10 null bytes in order to leak it. However the result of this encryption resides in the trusted module ram we can’t read.
But there is something really important :
Les opérations se déroulent presque toutes en 1 cycle sauf pour les opérations qui nécessitent l’utilisation de l’unité arithmétique et logique (UAL) auquel cas, 2 cycles seront nécessaires : un cycle pour soumettre le calcul à l’UAL, un cycle pour récupérer le résultat.
If the UAL operation is already in the CPU cache it will only do 1 cycle instead of 2.
Thats means we can know if an operation with it’s 2 operand has already been executed by looking at the number of cycle returned.
After some tests I noticed the encrypt
routine will need 61
cycles to be executed. If the routine is executed with less than 61
it means that a xor with the two operands already have be done before.
Test case
For example lets take 0x41
as the first random byte.
Executing 1 ^ 2
in untrusted memory will store this instruction in the cache and will do 2 cycles.
The next time doing 1 ^ 2
the cache would directly send back 3
in one cycle.
If I did 1 ^ 0x41
it will store the operation in the cache. Now if I call the xor encryption routine it will also do 1 ^ 0x41
as 0x41
is the first random byte. The value will be cached as I already have done this instruction just before and so the routine will do 1 cycle less. At the end of the execution VF
will hold 60
instead of 61
.
To resume if the encrypt routine do an instruction we have already done just before we will able to know which value have been used.
Payload analysis
I used the following project to write CHIP8 asm https://github.com/wernsey/chip8. The code is splited in several steps.
start:
CLS, 0 ; counter for 10 bytes key
LD VEgo:
, 0 ; char to test
LD V5
loop:
, #13 ; we gonna set char_to_test ^ #13 in the cache right now
LD V0, V5
LD V1xor V0, V1
, 1
LD V1, #EA0 ; start of our buffer in ram
LD IADD I, VE ; index we test right now
, #13 ; we gonna set #13 too
LD V0
[I], V0 ; ; we set in the ram #13 at the offset of the char of the key we want to test
LD
, #EA0 ; we pass our buffer to the call to encrypt
LD I
0
SYS , 60 ; vf hold the cpu cycle count, if there are only cache miss its 61
SNE VF; but if we already did an operation with the same value #13 ^ V5 (char we test)
; there will be a cache it and the cpu will have only one cycle instead of 2
; and so VF is 60
JP cont ; if VF is not 60 we test another char
ADD V5, 1
dw #00e1 ; but first we clean the AL cache
JP loop
cont: ; we found a key
, #EC0 ; we prepare the buffer for the verify call
LD IADD I, VE ; which offset we found
, V5 ; the value found
LD V0[I], V0 ; set it
LD
, #EA0 ; reset modified byte else the cache hit value will be erroned
LD IADD I, VE
, 0
LD V0[I], V0
LD
ADD VE, 1 ; increment to reach 10 bytes
, 10
SNE VEJP then
JP go
This code will simply do a xor between 0x13
and value from 0
to 0xff
after each xor it calls the routine with 0x13
at the right offset thanks to SYS 0
. If VF
is equal to 60
we have found the right value and can now set the value in another buffer which will be sent to the second routine for verification.
Everything is in a loop and between two calls to the routine we simply flush the UAL cache to reset the initial number of CPU cycle needed.
Once we found the 10 keys we can prepare the call to the second routine
then: ; we found all the bytes
, #EC0 ; our buffer with the bruteforced char
LD I1 ; calling the verify routine which will add the flag in RAM SYS
If everything made it until here, the flag should resides in the memory right after our buffer. The remain of the payload is only used to display the flag and add some sleep function as the delay wasn’t implemented in the emulator.
, 0 ; counter for displaying the flag
LD VEfinal:
, #EC5 ; EC0 + 10 ?? should be eca
LD IADD I, VE
, [I] ; we fetch the byte
LD V5
dis_one:
; to mark the transition between 2 digit
CLS , 1 ; SET x
LD V3, 1 ; SET Y
LD V4, 0
LD V6, 15 ; 1111
LD V7, V5
LD V8, V5
LD V9
SHR V8, V6 ; fetch the 4 msb to the lsb
SHR V8, V6
SHR V8, V6
SHR V8, V6
AND V8, V7 ; keep only the 4 lsb now
AND V9, V7 ; second part of the byte, 4 lsb
, 1 ; sprite to use
LD V1, V8 ; set I with the selected sprite
LD F, V4, 5 ; display sprite to the coordinate and size
DRW V3
letssleep: ; sleep as DT was not implemented to be able to see the char on screen
, 0
LD VBsleep0:
, 0
LD VAADD VB, 1
sleep:
ADD VA, 1
, #ff
SE VAJP sleep
, #60
SE VBJP sleep0
CLS, 1 ; sprite to use
LD V1, V9 ; set I with the selected sprite
LD F, V4, 5 ; display sprite to the coordinate and size
DRW V3
letssleep2:
, 0
LD VBsleep2:
, 0
LD VAADD VB, 1
sleep1:
ADD VA, 1
, #ff
SE VAJP sleep1
, #80
SE VBJP sleep2
CLS
ADD VE, 1 ; increment
JP final
As we can’t display a byte I splited it into 2 parts composed of 4 bits each. I used the SHR
instruction to rotate the bytes to reach the least significant bit position and then I could use the DRW
instruction to display a sprite. Then I take the saved real 4 least significant bits to display another sprite finally. So I have to used 2 sprites to display one byte.
I did some loop to create a delay in order to have time between each hexadecimal characters displayed on the screen. As mentioned in the documentation there is no keyboard handled by the emulator and so and make the program hang until I press a key.
Executing the payload
To compile and run the payload I used the following command
c8asm payload.asm && cat a.ch8 | (xxd -p -l 2000 -c 20000 | tr -d "\n" ; echo "ffff") | nc challenges1.france-cybersecurity-challenge.fr 7004
Then I could contemplate the hex characters displayed one by one on my screen :
I didn’t really talk about chip8 architecture as it’s for me more the purpose of the first challenge. But I found a lot of very well detailed documentation and a lot of projects.
Sources
https://en.wikipedia.org/wiki/CHIP-8#Virtual_machine_description
https://github.com/wernsey/chip8
http://devernay.free.fr/hacks/chip8/C8TECH10.HTM#Dxyn
https://github.com/mattmikolay/chip-8/wiki/Mastering-CHIP%E2%80%908
https://github.com/mwales/chip8/blob/master/customRom/helloworld.md
https://github.com/mwales/chip8