Updated specification

2024-12-21 23:38:54 +00:00 · 2018-11-04 19:42:19 +01:00 · 2018-11-04 19:42:19 +01:00 · 5114d6b5fe
commit 5114d6b5fe
parent d69d3d69a0
2 changed files with 43 additions and 32 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,4 @@
+
 # RandomX
 RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.

@ -12,13 +13,13 @@ RandomX is intended to be run efficiently and easily on a general-purpose CPU. T
 The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is static within a single PoW epoch. The exact algorithm to generate the DRAM blob and its update schedule is to be determined.

 #### MMU
-The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU splits the 4 GiB DRAM blob into 64-byte blocks (corresponding to the L1 cache line size of a typical CPU). Data within one block is always read sequentially in eight reads (8x8 bytes). Blocks are read mostly sequentially apart from occasional random jumps that happen on average every 256 blocks. The address of the next block to be read is determined 1 block ahead of time to enable efficient prefetching. The MMU uses three internal registers:
+The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU splits the 4 GiB DRAM blob into 64-byte blocks (corresponding to the most common L1 cache line size). Data within one block is always read sequentially in eight reads (8×8 bytes). Blocks are read mostly sequentially apart from occasional random jumps that happen on average every 256 blocks. The address of the next block to be read is determined 1 block ahead of time to enable efficient prefetching. The MMU uses three internal registers:
 * **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
 * **m1** - Address of the next block to be read from memory (32-bit, 64-byte aligned).
 * **mx** - Random 64-bit counter that determines if reading continues sequentially or jumps to a random block. When an address `addr` is passed to the MMU, it performs `mx ^= addr` and checks if the last 8 bits of `mx` are zero. If yes, the adjacent 32 bits are copied to register `m1` and 64-byte aligned.

 #### Cache
-The VM contains 256 KiB of cache. The cache is split into two segments of 16 KiB and 240 KiB. The cache is randomly accessed for both reading and writing. 75% of accesses are into the first 16 KiB.
+The VM contains 256 KiB of cache. The cache is split into two segments (16 KiB and 240 KiB). The cache is randomly accessed for both reading and writing. 75% of accesses are into the first 16 KiB.

 #### Program
 The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 1024 random 64-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
@ -30,7 +31,7 @@ The control unit (CU) controls the execution of the program. It reads instructio
 * **ic** - Instruction counter = the number of instructions to execute before terminating. Initial value is 65536 and the register is decremented after each executed instruction.

 #### Stack
-To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL, CALLR and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
+To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL, DCALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.

 #### Register file
 The VM has 8 integer registers r0-r7 and 8 floating point registers f0-f7. All registers are 64 bits wide.
@ -39,10 +40,10 @@ The VM has 8 integer registers r0-r7 and 8 floating point registers f0-f7. All r
 The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes.

 #### FPU
-The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. There are 4 binary operations (ADD, SUB, MUL, DIV) and one unary operation (SQRT).
+The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers.

 ## Instruction set
-The 64-bit instruction is encoded as follows:
+The instruction set was designed so that any bitstring is a valid program. The 64-bit instruction is encoded as follows:

 ![Imgur](https://i.imgur.com/TlgeYfk.png)

@ -53,7 +54,7 @@ There are 256 opcodes, which are distributed between various operations dependin
 |---------|-----------------|----|
 |ALU operations|TBD|TBD|
 |FPU operations|TBD|TBD|
-|branching|32|12.5%|
+|Control flow |32|12.5%|

 #### Parameters a, b, c (8 bits)
 `a` and `b` encode the instruction operands and `c` is the destination. All have the same encoding:
@ -62,7 +63,7 @@ There are 256 opcodes, which are distributed between various operations dependin

 Register number is encoded in the top 3 bits. ALU instructions use registers r0-r7, while FPU instructions use registers f0-f7. Addresses are always loaded from registers r0-r7. The bottom 3 bits determine where the operand is loaded from/result saved to:

-|location|a|b|c|
+|location|A|B|C|
 |---------|-|-|-
 |000|register|register|register|
 |001|register|register|register|
@ -78,19 +79,17 @@ Register number is encoded in the top 3 bits. ALU instructions use registers r0-
 |cache|address length|
 |---------|-|
 |00|18 bits (whole 256 KiB)|
-|01|14 bits (first 16 KiB)|
-|10|14 bits (first 16 KiB)|
-|11|14 bits (first 16 KiB)|
+|01, 10, 11|14 bits (first 16 KiB)|

-* **DRAM** - The value of the register is used as an address to pass to the MMU.
-* **imm1** - 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended fot signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer, first converted to a single precision floating point format and then to a double precision format.
+* **DRAM** - The value of the register is used as an address to pass to the MMU for reading from DRAM.
+* **imm1** - 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer, first converted to a single precision floating point format and then to a double precision format.

 #### imm0 (8 bits)
-An 8-bit immediate value that is used by the CALL instruction as jump offset.
+An 8-bit immediate value that is used to calculate the jump offset of the CALL and DCALL instructions.

 ### ALU instructions

-All ALU instructions take 2 operands A and B and produce result C.
+All ALU instructions take 2 operands `A` and `B` and produce result `C`. If `C` is shorter than 64 bits, it is zero-extended to 64 bits. 

 |opcodes|instruction|signed|A width|B width|C|C width|
 |-|-|-|-|-|-|-|
@ -127,8 +126,20 @@ All ALU instructions take 2 operands A and B and produce result C.
 ##### Division
 For the division instructions, the divisor is half length of the dividend. The result `C` consists of both the quotient and the remainder (remainder is put the upper bits). The result of division by zero is equal to the dividend.

-##### Result write-back
-If `C` is shorter than 64 bits, it is zero-extended before the result is written back. If the destination is a register, the value is first encrypted with a single AES round (TBD).
+##### Register scrambling
+Because the values of the integer registers are used as read and write addresses, they must stay pseudorandom. To achieve this, every ALU instruction has a scrambling step at the end. The values of the integer registers `r(a)` and `r(c)` corresponding to operands `A` and `C` are concatenated to form a 128-bit value `D`. The value of the integer register `r(b)` corresponding to the `B` operand is concatenated with its corresponding FPU register `f(b)` to form a 128-bit value `K`. `D` is then encrypted with a single [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) round using `K` as the round key and the result is saved into registers `r(a)` and `r(c)`. 
+
+In pseudocode:
+```
+D[127:64] = r(a)
+D[63:0] = r(c)
+K[127:64] = r(b)
+K[63:0] = f(b)
+E = AES_ROUND(D, K)
+r(a) = E[127:64]
+r(c) = E[63:0]
+```
+`AES_ROUND` consists of the ShiftRows, SubBytes and MixColumns steps followed by XOR with `K`.

 ### FPU instructions
 |opcodes|instruction|C|
@ -140,12 +151,12 @@ If `C` is shorter than 64 bits, it is zero-extended before the result is written
 |TBD|FSQRT|sqrt(A)|
 |TBD|FROUND|-|

-FPU instructions conform to the IEEE-754 specification. Initial rounding mode is RN (Round to Nearest). Denormal values are treated as zero (this corresponds to setting the FTZ flag in x86 SSE and ARM Neon engines).
+FPU instructions conform to the IEEE-754 specification, so they must give bit-exact correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values are treated as zero (this corresponds to setting the FTZ flag in x86 SSE and ARM Neon engines).

 Operands loaded from memory are treated as signed 64-bit integers and converted to double precision floating point format. Operands loaded from floating point registers are used directly.

 ##### FSQRT
-The sign bit of the FSQRT operand is always cleared first, so only non-negative values are evaluated.
+The sign bit of the FSQRT operand is always cleared first, so only non-negative values are used.

 ##### FROUND
 The FROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two right-most bits of A:
@ -158,25 +169,25 @@ The FROUND instruction changes the rounding mode for all subsequent FPU operatio
 |11|Round towards Zero (RZ) mode


-### Branch instructions
-The CU supports 3 branch instructions:
+### Control flow instructions
+The following 3 control flow instructions are supported:

 |opcodes|instruction|function|
 |-|-|-|
-|TBD|CALL|conditional near procedure call with static offset|
-|TBD|CALLR|conditional near procedure call with register offset|
-|TBD|RET|conditional return from procedure|
+|TBD|CALL|near procedure call with a static offset|
+|TBD|DCALL|near procedure call with a dynamic offset|
+|TBD|RET|return from procedure|

-All three instructions are conditional. Branching pattern is determined by the value of `imm1` (exact mechanism TBD). In case the branch is not taken, all three instructions set `C = A` ("arithmetic no-op").
+All three instructions are conditional in 75% of cases. The jump is taken only if `B <= imm1`. For the 25% of cases when `B` is equal to `imm1`, the jump is unconditional. In case the branch is not taken, all three instructions become "arithmetic no-op" `C = A`.

-##### CALL and CALLR
-When the branch is taken, both CALL and CALLR instructions push the values `A` and `pc` (program counter) onto the stack and then perform a forward jump relative to the value of `pc`. The forward offset is equal to `8 * (imm0 + 1)` for the CALL instruction and `8 * ((C & 0xFF) + 1)` for the CALLR instruction. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL/CALLR instructions are needed to form a loop in the program).
+##### CALL and DCALL
+Taken CALL and DCALL instructions push the values `A` and `pc` (program counter) onto the stack and then perform a forward jump relative to the value of `pc`. The forward offset is equal to `8 * (imm0 + 1)` for the CALL instruction and `8 * ((imm0 ^ (A >> 56)) + 1)` for the DCALL instruction. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL/DCALL instructions are needed to form a loop in the program).

 ##### RET
-When the branch is taken, the RET instruction pops the return address `raddr` from the stack (it's the instructions following the corresponding CALL or CALLR), then pops a return value `retval` from the stack and sets `C = retval`. Finally, the instruction jumps back to `raddr`.
+Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL or DCALL), then pops a return value `retval` from the stack and sets `C = retval`. Finally, the instruction jumps back to `raddr`.

 ## Program generation
-The program is initialized from a 256-bit seed value using a suitable PRNG. The program is generated in this order:
+The program is initialized from a 256-bit seed value using a [PCG random number generator](http://www.pcg-random.org/). The program is generated in this order:
 1. All 1024 instructions are generated as a list of random 64-bit integers.
 2. Initial values of all integer registers r0-r7 are generated as random 64-bit integers.
 3. Initial values of all floating point registers f0-f7 are generated as random 64-bit signed integers converted to a double precision floating point format.
--- a/tests/branch_prediction/makefile
+++ b/tests/branch_prediction/makefile
@ -1,15 +1,15 @@
 all: branch_always branch_predictably branch_randomly branch_mixed

-branch_always:
+branch_always: branch_always.c
 	gcc -O0 branch_always.c -o branch_always

-branch_predictably:
+branch_predictably: branch_predictably.c
 	gcc -O0 branch_predictably.c -o branch_predictably

-branch_randomly:
+branch_randomly: branch_randomly.c
 	gcc -O0 branch_randomly.c -o branch_randomly

-branch_mixed:
+branch_mixed: branch_mixed.c
 	gcc -O0 branch_mixed.c -o branch_mixed

 clean: