From 7e582c2815012f85e269c2f57a2c5a5aaaec649d Mon Sep 17 00:00:00 2001
From: tevador <37503146+tevador@users.noreply.github.com>
Date: Fri, 16 Nov 2018 19:30:38 +0100
Subject: [PATCH] Updated specification and instruction weights

---
 README.md | 91 +++++++++++++++++++++++++++----------------------------
 1 file changed, 45 insertions(+), 46 deletions(-)

diff --git a/README.md b/README.md
index ce1deec..e550f0a 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,3 @@
-
-
 # RandomX
 RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.
 
@@ -10,7 +8,7 @@ RandomX uses a simple low-level language (instruction set), which was designed s
 ## Virtual machine
 RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
 
-![Imgur](https://i.imgur.com/dRU8jiu.png)
+![Imgur](https://i.imgur.com/Xx5QVOV.png)
 
 #### DRAM
 The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
@@ -18,12 +16,13 @@ The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory
 *The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
 
 #### MMU
-The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When the current block has been consumed, reading jumps to a random block. The address of the next block is calculated 16 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
-* **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
-* **m1** - Address of the next block to be read from memory (32-bit, 256-byte aligned).
-* **mx** - Random 32-bit counter that determines the address of the next block. After each read, the read address is mixed with the counter: `mx ^= addr`. When the 16th quadword of the current block is read (the value of the `m0` register ends with `0x80`), the value of the `mx` register is copied into register `m1` and the last 8 bits of `m1` are cleared.
+The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The DRAM blob is read mostly sequentially. After an average of 8192 sequential reads, a random read is performed. An average program reads a total of 4 MiB of DRAM and has 64 random reads.
 
-*When the value of the `m1` register is changed, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
+The MMU uses two internal registers:
+* **ma** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
+* **mx** - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is mixed with the counter: `mx ^= addr`. If the right-most 13 bits of the register are zero:  `(mx & 0x1FFF) == 0`, the value of the `mx` register is copied into register `ma`.
+
+*When the value of the `ma` register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
 
 #### Scratchpad
 The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
@@ -66,12 +65,12 @@ The 128-bit instruction is encoded as follows:
 ![Imgur](https://i.imgur.com/thpvVHN.png)
 
 #### Opcode
-There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following (TBD):
+There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
 
 |operation|number of opcodes||
 |---------|-----------------|----|
-|ALU operations|158|61.7%|
-|FPU operations|66|25.8%|
+|ALU operations|142|55.5%|
+|FPU operations|82|32.0%|
 |Control flow |32|12.5%|
 
 #### Operand A
@@ -144,30 +143,30 @@ A 32-bit address mask that is used to calculate the write address for the C oper
 
 ### ALU instructions
 
-|opcodes|instruction|signed|A width|B width|C|C width|
+|weight|instruction|signed|A width|B width|C|C width|
 |-|-|-|-|-|-|-|
-|0-13|ADD_64|no|64|64|A + B|64|
-|14-20|ADD_32|no|32|32|A + B|32|
-|21-34|SUB_64|no|64|64|A - B|64|
-|35-41|SUB_32|no|32|32|A - B|32|
-|42-45|MUL_64|no|64|64|A * B|64|
-|46-49|MULH_64|no|64|64|A * B|64|
-|50-53|MUL_32|no|32|32|A * B|64|
-|54-57|IMUL_32|yes|32|32|A * B|64|
-|58-61|IMULH_64|yes|64|64|A * B|64|
-|62|DIV_64|no|64|32|A / B|32|
-|63|IDIV_64|yes|64|32|A / B|32|
-|64-76|AND_64|no|64|64|A & B|64|
-|77-82|AND_32|no|32|32|A & B|32|
-|83-95|OR_64|no|64|64|A &#124; B|64|
-|96-101|OR_32|no|32|32|A &#124; B|32|
-|102-115|XOR_64|no|64|64|A ^ B|64|
-|116-121|XOR_32|no|32|32|A ^ B|32|
-|122-128|SHL_64|no|64|6|A << B|64|
-|129-132|SHR_64|no|64|6|A >> B|64|
-|133-135|SAR_64|yes|64|6|A >> B|64|
-|136-146|ROL_64|no|64|6|A <<< B|64|
-|147-157|ROR_64|no|64|6|A >>> B|64|
+|16|ADD_64|no|64|64|A + B|64|
+|8|ADD_32|no|32|32|A + B|32|
+|16|SUB_64|no|64|64|A - B|64|
+|8|SUB_32|no|32|32|A - B|32|
+|7|MUL_64|no|64|64|A * B|64|
+|7|MULH_64|no|64|64|A * B|64|
+|7|MUL_32|no|32|32|A * B|64|
+|7|IMUL_32|yes|32|32|A * B|64|
+|7|IMULH_64|yes|64|64|A * B|64|
+|1|DIV_64|no|64|32|A / B|32|
+|1|IDIV_64|yes|64|32|A / B|32|
+|4|AND_64|no|64|64|A & B|64|
+|3|AND_32|no|32|32|A & B|32|
+|4|OR_64|no|64|64|A &#124; B|64|
+|3|OR_32|no|32|32|A &#124; B|32|
+|4|XOR_64|no|64|64|A ^ B|64|
+|3|XOR_32|no|32|32|A ^ B|32|
+|6|SHL_64|no|64|6|A << B|64|
+|6|SHR_64|no|64|6|A >> B|64|
+|6|SAR_64|yes|64|6|A >> B|64|
+|9|ROL_64|no|64|6|A <<< B|64|
+|9|ROR_64|no|64|6|A >>> B|64|
 
 ##### 32-bit operations
 Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are zero.
@@ -195,14 +194,14 @@ The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`im
 
 ### FPU instructions
 
-|opcodes|instruction|C|
+|weight|instruction|C|
 |-|-|-|
-|158-175|FADD|A + B|
-|176-193|FSUB|A - B|
-|194-211|FMUL|A * B|
-|212-214|FDIV|A / B|
-|215-221|FSQRT|sqrt(A)|
-|222-223|FROUND|A|
+|22|FADD|A + B|
+|22|FSUB|A - B|
+|22|FMUL|A * B|
+|8|FDIV|A / B|
+|6|FSQRT|sqrt(A)|
+|2|FROUND|A|
 
 FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values may not be produced by any operation.
 
@@ -230,15 +229,15 @@ The FROUND instruction changes the rounding mode for all subsequent FPU operatio
 ### Control flow instructions
 The following 2 control flow instructions are supported:
 
-|opcodes|instruction|function|
+|weight|instruction|function|
 |-|-|-|
-|224-240|CALL|near procedure call|
-|241-255|RET|return from procedure|
+|17|CALL|near procedure call|
+|15|RET|return from procedure|
 
 Both instructions are conditional in 75% of cases. The jump is taken only if `B <= imm1`. For the 25% of cases when `B` is equal to `imm1`, the jump is unconditional. In case the branch is not taken, both instructions become "arithmetic no-op" `C = A`.
 
 ##### CALL
-Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm0[7:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
+Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm0[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
 
 ##### RET
 The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A ^ retval`. Finally, the instruction jumps back to `raddr`.
@@ -249,7 +248,7 @@ The program is initialized from a 256-bit seed value `S`.
 2. The generator is used to generate random 128 bytes `R1`.
 3. Integer registers `r0`-`r7` are initialized using bytes 0-63 of `R1`.
 4. Floating point registers `f0`-`f7` are initialized using bytes 64-127 of `R1` interpreted as 8 64-bit signed integers converted to a double precision floating point format.
-5. The initial value of the `m0` register is set to `S[95:64]` and the the last 8 bits are cleared (256-byte aligned).
+5. The initial value of the `ma` register is set to `S[95:64]` and the the last 3 bits are cleared (8-byte aligned).
 6. `S` is expanded into 10 AES round keys `K0`-`K9`.
 7. `R1` is exploded into a 264 KiB buffer `B` by repeated 10-round AES encryption.
 8. The scratchpad is set to the first 256 KiB of `B`.