RandomWOW/doc/vm.md

55 lines
4.8 KiB
Markdown
Raw Normal View History

2018-12-14 12:12:18 +01:00
2018-12-31 19:27:31 +01:00
2018-12-14 12:12:18 +01:00
## RandomX virtual machine
RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
![Imgur](https://i.imgur.com/ZAfbX9m.png)
#### Dataset
The VM has access to a read-only dataset which has a size of 4 GiB and changes every ~34 hours. See [dataset.md](dataset.md) for details how the dataset is generated.
#### MMU
The memory management unit (MMU) interfaces the CPU with the external memory. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from the dataset. The dataset is read mostly sequentially. On average, there is one random read for every 8192 sequential reads. An average program reads a total of 4 MiB of the dataset and has 64 random reads.
The MMU uses two internal registers:
* **ma** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
* **mx** - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is XORed with the counter and if bits 3-15 of the register are zero, bits 0-2 are cleared and the value of the `mx` register is copied into register `ma`. Thus, all random reads are aligned to a 64 KiB block boundary.
*When the value of the `ma` register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
#### Scratchpad
The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.*
#### Program
The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
*For high-performance mining, the program should be translated directly into machine code. The whole program will fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
#### Control unit
The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
* **pc** - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
* **sp** - Address of the last element on the stack (64-bit, 8 byte aligned).
* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
*Fixed number of executed instructions ensure roughly equal runtime of each random program.*
#### Stack
To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
2018-12-31 19:27:31 +01:00
*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB. Most programs will use around 4 KiB of stack.*
2018-12-14 12:12:18 +01:00
#### Register file
2018-12-31 19:27:31 +01:00
The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. The integer registers are 64 bits wide. The floating point registers are 128 bits wide and each stores two packed double precision numbers.
2018-12-14 12:12:18 +01:00
*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
#### ALU
The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
#### FPU
2018-12-31 19:27:31 +01:00
The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root. All operations work with two packed double precision numbers.
2018-12-14 12:12:18 +01:00
#### Binary encoding
2018-12-31 19:27:31 +01:00
The VM stores and loads all data in little-endian byte order. Signed integer numbers are represented using two's complement.