mirror of
https://git.wownero.com/wownero/RandomWOW.git
synced 2024-12-22 15:58:53 +00:00
Updated documentation
This commit is contained in:
parent
40a08bb0c8
commit
4934bbf69d
21
README.md
21
README.md
@ -8,10 +8,10 @@ RandomX is a proof-of-work (PoW) algorithm that is optimized for general-purpose
|
|||||||
|
|
||||||
RandomX behaves like a keyed hashing function: it accepts a key `K` and arbitrary input `H` and produces a 256-bit result `R`. Under the hood, RandomX utilizes a virtual machine that executes programs in a special instruction set that consists of a mix of integer math, floating point math and branches. These programs can be translated into the CPU's native machine code on the fly. Example of a RandomX program translated into x86-64 assembly is [program.asm](doc/program.asm). A portable interpreter mode is also provided.
|
RandomX behaves like a keyed hashing function: it accepts a key `K` and arbitrary input `H` and produces a 256-bit result `R`. Under the hood, RandomX utilizes a virtual machine that executes programs in a special instruction set that consists of a mix of integer math, floating point math and branches. These programs can be translated into the CPU's native machine code on the fly. Example of a RandomX program translated into x86-64 assembly is [program.asm](doc/program.asm). A portable interpreter mode is also provided.
|
||||||
|
|
||||||
RandomX can operate in two modes:
|
RandomX can operate in two main modes with different memory requirements:
|
||||||
|
|
||||||
* **Fast mode** - requires 2080 MiB of shared memory.
|
* **Fast mode** - requires 2080 MiB of shared memory.
|
||||||
* **Light mode** - requires only 256 MiB of shared memory, but runs significantly slower and uses more power per hash.
|
* **Light mode** - requires only 256 MiB of shared memory, but runs significantly slower
|
||||||
|
|
||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
@ -46,14 +46,20 @@ RandomX was primarily designed as a PoW algorithm for [Monero](https://www.getmo
|
|||||||
* The key `K` is selected to be the hash of a block in the blockchain - this block is called the 'key block'. For optimal mining and verification performance, the key should change every 2048 blocks (~2.8 days) and there should be a delay of 64 blocks (~2 hours) between the key block and the change of the key `K`. This can be achieved by changing the key when `blockHeight % 2048 == 64` and selecting key block such that `keyBlockHeight % 2048 == 0`.
|
* The key `K` is selected to be the hash of a block in the blockchain - this block is called the 'key block'. For optimal mining and verification performance, the key should change every 2048 blocks (~2.8 days) and there should be a delay of 64 blocks (~2 hours) between the key block and the change of the key `K`. This can be achieved by changing the key when `blockHeight % 2048 == 64` and selecting key block such that `keyBlockHeight % 2048 == 0`.
|
||||||
* The input `H` is the standard hashing blob.
|
* The input `H` is the standard hashing blob.
|
||||||
|
|
||||||
|
If you wish to use RandomX as a PoW algorithm for your cryptocurrency, we strongly recommend not using the [default parameters](src/configuration.h) and change at least the following:
|
||||||
|
|
||||||
|
* Size of the Dataset (`RANDOMX_DATASET_BASE_SIZE` and `RANDOMX_DATASET_EXTRA_SIZE`).
|
||||||
|
* Scratchpad size (`RANDOMX_SCRATCHPAD_L3`, `RANDOMX_SCRATCHPAD_L2` and `RANDOMX_SCRATCHPAD_L1`).
|
||||||
|
* Instruction frequencies (parameters starting with `RANDOMX_FREQ_`).
|
||||||
|
|
||||||
### Performance
|
### Performance
|
||||||
Preliminary performance of selected CPUs using the optimal number of threads (T) and large pages (if possible), in hashes per second (H/s):
|
Preliminary performance of selected CPUs using the optimal number of threads (T) and large pages (if possible), in hashes per second (H/s):
|
||||||
|
|
||||||
|CPU|RAM|OS|AES|Fast mode|Light mode|
|
|CPU|RAM|OS|AES|Fast mode|Light mode|
|
||||||
|---|---|--|---|---------|--------------|
|
|---|---|--|---|---------|--------------|
|
||||||
AMD Ryzen 7 1700|16 GB DDR4|Ubuntu 16.04|hardware|4080 H/s (8T)|620 H/s (16T)|
|
AMD Ryzen 7 1700|16 GB DDR4|Ubuntu 16.04|hardware|4090 H/s (8T)|620 H/s (16T)|
|
||||||
Intel Core i7-8550U|16 GB DDR4|Windows 10|hardware|1700 H/s (4T)|350 H/s (8T)|
|
Intel Core i7-8550U|16 GB DDR4|Windows 10|hardware|1700 H/s (4T)|350 H/s (8T)|
|
||||||
Intel Core i3-3220|2 GB DDR3|Ubuntu 16.04|software|-|120 H/s (4T)|
|
Intel Core i3-3220|2 GB DDR3|Ubuntu 16.04|software|-|145 H/s (4T)|
|
||||||
Raspberry Pi 3|1 GB DDR2|Ubuntu 16.04|software|-|2.0 H/s (4T) †|
|
Raspberry Pi 3|1 GB DDR2|Ubuntu 16.04|software|-|2.0 H/s (4T) †|
|
||||||
|
|
||||||
† Using the interpreter mode. Compiled mode is expected to increase performance by a factor of 10.
|
† Using the interpreter mode. Compiled mode is expected to increase performance by a factor of 10.
|
||||||
@ -90,8 +96,13 @@ The reference implementation has been validated on the following platforms:
|
|||||||
RandomX uses some source code from the following 3rd party repositories:
|
RandomX uses some source code from the following 3rd party repositories:
|
||||||
* Argon2d, Blake2b hashing functions: https://github.com/P-H-C/phc-winner-argon2
|
* Argon2d, Blake2b hashing functions: https://github.com/P-H-C/phc-winner-argon2
|
||||||
|
|
||||||
|
The author of RandomX declares no competing financial interest in RandomX adoption, other than being a holder or Monero. The development of RandomX was funded from the author's own pocket with only the help listed above.
|
||||||
|
|
||||||
## Donations
|
## Donations
|
||||||
XMR (tevador):
|
|
||||||
|
If you'd like to use RandomX, please consider donating to help cover the development cost of the algorithm.
|
||||||
|
|
||||||
|
Author's XMR address:
|
||||||
```
|
```
|
||||||
845xHUh5GvfHwc2R8DVJCE7BT2sd4YEcmjG8GNSdmeNsP5DTEjXd1CNgxTcjHjiFuthRHAoVEJjM7GyKzQKLJtbd56xbh7V
|
845xHUh5GvfHwc2R8DVJCE7BT2sd4YEcmjG8GNSdmeNsP5DTEjXd1CNgxTcjHjiFuthRHAoVEJjM7GyKzQKLJtbd56xbh7V
|
||||||
```
|
```
|
||||||
|
@ -34,7 +34,9 @@ Modern CPUs include a sophisticated branch predictor unit [[3](https://en.wikipe
|
|||||||
|
|
||||||
The best way to maximize CPU efficiency is not to have any branches at all. However, CPUs invest a lot of die area and energy to handle branches. Without branches, CPU design can be significantly simplified because there is no need for commit/retire stages, which must be part of all speculative-execution designs to be able to recover from branch mispredictions.
|
The best way to maximize CPU efficiency is not to have any branches at all. However, CPUs invest a lot of die area and energy to handle branches. Without branches, CPU design can be significantly simplified because there is no need for commit/retire stages, which must be part of all speculative-execution designs to be able to recover from branch mispredictions.
|
||||||
|
|
||||||
RandomX therefore uses random branches with a jump probability of 1/128. These branches will be predicted as "not taken" by the CPU. Such branches are "free" in most CPU designs unless they are taken. The branching conditions and jump targets are chosen in such way that infinite loops in RandomX code are impossible because the register controlling the branch will never be modified in the repeated code block. The additional instructions executed due to branches represent less than 1% of all instructions.
|
RandomX therefore uses random branches with a jump probability of 1/256. These branches will be predicted as "not taken" by the CPU. Such branches are "free" in most CPU designs unless they are taken. The branching conditions and jump targets are chosen in such way that infinite loops in RandomX code are impossible because the register controlling the branch will never be modified in the repeated code block. Each CBRANCH instruction can jump at most twice in a row. The additional instructions executed due to branches represent less than 1% of all instructions.
|
||||||
|
|
||||||
|
Additionally, branches in the code significantly reduce the possibilities of static optimizations. For example, the ISWAP_R instruction could be optimized away by renaming registers if it wasn't for branches.
|
||||||
|
|
||||||
### CPU Caches
|
### CPU Caches
|
||||||
|
|
||||||
@ -67,7 +69,9 @@ The domains of floating point operations are separated into "additive" operation
|
|||||||
|
|
||||||
Because the limited range of group F registers allows more efficient fixed-point implementation (with 85-bit numbers), the FSCAL instruction manipulates the binary representation of the floating point format to make this optimization more difficult.
|
Because the limited range of group F registers allows more efficient fixed-point implementation (with 85-bit numbers), the FSCAL instruction manipulates the binary representation of the floating point format to make this optimization more difficult.
|
||||||
|
|
||||||
Group E registers are restricted to positive values, which avoids `NaN` results (such as square root of a negative number or `0 * ∞`). Division uses only memory source operand to avoid being optimized into multiplication by constant reciprocal. The exponent of group E operands is set to -240 to avoid division and multiplication by 0 and to increase the range of numbers that can be obtained. The approximate range of possible group E register values is `6.0E-73` to `infinity`.
|
Group E registers are restricted to positive values, which avoids `NaN` results (such as square root of a negative number or `0 * ∞`). Division uses only memory source operand to avoid being optimized into multiplication by constant reciprocal. The exponent of group E operands is set to a value between -255 and 0 to avoid division and multiplication by 0 and to increase the range of numbers that can be obtained. The approximate range of possible group E register values is `1.7E-77` to `infinity`.
|
||||||
|
|
||||||
|
While all register-register operations use only 4 static source operands (registers `a0`-`a1`), the vast majority of operations cannot be optimized because floating point math is not associative (for example `x - a0 + a0` generally doesn't equal `x`). Additionally, optimizations are more difficult due to branches in the code (for example, the sequence of operations `x - a0; CBRANCH; x + a0` can produce a value close to `x` or `x - a0` depending on if the branch was taken or not).
|
||||||
|
|
||||||
To maximize entropy and also to fit into one 64-byte cache line, floating point registers are combined using the XOR operation at the end of each iteration before being stored into the Scratchpad.
|
To maximize entropy and also to fit into one 64-byte cache line, floating point registers are combined using the XOR operation at the end of each iteration before being stored into the Scratchpad.
|
||||||
|
|
||||||
@ -75,14 +79,10 @@ To maximize entropy and also to fit into one 64-byte cache line, floating point
|
|||||||
|
|
||||||
RandomX uses all primitive integer operations that preserve entropy: addition, subtraction, multiplication, XOR and rotation.
|
RandomX uses all primitive integer operations that preserve entropy: addition, subtraction, multiplication, XOR and rotation.
|
||||||
|
|
||||||
The IADD_RC and IMUL_9C instructions utilize the address calculation logic of CPUs and can be performed in a single instruction by most CPUs.
|
The IADD_RS instruction utilizes the address calculation logic of CPUs and can be performed in a single hardware instruction by most CPUs.
|
||||||
|
|
||||||
Because integer division is not fully pipelined in CPUs and can be made faster in ASICs, the IMUL_RCP instruction requires only one division per program to calculate the reciprocal. This forces an ASIC to include a hardware divider without giving them a performance advantage during program execution.
|
Because integer division is not fully pipelined in CPUs and can be made faster in ASICs, the IMUL_RCP instruction requires only one division per program to calculate the reciprocal. This forces an ASIC to include a hardware divider without giving them a performance advantage during program execution.
|
||||||
|
|
||||||
The ISWAP_R instruction can be performed efficiently by CPUs that utilize register renaming.
|
|
||||||
|
|
||||||
The COND instructions add branches to RandomX programs and also use the common condition flags that are supported by most CPU architectures.
|
|
||||||
|
|
||||||
### Memory access
|
### Memory access
|
||||||
|
|
||||||
RandomX randomly reads from large buffer of data (Dataset) 16384 times for each hash calculation. Since the Dataset must be stored in DRAM, it provides a natural parallelization limit, because DRAM cannot do more than about 25 million random accesses per second per bank group. Each separately addressable bank group allows a throughput of around 1500 H/s.
|
RandomX randomly reads from large buffer of data (Dataset) 16384 times for each hash calculation. Since the Dataset must be stored in DRAM, it provides a natural parallelization limit, because DRAM cannot do more than about 25 million random accesses per second per bank group. Each separately addressable bank group allows a throughput of around 1500 H/s.
|
||||||
@ -91,9 +91,9 @@ All Dataset accesses read whole CPU cache line (64 bytes) and are fully prefetch
|
|||||||
|
|
||||||
#### Cache
|
#### Cache
|
||||||
|
|
||||||
The Cache, which is used for light verification and Dataset construction, is 8 times smaller than the Dataset. To keep a constant area-time product, each Dataset item is constructed by 8 Cache accesses (8 * 256 MiB = 1 * 2 GiB).
|
The Cache, which is used for light verification and Dataset construction, is about 8 times smaller than the Dataset. To keep a constant area-time product, each Dataset item is constructed from 8 random Cache accesses. The Dataset is 32 MiB larger than 8 times the Cache size. These additional 32 MiB compensate for chip area needed by SuperscalarHash.
|
||||||
|
|
||||||
Because 256 MiB is small enough to be included on-chip, RandomX uses a high-latency mixing function (SquareHash) which defeats the benefits of using low-latency memory for mining in tradeoff mode.
|
Because 256 MiB is small enough to be included on-chip, RandomX uses a high-latency, high-power mixing function (SuperscalarHash) which defeats the benefits of using low-latency memory and the energy required to calculate SuperscalarHash makes light mode very inefficient for mining.
|
||||||
|
|
||||||
Using less than 256 MiB of memory is not possible due to the use of tradeoff-resistant Argon2d with 3 iterations. When using 3 iterations (passes), halving the memory usage increases computational cost 3423 times for the best tradeoff attack [[7](https://eprint.iacr.org/2015/430.pdf)].
|
Using less than 256 MiB of memory is not possible due to the use of tradeoff-resistant Argon2d with 3 iterations. When using 3 iterations (passes), halving the memory usage increases computational cost 3423 times for the best tradeoff attack [[7](https://eprint.iacr.org/2015/430.pdf)].
|
||||||
|
|
||||||
@ -103,33 +103,39 @@ The Scratchpad is used as read-write memory. Its size was selected to fit entire
|
|||||||
|
|
||||||
Additionally, Scratchpad operations require write-read coherency, because when a write to L1 Scratchpad is in progress, a read has a 1/2048 chance of being from the same address. This is handled by the load-store unit (LSU) inside the CPU and requires every read to be checked against the addresses of all pending writes. Hardware without these coherency checks will produce >99% of invalid results.
|
Additionally, Scratchpad operations require write-read coherency, because when a write to L1 Scratchpad is in progress, a read has a 1/2048 chance of being from the same address. This is handled by the load-store unit (LSU) inside the CPU and requires every read to be checked against the addresses of all pending writes. Hardware without these coherency checks will produce >99% of invalid results.
|
||||||
|
|
||||||
|
While most writes to the Scratchpad are into L1 and L2, 144 bytes of data are written into L3 scratchpad per iteration, on average (64 bytes for integer registers, 64 bytes for floating point registers and 16 bytes by ISTORE instruction).
|
||||||
|
|
||||||
|
The image below visualizes writes to the Scratchpad. Each pixel in this image represents 8 bytes of the Scratchpad. Red pixels represent portions of the Scratchpad that have been overwritten at least once during hash calculation. The L1 and L2 levels are on the left side (almost completely overwritten). The right side of the scratchpad represents the bottom 1792 KiB. Only about 66% of it are overwritten, but the writes are spread uniformly and randomly.
|
||||||
|
|
||||||
|
![Imgur](https://i.imgur.com/pRz6aBG.png)
|
||||||
|
|
||||||
### Choice of hashing function
|
### Choice of hashing function
|
||||||
RandomX uses Blake2b as its main cryptographically secure hashing function. Blake2b was specifically designed to be fast in software, especially on modern 64-bit processors, where it's around three times faster than SHA-3 and can run at a speed of around 3 clock cycles per byte of input.
|
RandomX uses Blake2b as its main cryptographically secure hashing function. Blake2b was specifically designed to be fast in software, especially on modern 64-bit processors, where it's around three times faster than SHA-3 and can run at a speed of around 3 clock cycles per byte of input.
|
||||||
|
|
||||||
### Custom functions
|
### Custom functions
|
||||||
|
|
||||||
#### SquareHash
|
#### SuperscalarHash
|
||||||
|
|
||||||
SquareHash was chosen for its relative simplicity (uses only two operations - multiplication and subtraction) and high latency. A single SquareHash calculation takes 40-80 ns on a CPU, which is about the same time as DRAM access latency. ASIC devices using low-latency memory will be bottlenecked by SquareHash when calculating Dataset items, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.
|
SuperscalarHash was designed to burn as much power as possible while the CPU is waiting for data to be loaded from DRAM. The target latency of 170 cycles corresponds to the usual DRAM latency of 40-80 ns. ASIC devices designed for light-mode mining with low-latency memory will be bottlenecked by SuperscalarHash when calculating Dataset items and their efficiency will be destroyed by the high power usage of SuperscalarHash.
|
||||||
|
|
||||||
From a cryptographic standpoint, SquareHash achieves full Avalanche effect [[8](https://en.wikipedia.org/wiki/Avalanche_effect)]. SquareHash was originally based on exponentiation by squaring [[9](https://en.wikipedia.org/wiki/Exponentiation_by_squaring)]. In the [x86 assembly implementation](https://github.com/tevador/RandomX/blob/master/src/asm/squareHash.inc), if `adc rax, 0` is added after each subtraction, SquareHash becomes the following operation:
|
The average SuperscalarHash function contains a total of 450 instructions, out of which 155 are 64-bit multiplications. On average, the longest dependency chain is 95 instructions long. An ASIC design for light-mode mining, with 256 MiB of on-die memory and the ability to execute 1 instruction per cycle, will need on average 760 cycles to construct a Dataset item, assuming unlimited parallelization. It will have to execute 1240 64-bit multiplications per item, which will consume energy comparable to loading data from DRAM.
|
||||||
<code>
|
|
||||||
(x+9507361525245169745)<sup>4398046511104</sup> mod 2<sup>64</sup>+1
|
|
||||||
</code>,
|
|
||||||
where <code>4398046511104 = 2<sup>42</sup></code>. The addition of the carry was removed to improve CPU performance. The constant `9507361525245169745` is added to make SquareHash sensitive to zero (see chapter 3.4 of Specification).
|
|
||||||
|
|
||||||
#### Generator
|
#### AesGenerator
|
||||||
|
|
||||||
Generator was designed for fastest possible generation of pseudorandom data. It takes advantage of hardware accelerated AES in modern CPUs. Only one AES round is performed per 16 bytes of output, which results in throughput exceeding 20 GB/s. The Scratchpad can be filled in under 100 μs. The Generator state is initialized from the output of Blake2b.
|
AesGenerator was designed for fastest possible generation of pseudorandom data. It takes advantage of hardware accelerated AES in modern CPUs. Only one AES round is performed per 16 bytes of output, which results in throughput exceeding 20 GB/s. The Scratchpad can be filled in under 100 μs. The AesGenerator state is initialized from the output of Blake2b.
|
||||||
|
|
||||||
#### Finalizer
|
#### AesHash
|
||||||
|
|
||||||
The Finalizer was designed for fastest possible calculation of the Scratchpad fingerprint. It interprets the Scratchpad as a set of AES round keys, so it's equivalent to AES encryption with 32768 rounds. Two extra rounds are performed at the end to ensure avalanche of all Scratchpad bits in each lane. The output of the Finalizer is fed into the Blake2b hashing function to calculate the final proof hash.
|
AesHash was designed for fastest possible calculation of the Scratchpad fingerprint. It interprets the Scratchpad as a set of AES round keys, so it's equivalent to AES encryption with 32768 rounds. Two extra rounds are performed at the end to ensure avalanche of all Scratchpad bits in each lane. The output of the AesHash is fed into the Blake2b hashing function to calculate the final PoW hash.
|
||||||
|
|
||||||
### Chaining of VM executions
|
### Chaining of VM executions
|
||||||
|
|
||||||
RandomX chains 8 VM initializations and executions to prevent mining strategies that search for 'easy' programs.
|
RandomX chains 8 VM initializations and executions to prevent mining strategies that search for 'easy' programs.
|
||||||
|
|
||||||
|
For example, let's calculate the cost of avoiding the CFROUND instruction (frequency 1/256). There is a chance <code>Q = (255/256)<sup>256</sup> = 0.368</code> that a generated RandomX program doesn't contain any CFROUND instructions. If we assume that program generation is 'free', we can easily find such program. However, after we execute the program, there is a chance `1-Q` that the next program *has* a CFROUND instruction and we have wasted one program execution.
|
||||||
|
|
||||||
|
For 8 chained executions, the chance is only <code>Q<sup>7</sup> = 0.0009</code> that all programs in the chain contain no CFROUND instruction. However, during each attempt to find such chain, we will waste the execution of <code>(Q-1)*(1+2\*Q+3\*Q<sup>2</sup>+4\*Q<sup>3</sup>+5\*Q<sup>4</sup>+6\*Q<sup>5</sup>+7\*Q<sup>6</sup>) = 1.58</code> programs. In the end, we will have to execute, on average, 1734 programs to calculate a single hash! Compared to an honest miner that has to execute just 8 programs per hash, our hashrate is decreased more than 216 times, which is a lot more than the efficiency gain from avoiding 3 rounding modes.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
[1] CryptoNote whitepaper - https://cryptonote.org/whitepaper.pdf
|
[1] CryptoNote whitepaper - https://cryptonote.org/whitepaper.pdf
|
||||||
|
200
doc/specs.md
200
doc/specs.md
@ -2,6 +2,17 @@
|
|||||||
|
|
||||||
RandomX is a proof of work (PoW) algorithm which was designed to close the gap between general-purpose CPUs and specialized hardware. The core of the algorithm is a simulation of a virtual CPU.
|
RandomX is a proof of work (PoW) algorithm which was designed to close the gap between general-purpose CPUs and specialized hardware. The core of the algorithm is a simulation of a virtual CPU.
|
||||||
|
|
||||||
|
#### Table of contents
|
||||||
|
|
||||||
|
1. [Definitions](#1-definitions)
|
||||||
|
1. [Algorithm description](#2-algorithm-description)
|
||||||
|
1. [Custom functions](#3-custom-functions)
|
||||||
|
1. [Virtual Machine](#4-virtual-machine)
|
||||||
|
1. [Instruction set](#5-instruction-set)
|
||||||
|
1. [SuperscalarHash](#6-superscalarhash)
|
||||||
|
1. [Dataset](#7-dataset)
|
||||||
|
|
||||||
|
|
||||||
## 1. Definitions
|
## 1. Definitions
|
||||||
|
|
||||||
### 1.1 General definitions
|
### 1.1 General definitions
|
||||||
@ -18,7 +29,7 @@ RandomX is a proof of work (PoW) algorithm which was designed to close the gap b
|
|||||||
|
|
||||||
**BlakeGenerator** refers to a custom pseudo-random number generator described in chapter 3.4. It's based on the Blake2b hashing function.
|
**BlakeGenerator** refers to a custom pseudo-random number generator described in chapter 3.4. It's based on the Blake2b hashing function.
|
||||||
|
|
||||||
**SuperscalarHash** refers to a custom diffusion function designed to run efficiently on superscalar CPUs (see chapter 3.5). It transforms a 64-byte input value into a 64-byte output value.
|
**SuperscalarHash** refers to a custom diffusion function designed to run efficiently on superscalar CPUs (see chapter 7). It transforms a 64-byte input value into a 64-byte output value.
|
||||||
|
|
||||||
**Virtual Machine** or **VM** refers to the RandomX virtual machine as described in chapter 4.
|
**Virtual Machine** or **VM** refers to the RandomX virtual machine as described in chapter 4.
|
||||||
|
|
||||||
@ -32,9 +43,9 @@ RandomX is a proof of work (PoW) algorithm which was designed to close the gap b
|
|||||||
|
|
||||||
**Program Buffer** refers to the buffer from which the VM reads instructions.
|
**Program Buffer** refers to the buffer from which the VM reads instructions.
|
||||||
|
|
||||||
**Cache** refers to a read-only buffer initialized by Argon2d as described in chapter 6.2.
|
**Cache** refers to a read-only buffer initialized by Argon2d as described in chapter 7.1.
|
||||||
|
|
||||||
**Dataset** refers to a large read-only buffer described in chapter 6. It is constructed from the Cache using the SuperscalarHash function.
|
**Dataset** refers to a large read-only buffer described in chapter 7. It is constructed from the Cache using the SuperscalarHash function.
|
||||||
|
|
||||||
### 1.2 Configurable parameters
|
### 1.2 Configurable parameters
|
||||||
RandomX has several configurable parameters that are listed in Table 1.2.1 with their default values.
|
RandomX has several configurable parameters that are listed in Table 1.2.1 with their default values.
|
||||||
@ -205,10 +216,6 @@ The internal state is initialized from a seed value `K` (0-60 bytes long). The s
|
|||||||
|
|
||||||
The generator can generate 1 byte or 4 bytes at a time by supplying data from its internal state `S`. If there are not enough unused bytes left, the internal state is reinitialized as `S = Hash512(S)`.
|
The generator can generate 1 byte or 4 bytes at a time by supplying data from its internal state `S`. If there are not enough unused bytes left, the internal state is reinitialized as `S = Hash512(S)`.
|
||||||
|
|
||||||
### 3.5 SuperscalarHash
|
|
||||||
|
|
||||||
TODO
|
|
||||||
|
|
||||||
## 4. Virtual Machine
|
## 4. Virtual Machine
|
||||||
|
|
||||||
The components of the RandomX virtual machine are summarized in Fig. 4.1.
|
The components of the RandomX virtual machine are summarized in Fig. 4.1.
|
||||||
@ -221,7 +228,7 @@ The VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wi
|
|||||||
|
|
||||||
### 4.1 Dataset
|
### 4.1 Dataset
|
||||||
|
|
||||||
Dataset is described in detail in chapter 6. It's a large read-only buffer. Its size is equal to `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes. Each program uses only a random subset of the Dataset of size `RANDOMX_DATASET_BASE_SIZE`. All Dataset accesses read an aligned 64-byte item.
|
Dataset is described in detail in chapter 7. It's a large read-only buffer. Its size is equal to `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes. Each program uses only a random subset of the Dataset of size `RANDOMX_DATASET_BASE_SIZE`. All Dataset accesses read an aligned 64-byte item.
|
||||||
|
|
||||||
### 4.2 Scratchpad
|
### 4.2 Scratchpad
|
||||||
|
|
||||||
@ -653,23 +660,182 @@ There is one explicit store instruction for integer values.
|
|||||||
#### 5.5.1 ISTORE
|
#### 5.5.1 ISTORE
|
||||||
This instruction stores the value of the source integer register to the memory at the address calculated from the value of the destination register. The `src` and `dst` can be the same register.
|
This instruction stores the value of the source integer register to the memory at the address calculated from the value of the destination register. The `src` and `dst` can be the same register.
|
||||||
|
|
||||||
## 6. Dataset
|
## 6. SuperscalarHash
|
||||||
|
|
||||||
|
SuperscalarHash is a custom diffusion function that was designed to burn as much power as possible using only the CPU's integer ALUs.
|
||||||
|
|
||||||
|
The input and output of SuperscalarHash are 8 integer registers `r0`-`r7`, each 64 bits wide. The output of SuperscalarHash is used to construct the Dataset (see chapter 7.3).
|
||||||
|
|
||||||
|
### 6.1 Instructions
|
||||||
|
The body of SuperscalarHash is a random sequence of instructions that can run on the Virtual Machine. SuperscalarHash uses a reduced set of only integer register-register instructions listed in Table 6.1.1. `dst` refers to the destination register, `src` to the source register.
|
||||||
|
|
||||||
|
*Table 6.1.1 - SuperscalarHash instructions*
|
||||||
|
|
||||||
|
|freq. †|instruction|Macro-ops|operation|rules|
|
||||||
|
|-|-|-|-|-|
|
||||||
|
|0.11|ISUB_R|`sub_rr`|`dst = dst - src`|`dst != src`|
|
||||||
|
|0.11|IXOR_R|`xor_rr`|`dst = dst ^ src`|`dst != src`|
|
||||||
|
|0.11|IADD_RS|`lea_sib`|`dst = dst + (src << mod.shift)`|`dst != src`, `dst != r5`
|
||||||
|
|0.22|IMUL_R|`imul_rr`|`dst = dst * src`|`dst != src`|
|
||||||
|
|0.11|IROR_C|`ror_ri`|`dst = dst >>> imm32`|`imm32 % 64 != 0`
|
||||||
|
|0.10|IADD_C|`add_ri`|`dst = dst + imm32`|
|
||||||
|
|0.10|IXOR_C|`xor_ri`|`dst = dst ^ imm32`|
|
||||||
|
|0.03|IMULH_R|`mov_rr`,`mul_r`,`mov_rr`|`dst = (dst * src) >> 64`|
|
||||||
|
|0.03|ISMULH_R|`mov_rr`,`imul_r`,`mov_rr`|`dst = (dst * src) >> 64` (signed)|
|
||||||
|
|0.06|IMUL_RCP|`mov_ri`,`imul_rr`|<code>dst = 2<sup>x</sup> / imm32 * dst</code>|`imm32 != 0`, <code>imm32 != 2<sup>N</sup></code>|
|
||||||
|
|
||||||
|
† Frequencies are approximate. Instructions are generated based on complex rules.
|
||||||
|
|
||||||
|
#### 6.1.1 ISUB_R
|
||||||
|
See chapter 5.2.3. Source and destination are always distinct registers.
|
||||||
|
|
||||||
|
#### 6.1.2 IXOR_R
|
||||||
|
See chapter 5.2.8. Source and destination are always distinct registers.
|
||||||
|
|
||||||
|
#### 6.1.3 IADD_RS
|
||||||
|
See chapter 5.2.1. Source and destination are always distinct registers and register `r5` cannot be the destination.
|
||||||
|
|
||||||
|
#### 6.1.4 IMUL_R
|
||||||
|
See chapter 5.2.4. Source and destination are always distinct registers.
|
||||||
|
|
||||||
|
#### 6.1.5 IROR_C
|
||||||
|
The destination register is rotated right. The rotation count is given by `imm32` masked to 6 bits and cannot be 0.
|
||||||
|
|
||||||
|
#### 6.1.6 IADD_C
|
||||||
|
A sign-extended `imm32` is added to the destination register.
|
||||||
|
|
||||||
|
#### 6.1.7 IXOR_C
|
||||||
|
The destination register is XORed with a sign-extended `imm32`.
|
||||||
|
|
||||||
|
#### 6.1.8 IMULH_R, ISMULH_R
|
||||||
|
See chapter 5.2.5.
|
||||||
|
|
||||||
|
#### 6.1.9 IMUL_RCP
|
||||||
|
See chapter 5.2.6. `imm32` is never 0 or a power of 2.
|
||||||
|
|
||||||
|
### 6.2 The reference CPU
|
||||||
|
|
||||||
|
Unlike a standard RandomX program, a SuperscalarHash program is generated using a strict set of rules to achieve the maximum performance on a superscalar CPU. For this purpose, the generator runs a simulation of a reference CPU.
|
||||||
|
|
||||||
|
The reference CPU is loosely based on the [Intel Ivy Bridge microarchitecture](https://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)). It has the following properties:
|
||||||
|
|
||||||
|
* The CPU has 3 integer execution ports P0, P1 and P5 that can execute instructions in parallel. Multiplication can run only on port P1.
|
||||||
|
* Each of the Superscalar instructions listed in Table 6.1.1 consist of one or more *Macro-ops*. Each Macro-op has certain execution latency (in cycles) and size (in bytes) as shown in Table 6.2.1.
|
||||||
|
* Each of the Macro-ops listed in Table 6.2.1 consists of 0-2 *Micro-ops* that can go to a subset of the 3 execution ports. If a Macro-op consists of 2 Micro-ops, both must be executed together.
|
||||||
|
* The CPU can decode at most 16 bytes of code per cycle and at most 4 Micro-ops per cycle.
|
||||||
|
|
||||||
|
*Table 6.2.1 - Macro-ops*
|
||||||
|
|Macro-op|latency|size|1st Micro-op|2nd Micro-op|
|
||||||
|
|-|-|-|-|-|
|
||||||
|
|`sub_rr`|1|3|P015|-|
|
||||||
|
|`xor_rr`|1|3|P015|-|
|
||||||
|
|`lea_sib`|1|4|P01|-|
|
||||||
|
|`imul_rr`|3|4|P1|-|
|
||||||
|
|`ror_ri`|1|4|P05|-|
|
||||||
|
|`add_ri`|1|7, 8, 9|P015|-|
|
||||||
|
|`xor_ri`|1|7, 8, 9|P015|-|
|
||||||
|
|`mov_rr`|0|3|-|-|
|
||||||
|
|`mul_r`|4|3|P1|P5|
|
||||||
|
|`imul_r`|4|3|P1|P5|
|
||||||
|
|`mov_ri`|1|10|P015|-|
|
||||||
|
|
||||||
|
* P015 - Micro-op can be executed on any port
|
||||||
|
* P01 - Micro-op can be executed on ports P0 or P1
|
||||||
|
* P05 - Micro-op can be executed on ports P0 or P5
|
||||||
|
* P1 - Micro-op can be executed only on port P1
|
||||||
|
* P5 - Micro-op can be executed only on port P5
|
||||||
|
|
||||||
|
Macro-ops `add_ri` and `xor_ri` can be optionally padded to a size of 8 or 9 bytes for code alignment purposes. `mov_rr` has 0 execution latency and doesn't use an execution port, but still occupies space during the decoding stage (see chapter 6.3.1).
|
||||||
|
|
||||||
|
### 6.3 CPU simulation
|
||||||
|
|
||||||
|
SuperscalarHash programs are generated to maximize the usage of all 3 execution ports of the reference CPU. The generation consists of 4 stages:
|
||||||
|
|
||||||
|
* Decoding stage
|
||||||
|
* Instruction selection
|
||||||
|
* Port assignment
|
||||||
|
* Operand assignment
|
||||||
|
|
||||||
|
Program generation is complete when one of two conditions is met:
|
||||||
|
|
||||||
|
1. An instruction is scheduled for execution on cycle that is equal to or greater than `RANDOMX_SUPERSCALAR_LATENCY`
|
||||||
|
1. The number of generated instructions reaches `RANDOMX_SUPERSCALAR_MAX_SIZE`
|
||||||
|
|
||||||
|
#### 6.3.1 Decoding stage
|
||||||
|
|
||||||
|
The generator produces instructions in groups of 3 or 4 Macro-op slots such that the size of each group is exactly 16 bytes.
|
||||||
|
|
||||||
|
*Table 6.3.1 - Decoder configurations*
|
||||||
|
|
||||||
|
|decoder group|configuration|
|
||||||
|
|-------------|-------------|
|
||||||
|
|0|4-8-4|
|
||||||
|
|1|7-3-3-3|
|
||||||
|
|2|3-7-3-3|
|
||||||
|
|3|4-9-3|
|
||||||
|
|4|4-4-4-4|
|
||||||
|
|5|3-3-10|
|
||||||
|
|
||||||
|
The rules for the selection of the decoder group are following:
|
||||||
|
|
||||||
|
* If the currently processed instruction is IMULH_R or ISMULH_R, the next decode group is group 5 (the only group that starts with a 3-byte slot and has only 3 slots).
|
||||||
|
* If the total number of multiplications that have been generated is less than or equal to the current decoding cycle, the next decode group is group 4.
|
||||||
|
* If the currently processed instruction is IMUL_RCP, the next decode group is group 0 or 3 (must begin with a 4-byte slot for multiplication).
|
||||||
|
* Otherwise a random decode group is selected from groups 0-3.
|
||||||
|
|
||||||
|
#### 6.3.2 Instruction selection
|
||||||
|
|
||||||
|
Instructions are selected based on the size of the current decode group slot - see Table 6.3.2.
|
||||||
|
|
||||||
|
*Table 6.3.2 - Decoder configurations*
|
||||||
|
|
||||||
|
|slot size|note|instructions|
|
||||||
|
|-------------|-------------|-----|
|
||||||
|
|3|-|ISUB_R, IXOR_R
|
||||||
|
|3|last slot in the group|ISUB_R, IXOR_R, IMULH_R, ISMULH_R|
|
||||||
|
|4|decode group 4, not the last slot|IMUL_R|
|
||||||
|
|4|-|IROR_C, IADD_RS|
|
||||||
|
|7,8,9|-|IADD_C, IXOR_C|
|
||||||
|
|10|-|IMUL_RCP|
|
||||||
|
|
||||||
|
#### 6.3.3 Port assignment
|
||||||
|
|
||||||
|
Micro-ops are issued to execution ports as soon as an available port is free. The scheduling is done optimistically by checking port availability in order P5 -> P0 -> P1 to not overload port P1 (multiplication) by instructions that can go to any port. The cycle when all Micro-ops of an instruction can be executed is called the 'scheduleCycle'.
|
||||||
|
|
||||||
|
#### 6.3.4 Operand assignment
|
||||||
|
|
||||||
|
The source operand (if needed) is selected first. is it selected from the group of registers that are available at the 'scheduleCycle' of the instruction. A register is available if the latency of its last operation has elapsed.
|
||||||
|
|
||||||
|
The destination operand is selected with more strict rules (see column 'rules' in Table 6.1.1):
|
||||||
|
|
||||||
|
* value must be ready at the required cycle
|
||||||
|
* cannot be the same as the source register unless the instruction allows it (see column 'rules' in Table 6.1.1)
|
||||||
|
* this avoids optimizable operations such as `reg ^ reg` or `reg - reg`
|
||||||
|
* it also increases intermixing of register values
|
||||||
|
* register cannot be multiplied twice in a row unless `allowChainedMul` is true
|
||||||
|
* this avoids accumulation of trailing zeroes in registers due to excessive multiplication
|
||||||
|
* `allowChainedMul` is set to true if an attempt to find source/destination registers failed (this is quite rare, but prevents a catastrophic failure of the generator)
|
||||||
|
* either the last instruction applied to the register or its source must be different than the current instruction
|
||||||
|
* this avoids optimizable instruction sequences such as `r1 = r1 ^ r2; r1 = r1 ^ r2` (can be eliminated) or `reg = reg >>> C1; reg = reg >>> C2` (can be reduced to one rotation) or `reg = reg + C1; reg = reg + C2` (can be reduced to one addition)
|
||||||
|
* register `r5` cannot be the destination of the IADD_RS instruction (limitation of the x86 lea instruction)
|
||||||
|
|
||||||
|
## 7. Dataset
|
||||||
|
|
||||||
The Dataset is a read-only memory structure that is used during program execution (chapter 4.6.2, steps 6 and 7). The size of the Dataset is `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes and it's divided into 64-byte 'items'.
|
The Dataset is a read-only memory structure that is used during program execution (chapter 4.6.2, steps 6 and 7). The size of the Dataset is `RANDOMX_DATASET_BASE_SIZE + RANDOMX_DATASET_EXTRA_SIZE` bytes and it's divided into 64-byte 'items'.
|
||||||
|
|
||||||
In order to allow PoW verification with a lower amount of memory, the Dataset is constructed in two steps using an intermediate structure called the "Cache", which can be used to calculate Dataset items on the fly.
|
In order to allow PoW verification with a lower amount of memory, the Dataset is constructed in two steps using an intermediate structure called the "Cache", which can be used to calculate Dataset items on the fly.
|
||||||
|
|
||||||
The whole Dataset is constructed from the key value `K`, which is an input parameter of RandomX. The whole Dataset needs to be recalculated everytime the key value changes. Fig. 6.1 shows the process of Dataset construction.
|
The whole Dataset is constructed from the key value `K`, which is an input parameter of RandomX. The whole Dataset needs to be recalculated everytime the key value changes. Fig. 7.1 shows the process of Dataset construction.
|
||||||
|
|
||||||
*Figure 6.1 - Dataset construction*
|
*Figure 7.1 - Dataset construction*
|
||||||
|
|
||||||
![Imgur](https://i.imgur.com/86h5SbW.png)
|
![Imgur](https://i.imgur.com/86h5SbW.png)
|
||||||
|
|
||||||
### 6.2 Cache construction
|
### 7.1 Cache construction
|
||||||
|
|
||||||
The key `K` is expanded into the Cache using the "memory fill" function of Argon2d with parameters according to Table 6.2.1. The key is used as the "password" field.
|
The key `K` is expanded into the Cache using the "memory fill" function of Argon2d with parameters according to Table 7.1.1. The key is used as the "password" field.
|
||||||
|
|
||||||
*Table 6.2.1 - Argon2 parameters*
|
*Table 7.1.1 - Argon2 parameters*
|
||||||
|
|
||||||
|parameter|value|
|
|parameter|value|
|
||||||
|------------|--|
|
|------------|--|
|
||||||
@ -686,12 +852,12 @@ The key `K` is expanded into the Cache using the "memory fill" function of Argon
|
|||||||
|
|
||||||
The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
|
The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
|
||||||
|
|
||||||
### 6.3 SuperscalarHash initialization
|
### 7.2 SuperscalarHash initialization
|
||||||
|
|
||||||
The key value `K` is used to initialize a BlakeGenerator (see chapter 3.4), which is then used to generate 8 SuperscalarHash instances for Dataset initialization.
|
The key value `K` is used to initialize a BlakeGenerator (see chapter 3.4), which is then used to generate 8 SuperscalarHash instances for Dataset initialization.
|
||||||
|
|
||||||
### 6.4 Dataset block generation
|
### 7.3 Dataset block generation
|
||||||
Dataset items are numbered sequentially with `itemNumber` starting from 0. Each 64-byte Dataset item is generated independently using 8 SuperscalarHash functions (generated according to chapter 6.3) and by XORing randomly selected data from the Cache (constructed according to chapter 6.2).
|
Dataset items are numbered sequentially with `itemNumber` starting from 0. Each 64-byte Dataset item is generated independently using 8 SuperscalarHash functions (generated according to chapter 7.2) and by XORing randomly selected data from the Cache (constructed according to chapter 7.1).
|
||||||
|
|
||||||
The item data is represented by 8 64-bit integer registers: `r0`-`r7`.
|
The item data is represented by 8 64-bit integer registers: `r0`-`r7`.
|
||||||
|
|
||||||
|
@ -26,7 +26,8 @@ const uint8_t seed[32] = { 191, 182, 222, 175, 249, 89, 134, 104, 241, 68, 191,
|
|||||||
|
|
||||||
int main() {
|
int main() {
|
||||||
|
|
||||||
constexpr int count = 100000;
|
constexpr int count = 1000000;
|
||||||
|
int isnCounts[randomx::SuperscalarInstructionType::COUNT] = { 0 };
|
||||||
int64_t asicLatency = 0;
|
int64_t asicLatency = 0;
|
||||||
int64_t codesize = 0;
|
int64_t codesize = 0;
|
||||||
int64_t cpuLatency = 0;
|
int64_t cpuLatency = 0;
|
||||||
@ -44,6 +45,10 @@ int main() {
|
|||||||
mulCount += prog.mulCount;
|
mulCount += prog.mulCount;
|
||||||
size += prog.getSize();
|
size += prog.getSize();
|
||||||
|
|
||||||
|
for (unsigned j = 0; j < prog.getSize(); ++j) {
|
||||||
|
isnCounts[prog(j).opcode]++;
|
||||||
|
}
|
||||||
|
|
||||||
if ((i + 1) % (count / 100) == 0) {
|
if ((i + 1) % (count / 100) == 0) {
|
||||||
std::cout << "Completed " << ((i + 1) / (count / 100)) << "% ..." << std::endl;
|
std::cout << "Completed " << ((i + 1) / (count / 100)) << "% ..." << std::endl;
|
||||||
}
|
}
|
||||||
@ -57,5 +62,10 @@ int main() {
|
|||||||
std::cout << "Avg. mul. count: " << (mulCount / (double)count) << std::endl;
|
std::cout << "Avg. mul. count: " << (mulCount / (double)count) << std::endl;
|
||||||
std::cout << "Avg. RandomX ops: " << (size / (double)count) << std::endl;
|
std::cout << "Avg. RandomX ops: " << (size / (double)count) << std::endl;
|
||||||
|
|
||||||
|
std::cout << "Frequencies: " << std::endl;
|
||||||
|
for (unsigned j = 0; j < randomx::SuperscalarInstructionType::COUNT; ++j) {
|
||||||
|
std::cout << j << " " << isnCounts[j] << " " << isnCounts[j] / (double)size << std::endl;
|
||||||
|
}
|
||||||
|
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
Loading…
Reference in New Issue
Block a user