mirror of https://git.wownero.com/wownero/RandomWOW.git synced 2024-12-21 23:38:54 +00:00

tevador 3639a5e08d Updated specs: cache, FP rounding

2018-11-02 17:39:28 +01:00

11 KiB

Raw Blame History

RandomX

RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.

RandomX uses a simple low-level language (instruction set) to describe a variety of random programs. The instruction set was designed specifically for this proof of work algorithm, because existing languages and instruction sets are designed for a different goal (actual software development) and thus usually have a complex syntax and unnecessary flexibility.

Virtual machine

RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:

DRAM

The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is static within a single PoW epoch. The exact algorithm to generate the DRAM blob and its update schedule is to be determined.

MMU

The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU splits the 4 GiB DRAM blob into 64-byte blocks (corresponding to the L1 cache line size of a typical CPU). Data within one block is always read sequentially in eight reads (8x8 bytes). Blocks are read mostly sequentially apart from occasional random jumps that happen on average every 256 blocks. The address of the next block to be read is determined 1 block ahead of time to enable efficient prefetching. The MMU uses three internal registers:

m0 - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
m1 - Address of the next block to be read from memory (32-bit, 64-byte aligned).
mx - Random 64-bit counter that determines if reading continues sequentially or jumps to a random block. When an address addr is passed to the MMU, it performs mx ^= addr and checks if the last 8 bits of mx are zero. If yes, the adjacent 32 bits are copied to register m1 and 64-byte aligned.

Cache

The VM contains 256 KiB of cache. The cache is split into two segments of 16 KiB and 240 KiB. The cache is randomly accessed for both reading and writing. 75% of accesses are into the first 16 KiB.

Program

The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 1024 random 64-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.

Control unit

The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:

pc - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
sp - Address of the top of the stack (64-bit, 8 byte aligned).
ic - Instruction counter = the number of instructions to execute before terminating. Initial value is 65536 and the register is decremented after each executed instruction.

Stack

To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL, CALLR and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.

Register file

The VM has 8 integer registers r0-r7 and 8 floating point registers f0-f7. All registers are 64 bits wide.

ALU

The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes.

FPU

The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. There are 4 binary operations (ADD, SUB, MUL, DIV) and one unary operation (SQRT).

Instruction set

The 64-bit instruction is encoded as follows:

Opcode (8 bits)

There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:

operation	number of opcodes
ALU operations	TBD	TBD
FPU operations	TBD	TBD
branching	32	12.5%

Parameters a, b, c (8 bits)

a and b encode the instruction operands and c is the destination. All have the same encoding:

Register number is encoded in the top 3 bits. ALU instructions use registers r0-r7, while FPU instructions use registers f0-f7. Addresses are always loaded from registers r0-r7. The bottom 3 bits determine where the operand is loaded from/result saved to:

location	a	b	c
000	register	register	register
001	register	register	register
010	register	register	register
011	cache	register	cache
100	cache	register	cache
101	DRAM	register	cache
110	DRAM	imm1	cache
111	DRAM	imm1	cache

register - Direct register read/write.
cache - The value of the register is used as an address to read from/write to the cache. The bottom 3 bits of the address are cleared and the address is truncated to the following length depending on the cache bits:

cache	address length
00	18 bits (whole 256 KiB)
01	14 bits (first 16 KiB)
10	14 bits (first 16 KiB)
11	14 bits (first 16 KiB)

DRAM - The value of the register is used as an address to pass to the MMU.
imm1 - 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended fot signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer, first converted to a single precision floating point format and then to a double precision format.

imm0 (8 bits)

An 8-bit immediate value that is used by the CALL instruction as jump offset.

ALU instructions

All ALU instructions take 2 operands A and B and produce result C.

opcodes	instruction	signed	A width	B width	C	C width
TBD	ADD_64	no	64	64	A + B	64
TBD	ADD_32	no	32	32	A + B	32
TBD	ADD_16	no	16	16	A + B	16
TBD	SUB_64	no	64	64	A - B	64
TBD	SUB_32	no	32	32	A - B	32
TBD	SUB_16	no	16	16	A - B	16
TBD	MUL_64	no	64	64	A * B	64
TBD	MUL_32	no	32	32	A * B	64
TBD	MUL_16	no	16	16	A * B	32
TBD	IMUL_32	yes	32	32	A * B	64
TBD	IMUL_16	yes	16	16	A * B	32
TBD	DIV_64	no	64	32	A / B, A % B	64
TBD	IDIV_64	yes	64	32	A / B, A % B	64
TBD	DIV_32	no	32	16	A / B, A % B	32
TBD	IDIV_32	yes	32	16	A / B, A % B	32
TBD	AND_64	no	64	64	A & B	64
TBD	AND_32	no	32	32	A & B	32
TBD	AND_16	no	16	16	A & B	16
TBD	OR_64	no	64	64	A \| B	64
TBD	OR_32	no	32	32	A \| B	32
TBD	OR_16	no	16	16	A \| B	16
TBD	XOR_64	no	64	64	A ^ B	64
TBD	XOR_32	no	32	32	A ^ B	32
TBD	XOR_16	no	16	16	A ^ B	16
TBD	SHL_64	no	64	6	A << B	64
TBD	SHR_64	no	64	6	A >> B	64
TBD	SAR_64	yes	64	6	A >> B	64
TBD	ROL_64	no	64	6	A <<< B	64
TBD	ROR_64	no	64	6	A >>> B	64

Division

For the division instructions, the divisor is half length of the dividend. The result C consists of both the quotient and the remainder (remainder is put the upper bits). The result of division by zero is equal to the dividend.

Result write-back

If C is shorter than 64 bits, it is zero-extended before the result is written back. If the destination is a register, the value is first encrypted with a single AES round (TBD).

FPU instructions

opcodes	instruction	C
TBD	FADD	A + B
TBD	FSUB	A - B
TBD	FMUL	A * B
TBD	FDIV	A / B
TBD	FSQRT	sqrt(A)
TBD	FROUND	-

FPU instructions conform to the IEEE-754 specification. Initial rounding mode is RN (Round to Nearest). Denormal values are treated as zero (this corresponds to setting the FTZ flag in x86 SSE and ARM Neon engines).

Operands loaded from memory are treated as signed 64-bit integers and converted to double precision floating point format. Operands loaded from floating point registers are used directly.

FSQRT

The sign bit of the FSQRT operand is always cleared first, so only non-negative values are evaluated.

FROUND

The FROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two right-most bits of A:

A[1:0]	rounding mode
00	Round to Nearest (RN) mode
01	Round towards Plus Infinity (RP) mode
10	Round towards Minus Infinity (RM) mode
11	Round towards Zero (RZ) mode

Branch instructions

The CU supports 3 branch instructions:

opcodes	instruction	function
TBD	CALL	conditional near procedure call with static offset
TBD	CALLR	conditional near procedure call with register offset
TBD	RET	conditional return from procedure

All three instructions are conditional. Branching pattern is determined by the value of imm1 (exact mechanism TBD). In case the branch is not taken, all three instructions set C = A ("arithmetic no-op").

CALL and CALLR

When the branch is taken, both CALL and CALLR instructions push the values A and pc (program counter) onto the stack and then perform a forward jump relative to the value of pc. The forward offset is equal to 8 * (imm0 + 1) for the CALL instruction and 8 * ((C & 0xFF) + 1) for the CALLR instruction. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL/CALLR instructions are needed to form a loop in the program).

RET

When the branch is taken, the RET instruction pops the return address raddr from the stack (it's the instructions following the corresponding CALL or CALLR), then pops a return value retval from the stack and sets C = retval. Finally, the instruction jumps back to raddr.

Program generation

The program is initialized from a 256-bit seed value using a suitable PRNG. The program is generated in this order:

All 1024 instructions are generated as a list of random 64-bit integers.
Initial values of all integer registers r0-r7 are generated as random 64-bit integers.
Initial values of all floating point registers f0-f7 are generated as random 64-bit signed integers converted to a double precision floating point format.
The initial value of the m0 register is generated as a random 32-bit value with the last 6 bits cleared (64-byte aligned).
The 256 KiB cache is initialized using AES encryption (TBD).
The remaining registers are initialized as pc = 0, sp = 0, ic = 65536, m1 = m0 + 64, mx = 0.

Result

When the program terminates (the value of ic register reaches 0), the register file and the stack are hashed using the Blake2b hash function to get the final PoW value. The generation/execution can be chained multiple times to discourage mining strategies that search for programs with particular properties.

11 KiB Raw Blame History