New instruction encoding

2024-12-22 15:58:53 +00:00 · 2018-11-10 22:25:51 +01:00 · 2018-11-10 22:25:51 +01:00 · 2ea440d0f5
commit 2ea440d0f5
parent 3b2cb9b8c7
2 changed files with 122 additions and 156 deletions
--- a/README.md
+++ b/README.md
@ -10,7 +10,7 @@ RandomX uses a simple low-level language (instruction set), which was designed s
 ## Virtual machine
 RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
-![Imgur](https://i.imgur.com/Of1tGPm.png)
+![Imgur](https://i.imgur.com/dRU8jiu.png)
 #### DRAM
 The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
@ -18,7 +18,7 @@ The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory
 *The DRAM blob can be generated in 0.1-0.3 seconds using 8 threads with hardware-accelerated AES and dual channel DDR3 or DDR4 memory. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
 #### MMU
-The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When a block has been consumed, reading jumps to a random block. The address of the next block is calculated 8 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
+The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The MMU splits the 4 GiB DRAM blob into 256-byte blocks. Data within one block is always read sequentially in 32 reads (32×8 bytes). When the current block has been consumed, reading jumps to a random block. The address of the next block is calculated 8 reads before the current block is exhausted to enable efficient prefetching. The MMU uses three internal registers:
 * **m0** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
 * **m1** - Address of the next block to be read from memory (32-bit, 256-byte aligned).
 * **mx** - Random 32-bit counter that determines the address of the next block. After each read, the read address is mixed with the counter: `mx ^= addr`. When the 24th quadword of the current block is read (the value of the `m0` register ends with `0xC0`), the value of the `mx` register is copied into register `m1` and the last 8 bits of `m1` are cleared.
@ -31,7 +31,7 @@ The VM contains a 256 KiB scratchpad, which is accessed randomly both for readin
 *The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance (see below), which should limit the impact of L1 cache misses.*
 #### Program
-The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 1024 random 64-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
+The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
 *For high-performance mining, the program should be translated directly into machine code. The whole program should fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
@ -47,12 +47,12 @@ The control unit (CU) controls the execution of the program. It reads instructio
 To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
 #### Register file
-The VM has 8 integer registers `r0`-`r7` (each 64 bits wide), 8 floating point registers `f0`-`f7` (each 64 bits wide) and 4 memory address registers `g0`-`g3` (each 32 bits wide).
+The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. All registers are 64 bits wide.
-*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs. The memory address registers `g0`-`g3` can be stored in a single 128-bit vector register (`xmm0`-`xmm15` registers for x86 and `Q0`-`Q15` in ARM) for efficient address generation (see below).*
+*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
 #### ALU
-The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with various operand sizes of 64, 32 or 16 bits.
+The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 11 groups (ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, ROL, ROR) with operand sizes of 64 or 32 bits.
 #### FPU
 The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers.
@ -61,11 +61,11 @@ The floating-point unit performs IEEE-754 compliant math using 64-bit double pre
 The VM stores and loads all data in little-endian byte order.
 ## Instruction set
-The 64-bit instruction is encoded as follows:
+The 128-bit instruction is encoded as follows:
-![Imgur](https://i.imgur.com/FwYyKBB.png)
+![Imgur](https://i.imgur.com/thpvVHN.png)
-#### Opcode (8 bits)
+#### Opcode
 There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following (TBD):
 |operation|number of opcodes||
@ -74,102 +74,72 @@ There are 256 opcodes, which are distributed between various operations dependin
 |FPU operations|66|25.8%|
 |Control flow |32|12.5%|
-#### Operand a (8 bits)
+#### Operand A
-`a` encodes the first operand, which is read from memory.
+The first operand is read from memory. The location is determined by the `loc(a)` flag:
-![Imgur](https://i.imgur.com/JNIadYc.png)
+|loc(a)[2:0]|read A from|address size (W)
 |---------|-|-|
 |000|DRAM|32 bits|
 |001|DRAM|32 bits|
 |010|DRAM|32 bits|
 |011|DRAM|32 bits|
 |100|scratchpad|15 bits|
 |101|scratchpad|11 bits|
 |110|scratchpad|11 bits|
 |111|scratchpad|11 bits|
-The `loc(a)` flag determines where the operand `A` is read from where the result `C` is saved to (see Result write-back below):
+Flag `reg(a)` encodes an integer register `r0`-`r7`.  The read address is calculated as:
 |loc(a)|read A from|read address|write C to|write address
 |---------|-|-|-|-|
 |000|DRAM|32 bits|scratchpad|18 bits|
 |001|DRAM|32 bits|scratchpad|14 bits|
 |010|DRAM|32 bits|register `x(b)`|-|
 |011|DRAM|32 bits|register `x(b)`|-|
 |100|scratchpad|18 bits|scratchpad|14 bits|
 |101|scratchpad|14 bits|scratchpad|14 bits|
 |110|scratchpad|14 bits|register `x(b)`|-|
 |111|scratchpad|14 bits|register `x(b)`|-|
 The `r(a)` flag encodes an integer register (`r0`-`r7`). The value of the register is first XORed with the value of the `g0` register. The read address `addr` is then equal to the bottom 32 bits of `r(a)`. Additionally, the value of the register and all memory address registers are rotated.
 The `addr` value is then truncated to the required length (32, 18 or 14 bits). For reading from and writing to the scratchpad, the address is 8-byte aligned by clearing the bottom 3 bits.
 If the `gen` flag is equal to `00`, this instruction performs the Address generation step (see below).
 Pseudocode:
 ```
-FUNCTION GET_ADDRESS
+reg(a) ^= addr0
-	r(a) ^= g0
+addr(a) = reg(a)[W-1:0]
 	addr = r(a)
 	r(a) <<<= 32
 	g0 = g1
 	g1 = g2
 	g2 = g3
 	g3 = g0
 	IF gen == 0b00 THEN GENERATE_ADDRESSES
 	return addr
 END FUNCTION
 ```
-*The rotation of registers `g0`-`g3` can be performed with a single `PSHUFD` x86 instruction.*
+For reading from the scratchpad, `addr(a)` is multiplied by 8 for 8-byte aligned access.
 #### Operand B
 The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).
-#### Operand b (8 bits)
+|loc(b)[2:0]|read B from|
 `b` encodes the second operand, which is either a register or immediate value.
 ![Imgur](https://i.imgur.com/ppEiUfh.png)
 |loc(b)|read B from|
 |---------|-|
-|000|register `x(b)`|
+|000|register `reg(b)`|
-|001|register `x(b)`|
+|001|register `reg(b)`|
-|010|register `x(b)`|
+|010|register `reg(b)`|
-|011|register `x(b)`|
+|011|register `reg(b)`|
-|100|register `x(b)`|
+|100|register `reg(b)`|
-|101|register `x(b)`|
+|101|register `reg(b)`|
-|110|`imm1`|
+|110|`imm0` or `imm1`|
-|111|`imm1`|
+|111|`imm0` or `imm1`|
-The `x(b)` flag encodes a register. For ALU operations, this is an integer register (`r0`-`r7`) and for FPU operations, it's a floating point register (`f0`-`f7`).
+`imm0` is an 8-bit immediate value, which is used for shift and rotate ALU operations.
-`imm1` is a 32-bit immediate value encoded within the instruction. For ALU instructions that use operands shorter than 32 bits, the value is truncated. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer and converted to a double precision floating point format.
+`imm1` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is treated as a signed 32-bit integer and converted to a double precision floating point format.
-#### imm0 (8 bits)
+#### Operand C
-An 8-bit immediate value that is used to calculate the jump offset of the CALL instruction.
+The third operand is the location where the result is stored.
-
+|loc\(c\)[2:0]|write C to|address size (W)
-#### Result writeback
+|---------|-|-|
-
+|000|scratchpad|15 bits|
-All instructions take the operands `A` and `B` and produce a result `C`. Firstly, if `C` is shorter than 64 bits, it is zero-extended to 64 bits. The value of `C` is then written back either to the register `x(b)` or to the scratchpad using the same address `addr` from operand a (see table above).
+|001|scratchpad|11 bits|
 |010|scratchpad|11 bits|
 |011|scratchpad|11 bits|
 |100|register `reg(c)`|-|
 |101|register `reg(c)`|-|
 |110|register `reg(c)`|-|
 |111|register `reg(c)`|-|
 The `reg(c)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).  For writing to the scratchpad, an integer register is always used and the write address is calculated as:
 ```
 addr(c) = (addr1 ^ reg(c))[W-1:0] * 8
 ```
 *CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.*
-#### Address generation
+#### imm0
 An 8-bit immediate value that is used as the shift/rotate count by some ALU instructions and as the jump offset of the CALL instruction.
-To ensure that the values of the memory address registers remain pseudorandom, the values of the registers are regenerated on average once in every 4 instructions.
+#### addr0
-
+A 32-bit address mask that is used to calculate the read address for the A operand.
 During address generation, the 4 registers `g0`-`g3` are combined into one 128-bit register `G` and the registers `r(a)` and `x(b)` are combined into a 128-bit register `K`. `G` is then encrypted with a single [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) round using `K` as the round key.
 In pseudocode:
 ```
 PROCEDURE GENERATE_ADDRESSES
 	G[127:96] = g3
 	G[95:64] = g2
 	G[63:32] = g1
 	G[31:0] = g0
 	K[127:64] = r(a)
 	K[63:0] = x(b)
 	G = AES_ROUND(G, K)
 	g3 = G[127:96]
 	g2 = G[95:64]
 	g1 = G[63:32]
 	g0 = G[31:0]
 END PROCEDURE
 ```
 `AES_ROUND` consists of the ShiftRows, SubBytes and MixColumns steps followed by XOR with `K`.
 *For x86 CPUs, address generation requires 2-3 move instructions to construct the key and a single `AESENC` instruction for encryption. ARM requires two separate instructions `AESE` and `AESMC` (for MixColumns). The whole address generation can run in parallel with the currently executed instruction.*
 #### addr1
 A 32-bit address mask that is used to calculate the write address for the C operand. `addr1` is equal to `imm1`.
 ### ALU instructions
@ -220,7 +190,7 @@ For the division instructions, the dividend is 64 bits long and the divisor 32 b
 *Division by zero can be handled without branching by conditional move (`IF B == 0 THEN B = 1`). Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. In this extremely rare case, ARM produces the "correct" result, but x86 throws a hardware exception, which must be handled.*
 ##### Shift and rotate
-The shift/rotate instructions use just the bottom 6 bits of the `B` operand. All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
+The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm0` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
 ### FPU instructions
@ -267,30 +237,30 @@ The following 2 control flow instructions are supported:
 Both instructions are conditional in 75% of cases. The jump is taken only if `B <= imm1`. For the 25% of cases when `B` is equal to `imm1`, the jump is unconditional. In case the branch is not taken, both instructions become "arithmetic no-op" `C = A`.
 ##### CALL
-Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `8 * (imm0 + 1)`. Maximum jump distance is therefore 256 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
+Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm0[7:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
 ##### RET
 The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A ^ retval`. Finally, the instruction jumps back to `raddr`.
 ## Program generation
-The program is initialized from a 256-bit seed value using a [PCG random number generator](http://www.pcg-random.org/). The program is generated in this order:
+The program is initialized from a 256-bit seed value `S`.
-1. All 1024 instructions are generated as a list of random 64-bit integers.
+1. A [pcg32](http://www.pcg-random.org/)  random number generator is initialized with state `S[63:0]`.
-2. Initial values of all integer registers `r0`-`r7` are generated as random 64-bit integers.
+2. The generator is used to generate random 128 bytes `R1`.
-3. Initial values of all floating point registers `f0`-`f7` are generated as random 64-bit signed integers converted to a double precision floating point format.
+3. Integer registers `r0`-`r7` are initialized using bytes 0-63 bytes of `R1`.
-4. Initial values of all memory address registers `g0`-`g3` are generated as random 32-bit integers.
+4. Floating point registers `f0`-`f7` are initialized using bytes 64-127 of `R1` interpreted as 8 64-bit signed integers converted to a double precision floating point format.
-5. The initial value of the `m0` register is generated as a random 32-bit value with the last 8 bits cleared (256-byte aligned).
+5. The initial value of the `m0` register is set to `S[95:64]` and the the last 8 bits are cleared (256-byte aligned).
-6. A random 128-byte scratchpad seed is generated.
+6. `S` is expanded into 10 AES round keys `K0`-`K9`.
-7. The initial 256-bit seed is used to generate 10 AES round keys.
+7. `R1` is exploded into a 264 KiB buffer `B` by repeated 10-round AES encryption.
-6. The 256 KiB scratchpad is initialized by repeated 10-round AES encryption starting with the scratchpad seed.
+8. The scratchpad is set to the first 256 KiB of `B`.
-7. The remaining registers are initialized as `pc = 0`, `sp = 0`, `ic = 65536` (TBD), `mx = 0`.
+9. The program buffer is set to the final 8 KiB of `B`.
 10. The remaining registers are initialized as `pc = 0`, `sp = 0`, `ic = 1048576` (TBD), `mx = 0`.
 ## Result
 When the program terminates (the value of `ic` register reaches 0), the final result is calculated as follows:
-1. The register file is hashed using the Blake2b 256-bit hash function. The order of registers is: `r0`-`r7`, `f0`-`f7`, `g0`-`g3` (total of 144 bytes).
+1. The register file is treated as a 128-byte value `R2`.
-2. The 256-bit hash is expanded into 10 AES round keys.
+3. The 256 KiB scratchpad is imploded into a 128-byte digest `D` using 10-round AES decryption with keys `K0`-`K9` and XORing each 128-byte chunk with `R2`.
-3. The 256 KiB scratchpad is imploded into 128 bytes using 10-round AES decryption.
+4. `D` is hashed using the Blake2b 256-bit hash function. This is the result of the PoW.
 4. The 128 byte scratchpad digest is hashed again using the Blake2b 256-bit hash function. This is the result of the PoW.
 *The stack is not included in the result calculation to enable platform-specific return addresses.*
--- a/tests/rx2c.py
+++ b/tests/rx2c.py
@ -2,8 +2,8 @@ import random
 import sys
 import os
-PROGRAM_SIZE = 1024
+PROGRAM_SIZE = 512
-INSTRUCTION_COUNT = 65536
+INSTRUCTION_COUNT = 1024 * 1024
 def genBytes(count):
    return ', '.join(str(random.getrandbits(8)) for i in range(count))
@ -33,14 +33,14 @@ def toSigned32(x):
 def toSigned64(x):
    return x - ((x & 0x8000000000000000) << 1)
-def immediateTo(val, type):
+def immediateTo(symbol, type):
    converters = {
-        0: toSigned32(val),
+        0: toSigned32(symbol.imm1),
-        1: val,
+        1: symbol.imm1,
-        2: toSigned32(val),
+        2: toSigned32(symbol.imm1),
-        3: val,
+        3: symbol.imm1,
-        4: float(toSigned32(val) << 32),
+        4: float(toSigned32(symbol.imm1) << 32),
-        5: val & 63
+        5: symbol.imm0 & 63
    }
    return repr(converters.get(type))
@ -102,15 +102,14 @@ def getRegister(num, type):
 def writeInitialValues(file):
    file.write("\tclock_t clockStart = clock(), clockEnd;\n")
    for i in range(8):
-        file.write("\tr{0} = {1}ULL;\n".format(i, random.getrandbits(64)))
+        file.write("\tr{0} = *(uint64_t*)(aesSeed + {1});\n".format(i, i * 8))
    for i in range(8):
-        file.write("\tf{0} = {1};\n".format(i, toSigned64(random.getrandbits(64))))
+        file.write("\tf{0} = *(int64_t*)(aesSeed + {1});\n".format(i, 64 + i * 8))
-    file.write("\tG = _mm_set_epi64x({0}ULL, {1}ULL);\n".format(random.getrandbits(64), random.getrandbits(64)))
+    file.write("\tmmu.m0 = (aesKey[9] << 8) | (aesKey[10] << 16) | (aesKey[11] << 24);\n")
    file.write("\tmmu.m0 = {1};\n".format(i, random.getrandbits(32) & 0xFFFFFF00))
    file.write("\taesInitialize((__m128i*)aesKey, (__m128i*)aesSeed, (__m128i*)scratchpad, SCRATCHPAD_SIZE);\n")
    file.write("\tmmu.mx = 0;\n")
    file.write("\tmmu.sp = 0;\n")
-    file.write("\tic = 65536;\n")
+    file.write("\tic = {0};\n".format(INSTRUCTION_COUNT))
    file.write("\tmxcsr = (_mm_getcsr() | _MM_FLUSH_ZERO_ON) & ~_MM_ROUND_MASK; //flush denormals to zero, round to nearest\n")
    file.write("\t_mm_setcsr(mxcsr);\n")
@ -131,13 +130,8 @@ def writeEpilog(file):
 def writeCommon(file, i, symbol, type, name):
    file.write("\ti_{0}: {{ //{1}\n".format(i, name))
    file.write("\t\tif(0 == ic--) goto end;\n")
-    file.write("\t\tr{0} ^= (uint32_t)_mm_cvtsi128_si32(G);\n".format(symbol.ra))
+    file.write("\t\tr{0} ^= {1};\n".format(symbol.rega, symbol.addr0))
-    file.write("\t\taddr_t addr = r{0};\n".format(symbol.ra))
+    file.write("\t\taddr_t addr = r{0};\n".format(symbol.rega))
    file.write("\t\tr{0} = __rolq(r{0}, 32);\n".format(symbol.ra))
    file.write("\t\tG = _mm_shuffle_epi32(G, _MM_SHUFFLE(1, 2, 3, 0));\n")
    if symbol.gen == 0:
        file.write("\t\t__m128i K = _mm_set_epi64x({0}, r{1});\n".format(registerFrom(symbol.xb, type), symbol.ra))
        file.write("\t\tG = _mm_aesenc_si128(G, K);\n")
 def readA(symbol, type):
    location = {
@ -154,38 +148,40 @@ def readA(symbol, type):
 def writeC(symbol, type):
    location = {
-        0: "SCRATCHPAD_256K(addr)",
+        0: "SCRATCHPAD_256K(r{0} ^ {1})",
-        1: "SCRATCHPAD_16K(addr)",
+        1: "SCRATCHPAD_16K(r{0} ^ {1})",
-        2: "",
+        2: "SCRATCHPAD_16K(r{0} ^ {1})",
-        3: "",
+        3: "SCRATCHPAD_16K(r{0} ^ {1})",
-        4: "SCRATCHPAD_16K(addr)",
+        4: "",
-        5: "SCRATCHPAD_16K(addr)",
+        5: "",
        6: "",
        7: ""
    }
-    c = location.get(symbol.loca)
+    c = location.get(symbol.locc)
    if c == "":
-        c = getRegister(symbol.xb, type)
+        c = getRegister(symbol.regc, type)
    else:
-        c = convertibleFrom(c, type)
+        c = convertibleFrom(c.format(symbol.regc, symbol.addr1), type)
    return c
 def readB(symbol, type):
    if symbol.locb < 6:
-        return registerTo(getRegister(symbol.xb, type), type)
+        return registerTo(getRegister(symbol.regb, type), type)
    else:
-        return immediateTo(symbol.imm1, type)
+        return immediateTo(symbol, type)
 class CodeSymbol:
    def __init__(self, qi):
        self.opcode = qi & 255
        self.loca = (qi >> 8) & 7
-        self.ra = (qi >> 11) & 7
+        self.rega = (qi >> 16) & 7
-        self.gen = (qi >> 14) & 3
+        self.locb = (qi >> 24) & 7
-        self.locb = (qi >> 16) & 7
+        self.regb = (qi >> 32) & 7
-        self.xb = (qi >> 19) & 7
+        self.locc = (qi >> 40) & 7
-        self.imm0 = (qi >> 24) & 255
+        self.regc = (qi >> 48) & 7
-        self.imm1 = qi >> 32
+        self.imm0 = (qi >> 56) & 255
        self.addr0 = (qi >> 64) & 0xFFFFFFFF
        self.addr1 = self.imm1 = qi >> 96
 def writeOperation(file, i, symbol, type, name, op):
    writeCommon(file, i, symbol, type, name)
@ -326,7 +322,7 @@ def write_FSQRT(file, i, symbol):
 def write_FROUND(file, i, symbol):
    type = OperandType.FLOAT
    writeCommon(file, i, symbol, type, 'FROUND')
-    file.write("\t\t{0} A = {1};\n".format(declareType(OperandType.UINT64), readA(symbol, OperandType.UINT64)))
+    file.write("\t\t{0} A = {1};\n".format(declareType(OperandType.INT64), readA(symbol, OperandType.INT64)))
    file.write("\t\t{0} = A;\n".format(writeC(symbol, type)))
    file.write("\t\t_mm_setcsr(mxcsr | ((uint32_t)(A << 13) & _MM_ROUND_MASK)); }\n")
@ -335,13 +331,13 @@ def write_CALL(file, i, symbol):
    writeCommon(file, i, symbol, type, 'CALL')
    file.write("\t\t{0} A = {1};\n".format(declareType(type), readA(symbol, type)))
    if symbol.locb < 6:
-        file.write("\t\tif((uint32_t){0} <= {1}) {{\n".format(getRegister(symbol.xb, type), immediateTo(symbol.imm1, type)))
+        file.write("\t\tif((uint32_t)r{0} <= {1}) {{\n".format(symbol.regb, symbol.imm1))
    file.write("\t\t\tPUSH_VALUE(A);\n");
    file.write("\t\t\tPUSH_ADDRESS(&&i_{0});\n".format((i + 1) & (PROGRAM_SIZE - 1)));
-    file.write("\t\t\tgoto i_{0};\n".format((i + 1 + symbol.imm0) & (PROGRAM_SIZE - 1)));
+    file.write("\t\t\tgoto i_{0};\n".format((i + 1 + (symbol.imm0 & (PROGRAM_SIZE/4 - 1))) & (PROGRAM_SIZE - 1)));
    if symbol.locb < 6:
        file.write("\t\t}}\n\t\t{0} = A;".format(writeC(symbol, type)))
-    file.write(" }\n")
+    file.write("\t\t}\n")
 def write_RET(file, i, symbol):
    type = OperandType.UINT64
@ -349,7 +345,7 @@ def write_RET(file, i, symbol):
    file.write("\t\t{0} A = {1};\n".format(declareType(type), readA(symbol, type)))
    file.write("\t\tif(!STACK_IS_EMPTY()")
    if symbol.locb < 6:
-        file.write(" && (uint32_t){0} <= {1}".format(getRegister(symbol.xb, type), immediateTo(symbol.imm1, type)))
+        file.write(" && (uint32_t)r{0} <= {1}".format(symbol.regb, symbol.imm1))
    file.write(") {\n")
    file.write("\t\t\tvoid* target = POP_ADDRESS();\n")
    file.write("\t\t\tuint64_t C = POP_VALUE();\n")
@ -620,10 +616,9 @@ def writeCode(file, i, symbol):
    opcodeMap.get(symbol.opcode)(file, i, symbol)
 def writeMain(file):
-    file.write(("int main() {\n"
+    file.write(('__attribute__((optimize("Os"))) int main() {\n'
                "	register uint64_t r0, r1, r2, r3, r4, r5, r6, r7;\n"
                "	register double f0, f1, f2, f3, f4, f5, f6, f7;\n"
                "	register __m128i G; //g0-g3\n"
                "	register uint64_t ic;\n"
                "	convertible_t scratchpad[SCRATCHPAD_LENGTH];\n"
                "	stack_t stack[STACK_LENGTH];\n"
@ -663,10 +658,10 @@ def writeProlog(file):
                "#define SCRATCHPAD_LENGTH (SCRATCHPAD_SIZE / sizeof(convertible_t))\n"
                "#define SCRATCHPAD_MASK14 (16 * 1024 / sizeof(convertible_t) - 1)\n"
                "#define SCRATCHPAD_MASK18 (SCRATCHPAD_LENGTH - 1)\n"
-                "#define SCRATCHPAD_16K(x) scratchpad[(x >> 3) & SCRATCHPAD_MASK14]\n"
+                "#define SCRATCHPAD_16K(x) scratchpad[(x) & SCRATCHPAD_MASK14]\n"
-                "#define SCRATCHPAD_256K(x) scratchpad[(x >> 3) & SCRATCHPAD_MASK18]\n"
+                "#define SCRATCHPAD_256K(x) scratchpad[(x) & SCRATCHPAD_MASK18]\n"
                "#define STACK_LENGTH (32 * 1024)\n"
-                "#define DRAM(x) __rolq(6364136223846793005*(x)+1442695040888963407,32)\n"
+                "#define DRAM(x) __rolq(6364136223846793005ULL*(x)+1442695040888963407ULL,32)\n"
                "//#define PREFETCH(x) _mm_prefetch(x, _MM_HINT_T0)\n"
                "#define PREFETCH(x)\n"
                "#define PUSH_VALUE(x) stack[mmu.sp++].value = x\n"
@ -782,6 +777,7 @@ with sys.stdout as file:
    writeMain(file)
    writeInitialValues(file)
    for i in range(PROGRAM_SIZE):
-        writeCode(file, i, CodeSymbol(random.getrandbits(64)))
+        writeCode(file, i, CodeSymbol(random.getrandbits(128)))
    if PROGRAM_SIZE > 0:
        file.write("\t\tgoto i_0;\n")
    writeEpilog(file)