Merge branch 'dev'

2024-12-22 15:58:53 +00:00 · 2019-02-09 20:02:14 +01:00 · 2019-02-09 20:02:14 +01:00 · 98c4ccf5ca
commit 98c4ccf5ca
parent 85b31342e1 b8ce504be6
68 changed files with 5689 additions and 12165 deletions
--- a/README.md
+++ b/README.md
@ -1,111 +1,87 @@
 # RandomX
-RandomX is an experimental proof of work (PoW) algorithm that uses random code execution.
+RandomX is a proof-of-work (PoW) algorithm that is optimized for general-purpose CPUs. RandomX uses random code execution (hence the name) together with several memory-hard techniques to achieve the following goals:

-### Key features
+* Prevent the development of a single-chip [ASIC](https://en.wikipedia.org/wiki/Application-specific_integrated_circuit)
+* Minimize the efficiency advantage of specialized hardware compared to a general-purpose CPU

-* Memory-hard (requires  >4 GiB of memory)
-* CPU-friendly (especially for x86 and ARM architectures)
-* arguably ASIC-resistant
-* inefficient on GPUs
-* unusable for web-mining
+## Design

-## Virtual machine
+The core of RandomX is a virtual machine (VM), which can be summarized by the following schematic:

-RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
+![Imgur](https://i.imgur.com/8RYNWLk.png)

-![Imgur](https://i.imgur.com/ZAfbX9m.png)
+Notable parts of the RandomX VM are:

-Full description: [vm.md](doc/vm.md).
+* a large read-only 4 GiB dataset
+* a 2 MiB scratchpad (read/write), which is structured into three levels L1, L2 and L3
+* 8 integer and 12 floating point registers
+* an arithmetic logic unit (ALU)
+* a floating point unit (FPU)
+* a 2 KiB program buffer

-## Dataset
+The structure of the VM mimics the components that are found in a typical general purpose computer equipped with a CPU and a large amount of DRAM. The scratchpad is designed to fit into the CPU cache. The first 16 KiB and 256 KiB of the scratchpad are used more often take advantage of the faster L1 and L2 caches. The ratio of random reads from L1/L2/L3 is approximately 9:3:1, which matches the inverse latencies of typical CPU caches.

-RandomX uses a 4 GiB read-only dataset. The dataset is constructed using a combination of the [Argon2d](https://en.wikipedia.org/wiki/Argon2) hashing function, [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption/decryption and a random permutation. The dataset is regenerated every ~34 hours.
+The VM executes programs in a special instruction set, which was designed in such way that any random 8-byte word is a valid instruction and any sequence of valid instructions is a valid program. For more details see [RandomX ISA documentation](doc/isa.md). Because there are no "syntax" rules, generating a random program is as easy as filling the program buffer with random data. A RandomX program consists of 256 instructions. See [program.inc](src/program.inc) as an example of a RandomX program translated into x86-64 assembly.

-Full description: [dataset.md](doc/dataset.md).
+### Hash calculation

-## Instruction set
+Calculating a RandomX hash consists of initializing the 2 MiB scratchpad with random data, executing 8 RandomX loops and calculating a hash of the scratchpad.

-RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program. Each RandomX instruction has a length of 128 bits.
+Each RandomX loop is repeated 2048 times. The loop body has 4 parts:
+1. The values of all registers are loaded randomly from the scratchpad (L3)
+2. The RandomX program is executed
+3. A random block is loaded from the dataset and mixed with integer registers
+4. All register values are stored into the scratchpad (L3)

-Full description: [isa.md](doc/isa.md).
+Hash of the register state after 2048 interations is used to initialize the random program for the next loop. The use of 8 different programs in the course of a single hash calculation prevents mining strategies that search for "easy" programs.

-## Implementation
-Proof-of-concept implementation is written in C++.
-```
-> bin/randomx --help
-Usage: bin/randomx [OPTIONS]
-Supported options:
-        --help                  shows this message
-        --compiled              use x86-64 JIT-compiled VM (default: interpreted VM)
-        --lightClient           use 'light-client' mode (default: full dataset mode)
-        --softAes               use software AES (default: x86 AES-NI)
-        --threads T             use T threads (default: 1)
-        --nonces N              run N nonces (default: 1000)
-        --genAsm                generate x86 asm code for nonce N
-```
+The loads from the dataset are fully prefetched, so they don't slow down the loop.

-Two RandomX virtual machines are implemented:
+RandomX uses the [Blake2b](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE2) cryptographic hash function. Special hashing functions `fillAes1Rx4` and `hashAes1Rx4` based on [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption are used to initialize and hash the scratchpad ([hashAes1Rx4.cpp](src/hashAes1Rx4.cpp)).

-### Interpreted VM
-The interpreted VM is the reference implementation, which aims for maximum portability.
+### Hash verification

-The VM has been tested for correctness on the following platforms:
-* Linux: x86-64, ARMv7 (32-bit), ARMv8 (64-bit)
-* Windows: x86, x86-64
-* MacOS: x86-64
+RandomX is a symmetric PoW algorithm, so the verifying party has to repeat the same steps as when a hash is calculated.

-The interpreted VM supports two modes: "full dataset" mode, which requires more than 4 GiB of virtual memory, and a "light-client" mode, which requires about 64 MiB of memory, but runs significantly slower because dataset blocks are created on the fly rather than simply fetched from memory.
+However, to allow hash verification on devices that cannot store the whole 4 GiB dataset, RandomX allows a time-memory tradeoff by using just 256 MiB of memory at the cost of 16 times more random memory accesses. See [Dataset initialization](doc/dataset.md) for more details.

-Software AES implementation is available for CPUs which don't support [AES-NI](https://en.wikipedia.org/wiki/AES_instruction_set).
+### Performance
+Preliminary mining performance with the x86-64 JIT compiled VM:

-The following table lists the performance for Intel Core i5-3230M (Ivy Bridge) CPU using a single core on Windows 64-bit, compiled with Visual Studio 2017:
+|CPU|RAM|threads|hashrate [H/s]|comment|
+|-----|-----|----|----------|-----|
+|AMD Ryzen 1700|DDR4-2933|8|4100|
+|Intel i5-3230M|DDR3-1333|1|280|without large pages
+|Intel i7-8550U|DDR4-2400|4|1200|limited by thermals
+|Intel i5-2500K|DDR3-1333|3|1350|

-|mode|required memory|AES|initialization time [s]|performance [programs/s]|
-|------|----|-----|-------------------------|------------------|
-|light client|64 MiB|software|1.0|9.2|
-|light client|64 MiB|AES-NI|1.0|16|
-|full dataset|4 GiB|software|54|40|
-|full dataset|4 GiB|AES-NI|26|40|
+Hash verification is performed using the portable interpreter in "light-client mode" and takes 30-70 ms depending on RAM latency and CPU clock speed. Hash verification in "mining mode" takes 2-4 ms.

-### JIT-compiled VM
-A JIT compiler is available for x86-64 CPUs. This implementation shows the approximate performance that can be achieved using optimized mining software. The JIT compiler generates generic x86-64 code without any architecture-specific optimizations. Only "full dataset" mode is supported.
+### Documentation
+* [RandomX ISA](doc/isa.md)
+* [RandomX instruction listing](doc/isa-ops.md)
+* [Dataset initialization](doc/dataset.md)

-For optimal performance, an x86-64 CPU needs:
-* 32 KiB of L1 instruction cache per thread
-* 16 KiB of L1 data cache per thread
-* 240 KiB of L2 cache (exclusive) per thread
+# FAQ

-The following table lists the performance of AMD Ryzen 7 1700 (clock fixed at 3350 MHz, 1.05 Vcore, dual channel DDR4 2400 MHz) on Linux 64-bit (compiled with GCC 5.4.0).
+### Can RandomX run on a GPU?

-Power consumption was measured for the whole system using a wall socket wattmeter (±1W). Table lists difference over idle power consumption. [Prime95](https://en.wikipedia.org/wiki/Prime95#Use_for_stress_testing)  (small/in-place FFT) and [Cryptonight V2](https://github.com/monero-project/monero/pull/4218) power consumption are listed for comparison.
+We don't expect GPUs will ever be competitive in mining RandomX. The reference miner is CPU-only.

-||threads|initialization time [s]|performance [programs/s]|power [W]
-|-|------|----|-----|-------------------------|
-|RandomX (interpreted)|1|27|52|16|
-|RandomX (interpreted)|8|4.0|390|63|
-|RandomX (interpreted)|16|3.5|620|74|
-|RandomX (compiled)|1|27|407|17|
-|RandomX (compiled)|2|14|810|26|
-|RandomX (compiled)|4|7.3|1620|42|
-|RandomX (compiled)|6|5.1|2410|56|
-|RandomX (compiled)|8|4.0|3200|71|
-|RandomX (compiled)|12|4.0|3670|82|
-|RandomX (compiled)|16|3.5|4110|92|
-|Cryptonight v2|8|-|-|47|
-|Prime95|8|-|-|77|
-|Prime95|16|-|-|81|
+RandomX was designed to be efficient on CPUs. Designing an algorithm compatible with both CPUs and GPUs brings too many limitations and ultimately decreases ASIC resistance. CPUs have the advantage of not needing proprietary drivers and most CPU architectures support a large common subset of primitive operations.

-## Proof of work
+Additionally, targeting CPUs allows for more decentralized mining for several reasons:

-RandomX VM can be used for PoW using the following steps:
+* Every computer has a CPU and even laptops will be able to mine efficiently.
+* CPU mining is easier to set up - no driver compatibility issues, BIOS flashing etc.
+* CPU mining is more difficult to centralize because computers can usually have only one CPU except for expensive server parts.

-1. Initialize the VM using a 256-bit hash of any data.
-2. Execute the RandomX program.
-3. Calculate `blake2b(RegisterFile || t1ha2(Scratchpad))`*
+### Does RandomX facilitate botnets/malware mining or web mining?
+Quite the opposite. Efficient mining requires 4 GiB of memory, which is very difficult to hide in an infected computer and disqualifies many low-end machines. Web mining is nearly impossible due to the large memory requirement and the need for a rather lengthy initialization of the dataset.

-\* [blake2b](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE2) is a cryptographic hash function, [t1ha2](https://github.com/leo-yuriev/t1ha) is a fast hashing function.
+### Since RandomX uses floating point calculations, how can it give reproducible results on different platforms?

-The above steps can be chained multiple times to prevent mining strategies that search for programs with particular properties (for example, without division).
+RandomX uses only operations that are guaranteed to give correctly rounded results by the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standard: addition, subtraction, multiplication, division and square root. Special care is taken to avoid corner cases such as NaN values or denormals.

 ## Acknowledgements
 The following people have contributed to the design of RandomX:
@ -114,13 +90,10 @@ The following people have contributed to the design of RandomX:

 RandomX uses some source code from the following 3rd party repositories:
 * Argon2d, Blake2b hashing functions: https://github.com/P-H-C/phc-winner-argon2
-* PCG32 random number generator: https://github.com/imneme/pcg-c-basic
 * Software AES implementation https://github.com/fireice-uk/xmr-stak
-* t1ha2 hashing function: https://github.com/leo-yuriev/t1ha

 ## Donations
-
 XMR:
 ```
-4B9nWtGhZfAWsTxWujPDGoWfVpJvADxkxJJTmMQp3zk98n8PdLkEKXA5g7FEUjB8JPPHdP959WDWMem3FPDTK2JUU1UbVHo
+845xHUh5GvfHwc2R8DVJCE7BT2sd4YEcmjG8GNSdmeNsP5DTEjXd1CNgxTcjHjiFuthRHAoVEJjM7GyKzQKLJtbd56xbh7V
 ```
--- a/doc/dataset.md
+++ b/doc/dataset.md
@ -1,15 +1,14 @@
+# Dataset

-## Dataset
+The dataset is randomly accessed 16384 times during each hash calculation, which significantly increases memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 67108864 blocks of 64 bytes.

-The dataset serves as the source of the first operand of all instructions and provides the memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 65536 blocks, each 64 KiB in size.
+In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 256 MiB cache, which can be used to calculate dataset blocks on the fly.

-In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 64 MiB cache, which can be used to calculate dataset blocks on the fly. To facilitate this, all random reads from the dataset are aligned to the beginning of a block.
+Because the initialization of the dataset is computationally intensive, it is recalculated only every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:

-Because the initialization of the dataset is computationally intensive, it's recalculated on average every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
+![Imgur](https://i.imgur.com/b9WHOwo.png)

-![Imgur](https://i.imgur.com/JgLCjeq.png)
-
-### Seed block
+## Seed block
 The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.

 |block|Seed block|
@ -19,9 +18,9 @@ The whole dataset is constructed from a 256-bit hash of the last block whose hei
 |2113-3136|2048|
 |...|...

-### Cache construction
+## Cache construction

-The 32-byte seed block hash is expanded into the 64 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
+The 32-byte seed block hash is expanded into the 256 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.

 Argon2 is used with the following parameters:

@ -29,8 +28,8 @@ Argon2 is used with the following parameters:
 |------------|--|
 |parallelism|1|
 |output size|0|
-|memory|65536 (64 MiB)|
-|iterations|12|
+|memory|262144 (256 MiB)|
+|iterations|3|
 |version|`0x13`|
 |hash type|0 (Argon2d)
 |password|seed block hash (32 bytes)
@ -40,43 +39,66 @@ Argon2 is used with the following parameters:

 The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.

-The use of 12 iterations makes time-memory tradeoffs infeasible and thus 64 MiB is the minimum amount of memory required by RandomX.
+The use of 3 iterations makes time-memory tradeoffs infeasible and thus 256 MiB is the minimum amount of memory required by RandomX.

-When the memory fill is complete, the whole memory array is cyclically shifted backwards by 512 bytes (i.e. bytes 0-511 are moved to the end of the array). This is done to misalign the array so that each 1024-byte cache block spans two subsequent Argon2 blocks.
+## Dataset block generation
+The full 4 GiB dataset can be generated from the 256 MiB cache. Each 64-byte block is generated independently by XORing 16 pseudorandom cache blocks selected by the `SquareHash` function.

-### Dataset block generation
-The full 4 GiB dataset can be generated from the 64 MiB cache. Each block is generated separately: a 1024 byte block of the cache is expanded into 64 KiB of the dataset. The algorithm has 3 steps: expansion, AES and shuffle.
+### SquareHash
+`SquareHash` is a custom hash function with 64-bit input and 64-bit output. It is calculated by repeatedly squaring the input, splitting the 128-bit result in to two 64-bit halves and subtracting the high half from the low half. This is repeated 42 times. It's available as a [portable C implementation](../src/squareHash.h) and [x86-64 assembly version](../src/asm/squareHash.inc).

-#### Expansion
-The 1024 cache bytes are split into 128 quadwords and interleaved with 504-byte chunks of null bytes. The resulting sequence is: 8 cache bytes + 504 null bytes + 8 cache bytes + 504 null bytes etc. Total length of the expanded block is 65536 bytes.
+Properties of `SquareHash`:

-#### AES
-The 256-bit seed block hash is expanded into 10 AES round keys `k0`-`k9`. Let `i = 0...65535` be the index of the block that is being expanded. If `i` is an even number, this step uses AES *decryption* and if `i` is an odd number, it uses AES *encryption*.  Since both encryption and decryption scramble random data, no distinction is made between them in the text below.
+* It achieves full [Avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect).
+* Since the whole calculation is a long dependency chain, which uses only multiplication and subtraction, the performance gains by using custom hardware are very limited.
+* A single `SquareHash` calculation takes 40-80 ns, which is about the same time as DRAM access latency. ASIC devices using low-latency memory will be bottlenecked by `SquareHash`, while CPUs will finish the hash calculation in about the same time it takes to fetch data from RAM.

-The AES encryption is performed with 10 identical rounds using round keys `k0`-`k9`. Note that this is different from the typical AES procedure, which uses a different key schedule for decryption and a modified last round.
+The output of 16 chained SquareHash calculations is used to determine cache blocks that are XORed together to produce a dataset block:

-Before the AES encryption is applied, each 16-byte chunk is XORed with the ciphertext of the previous chunk. This is similar to the [AES-CBC](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Cipher_Block_Chaining_%28CBC%29) mode of operation and forces the encryption to be sequential. For XORing the initial block, an initialization vector is formed by zero-extending `i` to 128 bits.
+```c++
+void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber) {
+  uint64_t r0, r1, r2, r3, r4, r5, r6, r7;

-#### Shuffle
-When the AES step is complete, the last 16-byte chunk of the block is used to initialize a PCG32 random number generator. Bits 0-63 are used as the initial state and bits 64-127 are used as the increment. The least-significant bit of the increment is always set to 1 to form an odd number.
+  r0 = 4ULL * blockNumber;
+  r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;

-The whole block is then divided into 16384 doublewords (4 bytes) and the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) algorithm is applied to it. The algorithm generates a random in-place permutation of the 16384 doublewords. The result of the shuffle is the `i`-th block of the dataset.
+  constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;

-The shuffle algorithm requires a uniform distribution of random numbers. The output of the PCG32 generator is always properly filtered to avoid the [modulo bias](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modulo_bias).
+  for (auto i = 0; i < DatasetIterations; ++i) {
+    const uint8_t* mixBlock = cache + (r0 & mask);
+    PREFETCHNTA(mixBlock);
+    r0 = squareHash(r0);
+    r0 ^= load64(mixBlock + 0);
+    r1 ^= load64(mixBlock + 8);
+    r2 ^= load64(mixBlock + 16);
+    r3 ^= load64(mixBlock + 24);
+    r4 ^= load64(mixBlock + 32);
+    r5 ^= load64(mixBlock + 40);
+    r6 ^= load64(mixBlock + 48);
+    r7 ^= load64(mixBlock + 56);
+  }

-### Performance
-The initial 64-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be easily parallelized.
+  store64(out + 0, r0);
+  store64(out + 8, r1);
+  store64(out + 16, r2);
+  store64(out + 24, r3);
+  store64(out + 32, r4);
+  store64(out + 40, r5);
+  store64(out + 48, r6);
+  store64(out + 56, r7);
+}
+```

-Dataset generation performance depends on the support of the AES-NI instruction set. The following table lists the generation runtimes using the same Ivy Bridge laptop with a single thread:
+*Note: `SquareHash` doesn't calculate squaring modulo 2<sup>64</sup>+1 because the subtraction is performed modulo 2<sup>64</sup>. Squaring modulo 2<sup>64</sup>+1 can be calculated by adding the carry bit in every iteration (i.e. the sequence in x86-64 assembly would have to be: `mul rax; sub rax, rdx; adc rax, 0`), but this would decrease ASIC-resistance of `SquareHash`.*

-|AES|4 GiB dataset generation|single block generation|
-|-----|-----------------------------|----------------|
-|hardware (AES-NI)|25 s|380 µs|
-|software|53 s|810 µs|
+## Performance
+The initial 256-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be parallelized.

-While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using a recent 6-core CPU with AES-NI support, the whole dataset can be generated in about 4 seconds.
+On the same laptop, full dataset initialization takes around 100 seconds using a single thread (1.5 µs per block).

-Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating ~512 dataset blocks per minute (corresponds to less than 1% utilization of a single CPU core).
+While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using an 8-core AMD Ryzen CPU, the whole dataset can be generated in under 10 seconds.

-### Light clients
-Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly as the program is being executed. In this case, the program execution time will be increased by roughly 100 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 40 milliseconds per program.
+Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating 524288 dataset blocks per minute (corresponds to about 1% utilization of a single CPU core).
+
+## Light clients
+Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly during hash calculation. In this case, the hash calculation time will be increased by 16384 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 24.5 milliseconds per hash.
--- a/doc/isa-ops.md
+++ b/doc/isa-ops.md
@ -0,0 +1,103 @@
+# RandomX instruction listing
+
+## Integer instructions
+For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `imm32` as the source operand instead of the register. This is indicated in the 'src == dst' column.
+
+Memory operands are loaded as 8-byte values from the address indicated by `src`.  This indirect addressing is marked with square brackets: `[src]`.
+
+|frequency|instruction|dst|src|`src == dst ?`|operation|
+|-|-|-|-|-|-|
+|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
+|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
+|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
+|12/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
+|7/256|ISUB_M|R|mem|`src = imm32`|`dst = dst - [src]`|
+|9/256|IMUL_9C|R|-|-|`dst = 9 * dst + imm32`|
+|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
+|4/256|IMUL_M|R|mem|`src = imm32`|`dst = dst * [src]`|
+|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
+|1/256|IMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64`|
+|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
+|1/256|ISMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64` (signed)|
+|4/256|IDIV_C|R|-|-|`dst = dst + dst / imm32`|
+|4/256|ISDIV_C|R|-|-|`dst = dst + dst / imm32` (signed)|
+|2/256|INEG_R|R|-|-|`dst = -dst`|
+|16/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
+|4/256|IXOR_M|R|mem|`src = imm32`|`dst = dst ^ [src]`|
+|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
+|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|
+
+#### IMULH and ISMULH
+These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (`IMULH` is unsigned, `ISMULH` is signed). The variants with a register source operand do not use `imm32` (they perform a squaring operation if `dst` equals `src`).
+
+#### IDIV_C and ISDIV_C
+The division instructions use a constant divisor, so they can be optimized into a [multiplication by fixed-point reciprocal](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant). `IDIV_C` performs unsigned division (`imm32` is zero-extended to 64 bits), while `ISDIV_C` performs signed division. In the case of division by zero, the instructions become a no-op. In the very rare case of signed overflow, the destination register is set to zero.
+
+#### ISWAP_R
+This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.
+
+## Floating point instructions
+For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.
+
+Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.
+
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|
+|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
+|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
+|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
+|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
+|5/256|FSUB_M|F|mem|`(dst0, dst1) = (dst0 - [src][0], dst1 - [src][1])`|
+|6/256|FNEG_R|F|-|`(dst0, dst1) = (-dst0, -dst1)`|
+|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
+|4/256|FDIV_M|E|mem|`(dst0, dst1) = (dst0 / [src][0], dst1 / [src][1])`|
+|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
+
+#### Denormal and NaN values
+Due to restrictions on the values of the floating point registers, no operation results in `NaN`.
+`FDIV_M` can produce a denormal result. In that case, the result is set to `DBL_MIN = 2.22507385850720138309e-308`, which is the smallest positive normal number.
+
+#### Rounding
+All floating point instructions give correctly rounded results. The rounding mode depends on the value of the `fprc` register:
+
+|`fprc`|rounding mode|
+|-------|------------|
+|0|roundTiesToEven|
+|1|roundTowardNegative|
+|2|roundTowardPositive|
+|3|roundTowardZero|
+
+The rounding modes are defined by the IEEE 754 standard.
+
+## Other instructions
+There are 4 special instructions that have more than one source operand or the destination operand is a memory value.
+
+|frequency|instruction|dst|src|operation|
+|-|-|-|-|-|
+|7/256|COND_R|R|R|`if(condition(src, imm32)) dst = dst + 1`
+|1/256|COND_M|R|mem|`if(condition([src], imm32)) dst = dst + 1`
+|1/256|CFROUND|`fprc`|R|`fprc = src >>> imm32`
+|16/256|ISTORE|mem|R|`[dst] = src`
+
+#### COND
+
+These instructions conditionally increment the destination register. The condition function depends on the `mod.cond` flag and takes the lower 32 bits of the source operand and the value `imm32`.
+
+|`mod.cond`|signed|`condition`|probability|*x86*|*ARM*
+|---|---|----------|-----|--|----|
+|0|no|`src <= imm32`|0% - 100%|`JBE`|`BLS`
+|1|no|`src > imm32`|0% - 100%|`JA`|`BHI`
+|2|yes|`src - imm32 < 0`|50%|`JS`|`BMI`
+|3|yes|`src - imm32 >= 0`|50%|`JNS`|`BPL`
+|4|yes|`src - imm32` overflows|0% - 50%|`JO`|`BVS`
+|5|yes|`src - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
+|6|yes|`src < imm32`|0% - 100%|`JL`|`BLT`
+|7|yes|`src >= imm32`|0% - 100%|`JGE`|`BGE`
+
+The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected probability the condition is true (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
+
+#### CFROUND
+This instruction sets the value of the `fprc` register to the 2 least significant bits of the source register rotated right by `imm32`. This changes the rounding mode of all subsequent floating point instructions.
+
+#### ISTORE
+The `ISTORE` instruction stores the value of the source integer register to the memory at the address specified by the destination register. The `src` and `dst` register can be the same.
--- a/doc/isa.md
+++ b/doc/isa.md
@ -1,213 +1,91 @@

-## RandomX instruction set
-RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
+# RandomX instruction set architecture
+RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE 754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

-Each RandomX instruction has a length of 128 bits. The encoding is following:
+## Registers

-![Imgur](https://i.imgur.com/mbndESz.png)
+RandomX has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `a0`-`a3` (group A), `f0`-`f3` (group F) and `e0`-`e3` (group E). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of floating point numbers. The lower and upper half of floating point registers are not separately addressable.

-*All flags are aligned to an 8-bit boundary for easier decoding.*
+*Table 1: Addressable register groups*

-#### Opcode
-There are 256 opcodes, which are distributed between 30 instructions based on their weight (how often they will occur in the program on average). Instructions are divided into 5 groups:
+|index|R|A|F|E|F+E|
+|--|--|--|--|--|--|
+|0|`r0`|`a0`|`f0`|`e0`|`f0`|
+|1|`r1`|`a1`|`f1`|`e1`|`f1`|
+|2|`r2`|`a2`|`f2`|`e2`|`f2`|
+|3|`r3`|`a3`|`f3`|`e3`|`f3`|
+|4|`r4`||||`e0`|
+|5|`r5`||||`e1`|
+|6|`r6`||||`e2`|
+|7|`r7`||||`e3`|

-|group|number of opcodes||comment|
-|---------|-----------------|----|------|
-|IA|115|44.9%|integer arithmetic operations
-|IS|21|8.2%|bitwise shift and rotate
-|FA|70|27.4%|floating point arithmetic operations
-|FS|8|3.1%|floating point single-input operations
-|CF|42|16.4%|control flow instructions (branches)
-||**256**|**100%**
+Besides the directly addressable registers above, there is a 2-bit `fprc` register for rounding control, which is an implicit destination register of the `CFROUND` instruction, and two architectural 32-bit registers `ma` and `mx`, which are not accessible to any instruction. 

-#### Operand A
-The first 64-bit operand is read from memory. The location is determined by the `loc(a)` flag:
+Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for loading the source operand from the memory (scratchpad).

-|loc(a)[2:0]|read A from|address size (W)
-|---------|-|-|
-|000|dataset|32 bits|
-|001|dataset|32 bits|
-|010|dataset|32 bits|
-|011|dataset|32 bits|
-|100|scratchpad|15 bits|
-|101|scratchpad|11 bits|
-|110|scratchpad|11 bits|
-|111|scratchpad|11 bits|
+Floating point registers `a0`-`a3` are read-only and may not be written to except at the moment a program is loaded into the VM. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.

-Flag `reg(a)` encodes an integer register `r0`-`r7`.  The read address is calculated as:
-```
-reg(a) = reg(a) XOR signExtend(addr(a))
-read_addr = reg(a)[W-1:0]
-```
-`W` is the address width from the above table. For reading from the scratchpad, `read_addr` is multiplied by 8 for 8-byte aligned access.
+Floating point registers `f0`-`f3` are the *additive* registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed `1.0e+12`.

-#### Operand B
-The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (instruction groups IA and IS) or a floating point register (instruction group FA). Instruction group FS doesn't use operand B.
+Floating point registers `e0`-`e3` are the *multiplicative* registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.

-|loc(b)[2:0]|B (IA)|B (IS)|B (FA)|B (FS)
-|---------|-|-|-|-|
-|000|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|001|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|010|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|011|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
-|100|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
-|101|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
-|110|`imm32`|`imm8`|floating point `reg(b)`|-
-|111|`imm32`|`imm8`|floating point `reg(b)`|-
+## Instruction encoding

-`imm8` is an 8-bit immediate value, which is used for shift and rotate integer instructions (group IS). Only bits 0-5 are used.
+Each instruction word is 64 bits long and has the following format:

-`imm32` is a 32-bit immediate value which is used for integer instructions from group IA.
+![Imgur](https://i.imgur.com/FtkWRwe.png)

-Floating point instructions don't use immediate values.
+### opcode
+There are 256 opcodes, which are distributed between 32 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).

-#### Operand C
-The third operand is the location where the result is stored. It can be a register or a 64-bit scratchpad location, depending on the value of flag `loc(c)`.
+*Table 2: Instruction groups*

-|loc\(c\)[2:0]|address size (W)| C (IA, IS)|C (FA, FS)
-|---------|-|-|-|-|-|
-|000|15 bits|scratchpad|floating point `reg(c)`
-|001|11 bits|scratchpad|floating point `reg(c)`
-|010|11 bits|scratchpad|floating point `reg(c)`
-|011|11 bits|scratchpad|floating point `reg(c)`
-|100|15 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|101|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|110|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
-|111|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
+|group|# instructions|# opcodes||
+|---------|-----------------|----|-|
+|integer |19|137|53.5%|
+|floating point |9|94|36.7%|
+|other |4|25|9.8%|
+||**32**|**256**|**100%**

-Integer operations write either to the scratchpad or to a register. Floating point operations always write to a register and can also write to the scratchpad. In that case, bit 3 of the `loc(c)` flag determines if the low or high half of the register is written:
+Full description of all instructions: [isa-ops.md](isa-ops.md).

-|loc\(c\)[3]|write to scratchpad|
-|------------|-----------------------|
-|0|floating point `reg(c)[63:0]`
-|1|floating point `reg(c)[127:64]`
+### dst
+Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 1.

-The FPROUND instruction is an exception and always writes the low half of the register.
+### src

-For writing to the scratchpad, an integer register is always used to calculate the address:
-```
-write_addr = 8 * (addr(c) XOR reg(c)[31:0])[W-1:0]
-```
-*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 writes to memory.*
+The `src` flag encodes a source operand register according to Table 1 (only bits 0-1 or 0-2 are used).

-#### imm8
-An 8-bit immediate value that is used as the shift/rotate count by group IS instructions and as the jump offset of the CALL instruction.
+Immediate value `imm32` is used as the source operand in cases when `dst` and `src` encode the same register.

-#### addr(a)
-A 32-bit address mask that is used to calculate the read address for the A operand. It's sign-extended to 64 bits.
+For register-memory instructions, the source operand determines the `address_base` value for calculating the memory address (see below).

-#### addr\(c\)
-A 32-bit address mask that is used to calculate the write address for the C operand. `addr(c)` is equal to `imm32`.
+### mod

-### ALU instructions
+The `mod` flag is encoded as:

-|weight|instruction|group|signed|A width|B width|C|C width|
-|-|-|-|-|-|-|-|-|
-|10|ADD_64|IA|no|64|64|`A + B`|64|
-|2|ADD_32|IA|no|32|32|`A + B`|32|
-|10|SUB_64|IA|no|64|64|`A - B`|64|
-|2|SUB_32|IA|no|32|32|`A - B`|32|
-|21|MUL_64|IA|no|64|64|`A * B`|64|
-|10|MULH_64|IA|no|64|64|`A * B`|64|
-|15|MUL_32|IA|no|32|32|`A * B`|64|
-|15|IMUL_32|IA|yes|32|32|`A * B`|64|
-|10|IMULH_64|IA|yes|64|64|`A * B`|64|
-|1|DIV_64|IA|no|64|32|`A / B`|32|
-|1|IDIV_64|IA|yes|64|32|`A / B`|32|
-|4|AND_64|IA|no|64|64|`A & B`|64|
-|2|AND_32|IA|no|32|32|`A & B`|32|
-|4|OR_64|IA|no|64|64|`A | B`|64|
-|2|OR_32|IA|no|32|32|`A | B`|32|
-|4|XOR_64|IA|no|64|64|`A ^ B`|64|
-|2|XOR_32|IA|no|32|32|`A ^ B`|32|
-|3|SHL_64|IS|no|64|6|`A << B`|64|
-|3|SHR_64|IS|no|64|6|`A >> B`|64|
-|3|SAR_64|IS|yes|64|6|`A >> B`|64|
-|6|ROL_64|IS|no|64|6|`A <<< B`|64|
-|6|ROR_64|IS|no|64|6|`A >>> B`|64|
+*Table 3: mod flag encoding*

-##### 32-bit operations
-Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are set to zero.
+|`mod`|description|
+|----|--------|
+|0-1|`mod.mem` flag|
+|2-4|`mod.cond` flag|
+|5-7|Reserved|

-##### Multiplication
-There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
+The `mod.mem` flag determines the address mask when reading from or writing to memory:

-##### Division
-For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
+*Table 3: memory address mask*

-*Division by zero can be handled without branching by a conditional move. Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. This rare case must be handled in x86 (ARM produces the "correct" result).*
+|`mod.mem`|`address_mask`|(scratchpad level)|
+|---------|-|---|
+|0|262136|(L2)|
+|1-3|16376|(L1)|

-##### Shift and rotate
-The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm8` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
+Table 3 applies to all memory accesses except for cases when the source operand is an immediate value. In that case, `address_mask` is equal to 2097144 (L3). 

-### FPU instructions
+The address for reading/writing is calculated by applying bitwise AND operation to `address_base` and `address_mask`.

-|weight|instruction|group|C|
-|-|-|-|-|
-|20|FPADD|FA|`A + B`|
-|20|FPSUB|FA|`A - B`|
-|22|FPMUL|FA|`A * B`|
-|8|FPDIV|FA|`A / B`|
-|6|FPSQRT|FS|`sqrt(abs(A))`|
-|2|FPROUND|FS|`convertSigned52(A)`|
+The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.

-All floating point instructions apart FPROUND are vector instructions that operate on two packed double precision floating point values.
-
-#### Conversion of operand A
-Operand A is loaded from memory as a 64-bit value. All floating point instructions apart FPROUND interpret A as two packed 32-bit signed integers and convert them into two packed double precision floating point values.
-
-The FPROUND instruction has a scalar output and interprets A as a 64-bit signed integer. The 11 least-significant bits are cleared before conversion to a double precision format. This is done so the number fits exactly into the 52-bit mantissa without rounding. Output of FPROUND is always written into the lower half of the result register and only this lower half may be written into the scratchpad.
-
-#### Rounding
-FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values must be flushed to zero.
-
-#### NaN
-If an operation produces NaN, the result is converted into positive zero. NaN results may never be written into registers or memory. Only division and multiplication must be checked for NaN results (`0.0 / 0.0` and `0.0 * Infinity` result in NaN).
-
-##### FPROUND
-The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two least-significant bits of A.
-
-|A[1:0]|rounding mode|
-|-------|------------|
-|00|roundTiesToEven|
-|01|roundTowardNegative|
-|10|roundTowardPositive|
-|11|roundTowardZero|
-
-The rounding modes are defined by the IEEE-754 standard.
-
-*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
-
-### Control instructions
-The following 2 control instructions are supported:
-
-|weight|instruction|function|condition|
-|-|-|-|-|
-|20|CALL|near procedure call|(see condition table below)
-|22|RET|return from procedure|stack is not empty
-
-Both instructions are conditional. If the condition evaluates to `false`, CALL and RET behave as "arithmetic no-op" and simply copy operand A into destination C without jumping.
-
-##### CALL
-The CALL instruction uses a condition function, which takes the lower 32 bits of integer register `reg(b)` and the value `imm32` and evaluates a condition based on the `loc(b)` flag: 
-
-|loc(b)[2:0]|signed|jump condition|probability|*x86*|*ARM*
-|---|---|----------|-----|--|----|
-|000|no|`reg(b)[31:0] <= imm32`|0% - 100%|`JBE`|`BLS`
-|001|no|`reg(b)[31:0] > imm32`|0% - 100%|`JA`|`BHI`
-|010|yes|`reg(b)[31:0] - imm32 < 0`|50%|`JS`|`BMI`
-|011|yes|`reg(b)[31:0] - imm32 >= 0`|50%|`JNS`|`BPL`
-|100|yes|`reg(b)[31:0] - imm32` overflows|0% - 50%|`JO`|`BVS`
-|101|yes|`reg(b)[31:0] - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
-|110|yes|`reg(b)[31:0] < imm32`|0% - 100%|`JL`|`BLT`
-|111|yes|`reg(b)[31:0] >= imm32`|0% - 100%|`JGE`|`BGE`
-
-The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected jump probability (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
-
-Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
-
-##### RET
-The RET instruction is taken only if the stack is not empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
-
-## Reference implementation
-A portable C++ implementation of all ALU and FPU instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
+### imm32
+A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits unless specified otherwise.
--- a/45
+++ b/45
@ -11,12 +11,12 @@ SRCDIR=src
 OBJDIR=obj
 LDFLAGS=-lpthread
 TOBJS=$(addprefix $(OBJDIR)/,instructionsPortable.o TestAluFpu.o)
-ROBJS=$(addprefix $(OBJDIR)/,argon2_core.o argon2_ref.o AssemblyGeneratorX86.o blake2b.o CompiledVirtualMachine.o dataset.o JitCompilerX86.o instructionsPortable.o Instruction.o InterpretedVirtualMachine.o main.o Program.o softAes.o VirtualMachine.o t1ha2.o Cache.o)
+ROBJS=$(addprefix $(OBJDIR)/,argon2_core.o argon2_ref.o AssemblyGeneratorX86.o blake2b.o CompiledVirtualMachine.o dataset.o JitCompilerX86.o instructionsPortable.o Instruction.o InterpretedVirtualMachine.o main.o Program.o softAes.o VirtualMachine.o Cache.o virtualMemory.o divideByConstantCodegen.o LightClientAsyncWorker.o hashAes1Rx4.o)
 ifeq ($(PLATFORM),x86_64)
-    ROBJS += $(OBJDIR)/JitCompilerX86-static.o
+    ROBJS += $(OBJDIR)/JitCompilerX86-static.o $(OBJDIR)/squareHash.o
 endif

-all: release test
+all: release

 release: CXXFLAGS += -march=native -O3 -flto
 release: CCFLAGS += -march=native -O3 -flto
@ -27,6 +27,11 @@ debug: CCFLAGS += -g
 debug: LDFLAGS += -g
 debug: $(BINDIR)/randomx

+profile: CXXFLAGS += -pg
+profile: CCFLAGS += -pg
+profile: LDFLAGS += -pg
+profile: $(BINDIR)/randomx
+
 test: CXXFLAGS += -O0
 test: $(BINDIR)/AluFpuTest

@ -36,7 +41,7 @@ $(BINDIR)/randomx: $(ROBJS) | $(BINDIR)
 $(BINDIR)/AluFpuTest: $(TOBJS) | $(BINDIR)
 	$(CXX) $(TOBJS) $(LDFLAGS) -o $@
  
-$(OBJDIR)/TestAluFpu.o: $(addprefix $(SRCDIR)/,TestAluFpu.cpp instructions.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/TestAluFpu.o: $(addprefix $(SRCDIR)/,TestAluFpu.cpp instructions.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/TestAluFpu.cpp -o $@
  
 $(OBJDIR)/argon2_core.o: $(addprefix $(SRCDIR)/,argon2_core.c argon2_core.h blake2/blake2.h blake2/blake2-impl.h) | $(OBJDIR)
@ -45,40 +50,52 @@ $(OBJDIR)/argon2_core.o: $(addprefix $(SRCDIR)/,argon2_core.c argon2_core.h blak
 $(OBJDIR)/argon2_ref.o: $(addprefix $(SRCDIR)/,argon2_ref.c argon2.h argon2_core.h blake2/blake2.h blake2/blake2-impl.h blake2/blamka-round-ref.h) | $(OBJDIR)
 	$(CC) $(CCFLAGS) -c $(SRCDIR)/argon2_ref.c -o $@

-$(OBJDIR)/AssemblyGeneratorX86.o: $(addprefix $(SRCDIR)/,AssemblyGeneratorX86.cpp AssemblyGeneratorX86.hpp Instruction.hpp Pcg32.hpp common.hpp instructions.hpp instructionWeights.hpp) | $(OBJDIR)
+$(OBJDIR)/AssemblyGeneratorX86.o: $(addprefix $(SRCDIR)/,AssemblyGeneratorX86.cpp AssemblyGeneratorX86.hpp Instruction.hpp common.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/AssemblyGeneratorX86.cpp -o $@

 $(OBJDIR)/blake2b.o: $(addprefix $(SRCDIR)/blake2/,blake2b.c blake2.h blake2-impl.h) | $(OBJDIR)
 	$(CC) $(CCFLAGS) -c $(SRCDIR)/blake2/blake2b.c -o $@

-$(OBJDIR)/CompiledVirtualMachine.o: $(addprefix $(SRCDIR)/,CompiledVirtualMachine.cpp CompiledVirtualMachine.hpp Pcg32.hpp common.hpp instructions.hpp) | $(OBJDIR)
+$(OBJDIR)/CompiledVirtualMachine.o: $(addprefix $(SRCDIR)/,CompiledVirtualMachine.cpp CompiledVirtualMachine.hpp common.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/CompiledVirtualMachine.cpp -o $@
  
-$(OBJDIR)/dataset.o: $(addprefix $(SRCDIR)/,dataset.cpp common.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/dataset.o: $(addprefix $(SRCDIR)/,dataset.cpp common.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/dataset.cpp -o $@

+$(OBJDIR)/divideByConstantCodegen.o: $(addprefix $(SRCDIR)/,divideByConstantCodegen.c divideByConstantCodegen.h) | $(OBJDIR)
+	$(CC) $(CCFLAGS) -c $(SRCDIR)/divideByConstantCodegen.c -o $@
+
+$(OBJDIR)/hashAes1Rx4.o: $(addprefix $(SRCDIR)/,hashAes1Rx4.cpp softAes.h) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/hashAes1Rx4.cpp -o $@
+
 $(OBJDIR)/JitCompilerX86.o: $(addprefix $(SRCDIR)/,JitCompilerX86.cpp JitCompilerX86.hpp Instruction.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/JitCompilerX86.cpp -o $@

-$(OBJDIR)/JitCompilerX86-static.o: $(addprefix $(SRCDIR)/,JitCompilerX86-static.S $(addprefix asm/program_, prologue_linux.inc prologue_load.inc epilogue_linux.inc epilogue_store.inc read_r.inc read_f.inc)) | $(OBJDIR)
+$(OBJDIR)/JitCompilerX86-static.o: $(addprefix $(SRCDIR)/,JitCompilerX86-static.S $(addprefix asm/program_, prologue_linux.inc prologue_load.inc epilogue_linux.inc epilogue_store.inc read_dataset.inc loop_load.inc loop_store.inc xmm_constants.inc)) | $(OBJDIR)
 	$(CXX) -x assembler-with-cpp -c $(SRCDIR)/JitCompilerX86-static.S -o $@

-$(OBJDIR)/instructionsPortable.o: $(addprefix $(SRCDIR)/,instructionsPortable.cpp instructions.hpp intrinPortable.h) | $(OBJDIR)
+$(OBJDIR)/squareHash.o: $(addprefix $(SRCDIR)/,squareHash.S $(addprefix asm/, squareHash.inc))  | $(OBJDIR)
+	$(CXX) -x assembler-with-cpp -c $(SRCDIR)/squareHash.S -o $@
+
+$(OBJDIR)/instructionsPortable.o: $(addprefix $(SRCDIR)/,instructionsPortable.cpp intrinPortable.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/instructionsPortable.cpp -o $@

 $(OBJDIR)/Instruction.o: $(addprefix $(SRCDIR)/,Instruction.cpp Instruction.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Instruction.cpp -o $@
  
-$(OBJDIR)/InterpretedVirtualMachine.o: $(addprefix $(SRCDIR)/,InterpretedVirtualMachine.cpp InterpretedVirtualMachine.hpp Pcg32.hpp instructions.hpp instructionWeights.hpp) | $(OBJDIR)
+$(OBJDIR)/InterpretedVirtualMachine.o: $(addprefix $(SRCDIR)/,InterpretedVirtualMachine.cpp InterpretedVirtualMachine.hpp instructionWeights.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/InterpretedVirtualMachine.cpp -o $@

+$(OBJDIR)/LightClientAsyncWorker.o: $(addprefix $(SRCDIR)/,LightClientAsyncWorker.cpp LightClientAsyncWorker.hpp common.hpp) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/LightClientAsyncWorker.cpp -o $@
+  
 $(OBJDIR)/main.o: $(addprefix $(SRCDIR)/,main.cpp InterpretedVirtualMachine.hpp Stopwatch.hpp blake2/blake2.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/main.cpp -o $@
  
-$(OBJDIR)/Program.o: $(addprefix $(SRCDIR)/,Program.cpp Program.hpp Pcg32.hpp) | $(OBJDIR)
+$(OBJDIR)/Program.o: $(addprefix $(SRCDIR)/,Program.cpp Program.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Program.cpp -o $@

-$(OBJDIR)/Cache.o: $(addprefix $(SRCDIR)/,Cache.cpp Cache.hpp Pcg32.hpp argon2_core.h) | $(OBJDIR)
+$(OBJDIR)/Cache.o: $(addprefix $(SRCDIR)/,Cache.cpp Cache.hpp argon2_core.h) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/Cache.cpp -o $@
  
 $(OBJDIR)/softAes.o: $(addprefix $(SRCDIR)/,softAes.cpp softAes.h) | $(OBJDIR)
@ -87,8 +104,8 @@ $(OBJDIR)/softAes.o: $(addprefix $(SRCDIR)/,softAes.cpp softAes.h) | $(OBJDIR)
 $(OBJDIR)/VirtualMachine.o: $(addprefix $(SRCDIR)/,VirtualMachine.cpp VirtualMachine.hpp common.hpp dataset.hpp) | $(OBJDIR)
 	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/VirtualMachine.cpp -o $@

-$(OBJDIR)/t1ha2.o: $(addprefix $(SRCDIR)/t1ha/,t1ha2.c t1ha.h t1ha_bits.h) | $(OBJDIR)
-	$(CC) $(CCFLAGS) -c $(SRCDIR)/t1ha/t1ha2.c -o $@
+$(OBJDIR)/virtualMemory.o: $(addprefix $(SRCDIR)/,virtualMemory.cpp virtualMemory.hpp) | $(OBJDIR)
+	$(CXX) $(CXXFLAGS) -c $(SRCDIR)/virtualMemory.cpp -o $@
  
 $(OBJDIR):
 	mkdir $(OBJDIR)
--- a/src/AssemblyGeneratorX86.cpp
+++ b/src/AssemblyGeneratorX86.cpp
@ -17,535 +17,528 @@ You should have received a copy of the GNU General Public License
 along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 //#define TRACE
+#define MAGIC_DIVISION
 #include "AssemblyGeneratorX86.hpp"
-#include "Pcg32.hpp"
 #include "common.hpp"
-#include "instructions.hpp"
+#ifdef MAGIC_DIVISION
+#include "divideByConstantCodegen.h"
+#endif
+#include "Program.hpp"

 namespace RandomX {

 	static const char* regR[8] = { "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15" };
 	static const char* regR32[8] = { "r8d", "r9d", "r10d", "r11d", "r12d", "r13d", "r14d", "r15d" };
-	static const char* regF[8] = { "xmm8", "xmm9", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regFE[8] = { "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regF[4] = { "xmm0", "xmm1", "xmm2", "xmm3" };
+	static const char* regE[4] = { "xmm4", "xmm5", "xmm6", "xmm7" };
+	static const char* regA[4] = { "xmm8", "xmm9", "xmm10", "xmm11" };

-	void AssemblyGeneratorX86::generateProgram(const void* seed) {
+	static const char* fsumInstr[4] = { "paddb", "paddw", "paddd", "paddq" };
+
+	static const char* regA4 = "xmm12";
+	static const char* dblMin = "xmm13";
+	static const char* absMask = "xmm14";
+	static const char* signMask = "xmm15";
+	static const char* regMx = "rbp";
+	static const char* regIc = "rbx";
+	static const char* regIc32 = "ebx";
+	static const char* regIc8 = "bl";
+	static const char* regDatasetAddr = "rdi";
+	static const char* regScratchpadAddr = "rsi";
+
+	void AssemblyGeneratorX86::generateProgram(Program& prog) {
 		asmCode.str(std::string()); //clear
-		Pcg32 gen(seed);
-		for (unsigned i = 0; i < sizeof(RegisterFile) / sizeof(Pcg32::result_type); ++i) {
-			gen();
-		}
-		Instruction instr;
 		for (unsigned i = 0; i < ProgramLength; ++i) {
-			for (unsigned j = 0; j < sizeof(instr) / sizeof(Pcg32::result_type); ++j) {
-				*(((uint32_t*)&instr) + j) = gen();
-			}
+			Instruction& instr = prog(i);
+			instr.src %= RegistersCount;
+			instr.dst %= RegistersCount;
 			generateCode(instr, i);
-			asmCode << std::endl;
+			//asmCode << std::endl;
 		}
-		if(ProgramLength > 0)
-			asmCode << "\tjmp rx_i_0" << std::endl;
 	}

 	void AssemblyGeneratorX86::generateCode(Instruction& instr, int i) {
-		asmCode << "rx_i_" << i << ": ;" << instr.getName() << std::endl;
-		asmCode << "\tdec edi" << std::endl;
-		asmCode << "\tjz rx_finish" << std::endl;
+		asmCode << "\t; " << instr;
 		auto generator = engine[instr.opcode];
 		(this->*generator)(instr, i);
 	}

-	void AssemblyGeneratorX86::genar(Instruction& instr) {
-		asmCode << "\txor " << regR[instr.rega % RegistersCount] << ", 0" << std::hex << instr.addra << "h" << std::dec << std::endl;
-		switch (instr.loca & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov ecx, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tcall rx_read_dataset_r" << std::endl;
-			return;
+	void AssemblyGeneratorX86::genAddressReg(Instruction& instr, const char* reg = "eax") {
+		asmCode << "\tmov " << reg << ", " << regR32[instr.src] << std::endl;
+		asmCode << "\tand " << reg << ", " << ((instr.mod % 4) ? ScratchpadL1Mask : ScratchpadL2Mask) << std::endl;
+	}

-		case 4:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tmov rax, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
+	void AssemblyGeneratorX86::genAddressRegDst(Instruction& instr, int maskAlign = 8) {
+		asmCode << "\tmov eax" << ", " << regR32[instr.dst] << std::endl;
+		asmCode << "\tand eax" << ", " << ((instr.mod % 4) ? (ScratchpadL1Mask & (-maskAlign)) : (ScratchpadL2Mask & (-maskAlign))) << std::endl;
+	}

-		default:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tmov rax, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
+	int32_t AssemblyGeneratorX86::genAddressImm(Instruction& instr) {
+		return (int32_t)instr.imm32 & ScratchpadL3Mask;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_IADD_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tadd " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\tadd " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
 		}
 	}

-
-	void AssemblyGeneratorX86::genaf(Instruction& instr) {
-		asmCode << "\txor " << regR[instr.rega % RegistersCount] << ", 0" << std::hex << instr.addra << "h" << std::dec << std::endl;
-		switch (instr.loca & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov ecx, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tcall rx_read_dataset_f" << std::endl;
-			return;
-
-		case 4:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tcvtdq2pd xmm0, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
-
-		default:
-			asmCode << "\tmov eax, " << regR32[instr.rega % RegistersCount] << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tcvtdq2pd xmm0, qword ptr [rsi + rax * 8]" << std::endl;
-			return;
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IADD_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\tadd " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\tadd " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
 		}
 	}

-	void AssemblyGeneratorX86::genbr0(Instruction& instr, const char* instrx86) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov rcx, " << regR[instr.regb % RegistersCount] << std::endl;
-			asmCode << "\t" << instrx86 << " rax, cl" << std::endl;
-			return;
-		default:
-			asmCode << "\t" << instrx86 << " rax, " << (instr.imm8 & 63) << std::endl;;
-			return;
+	//1 uOP
+	void AssemblyGeneratorX86::h_IADD_RC(Instruction& instr, int i) {
+		asmCode << "\tlea " << regR[instr.dst] << ", [" << regR[instr.dst] << "+" << regR[instr.src] << std::showpos << (int32_t)instr.imm32 << std::noshowpos << "]" << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_ISUB_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tsub " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\tsub " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
 		}
 	}

-	void AssemblyGeneratorX86::genbr1(Instruction& instr) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-		case 4:
-		case 5:
-			asmCode << regR[instr.regb % RegistersCount] << std::endl;
-			return;
-		default:
-			asmCode  << instr.imm32 << std::endl;;
-			return;
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_ISUB_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\tsub " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\tsub " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
 		}
 	}

-	void AssemblyGeneratorX86::genbr132(Instruction& instr) {
-		switch (instr.locb & 7)
-		{
-		case 0:
-		case 1:
-		case 2:
-		case 3:
-		case 4:
-		case 5:
-			asmCode << regR32[instr.regb % RegistersCount] << std::endl;
-			return;
-		default:
-			asmCode << instr.imm32 << std::endl;;
-			return;
+	//1 uOP
+	void AssemblyGeneratorX86::h_IMUL_9C(Instruction& instr, int i) {
+		asmCode << "\tlea " << regR[instr.dst] << ", [" << regR[instr.dst] << "+" << regR[instr.dst] << "*8" << std::showpos << (int32_t)instr.imm32 << std::noshowpos << "]" << std::endl;
+	}
+
+	//1 uOP
+	void AssemblyGeneratorX86::h_IMUL_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\timul " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\timul " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
 		}
 	}

-	void AssemblyGeneratorX86::genbf(Instruction& instr, const char* instrx86) {
-		asmCode << "\t" << instrx86 << " xmm0, " << regF[instr.regb % RegistersCount] << std::endl;
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IMUL_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\timul " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
 		}
-
-	void AssemblyGeneratorX86::gencr(Instruction& instr) {
-		switch (instr.locc & 7)
-		{
-		case 0:
-			asmCode << "\tmov rcx, rax" << std::endl;
-			asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-			asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-			asmCode << "\tmov qword ptr [rsi + rax * 8], rcx" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rcx" << std::endl;
-			}
-			return;
-
-		case 1:
-		case 2:
-		case 3:
-			asmCode << "\tmov rcx, rax" << std::endl;
-			asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-			asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-			asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-			asmCode << "\tmov qword ptr [rsi + rax * 8], rcx" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rcx" << std::endl;
-			}
-			return;
-
-		default:
-			asmCode << "\tmov " << regR[instr.regc % RegistersCount] << ", rax" << std::endl;
-			if (trace) {
-				asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rax" << std::endl;
-			}
-		}
-	}
-
-	void AssemblyGeneratorX86::gencf(Instruction& instr, bool alwaysLow = false) {
-		if(!alwaysLow)
-			asmCode << "\tmovaps " << regF[instr.regc % RegistersCount] << ", xmm0" << std::endl;
-		const char* store = (!alwaysLow && (instr.locc & 8)) ? "movhpd" : "movlpd";
-		switch (instr.locc & 7)
-		{
-			case 4:
-				asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-				asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-				asmCode << "\tand eax, " << (ScratchpadL2 - 1) << std::endl;
-				asmCode << "\t" << store << " qword ptr [rsi + rax * 8], " << regF[instr.regc % RegistersCount] << std::endl;
-				break;
-
-			case 5:
-			case 6:
-			case 7:
-				asmCode << "\tmov eax, " << regR32[instr.regc % RegistersCount] << std::endl;
-				asmCode << "\txor eax, 0" << std::hex << instr.addrc << "h" << std::dec << std::endl;
-				asmCode << "\tand eax, " << (ScratchpadL1 - 1) << std::endl;
-				asmCode << "\t" << store << " qword ptr [rsi + rax * 8], " << regF[instr.regc % RegistersCount] << std::endl;
-				break;
-		}
-		if (trace) {
-			asmCode << "\t" << store << " qword ptr [rsi + rdi * 8 + 262136], " << regF[instr.regc % RegistersCount] << std::endl;
+		else {
+			asmCode << "\timul " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
 		}
 	}

-	void AssemblyGeneratorX86::h_ADD_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tadd rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//4 uOPs
+	void AssemblyGeneratorX86::h_IMULH_R(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+		asmCode << "\tmul " << regR[instr.src] << std::endl;
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_ADD_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tadd eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//5.75 uOPs
+	void AssemblyGeneratorX86::h_IMULH_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, "ecx");
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\tmul qword ptr [rsi+rcx]" << std::endl;
+		}
+		else {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\tmul qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_SUB_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tsub rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//4 uOPs
+	void AssemblyGeneratorX86::h_ISMULH_R(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+		asmCode << "\timul " << regR[instr.src] << std::endl;
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_SUB_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tsub eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//5.75 uOPs
+	void AssemblyGeneratorX86::h_ISMULH_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr, "ecx");
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\timul qword ptr [rsi+rcx]" << std::endl;
+		}
+		else {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			asmCode << "\timul qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
+		asmCode << "\tmov " << regR[instr.dst] << ", rdx" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_MUL_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\timul rax, ";
-		if ((instr.locb & 7) >= 6) {
-			asmCode << "rax, ";
-		}
-		genbr1(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_INEG_R(Instruction& instr, int i) {
+		asmCode << "\tneg " << regR[instr.dst] << std::endl;
 	}

-	void AssemblyGeneratorX86::h_MULH_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, ";
-		genbr1(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_IXOR_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\txor " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
+		else {
+			asmCode << "\txor " << regR[instr.dst] << ", " << (int32_t)instr.imm32 << std::endl;
+		}
+	}
+
+	//2.75 uOP
+	void AssemblyGeneratorX86::h_IXOR_M(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			genAddressReg(instr);
+			asmCode << "\txor " << regR[instr.dst] << ", qword ptr [rsi+rax]" << std::endl;
+		}
+		else {
+			asmCode << "\txor " << regR[instr.dst] << ", qword ptr [rsi+" << genAddressImm(instr) << "]" << std::endl;
+		}
+	}
+
+	//1.75 uOPs
+	void AssemblyGeneratorX86::h_IROR_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tmov ecx, " << regR32[instr.src] << std::endl;
+			asmCode << "\tror " << regR[instr.dst] << ", cl" << std::endl;
+		}
+		else {
+			asmCode << "\tror " << regR[instr.dst] << ", " << (instr.imm32 & 63) << std::endl;
+		}
+	}
+
+	//1.75 uOPs
+	void AssemblyGeneratorX86::h_IROL_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\tmov ecx, " << regR32[instr.src] << std::endl;
+			asmCode << "\trol " << regR[instr.dst] << ", cl" << std::endl;
+		}
+		else {
+			asmCode << "\trol " << regR[instr.dst] << ", " << (instr.imm32 & 63) << std::endl;
+		}
+	}
+
+	//~6 uOPs
+	void AssemblyGeneratorX86::h_IDIV_C(Instruction& instr, int i) {
+		if (instr.imm32 != 0) {
+			uint32_t divisor = instr.imm32;
+			if (divisor & (divisor - 1)) {
+				magicu_info mi = compute_unsigned_magic_info(divisor, sizeof(uint64_t) * 8);
+				if (mi.pre_shift == 0 && !mi.increment) {
+					asmCode << "\tmov rax, " << mi.multiplier << std::endl;
+					asmCode << "\tmul " << regR[instr.dst] << std::endl;
+				}
+				else {
+					asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+					if (mi.pre_shift > 0)
+						asmCode << "\tshr rax, " << mi.pre_shift << std::endl;
+					if (mi.increment) {
+						asmCode << "\tadd rax, 1" << std::endl;
+						asmCode << "\tsbb rax, 0" << std::endl;
+					}
+					asmCode << "\tmov rcx, " << mi.multiplier << std::endl;
 					asmCode << "\tmul rcx" << std::endl;
-		asmCode << "\tmov rax, rdx" << std::endl;
-		gencr(instr);
+				}
+				if (mi.post_shift > 0)
+					asmCode << "\tshr rdx, " << mi.post_shift << std::endl;
+				asmCode << "\tadd " << regR[instr.dst] << ", rdx" << std::endl;
+			}
+			else { //divisor is a power of two
+				int shift = 0;
+				while (divisor >>= 1)
+					++shift;
+				if(shift > 0)
+					asmCode << "\tshr " << regR[instr.dst] << ", " << shift << std::endl;
+			}
+		}	
 	}

-	void AssemblyGeneratorX86::h_MUL_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov ecx, eax" << std::endl;
-		asmCode << "\tmov eax, ";
-		genbr132(instr);
-		asmCode << "\timul rax, rcx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_IMUL_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmovsxd rcx, eax" << std::endl;
-		if ((instr.locb & 7) >= 6) {
-			asmCode << "\tmov rax, " << instr.imm32 << std::endl;
-		}
-		else {
-			asmCode << "\tmovsxd rax, " << regR32[instr.regb % RegistersCount] << std::endl;
-		}
-		asmCode << "\timul rax, rcx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_IMULH_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, ";
-		genbr1(instr);
-		asmCode << "\timul rcx" << std::endl;
-		asmCode << "\tmov rax, rdx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_DIV_64(Instruction& instr, int i) {
-		genar(instr);
-		if ((instr.locb & 7) >= 6) {
-			if (instr.imm32 == 0) {
-				asmCode << "\tmov ecx, 1" << std::endl;
-			}
-			else {
-				asmCode << "\tmov ecx, " << instr.imm32 << std::endl;
-			}
-		}
-		else {
-			asmCode << "\tmov ecx, 1" << std::endl;
-			asmCode << "\tmov edx, " << regR32[instr.regb % RegistersCount] << std::endl;
-			asmCode << "\ttest edx, edx" << std::endl;
-			asmCode << "\tcmovne ecx, edx" << std::endl;
-		}
-		asmCode << "\txor edx, edx" << std::endl;
-		asmCode << "\tdiv rcx" << std::endl;
-		gencr(instr);
-	}
-
-	void AssemblyGeneratorX86::h_IDIV_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov edx, ";
-		genbr132(instr);
-		asmCode << "\tcmp edx, -1" << std::endl;
-		asmCode << "\tjne short safe_idiv_" << i << std::endl;
+	//~8.5 uOPs
+	void AssemblyGeneratorX86::h_ISDIV_C(Instruction& instr, int i) {
+		int64_t divisor = (int32_t)instr.imm32;
+		if ((divisor & -divisor) == divisor || (divisor & -divisor) == -divisor) {
+			asmCode << "\tmov rax, " << regR[instr.dst] << std::endl;
+			// +/- power of two
+			bool negative = divisor < 0;
+			if (negative)
+				divisor = -divisor;
+			int shift = 0;
+			uint64_t unsignedDivisor = divisor;
+			while (unsignedDivisor >>= 1)
+				++shift;
+			if (shift > 0) {
 				asmCode << "\tmov rcx, rax" << std::endl;
-		asmCode << "\trol rcx, 1" << std::endl;
-		asmCode << "\tdec rcx" << std::endl;
-		asmCode << "\tjz short result_idiv_" << i << std::endl;
-		asmCode << "safe_idiv_" << i << ":" << std::endl;
-		asmCode << "\tmov ecx, 1" << std::endl;
-		asmCode << "\ttest edx, edx" << std::endl;
-		asmCode << "\tcmovne ecx, edx" << std::endl;
-		asmCode << "\tmovsxd rcx, ecx" << std::endl;
-		asmCode << "\tcqo" << std::endl;
-		asmCode << "\tidiv rcx" << std::endl;
-		asmCode << "result_idiv_" << i << ":" << std::endl;
-		gencr(instr);
+				asmCode << "\tsar rcx, 63" << std::endl;
+				uint32_t mask = (1ULL << shift) + 0xFFFFFFFF;
+				asmCode << "\tand ecx, 0" << std::hex << mask << std::dec << "h" << std::endl;
+				asmCode << "\tadd rax, rcx" << std::endl;
+				asmCode << "\tsar rax, " << shift << std::endl;
+			}
+			if (negative)
+				asmCode << "\tneg rax" << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", rax" << std::endl;
+		}
+		else if (divisor != 0) {
+			magics_info mi = compute_signed_magic_info(divisor);
+			asmCode << "\tmov rax, " << mi.multiplier << std::endl;
+			asmCode << "\timul " << regR[instr.dst] << std::endl;
+			//asmCode << "\tmov rax, rdx" << std::endl;
+			asmCode << "\txor eax, eax" << std::endl;
+			bool haveSF = false;
+			if (divisor > 0 && mi.multiplier < 0) {
+				asmCode << "\tadd rdx, " << regR[instr.dst] << std::endl;
+				haveSF = true;
+			}
+			if (divisor < 0 && mi.multiplier > 0) {
+				asmCode << "\tsub rdx, " << regR[instr.dst] << std::endl;
+				haveSF = true;
+			}
+			if (mi.shift > 0) {
+				asmCode << "\tsar rdx, " << mi.shift << std::endl;
+				haveSF = true;
+			}
+			if (!haveSF)
+				asmCode << "\ttest rdx, rdx" << std::endl;
+			asmCode << "\tsets al" << std::endl;
+			asmCode << "\tadd rdx, rax" << std::endl;
+			asmCode << "\tadd " << regR[instr.dst] << ", rdx" << std::endl;
+		}
 	}

-	void AssemblyGeneratorX86::h_AND_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tand rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//2 uOPs
+	void AssemblyGeneratorX86::h_ISWAP_R(Instruction& instr, int i) {
+		if (instr.src != instr.dst) {
+			asmCode << "\txchg " << regR[instr.dst] << ", " << regR[instr.src] << std::endl;
+		}
 	}

-	void AssemblyGeneratorX86::h_AND_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tand eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//1 uOPs
+	void AssemblyGeneratorX86::h_FSWAP_R(Instruction& instr, int i) {
+		asmCode << "\tshufpd " << regFE[instr.dst] << ", " << regFE[instr.dst] << ", 1" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_OR_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tor rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_FADD_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\taddpd " << regF[instr.dst] << ", " << regA[instr.src] << std::endl;
+		//asmCode << "\t" << fsumInstr[instr.mod % 4] << " " << signMask << ", " << regF[instr.dst] << std::endl;
 	}

-	void AssemblyGeneratorX86::h_OR_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tor eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//5 uOPs
+	void AssemblyGeneratorX86::h_FADD_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\taddpd " << regF[instr.dst] << ", xmm12" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_XOR_64(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\txor rax, ";
-		genbr1(instr);
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_FSUB_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tsubpd " << regF[instr.dst] << ", " << regA[instr.src] << std::endl;
+		//asmCode << "\t" << fsumInstr[instr.mod % 4] << " " << signMask << ", " << regF[instr.dst] << std::endl;
 	}

-	void AssemblyGeneratorX86::h_XOR_32(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\txor eax, ";
-		genbr132(instr);
-		gencr(instr);
+	//5 uOPs
+	void AssemblyGeneratorX86::h_FSUB_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tsubpd " << regF[instr.dst] << ", xmm12" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_SHL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "shl");
-		gencr(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_FNEG_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		asmCode << "\txorps " << regF[instr.dst] << ", " << signMask << std::endl;
 	}

-	void AssemblyGeneratorX86::h_SHR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "shr");
-		gencr(instr);
+	//1 uOPs
+	void AssemblyGeneratorX86::h_FMUL_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tmulpd " << regE[instr.dst] << ", " << regA[instr.src] << std::endl;
 	}

-	void AssemblyGeneratorX86::h_SAR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "sar");
-		gencr(instr);
+	//7 uOPs
+	void AssemblyGeneratorX86::h_FMUL_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tandps xmm12, xmm14" << std::endl;
+		asmCode << "\tmulpd " << regE[instr.dst] << ", xmm12" << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
 	}

-	void AssemblyGeneratorX86::h_ROL_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "rol");
-		gencr(instr);
+	//2 uOPs
+	void AssemblyGeneratorX86::h_FDIV_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		instr.src %= 4;
+		asmCode << "\tdivpd " << regE[instr.dst] << ", " << regA[instr.src] << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
 	}

-	void AssemblyGeneratorX86::h_ROR_64(Instruction& instr, int i) {
-		genar(instr);
-		genbr0(instr, "ror");
-		gencr(instr);
+	//7 uOPs
+	void AssemblyGeneratorX86::h_FDIV_M(Instruction& instr, int i) {
+		instr.dst %= 4;
+		genAddressReg(instr);
+		asmCode << "\tcvtdq2pd xmm12, qword ptr [rsi+rax]" << std::endl;
+		asmCode << "\tandps xmm12, xmm14" << std::endl;
+		asmCode << "\tdivpd " << regE[instr.dst] << ", xmm12" << std::endl;
+		asmCode << "\tmaxpd " << regE[instr.dst] << ", " << dblMin << std::endl;
 	}

-	void AssemblyGeneratorX86::h_FPADD(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "addpd");
-		gencf(instr);
+	//1 uOP
+	void AssemblyGeneratorX86::h_FSQRT_R(Instruction& instr, int i) {
+		instr.dst %= 4;
+		asmCode << "\tsqrtpd " << regE[instr.dst] << ", " << regE[instr.dst] << std::endl;
 	}	

-	void AssemblyGeneratorX86::h_FPSUB(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "subpd");
-		gencf(instr);
-	}
-
-	void AssemblyGeneratorX86::h_FPMUL(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "mulpd");
-		asmCode << "\tmovaps xmm1, xmm0" << std::endl;
-		asmCode << "\tcmpeqpd xmm1, xmm1" << std::endl;
-		asmCode << "\tandps xmm0, xmm1" << std::endl;
-		gencf(instr);
-	}
-
-	void AssemblyGeneratorX86::h_FPDIV(Instruction& instr, int i) {
-		genaf(instr);
-		genbf(instr, "divpd");
-		asmCode << "\tmovaps xmm1, xmm0" << std::endl;
-		asmCode << "\tcmpeqpd xmm1, xmm1" << std::endl;
-		asmCode << "\tandps xmm0, xmm1" << std::endl;
-		gencf(instr);
-	}
-
-	void AssemblyGeneratorX86::h_FPSQRT(Instruction& instr, int i) {
-		genaf(instr);
-		asmCode << "\tandps xmm0, xmm10" << std::endl;
-		asmCode << "\tsqrtpd xmm0, xmm0" << std::endl;
-		gencf(instr);
-	}
-
-	void AssemblyGeneratorX86::h_FPROUND(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tmov rcx, rax" << std::endl;
-		asmCode << "\tshl eax, 13" << std::endl;
-		asmCode << "\tand rcx, -2048" << std::endl;
+	//6 uOPs
+	void AssemblyGeneratorX86::h_CFROUND(Instruction& instr, int i) {
+		asmCode << "\tmov rax, " << regR[instr.src] << std::endl;
+		int rotate = (13 - (instr.imm32 & 63)) & 63;
+		if (rotate != 0)
+			asmCode << "\trol rax, " << rotate << std::endl;
 		asmCode << "\tand eax, 24576" << std::endl;
-		asmCode << "\tcvtsi2sd " << regF[instr.regc % RegistersCount] << ", rcx" << std::endl;
 		asmCode << "\tor eax, 40896" << std::endl;
 		asmCode << "\tmov dword ptr [rsp-8], eax" << std::endl;
 		asmCode << "\tldmxcsr dword ptr [rsp-8]" << std::endl;
-		gencf(instr, true);
 	}

-	static inline const char* jumpCondition(Instruction& instr, bool invert = false) {
-		switch ((instr.locb & 7) ^ invert)
+	static inline const char* condition(Instruction& instr, bool invert = false) {
+		switch (((instr.mod >> 2) & 7) ^ invert)
 		{
 			case 0:
-				return "jbe";
+				return "be";
 			case 1:
-				return "ja";
+				return "a";
 			case 2:
-				return "js";
+				return "s";
 			case 3:
-				return "jns";
+				return "ns";
 			case 4:
-				return "jo";
+				return "o";
 			case 5:
-				return "jno";
+				return "no";
 			case 6:
-				return "jl";
+				return "l";
 			case 7:
-				return "jge";
+				return "ge";
+			default:
+				UNREACHABLE;
 		}
 	}

-	void AssemblyGeneratorX86::h_CALL(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tcmp " << regR32[instr.regb % RegistersCount] << ", " << instr.imm32 << std::endl;
-		asmCode << "\t" << jumpCondition(instr);
-		asmCode << " short taken_call_" << i << std::endl;
-		gencr(instr);
-		asmCode << "\tjmp rx_i_" << wrapInstr(i + 1) << std::endl;
-		asmCode << "taken_call_" << i << ":" << std::endl;
-		if (trace) {
-			asmCode << "\tmov qword ptr [rsi + rdi * 8 + 262136], rax" << std::endl;
-		}
-		asmCode << "\tpush rax" << std::endl;
-		asmCode << "\tcall rx_i_" << wrapInstr(i + (instr.imm8 & 127) + 2) << std::endl;
+	//4 uOPs
+	void AssemblyGeneratorX86::h_COND_R(Instruction& instr, int i) {
+		asmCode << "\txor ecx, ecx" << std::endl;
+		asmCode << "\tcmp " << regR32[instr.src] << ", " << (int32_t)instr.imm32 << std::endl;
+		asmCode << "\tset" << condition(instr) << " cl" << std::endl;
+		asmCode << "\tadd " << regR[instr.dst] << ", rcx" << std::endl;
 	}

-	void AssemblyGeneratorX86::h_RET(Instruction& instr, int i) {
-		genar(instr);
-		asmCode << "\tcmp rsp, rbp" << std::endl;
-		asmCode << "\tje short not_taken_ret_" << i << std::endl;
-		asmCode << "\txor rax, qword ptr [rsp + 8]" << std::endl;
-		gencr(instr);
-		asmCode << "\tret 8" << std::endl;
-		asmCode << "not_taken_ret_" << i << ":" << std::endl;
-		gencr(instr);
+	//6 uOPs
+	void AssemblyGeneratorX86::h_COND_M(Instruction& instr, int i) {
+		asmCode << "\txor ecx, ecx" << std::endl;
+		genAddressReg(instr);
+		asmCode << "\tcmp dword ptr [rsi+rax], " << (int32_t)instr.imm32 << std::endl;
+		asmCode << "\tset" << condition(instr) << " cl" << std::endl;
+		asmCode << "\tadd " << regR[instr.dst] << ", rcx" << std::endl;
+	}
+
+	//3 uOPs
+	void AssemblyGeneratorX86::h_ISTORE(Instruction& instr, int i) {
+		genAddressRegDst(instr);
+		asmCode << "\tmov qword ptr [rsi+rax], " << regR[instr.src] << std::endl;
+	}
+
+	//3 uOPs
+	void AssemblyGeneratorX86::h_FSTORE(Instruction& instr, int i) {
+		genAddressRegDst(instr, 16);
+		asmCode << "\tmovapd xmmword ptr [rsi+rax], " << regFE[instr.src] << std::endl;
+	}
+
+	void AssemblyGeneratorX86::h_NOP(Instruction& instr, int i) {
+		asmCode << "\tnop" << std::endl;
 	}

 #include "instructionWeights.hpp"
 #define INST_HANDLE(x) REPN(&AssemblyGeneratorX86::h_##x, WT(x))

 	InstructionGenerator AssemblyGeneratorX86::engine[256] = {
-		INST_HANDLE(ADD_64)
-		INST_HANDLE(ADD_32)
-		INST_HANDLE(SUB_64)
-		INST_HANDLE(SUB_32)
-		INST_HANDLE(MUL_64)
-		INST_HANDLE(MULH_64)
-		INST_HANDLE(MUL_32)
-		INST_HANDLE(IMUL_32)
-		INST_HANDLE(IMULH_64)
-		INST_HANDLE(DIV_64)
-		INST_HANDLE(IDIV_64)
-		INST_HANDLE(AND_64)
-		INST_HANDLE(AND_32)
-		INST_HANDLE(OR_64)
-		INST_HANDLE(OR_32)
-		INST_HANDLE(XOR_64)
-		INST_HANDLE(XOR_32)
-		INST_HANDLE(SHL_64)
-		INST_HANDLE(SHR_64)
-		INST_HANDLE(SAR_64)
-		INST_HANDLE(ROL_64)
-		INST_HANDLE(ROR_64)
-		INST_HANDLE(FPADD)
-		INST_HANDLE(FPSUB)
-		INST_HANDLE(FPMUL)
-		INST_HANDLE(FPDIV)
-		INST_HANDLE(FPSQRT)
-		INST_HANDLE(FPROUND)
-		INST_HANDLE(CALL)
-		INST_HANDLE(RET)
+		//Integer
+		INST_HANDLE(IADD_R)
+		INST_HANDLE(IADD_M)
+		INST_HANDLE(IADD_RC)
+		INST_HANDLE(ISUB_R)
+		INST_HANDLE(ISUB_M)
+		INST_HANDLE(IMUL_9C)
+		INST_HANDLE(IMUL_R)
+		INST_HANDLE(IMUL_M)
+		INST_HANDLE(IMULH_R)
+		INST_HANDLE(IMULH_M)
+		INST_HANDLE(ISMULH_R)
+		INST_HANDLE(ISMULH_M)
+		INST_HANDLE(IDIV_C)
+		INST_HANDLE(ISDIV_C)
+		INST_HANDLE(INEG_R)
+		INST_HANDLE(IXOR_R)
+		INST_HANDLE(IXOR_M)
+		INST_HANDLE(IROR_R)
+		INST_HANDLE(IROL_R)
+		INST_HANDLE(ISWAP_R)
+
+		//Common floating point
+		INST_HANDLE(FSWAP_R)
+
+		//Floating point group F
+		INST_HANDLE(FADD_R)
+		INST_HANDLE(FADD_M)
+		INST_HANDLE(FSUB_R)
+		INST_HANDLE(FSUB_M)
+		INST_HANDLE(FNEG_R)
+
+		//Floating point group E
+		INST_HANDLE(FMUL_R)
+		INST_HANDLE(FMUL_M)
+		INST_HANDLE(FDIV_R)
+		INST_HANDLE(FDIV_M)
+		INST_HANDLE(FSQRT_R)
+
+		//Control
+		INST_HANDLE(COND_R)
+		INST_HANDLE(COND_M)
+		INST_HANDLE(CFROUND)
+
+		INST_HANDLE(ISTORE)
+		INST_HANDLE(FSTORE)
+
+		INST_HANDLE(NOP)
 	};
 }
--- a/src/AssemblyGeneratorX86.hpp
+++ b/src/AssemblyGeneratorX86.hpp
@ -24,13 +24,14 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 namespace RandomX {

+	class Program;
 	class AssemblyGeneratorX86;

 	typedef void(AssemblyGeneratorX86::*InstructionGenerator)(Instruction&, int);

 	class AssemblyGeneratorX86 {
 	public:
-		void generateProgram(const void* seed);
+		void generateProgram(Program&);
 		void printCode(std::ostream& os) {
 			os << asmCode.rdbuf();
 		}
@ -38,46 +39,48 @@ namespace RandomX {
 		static InstructionGenerator engine[256];
 		std::stringstream asmCode;

-		void genar(Instruction&);
-		void genaf(Instruction&);
-		void genbr0(Instruction&, const char*);
-		void genbr1(Instruction&);
-		void genbr132(Instruction&);
-		void genbf(Instruction&, const char*);
-		void gencr(Instruction&);
-		void gencf(Instruction&, bool);
+		void genAddressReg(Instruction&, const char*);
+		void genAddressRegDst(Instruction&, int);
+		int32_t genAddressImm(Instruction&);

 		void generateCode(Instruction&, int);

-		void h_ADD_64(Instruction&, int);
-		void h_ADD_32(Instruction&, int);
-		void h_SUB_64(Instruction&, int);
-		void h_SUB_32(Instruction&, int);
-		void h_MUL_64(Instruction&, int);
-		void h_MULH_64(Instruction&, int);
-		void h_MUL_32(Instruction&, int);
-		void h_IMUL_32(Instruction&, int);
-		void h_IMULH_64(Instruction&, int);
-		void h_DIV_64(Instruction&, int);
-		void h_IDIV_64(Instruction&, int);
-		void h_AND_64(Instruction&, int);
-		void h_AND_32(Instruction&, int);
-		void h_OR_64(Instruction&, int);
-		void h_OR_32(Instruction&, int);
-		void h_XOR_64(Instruction&, int);
-		void h_XOR_32(Instruction&, int);
-		void h_SHL_64(Instruction&, int);
-		void h_SHR_64(Instruction&, int);
-		void h_SAR_64(Instruction&, int);
-		void h_ROL_64(Instruction&, int);
-		void h_ROR_64(Instruction&, int);
-		void h_FPADD(Instruction&, int);
-		void h_FPSUB(Instruction&, int);
-		void h_FPMUL(Instruction&, int);
-		void h_FPDIV(Instruction&, int);
-		void h_FPSQRT(Instruction&, int);
-		void h_FPROUND(Instruction&, int);
-		void h_CALL(Instruction&, int);
-		void h_RET(Instruction&, int);
+		void  h_IADD_R(Instruction&, int);
+		void  h_IADD_M(Instruction&, int);
+		void  h_IADD_RC(Instruction&, int);
+		void  h_ISUB_R(Instruction&, int);
+		void  h_ISUB_M(Instruction&, int);
+		void  h_IMUL_9C(Instruction&, int);
+		void  h_IMUL_R(Instruction&, int);
+		void  h_IMUL_M(Instruction&, int);
+		void  h_IMULH_R(Instruction&, int);
+		void  h_IMULH_M(Instruction&, int);
+		void  h_ISMULH_R(Instruction&, int);
+		void  h_ISMULH_M(Instruction&, int);
+		void  h_IDIV_C(Instruction&, int);
+		void  h_ISDIV_C(Instruction&, int);
+		void  h_INEG_R(Instruction&, int);
+		void  h_IXOR_R(Instruction&, int);
+		void  h_IXOR_M(Instruction&, int);
+		void  h_IROR_R(Instruction&, int);
+		void  h_IROL_R(Instruction&, int);
+		void  h_ISWAP_R(Instruction&, int);
+		void  h_FSWAP_R(Instruction&, int);
+		void  h_FADD_R(Instruction&, int);
+		void  h_FADD_M(Instruction&, int);
+		void  h_FSUB_R(Instruction&, int);
+		void  h_FSUB_M(Instruction&, int);
+		void  h_FNEG_R(Instruction&, int);
+		void  h_FMUL_R(Instruction&, int);
+		void  h_FMUL_M(Instruction&, int);
+		void  h_FDIV_R(Instruction&, int);
+		void  h_FDIV_M(Instruction&, int);
+		void  h_FSQRT_R(Instruction&, int);
+		void  h_COND_R(Instruction&, int);
+		void  h_COND_M(Instruction&, int);
+		void  h_CFROUND(Instruction&, int);
+		void  h_ISTORE(Instruction&, int);
+		void  h_FSTORE(Instruction&, int);
+		void  h_NOP(Instruction&, int);
 	};
 }
--- a/src/Cache.cpp
+++ b/src/Cache.cpp
@ -23,7 +23,6 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "Cache.hpp"
 #include "softAes.h"
 #include "argon2.h"
-#include "Pcg32.hpp"
 #include "argon2_core.h"

 namespace RandomX {
@ -134,11 +133,6 @@ namespace RandomX {
 		//Argon2d memory fill
 		argonFill(seed, seedSize);

-		//Circular shift of the cache buffer by 512 bytes
-		//realized by copying the first 512 bytes to the back 
-		//of the buffer and shifting the start by 512 bytes
-		memcpy(memory + CacheSize, memory, CacheShift);
-
 		//AES keys
 		expandAesKeys<softAes>((__m128i*)seed, keys.data());
 	}
--- a/src/Cache.hpp
+++ b/src/Cache.hpp
@ -23,12 +23,32 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <new>
 #include "common.hpp"
 #include "dataset.hpp"
+#include "virtualMemory.hpp"

 namespace RandomX {

 	class Cache {
 	public:
-		void* operator new(size_t size) {
+		static void* alloc(bool largePages) {
+			if (largePages) {
+				return allocLargePagesMemory(sizeof(Cache));
+			}
+			else {
+				void* ptr = _mm_malloc(sizeof(Cache), sizeof(__m128i));
+				if (ptr == nullptr)
+					throw std::bad_alloc();
+				return ptr;
+			}
+		}
+		static void dealloc(Cache* cache, bool largePages) {
+			if (largePages) {
+				//allocLargePagesMemory(sizeof(Cache));
+			}
+			else {
+				_mm_free(cache);
+			}
+		}
+		/*void* operator new(size_t size) {
 			void* ptr = _mm_malloc(size, sizeof(__m128i));
 			if (ptr == nullptr)
 				throw std::bad_alloc();
@ -37,7 +57,7 @@ namespace RandomX {

 		void operator delete(void* ptr) {
 			_mm_free(ptr);
-		}
+		}*/

 		template<bool softAes>
 		void initialize(const void* seed, size_t seedSize);
@ -46,12 +66,12 @@ namespace RandomX {
 			return keys;
 		}

-		const uint8_t* getCache() {
-			return memory + CacheShift;
+		const uint8_t* getCache() const {
+			return memory;
 		}
 	private:
 		alignas(16) KeysContainer keys;
-		uint8_t memory[CacheSize + CacheShift];
+		uint8_t memory[CacheSize];
 		void argonFill(const void* seed, size_t seedSize);
 	};
 }
--- a/src/CompiledVirtualMachine.cpp
+++ b/src/CompiledVirtualMachine.cpp
@ -18,37 +18,28 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */

 #include "CompiledVirtualMachine.hpp"
-#include "Pcg32.hpp"
 #include "common.hpp"
-#include "instructions.hpp"
 #include <stdexcept>

 namespace RandomX {

-	CompiledVirtualMachine::CompiledVirtualMachine(bool softAes) : VirtualMachine(softAes) {
-
+	CompiledVirtualMachine::CompiledVirtualMachine() {
+		totalSize = 0;
 	}

-	void CompiledVirtualMachine::setDataset(dataset_t ds, bool lightClient) {
-		if (lightClient) {
-			throw std::runtime_error("Compiled VM does not support light-client mode");
-		}
-		VirtualMachine::setDataset(ds, lightClient);
+	void CompiledVirtualMachine::setDataset(dataset_t ds) {
+		mem.ds = ds;
 	}

-	void CompiledVirtualMachine::initializeProgram(const void* seed) {
-		Pcg32 gen(seed);
-		for (unsigned i = 0; i < sizeof(reg) / sizeof(Pcg32::result_type); ++i) {
-			*(((uint32_t*)&reg) + i) = gen();
-		}
-		compiler.generateProgram(gen);
-		mem.ma = (gen() ^ *(((uint32_t*)seed) + 4)) & ~7;
-		mem.mx = *(((uint32_t*)seed) + 5);
+	void CompiledVirtualMachine::initialize() {
+		VirtualMachine::initialize();
+		compiler.generateProgram(program);
 	}

 	void CompiledVirtualMachine::execute() {
-		//executeProgram(reg, mem, scratchpad, readDataset);
-		compiler.getProgramFunc()(reg, mem, scratchpad);
+		//executeProgram(reg, mem, scratchpad, InstructionCount);
+		totalSize += compiler.getCodeSize();
+		compiler.getProgramFunc()(reg, mem, scratchpad, InstructionCount);
 #ifdef TRACEVM
 		for (int32_t i = InstructionCount - 1; i >= 0; --i) {
 			std::cout << std::hex << tracepad[i].u64 << std::endl;
--- a/src/CompiledVirtualMachine.hpp
+++ b/src/CompiledVirtualMachine.hpp
@ -19,24 +19,39 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #pragma once
 //#define TRACEVM
+#include <new>
 #include "VirtualMachine.hpp"
 #include "JitCompilerX86.hpp"
+#include "intrinPortable.h"

 namespace RandomX {

 	class CompiledVirtualMachine : public VirtualMachine {
 	public:
-		CompiledVirtualMachine(bool softAes);
-		void setDataset(dataset_t ds, bool light = false) override;
-		void initializeProgram(const void* seed) override;
+		void* operator new(size_t size) {
+			void* ptr = _mm_malloc(size, 64);
+			if (ptr == nullptr)
+				throw std::bad_alloc();
+			return ptr;
+		}
+		void operator delete(void* ptr) {
+			_mm_free(ptr);
+		}
+		CompiledVirtualMachine();
+		void setDataset(dataset_t ds) override;
+		void initialize() override;
 		virtual void execute() override;
 		void* getProgram() {
 			return compiler.getCode();
 		}
+		uint64_t getTotalSize() {
+			return totalSize;
+		}
 	private:
 #ifdef TRACEVM
 		convertible_t tracepad[InstructionCount];
 #endif
 		JitCompilerX86 compiler;
+		uint64_t totalSize;
 	};
 }
--- a/src/Instruction.cpp
+++ b/src/Instruction.cpp
@ -18,53 +18,419 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */

 #include "Instruction.hpp"
+#include "common.hpp"

 namespace RandomX {

 	void Instruction::print(std::ostream& os) const {
-			os << "  A: loc = " << std::dec << (loca & 7) << ", reg: " << (rega & 7) << std::endl;
-			os << "  B: loc = " << (locb & 7) << ", reg: " << (regb & 7) << std::endl;
-			os << "  C: loc = " << (locc & 7) << ", reg: " << (regc & 7) << std::endl;
-			os << "  addra = " << std::hex << addra << std::endl;
-			os << "  addrc = " << addrc << std::endl;
-			os << "  imm8 = " << std::dec << (int)imm8 << std::endl;
-			os << "  imm32 = " << imm32 << std::endl;
+		os << names[opcode] << " ";
+		auto handler = engine[opcode];
+		(this->*handler)(os);
+	}
+
+	void Instruction::genAddressReg(std::ostream& os) const {
+		os << ((mod % 4) ? "L1" : "L2") << "[r" << (int)src << "]";
+	}
+
+	void Instruction::genAddressRegDst(std::ostream& os) const {
+		os << ((mod % 4) ? "L1" : "L2") << "[r" << (int)dst << "]";
+	}
+
+	void Instruction::genAddressImm(std::ostream& os) const {
+		os << "L3" << "[" << (imm32 & ScratchpadL3Mask) << "]";
+	}
+
+	void Instruction::h_IADD_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IADD_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IADD_RC(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	//1 uOP
+	void Instruction::h_ISUB_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_ISUB_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IMUL_9C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	void Instruction::h_IMUL_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IMUL_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IMULH_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_IMULH_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_ISMULH_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_ISMULH_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_INEG_R(std::ostream& os) const {
+		os << "r" << (int)dst << std::endl;
+	}
+
+	void Instruction::h_IXOR_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+		}
+	}
+
+	void Instruction::h_IXOR_M(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", ";
+			genAddressReg(os);
+			os << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", ";
+			genAddressImm(os);
+			os << std::endl;
+		}
+	}
+
+	void Instruction::h_IROR_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (imm32 & 63) << std::endl;
+		}
+	}
+
+	void Instruction::h_IROL_R(std::ostream& os) const {
+		if (src != dst) {
+			os << "r" << (int)dst << ", r" << (int)src << std::endl;
+		}
+		else {
+			os << "r" << (int)dst << ", " << (imm32 & 63) << std::endl;
+		}
+	}
+
+	void Instruction::h_IDIV_C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << imm32 << std::endl;
+	}
+
+	void Instruction::h_ISDIV_C(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << (int32_t)imm32 << std::endl;
+	}
+
+	void Instruction::h_ISWAP_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", r" << (int)src << std::endl;
+	}
+
+	void Instruction::h_FSWAP_R(std::ostream& os) const {
+		const char reg = (dst >= 4) ? 'e' : 'f';
+		auto dstIndex = dst % 4;
+		os << reg << dstIndex << std::endl;
+	}
+
+	void Instruction::h_FADD_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "f" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FADD_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FSUB_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "f" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FSUB_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FNEG_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "f" << dstIndex << std::endl;
+	}
+
+	void Instruction::h_FMUL_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "e" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FMUL_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FDIV_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		auto srcIndex = src % 4;
+		os << "e" << dstIndex << ", a" << srcIndex << std::endl;
+	}
+
+	void Instruction::h_FDIV_M(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << ", ";
+		genAddressReg(os);
+		os << std::endl;
+	}
+
+	void Instruction::h_FSQRT_R(std::ostream& os) const {
+		auto dstIndex = dst % 4;
+		os << "e" << dstIndex << std::endl;
+	}
+
+	void Instruction::h_CFROUND(std::ostream& os) const {
+		os << "r" << (int)src << ", " << (imm32 & 63) << std::endl;
+	}
+
+	static inline const char* condition(int index) {
+		switch (index)
+		{
+		case 0:
+			return "be";
+		case 1:
+			return "ab";
+		case 2:
+			return "sg";
+		case 3:
+			return "ns";
+		case 4:
+			return "of";
+		case 5:
+			return "no";
+		case 6:
+			return "lt";
+		case 7:
+			return "ge";
+		default:
+			UNREACHABLE;
+		}
+	}
+
+	void Instruction::h_COND_R(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << condition((mod >> 2) & 7) << "(r" << (int)src << ", " << (int32_t)imm32 << ")" << std::endl;
+	}
+
+	void Instruction::h_COND_M(std::ostream& os) const {
+		os << "r" << (int)dst << ", " << condition((mod >> 2) & 7) << "(";
+		genAddressReg(os);
+		os << ", " << (int32_t)imm32 << ")" << std::endl;
+	}
+
+	void  Instruction::h_ISTORE(std::ostream& os) const {
+		genAddressRegDst(os);
+		os << ", r" << (int)src << std::endl;
+	}
+
+	void  Instruction::h_FSTORE(std::ostream& os) const {
+		const char reg = (src >= 4) ? 'e' : 'f';
+		genAddressRegDst(os);
+		auto srcIndex = src % 4;
+		os << ", " << reg << srcIndex << std::endl;
+	}
+
+	void  Instruction::h_NOP(std::ostream& os) const {
+		os << std::endl;
 	}

 #include "instructionWeights.hpp"
 #define INST_NAME(x) REPN(#x, WT(x))
+#define INST_HANDLE(x) REPN(&Instruction::h_##x, WT(x))

 	const char* Instruction::names[256] = {
-		INST_NAME(ADD_64)
-		INST_NAME(ADD_32)
-		INST_NAME(SUB_64)
-		INST_NAME(SUB_32)
-		INST_NAME(MUL_64)
-		INST_NAME(MULH_64)
-		INST_NAME(MUL_32)
-		INST_NAME(IMUL_32)
-		INST_NAME(IMULH_64)
-		INST_NAME(DIV_64)
-		INST_NAME(IDIV_64)
-		INST_NAME(AND_64)
-		INST_NAME(AND_32)
-		INST_NAME(OR_64)
-		INST_NAME(OR_32)
-		INST_NAME(XOR_64)
-		INST_NAME(XOR_32)
-		INST_NAME(SHL_64)
-		INST_NAME(SHR_64)
-		INST_NAME(SAR_64)
-		INST_NAME(ROL_64)
-		INST_NAME(ROR_64)
-		INST_NAME(FPADD)
-		INST_NAME(FPSUB)
-		INST_NAME(FPMUL)
-		INST_NAME(FPDIV)
-		INST_NAME(FPSQRT)
-		INST_NAME(FPROUND)
-		INST_NAME(CALL)
-		INST_NAME(RET)
+		//Integer
+		INST_NAME(IADD_R)
+		INST_NAME(IADD_M)
+		INST_NAME(IADD_RC)
+		INST_NAME(ISUB_R)
+		INST_NAME(ISUB_M)
+		INST_NAME(IMUL_9C)
+		INST_NAME(IMUL_R)
+		INST_NAME(IMUL_M)
+		INST_NAME(IMULH_R)
+		INST_NAME(IMULH_M)
+		INST_NAME(ISMULH_R)
+		INST_NAME(ISMULH_M)
+		INST_NAME(IDIV_C)
+		INST_NAME(ISDIV_C)
+		INST_NAME(INEG_R)
+		INST_NAME(IXOR_R)
+		INST_NAME(IXOR_M)
+		INST_NAME(IROR_R)
+		INST_NAME(IROL_R)
+		INST_NAME(ISWAP_R)
+
+		//Common floating point
+		INST_NAME(FSWAP_R)
+
+		//Floating point group F
+		INST_NAME(FADD_R)
+		INST_NAME(FADD_M)
+		INST_NAME(FSUB_R)
+		INST_NAME(FSUB_M)
+		INST_NAME(FNEG_R)
+
+		//Floating point group E
+		INST_NAME(FMUL_R)
+		INST_NAME(FMUL_M)
+		INST_NAME(FDIV_R)
+		INST_NAME(FDIV_M)
+		INST_NAME(FSQRT_R)
+
+		//Control
+		INST_NAME(COND_R)
+		INST_NAME(COND_M)
+		INST_NAME(CFROUND)
+
+		INST_NAME(ISTORE)
+		INST_NAME(FSTORE)
+
+		INST_NAME(NOP)
+	};
+
+	InstructionVisualizer Instruction::engine[256] = {
+		//Integer
+		INST_HANDLE(IADD_R)
+		INST_HANDLE(IADD_M)
+		INST_HANDLE(IADD_RC)
+		INST_HANDLE(ISUB_R)
+		INST_HANDLE(ISUB_M)
+		INST_HANDLE(IMUL_9C)
+		INST_HANDLE(IMUL_R)
+		INST_HANDLE(IMUL_M)
+		INST_HANDLE(IMULH_R)
+		INST_HANDLE(IMULH_M)
+		INST_HANDLE(ISMULH_R)
+		INST_HANDLE(ISMULH_M)
+		INST_HANDLE(IDIV_C)
+		INST_HANDLE(ISDIV_C)
+		INST_HANDLE(INEG_R)
+		INST_HANDLE(IXOR_R)
+		INST_HANDLE(IXOR_M)
+		INST_HANDLE(IROR_R)
+		INST_HANDLE(IROL_R)
+		INST_HANDLE(ISWAP_R)
+
+		//Common floating point
+		INST_HANDLE(FSWAP_R)
+
+		//Floating point group F
+		INST_HANDLE(FADD_R)
+		INST_HANDLE(FADD_M)
+		INST_HANDLE(FSUB_R)
+		INST_HANDLE(FSUB_M)
+		INST_HANDLE(FNEG_R)
+
+		//Floating point group E
+		INST_HANDLE(FMUL_R)
+		INST_HANDLE(FMUL_M)
+		INST_HANDLE(FDIV_R)
+		INST_HANDLE(FDIV_M)
+		INST_HANDLE(FSQRT_R)
+
+		//Control
+		INST_HANDLE(COND_R)
+		INST_HANDLE(COND_M)
+		INST_HANDLE(CFROUND)
+
+		INST_HANDLE(ISTORE)
+		INST_HANDLE(FSTORE)
+
+		INST_HANDLE(NOP)
 	};

 }
--- a/src/Instruction.hpp
+++ b/src/Instruction.hpp
@ -24,21 +24,57 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 namespace RandomX {

+	class Instruction;
+
+	typedef void(Instruction::*InstructionVisualizer)(std::ostream&) const;
+
+	namespace InstructionType {
+		constexpr int IADD_R = 0;
+		constexpr int IADD_M = 1;
+		constexpr int IADD_RC = 2;
+		constexpr int ISUB_R = 3;
+		constexpr int ISUB_M = 4;
+		constexpr int IMUL_9C = 5;
+		constexpr int IMUL_R = 6;
+		constexpr int IMUL_M = 7;
+		constexpr int IMULH_R = 8;
+		constexpr int IMULH_M = 9;
+		constexpr int ISMULH_R = 10;
+		constexpr int ISMULH_M = 11;
+		constexpr int IDIV_C = 12;
+		constexpr int ISDIV_C = 13;
+		constexpr int INEG_R = 14;
+		constexpr int IXOR_R = 15;
+		constexpr int IXOR_M = 16;
+		constexpr int IROR_R = 17;
+		constexpr int IROL_R = 18;
+		constexpr int ISWAP_R = 19;
+		constexpr int FSWAP_R = 20;
+		constexpr int FADD_R = 21;
+		constexpr int FADD_M = 22;
+		constexpr int FSUB_R = 23;
+		constexpr int FSUB_M = 24;
+		constexpr int FNEG_R = 25;
+		constexpr int FMUL_R = 26;
+		constexpr int FMUL_M = 27;
+		constexpr int FDIV_R = 28;
+		constexpr int FDIV_M = 29;
+		constexpr int FSQRT_R = 30;
+		constexpr int COND_R = 31;
+		constexpr int COND_M = 32;
+		constexpr int CFROUND = 33;
+		constexpr int ISTORE = 34;
+		constexpr int FSTORE = 35;
+		constexpr int NOP = 36;
+	}
+
 	class Instruction {
 	public:
 		uint8_t opcode;
-		uint8_t loca;
-		uint8_t rega;
-		uint8_t locb;
-		uint8_t regb;
-		uint8_t locc;
-		uint8_t regc;
-		uint8_t imm8;
-		int32_t addra;
-		union {
-			uint32_t addrc;
-			int32_t imm32;
-		};
+		uint8_t dst;
+		uint8_t src;
+		uint8_t mod;
+		uint32_t imm32;
 		const char* getName() const {
 			return names[opcode];
 		}
@ -49,8 +85,51 @@ namespace RandomX {
 	private:
 		void print(std::ostream&) const;
 		static const char* names[256];
+		static InstructionVisualizer engine[256];
+
+		void genAddressReg(std::ostream& os) const;
+		void genAddressImm(std::ostream& os) const;
+		void genAddressRegDst(std::ostream&) const;
+
+		void  h_IADD_R(std::ostream&) const;
+		void  h_IADD_M(std::ostream&) const;
+		void  h_IADD_RC(std::ostream&) const;
+		void  h_ISUB_R(std::ostream&) const;
+		void  h_ISUB_M(std::ostream&) const;
+		void  h_IMUL_9C(std::ostream&) const;
+		void  h_IMUL_R(std::ostream&) const;
+		void  h_IMUL_M(std::ostream&) const;
+		void  h_IMULH_R(std::ostream&) const;
+		void  h_IMULH_M(std::ostream&) const;
+		void  h_ISMULH_R(std::ostream&) const;
+		void  h_ISMULH_M(std::ostream&) const;
+		void  h_IDIV_C(std::ostream&) const;
+		void  h_ISDIV_C(std::ostream&) const;
+		void  h_INEG_R(std::ostream&) const;
+		void  h_IXOR_R(std::ostream&) const;
+		void  h_IXOR_M(std::ostream&) const;
+		void  h_IROR_R(std::ostream&) const;
+		void  h_IROL_R(std::ostream&) const;
+		void  h_ISWAP_R(std::ostream&) const;
+		void  h_FSWAP_R(std::ostream&) const;
+		void  h_FADD_R(std::ostream&) const;
+		void  h_FADD_M(std::ostream&) const;
+		void  h_FSUB_R(std::ostream&) const;
+		void  h_FSUB_M(std::ostream&) const;
+		void  h_FNEG_R(std::ostream&) const;
+		void  h_FMUL_R(std::ostream&) const;
+		void  h_FMUL_M(std::ostream&) const;
+		void  h_FDIV_R(std::ostream&) const;
+		void  h_FDIV_M(std::ostream&) const;
+		void  h_FSQRT_R(std::ostream&) const;
+		void  h_COND_R(std::ostream&) const;
+		void  h_COND_M(std::ostream&) const;
+		void  h_CFROUND(std::ostream&) const;
+		void  h_ISTORE(std::ostream&) const;
+		void  h_FSTORE(std::ostream&) const;
+		void  h_NOP(std::ostream&) const;
 	};

-	static_assert(sizeof(Instruction) == 16, "Invalid alignment of struct Instruction");
+	static_assert(sizeof(Instruction) == 8, "Invalid alignment of struct Instruction");

 }
--- a/src/InterpretedVirtualMachine.cpp
+++ b/src/InterpretedVirtualMachine.cpp
--- a/src/InterpretedVirtualMachine.hpp
+++ b/src/InterpretedVirtualMachine.hpp
@ -21,27 +21,57 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 //#define STATS
 #include "VirtualMachine.hpp"
 #include "Program.hpp"
-#include <vector>
+#include "intrinPortable.h"

 namespace RandomX {

+	class ITransform {
+	public:
+		virtual int32_t apply(int32_t) const = 0;
+		virtual const char* getName() const = 0;
+		virtual std::ostream& printAsm(std::ostream&) const = 0;
+		virtual std::ostream& printCxx(std::ostream&) const = 0;
+	};
+
+	struct InstructionByteCode;
 	class InterpretedVirtualMachine;

 	typedef void(InterpretedVirtualMachine::*InstructionHandler)(Instruction&);

+	struct alignas(16) InstructionByteCode {
+		int_reg_t* idst;
+		int_reg_t* isrc;
+		int_reg_t imm;
+		__m128d* fdst;
+		__m128d* fsrc;
+		uint32_t condition;
+		uint32_t memMask;
+		uint32_t type;
+		union {
+			uint64_t unsignedMultiplier;
+			int64_t signedMultiplier;
+		};
+		unsigned shift;
+		unsigned preShift;
+		unsigned postShift;
+		bool increment;
+	};
+
+	constexpr int asedwfagdewsa = sizeof(InstructionByteCode);
+
 	class InterpretedVirtualMachine : public VirtualMachine {
 	public:
-		InterpretedVirtualMachine(bool softAes) : VirtualMachine(softAes) {}
-		virtual void initializeProgram(const void* seed) override;
-		virtual void execute() override;
-		const Program& getProgam() {
-			return p;
-		}
+		InterpretedVirtualMachine(bool soft, bool async) : softAes(soft), asyncWorker(async) {}
+		~InterpretedVirtualMachine();
+		void setDataset(dataset_t ds) override;
+		void initialize() override;
+		void execute() override;
 	private:
 		static InstructionHandler engine[256];
-		Program p;
-		std::vector<convertible_t> stack;
-		uint64_t pc, ic;
+		DatasetReadFunc readDataset;
+		bool softAes, asyncWorker;
+		InstructionByteCode byteCode[ProgramLength];
+		
 #ifdef STATS
 		int count_ADD_64 = 0;
 		int count_ADD_32 = 0;
@ -65,17 +95,18 @@ namespace RandomX {
 		int count_SAR_64 = 0;
 		int count_ROL_64 = 0;
 		int count_ROR_64 = 0;
-		int count_FPADD = 0;
-		int count_FPSUB = 0;
-		int count_FPMUL = 0;
-		int count_FPDIV = 0;
-		int count_FPSQRT = 0;
+		int count_FADD = 0;
+		int count_FSUB = 0;
+		int count_FMUL = 0;
+		int count_FDIV = 0;
+		int count_FSQRT = 0;
 		int count_FPROUND = 0;
+		int count_JUMP_taken = 0;
+		int count_JUMP_not_taken = 0;
 		int count_CALL_taken = 0;
 		int count_CALL_not_taken = 0;
 		int count_RET_stack_empty = 0;
 		int count_RET_taken = 0;
-		int count_RET_not_taken = 0;
 		int count_jump_taken[8] = { 0 };
 		int count_jump_not_taken[8] = { 0 };
 		int count_max_stack = 0;
@ -83,66 +114,17 @@ namespace RandomX {
 		int count_retdepth_max = 0;
 		int count_endstack = 0;
 		int count_instructions[ProgramLength] = { 0 };
+		int count_FADD_nop = 0;
+		int count_FADD_nop2 = 0;
+		int count_FSUB_nop = 0;
+		int count_FSUB_nop2 = 0;
+		int count_FMUL_nop = 0;
+		int count_FMUL_nop2 = 0;
+		int datasetAccess[256] = { 0 };
 #endif
-
-		convertible_t loada(Instruction&);
-		convertible_t loadbr0(Instruction&);
-		convertible_t loadbr1(Instruction&);
-		convertible_t& getcr(Instruction&);
-		void writecf(Instruction&, fpu_reg_t&);
-		void writecflo(Instruction&, fpu_reg_t&);
-
-		void stackPush(convertible_t& c) {
-			stack.push_back(c);
-		}
-
-		void stackPush(uint64_t x) {
-			convertible_t c;
-			c.u64 = x;
-			stack.push_back(c);
-		}
-
-		convertible_t stackPopValue() {
-			convertible_t top = stack.back();
-			stack.pop_back();
-			return top;
-		}
-
-		uint64_t stackPopAddress() {
-			convertible_t top = stack.back();
-			stack.pop_back();
-			return top.u64;
-		}
-
-		void h_ADD_64(Instruction&);
-		void h_ADD_32(Instruction&);
-		void h_SUB_64(Instruction&);
-		void h_SUB_32(Instruction&);
-		void h_MUL_64(Instruction&);
-		void h_MULH_64(Instruction&);
-		void h_MUL_32(Instruction&);
-		void h_IMUL_32(Instruction&);
-		void h_IMULH_64(Instruction&);
-		void h_DIV_64(Instruction&);
-		void h_IDIV_64(Instruction&);
-		void h_AND_64(Instruction&);
-		void h_AND_32(Instruction&);
-		void h_OR_64(Instruction&);
-		void h_OR_32(Instruction&);
-		void h_XOR_64(Instruction&);
-		void h_XOR_32(Instruction&);
-		void h_SHL_64(Instruction&);
-		void h_SHR_64(Instruction&);
-		void h_SAR_64(Instruction&);
-		void h_ROL_64(Instruction&);
-		void h_ROR_64(Instruction&);
-		void h_FPADD(Instruction&);
-		void h_FPSUB(Instruction&);
-		void h_FPMUL(Instruction&);
-		void h_FPDIV(Instruction&);
-		void h_FPSQRT(Instruction&);
-		void h_FPROUND(Instruction&);
-		void h_CALL(Instruction&);
-		void h_RET(Instruction&);
+		void precompileProgram(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
+		template<int N>
+		void executeBytecode(int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
+		void executeBytecode(int i, int_reg_t(&r)[8], __m128d (&f)[4], __m128d (&e)[4], __m128d (&a)[4]);
 	};
 }
--- a/src/JitCompilerX86-static.S
+++ b/src/JitCompilerX86-static.S
@ -27,32 +27,47 @@
 #define DECL(x) x
 #endif
 .global DECL(randomx_program_prologue)
-.global DECL(randomx_program_begin)
+.global DECL(randomx_program_loop_begin)
+.global DECL(randomx_program_loop_load)
+.global DECL(randomx_program_start)
+.global DECL(randomx_program_read_dataset)
+.global DECL(randomx_program_loop_store)
+.global DECL(randomx_program_loop_end)
 .global DECL(randomx_program_epilogue)
-.global DECL(randomx_program_read_r)
-.global DECL(randomx_program_read_f)
 .global DECL(randomx_program_end)

+#define db .byte
+
 .align 64
 DECL(randomx_program_prologue):
 	#include "asm/program_prologue_linux.inc"

 .align 64
-DECL(randomx_program_begin):
+	#include "asm/program_xmm_constants.inc"
+
+.align 64
+DECL(randomx_program_loop_begin):
+	nop
+
+DECL(randomx_program_loop_load):
+	#include "asm/program_loop_load.inc"
+
+DECL(randomx_program_start):
+	nop
+
+DECL(randomx_program_read_dataset):
+	#include "asm/program_read_dataset.inc"
+
+DECL(randomx_program_loop_store):
+	#include "asm/program_loop_store.inc"
+
+DECL(randomx_program_loop_end):
 	nop

 .align 64
 DECL(randomx_program_epilogue):
 	#include "asm/program_epilogue_linux.inc"

-.align 64
-DECL(randomx_program_read_r):
-	#include "asm/program_read_r.inc"
-
-.align 64
-DECL(randomx_program_read_f):
-	#include "asm/program_read_f.inc"
-
 .align 64
 DECL(randomx_program_end):
 	nop
--- a/src/JitCompilerX86-static.asm
+++ b/src/JitCompilerX86-static.asm
@ -15,13 +15,18 @@
 ;# You should have received a copy of the GNU General Public License
 ;# along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

+IFDEF RAX
+
 _RANDOMX_JITX86_STATIC SEGMENT PAGE READ EXECUTE

 PUBLIC randomx_program_prologue
-PUBLIC randomx_program_begin
+PUBLIC randomx_program_loop_begin
+PUBLIC randomx_program_loop_load
+PUBLIC randomx_program_start
+PUBLIC randomx_program_read_dataset
+PUBLIC randomx_program_loop_store
+PUBLIC randomx_program_loop_end
 PUBLIC randomx_program_epilogue
-PUBLIC randomx_program_read_r
-PUBLIC randomx_program_read_f
 PUBLIC randomx_program_end

 ALIGN 64
@ -30,25 +35,38 @@ randomx_program_prologue PROC
 randomx_program_prologue ENDP

 ALIGN 64
-randomx_program_begin PROC
+	include asm/program_xmm_constants.inc
+
+ALIGN 64
+randomx_program_loop_begin PROC
 	nop
-randomx_program_begin ENDP
+randomx_program_loop_begin ENDP
+
+randomx_program_loop_load PROC
+	include asm/program_loop_load.inc
+randomx_program_loop_load ENDP
+
+randomx_program_start PROC
+	nop
+randomx_program_start ENDP
+
+randomx_program_read_dataset PROC
+	include asm/program_read_dataset.inc
+randomx_program_read_dataset ENDP
+
+randomx_program_loop_store PROC
+	include asm/program_loop_store.inc
+randomx_program_loop_store ENDP
+
+randomx_program_loop_end PROC
+	nop
+randomx_program_loop_end ENDP

 ALIGN 64
 randomx_program_epilogue PROC
 	include asm/program_epilogue_win64.inc
 randomx_program_epilogue ENDP

-ALIGN 64
-randomx_program_read_r PROC
-	include asm/program_read_r.inc
-randomx_program_read_r ENDP
-
-ALIGN 64
-randomx_program_read_f PROC
-	include asm/program_read_f.inc
-randomx_program_read_f ENDP
-
 ALIGN 64
 randomx_program_end PROC
 	nop
@ -56,4 +74,6 @@ randomx_program_end ENDP

 _RANDOMX_JITX86_STATIC ENDS

+ENDIF
+
 END
--- a/src/JitCompilerX86-static.hpp
+++ b/src/JitCompilerX86-static.hpp
@ -19,9 +19,12 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 extern "C" {
 	void randomx_program_prologue();
-  void randomx_program_begin();
+	void randomx_program_loop_begin();
+	void randomx_program_loop_load();
+	void randomx_program_start();
+	void randomx_program_read_dataset();
+	void randomx_program_loop_store();
+	void randomx_program_loop_end();
 	void randomx_program_epilogue();
-  void randomx_program_read_r();
-  void randomx_program_read_f();
 	void randomx_program_end();
 }
--- a/src/JitCompilerX86.cpp
+++ b/src/JitCompilerX86.cpp
--- a/src/JitCompilerX86.hpp
+++ b/src/JitCompilerX86.hpp
@ -24,94 +24,108 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <cstring>
 #include <vector>

-class Pcg32;
-
 namespace RandomX {

+	class Program;
 	class JitCompilerX86;

-	typedef void(JitCompilerX86::*InstructionGeneratorX86)(Instruction&, int);
+	typedef void(JitCompilerX86::*InstructionGeneratorX86)(Instruction&);

 	constexpr uint32_t CodeSize = 64 * 1024;
-	constexpr uint32_t CacheLineSize = 64;
-
-	struct CallOffset {
-		CallOffset(int32_t p, int32_t i) : pos(p), index(i) {}
-		int32_t pos;
-		int32_t index;
-	};

 	class JitCompilerX86 {
 	public:
 		JitCompilerX86();
-		void generateProgram(Pcg32&);
+		void generateProgram(Program&);
 		ProgramFunc getProgramFunc() {
 			return (ProgramFunc)code;
 		}
 		uint8_t* getCode() {
 			return code;
 		}
+		size_t getCodeSize();
 	private:
 		static InstructionGeneratorX86 engine[256];
 		uint8_t* code;
 		int32_t codePos;
-		std::vector<int32_t> instructionOffsets;
-		std::vector<CallOffset> callOffsets;

-		void genar(Instruction&);
-		void genaf(Instruction&);
-		void genbr0(Instruction&, uint16_t, uint16_t);
-		void genbr1(Instruction&, uint16_t, uint16_t);
-		void genbr132(Instruction&, uint16_t, uint8_t);
-		void genbf(Instruction&, uint8_t);
-		void scratchpadStoreR(Instruction&, uint32_t);
-		void scratchpadStoreF(Instruction&, int, uint32_t, bool);
-		void gencr(Instruction&);
-		void gencf(Instruction&, bool);
-		void generateCode(Instruction&, int);
-		void fixCallOffsets();
+		void genAddressReg(Instruction&, bool);
+		void genAddressRegDst(Instruction&, bool);
+		void genAddressImm(Instruction&);
+		void genSIB(int scale, int index, int base);
+
+		void generateCode(Instruction&);

 		void emitByte(uint8_t val) {
 			code[codePos] = val;
 			codePos++;
 		}

-		template<typename T>
-		void emit(T val) {
-			*reinterpret_cast<T*>(code + codePos) = val;
-			codePos += sizeof(T);
+		void emit32(uint32_t val) {
+			code[codePos + 0] = val;
+			code[codePos + 1] = val >> 8;
+			code[codePos + 2] = val >> 16;
+			code[codePos + 3] = val >> 24;
+			codePos += 4;
 		}

-		void h_ADD_64(Instruction&, int);
-		void h_ADD_32(Instruction&, int);
-		void h_SUB_64(Instruction&, int);
-		void h_SUB_32(Instruction&, int);
-		void h_MUL_64(Instruction&, int);
-		void h_MULH_64(Instruction&, int);
-		void h_MUL_32(Instruction&, int);
-		void h_IMUL_32(Instruction&, int);
-		void h_IMULH_64(Instruction&, int);
-		void h_DIV_64(Instruction&, int);
-		void h_IDIV_64(Instruction&, int);
-		void h_AND_64(Instruction&, int);
-		void h_AND_32(Instruction&, int);
-		void h_OR_64(Instruction&, int);
-		void h_OR_32(Instruction&, int);
-		void h_XOR_64(Instruction&, int);
-		void h_XOR_32(Instruction&, int);
-		void h_SHL_64(Instruction&, int);
-		void h_SHR_64(Instruction&, int);
-		void h_SAR_64(Instruction&, int);
-		void h_ROL_64(Instruction&, int);
-		void h_ROR_64(Instruction&, int);
-		void h_FPADD(Instruction&, int);
-		void h_FPSUB(Instruction&, int);
-		void h_FPMUL(Instruction&, int);
-		void h_FPDIV(Instruction&, int);
-		void h_FPSQRT(Instruction&, int);
-		void h_FPROUND(Instruction&, int);
-		void h_CALL(Instruction&, int);
-		void h_RET(Instruction&, int);
+		void emit64(uint64_t val) {
+			code[codePos + 0] = val;
+			code[codePos + 1] = val >> 8;
+			code[codePos + 2] = val >> 16;
+			code[codePos + 3] = val >> 24;
+			code[codePos + 4] = val >> 32;
+			code[codePos + 5] = val >> 40;
+			code[codePos + 6] = val >> 48;
+			code[codePos + 7] = val >> 56;
+			codePos += 8;
+		}
+
+		template<size_t N>
+		void emit(const uint8_t (&src)[N]) {
+			for (unsigned i = 0; i < N; ++i) {
+				code[codePos + i] = src[i];
+			}
+			codePos += N;
+		}
+
+		void  h_IADD_R(Instruction&);
+		void  h_IADD_M(Instruction&);
+		void  h_IADD_RC(Instruction&);
+		void  h_ISUB_R(Instruction&);
+		void  h_ISUB_M(Instruction&);
+		void  h_IMUL_9C(Instruction&);
+		void  h_IMUL_R(Instruction&);
+		void  h_IMUL_M(Instruction&);
+		void  h_IMULH_R(Instruction&);
+		void  h_IMULH_M(Instruction&);
+		void  h_ISMULH_R(Instruction&);
+		void  h_ISMULH_M(Instruction&);
+		void  h_IDIV_C(Instruction&);
+		void  h_ISDIV_C(Instruction&);
+		void  h_INEG_R(Instruction&);
+		void  h_IXOR_R(Instruction&);
+		void  h_IXOR_M(Instruction&);
+		void  h_IROR_R(Instruction&);
+		void  h_IROL_R(Instruction&);
+		void  h_ISWAP_R(Instruction&);
+		void  h_FSWAP_R(Instruction&);
+		void  h_FADD_R(Instruction&);
+		void  h_FADD_M(Instruction&);
+		void  h_FSUB_R(Instruction&);
+		void  h_FSUB_M(Instruction&);
+		void  h_FNEG_R(Instruction&);
+		void  h_FMUL_R(Instruction&);
+		void  h_FMUL_M(Instruction&);
+		void  h_FDIV_R(Instruction&);
+		void  h_FDIV_M(Instruction&);
+		void  h_FSQRT_R(Instruction&);
+		void  h_COND_R(Instruction&);
+		void  h_COND_M(Instruction&);
+		void  h_CFROUND(Instruction&);
+		void  h_ISTORE(Instruction&);
+		void  h_FSTORE(Instruction&);
+		void  h_NOP(Instruction&);
 	};

 }
--- a/src/LightClientAsyncWorker.cpp
+++ b/src/LightClientAsyncWorker.cpp
@ -0,0 +1,123 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "LightClientAsyncWorker.hpp"
+#include "dataset.hpp"
+#include "Cache.hpp"
+
+namespace RandomX {
+
+	template<bool softAes>
+	LightClientAsyncWorker<softAes>::LightClientAsyncWorker(const Cache* c) : ILightClientAsyncWorker(c), output(nullptr), hasWork(false), 
+#ifdef TRACE
+		sw(true),
+#endif
+		workerThread(&LightClientAsyncWorker::runWorker, this) {
+
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::prepareBlock(addr_t addr) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlock-enter " << addr / CacheLineSize << std::endl;
+#endif
+		{
+			std::lock_guard<std::mutex> lk(mutex);
+			startBlock = addr / CacheLineSize;
+			blockCount = 1;
+			output = currentLine.data();
+			hasWork = true;
+		}
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlock-notify " << startBlock << "/" << blockCount << std::endl;
+#endif
+		notifier.notify_one();
+	}
+
+	template<bool softAes>
+	const uint64_t* LightClientAsyncWorker<softAes>::getBlock(addr_t addr) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": getBlock-enter " << addr / CacheLineSize << std::endl;
+#endif
+		uint32_t currentBlock = addr / CacheLineSize;
+		if (currentBlock != startBlock || output != currentLine.data()) {
+			initBlock(cache->getCache(), (uint8_t*)currentLine.data(), currentBlock, cache->getKeys());
+		}
+		else {
+			sync();
+		}
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": getBlock-return " << addr / CacheLineSize << std::endl;
+#endif
+		return currentLine.data();
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": prepareBlocks-enter " << startBlock << "/" << blockCount << std::endl;
+#endif
+		{
+			std::lock_guard<std::mutex> lk(mutex);
+			this->startBlock = startBlock;
+			this->blockCount = blockCount;
+			output = out;
+			hasWork = true;
+			notifier.notify_one();
+		}
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
+		for (uint32_t i = 0; i < blockCount; ++i) {
+			initBlock(cache->getCache(), (uint8_t*)out + CacheLineSize * i, startBlock + i, cache->getKeys());
+		}
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::sync() {
+		std::unique_lock<std::mutex> lk(mutex);
+		notifier.wait(lk, [this] { return !hasWork; });
+	}
+
+	template<bool softAes>
+	void LightClientAsyncWorker<softAes>::runWorker() {
+#ifdef TRACE
+		std::cout << sw.getElapsed() << ": runWorker-enter " << std::endl;
+#endif
+		for (;;) {
+			std::unique_lock<std::mutex> lk(mutex);
+			notifier.wait(lk, [this] { return hasWork; });
+#ifdef TRACE
+			std::cout << sw.getElapsed() << ": runWorker-getBlocks " << startBlock << "/" << blockCount << std::endl;
+#endif
+			//getBlocks(output, startBlock, blockCount);
+			initBlock(cache->getCache(), (uint8_t*)output, startBlock, cache->getKeys());
+			hasWork = false;
+#ifdef TRACE
+			std::cout << sw.getElapsed() << ": runWorker-finished " << startBlock << "/" << blockCount << std::endl;
+#endif
+			lk.unlock();
+			notifier.notify_one();
+		}
+	}
+
+	template class LightClientAsyncWorker<true>;
+	template class LightClientAsyncWorker<false>;
+}
--- a/src/LightClientAsyncWorker.hpp
+++ b/src/LightClientAsyncWorker.hpp
@ -0,0 +1,60 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+//#define TRACE
+#include "common.hpp"
+
+#include <thread>
+#include <mutex>
+#include <condition_variable>
+#include <array>
+#ifdef TRACE
+#include "Stopwatch.hpp"
+#include <iostream>
+#endif
+
+namespace RandomX {
+
+	class Cache;
+
+	using DatasetLine = std::array<uint64_t, CacheLineSize / sizeof(uint64_t)>;
+
+	template<bool softAes>
+	class LightClientAsyncWorker : public ILightClientAsyncWorker {
+	public:
+		LightClientAsyncWorker(const Cache*);
+		void prepareBlock(addr_t) final;
+		void prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
+		const uint64_t* getBlock(addr_t) final;
+		void getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
+		void sync() final;
+	private:
+		void runWorker();
+		std::condition_variable notifier;
+		std::mutex mutex;
+		alignas(16) DatasetLine currentLine;
+		void* output;
+		uint32_t startBlock, blockCount;
+		bool hasWork;
+#ifdef TRACE
+		Stopwatch sw;
+#endif
+		std::thread workerThread;
+	};
+}
--- a/src/Pcg32.hpp
+++ b/src/Pcg32.hpp
@ -1,72 +0,0 @@
-/*
-Copyright (c) 2018 tevador
-
-This file is part of RandomX.
-
-RandomX is free software: you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation, either version 3 of the License, or
-(at your option) any later version.
-
-RandomX is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
-*/
-
-// Based on:
-// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
-// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
-
-#pragma once
-#include <cstdint>
-
-#if defined(_MSC_VER)
-#pragma warning (disable : 4146)
-#endif
-
-class Pcg32 {
-public:
-	typedef uint32_t result_type;
-	static constexpr result_type min() { return 0U; }
-	static constexpr result_type max() { return UINT32_MAX; }
-	Pcg32(const void* seed) {
-		auto* u64seed = (const uint64_t*)seed;
-		state = *(u64seed + 0);
-		inc = *(u64seed + 1) | 1ull;
-	}
-	Pcg32(uint64_t state, uint64_t inc) : state(state), inc(inc | 1ull) {
-	}
-	result_type operator()() {
-		return next();
-	}
-	result_type getUniform(result_type min, result_type max) {
-		const result_type range = max - min;
-		const result_type erange = range + 1;
-		result_type ret;
-
-		for (;;) {
-			ret = next();
-			if (ret / erange < UINT32_MAX / erange || UINT32_MAX % erange == range) {
-				ret %= erange;
-				break;
-			}
-		}
-		return ret + min;
-	}
-private:
-	uint64_t state;
-	uint64_t inc;
-	result_type next() {
-		uint64_t oldstate = state;
-		// Advance internal state
-		state = oldstate * 6364136223846793005ULL + inc;
-		// Calculate output function (XSH RR), uses old state for max ILP
-		uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
-		uint32_t rot = oldstate >> 59u;
-		return (xorshifted >> rot) | (xorshifted << (-rot & 31));
-	}
-};
--- a/src/Program.cpp
+++ b/src/Program.cpp
@ -18,19 +18,12 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */

 #include "Program.hpp"
-#include "Pcg32.hpp"
+#include "hashAes1Rx4.hpp"

 namespace RandomX {
-	void Program::initialize(Pcg32& gen) {
-		for (unsigned i = 0; i < sizeof(programBuffer) / sizeof(Pcg32::result_type); ++i) {
-			*(((uint32_t*)&programBuffer) + i) = gen();
-		}
-	}
-
 	void Program::print(std::ostream& os) const {
 		for (int i = 0; i < RandomX::ProgramLength; ++i) {
 			auto instr = programBuffer[i];
-			os << std::dec << instr.getName() << " (" << i << "):" << std::endl;
 			os << instr;
 		}
 	}
--- a/src/Program.hpp
+++ b/src/Program.hpp
@ -24,22 +24,25 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "common.hpp"
 #include "Instruction.hpp"

-class Pcg32;
-
 namespace RandomX {

 	class Program {
 	public:
-		Instruction& operator()(uint64_t pc) {
+		Instruction& operator()(int pc) {
 			return programBuffer[pc];
 		}
-		void initialize(Pcg32& gen);
 		friend std::ostream& operator<<(std::ostream& os, const Program& p) {
 			p.print(os);
 			return os;
 		}
+		uint64_t getEntropy(int i) {
+			return entropyBuffer[i];
+		}
 	private:
 		void print(std::ostream&) const;
+		uint64_t entropyBuffer[16];
 		Instruction programBuffer[ProgramLength];
 	};
+
+	static_assert(sizeof(Program) % 64 == 0, "Invalid size of class Program");
 }
--- a/src/Stopwatch.hpp
+++ b/src/Stopwatch.hpp
@ -53,7 +53,7 @@ public:
 			isRunning = false;
 		}
 	}
-	double getElapsed() {
+	double getElapsed() const {
 		return getElapsedNanosec() / 1e+9;
 	}
 private:
@ -63,7 +63,7 @@ private:
 	uint64_t elapsed;
 	bool isRunning;

-	uint64_t getElapsedNanosec() {
+	uint64_t getElapsedNanosec() const {
 		uint64_t elns = elapsed;
 		if (isRunning) {
 			chrono_t endMark = std::chrono::high_resolution_clock::now();
--- a/src/VirtualMachine.cpp
+++ b/src/VirtualMachine.cpp
@ -19,85 +19,82 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #include "VirtualMachine.hpp"
 #include "common.hpp"
-#include "dataset.hpp"
-#include "Cache.hpp"
-#include "t1ha/t1ha.h"
+#include "hashAes1Rx4.hpp"
 #include "blake2/blake2.h"
 #include <cstring>
 #include <iomanip>
+#include "intrinPortable.h"

 std::ostream& operator<<(std::ostream& os, const RandomX::RegisterFile& rf) {
 	for (int i = 0; i < RandomX::RegistersCount; ++i)
-		os << std::hex << "r" << i << " = " << rf.r[i].u64 << std::endl << std::dec;
-	for (int i = 0; i < RandomX::RegistersCount; ++i)
-		os << std::hex << "f" << i << " = " << rf.f[i].hi.u64 << " (" << rf.f[i].hi.f64 << ")" << std::endl
-		<< "   = " << rf.f[i].lo.u64 << " (" << rf.f[i].lo.f64 << ")" << std::endl << std::dec;
+		os << std::hex << "r" << i << " = " << rf.r[i] << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "f" << i << " = " << *(uint64_t*)&rf.f[i].hi << " (" << rf.f[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.f[i].lo << " (" << rf.f[i].lo << ")" << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "e" << i << " = " << *(uint64_t*)&rf.e[i].hi << " (" << rf.e[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.e[i].lo << " (" << rf.e[i].lo << ")" << std::endl << std::dec;
+	for (int i = 0; i < 4; ++i)
+		os << std::hex << "a" << i << " = " << *(uint64_t*)&rf.a[i].hi << " (" << rf.a[i].hi << ")" << std::endl
+		<< "   = " << *(uint64_t*)&rf.a[i].lo << " (" << rf.a[i].lo << ")" << std::endl << std::dec;
 	return os;
 }

 namespace RandomX {

-	VirtualMachine::VirtualMachine(bool softAes) : softAes(softAes), lightClient(false) {
+	constexpr int mantissaSize = 52;
+	constexpr int exponentSize = 11;
+	constexpr uint64_t mantissaMask = (1ULL << mantissaSize) - 1;
+	constexpr uint64_t exponentMask = (1ULL << exponentSize) - 1;
+	constexpr int exponentBias = 1023;
+
+	static inline uint64_t getSmallPositiveFloatBits(uint64_t entropy) {
+		auto exponent = entropy >> 59; //0..31
+		auto mantissa = entropy & mantissaMask;
+		exponent += exponentBias;
+		exponent &= exponentMask;
+		exponent <<= mantissaSize;
+		return exponent | mantissa;
+	}
+
+	VirtualMachine::VirtualMachine() {
 		mem.ds.dataset = nullptr;
 	}

-	VirtualMachine::~VirtualMachine() {
-		if (lightClient) {
-			delete mem.ds.lightDataset->block;
-			delete mem.ds.lightDataset;
-		}
+	void VirtualMachine::resetRoundingMode() {
+		initFpu();
 	}

-	void VirtualMachine::setDataset(dataset_t ds, bool light) {
-		if (mem.ds.dataset != nullptr) {
-			throw std::runtime_error("Dataset is already initialized");
-		}
-		lightClient = light;
-		if (light) {
-			auto lds = mem.ds.lightDataset = new LightClientDataset();
-			lds->cache = ds.cache;
-			lds->block = (uint8_t*)_mm_malloc(DatasetBlockSize, sizeof(__m128i));
-			lds->blockNumber = -1;
-			if (lds->block == nullptr) {
-				throw std::bad_alloc();
-			}
-			if (softAes) {
-				readDataset = &datasetReadLight<true>;
-			}
-			else {
-				readDataset = &datasetReadLight<false>;
-			}
-		}
-		else {
-			mem.ds = ds;
-			readDataset = &datasetRead;
-		}
+	void VirtualMachine::initialize() {
+		store64(&reg.a[0].lo, getSmallPositiveFloatBits(program.getEntropy(0)));
+		store64(&reg.a[0].hi, getSmallPositiveFloatBits(program.getEntropy(1)));
+		store64(&reg.a[1].lo, getSmallPositiveFloatBits(program.getEntropy(2)));
+		store64(&reg.a[1].hi, getSmallPositiveFloatBits(program.getEntropy(3)));
+		store64(&reg.a[2].lo, getSmallPositiveFloatBits(program.getEntropy(4)));
+		store64(&reg.a[2].hi, getSmallPositiveFloatBits(program.getEntropy(5)));
+		store64(&reg.a[3].lo, getSmallPositiveFloatBits(program.getEntropy(6)));
+		store64(&reg.a[3].hi, getSmallPositiveFloatBits(program.getEntropy(7)));
+		mem.ma = program.getEntropy(8) & CacheLineAlignMask;
+		mem.mx = program.getEntropy(10);
+		auto addressRegisters = program.getEntropy(12);
+		readReg0 = 0 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg1 = 2 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg2 = 4 + (addressRegisters & 1);
+		addressRegisters >>= 1;
+		readReg3 = 6 + (addressRegisters & 1);
 	}

-	void VirtualMachine::initializeScratchpad(uint32_t index) {
-		if (lightClient) {
-			auto cache = mem.ds.lightDataset->cache;
-			if (softAes) {
-				for (int i = 0; i < ScratchpadSize / DatasetBlockSize; ++i) {
-					initBlock<true>(cache->getCache(), ((uint8_t*)scratchpad) + DatasetBlockSize * i, (ScratchpadSize / DatasetBlockSize) * index + i, cache->getKeys());
-				}
-			}
-			else {
-				for (int i = 0; i < ScratchpadSize / DatasetBlockSize; ++i) {
-					initBlock<false>(cache->getCache(), ((uint8_t*)scratchpad) + DatasetBlockSize * i, (ScratchpadSize / DatasetBlockSize) * index + i, cache->getKeys());
-				}
-			}
-		}
-		else {
-			memcpy(scratchpad, mem.ds.dataset + ScratchpadSize * index, ScratchpadSize);
+	template<bool softAes>
+	void VirtualMachine::getResult(void* scratchpad, size_t scratchpadSize, void* outHash) {
+		if (scratchpadSize > 0) {
+			hashAes1Rx4<false>(scratchpad, scratchpadSize, &reg.a);
 		}
+		blake2b(outHash, ResultSize, &reg, sizeof(RegisterFile), nullptr, 0);
 	}

-	void VirtualMachine::getResult(void* out) {
-		constexpr size_t smallStateLength = sizeof(RegisterFile) / sizeof(uint64_t) + 2;
-		uint64_t smallState[smallStateLength];
-		memcpy(smallState, &reg, sizeof(RegisterFile));
-		smallState[smallStateLength - 1] = t1ha2_atonce128(&smallState[smallStateLength - 2], scratchpad, ScratchpadSize, reg.r[0].u64);
-		blake2b(out, ResultSize, smallState, sizeof(smallState), nullptr, 0);
-	}
+	template void VirtualMachine::getResult<false>(void* scratchpad, size_t scratchpadSize, void* outHash);
+	template void VirtualMachine::getResult<true>(void* scratchpad, size_t scratchpadSize, void* outHash);
+
 }
--- a/src/VirtualMachine.hpp
+++ b/src/VirtualMachine.hpp
@ -20,26 +20,36 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #pragma once
 #include <cstdint>
 #include "common.hpp"
+#include "Program.hpp"

 namespace RandomX {

+
+
 	class VirtualMachine {
 	public:
-		VirtualMachine(bool softAes);
-		virtual ~VirtualMachine();
-		virtual void setDataset(dataset_t ds, bool light = false);
-		void initializeScratchpad(uint32_t index);
-		virtual void initializeProgram(const void* seed) = 0;
+		VirtualMachine();
+		virtual ~VirtualMachine() {}
+		virtual void setDataset(dataset_t ds) = 0;
+		void setScratchpad(void* ptr) {
+			scratchpad = (uint8_t*)ptr;
+		}
+		void resetRoundingMode();
+		virtual void initialize();
 		virtual void execute() = 0;
-		void getResult(void*);
+		template<bool softAes>
+		void getResult(void* scratchpad, size_t scratchpadSize, void* outHash);
 		const RegisterFile& getRegisterFile() {
 			return reg;
 		}
+		Program* getProgramBuffer() {
+			return &program;
+		}
 	protected:
-		bool softAes, lightClient;
-		DatasetReadFunc readDataset;
+		alignas(16) Program program;
 		alignas(16) RegisterFile reg;
 		MemoryRegisters mem;
-		alignas(16) convertible_t scratchpad[ScratchpadLength];
+		uint8_t* scratchpad;
+		uint32_t readReg0, readReg1, readReg2, readReg3;
 	};
 }
--- a/src/asm/program_epilogue_store.inc
+++ b/src/asm/program_epilogue_store.inc
@ -1,6 +1,3 @@
-	;# unroll VM stack
-	mov rsp, rbp
-
 	;# save VM register values
 	pop rcx
 	mov qword ptr [rcx+0], r8
@ -11,8 +8,8 @@
 	mov qword ptr [rcx+40], r13
 	mov qword ptr [rcx+48], r14
 	mov qword ptr [rcx+56], r15
-	movdqa xmmword ptr [rcx+64], xmm8
-	movdqa xmmword ptr [rcx+80], xmm9
+	movdqa xmmword ptr [rcx+64], xmm0
+	movdqa xmmword ptr [rcx+80], xmm1
 	movdqa xmmword ptr [rcx+96], xmm2
 	movdqa xmmword ptr [rcx+112], xmm3
 	lea rcx, [rcx+64]
--- a/src/asm/program_epilogue_win64.inc
+++ b/src/asm/program_epilogue_win64.inc
@ -1,6 +1,12 @@
 	include program_epilogue_store.inc

 	;# restore callee-saved registers - Microsoft x64 calling convention
+	movdqu xmm15, xmmword ptr [rsp]
+	movdqu xmm14, xmmword ptr [rsp+16]
+	movdqu xmm13, xmmword ptr [rsp+32]
+	movdqu xmm12, xmmword ptr [rsp+48]
+	movdqu xmm11, xmmword ptr [rsp+64]
+	add rsp, 80
 	movdqu xmm10, xmmword ptr [rsp]
 	movdqu xmm9, xmmword ptr [rsp+16]
 	movdqu xmm8, xmmword ptr [rsp+32]
@ -17,4 +23,4 @@
 	pop rbx

 	;# program finished
-	ret	0
+	ret
--- a/src/asm/program_loop_load.inc
+++ b/src/asm/program_loop_load.inc
@ -0,0 +1,28 @@
+	mov rdx, rax
+	and eax, 2097088
+	lea rcx, [rsi+rax]
+	push rcx
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ror rdx, 32
+	and edx, 2097088
+	lea rcx, [rsi+rdx]
+	push rcx
+	cvtdq2pd xmm0, qword ptr [rcx+0]
+	cvtdq2pd xmm1, qword ptr [rcx+8]
+	cvtdq2pd xmm2, qword ptr [rcx+16]
+	cvtdq2pd xmm3, qword ptr [rcx+24]
+	cvtdq2pd xmm4, qword ptr [rcx+32]
+	cvtdq2pd xmm5, qword ptr [rcx+40]
+	cvtdq2pd xmm6, qword ptr [rcx+48]
+	cvtdq2pd xmm7, qword ptr [rcx+56]
+	andps xmm4, xmm14
+	andps xmm5, xmm14
+	andps xmm6, xmm14
+	andps xmm7, xmm14
--- a/src/asm/program_loop_store.inc
+++ b/src/asm/program_loop_store.inc
@ -0,0 +1,18 @@
+	pop rcx
+	mov qword ptr [rcx+0], r8
+	mov qword ptr [rcx+8], r9
+	mov qword ptr [rcx+16], r10
+	mov qword ptr [rcx+24], r11
+	mov qword ptr [rcx+32], r12
+	mov qword ptr [rcx+40], r13
+	mov qword ptr [rcx+48], r14
+	mov qword ptr [rcx+56], r15
+	pop rcx
+	mulpd xmm0, xmm4
+	mulpd xmm1, xmm5
+	mulpd xmm2, xmm6
+	mulpd xmm3, xmm7
+	movapd xmmword ptr [rcx+0], xmm0
+	movapd xmmword ptr [rcx+16], xmm1
+	movapd xmmword ptr [rcx+32], xmm2
+	movapd xmmword ptr [rcx+48], xmm3
--- a/src/asm/program_prologue_linux.inc
+++ b/src/asm/program_prologue_linux.inc
@ -7,11 +7,13 @@
 	push r15

 	;# function arguments
+	mov rbx, rcx                ;# loop counter
 	push rdi                    ;# RegisterFile& registerFile
-	mov rbx, rsi    ;# MemoryRegisters& memory
-	mov rsi, rdx    ;# convertible_t* scratchpad
 	mov rcx, rdi
+	mov rbp, qword ptr [rsi]    ;# "mx", "ma"
+	mov rdi, qword ptr [rsi+8]  ;# uint8_t* dataset
+	mov rsi, rdx                ;# convertible_t* scratchpad

 	#include "program_prologue_load.inc"

-	jmp randomx_program_begin
+	jmp DECL(randomx_program_loop_begin)
--- a/src/asm/program_prologue_load.inc
+++ b/src/asm/program_prologue_load.inc
@ -1,63 +1,21 @@
-	mov rbp, rsp      ;# beginning of VM stack
-	mov rdi, 1048577  ;# number of VM instructions to execute + 1
+	mov rax, rbp

-	xorps xmm10, xmm10
-	cmpeqpd xmm10, xmm10
-	psrlq xmm10, 1    ;# mask for absolute value = 0x7fffffffffffffff7fffffffffffffff
+	;# zero integer registers
+	xor r8, r8
+	xor r9, r9
+	xor r10, r10
+	xor r11, r11
+	xor r12, r12
+	xor r13, r13
+	xor r14, r14
+	xor r15, r15

-	;# reset rounding mode
-	mov dword ptr [rsp-8], 40896
-	ldmxcsr dword ptr [rsp-8]
-
-	;# load integer registers
-	mov r8, qword ptr [rcx+0]
-	mov r9, qword ptr [rcx+8]
-	mov r10, qword ptr [rcx+16]
-	mov r11, qword ptr [rcx+24]
-	mov r12, qword ptr [rcx+32]
-	mov r13, qword ptr [rcx+40]
-	mov r14, qword ptr [rcx+48]
-	mov r15, qword ptr [rcx+56]
-
-	;# initialize floating point registers
-	xorps xmm8, xmm8
-	cvtsi2sd xmm8, qword ptr [rcx+72]
-	pslldq xmm8, 8
-	cvtsi2sd xmm8, qword ptr [rcx+64]
-
-	xorps xmm9, xmm9
-	cvtsi2sd xmm9, qword ptr [rcx+88]
-	pslldq xmm9, 8
-	cvtsi2sd xmm9, qword ptr [rcx+80]
-
-	xorps xmm2, xmm2
-	cvtsi2sd xmm2, qword ptr [rcx+104]
-	pslldq xmm2, 8
-	cvtsi2sd xmm2, qword ptr [rcx+96]
-
-	xorps xmm3, xmm3
-	cvtsi2sd xmm3, qword ptr [rcx+120]
-	pslldq xmm3, 8
-	cvtsi2sd xmm3, qword ptr [rcx+112]
-
-	lea rcx, [rcx+64]
-
-	xorps xmm4, xmm4
-	cvtsi2sd xmm4, qword ptr [rcx+72]
-	pslldq xmm4, 8
-	cvtsi2sd xmm4, qword ptr [rcx+64]
-
-	xorps xmm5, xmm5
-	cvtsi2sd xmm5, qword ptr [rcx+88]
-	pslldq xmm5, 8
-	cvtsi2sd xmm5, qword ptr [rcx+80]
-
-	xorps xmm6, xmm6
-	cvtsi2sd xmm6, qword ptr [rcx+104]
-	pslldq xmm6, 8
-	cvtsi2sd xmm6, qword ptr [rcx+96]
-
-	xorps xmm7, xmm7
-	cvtsi2sd xmm7, qword ptr [rcx+120]
-	pslldq xmm7, 8
-	cvtsi2sd xmm7, qword ptr [rcx+112]
+	;# load constant registers
+	lea rcx, [rcx+120]
+	movapd xmm8, xmmword ptr [rcx+72]
+	movapd xmm9, xmmword ptr [rcx+88]
+	movapd xmm10, xmmword ptr [rcx+104]
+	movapd xmm11, xmmword ptr [rcx+120]
+	movapd xmm13, xmmword ptr [minDbl]
+	movapd xmm14, xmmword ptr [absMask]
+	movapd xmm15, xmmword ptr [signMask]
--- a/src/asm/program_prologue_win64.inc
+++ b/src/asm/program_prologue_win64.inc
@ -13,12 +13,20 @@
 	movdqu xmmword ptr [rsp+32], xmm8
 	movdqu xmmword ptr [rsp+16], xmm9
 	movdqu xmmword ptr [rsp+0], xmm10
+	sub rsp, 80
+	movdqu xmmword ptr [rsp+64], xmm11
+	movdqu xmmword ptr [rsp+48], xmm12
+	movdqu xmmword ptr [rsp+32], xmm13
+	movdqu xmmword ptr [rsp+16], xmm14
+	movdqu xmmword ptr [rsp+0], xmm15

-	;# function arguments
-	push rcx        ;# RegisterFile& registerFile
-	mov rbx, rdx    ;# MemoryRegisters& memory
-	mov rsi, r8     ;# convertible_t* scratchpad
+	; function arguments
+	push rcx                    ; RegisterFile& registerFile
+	mov rbp, qword ptr [rdx]    ; "mx", "ma"
+	mov rdi, qword ptr [rdx+8]  ; uint8_t* dataset
+	mov rsi, r8                 ; convertible_t* scratchpad
+	mov rbx, r9                 ; loop counter

 	include program_prologue_load.inc

-	jmp randomx_program_begin
+	jmp randomx_program_loop_begin
--- a/src/asm/program_read_dataset.inc
+++ b/src/asm/program_read_dataset.inc
@ -0,0 +1,17 @@
+	xor rbp, rax                       ;# modify "mx"
+	xor eax, eax
+	and rbp, -64                       ;# align "mx" to the start of a cache line
+	mov edx, ebp                       ;# edx = mx
+	prefetchnta byte ptr [rdi+rdx]
+	ror rbp, 32                        ;# swap "ma" and "mx"
+	mov edx, ebp                       ;# edx = ma
+	lea rcx, [rdi+rdx]                 ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	
--- a/src/asm/program_read_f.inc
+++ b/src/asm/program_read_f.inc
@ -1,13 +0,0 @@
-	mov edx, dword ptr [rbx]      ;# ma
-	mov rax, qword ptr [rbx+8]    ;# dataset
-	cvtdq2pd xmm0, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]    ;# mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 65528
-	jne short rx_read_dataset_f_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rax+rcx]
-rx_read_dataset_f_ret:
-	ret 0
--- a/src/asm/program_read_r.inc
+++ b/src/asm/program_read_r.inc
@ -1,13 +0,0 @@
-	mov eax, dword ptr [rbx]      ;# ma
-	mov rdx, qword ptr [rbx+8]    ;# dataset
-	mov rax, qword ptr [rdx+rax]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]    ;# mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 65528
-	jne short rx_read_dataset_r_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rdx+rcx]
-rx_read_dataset_r_ret:
-	ret 0
--- a/src/asm/program_transform_address.inc
+++ b/src/asm/program_transform_address.inc
@ -0,0 +1,154 @@
+	;# 90 address transformations
+	;# forced REX prefix is used to make all transformations 4 bytes long
+	lea eax, [rax+rax*8+109]
+	db 64
+	xor eax, 96
+	lea eax, [rax+rax*8-19]
+	db 64
+	add eax, -98
+	db 64
+	add eax, -21
+	db 64
+	xor eax, -80
+	lea eax, [rax+rax*8-92]
+	db 64
+	add eax, 113
+	lea eax, [rax+rax*8+100]
+	db 64
+	add eax, -39
+	db 64
+	xor eax, 120
+	lea eax, [rax+rax*8-119]
+	db 64
+	add eax, -113
+	db 64
+	add eax, 111
+	db 64
+	xor eax, 104
+	lea eax, [rax+rax*8-83]
+	lea eax, [rax+rax*8+127]
+	db 64
+	xor eax, -112
+	db 64
+	add eax, 89
+	db 64
+	add eax, -32
+	db 64
+	add eax, 104
+	db 64
+	xor eax, -120
+	db 64
+	xor eax, 24
+	lea eax, [rax+rax*8+9]
+	db 64
+	add eax, -31
+	db 64
+	xor eax, -16
+	db 64
+	add eax, 68
+	lea eax, [rax+rax*8-110]
+	db 64
+	xor eax, 64
+	db 64
+	xor eax, -40
+	db 64
+	xor eax, -8
+	db 64
+	add eax, -10
+	db 64
+	xor eax, -32
+	db 64
+	add eax, 14
+	lea eax, [rax+rax*8-46]
+	db 64
+	xor eax, -104
+	lea eax, [rax+rax*8+36]
+	db 64
+	add eax, 100
+	lea eax, [rax+rax*8-65]
+	lea eax, [rax+rax*8+27]
+	lea eax, [rax+rax*8+91]
+	db 64
+	add eax, -101
+	db 64
+	add eax, -94
+	lea eax, [rax+rax*8-10]
+	db 64
+	xor eax, 80
+	db 64
+	add eax, -108
+	db 64
+	add eax, -58
+	db 64
+	xor eax, 48
+	lea eax, [rax+rax*8+73]
+	db 64
+	xor eax, -48
+	db 64
+	xor eax, 32
+	db 64
+	xor eax, -96
+	db 64
+	add eax, 118
+	db 64
+	add eax, 91
+	lea eax, [rax+rax*8+18]
+	db 64
+	add eax, -11
+	lea eax, [rax+rax*8+63]
+	db 64
+	add eax, 114
+	lea eax, [rax+rax*8+45]
+	db 64
+	add eax, -67
+	db 64
+	add eax, 53
+	lea eax, [rax+rax*8-101]
+	lea eax, [rax+rax*8-1]
+	db 64
+	xor eax, 16
+	lea eax, [rax+rax*8-37]
+	lea eax, [rax+rax*8-28]
+	lea eax, [rax+rax*8-55]
+	db 64
+	xor eax, -88
+	db 64
+	xor eax, -72
+	db 64
+	add eax, 36
+	db 64
+	xor eax, -56
+	db 64
+	add eax, 116
+	db 64
+	xor eax, 88
+	db 64
+	xor eax, -128
+	db 64
+	add eax, 50
+	db 64
+	add eax, 105
+	db 64
+	add eax, -37
+	db 64
+	xor eax, 112
+	db 64
+	xor eax, 8
+	db 64
+	xor eax, -24
+	lea eax, [rax+rax*8+118]
+	db 64
+	xor eax, 72
+	db 64
+	xor eax, -64
+	db 64
+	add eax, 40
+	lea eax, [rax+rax*8-74]
+	lea eax, [rax+rax*8+82]
+	lea eax, [rax+rax*8+54]
+	db 64
+	xor eax, 56
+	db 64
+	xor eax, 40
+	db 64
+	add eax, 87
--- a/src/asm/program_xmm_constants.inc
+++ b/src/asm/program_xmm_constants.inc
@ -0,0 +1,6 @@
+minDbl:
+	db 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 16, 0
+absMask:
+	db 255, 255, 255, 255, 255, 255, 255, 127, 255, 255, 255, 255, 255, 255, 255, 127
+signMask:
+	db 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 128
--- a/src/asm/squareHash.inc
+++ b/src/asm/squareHash.inc
@ -0,0 +1,87 @@
+	mov rax, 1613783669344650115
+	add rax, rcx
+	mul rax
+	sub rax, rdx ;# 1
+	mul rax
+	sub rax, rdx ;# 2
+	mul rax
+	sub rax, rdx ;# 3
+	mul rax
+	sub rax, rdx ;# 4
+	mul rax
+	sub rax, rdx ;# 5
+	mul rax
+	sub rax, rdx ;# 6
+	mul rax
+	sub rax, rdx ;# 7
+	mul rax
+	sub rax, rdx ;# 8
+	mul rax
+	sub rax, rdx ;# 9
+	mul rax
+	sub rax, rdx ;# 10
+	mul rax
+	sub rax, rdx ;# 11
+	mul rax
+	sub rax, rdx ;# 12
+	mul rax
+	sub rax, rdx ;# 13
+	mul rax
+	sub rax, rdx ;# 14
+	mul rax
+	sub rax, rdx ;# 15
+	mul rax
+	sub rax, rdx ;# 16
+	mul rax
+	sub rax, rdx ;# 17
+	mul rax
+	sub rax, rdx ;# 18
+	mul rax
+	sub rax, rdx ;# 19
+	mul rax
+	sub rax, rdx ;# 20
+	mul rax
+	sub rax, rdx ;# 21
+	mul rax
+	sub rax, rdx ;# 22
+	mul rax
+	sub rax, rdx ;# 23
+	mul rax
+	sub rax, rdx ;# 24
+	mul rax
+	sub rax, rdx ;# 25
+	mul rax
+	sub rax, rdx ;# 26
+	mul rax
+	sub rax, rdx ;# 27
+	mul rax
+	sub rax, rdx ;# 28
+	mul rax
+	sub rax, rdx ;# 29
+	mul rax
+	sub rax, rdx ;# 30
+	mul rax
+	sub rax, rdx ;# 31
+	mul rax
+	sub rax, rdx ;# 32
+	mul rax
+	sub rax, rdx ;# 33
+	mul rax
+	sub rax, rdx ;# 34
+	mul rax
+	sub rax, rdx ;# 35
+	mul rax
+	sub rax, rdx ;# 36
+	mul rax
+	sub rax, rdx ;# 37
+	mul rax
+	sub rax, rdx ;# 38
+	mul rax
+	sub rax, rdx ;# 39
+	mul rax
+	sub rax, rdx ;# 40
+	mul rax
+	sub rax, rdx ;# 41
+	mul rax
+	sub rax, rdx ;# 42
+	ret
--- a/src/blake2/blake2-impl.h
+++ b/src/blake2/blake2-impl.h
@ -27,105 +27,10 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #define PORTABLE_BLAKE2_IMPL_H

 #include <stdint.h>
-#include <string.h>

-#if defined(_MSC_VER)
-#define BLAKE2_INLINE __inline
-#elif defined(__GNUC__) || defined(__clang__)
-#define BLAKE2_INLINE __inline__
-#else
-#define BLAKE2_INLINE
-#endif
+#include "endian.h"

- /* Argon2 Team - Begin Code */
- /*
-	Not an exhaustive list, but should cover the majority of modern platforms
-	Additionally, the code will always be correct---this is only a performance
-	tweak.
- */
-#if (defined(__BYTE_ORDER__) &&                                                \
-     (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) ||                           \
-    defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
-    defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) ||       \
-    defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) ||                \
-    defined(_M_ARM)
-#define NATIVE_LITTLE_ENDIAN
-#endif
- /* Argon2 Team - End Code */
-
-static BLAKE2_INLINE uint32_t load32(const void *src) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	uint32_t w;
-	memcpy(&w, src, sizeof w);
-	return w;
-#else
-	const uint8_t *p = (const uint8_t *)src;
-	uint32_t w = *p++;
-	w |= (uint32_t)(*p++) << 8;
-	w |= (uint32_t)(*p++) << 16;
-	w |= (uint32_t)(*p++) << 24;
-	return w;
-#endif
-}
-
-static BLAKE2_INLINE uint64_t load64(const void *src) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	uint64_t w;
-	memcpy(&w, src, sizeof w);
-	return w;
-#else
-	const uint8_t *p = (const uint8_t *)src;
-	uint64_t w = *p++;
-	w |= (uint64_t)(*p++) << 8;
-	w |= (uint64_t)(*p++) << 16;
-	w |= (uint64_t)(*p++) << 24;
-	w |= (uint64_t)(*p++) << 32;
-	w |= (uint64_t)(*p++) << 40;
-	w |= (uint64_t)(*p++) << 48;
-	w |= (uint64_t)(*p++) << 56;
-	return w;
-#endif
-}
-
-static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	memcpy(dst, &w, sizeof w);
-#else
-	uint8_t *p = (uint8_t *)dst;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-#endif
-}
-
-static BLAKE2_INLINE void store64(void *dst, uint64_t w) {
-#if defined(NATIVE_LITTLE_ENDIAN)
-	memcpy(dst, &w, sizeof w);
-#else
-	uint8_t *p = (uint8_t *)dst;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-	w >>= 8;
-	*p++ = (uint8_t)w;
-#endif
-}
-
-static BLAKE2_INLINE uint64_t load48(const void *src) {
+static FORCE_INLINE uint64_t load48(const void *src) {
 	const uint8_t *p = (const uint8_t *)src;
 	uint64_t w = *p++;
 	w |= (uint64_t)(*p++) << 8;
@ -136,7 +41,7 @@ static BLAKE2_INLINE uint64_t load48(const void *src) {
 	return w;
 }

-static BLAKE2_INLINE void store48(void *dst, uint64_t w) {
+static FORCE_INLINE void store48(void *dst, uint64_t w) {
 	uint8_t *p = (uint8_t *)dst;
 	*p++ = (uint8_t)w;
 	w >>= 8;
@ -151,11 +56,11 @@ static BLAKE2_INLINE void store48(void *dst, uint64_t w) {
 	*p++ = (uint8_t)w;
 }

-static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
+static FORCE_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
 	return (w >> c) | (w << (32 - c));
 }

-static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
+static FORCE_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
 	return (w >> c) | (w << (64 - c));
 }

--- a/src/blake2/blake2b.c
+++ b/src/blake2/blake2b.c
@ -51,29 +51,29 @@ static const unsigned int blake2b_sigma[12][16] = {
 	{14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3},
 };

-static BLAKE2_INLINE void blake2b_set_lastnode(blake2b_state *S) {
+static FORCE_INLINE void blake2b_set_lastnode(blake2b_state *S) {
 	S->f[1] = (uint64_t)-1;
 }

-static BLAKE2_INLINE void blake2b_set_lastblock(blake2b_state *S) {
+static FORCE_INLINE void blake2b_set_lastblock(blake2b_state *S) {
 	if (S->last_node) {
 		blake2b_set_lastnode(S);
 	}
 	S->f[0] = (uint64_t)-1;
 }

-static BLAKE2_INLINE void blake2b_increment_counter(blake2b_state *S,
+static FORCE_INLINE void blake2b_increment_counter(blake2b_state *S,
 	uint64_t inc) {
 	S->t[0] += inc;
 	S->t[1] += (S->t[0] < inc);
 }

-static BLAKE2_INLINE void blake2b_invalidate_state(blake2b_state *S) {
+static FORCE_INLINE void blake2b_invalidate_state(blake2b_state *S) {
 	//clear_internal_memory(S, sizeof(*S));      /* wipe */
 	blake2b_set_lastblock(S); /* invalidate for further use */
 }

-static BLAKE2_INLINE void blake2b_init0(blake2b_state *S) {
+static FORCE_INLINE void blake2b_init0(blake2b_state *S) {
 	memset(S, 0, sizeof(*S));
 	memcpy(S->h, blake2b_IV, sizeof(S->h));
 }
--- a/src/blake2/blamka-round-ref.h
+++ b/src/blake2/blamka-round-ref.h
@ -30,7 +30,7 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include "blake2-impl.h"

 /* designed by the Lyra PHC team */
-static BLAKE2_INLINE uint64_t fBlaMka(uint64_t x, uint64_t y) {
+static FORCE_INLINE uint64_t fBlaMka(uint64_t x, uint64_t y) {
 	const uint64_t m = UINT64_C(0xFFFFFFFF);
 	const uint64_t xy = (x & m) * (y & m);
 	return x + y + 2 * xy;
--- a/src/blake2/endian.h
+++ b/src/blake2/endian.h
@ -0,0 +1,99 @@
+#pragma once
+#include <stdint.h>
+#include <string.h>
+
+#if defined(_MSC_VER)
+#define FORCE_INLINE __inline
+#elif defined(__GNUC__) || defined(__clang__)
+#define FORCE_INLINE __inline__
+#else
+#define FORCE_INLINE
+#endif
+
+ /* Argon2 Team - Begin Code */
+ /*
+	Not an exhaustive list, but should cover the majority of modern platforms
+	Additionally, the code will always be correct---this is only a performance
+	tweak.
+ */
+#if (defined(__BYTE_ORDER__) &&                                                \
+     (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) ||                           \
+    defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
+    defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) ||       \
+    defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) ||                \
+    defined(_M_ARM)
+#define NATIVE_LITTLE_ENDIAN
+#endif
+ /* Argon2 Team - End Code */
+
+static FORCE_INLINE uint32_t load32(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	uint32_t w;
+	memcpy(&w, src, sizeof w);
+	return w;
+#else
+	const uint8_t *p = (const uint8_t *)src;
+	uint32_t w = *p++;
+	w |= (uint32_t)(*p++) << 8;
+	w |= (uint32_t)(*p++) << 16;
+	w |= (uint32_t)(*p++) << 24;
+	return w;
+#endif
+}
+
+static FORCE_INLINE uint64_t load64(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	uint64_t w;
+	memcpy(&w, src, sizeof w);
+	return w;
+#else
+	const uint8_t *p = (const uint8_t *)src;
+	uint64_t w = *p++;
+	w |= (uint64_t)(*p++) << 8;
+	w |= (uint64_t)(*p++) << 16;
+	w |= (uint64_t)(*p++) << 24;
+	w |= (uint64_t)(*p++) << 32;
+	w |= (uint64_t)(*p++) << 40;
+	w |= (uint64_t)(*p++) << 48;
+	w |= (uint64_t)(*p++) << 56;
+	return w;
+#endif
+}
+
+static FORCE_INLINE void store32(void *dst, uint32_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	memcpy(dst, &w, sizeof w);
+#else
+	uint8_t *p = (uint8_t *)dst;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+#endif
+}
+
+static FORCE_INLINE void store64(void *dst, uint64_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+	memcpy(dst, &w, sizeof w);
+#else
+	uint8_t *p = (uint8_t *)dst;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+	w >>= 8;
+	*p++ = (uint8_t)w;
+#endif
+}
--- a/src/common.hpp
+++ b/src/common.hpp
@ -21,62 +21,68 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #include <cstdint>
 #include <iostream>
+#include "blake2/endian.h"

 namespace RandomX {

 	using addr_t = uint32_t;

-	constexpr int RoundToNearest = 0;
-	constexpr int RoundDown = 1;
-	constexpr int RoundUp = 2;
-	constexpr int RoundToZero = 3;
-
 	constexpr int SeedSize = 32;
-	constexpr int ResultSize = 32;
+	constexpr int ResultSize = 64;

-	constexpr int CacheBlockSize = 1024;
-	constexpr int CacheShift = CacheBlockSize / 2;
-	constexpr int BlockExpansionRatio = 64;
-	constexpr uint32_t DatasetBlockSize = BlockExpansionRatio * CacheBlockSize;
-	constexpr uint32_t DatasetBlockCount = 65536;
-	constexpr uint32_t CacheSize = DatasetBlockCount * CacheBlockSize;
-	constexpr uint64_t DatasetSize = (uint64_t)DatasetBlockCount * DatasetBlockSize;
-
-	constexpr int ArgonIterations = 12;
-	constexpr uint32_t ArgonMemorySize = 65536; //KiB
+	constexpr int ArgonIterations = 3;
+	constexpr uint32_t ArgonMemorySize = 262144; //KiB
 	constexpr int ArgonLanes = 1;
 	const char ArgonSalt[] = "Monero\x1A$";
 	constexpr int ArgonSaltSize = sizeof(ArgonSalt) - 1;

+	constexpr int CacheLineSize = 64;
+	constexpr uint32_t CacheLineAlignMask = 0xFFFFFFFF & ~(CacheLineSize - 1);
+	constexpr uint64_t DatasetSize = 4ULL * 1024 * 1024 * 1024; //4 GiB
+	constexpr uint32_t CacheSize = ArgonMemorySize * 1024;
+	constexpr int CacheBlockCount = CacheSize / CacheLineSize;
+	constexpr int BlockExpansionRatio = DatasetSize / CacheSize;
+	constexpr int DatasetBlockCount = BlockExpansionRatio * CacheBlockCount;
+	constexpr int DatasetIterations = 16;
+
+
 #ifdef TRACE
 	constexpr bool trace = true;
 #else
 	constexpr bool trace = false;
 #endif

-	union convertible_t {
-		double f64;
-		int64_t i64;
-		uint64_t u64;
-		int32_t i32;
-		uint32_t u32;
-		struct {
-			int32_t i32lo;
-			int32_t i32hi;
-		};
-	};
+#ifndef UNREACHABLE
+#ifdef __GNUC__
+#define UNREACHABLE __builtin_unreachable()
+#elif _MSC_VER
+#define UNREACHABLE __assume(false)
+#else
+#define UNREACHABLE
+#endif
+#endif
+
+	using int_reg_t = uint64_t;

 	struct fpu_reg_t {
-		convertible_t lo;
-		convertible_t hi;
+		double lo;
+		double hi;
 	};

-	constexpr int ProgramLength = 512;
-	constexpr uint32_t InstructionCount = 1024 * 1024;
-	constexpr uint32_t ScratchpadSize = 256 * 1024;
-	constexpr uint32_t ScratchpadLength = ScratchpadSize / sizeof(convertible_t);
-	constexpr uint32_t ScratchpadL1 = ScratchpadSize / 16 / sizeof(convertible_t);
-	constexpr uint32_t ScratchpadL2 = ScratchpadSize / sizeof(convertible_t);
+	constexpr int ProgramLength = 256;
+	constexpr uint32_t InstructionCount = 2048;
+	constexpr uint32_t ScratchpadSize = 2 * 1024 * 1024;
+	constexpr uint32_t ScratchpadLength = ScratchpadSize / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL1 = ScratchpadSize / 128 / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL2 = ScratchpadSize / 8 / sizeof(int_reg_t);
+	constexpr uint32_t ScratchpadL3 = ScratchpadSize / sizeof(int_reg_t);
+	constexpr int ScratchpadL1Mask = (ScratchpadL1 - 1) * 8;
+	constexpr int ScratchpadL2Mask = (ScratchpadL2 - 1) * 8;
+	constexpr int ScratchpadL1Mask16 = (ScratchpadL1 / 2 - 1) * 16;
+	constexpr int ScratchpadL2Mask16 = (ScratchpadL2 / 2 - 1) * 16;
+	constexpr int ScratchpadL3Mask = (ScratchpadLength - 1) * 8;
+	constexpr int ScratchpadL3Mask64 = (ScratchpadLength / 8 - 1) * 64;
+	constexpr uint32_t TransformationCount = 90;
 	constexpr int RegistersCount = 8;

 	class Cache;
@ -85,38 +91,50 @@ namespace RandomX {
 		return i % RandomX::ProgramLength;
 	}

-	struct LightClientDataset {
-		Cache* cache;
-		uint8_t* block;
-		uint32_t blockNumber;
+	class ILightClientAsyncWorker {
+	public:
+		virtual ~ILightClientAsyncWorker() {}
+		virtual void prepareBlock(addr_t) = 0;
+		virtual void prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) = 0;
+		virtual const uint64_t* getBlock(addr_t) = 0;
+		virtual void getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) = 0;
+		virtual void sync() = 0;
+		const Cache* getCache() {
+			return cache;
+		}
+	protected:
+		ILightClientAsyncWorker(const Cache* c) : cache(c) {}
+		const Cache* cache;
 	};

 	union dataset_t {
 		uint8_t* dataset;
 		Cache* cache;
-		LightClientDataset* lightDataset;
+		ILightClientAsyncWorker* asyncWorker;
 	};

 	struct MemoryRegisters {
-		addr_t ma, mx;
+		addr_t mx, ma;
 		dataset_t ds;
 	};

 	static_assert(sizeof(MemoryRegisters) == 2 * sizeof(addr_t) + sizeof(uintptr_t), "Invalid alignment of struct RandomX::MemoryRegisters");

 	struct RegisterFile {
-		convertible_t r[RegistersCount];
-		fpu_reg_t f[RegistersCount];
+		int_reg_t r[RegistersCount];
+		fpu_reg_t f[RegistersCount / 2];
+		fpu_reg_t e[RegistersCount / 2];
+		fpu_reg_t a[RegistersCount / 2];
 	};

-	static_assert(sizeof(RegisterFile) == 3 * RegistersCount * sizeof(convertible_t), "Invalid alignment of struct RandomX::RegisterFile");
+	static_assert(sizeof(RegisterFile) == 256, "Invalid alignment of struct RandomX::RegisterFile");

-	typedef convertible_t(*DatasetReadFunc)(addr_t, MemoryRegisters&);
+	typedef void(*DatasetReadFunc)(addr_t, MemoryRegisters&, int_reg_t(&reg)[RegistersCount]);

-	typedef void(*ProgramFunc)(RegisterFile&, MemoryRegisters&, convertible_t*);
+	typedef void(*ProgramFunc)(RegisterFile&, MemoryRegisters&, uint8_t* /* scratchpad */, uint64_t);

 	extern "C" {
-		void executeProgram(RegisterFile&, MemoryRegisters&, convertible_t*, DatasetReadFunc);
+		void executeProgram(RegisterFile&, MemoryRegisters&, uint8_t* /* scratchpad */, uint64_t);
 	}
 }

--- a/src/dataset.cpp
+++ b/src/dataset.cpp
@ -24,156 +24,103 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #include "common.hpp"
 #include "dataset.hpp"
-#include "Pcg32.hpp"
 #include "Cache.hpp"
+#include "virtualMemory.hpp"
+#include "softAes.h"
+#include "squareHash.h"
+#include "blake2/endian.h"

 #if defined(__SSE2__)
 #include <wmmintrin.h>
-#define PREFETCH(memory) _mm_prefetch((const char *)((memory).ds.dataset + (memory).ma), _MM_HINT_T0)
+#define PREFETCHNTA(x) _mm_prefetch((const char *)(x), _MM_HINT_NTA)
 #else
 #define PREFETCH(memory)
 #endif

 namespace RandomX {

-	template<typename T>
-	static inline void shuffle(T* buffer, size_t bytes, Pcg32& gen) {
-		auto count = bytes / sizeof(T);
-		for (auto i = count - 1; i >= 1; --i) {
-			int j = gen.getUniform(0, i);
-			std::swap(buffer[j], buffer[i]);
-		}
+	void initBlock(const uint8_t* cache, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys) {
+		uint64_t r0, r1, r2, r3, r4, r5, r6, r7;
+
+		r0 = 4ULL * blockNumber;
+		r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;
+
+		constexpr uint32_t mask = (CacheSize - 1) & CacheLineAlignMask;
+
+		for (auto i = 0; i < DatasetIterations; ++i) {
+			const uint8_t* mixBlock = cache + (r0 & mask);
+			PREFETCHNTA(mixBlock);
+			r0 = squareHash(r0);
+			r0 ^= load64(mixBlock + 0);
+			r1 ^= load64(mixBlock + 8);
+			r2 ^= load64(mixBlock + 16);
+			r3 ^= load64(mixBlock + 24);
+			r4 ^= load64(mixBlock + 32);
+			r5 ^= load64(mixBlock + 40);
+			r6 ^= load64(mixBlock + 48);
+			r7 ^= load64(mixBlock + 56);
 		}

-	template<bool soft>
-	static inline __m128i aesenc(__m128i in, __m128i key) {
-		return soft ? soft_aesenc(in, key) : _mm_aesenc_si128(in, key);
+		store64(out + 0, r0);
+		store64(out + 8, r1);
+		store64(out + 16, r2);
+		store64(out + 24, r3);
+		store64(out + 32, r4);
+		store64(out + 40, r5);
+		store64(out + 48, r6);
+		store64(out + 56, r7);
 	}

-	template<bool soft>
-	static inline __m128i aesdec(__m128i in, __m128i key) {
-		return soft ? soft_aesdec(in, key) : _mm_aesdec_si128(in, key);
-	}
-
-	template<bool soft, bool enc>
-	void initBlock(const uint8_t* in, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys) {
-		__m128i xin, xout;
-		//Initialization vector = block number extended to 128 bits
-		xout = _mm_cvtsi32_si128(blockNumber);
-		//Expand + AES
-		for (uint32_t i = 0; i < DatasetBlockSize / sizeof(__m128i); ++i) {
-			if ((i % 32) == 0) {
-				xin = _mm_set_epi64x(*(uint64_t*)(in + i / 4), 0);
-				xout = _mm_xor_si128(xin, xout);
-			}
-			if (enc) {
-				xout = aesenc<soft>(xout, keys[0]);
-				xout = aesenc<soft>(xout, keys[1]);
-				xout = aesenc<soft>(xout, keys[2]);
-				xout = aesenc<soft>(xout, keys[3]);
-				xout = aesenc<soft>(xout, keys[4]);
-				xout = aesenc<soft>(xout, keys[5]);
-				xout = aesenc<soft>(xout, keys[6]);
-				xout = aesenc<soft>(xout, keys[7]);
-				xout = aesenc<soft>(xout, keys[8]);
-				xout = aesenc<soft>(xout, keys[9]);
-			}
-			else {
-				xout = aesdec<soft>(xout, keys[0]);
-				xout = aesdec<soft>(xout, keys[1]);
-				xout = aesdec<soft>(xout, keys[2]);
-				xout = aesdec<soft>(xout, keys[3]);
-				xout = aesdec<soft>(xout, keys[4]);
-				xout = aesdec<soft>(xout, keys[5]);
-				xout = aesdec<soft>(xout, keys[6]);
-				xout = aesdec<soft>(xout, keys[7]);
-				xout = aesdec<soft>(xout, keys[8]);
-				xout = aesdec<soft>(xout, keys[9]);
-			}
-			_mm_store_si128((__m128i*)(out + i * sizeof(__m128i)), xout);
-		}
-		//Shuffle
-		Pcg32 gen(&xout);
-		shuffle<uint32_t>((uint32_t*)out, DatasetBlockSize, gen);
-	}
-
-	template
-		void initBlock<true, true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<true, false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false, true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false, false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	convertible_t datasetRead(addr_t addr, MemoryRegisters& memory) {
-		convertible_t data;
-		data.u64 = *(uint64_t*)(memory.ds.dataset + memory.ma);
-		memory.ma += 8;
+	void datasetRead(addr_t addr, MemoryRegisters& memory, RegisterFile& reg) {
+		uint64_t* datasetLine = (uint64_t*)(memory.ds.dataset + memory.ma);
 		memory.mx ^= addr;
-		if ((memory.mx & 0xFFF8) == 0) {
-			memory.ma = memory.mx & ~7;
-			PREFETCH(memory);
-		}
-		return data;
+		memory.mx &= -64; //align to cache line
+		std::swap(memory.mx, memory.ma);
+		PREFETCHNTA(memory.ds.dataset + memory.ma);
+		for (int i = 0; i < RegistersCount; ++i)
+			reg.r[i] ^= datasetLine[i];
 	}

-	template<bool softAes>
-	void initBlock(const uint8_t* cache, uint8_t* block, uint32_t blockNumber, const KeysContainer& keys) {
-		if (blockNumber % 2 == 1) {
-			initBlock<softAes, true>(cache + blockNumber * CacheBlockSize, block, blockNumber, keys);
-		}
-		else {
-			initBlock<softAes, false>(cache + blockNumber * CacheBlockSize, block, blockNumber, keys);
-		}
-	}
-
-	template
-		void initBlock<true>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template
-		void initBlock<false>(const uint8_t*, uint8_t*, uint32_t, const KeysContainer&);
-
-	template<bool softAes>
-	convertible_t datasetReadLight(addr_t addr, MemoryRegisters& memory) {
-		convertible_t data;
-		LightClientDataset* lds = memory.ds.lightDataset;
-		auto blockNumber = memory.ma / DatasetBlockSize;
-		if (lds->blockNumber != blockNumber) {
-			initBlock<softAes>(lds->cache->getCache(), (uint8_t*)lds->block, blockNumber, lds->cache->getKeys());
-			lds->blockNumber = blockNumber;
-		}
-		data.u64 = *(uint64_t*)(lds->block + (memory.ma % DatasetBlockSize));
-		memory.ma += 8;
+	void datasetReadLight(addr_t addr, MemoryRegisters& memory, int_reg_t (&reg)[RegistersCount]) {
 		memory.mx ^= addr;
-		if ((memory.mx & 0xFFF8) == 0) {
-			memory.ma = memory.mx & ~7;
-		}
-		return data;
+		memory.mx &= CacheLineAlignMask; //align to cache line
+		Cache* cache = memory.ds.cache;
+		uint64_t datasetLine[CacheLineSize / sizeof(uint64_t)];
+		initBlock(cache->getCache(), (uint8_t*)datasetLine, memory.ma / CacheLineSize, cache->getKeys());
+		for (int i = 0; i < RegistersCount; ++i)
+			reg[i] ^= datasetLine[i];
+		std::swap(memory.mx, memory.ma);
 	}

-	template
-		convertible_t datasetReadLight<false>(addr_t addr, MemoryRegisters& memory);
+	void datasetReadLightAsync(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]) {
+		ILightClientAsyncWorker* aw = memory.ds.asyncWorker;
+		const uint64_t* datasetLine = aw->getBlock(memory.ma);
+		for (int i = 0; i < RegistersCount; ++i)
+			reg[i] ^= datasetLine[i];
+		memory.mx ^= addr;
+		memory.mx &= CacheLineAlignMask; //align to cache line
+		std::swap(memory.mx, memory.ma);
+		aw->prepareBlock(memory.ma);
+	}

-	template
-		convertible_t datasetReadLight<true>(addr_t addr, MemoryRegisters& memory);
-
-	void datasetAlloc(dataset_t& ds) {
+	void datasetAlloc(dataset_t& ds, bool largePages) {
 		if (sizeof(size_t) <= 4)
 			throw std::runtime_error("Platform doesn't support enough memory for the dataset");
-		ds.dataset = (uint8_t*)_mm_malloc(DatasetSize, /*sizeof(__m128i)*/ 64);
+		if (largePages) {
+			ds.dataset = (uint8_t*)allocLargePagesMemory(DatasetSize);
+		}
+		else {
+			ds.dataset = (uint8_t*)_mm_malloc(DatasetSize, 64);
 			if (ds.dataset == nullptr) {
 				throw std::runtime_error("Dataset memory allocation failed. >4 GiB of free virtual memory is needed.");
 			}
 		}
+	}

 	template<bool softAes>
 	void datasetInit(Cache* cache, dataset_t ds, uint32_t startBlock, uint32_t blockCount) {
 		for (uint32_t i = startBlock; i < startBlock + blockCount; ++i) {
-			initBlock<softAes>(cache->getCache(), ds.dataset + i * DatasetBlockSize, i, cache->getKeys());
+			initBlock(cache->getCache(), ds.dataset + i * CacheLineSize, i, cache->getKeys());
 		}
 	}

@ -184,14 +131,26 @@ namespace RandomX {
 		void datasetInit<true>(Cache*, dataset_t, uint32_t, uint32_t);

 	template<bool softAes>
-	void datasetInitCache(const void* seed, dataset_t& ds) {
-		ds.cache = new Cache();
+	void datasetInitCache(const void* seed, dataset_t& ds, bool largePages) {
+		ds.cache = new(Cache::alloc(largePages)) Cache();
 		ds.cache->initialize<softAes>(seed, SeedSize);
 	}

 	template
-		void datasetInitCache<false>(const void*, dataset_t&);
+		void datasetInitCache<false>(const void*, dataset_t&, bool);

 	template
-		void datasetInitCache<true>(const void*, dataset_t&);
+		void datasetInitCache<true>(const void*, dataset_t&, bool);
+
+	template<bool softAes>
+	void aesBench(uint32_t blockCount) {
+		alignas(16) KeysContainer keys;
+		alignas(16) uint8_t buffer[CacheLineSize];
+		for (uint32_t block = 0; block < blockCount; ++block) {
+			initBlock(buffer, buffer, 0, keys);
+		}
+	}
+
+	template void aesBench<false>(uint32_t blockCount);
+	template void aesBench<true>(uint32_t blockCount);
 }
--- a/src/dataset.hpp
+++ b/src/dataset.hpp
@ -23,7 +23,6 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <array>
 #include "intrinPortable.h"
 #include "common.hpp"
-#include "softAes.h"

 namespace RandomX {

@ -32,20 +31,23 @@ namespace RandomX {
 	template<bool soft, bool enc>
 	void initBlock(const uint8_t* in, uint8_t* out, uint32_t blockNumber, const KeysContainer& keys);

-	template<bool softAes>
 	void initBlock(const uint8_t* cache, uint8_t* block, uint32_t blockNumber, const KeysContainer& keys);

-	void datasetAlloc(dataset_t& ds);
+	void datasetAlloc(dataset_t& ds, bool largePages);

 	template<bool softAes>
 	void datasetInit(Cache* cache, dataset_t ds, uint32_t startBlock, uint32_t blockCount);

-	convertible_t datasetRead(addr_t addr, MemoryRegisters& memory);
+	void datasetRead(addr_t addr, MemoryRegisters& memory, RegisterFile&);

 	template<bool softAes>
-	void datasetInitCache(const void* seed, dataset_t& dataset);
+	void datasetInitCache(const void* seed, dataset_t& dataset, bool largePages);
+
+	void datasetReadLight(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]);
+
+	void datasetReadLightAsync(addr_t addr, MemoryRegisters& memory, int_reg_t(&reg)[RegistersCount]);

 	template<bool softAes>
-	convertible_t datasetReadLight(addr_t addr, MemoryRegisters& memory);
+	void aesBench(uint32_t blockCount);
 }

--- a/src/divideByConstantCodegen.c
+++ b/src/divideByConstantCodegen.c
@ -0,0 +1,169 @@
+/*
+  Reference implementations of computing and using the "magic number" approach to dividing
+  by constants, including codegen instructions. The unsigned division incorporates the
+  "round down" optimization per ridiculous_fish.
+
+  This is free and unencumbered software. Any copyright is dedicated to the Public Domain.
+*/
+
+#include <limits.h> //for CHAR_BIT
+#include <assert.h>
+
+#include "divideByConstantCodegen.h"
+
+struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits) {
+
+	//The numerator must fit in a unsigned_type
+	assert(num_bits > 0 && num_bits <= sizeof(unsigned_type) * CHAR_BIT);
+
+	// D must be larger than zero and not a power of 2
+	assert(D & (D - 1));
+
+	// The eventual result
+	struct magicu_info result;
+
+	// Bits in a unsigned_type
+	const unsigned UINT_BITS = sizeof(unsigned_type) * CHAR_BIT;
+
+	// The extra shift implicit in the difference between UINT_BITS and num_bits
+	const unsigned extra_shift = UINT_BITS - num_bits;
+
+	// The initial power of 2 is one less than the first one that can possibly work
+	const unsigned_type initial_power_of_2 = (unsigned_type)1 << (UINT_BITS - 1);
+
+	// The remainder and quotient of our power of 2 divided by d
+	unsigned_type quotient = initial_power_of_2 / D, remainder = initial_power_of_2 % D;
+
+	// ceil(log_2 D)
+	unsigned ceil_log_2_D;
+
+	// The magic info for the variant "round down" algorithm
+	unsigned_type down_multiplier = 0;
+	unsigned down_exponent = 0;
+	int has_magic_down = 0;
+
+	// Compute ceil(log_2 D)
+	ceil_log_2_D = 0;
+	unsigned_type tmp;
+	for (tmp = D; tmp > 0; tmp >>= 1)
+		ceil_log_2_D += 1;
+
+
+	// Begin a loop that increments the exponent, until we find a power of 2 that works.
+	unsigned exponent;
+	for (exponent = 0; ; exponent++) {
+		// Quotient and remainder is from previous exponent; compute it for this exponent.
+		if (remainder >= D - remainder) {
+			// Doubling remainder will wrap around D
+			quotient = quotient * 2 + 1;
+			remainder = remainder * 2 - D;
+		}
+		else {
+			// Remainder will not wrap
+			quotient = quotient * 2;
+			remainder = remainder * 2;
+		}
+
+		// We're done if this exponent works for the round_up algorithm.
+		// Note that exponent may be larger than the maximum shift supported,
+		// so the check for >= ceil_log_2_D is critical.
+		if ((exponent + extra_shift >= ceil_log_2_D) || (D - remainder) <= ((unsigned_type)1 << (exponent + extra_shift)))
+			break;
+
+		// Set magic_down if we have not set it yet and this exponent works for the round_down algorithm
+		if (!has_magic_down && remainder <= ((unsigned_type)1 << (exponent + extra_shift))) {
+			has_magic_down = 1;
+			down_multiplier = quotient;
+			down_exponent = exponent;
+		}
+	}
+
+	if (exponent < ceil_log_2_D) {
+		// magic_up is efficient
+		result.multiplier = quotient + 1;
+		result.pre_shift = 0;
+		result.post_shift = exponent;
+		result.increment = 0;
+	}
+	else if (D & 1) {
+		// Odd divisor, so use magic_down, which must have been set
+		assert(has_magic_down);
+		result.multiplier = down_multiplier;
+		result.pre_shift = 0;
+		result.post_shift = down_exponent;
+		result.increment = 1;
+	}
+	else {
+		// Even divisor, so use a prefix-shifted dividend
+		unsigned pre_shift = 0;
+		unsigned_type shifted_D = D;
+		while ((shifted_D & 1) == 0) {
+			shifted_D >>= 1;
+			pre_shift += 1;
+		}
+		result = compute_unsigned_magic_info(shifted_D, num_bits - pre_shift);
+		assert(result.increment == 0 && result.pre_shift == 0); //expect no increment or pre_shift in this path
+		result.pre_shift = pre_shift;
+	}
+	return result;
+}
+
+struct magics_info compute_signed_magic_info(signed_type D) {
+	// D must not be zero and must not be a power of 2 (or its negative)
+	assert(D != 0 && (D & -D) != D && (D & -D) != -D);
+
+	// Our result
+	struct magics_info result;
+
+	// Bits in an signed_type
+	const unsigned SINT_BITS = sizeof(signed_type) * CHAR_BIT;
+
+	// Absolute value of D (we know D is not the most negative value since that's a power of 2)
+	const unsigned_type abs_d = (D < 0 ? -D : D);
+
+	// The initial power of 2 is one less than the first one that can possibly work
+	// "two31" in Warren
+	unsigned exponent = SINT_BITS - 1;
+	const unsigned_type initial_power_of_2 = (unsigned_type)1 << exponent;
+
+	// Compute the absolute value of our "test numerator,"
+	// which is the largest dividend whose remainder with d is d-1.
+	// This is called anc in Warren.
+	const unsigned_type tmp = initial_power_of_2 + (D < 0);
+	const unsigned_type abs_test_numer = tmp - 1 - tmp % abs_d;
+
+	// Initialize our quotients and remainders (q1, r1, q2, r2 in Warren)
+	unsigned_type quotient1 = initial_power_of_2 / abs_test_numer, remainder1 = initial_power_of_2 % abs_test_numer;
+	unsigned_type quotient2 = initial_power_of_2 / abs_d, remainder2 = initial_power_of_2 % abs_d;
+	unsigned_type delta;
+
+	// Begin our loop
+	do {
+		// Update the exponent
+		exponent++;
+
+		// Update quotient1 and remainder1
+		quotient1 *= 2;
+		remainder1 *= 2;
+		if (remainder1 >= abs_test_numer) {
+			quotient1 += 1;
+			remainder1 -= abs_test_numer;
+		}
+
+		// Update quotient2 and remainder2
+		quotient2 *= 2;
+		remainder2 *= 2;
+		if (remainder2 >= abs_d) {
+			quotient2 += 1;
+			remainder2 -= abs_d;
+		}
+
+		// Keep going as long as (2**exponent) / abs_d <= delta
+		delta = abs_d - remainder2;
+	} while (quotient1 < delta || (quotient1 == delta && remainder1 == 0));
+
+	result.multiplier = quotient2 + 1;
+	if (D < 0) result.multiplier = -result.multiplier;
+	result.shift = exponent - SINT_BITS;
+	return result;
+}
--- a/src/divideByConstantCodegen.h
+++ b/src/divideByConstantCodegen.h
@ -0,0 +1,117 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#pragma once
+#include <stdint.h>
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+	typedef uint64_t unsigned_type;
+	typedef int64_t signed_type;
+
+	/* Computes "magic info" for performing signed division by a fixed integer D.
+	   The type 'signed_type' is assumed to be defined as a signed integer type large enough
+	   to hold both the dividend and the divisor.
+	   Here >> is arithmetic (signed) shift, and >>> is logical shift.
+
+	   To emit code for n/d, rounding towards zero, use the following sequence:
+
+		 m = compute_signed_magic_info(D)
+		 emit("result = (m.multiplier * n) >> SINT_BITS");
+		 if d > 0 and m.multiplier < 0: emit("result += n")
+		 if d < 0 and m.multiplier > 0: emit("result -= n")
+		 if m.post_shift > 0: emit("result >>= m.shift")
+		 emit("result += (result < 0)")
+
+	  The shifts by SINT_BITS may be "free" if the high half of the full multiply
+	  is put in a separate register.
+
+	  The final add can of course be implemented via the sign bit, e.g.
+		  result += (result >>> (SINT_BITS - 1))
+	   or
+		  result -= (result >> (SINT_BITS - 1))
+
+	   This code is heavily indebted to Hacker's Delight by Henry Warren.
+	   See http://www.hackersdelight.org/HDcode/magic.c.txt
+	   Used with permission from http://www.hackersdelight.org/permissions.htm
+	 */
+
+	struct magics_info {
+		signed_type multiplier; // the "magic number" multiplier
+		unsigned shift; // shift for the dividend after multiplying
+	};
+	struct magics_info compute_signed_magic_info(signed_type D);
+
+
+	/* Computes "magic info" for performing unsigned division by a fixed positive integer D.
+	   The type 'unsigned_type' is assumed to be defined as an unsigned integer type large enough
+	   to hold both the dividend and the divisor. num_bits can be set appropriately if n is
+	   known to be smaller than the largest unsigned_type; if this is not known then pass
+	   (sizeof(unsigned_type) * CHAR_BIT) for num_bits.
+
+	   Assume we have a hardware register of width UINT_BITS, a known constant D which is
+	   not zero and not a power of 2, and a variable n of width num_bits (which may be
+	   up to UINT_BITS). To emit code for n/d, use one of the two following sequences
+	   (here >>> refers to a logical bitshift):
+
+		 m = compute_unsigned_magic_info(D, num_bits)
+		 if m.pre_shift > 0: emit("n >>>= m.pre_shift")
+		 if m.increment: emit("n = saturated_increment(n)")
+		 emit("result = (m.multiplier * n) >>> UINT_BITS")
+		 if m.post_shift > 0: emit("result >>>= m.post_shift")
+
+	   or
+
+		 m = compute_unsigned_magic_info(D, num_bits)
+		 if m.pre_shift > 0: emit("n >>>= m.pre_shift")
+		 emit("result = m.multiplier * n")
+		 if m.increment: emit("result = result + m.multiplier")
+		 emit("result >>>= UINT_BITS")
+		 if m.post_shift > 0: emit("result >>>= m.post_shift")
+
+	  The shifts by UINT_BITS may be "free" if the high half of the full multiply
+	  is put in a separate register.
+
+	  saturated_increment(n) means "increment n unless it would wrap to 0," i.e.
+		if n == (1 << UINT_BITS)-1: result = n
+		else: result = n+1
+	  A common way to implement this is with the carry bit. For example, on x86:
+		 add 1
+		 sbb 0
+
+	  Some invariants:
+	   1: At least one of pre_shift and increment is zero
+	   2: multiplier is never zero
+
+	   This code incorporates the "round down" optimization per ridiculous_fish.
+	 */
+
+	struct magicu_info {
+		unsigned_type multiplier; // the "magic number" multiplier
+		unsigned pre_shift; // shift for the dividend before multiplying
+		unsigned post_shift; //shift for the dividend after multiplying
+		int increment; // 0 or 1; if set then increment the numerator, using one of the two strategies
+	};
+	struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits);
+
+#if defined(__cplusplus)
+}
+#endif
--- a/src/executeProgram-win64.asm
+++ b/src/executeProgram-win64.asm
@ -15,20 +15,22 @@
 ;# You should have received a copy of the GNU General Public License
 ;# along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

-PUBLIC executeProgram
+IFDEF RAX

-.code
+_RANDOMX_EXECUTE_PROGRAM SEGMENT PAGE READ EXECUTE
+
+PUBLIC executeProgram

 executeProgram PROC
 	; REGISTER ALLOCATION:
 	; rax -> temporary
-	; rbx -> MemoryRegisters& memory
+	; rbx -> "ic"
 	; rcx -> temporary
 	; rdx -> temporary
-	; rsi -> convertible_t& scratchpad
-	; rdi -> "ic" (instruction counter)
-	; rbp	-> beginning of VM stack
-	; rsp -> end of VM stack
+	; rsi -> scratchpad pointer
+	; rdi -> dataset pointer
+	; rbp -> "ma", "mx"
+	; rsp -> stack pointer
 	; r8 	-> "r0"
 	; r9 	-> "r1"
 	; r10 -> "r2"
@ -37,31 +39,22 @@ executeProgram PROC
 	; r13 -> "r5"
 	; r14 -> "r6"
 	; r15 -> "r7"
-	; xmm0 -> temporary
-	; xmm1 -> temporary
+	; xmm0 -> "f0"
+	; xmm1 -> "f1"
 	; xmm2 -> "f2"
 	; xmm3 -> "f3"
-	; xmm4 -> "f4"
-	; xmm5 -> "f5"
-	; xmm6 -> "f6"
-	; xmm7 -> "f7"
-	; xmm8 -> "f0"
-	; xmm9 -> "f1"
-	; xmm10 -> absolute value mask
-
-	; STACK STRUCTURE:
-	;   |
-	;   |
-	;   | saved registers
-	;   |
-	;   v
-	; [rbp] RegisterFile& registerFile
-	;   |
-	;   |
-	;   | VM stack
-	;   |
-	;   v
-	; [rsp] last element of VM stack
+	; xmm4 -> "e0"
+	; xmm5 -> "e1"
+	; xmm6 -> "e2"
+	; xmm7 -> "e3"
+	; xmm8 -> "a0"
+	; xmm9 -> "a1"
+	; xmm10 -> "a2"
+	; xmm11 -> "a3"
+	; xmm12 -> temporary
+	; xmm13 -> DBL_MIN
+	; xmm14 -> absolute value mask
+	; xmm15 -> sign mask

 	; store callee-saved registers
 	push rbx
@ -78,95 +71,131 @@ executeProgram PROC
 	movdqu xmmword ptr [rsp+32], xmm8
 	movdqu xmmword ptr [rsp+16], xmm9
 	movdqu xmmword ptr [rsp+0], xmm10
+	sub rsp, 80
+	movdqu xmmword ptr [rsp+64], xmm11
+	movdqu xmmword ptr [rsp+48], xmm12
+	movdqu xmmword ptr [rsp+32], xmm13
+	movdqu xmmword ptr [rsp+16], xmm14
+	movdqu xmmword ptr [rsp+0], xmm15

 	; function arguments
 	push rcx                    ; RegisterFile& registerFile
-	mov rbx, rdx		; MemoryRegisters& memory
-	mov rsi, r8			; convertible_t& scratchpad
-	push r9
+	mov rbp, qword ptr [rdx]    ; "mx", "ma"
+	mov eax, ebp                ; "mx"
+	mov rdi, qword ptr [rdx+8]  ; uint8_t* dataset
+	mov rsi, r8                 ; convertible_t* scratchpad
+	mov rbx, r9                 ; loop counter
 	
-	mov rbp, rsp			; beginning of VM stack
-	mov rdi, 1048577	; number of VM instructions to execute + 1
+	;# zero integer registers
+	xor r8, r8
+	xor r9, r9
+	xor r10, r10
+	xor r11, r11
+	xor r12, r12
+	xor r13, r13
+	xor r14, r14
+	xor r15, r15
 	
-	xorps xmm10, xmm10
-	cmpeqpd xmm10, xmm10
-	psrlq xmm10, 1		; mask for absolute value = 0x7fffffffffffffff7fffffffffffffff
+	;# load constant registers
+	lea rcx, [rcx+120]
+	movapd xmm8, xmmword ptr [rcx+72]
+	movapd xmm9, xmmword ptr [rcx+88]
+	movapd xmm10, xmmword ptr [rcx+104]
+	movapd xmm11, xmmword ptr [rcx+120]
+	movapd xmm13, xmmword ptr [minDbl]
+	movapd xmm14, xmmword ptr [absMask]
+	movapd xmm15, xmmword ptr [signMask]

-	; reset rounding mode
-	mov dword ptr [rsp-8], 40896
-	ldmxcsr dword ptr [rsp-8]
+	jmp program_begin

-	; load integer registers
-	mov r8, qword ptr [rcx+0]
-	mov r9, qword ptr [rcx+8]
-	mov r10, qword ptr [rcx+16]
-	mov r11, qword ptr [rcx+24]
-	mov r12, qword ptr [rcx+32]
-	mov r13, qword ptr [rcx+40]
-	mov r14, qword ptr [rcx+48]
-	mov r15, qword ptr [rcx+56]
+ALIGN 64
+minDbl:
+	db 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 16, 0
+absMask:
+	db 255, 255, 255, 255, 255, 255, 255, 127, 255, 255, 255, 255, 255, 255, 255, 127
+signMask:
+	db 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 128

-	; load register f0 hi, lo
-	xorps xmm8, xmm8
-	cvtsi2sd xmm8, qword ptr [rcx+72]
-	pslldq xmm8, 8
-	cvtsi2sd xmm8, qword ptr [rcx+64]
-
-	; load register f1 hi, lo
-	xorps xmm9, xmm9
-	cvtsi2sd xmm9, qword ptr [rcx+88]
-	pslldq xmm9, 8
-	cvtsi2sd xmm9, qword ptr [rcx+80]
-
-	; load register f2 hi, lo
-	xorps xmm2, xmm2
-	cvtsi2sd xmm2, qword ptr [rcx+104]
-	pslldq xmm2, 8
-	cvtsi2sd xmm2, qword ptr [rcx+96]
-
-	; load register f3 hi, lo
-	xorps xmm3, xmm3
-	cvtsi2sd xmm3, qword ptr [rcx+120]
-	pslldq xmm3, 8
-	cvtsi2sd xmm3, qword ptr [rcx+112]
-
-	lea rcx, [rcx+64]
-
-	; load register f4 hi, lo
-	xorps xmm4, xmm4
-	cvtsi2sd xmm4, qword ptr [rcx+72]
-	pslldq xmm4, 8
-	cvtsi2sd xmm4, qword ptr [rcx+64]
-
-	; load register f5 hi, lo
-	xorps xmm5, xmm5
-	cvtsi2sd xmm5, qword ptr [rcx+88]
-	pslldq xmm5, 8
-	cvtsi2sd xmm5, qword ptr [rcx+80]
-
-	; load register f6 hi, lo
-	xorps xmm6, xmm6
-	cvtsi2sd xmm6, qword ptr [rcx+104]
-	pslldq xmm6, 8
-	cvtsi2sd xmm6, qword ptr [rcx+96]
-
-	; load register f7 hi, lo
-	xorps xmm7, xmm7
-	cvtsi2sd xmm7, qword ptr [rcx+120]
-	pslldq xmm7, 8
-	cvtsi2sd xmm7, qword ptr [rcx+112]
-
-	; program body
+ALIGN 64
+program_begin:
+	xor rax, r8                      ;# read address register 1
+	xor rax, r9
+	mov rdx, rax
+	and eax, 1048512
+	push rax
+	lea rcx, [rsi+rax]
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ror rdx, 32
+	and edx, 1048512
+	push rdx
+	lea rcx, [rsi+rdx]
+	cvtdq2pd xmm0, qword ptr [rcx+0]
+	cvtdq2pd xmm1, qword ptr [rcx+8]
+	cvtdq2pd xmm2, qword ptr [rcx+16]
+	cvtdq2pd xmm3, qword ptr [rcx+24]
+	cvtdq2pd xmm4, qword ptr [rcx+32]
+	cvtdq2pd xmm5, qword ptr [rcx+40]
+	cvtdq2pd xmm6, qword ptr [rcx+48]
+	cvtdq2pd xmm7, qword ptr [rcx+56]
+	andps xmm4, xmm14
+	andps xmm5, xmm14
+	andps xmm6, xmm14
+	andps xmm7, xmm14

+	;# 256 instructions
 	include program.inc

-rx_finish:
-	; unroll the stack
-	mov rsp, rbp
+	mov eax, r8d                       ;# read address register 1
+	xor eax, r9d                       ;# read address register 2
+	xor rbp, rax                       ;# modify "mx"
+	and rbp, -64                       ;# align "mx" to the start of a cache line
+	mov edx, ebp                       ;# edx = mx
+	prefetchnta byte ptr [rdi+rdx]
+	ror rbp, 32                        ;# swap "ma" and "mx"
+	mov edx, ebp                       ;# edx = ma
+	lea rcx, [rdi+rdx]                 ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	pop rax
+	lea rcx, [rsi+rax]
+	mov qword ptr [rcx+0], r8
+	mov qword ptr [rcx+8], r9
+	mov qword ptr [rcx+16], r10
+	mov qword ptr [rcx+24], r11
+	mov qword ptr [rcx+32], r12
+	mov qword ptr [rcx+40], r13
+	mov qword ptr [rcx+48], r14
+	mov qword ptr [rcx+56], r15
+	pop rax
+	lea rcx, [rsi+rax]
+	mulpd xmm0, xmm4
+	mulpd xmm1, xmm5
+	mulpd xmm2, xmm6
+	mulpd xmm3, xmm7
+	movapd xmmword ptr [rcx+0], xmm0
+	movapd xmmword ptr [rcx+16], xmm1
+	movapd xmmword ptr [rcx+32], xmm2
+	movapd xmmword ptr [rcx+48], xmm3
+	xor eax, eax
+	dec ebx
+	jnz program_begin
 	
+rx_finish:
 	; save VM register values
 	pop rcx
-	pop rcx
 	mov qword ptr [rcx+0], r8
 	mov qword ptr [rcx+8], r9
 	mov qword ptr [rcx+16], r10
@ -175,8 +204,8 @@ rx_finish:
 	mov qword ptr [rcx+40], r13
 	mov qword ptr [rcx+48], r14
 	mov qword ptr [rcx+56], r15
-	movdqa xmmword ptr [rcx+64], xmm8
-	movdqa xmmword ptr [rcx+80], xmm9
+	movdqa xmmword ptr [rcx+64], xmm0
+	movdqa xmmword ptr [rcx+80], xmm1
 	movdqa xmmword ptr [rcx+96], xmm2
 	movdqa xmmword ptr [rcx+112], xmm3
 	lea rcx, [rcx+64]
@ -186,6 +215,12 @@ rx_finish:
 	movdqa xmmword ptr [rcx+112], xmm7

 	; load callee-saved registers
+	movdqu xmm15, xmmword ptr [rsp]
+	movdqu xmm14, xmmword ptr [rsp+16]
+	movdqu xmm13, xmmword ptr [rsp+32]
+	movdqu xmm12, xmmword ptr [rsp+48]
+	movdqu xmm11, xmmword ptr [rsp+64]
+	add rsp, 80
 	movdqu xmm10, xmmword ptr [rsp]
 	movdqu xmm9, xmmword ptr [rsp+16]
 	movdqu xmm8, xmmword ptr [rsp+32]
@ -202,57 +237,50 @@ rx_finish:
 	pop rbx

 	; return
-	ret	0
+	ret
 	
-rx_read_dataset:
-	push r8
-	push r9
-	push r10
-	push r11
-	mov rdx, rbx
-	movd qword ptr [rsp - 8], xmm1
-	movd qword ptr [rsp - 16], xmm2
-	sub rsp, 48
-	call qword ptr [rbp]
-	add rsp, 48
-	movd xmm2, qword ptr [rsp - 16]
-	movd xmm1, qword ptr [rsp - 8]
-	pop r11
-	pop r10
-	pop r9
-	pop r8
-	ret 0
+TransformAddress MACRO reg32, reg64
+;# Transforms the address in the register so that the transformed address
+;# lies in a different cache line than the original address (mod 2^N).
+;# This is done to prevent a load-store dependency.
+;# There are 3 different transformations that can be used: x -> 9*x+C, x -> x+C, x -> x^C
+	;lea reg32, [reg64+reg64*8+127]  ;# C = -119 -110 -101 -92 -83 -74 -65 -55 -46 -37 -28 -19 -10 -1 9 18 27 36 45 54 63 73 82 91 100 109 118 127
+	db 64
+	add reg32, -39                   ;# C = all except -7 to +7
+	;xor reg32, -8                   ;# C = all except 0 to 7
+ENDM

-rx_read_dataset_r:
-	mov edx, dword ptr [rbx]	; ma
-	mov rax, qword ptr [rbx+8]	; dataset
-	mov rax, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]	; mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 0FFF8h
-	jne short rx_read_dataset_r_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	mov rdx, qword ptr [rbx+8]
-	prefetcht0 byte ptr [rdx+rcx]
-rx_read_dataset_r_ret:
-	ret 0
-
-rx_read_dataset_f:
-	mov edx, dword ptr [rbx]	; ma
-	mov rax, qword ptr [rbx+8]	; dataset
-	cvtdq2pd xmm0, qword ptr [rax+rdx]
-	add dword ptr [rbx], 8
-	xor ecx, dword ptr [rbx+4]	; mx
-	mov dword ptr [rbx+4], ecx
-	test ecx, 0FFF8h
-	jne short rx_read_dataset_f_ret
-	and ecx, -8
-	mov dword ptr [rbx], ecx
-	prefetcht0 byte ptr [rax+rcx]
-rx_read_dataset_f_ret:
-	ret 0
+ALIGN 64
+rx_read:
+;# IN     eax = random 32-bit address
+;# GLOBAL rdi = address of the dataset address
+;# GLOBAL rsi = address of the scratchpad
+;# GLOBAL rbp = low 32 bits = "mx", high 32 bits = "ma"
+;# MODIFY rcx, rdx
+	TransformAddress eax, rax       ;# TransformAddress function
+	mov rcx, qword ptr [rdi]        ;# load the dataset address
+	xor rbp, rax                    ;# modify "mx"
+	;# prefetch cacheline "mx"
+	and rbp, -64                    ;# align "mx" to the start of a cache line
+	mov edx, ebp                    ;# edx = mx
+	prefetchnta byte ptr [rcx+rdx]
+	;# read cacheline "ma"
+	ror rbp, 32                     ;# swap "ma" and "mx"
+	mov edx, ebp                    ;# edx = ma
+	lea rcx, [rcx+rdx]              ;# dataset cache line
+	xor r8,  qword ptr [rcx+0]
+	xor r9,  qword ptr [rcx+8]
+	xor r10, qword ptr [rcx+16]
+	xor r11, qword ptr [rcx+24]
+	xor r12, qword ptr [rcx+32]
+	xor r13, qword ptr [rcx+40]
+	xor r14, qword ptr [rcx+48]
+	xor r15, qword ptr [rcx+56]
+	ret
 executeProgram ENDP

+_RANDOMX_EXECUTE_PROGRAM ENDS
+
+ENDIF
+
 END
--- a/src/hashAes1Rx4.cpp
+++ b/src/hashAes1Rx4.cpp
@ -0,0 +1,136 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "softAes.h"
+
+/*
+	Calculate a 512-bit hash of 'input' using 4 lanes of AES.
+	The input is treated as a set of round keys for the encryption
+	of the initial state.
+
+	'inputSize' must be a multiple of 64.
+
+	For a 2 MiB input, this has the same security as 32768-round
+	AES encryption.
+
+	Hashing throughput: >20 GiB/s per CPU core with hardware AES
+*/
+template<bool softAes>
+void hashAes1Rx4(const void *input, size_t inputSize, void *hash) {
+	const uint8_t* inptr = (uint8_t*)input;
+	const uint8_t* inputEnd = inptr + inputSize;
+
+	__m128i state0, state1, state2, state3;
+	__m128i in0, in1, in2, in3;
+
+	//intial state
+	state0 = _mm_set_epi32(0x9d04b0ae, 0x59943385, 0x30ac8d93, 0x3fe49f5d);
+	state1 = _mm_set_epi32(0x8a39ebf1, 0xddc10935, 0xa724ecd3, 0x7b0c6064);
+	state2 = _mm_set_epi32(0x7ec70420, 0xdf01edda, 0x7c12ecf7, 0xfb5382e3);
+	state3 = _mm_set_epi32(0x94a9d201, 0x5082d1c8, 0xb2e74109, 0x7728b705);
+
+	//process 64 bytes at a time in 4 lanes
+	while (inptr < inputEnd) {
+		in0 = _mm_load_si128((__m128i*)inptr + 0);
+		in1 = _mm_load_si128((__m128i*)inptr + 1);
+		in2 = _mm_load_si128((__m128i*)inptr + 2);
+		in3 = _mm_load_si128((__m128i*)inptr + 3);
+
+		state0 = aesenc<softAes>(state0, in0);
+		state1 = aesdec<softAes>(state1, in1);
+		state2 = aesenc<softAes>(state2, in2);
+		state3 = aesdec<softAes>(state3, in3);
+
+		inptr += 64;
+	}
+
+	//two extra rounds to achieve full diffusion
+	__m128i xkey0 = _mm_set_epi32(0x4ff637c5, 0x053bd705, 0x8231a744, 0xc3767b17);
+	__m128i xkey1 = _mm_set_epi32(0x6594a1a6, 0xa8879d58, 0xb01da200, 0x8a8fae2e);
+
+	state0 = aesenc<softAes>(state0, xkey0);
+	state1 = aesdec<softAes>(state1, xkey0);
+	state2 = aesenc<softAes>(state2, xkey0);
+	state3 = aesdec<softAes>(state3, xkey0);
+
+	state0 = aesenc<softAes>(state0, xkey1);
+	state1 = aesdec<softAes>(state1, xkey1);
+	state2 = aesenc<softAes>(state2, xkey1);
+	state3 = aesdec<softAes>(state3, xkey1);
+
+	//output hash
+	_mm_store_si128((__m128i*)hash + 0, state0);
+	_mm_store_si128((__m128i*)hash + 1, state1);
+	_mm_store_si128((__m128i*)hash + 2, state2);
+	_mm_store_si128((__m128i*)hash + 3, state3);
+}
+
+template void hashAes1Rx4<false>(const void *input, size_t inputSize, void *hash);
+template void hashAes1Rx4<true>(const void *input, size_t inputSize, void *hash);
+
+/*
+	Fill 'buffer' with pseudorandom data based on 512-bit 'state'.
+	The state is encrypted using a single AES round per 16 bytes of output
+	in 4 lanes.
+
+	'outputSize' must be a multiple of 64.
+
+	The modified state is written back to 'state' to allow multiple
+	calls to this function.
+*/
+template<bool softAes>
+void fillAes1Rx4(void *state, size_t outputSize, void *buffer) {
+	const uint8_t* outptr = (uint8_t*)buffer;
+	const uint8_t* outputEnd = outptr + outputSize;
+
+	__m128i state0, state1, state2, state3;
+	__m128i key0, key1, key2, key3;
+
+	key0 = _mm_set_epi32(0x9274f206, 0x79498d2f, 0x7d2de6ab, 0x67a04d26);
+	key1 = _mm_set_epi32(0xe1f7af05, 0x2a3a6f1d, 0x86658a15, 0x4f719812);
+	key2 = _mm_set_epi32(0xd1b1f791, 0x9e2ec914, 0x14c77bce, 0xba90750e);
+	key3 = _mm_set_epi32(0x179d0fd9, 0x6e57883c, 0xa53bbe4f, 0xaa07621f);
+
+	state0 = _mm_load_si128((__m128i*)state + 0);
+	state1 = _mm_load_si128((__m128i*)state + 1);
+	state2 = _mm_load_si128((__m128i*)state + 2);
+	state3 = _mm_load_si128((__m128i*)state + 3);
+
+	while (outptr < outputEnd) {
+		state0 = aesdec<softAes>(state0, key0);
+		state1 = aesenc<softAes>(state1, key1);
+		state2 = aesdec<softAes>(state2, key2);
+		state3 = aesenc<softAes>(state3, key3);
+
+		_mm_store_si128((__m128i*)outptr + 0, state0);
+		_mm_store_si128((__m128i*)outptr + 1, state1);
+		_mm_store_si128((__m128i*)outptr + 2, state2);
+		_mm_store_si128((__m128i*)outptr + 3, state3);
+
+		outptr += 64;
+	}
+
+	_mm_store_si128((__m128i*)state + 0, state0);
+	_mm_store_si128((__m128i*)state + 1, state1);
+	_mm_store_si128((__m128i*)state + 2, state2);
+	_mm_store_si128((__m128i*)state + 3, state3);
+}
+
+template void fillAes1Rx4<true>(void *state, size_t outputSize, void *buffer);
+template void fillAes1Rx4<false>(void *state, size_t outputSize, void *buffer);
--- a/src/hashAes1Rx4.hpp
+++ b/src/hashAes1Rx4.hpp
@ -0,0 +1,26 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "softAes.h"
+
+template<bool softAes>
+void hashAes1Rx4(const void *input, size_t inputSize, void *hash);
+
+template<bool softAes>
+void fillAes1Rx4(void *state, size_t outputSize, void *buffer);
--- a/src/instructionWeights.hpp
+++ b/src/instructionWeights.hpp
@ -19,44 +19,63 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #pragma once

-#define WT_ADD_64 11
-#define WT_ADD_32 2
-#define WT_SUB_64 11
-#define WT_SUB_32 2
-#define WT_MUL_64 23
-#define WT_MULH_64 10
-#define WT_MUL_32 15
-#define WT_IMUL_32 15
-#define WT_IMULH_64 6
-#define WT_DIV_64 1
-#define WT_IDIV_64 1
-#define WT_AND_64 4
-#define WT_AND_32 2
-#define WT_OR_64 4
-#define WT_OR_32 2
-#define WT_XOR_64 4
-#define WT_XOR_32 2
-#define WT_SHL_64 3
-#define WT_SHR_64 3
-#define WT_SAR_64 3
-#define WT_ROL_64 6
-#define WT_ROR_64 6
-#define WT_FPADD 20
-#define WT_FPSUB 20
-#define WT_FPMUL 22
-#define WT_FPDIV 8
-#define WT_FPSQRT 6
-#define WT_FPROUND 2
-#define WT_CALL 20
-#define WT_RET 22
+//Integer
+#define WT_IADD_R 12
+#define WT_IADD_M 7
+#define WT_IADD_RC 16
+#define WT_ISUB_R 12
+#define WT_ISUB_M 7
+#define WT_IMUL_9C 9
+#define WT_IMUL_R 16
+#define WT_IMUL_M 4
+#define WT_IMULH_R 4
+#define WT_IMULH_M 1
+#define WT_ISMULH_R 4
+#define WT_ISMULH_M 1
+#define WT_IDIV_C 4
+#define WT_ISDIV_C 4
+#define WT_INEG_R 2
+#define WT_IXOR_R 16
+#define WT_IXOR_M 4
+#define WT_IROR_R 10
+#define WT_IROL_R 0
+#define WT_ISWAP_R 4

+//Common floating point
+#define WT_FSWAP_R 8

-constexpr int wtSum = WT_ADD_64 + WT_ADD_32 + WT_SUB_64 + WT_SUB_32 + \
-WT_MUL_64 + WT_MULH_64 + WT_MUL_32 + WT_IMUL_32 + WT_IMULH_64 + \
-WT_DIV_64 + WT_IDIV_64 + WT_AND_64 + WT_AND_32 + WT_OR_64 + \
-WT_OR_32 + WT_XOR_64 + WT_XOR_32 + WT_SHL_64 + WT_SHR_64 + \
-WT_SAR_64 + WT_ROL_64 + WT_ROR_64 + WT_FPADD + WT_FPSUB + WT_FPMUL \
-+ WT_FPDIV + WT_FPSQRT + WT_FPROUND + WT_CALL + WT_RET;
+//Floating point group F
+#define WT_FADD_R 20
+#define WT_FADD_M 5
+#define WT_FSUB_R 20
+#define WT_FSUB_M 5
+#define WT_FNEG_R 6
+
+//Floating point group E
+#define WT_FMUL_R 20
+#define WT_FMUL_M 0
+#define WT_FDIV_R 0
+#define WT_FDIV_M 4
+#define WT_FSQRT_R 6
+
+//Control
+#define WT_COND_R 7
+#define WT_COND_M 1
+#define WT_CFROUND 1
+
+//Store
+#define WT_ISTORE 16
+#define WT_FSTORE 0
+
+#define WT_NOP 0
+
+constexpr int wtSum = WT_IADD_R + WT_IADD_M + WT_IADD_RC + WT_ISUB_R + \
+WT_ISUB_M + WT_IMUL_9C + WT_IMUL_R + WT_IMUL_M + WT_IMULH_R + \
+WT_IMULH_M + WT_ISMULH_R + WT_ISMULH_M + WT_IDIV_C + WT_ISDIV_C + \
+WT_INEG_R + WT_IXOR_R + WT_IXOR_M + WT_IROR_R + WT_IROL_R + \
+WT_ISWAP_R + WT_FSWAP_R + WT_FADD_R + WT_FADD_M + WT_FSUB_R + WT_FSUB_M + \
+WT_FNEG_R + WT_FMUL_R + WT_FMUL_M + WT_FDIV_R + WT_FDIV_M + \
+WT_FSQRT_R + WT_COND_R + WT_COND_M + WT_CFROUND + WT_ISTORE + WT_FSTORE + WT_NOP;

 static_assert(wtSum == 256,
 	"Sum of instruction weights must be 256");
@ -97,8 +116,46 @@ static_assert(wtSum == 256,
 #define REP33(x) REP32(x) x,
 #define REP40(x) REP32(x) REP8(x)
 #define REP128(x) REP32(x) REP32(x) REP32(x) REP32(x)
+#define REP232(x) REP128(x) REP40(x) REP40(x) REP24(x)
 #define REP256(x) REP128(x) REP128(x)
 #define REPNX(x,N) REP##N(x)
 #define REPN(x,N) REPNX(x,N)
 #define NUM(x) x
 #define WT(x) NUM(WT_##x)
+
+#define REPCASE0(x)
+#define REPCASE1(x) case __COUNTER__:
+#define REPCASE2(x) REPCASE1(x) case __COUNTER__:
+#define REPCASE3(x) REPCASE2(x) case __COUNTER__:
+#define REPCASE4(x) REPCASE3(x) case __COUNTER__:
+#define REPCASE5(x) REPCASE4(x) case __COUNTER__:
+#define REPCASE6(x) REPCASE5(x) case __COUNTER__:
+#define REPCASE7(x) REPCASE6(x) case __COUNTER__:
+#define REPCASE8(x) REPCASE7(x) case __COUNTER__:
+#define REPCASE9(x) REPCASE8(x) case __COUNTER__:
+#define REPCASE10(x) REPCASE9(x) case __COUNTER__:
+#define REPCASE11(x) REPCASE10(x) case __COUNTER__:
+#define REPCASE12(x) REPCASE11(x) case __COUNTER__:
+#define REPCASE13(x) REPCASE12(x) case __COUNTER__:
+#define REPCASE14(x) REPCASE13(x) case __COUNTER__:
+#define REPCASE15(x) REPCASE14(x) case __COUNTER__:
+#define REPCASE16(x) REPCASE15(x) case __COUNTER__:
+#define REPCASE17(x) REPCASE16(x) case __COUNTER__:
+#define REPCASE18(x) REPCASE17(x) case __COUNTER__:
+#define REPCASE19(x) REPCASE18(x) case __COUNTER__:
+#define REPCASE20(x) REPCASE19(x) case __COUNTER__:
+#define REPCASE21(x) REPCASE20(x) case __COUNTER__:
+#define REPCASE22(x) REPCASE21(x) case __COUNTER__:
+#define REPCASE23(x) REPCASE22(x) case __COUNTER__:
+#define REPCASE24(x) REPCASE23(x) case __COUNTER__:
+#define REPCASE25(x) REPCASE24(x) case __COUNTER__:
+#define REPCASE26(x) REPCASE25(x) case __COUNTER__:
+#define REPCASE27(x) REPCASE26(x) case __COUNTER__:
+#define REPCASE28(x) REPCASE27(x) case __COUNTER__:
+#define REPCASE29(x) REPCASE28(x) case __COUNTER__:
+#define REPCASE30(x) REPCASE29(x) case __COUNTER__:
+#define REPCASE31(x) REPCASE30(x) case __COUNTER__:
+#define REPCASE32(x) REPCASE31(x) case __COUNTER__:
+#define REPCASENX(x,N) REPCASE##N(x)
+#define REPCASEN(x,N) REPCASENX(x,N)
+#define CASE_REP(x) REPCASEN(x, WT(x))
--- a/src/instructions.hpp
+++ b/src/instructions.hpp
@ -1,63 +0,0 @@
-/*
-Copyright (c) 2018 tevador
-
-This file is part of RandomX.
-
-RandomX is free software: you can redistribute it and/or modify
-it under the terms of the GNU General Public License as published by
-the Free Software Foundation, either version 3 of the License, or
-(at your option) any later version.
-
-RandomX is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
-*/
-
-#include <cstdint>
-#include "common.hpp"
-
-namespace RandomX {
-
-	//Clears the 11 least-significant bits before conversion. This is done so the number
-	//fits exactly into the 52-bit mantissa without rounding.
-	inline double convertSigned52(int64_t x) {
-		return (double)(x & -2048L);
-	}
-
-	extern "C" {
-		void ADD_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ADD_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SUB_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SUB_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MUL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void MUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IMUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IMULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void DIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void IDIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void AND_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void AND_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void OR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void OR_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void XOR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void XOR_32(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SHL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SHR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void SAR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ROL_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		void ROR_64(convertible_t& a, convertible_t& b, convertible_t& c);
-		bool JMP_COND(uint8_t, convertible_t&, int32_t);
-		void FPINIT();
-		void FPADD(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPSUB(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPMUL(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPDIV(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPSQRT(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-		void FPROUND(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
-	}
-}
--- a/src/instructionsPortable.cpp
+++ b/src/instructionsPortable.cpp
@ -17,26 +17,27 @@ You should have received a copy of the GNU General Public License
 along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 */
 //#define DEBUG
-#include "instructions.hpp"
 #include "intrinPortable.h"
+#include "blake2/endian.h"
 #pragma STDC FENV_ACCESS on
 #include <cfenv>
 #include <cmath>
 #ifdef DEBUG
 #include <iostream>
 #endif
+#include "common.hpp"

 #if defined(__SIZEOF_INT128__)
 	typedef unsigned __int128 uint128_t;
 	typedef __int128 int128_t;
-	static inline uint64_t __umulhi64(uint64_t a, uint64_t b) {
+	uint64_t mulh(uint64_t a, uint64_t b) {
 		return ((uint128_t)a * b) >> 64;
 	}
-	static inline uint64_t __imulhi64(int64_t a, int64_t b) {
+	int64_t smulh(int64_t a, int64_t b) {
 		return ((int128_t)a * b) >> 64;
 	}
-	#define umulhi64 __umulhi64
-	#define imulhi64 __imulhi64
+	#define HAVE_MULH
+	#define HAVE_SMULH
 #endif

 #if defined(_MSC_VER)
@ -44,62 +45,62 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 	#define EVAL_DEFINE(X) HAS_VALUE(X)
 	#include <intrin.h>
 	#include <stdlib.h>
-	#define ror64 _rotr64
-	#define rol64 _rotl64
+
+	uint64_t rotl(uint64_t x, int c) {
+		return _rotl64(x, c);
+	}
+	uint64_t rotr(uint64_t x , int c) {
+		return _rotr64(x, c);
+	}
+	#define HAVE_ROTL
+	#define HAVE_ROTR
+
 	#if EVAL_DEFINE(__MACHINEARM64_X64(1))
-		#define umulhi64 __umulh
+		uint64_t mulh(uint64_t a, uint64_t b) {
+			return __umulh(a, b);
+		}
+		#define HAVE_MULH
 	#endif
+
 	#if EVAL_DEFINE(__MACHINEX64(1))
-		static inline uint64_t __imulhi64(int64_t a, int64_t b) {
+		int64_t smulh(int64_t a, int64_t b) {
 			int64_t hi;
 			_mul128(a, b, &hi);
 			return hi;
 		}
-		#define imulhi64 __imulhi64
+		#define HAVE_SMULH
 	#endif
-	static inline uint32_t _setRoundMode(uint32_t mode) {
-		return _controlfp(mode, _MCW_RC);
+
+	static void setRoundMode__(uint32_t mode) {
+		_controlfp(mode, _MCW_RC);
 	}
-	#define setRoundMode _setRoundMode
+	#define HAVE_SETROUNDMODE_IMPL
 #endif

-#ifndef setRoundMode
-	#define setRoundMode fesetround
+#ifndef HAVE_SETROUNDMODE_IMPL
+	static void setRoundMode__(uint32_t mode) {
+		fesetround(mode);
+	}
 #endif

-#ifndef ror64
-	static inline uint64_t __ror64(uint64_t a, int b) {
+#ifndef HAVE_ROTR
+	uint64_t rotr(uint64_t a, int b) {
 		return (a >> b) | (a << (64 - b));
 	}
-	#define ror64 __ror64
+	#define HAS_ROTR
 #endif

-#ifndef rol64
-	static inline uint64_t __rol64(uint64_t a, int b) {
+#ifndef HAVE_ROTL
+	uint64_t rotl(uint64_t a, int b) {
 		return (a << b) | (a >> (64 - b));
 	}
-	#define rol64 __rol64
+	#define HAS_ROTL
 #endif

-#ifndef sar64
-	#include <type_traits>
-	constexpr int64_t builtintShr64(int64_t value, int shift) noexcept {
-		return value >> shift;
-	}
-
-	struct UsesArithmeticShift : std::integral_constant<bool, builtintShr64(-1LL, 1) == -1LL> {
-	};
-
-	static inline int64_t __sar64(int64_t a, int b) {
-		return UsesArithmeticShift::value ? builtintShr64(a, b) : (a < 0 ? ~(~a >> b) : a >> b);
-	}
-	#define sar64 __sar64
-#endif
-
-#ifndef umulhi64
+#ifndef HAVE_MULH
 	#define LO(x) ((x)&0xffffffff)
 	#define HI(x) ((x)>>32)
-	static inline uint64_t __umulhi64(uint64_t a, uint64_t b) {
+	uint64_t mulh(uint64_t a, uint64_t b) {
 		uint64_t ah = HI(a), al = LO(a);
 		uint64_t bh = HI(b), bl = LO(b);
 		uint64_t x00 = al * bl;
@ -112,17 +113,17 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 		return (m3 << 32) + LO(m2);
 	}
-	#define umulhi64 __umulhi64
+	#define HAVE_MULH
 #endif

-#ifndef imulhi64
-	static inline int64_t __imulhi64(int64_t a, int64_t b) {
-		int64_t hi = umulhi64(a, b);
+#ifndef HAVE_SMULH
+	int64_t smulh(int64_t a, int64_t b) {
+		int64_t hi = mulh(a, b);
 		if (a < 0LL) hi -= b;
 		if (b < 0LL) hi -= a;
 		return hi;
 	}
-	#define imulhi64 __imulhi64
+	#define HAVE_SMULH
 #endif

 // avoid undefined behavior of signed overflow
@ -137,20 +138,20 @@ static inline int32_t safeSub(int32_t a, int32_t b) {

 #if defined(__has_builtin)
 #if __has_builtin(__builtin_sub_overflow)
-	static inline bool __subOverflow(int32_t a, int32_t b) {
+	static inline bool subOverflow__(uint32_t a, uint32_t b) {
 		int32_t temp;
-		return __builtin_sub_overflow(a, b, &temp);
+		return __builtin_sub_overflow(unsigned32ToSigned2sCompl(a), unsigned32ToSigned2sCompl(b), &temp);
 	}
-	#define subOverflow __subOverflow
+	#define HAVE_SUB_OVERFLOW
 #endif
 #endif

-#ifndef subOverflow
-	static inline bool __subOverflow(int32_t a, int32_t b) {
-		auto c = safeSub(a, b);
-		return (c < a) != (b > 0);
+#ifndef HAVE_SUB_OVERFLOW
+	static inline bool subOverflow__(uint32_t a, uint32_t b) {
+		auto c = unsigned32ToSigned2sCompl(a - b);
+		return (c < unsigned32ToSigned2sCompl(a)) != (unsigned32ToSigned2sCompl(b) > 0);
 	}
-	#define subOverflow __subOverflow
+	#define HAVE_SUB_OVERFLOW
 #endif

 static inline double FlushDenormalNaN(double x) {
@ -165,124 +166,50 @@ static inline double FlushNaN(double x) {
 	return x != x ? 0.0 : x;
 }

-namespace RandomX {
-
-	extern "C" {
-
-		void ADD_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 + b.u64;
+void setRoundMode(uint32_t rcflag) {
+	switch (rcflag & 3) {
+		case RoundDown:
+			setRoundMode__(FE_DOWNWARD);
+			break;
+		case RoundUp:
+			setRoundMode__(FE_UPWARD);
+			break;
+		case RoundToZero:
+			setRoundMode__(FE_TOWARDZERO);
+			break;
+		case RoundToNearest:
+			setRoundMode__(FE_TONEAREST);
+			break;
+		default:
+			UNREACHABLE;
+	}
 }

-		void ADD_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 + b.u32;
-		}
-
-		void SUB_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 - b.u64;
-		}
-
-		void SUB_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 - b.u32;
-		}
-
-		void MUL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 * b.u64;
-		}
-
-		void MULH_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = umulhi64(a.u64, b.u64);
-		}
-
-		void MUL_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = (uint64_t)a.u32 * b.u32;
-		}
-
-		void IMUL_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.i64 = (int64_t)a.i32 * b.i32;
-		}
-
-		void IMULH_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.i64 = imulhi64(a.i64, b.i64);
-		}
-
-		void DIV_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 / (b.u32 != 0 ? b.u32 : 1U);
-		}
-
-		void IDIV_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			if (a.i64 == INT64_MIN && b.i32 == -1)
-				c.i64 = INT64_MIN;
-			else
-				c.i64 = a.i64 / (b.i32 != 0 ? b.i32 : 1);
-		}
-
-		void AND_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 & b.u64;
-		}
-
-		void AND_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 & b.u32;
-		}
-
-		void OR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 | b.u64;
-		}
-
-		void OR_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 | b.u32;
-		}
-
-		void XOR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 ^ b.u64;
-		}
-
-		void XOR_32(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u32 ^ b.u32;
-		}
-
-		void SHL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 << (b.u64 & 63);
-		}
-
-		void SHR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = a.u64 >> (b.u64 & 63);
-		}
-
-		void SAR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = sar64(a.i64, b.u64 & 63);
-		}
-
-		void ROL_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = rol64(a.u64, (b.u64 & 63));
-		}
-
-		void ROR_64(convertible_t& a, convertible_t& b, convertible_t& c) {
-			c.u64 = ror64(a.u64, (b.u64 & 63));
-		}
-
-		bool JMP_COND(uint8_t type, convertible_t& regb, int32_t imm32) {
+bool condition(uint32_t type, uint32_t value, uint32_t imm32) {
 	switch (type & 7)
 	{
 		case 0:
-					return regb.u32 <= (uint32_t)imm32;
+			return value <= imm32;
 		case 1:
-					return regb.u32 > (uint32_t)imm32;
+			return value > imm32;
 		case 2:
-					return safeSub(regb.i32, imm32) < 0;
+			return unsigned32ToSigned2sCompl(value - imm32) < 0;
 		case 3:
-					return safeSub(regb.i32, imm32) >= 0;
+			return unsigned32ToSigned2sCompl(value - imm32) >= 0;
 		case 4:
-					return subOverflow(regb.i32, imm32);
+			return subOverflow__(value, imm32);
 		case 5:
-					return !subOverflow(regb.i32, imm32);
+			return !subOverflow__(value, imm32);
 		case 6:
-					return regb.i32 < imm32;
+			return unsigned32ToSigned2sCompl(value) < unsigned32ToSigned2sCompl(imm32);
 		case 7:
-					return regb.i32 >= imm32;
+			return unsigned32ToSigned2sCompl(value) >= unsigned32ToSigned2sCompl(imm32);
+		default:
+			UNREACHABLE;
 	}
 }

-		void FPINIT() {
+void initFpu() {
 #ifdef __SSE2__
 	_mm_setcsr(0x9FC0); //Flush to zero, denormals are zero, default rounding mode, all exceptions disabled
 #else
@ -290,126 +217,13 @@ namespace RandomX {
 #endif
 }

-		void FPADD(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_add_pd(ad, bd);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = alo + b.lo.f64;
-			c.hi.f64 = ahi + b.hi.f64;
-#endif
-		}
+union double_ser_t {
+	double f;
+	uint64_t i;
+};

-		void FPSUB(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_sub_pd(ad, bd);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = alo - b.lo.f64;
-			c.hi.f64 = ahi - b.hi.f64;
-#endif
-		}
-
-		void FPMUL(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_mul_pd(ad, bd);
-			__m128d mask = _mm_cmpeq_pd(cd, cd);
-			cd = _mm_and_pd(cd, mask);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = FlushNaN(alo * b.lo.f64);
-			c.hi.f64 = FlushNaN(ahi * b.hi.f64);
-#endif
-		}
-
-		void FPDIV(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			__m128d bd = _mm_load_pd(&b.lo.f64);
-			__m128d cd = _mm_div_pd(ad, bd);
-			__m128d mask = _mm_cmpeq_pd(cd, cd);
-			cd = _mm_and_pd(cd, mask);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = FlushDenormalNaN(alo / b.lo.f64);
-			c.hi.f64 = FlushDenormalNaN(ahi / b.hi.f64);
-#endif
-		}
-
-		void FPSQRT(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-#ifdef __SSE2__
-			__m128i ai = _mm_loadl_epi64((const __m128i*)&a);
-			__m128d ad = _mm_cvtepi32_pd(ai);
-			const __m128d absmask = _mm_castsi128_pd(_mm_set1_epi64x(~(1LL << 63)));
-			ad = _mm_and_pd(ad, absmask);
-			__m128d cd = _mm_sqrt_pd(ad);
-			_mm_store_pd(&c.lo.f64, cd);
-#else
-			double alo = (double)a.i32lo;
-			double ahi = (double)a.i32hi;
-			c.lo.f64 = sqrt(std::abs(alo));
-			c.hi.f64 = sqrt(std::abs(ahi));
-#endif
-		}
-
-		void FPROUND(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c) {
-			c.lo.f64 = convertSigned52(a.i64);
-			switch (a.u64 & 3) {
-				case RoundDown:
-#ifdef DEBUG
-					std::cout << "Round FE_DOWNWARD (" << FE_DOWNWARD << ") = " <<
-#endif
-					setRoundMode(FE_DOWNWARD);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				case RoundUp:
-#ifdef DEBUG
-					std::cout << "Round FE_UPWARD (" << FE_UPWARD << ") = " <<
-#endif
-					setRoundMode(FE_UPWARD);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				case RoundToZero:
-#ifdef DEBUG
-					std::cout << "Round FE_TOWARDZERO (" << FE_TOWARDZERO << ") = " <<
-#endif
-					setRoundMode(FE_TOWARDZERO);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-				default:
-#ifdef DEBUG
-					std::cout << "Round FE_TONEAREST (" << FE_TONEAREST << ") = " <<
-#endif
-					setRoundMode(FE_TONEAREST);
-#ifdef DEBUG
-					std::cout << std::endl;
-#endif
-					break;
-			}
-		}
-	}
+double loadDoublePortable(const void* addr) {
+	double_ser_t ds;
+	ds.i = load64(addr);
+	return ds.f;
 }
--- a/src/intrinPortable.h
+++ b/src/intrinPortable.h
@ -19,6 +19,8 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.

 #pragma once

+#include <cstdint>
+
 #if defined(_MSC_VER)
 #if defined(_M_X64) || (defined(_M_IX86_FP) && _M_IX86_FP == 2)
 #define __SSE2__ 1
@ -31,12 +33,21 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #else
 #include <intrin.h>
 #endif
+
+inline __m128d _mm_abs(__m128d xd) {
+	const __m128d absmask = _mm_castsi128_pd(_mm_set1_epi64x(~(1LL << 63)));
+	return _mm_and_pd(xd, absmask);
+}
+
+#define PREFETCHNTA(x) _mm_prefetch((const char *)(x), _MM_HINT_NTA)
+
 #else
 #include <cstdint>
 #include <stdexcept>

 #define _mm_malloc(a,b) malloc(a)
 #define _mm_free(a) free(a)
+#define PREFETCHNTA(x)

 typedef union {
 	uint64_t u64[2];
@ -45,6 +56,18 @@ typedef union {
 	uint8_t u8[16];
 } __m128i;

+typedef struct {
+	double lo;
+	double hi;
+} __m128d;
+
+inline __m128d _mm_load_pd(const double* pd) {
+	__m128d x;
+	x.lo = *(pd + 0);
+	x.hi = *(pd + 1);
+	return x;
+}
+
 static const char* platformError = "Platform doesn't support hardware AES";

 inline __m128i _mm_aeskeygenassist_si128(__m128i key, uint8_t rcon) {
@ -132,3 +155,35 @@ inline __m128i _mm_slli_si128(__m128i _A, int _Imm) {
 }

 #endif
+
+constexpr int RoundToNearest = 0;
+constexpr int RoundDown = 1;
+constexpr int RoundUp = 2;
+constexpr int RoundToZero = 3;
+
+constexpr int32_t unsigned32ToSigned2sCompl(uint32_t x) {
+	return (-1 == ~0) ? (int32_t)x : (x > INT32_MAX ? (-(int32_t)(UINT32_MAX - x) - 1) : (int32_t)x);
+}
+
+constexpr int64_t unsigned64ToSigned2sCompl(uint64_t x) {
+	return (-1 == ~0) ? (int64_t)x : (x > INT64_MAX ? (-(int64_t)(UINT64_MAX - x) - 1) : (int64_t)x);
+}
+
+constexpr uint64_t signExtend2sCompl(uint32_t x) {
+	return (-1 == ~0) ? (int64_t)(int32_t)(x) : (x > INT32_MAX ? (x | 0xffffffff00000000ULL) : (uint64_t)x);
+}
+
+inline __m128d load_cvt_i32x2(const void* addr) {
+	__m128i ix = _mm_load_si128((const __m128i*)addr);
+	return _mm_cvtepi32_pd(ix);
+}
+
+double loadDoublePortable(const void* addr);
+
+uint64_t mulh(uint64_t, uint64_t);
+int64_t smulh(int64_t, int64_t);
+uint64_t rotl(uint64_t, int);
+uint64_t rotr(uint64_t, int);
+void initFpu();
+void setRoundMode(uint32_t);
+bool condition(uint32_t, uint32_t, uint32_t);
--- a/src/main.cpp
+++ b/src/main.cpp
@ -29,11 +29,11 @@ along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
 #include <cstring>
 #include "Program.hpp"
 #include <string>
-#include "instructions.hpp"
 #include <thread>
 #include <atomic>
 #include "dataset.hpp"
 #include "Cache.hpp"
+#include "hashAes1Rx4.hpp"

 const uint8_t seed[32] = { 191, 182, 222, 175, 249, 89, 134, 104, 241, 68, 191, 62, 162, 166, 61, 64, 123, 191, 227, 193, 118, 60, 188, 53, 223, 133, 175, 24, 123, 230, 55, 74 };

@ -115,7 +115,7 @@ void printUsage(const char* executable) {
 }

 void generateAsm(int nonce) {
-	uint64_t hash[4];
+	uint64_t hash[8];
 	unsigned char blockTemplate[] = {
 		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
 		0x5a, 0xc5, 0xfa, 0xd3, 0xaa, 0x3a, 0xf6, 0xea, 0x44, 0xc1, 0x18, 0x69, 0xdc, 0x4f, 0x85, 0x3f, 0x00, 0x2b, 0x2e,
@ -126,11 +126,13 @@ void generateAsm(int nonce) {
 	*noncePtr = nonce;
 	blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
 	RandomX::AssemblyGeneratorX86 asmX86;
-	asmX86.generateProgram(hash);
+	RandomX::Program p;
+	fillAes1Rx4<false>(hash, sizeof(p), &p);
+	asmX86.generateProgram(p);
 	asmX86.printCode(std::cout);
 }

-void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash& result, int noncesCount, int thread) {
+void generateNative(int nonce) {
 	uint64_t hash[4];
 	unsigned char blockTemplate[] = {
 		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
@ -139,18 +141,44 @@ void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash
 		0xc3, 0x8b, 0xde, 0xd3, 0x4d, 0x2d, 0xcd, 0xee, 0xf9, 0x5c, 0xd2, 0x0c, 0xef, 0xc1, 0x2f, 0x61, 0xd5, 0x61, 0x09
 	};
 	int* noncePtr = (int*)(blockTemplate + 39);
+	*noncePtr = nonce;
+	blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
+	alignas(16) RandomX::Program prog;
+	fillAes1Rx4<false>((void*)hash, sizeof(prog), &prog);
+	for (int i = 0; i < RandomX::ProgramLength; ++i) {
+		prog(i).dst %= 8;
+		prog(i).src %= 8;
+	}
+	std::cout << prog << std::endl;
+}
+
+void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash& result, int noncesCount, int thread, uint8_t* scratchpad) {
+	alignas(16) uint64_t hash[8];
+	unsigned char blockTemplate[] = {
+		0x07, 0x07, 0xf7, 0xa4, 0xf0, 0xd6, 0x05, 0xb3, 0x03, 0x26, 0x08, 0x16, 0xba, 0x3f, 0x10, 0x90, 0x2e, 0x1a, 0x14,
+		0x5a, 0xc5, 0xfa, 0xd3, 0xaa, 0x3a, 0xf6, 0xea, 0x44, 0xc1, 0x18, 0x69, 0xdc, 0x4f, 0x85, 0x3f, 0x00, 0x2b, 0x2e,
+		0xea, 0x00, 0x00, 0x00, 0x00, 0x77, 0xb2, 0x06, 0xa0, 0x2c, 0xa5, 0xb1, 0xd4, 0xce, 0x6b, 0xbf, 0xdf, 0x0a, 0xca,
+		0xc3, 0x8b, 0xde, 0xd3, 0x4d, 0x2d, 0xcd, 0xee, 0xf9, 0x5c, 0xd2, 0x0c, 0xef, 0xc1, 0x2f, 0x61, 0xd5, 0x61, 0x09
+	};
+	int* noncePtr = (int*)(blockTemplate + 39);
 	int nonce = atomicNonce.fetch_add(1);

 	while (nonce < noncesCount) {
 		//std::cout << "Thread " << thread << " nonce " << nonce << std::endl;
 		*noncePtr = nonce;
 		blake2b(hash, sizeof(hash), blockTemplate, sizeof(blockTemplate), nullptr, 0);
-		int spIndex = ((uint8_t*)hash)[24] | ((((uint8_t*)hash)[25] & 63) << 8);
-		vm->initializeScratchpad(spIndex);
-		vm->initializeProgram(hash);
+		fillAes1Rx4<false>((void*)hash, RandomX::ScratchpadSize, scratchpad);
+		//vm->initializeScratchpad(scratchpad, spIndex);
+		vm->setScratchpad(scratchpad);
 		//dump((char*)((RandomX::CompiledVirtualMachine*)vm)->getProgram(), RandomX::CodeSize, "code-1337-jmp.txt");
+		for (int chain = 0; chain < 8; ++chain) {
+			fillAes1Rx4<false>((void*)hash, sizeof(RandomX::Program), vm->getProgramBuffer());
+			vm->initialize();
 			vm->execute();
-		vm->getResult(hash);
+			vm->getResult<false>(nullptr, 0, hash);
+		}
+		//vm->initializeProgram(hash);
+		vm->getResult<false>(scratchpad, RandomX::ScratchpadSize, hash);
 		result.xorWith(hash);
 		if (RandomX::trace) {
 			std::cout << "Nonce: " << nonce << " ";
@ -162,7 +190,7 @@ void mine(RandomX::VirtualMachine* vm, std::atomic<int>& atomicNonce, AtomicHash
 }

 int main(int argc, char** argv) {
-	bool softAes, lightClient, genAsm, compiled, help;
+	bool softAes, lightClient, genAsm, compiled, help, largePages, async, aesBench, genNative;
 	int programCount, threadCount;
 	readOption("--help", argc, argv, help);

@ -177,33 +205,56 @@ int main(int argc, char** argv) {
 	readOption("--compiled", argc, argv, compiled);
 	readIntOption("--threads", argc, argv, threadCount, 1);
 	readIntOption("--nonces", argc, argv, programCount, 1000);
+	readOption("--largePages", argc, argv, largePages);
+	readOption("--async", argc, argv, async);
+	readOption("--aesBench", argc, argv, aesBench);
+	readOption("--genNative", argc, argv, genNative);

 	if (genAsm) {
 		generateAsm(programCount);
 		return 0;
 	}

+	if (genNative) {
+		generateNative(programCount);
+		return 0;
+	}
+
+	if (softAes)
+		std::cout << "Using software AES." << std::endl;
+
+	if(aesBench) {
+		programCount *= 10;
+		Stopwatch sw(true);
+		if (softAes) {
+			RandomX::aesBench<true>(programCount);
+		}
+		else {
+			RandomX::aesBench<false>(programCount);
+		}
+		sw.stop();
+		std::cout << "AES performance: " << programCount / sw.getElapsed() << " blocks/s" << std::endl;
+		return 0;
+	}
+
 	std::atomic<int> atomicNonce(0);
 	AtomicHash result;
 	std::vector<RandomX::VirtualMachine*> vms;
 	std::vector<std::thread> threads;
 	RandomX::dataset_t dataset;

-	if (softAes)
-		std::cout << "Using software AES." << std::endl;
 	std::cout << "Initializing..." << std::endl;
-
 	try {
 		Stopwatch sw(true);
 		if (softAes) {
-			RandomX::datasetInitCache<true>(seed, dataset);
+			RandomX::datasetInitCache<true>(seed, dataset, largePages);
 		}
 		else {
-			RandomX::datasetInitCache<false>(seed, dataset);
+			RandomX::datasetInitCache<false>(seed, dataset, largePages);
 		}
 		if (RandomX::trace) {
 			std::cout << "Keys: " << std::endl;
-			for (int i = 0; i < dataset.cache->getKeys().size(); ++i) {
+			for (unsigned i = 0; i < dataset.cache->getKeys().size(); ++i) {
 				outputHex(std::cout, (char*)&dataset.cache->getKeys()[i], sizeof(__m128i));
 			}
 			std::cout << std::endl;
@ -212,11 +263,11 @@ int main(int argc, char** argv) {
 			std::cout << std::endl;
 		}
 		if (lightClient) {
-			std::cout << "Cache (64 MiB) initialized in " << sw.getElapsed() << " s" << std::endl;
+			std::cout << "Cache (256 MiB) initialized in " << sw.getElapsed() << " s" << std::endl;
 		}
 		else {
 			RandomX::Cache* cache = dataset.cache;
-			RandomX::datasetAlloc(dataset);
+			RandomX::datasetAlloc(dataset, largePages);
 			if (threadCount > 1) {
 				auto perThread = RandomX::DatasetBlockCount / threadCount;
 				auto remainder = RandomX::DatasetBlockCount % threadCount;
@ -229,7 +280,7 @@ int main(int argc, char** argv) {
 						threads.push_back(std::thread(&RandomX::datasetInit<false>, cache, dataset, i * perThread, count));
 					}
 				}
-				for (int i = 0; i < threads.size(); ++i) {
+				for (unsigned i = 0; i < threads.size(); ++i) {
 					threads[i].join();
 				}
 			}
@ -241,7 +292,7 @@ int main(int argc, char** argv) {
 					RandomX::datasetInit<false>(cache, dataset, 0, RandomX::DatasetBlockCount);
 				}
 			}
-			delete cache;
+			RandomX::Cache::dealloc(cache, largePages);
 			threads.clear();
 			std::cout << "Dataset (4 GiB) initialized in " << sw.getElapsed() << " s" << std::endl;
 		}
@ -249,37 +300,47 @@ int main(int argc, char** argv) {
 		for (int i = 0; i < threadCount; ++i) {
 			RandomX::VirtualMachine* vm;
 			if (compiled) {
-				vm = new RandomX::CompiledVirtualMachine(softAes);
+				vm = new RandomX::CompiledVirtualMachine();
 			}
 			else {
-				vm = new RandomX::InterpretedVirtualMachine(softAes);
+				vm = new RandomX::InterpretedVirtualMachine(softAes, async);
 			}
-			vm->setDataset(dataset, lightClient);
+			vm->setDataset(dataset);
 			vms.push_back(vm);
 		}
+		uint8_t* scratchpadMem;
+		if (largePages) {
+			scratchpadMem = (uint8_t*)allocLargePagesMemory(threadCount * RandomX::ScratchpadSize);
+		}
+		else {
+			scratchpadMem = (uint8_t*)_mm_malloc(threadCount * RandomX::ScratchpadSize, RandomX::CacheLineSize);
+		}
 		std::cout << "Running benchmark (" << programCount << " programs) ..." << std::endl;
 		sw.restart();
 		if (threadCount > 1) {
-			for (int i = 0; i < vms.size(); ++i) {
-				threads.push_back(std::thread(&mine, vms[i], std::ref(atomicNonce), std::ref(result), programCount, i));
+			for (unsigned i = 0; i < vms.size(); ++i) {
+				threads.push_back(std::thread(&mine, vms[i], std::ref(atomicNonce), std::ref(result), programCount, i, scratchpadMem + RandomX::ScratchpadSize * i));
 			}
-			for (int i = 0; i < threads.size(); ++i) {
+			for (unsigned i = 0; i < threads.size(); ++i) {
 				threads[i].join();
 			}
 		}
 		else {
-			mine(vms[0], std::ref(atomicNonce), std::ref(result), programCount, 0);
+			mine(vms[0], std::ref(atomicNonce), std::ref(result), programCount, 0, scratchpadMem);
+			if (compiled)
+				std::cout << "Average program size: " << ((RandomX::CompiledVirtualMachine*)vms[0])->getTotalSize() / programCount << std::endl;
 		}
 		double elapsed = sw.getElapsed();
 		std::cout << "Calculated result: ";
 		result.print(std::cout);
 		if(programCount == 1000)
 		std::cout << "Reference result:  3e1c5f9b9d0bf8ffa250f860bf5f7ab76ac823b206ddee6a592660119a3640c6" << std::endl;
-		std::cout << "Performance: " << programCount / elapsed << " programs per second" << std::endl;
-		/*if (threadCount == 1 && !compiled) {
-			auto ivm = (RandomX::InterpretedVirtualMachine*)vms[0];
-			std::cout << ivm->getProgam();
-		}*/
+		if (lightClient) {
+			std::cout << "Performance: " << 1000 * elapsed / programCount << " ms per hash" << std::endl;
+		}
+		else {
+			std::cout << "Performance: " << programCount / elapsed << " hashes per second" << std::endl;
+		}
 	}
 	catch (std::exception& e) {
 		std::cout << "ERROR: " << e.what() << std::endl;
--- a/src/program.inc
+++ b/src/program.inc
--- a/src/softAes.h
+++ b/src/softAes.h
@ -26,3 +26,13 @@ __m128i soft_aeskeygenassist(__m128i key, uint8_t rcon);
 __m128i soft_aesenc(__m128i in, __m128i key);

 __m128i soft_aesdec(__m128i in, __m128i key);
+
+template<bool soft>
+inline __m128i aesenc(__m128i in, __m128i key) {
+	return soft ? soft_aesenc(in, key) : _mm_aesenc_si128(in, key);
+}
+
+template<bool soft>
+inline __m128i aesdec(__m128i in, __m128i key) {
+	return soft ? soft_aesdec(in, key) : _mm_aesdec_si128(in, key);
+}
--- a/src/squareHash.S
+++ b/src/squareHash.S
@ -0,0 +1,17 @@
+.intel_syntax noprefix
+#if defined(__APPLE__)
+.text
+#else
+.section .text
+#endif
+#if defined(__WIN32__) || defined(__APPLE__)
+#define DECL(x) _##x
+#else
+#define DECL(x) x
+#endif
+
+.global DECL(squareHash)
+
+DECL(squareHash):
+	mov rcx, rsi
+	#include "asm/squareHash.inc"
--- a/src/squareHash.asm
+++ b/src/squareHash.asm
@ -0,0 +1,9 @@
+PUBLIC squareHash
+
+.code
+
+squareHash PROC
+	include asm/squareHash.inc
+squareHash ENDP
+
+END
--- a/src/squareHash.h
+++ b/src/squareHash.h
@ -0,0 +1,76 @@
+/*
+Copyright (c) 2019 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+/*
+	Based on the original idea by SChernykh:
+	https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-414336613
+*/
+
+#include <stdint.h>
+
+#if !defined(_M_X64) && !defined(__x86_64__)
+
+typedef struct {
+	uint64_t lo;
+	uint64_t hi;
+} uint128_t;
+
+#define LO(x) ((x)&0xffffffff)
+#define HI(x) ((x)>>32)
+static inline uint128_t square128(uint64_t x) {
+	uint64_t xh = HI(x), xl = LO(x);
+	uint64_t xll = xl * xl;
+	uint64_t xlh = xl * xh;
+	uint64_t xhh = xh * xh;
+	uint64_t m1 = 2 * LO(xlh) + HI(xll);
+	uint64_t m2 = 2 * HI(xlh) + LO(xhh) + HI(m1);
+	uint64_t m3 = HI(xhh) + HI(m2);
+
+	uint128_t x2;
+
+	x2.lo = (m1 << 32) + LO(xll);
+	x2.hi = (m3 << 32) + LO(m2);
+
+	return x2;
+}
+#undef LO(x)
+#undef HI(x)
+
+inline uint64_t squareHash(uint64_t x) {
+	x += 1613783669344650115;
+	for (int i = 0; i < 42; ++i) {
+		uint128_t x2 = square128(x);
+		x = x2.lo - x2.hi;
+	}
+	return x;
+}
+
+#else
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+uint64_t squareHash(uint64_t);
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
--- a/src/t1ha/t1ha.h
+++ b/src/t1ha/t1ha.h
@ -1,723 +0,0 @@
-/*
- *  Copyright (c) 2016-2018 Positive Technologies, https://www.ptsecurity.com,
- *  Fast Positive Hash.
- *
- *  Portions Copyright (c) 2010-2018 Leonid Yuriev <leo@yuriev.ru>,
- *  The 1Hippeus project (t1h).
- *
- *  This software is provided 'as-is', without any express or implied
- *  warranty. In no event will the authors be held liable for any damages
- *  arising from the use of this software.
- *
- *  Permission is granted to anyone to use this software for any purpose,
- *  including commercial applications, and to alter it and redistribute it
- *  freely, subject to the following restrictions:
- *
- *  1. The origin of this software must not be misrepresented; you must not
- *     claim that you wrote the original software. If you use this software
- *     in a product, an acknowledgement in the product documentation would be
- *     appreciated but is not required.
- *  2. Altered source versions must be plainly marked as such, and must not be
- *     misrepresented as being the original software.
- *  3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * t1ha = { Fast Positive Hash, aka "Позитивный Хэш" }
- * by [Positive Technologies](https://www.ptsecurity.ru)
- *
- * Briefly, it is a 64-bit Hash Function:
- *  1. Created for 64-bit little-endian platforms, in predominantly for x86_64,
- *     but portable and without penalties it can run on any 64-bit CPU.
- *  2. In most cases up to 15% faster than City64, xxHash, mum-hash, metro-hash
- *     and all others portable hash-functions (which do not use specific
- *     hardware tricks).
- *  3. Not suitable for cryptography.
- *
- * The Future will Positive. Всё будет хорошо.
- *
- * ACKNOWLEDGEMENT:
- * The t1ha was originally developed by Leonid Yuriev (Леонид Юрьев)
- * for The 1Hippeus project - zerocopy messaging in the spirit of Sparta!
- */
-
-#pragma once
-
-/*****************************************************************************
- *
- * PLEASE PAY ATTENTION TO THE FOLLOWING NOTES
- * about macros definitions which controls t1ha behaviour and/or performance.
- *
- *
- * 1) T1HA_SYS_UNALIGNED_ACCESS = Defines the system/platform/CPU/architecture
- *                                abilities for unaligned data access.
- *
- *    By default, when the T1HA_SYS_UNALIGNED_ACCESS not defined,
- *    it will defined on the basis hardcoded knowledge about of capabilities
- *    of most common CPU architectures. But you could override this
- *    default behavior when build t1ha library itself:
- *
- *      // To disable unaligned access at all.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 0
- *
- *      // To enable unaligned access, but indicate that it significally slow.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 1
- *
- *      // To enable unaligned access, and indicate that it effecient.
- *      #define T1HA_SYS_UNALIGNED_ACCESS 2
- *
- *
- * 2) T1HA_USE_FAST_ONESHOT_READ = Controls the data reads at the end of buffer.
- *
- *    When defined to non-zero, t1ha will use 'one shot' method for reading
- *    up to 8 bytes at the end of data. In this case just the one 64-bit read
- *    will be performed even when the available less than 8 bytes.
- *
- *    This is little bit faster that switching by length of data tail.
- *    Unfortunately this will triggering a false-positive alarms from Valgrind,
- *    AddressSanitizer and other similar tool.
- *
- *    By default, t1ha defines it to 1, but you could override this
- *    default behavior when build t1ha library itself:
- *
- *      // For little bit faster and small code.
- *      #define T1HA_USE_FAST_ONESHOT_READ 1
- *
- *      // For calmness if doubt.
- *      #define T1HA_USE_FAST_ONESHOT_READ 0
- *
- *
- * 3) T1HA0_RUNTIME_SELECT = Controls choice fastest function in runtime.
- *
- *    t1ha library offers the t1ha0() function as the fastest for current CPU.
- *    But actual CPU's features/capabilities and may be significantly different,
- *    especially on x86 platform. Therefore, internally, t1ha0() may require
- *    dynamic dispatching for choice best implementation.
- *
- *    By default, t1ha enables such runtime choice and (may be) corresponding
- *    indirect calls if it reasonable, but you could override this default
- *    behavior when build t1ha library itself:
- *
- *      // To enable runtime choice of fastest implementation.
- *      #define T1HA0_RUNTIME_SELECT 1
- *
- *      // To disable runtime choice of fastest implementation.
- *      #define T1HA0_RUNTIME_SELECT 0
- *
- *    When T1HA0_RUNTIME_SELECT is nonzero the t1ha0_resolve() function could
- *    be used to get actual t1ha0() implementation address at runtime. This is
- *    useful for two cases:
- *      - calling by local pointer-to-function usually is little
- *        bit faster (less overhead) than via a PLT thru the DSO boundary.
- *      - GNU Indirect functions (see below) don't supported by environment
- *        and calling by t1ha0_funcptr is not available and/or expensive.
- *
- * 4) T1HA_USE_INDIRECT_FUNCTIONS = Controls usage of GNU Indirect functions.
- *
- *    In continue of T1HA0_RUNTIME_SELECT the T1HA_USE_INDIRECT_FUNCTIONS
- *    controls usage of ELF indirect functions feature. In general, when
- *    available, this reduces overhead of indirect function's calls though
- *    a DSO-bundary (https://sourceware.org/glibc/wiki/GNU_IFUNC).
- *
- *    By default, t1ha engage GNU Indirect functions when it available
- *    and useful, but you could override this default behavior when build
- *    t1ha library itself:
- *
- *      // To enable use of GNU ELF Indirect functions.
- *      #define T1HA_USE_INDIRECT_FUNCTIONS 1
- *
- *      // To disable use of GNU ELF Indirect functions. This may be useful
- *      // if the actual toolchain or the system's loader don't support ones.
- *      #define T1HA_USE_INDIRECT_FUNCTIONS 0
- *
- * 5) T1HA0_AESNI_AVAILABLE = Controls AES-NI detection and dispatching on x86.
- *
- *    In continue of T1HA0_RUNTIME_SELECT the T1HA0_AESNI_AVAILABLE controls
- *    detection and usage of AES-NI CPU's feature. On the other hand, this
- *    requires compiling parts of t1ha library with certain properly options,
- *    and could be difficult or inconvenient in some cases.
- *
- *    By default, t1ha engade AES-NI for t1ha0() on the x86 platform, but
- *    you could override this default behavior when build t1ha library itself:
- *
- *      // To disable detection and usage of AES-NI instructions for t1ha0().
- *      // This may be useful when you unable to build t1ha library properly
- *      // or known that AES-NI will be unavailable at the deploy.
- *      #define T1HA0_AESNI_AVAILABLE 0
- *
- *      // To force detection and usage of AES-NI instructions for t1ha0(),
- *      // but I don't known reasons to anybody would need this.
- *      #define T1HA0_AESNI_AVAILABLE 1
- *
- * 6) T1HA0_DISABLED, T1HA1_DISABLED, T1HA2_DISABLED = Controls availability of
- *    t1ha functions.
- *
- *    In some cases could be useful to import/use only few of t1ha functions
- *    or just the one. So, this definitions allows disable corresponding parts
- *    of t1ha library.
- *
- *      // To disable t1ha0(), t1ha0_32le(), t1ha0_32be() and all AES-NI.
- *      #define T1HA0_DISABLED
- *
- *      // To disable t1ha1_le() and t1ha1_be().
- *      #define T1HA1_DISABLED
- *
- *      // To disable t1ha2_atonce(), t1ha2_atonce128() and so on.
- *      #define T1HA2_DISABLED
- *
- *****************************************************************************/
-
-#define T1HA_VERSION_MAJOR 2
-#define T1HA_VERSION_MINOR 1
-#define T1HA_VERSION_RELEASE 0
-
-#ifndef __has_attribute
-#define __has_attribute(x) (0)
-#endif
-
-#ifndef __has_include
-#define __has_include(x) (0)
-#endif
-
-#ifndef __GNUC_PREREQ
-#if defined(__GNUC__) && defined(__GNUC_MINOR__)
-#define __GNUC_PREREQ(maj, min)                                                \
-  ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
-#else
-#define __GNUC_PREREQ(maj, min) 0
-#endif
-#endif /* __GNUC_PREREQ */
-
-#ifndef __CLANG_PREREQ
-#ifdef __clang__
-#define __CLANG_PREREQ(maj, min)                                               \
-  ((__clang_major__ << 16) + __clang_minor__ >= ((maj) << 16) + (min))
-#else
-#define __CLANG_PREREQ(maj, min) (0)
-#endif
-#endif /* __CLANG_PREREQ */
-
-#ifndef __LCC_PREREQ
-#ifdef __LCC__
-#define __LCC_PREREQ(maj, min)                                                 \
-  ((__LCC__ << 16) + __LCC_MINOR__ >= ((maj) << 16) + (min))
-#else
-#define __LCC_PREREQ(maj, min) (0)
-#endif
-#endif /* __LCC_PREREQ */
-
-/*****************************************************************************/
-
-#ifdef _MSC_VER
-/* Avoid '16' bytes padding added after data member 't1ha_context::total'
- * and other warnings from std-headers if warning-level > 3. */
-#pragma warning(push, 3)
-#endif
-
-#if defined(__cplusplus) && __cplusplus >= 201103L
-#include <climits>
-#include <cstddef>
-#include <cstdint>
-#else
-#include <limits.h>
-#include <stddef.h>
-#include <stdint.h>
-#endif
-
-/*****************************************************************************/
-
-#if defined(i386) || defined(__386) || defined(__i386) || defined(__i386__) || \
-    defined(i486) || defined(__i486) || defined(__i486__) ||                   \
-    defined(i586) | defined(__i586) || defined(__i586__) || defined(i686) ||   \
-    defined(__i686) || defined(__i686__) || defined(_M_IX86) ||                \
-    defined(_X86_) || defined(__THW_INTEL__) || defined(__I86__) ||            \
-    defined(__INTEL__) || defined(__x86_64) || defined(__x86_64__) ||          \
-    defined(__amd64__) || defined(__amd64) || defined(_M_X64) ||               \
-    defined(_M_AMD64) || defined(__IA32__) || defined(__INTEL__)
-#ifndef __ia32__
-/* LY: define neutral __ia32__ for x86 and x86-64 archs */
-#define __ia32__ 1
-#endif /* __ia32__ */
-#if !defined(__amd64__) && (defined(__x86_64) || defined(__x86_64__) ||        \
-                            defined(__amd64) || defined(_M_X64))
-/* LY: define trusty __amd64__ for all AMD64/x86-64 arch */
-#define __amd64__ 1
-#endif /* __amd64__ */
-#endif /* all x86 */
-
-#if !defined(__BYTE_ORDER__) || !defined(__ORDER_LITTLE_ENDIAN__) ||           \
-    !defined(__ORDER_BIG_ENDIAN__)
-
-/* *INDENT-OFF* */
-/* clang-format off */
-
-#if defined(__GLIBC__) || defined(__GNU_LIBRARY__) || defined(__ANDROID__) ||  \
-    defined(HAVE_ENDIAN_H) || __has_include(<endian.h>)
-#include <endian.h>
-#elif defined(__APPLE__) || defined(__MACH__) || defined(__OpenBSD__) ||       \
-    defined(HAVE_MACHINE_ENDIAN_H) || __has_include(<machine/endian.h>)
-#include <machine/endian.h>
-#elif defined(HAVE_SYS_ISA_DEFS_H) || __has_include(<sys/isa_defs.h>)
-#include <sys/isa_defs.h>
-#elif (defined(HAVE_SYS_TYPES_H) && defined(HAVE_SYS_ENDIAN_H)) ||             \
-    (__has_include(<sys/types.h>) && __has_include(<sys/endian.h>))
-#include <sys/endian.h>
-#include <sys/types.h>
-#elif defined(__bsdi__) || defined(__DragonFly__) || defined(__FreeBSD__) ||   \
-    defined(__NETBSD__) || defined(__NetBSD__) ||                              \
-    defined(HAVE_SYS_PARAM_H) || __has_include(<sys/param.h>)
-#include <sys/param.h>
-#endif /* OS */
-
-/* *INDENT-ON* */
-/* clang-format on */
-
-#if defined(__BYTE_ORDER) && defined(__LITTLE_ENDIAN) && defined(__BIG_ENDIAN)
-#define __ORDER_LITTLE_ENDIAN__ __LITTLE_ENDIAN
-#define __ORDER_BIG_ENDIAN__ __BIG_ENDIAN
-#define __BYTE_ORDER__ __BYTE_ORDER
-#elif defined(_BYTE_ORDER) && defined(_LITTLE_ENDIAN) && defined(_BIG_ENDIAN)
-#define __ORDER_LITTLE_ENDIAN__ _LITTLE_ENDIAN
-#define __ORDER_BIG_ENDIAN__ _BIG_ENDIAN
-#define __BYTE_ORDER__ _BYTE_ORDER
-#else
-#define __ORDER_LITTLE_ENDIAN__ 1234
-#define __ORDER_BIG_ENDIAN__ 4321
-
-#if defined(__LITTLE_ENDIAN__) ||                                              \
-    (defined(_LITTLE_ENDIAN) && !defined(_BIG_ENDIAN)) ||                      \
-    defined(__ARMEL__) || defined(__THUMBEL__) || defined(__AARCH64EL__) ||    \
-    defined(__MIPSEL__) || defined(_MIPSEL) || defined(__MIPSEL) ||            \
-    defined(_M_ARM) || defined(_M_ARM64) || defined(__e2k__) ||                \
-    defined(__elbrus_4c__) || defined(__elbrus_8c__) || defined(__bfin__) ||   \
-    defined(__BFIN__) || defined(__ia64__) || defined(_IA64) ||                \
-    defined(__IA64__) || defined(__ia64) || defined(_M_IA64) ||                \
-    defined(__itanium__) || defined(__ia32__) || defined(__CYGWIN__) ||        \
-    defined(_WIN64) || defined(_WIN32) || defined(__TOS_WIN__) ||              \
-    defined(__WINDOWS__)
-#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
-
-#elif defined(__BIG_ENDIAN__) ||                                               \
-    (defined(_BIG_ENDIAN) && !defined(_LITTLE_ENDIAN)) ||                      \
-    defined(__ARMEB__) || defined(__THUMBEB__) || defined(__AARCH64EB__) ||    \
-    defined(__MIPSEB__) || defined(_MIPSEB) || defined(__MIPSEB) ||            \
-    defined(__m68k__) || defined(M68000) || defined(__hppa__) ||               \
-    defined(__hppa) || defined(__HPPA__) || defined(__sparc__) ||              \
-    defined(__sparc) || defined(__370__) || defined(__THW_370__) ||            \
-    defined(__s390__) || defined(__s390x__) || defined(__SYSC_ZARCH__)
-#define __BYTE_ORDER__ __ORDER_BIG_ENDIAN__
-
-#else
-#error __BYTE_ORDER__ should be defined.
-#endif /* Arch */
-
-#endif
-#endif /* __BYTE_ORDER__ || __ORDER_LITTLE_ENDIAN__ || __ORDER_BIG_ENDIAN__ */
-
-/*****************************************************************************/
-
-#ifndef __dll_export
-#if defined(_WIN32) || defined(_WIN64) || defined(__CYGWIN__)
-#if defined(__GNUC__) || __has_attribute(dllexport)
-#define __dll_export __attribute__((dllexport))
-#elif defined(_MSC_VER)
-#define __dll_export __declspec(dllexport)
-#else
-#define __dll_export
-#endif
-#elif defined(__GNUC__) || __has_attribute(visibility)
-#define __dll_export __attribute__((visibility("default")))
-#else
-#define __dll_export
-#endif
-#endif /* __dll_export */
-
-#ifndef __dll_import
-#if defined(_WIN32) || defined(_WIN64) || defined(__CYGWIN__)
-#if defined(__GNUC__) || __has_attribute(dllimport)
-#define __dll_import __attribute__((dllimport))
-#elif defined(_MSC_VER)
-#define __dll_import __declspec(dllimport)
-#else
-#define __dll_import
-#endif
-#else
-#define __dll_import
-#endif
-#endif /* __dll_import */
-
-#ifndef __force_inline
-#ifdef _MSC_VER
-#define __force_inline __forceinline
-#elif __GNUC_PREREQ(3, 2) || __has_attribute(always_inline)
-#define __force_inline __inline __attribute__((always_inline))
-#else
-#define __force_inline __inline
-#endif
-#endif /* __force_inline */
-
-#ifndef T1HA_API
-#if defined(t1ha_EXPORTS)
-#define T1HA_API __dll_export
-#elif defined(t1ha_IMPORTS)
-#define T1HA_API __dll_import
-#else
-#define T1HA_API
-#endif
-#endif /* T1HA_API */
-
-#if defined(_MSC_VER) && defined(__ia32__)
-#define T1HA_ALIGN_PREFIX __declspec(align(32)) /* required only for SIMD */
-#else
-#define T1HA_ALIGN_PREFIX
-#endif /* _MSC_VER */
-
-#if defined(__GNUC__) && defined(__ia32__)
-#define T1HA_ALIGN_SUFFIX                                                      \
-  __attribute__((aligned(32))) /* required only for SIMD */
-#else
-#define T1HA_ALIGN_SUFFIX
-#endif /* GCC x86 */
-
-#ifndef T1HA_USE_INDIRECT_FUNCTIONS
-/* GNU ELF indirect functions usage control. For more info please see
- * https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
- * and https://sourceware.org/glibc/wiki/GNU_IFUNC */
-#if __has_attribute(ifunc) &&                                                  \
-    defined(__ELF__) /* ifunc is broken on Darwin/OSX */
-/* Use ifunc/gnu_indirect_function if corresponding attribute is available,
- * Assuming compiler will generate properly code even when
- * the -fstack-protector-all and/or the -fsanitize=address are enabled. */
-#define T1HA_USE_INDIRECT_FUNCTIONS 1
-#elif defined(__ELF__) && !defined(__SANITIZE_ADDRESS__) &&                    \
-    !defined(__SSP_ALL__)
-/* ifunc/gnu_indirect_function will be used on ELF, but only if both
- * -fstack-protector-all and -fsanitize=address are NOT enabled. */
-#define T1HA_USE_INDIRECT_FUNCTIONS 1
-#else
-#define T1HA_USE_INDIRECT_FUNCTIONS 0
-#endif
-#endif /* T1HA_USE_INDIRECT_FUNCTIONS */
-
-#if __GNUC_PREREQ(4, 0)
-#pragma GCC visibility push(hidden)
-#endif /* __GNUC_PREREQ(4,0) */
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-typedef union T1HA_ALIGN_PREFIX t1ha_state256 {
-  uint8_t bytes[32];
-  uint32_t u32[8];
-  uint64_t u64[4];
-  struct {
-    uint64_t a, b, c, d;
-  } n;
-} t1ha_state256_t T1HA_ALIGN_SUFFIX;
-
-typedef struct t1ha_context {
-  t1ha_state256_t state;
-  t1ha_state256_t buffer;
-  size_t partial;
-  uint64_t total;
-} t1ha_context_t;
-
-#ifdef _MSC_VER
-#pragma warning(pop)
-#endif
-
-/******************************************************************************
- *
- * Self-testing API.
- *
- * Unfortunately, some compilers (exactly only Microsoft Visual C/C++) has
- * a bugs which leads t1ha-functions to produce wrong results. This API allows
- * check the correctness of the actual code in runtime.
- *
- * All check-functions returns 0 on success, or -1 in case the corresponding
- * hash-function failed verification. PLEASE, always perform such checking at
- * initialization of your code, if you using MSVC or other troubleful compilers.
- */
-
-T1HA_API int t1ha_selfcheck__all_enabled(void);
-
-#ifndef T1HA2_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha2_atonce(void);
-T1HA_API int t1ha_selfcheck__t1ha2_atonce128(void);
-T1HA_API int t1ha_selfcheck__t1ha2_stream(void);
-T1HA_API int t1ha_selfcheck__t1ha2(void);
-#endif /* T1HA2_DISABLED */
-
-#ifndef T1HA1_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha1_le(void);
-T1HA_API int t1ha_selfcheck__t1ha1_be(void);
-T1HA_API int t1ha_selfcheck__t1ha1(void);
-#endif /* T1HA1_DISABLED */
-
-#ifndef T1HA0_DISABLED
-T1HA_API int t1ha_selfcheck__t1ha0_32le(void);
-T1HA_API int t1ha_selfcheck__t1ha0_32be(void);
-T1HA_API int t1ha_selfcheck__t1ha0(void);
-
-/* Define T1HA0_AESNI_AVAILABLE to 0 for disable AES-NI support. */
-#ifndef T1HA0_AESNI_AVAILABLE
-#if defined(__e2k__) ||                                                        \
-    (defined(__ia32__) && (!defined(_M_IX86) || _MSC_VER > 1800))
-#define T1HA0_AESNI_AVAILABLE 1
-#else
-#define T1HA0_AESNI_AVAILABLE 0
-#endif
-#endif /* ifndef T1HA0_AESNI_AVAILABLE */
-
-#if T1HA0_AESNI_AVAILABLE
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_noavx(void);
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_avx(void);
-#ifndef __e2k__
-T1HA_API int t1ha_selfcheck__t1ha0_ia32aes_avx2(void);
-#endif
-#endif /* if T1HA0_AESNI_AVAILABLE */
-#endif /* T1HA0_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha2 = 64 and 128-bit, SLIGHTLY MORE ATTENTION FOR QUALITY AND STRENGTH.
- *
- *    - The recommended version of "Fast Positive Hash" with good quality
- *      for checksum, hash tables and fingerprinting.
- *    - Portable and extremely efficiency on modern 64-bit CPUs.
- *      Designed for 64-bit little-endian platforms,
- *      in other cases will runs slowly.
- *    - Great quality of hashing and still faster than other non-t1ha hashes.
- *      Provides streaming mode and 128-bit result.
- *
- * Note: Due performance reason 64- and 128-bit results are completely
- *       different each other, i.e. 64-bit result is NOT any part of 128-bit.
- */
-#ifndef T1HA2_DISABLED
-
-/* The at-once variant with 64-bit result */
-T1HA_API uint64_t t1ha2_atonce(const void *data, size_t length, uint64_t seed);
-
-/* The at-once variant with 128-bit result.
- * Argument `extra_result` is NOT optional and MUST be valid.
- * The high 64-bit part of 128-bit hash will be always unconditionally
- * stored to the address given by `extra_result` argument. */
-T1HA_API uint64_t t1ha2_atonce128(uint64_t *__restrict extra_result,
-                                  const void *__restrict data, size_t length,
-                                  uint64_t seed);
-
-/* The init/update/final trinity for streaming.
- * Return 64 or 128-bit result depentently from `extra_result` argument. */
-T1HA_API void t1ha2_init(t1ha_context_t *ctx, uint64_t seed_x, uint64_t seed_y);
-T1HA_API void t1ha2_update(t1ha_context_t *__restrict ctx,
-                           const void *__restrict data, size_t length);
-
-/* Argument `extra_result` is optional and MAY be NULL.
- *  - If `extra_result` is NOT NULL then the 128-bit hash will be calculated,
- *    and high 64-bit part of it will be stored to the address given
- *    by `extra_result` argument.
- *  - Otherwise the 64-bit hash will be calculated
- *    and returned from function directly.
- *
- * Note: Due performance reason 64- and 128-bit results are completely
- *       different each other, i.e. 64-bit result is NOT any part of 128-bit. */
-T1HA_API uint64_t t1ha2_final(t1ha_context_t *__restrict ctx,
-                              uint64_t *__restrict extra_result /* optional */);
-
-#endif /* T1HA2_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha1 = 64-bit, BASELINE FAST PORTABLE HASH:
- *
- *    - Runs faster on 64-bit platforms in other cases may runs slowly.
- *    - Portable and stable, returns same 64-bit result
- *      on all architectures and CPUs.
- *    - Unfortunately it fails the "strict avalanche criteria",
- *      see test results at https://github.com/demerphq/smhasher.
- *
- *      This flaw is insignificant for the t1ha1() purposes and imperceptible
- *      from a practical point of view.
- *      However, nowadays this issue has resolved in the next t1ha2(),
- *      that was initially planned to providing a bit more quality.
- */
-#ifndef T1HA1_DISABLED
-
-/* The little-endian variant. */
-T1HA_API uint64_t t1ha1_le(const void *data, size_t length, uint64_t seed);
-
-/* The big-endian variant. */
-T1HA_API uint64_t t1ha1_be(const void *data, size_t length, uint64_t seed);
-
-#endif /* T1HA1_DISABLED */
-
-/******************************************************************************
- *
- *  t1ha0 = 64-bit, JUST ONLY FASTER:
- *
- *    - Provides fast-as-possible hashing for current CPU, including
- *      32-bit systems and engaging the available hardware acceleration.
- *    - It is a facade that selects most quick-and-dirty hash
- *      for the current processor. For instance, on IA32 (x86) actual function
- *      will be selected in runtime, depending on current CPU capabilities
- *
- * BE CAREFUL!!!  THIS IS MEANS:
- *
- *   1. The quality of hash is a subject for tradeoffs with performance.
- *      So, the quality and strength of t1ha0() may be lower than t1ha1(),
- *      especially on 32-bit targets, but then much faster.
- *      However, guaranteed that it passes all SMHasher tests.
- *
- *   2. No warranty that the hash result will be same for particular
- *      key on another machine or another version of libt1ha.
- *
- *      Briefly, such hash-results and their derivatives, should be
- *      used only in runtime, but should not be persist or transferred
- *      over a network.
- *
- *
- *  When T1HA0_RUNTIME_SELECT is nonzero the t1ha0_resolve() function could
- *  be used to get actual t1ha0() implementation address at runtime. This is
- *  useful for two cases:
- *    - calling by local pointer-to-function usually is little
- *      bit faster (less overhead) than via a PLT thru the DSO boundary.
- *    - GNU Indirect functions (see below) don't supported by environment
- *      and calling by t1ha0_funcptr is not available and/or expensive.
- */
-
-#ifndef T1HA0_DISABLED
-
-/* The little-endian variant for 32-bit CPU. */
-uint64_t t1ha0_32le(const void *data, size_t length, uint64_t seed);
-/* The big-endian variant for 32-bit CPU. */
-uint64_t t1ha0_32be(const void *data, size_t length, uint64_t seed);
-
-/* Define T1HA0_AESNI_AVAILABLE to 0 for disable AES-NI support. */
-#ifndef T1HA0_AESNI_AVAILABLE
-#if defined(__e2k__) ||                                                        \
-    (defined(__ia32__) && (!defined(_M_IX86) || _MSC_VER > 1800))
-#define T1HA0_AESNI_AVAILABLE 1
-#else
-#define T1HA0_AESNI_AVAILABLE 0
-#endif
-#endif /* T1HA0_AESNI_AVAILABLE */
-
-/* Define T1HA0_RUNTIME_SELECT to 0 for disable dispatching t1ha0 at runtime. */
-#ifndef T1HA0_RUNTIME_SELECT
-#if T1HA0_AESNI_AVAILABLE && !defined(__e2k__)
-#define T1HA0_RUNTIME_SELECT 1
-#else
-#define T1HA0_RUNTIME_SELECT 0
-#endif
-#endif /* T1HA0_RUNTIME_SELECT */
-
-#if !T1HA0_RUNTIME_SELECT && !defined(T1HA0_USE_DEFINE)
-#if defined(__LCC__)
-#define T1HA0_USE_DEFINE 1
-#else
-#define T1HA0_USE_DEFINE 0
-#endif
-#endif /* T1HA0_USE_DEFINE */
-
-#if T1HA0_AESNI_AVAILABLE
-uint64_t t1ha0_ia32aes_noavx(const void *data, size_t length, uint64_t seed);
-uint64_t t1ha0_ia32aes_avx(const void *data, size_t length, uint64_t seed);
-#ifndef __e2k__
-uint64_t t1ha0_ia32aes_avx2(const void *data, size_t length, uint64_t seed);
-#endif
-#endif /* T1HA0_AESNI_AVAILABLE */
-
-#if T1HA0_RUNTIME_SELECT
-typedef uint64_t (*t1ha0_function_t)(const void *, size_t, uint64_t);
-T1HA_API t1ha0_function_t t1ha0_resolve(void);
-#if T1HA_USE_INDIRECT_FUNCTIONS
-T1HA_API uint64_t t1ha0(const void *data, size_t length, uint64_t seed);
-#else
-/* Otherwise function pointer will be used.
- * Unfortunately this may cause some overhead calling. */
-T1HA_API extern uint64_t (*t1ha0_funcptr)(const void *data, size_t length,
-                                          uint64_t seed);
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-  return t1ha0_funcptr(data, length, seed);
-}
-#endif /* T1HA_USE_INDIRECT_FUNCTIONS */
-
-#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-
-#if T1HA0_USE_DEFINE
-
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-#define t1ha0 t1ha2_atonce
-#else
-#define t1ha0 t1ha1_be
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-#define t1ha0 t1ha0_32be
-#endif /* 32/64 */
-
-#else /* T1HA0_USE_DEFINE */
-
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-  return t1ha2_atonce(data, length, seed);
-#else
-  return t1ha1_be(data, length, seed);
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-  return t1ha0_32be(data, length, seed);
-#endif /* 32/64 */
-}
-
-#endif /* !T1HA0_USE_DEFINE */
-
-#else /* !T1HA0_RUNTIME_SELECT && __BYTE_ORDER__ != __ORDER_BIG_ENDIAN__ */
-
-#if T1HA0_USE_DEFINE
-
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-#define t1ha0 t1ha2_atonce
-#else
-#define t1ha0 t1ha1_le
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-#define t1ha0 t1ha0_32le
-#endif /* 32/64 */
-
-#else
-
-static __force_inline uint64_t t1ha0(const void *data, size_t length,
-                                     uint64_t seed) {
-#if (UINTPTR_MAX > 0xffffFFFFul || ULONG_MAX > 0xffffFFFFul) &&                \
-    (!defined(T1HA1_DISABLED) || !defined(T1HA2_DISABLED))
-#if defined(T1HA1_DISABLED)
-  return t1ha2_atonce(data, length, seed);
-#else
-  return t1ha1_le(data, length, seed);
-#endif /* T1HA1_DISABLED */
-#else  /* 32/64 */
-  return t1ha0_32le(data, length, seed);
-#endif /* 32/64 */
-}
-
-#endif /* !T1HA0_USE_DEFINE */
-
-#endif /* !T1HA0_RUNTIME_SELECT */
-
-#endif /* T1HA0_DISABLED */
-
-#ifdef __cplusplus
-}
-#endif
-
-#if __GNUC_PREREQ(4, 0)
-#pragma GCC visibility pop
-#endif /* __GNUC_PREREQ(4,0) */
--- a/src/t1ha/t1ha2.c
+++ b/src/t1ha/t1ha2.c
@ -1,329 +0,0 @@
-/*
- *  Copyright (c) 2016-2018 Positive Technologies, https://www.ptsecurity.com,
- *  Fast Positive Hash.
- *
- *  Portions Copyright (c) 2010-2018 Leonid Yuriev <leo@yuriev.ru>,
- *  The 1Hippeus project (t1h).
- *
- *  This software is provided 'as-is', without any express or implied
- *  warranty. In no event will the authors be held liable for any damages
- *  arising from the use of this software.
- *
- *  Permission is granted to anyone to use this software for any purpose,
- *  including commercial applications, and to alter it and redistribute it
- *  freely, subject to the following restrictions:
- *
- *  1. The origin of this software must not be misrepresented; you must not
- *     claim that you wrote the original software. If you use this software
- *     in a product, an acknowledgement in the product documentation would be
- *     appreciated but is not required.
- *  2. Altered source versions must be plainly marked as such, and must not be
- *     misrepresented as being the original software.
- *  3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * t1ha = { Fast Positive Hash, aka "Позитивный Хэш" }
- * by [Positive Technologies](https://www.ptsecurity.ru)
- *
- * Briefly, it is a 64-bit Hash Function:
- *  1. Created for 64-bit little-endian platforms, in predominantly for x86_64,
- *     but portable and without penalties it can run on any 64-bit CPU.
- *  2. In most cases up to 15% faster than City64, xxHash, mum-hash, metro-hash
- *     and all others portable hash-functions (which do not use specific
- *     hardware tricks).
- *  3. Not suitable for cryptography.
- *
- * The Future will Positive. Всё будет хорошо.
- *
- * ACKNOWLEDGEMENT:
- * The t1ha was originally developed by Leonid Yuriev (Леонид Юрьев)
- * for The 1Hippeus project - zerocopy messaging in the spirit of Sparta!
- */
-
-#ifndef T1HA2_DISABLED
-#include "t1ha_bits.h"
-//#include "t1ha_selfcheck.h"
-
-static __always_inline void init_ab(t1ha_state256_t *s, uint64_t x,
-                                    uint64_t y) {
-  s->n.a = x;
-  s->n.b = y;
-}
-
-static __always_inline void init_cd(t1ha_state256_t *s, uint64_t x,
-                                    uint64_t y) {
-  s->n.c = rot64(y, 23) + ~x;
-  s->n.d = ~y + rot64(x, 19);
-}
-
-/* TODO: C++ template in the next version */
-#define T1HA2_UPDATE(ENDIANNES, ALIGNESS, state, v)                            \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t w0 = fetch64_##ENDIANNES##_##ALIGNESS(v + 0);               \
-    const uint64_t w1 = fetch64_##ENDIANNES##_##ALIGNESS(v + 1);               \
-    const uint64_t w2 = fetch64_##ENDIANNES##_##ALIGNESS(v + 2);               \
-    const uint64_t w3 = fetch64_##ENDIANNES##_##ALIGNESS(v + 3);               \
-                                                                               \
-    const uint64_t d02 = w0 + rot64(w2 + s->n.d, 56);                          \
-    const uint64_t c13 = w1 + rot64(w3 + s->n.c, 19);                          \
-    s->n.d ^= s->n.b + rot64(w1, 38);                                          \
-    s->n.c ^= s->n.a + rot64(w0, 57);                                          \
-    s->n.b ^= prime_6 * (c13 + w2);                                            \
-    s->n.a ^= prime_5 * (d02 + w3);                                            \
-  } while (0)
-
-static __always_inline void squash(t1ha_state256_t *s) {
-  s->n.a ^= prime_6 * (s->n.c + rot64(s->n.d, 23));
-  s->n.b ^= prime_5 * (rot64(s->n.c, 19) + s->n.d);
-}
-
-/* TODO: C++ template in the next version */
-#define T1HA2_LOOP(ENDIANNES, ALIGNESS, state, data, len)                      \
-  do {                                                                         \
-    const void *detent = (const uint8_t *)data + len - 31;                     \
-    do {                                                                       \
-      const uint64_t *v = (const uint64_t *)data;                              \
-      data = (const uint64_t *)data + 4;                                       \
-      prefetch(data);                                                          \
-      T1HA2_UPDATE(le, ALIGNESS, state, v);                                    \
-    } while (likely(data < detent));                                           \
-  } while (0)
-
-/* TODO: C++ template in the next version */
-#define T1HA2_TAIL_AB(ENDIANNES, ALIGNESS, state, data, len)                   \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t *v = (const uint64_t *)data;                                \
-    switch (len) {                                                             \
-    default:                                                                   \
-      mixup64(&s->n.a, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_4);                                                        \
-    /* fall through */                                                         \
-    case 24:                                                                   \
-    case 23:                                                                   \
-    case 22:                                                                   \
-    case 21:                                                                   \
-    case 20:                                                                   \
-    case 19:                                                                   \
-    case 18:                                                                   \
-    case 17:                                                                   \
-      mixup64(&s->n.b, &s->n.a, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_3);                                                        \
-    /* fall through */                                                         \
-    case 16:                                                                   \
-    case 15:                                                                   \
-    case 14:                                                                   \
-    case 13:                                                                   \
-    case 12:                                                                   \
-    case 11:                                                                   \
-    case 10:                                                                   \
-    case 9:                                                                    \
-      mixup64(&s->n.a, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_2);                                                        \
-    /* fall through */                                                         \
-    case 8:                                                                    \
-    case 7:                                                                    \
-    case 6:                                                                    \
-    case 5:                                                                    \
-    case 4:                                                                    \
-    case 3:                                                                    \
-    case 2:                                                                    \
-    case 1:                                                                    \
-      mixup64(&s->n.b, &s->n.a, tail64_##ENDIANNES##_##ALIGNESS(v, len),       \
-              prime_1);                                                        \
-    /* fall through */                                                         \
-    case 0:                                                                    \
-      return final64(s->n.a, s->n.b);                                          \
-    }                                                                          \
-  } while (0)
-
-/* TODO: C++ template in the next version */
-#define T1HA2_TAIL_ABCD(ENDIANNES, ALIGNESS, state, data, len)                 \
-  do {                                                                         \
-    t1ha_state256_t *const s = state;                                          \
-    const uint64_t *v = (const uint64_t *)data;                                \
-    switch (len) {                                                             \
-    default:                                                                   \
-      mixup64(&s->n.a, &s->n.d, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_4);                                                        \
-    /* fall through */                                                         \
-    case 24:                                                                   \
-    case 23:                                                                   \
-    case 22:                                                                   \
-    case 21:                                                                   \
-    case 20:                                                                   \
-    case 19:                                                                   \
-    case 18:                                                                   \
-    case 17:                                                                   \
-      mixup64(&s->n.b, &s->n.a, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_3);                                                        \
-    /* fall through */                                                         \
-    case 16:                                                                   \
-    case 15:                                                                   \
-    case 14:                                                                   \
-    case 13:                                                                   \
-    case 12:                                                                   \
-    case 11:                                                                   \
-    case 10:                                                                   \
-    case 9:                                                                    \
-      mixup64(&s->n.c, &s->n.b, fetch64_##ENDIANNES##_##ALIGNESS(v++),         \
-              prime_2);                                                        \
-    /* fall through */                                                         \
-    case 8:                                                                    \
-    case 7:                                                                    \
-    case 6:                                                                    \
-    case 5:                                                                    \
-    case 4:                                                                    \
-    case 3:                                                                    \
-    case 2:                                                                    \
-    case 1:                                                                    \
-      mixup64(&s->n.d, &s->n.c, tail64_##ENDIANNES##_##ALIGNESS(v, len),       \
-              prime_1);                                                        \
-    /* fall through */                                                         \
-    case 0:                                                                    \
-      return final128(s->n.a, s->n.b, s->n.c, s->n.d, extra_result);           \
-    }                                                                          \
-  } while (0)
-
-static __always_inline uint64_t final128(uint64_t a, uint64_t b, uint64_t c,
-                                         uint64_t d, uint64_t *h) {
-  mixup64(&a, &b, rot64(c, 41) ^ d, prime_0);
-  mixup64(&b, &c, rot64(d, 23) ^ a, prime_6);
-  mixup64(&c, &d, rot64(a, 19) ^ b, prime_5);
-  mixup64(&d, &a, rot64(b, 31) ^ c, prime_4);
-  *h = c + d;
-  return a ^ b;
-}
-
-//------------------------------------------------------------------------------
-
-uint64_t t1ha2_atonce(const void *data, size_t length, uint64_t seed) {
-  t1ha_state256_t state;
-  init_ab(&state, seed, length);
-
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-  if (unlikely(length > 32)) {
-    init_cd(&state, seed, length);
-    T1HA2_LOOP(le, unaligned, &state, data, length);
-    squash(&state);
-    length &= 31;
-  }
-  T1HA2_TAIL_AB(le, unaligned, &state, data, length);
-#else
-  const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-  if (misaligned) {
-    if (unlikely(length > 32)) {
-      init_cd(&state, seed, length);
-      T1HA2_LOOP(le, unaligned, &state, data, length);
-      squash(&state);
-      length &= 31;
-    }
-    T1HA2_TAIL_AB(le, unaligned, &state, data, length);
-  } else {
-    if (unlikely(length > 32)) {
-      init_cd(&state, seed, length);
-      T1HA2_LOOP(le, aligned, &state, data, length);
-      squash(&state);
-      length &= 31;
-    }
-    T1HA2_TAIL_AB(le, aligned, &state, data, length);
-  }
-#endif
-}
-
-uint64_t t1ha2_atonce128(uint64_t *__restrict extra_result,
-                         const void *__restrict data, size_t length,
-                         uint64_t seed) {
-  t1ha_state256_t state;
-  init_ab(&state, seed, length);
-  init_cd(&state, seed, length);
-
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-  if (unlikely(length > 32)) {
-    T1HA2_LOOP(le, unaligned, &state, data, length);
-    length &= 31;
-  }
-  T1HA2_TAIL_ABCD(le, unaligned, &state, data, length);
-#else
-  const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-  if (misaligned) {
-    if (unlikely(length > 32)) {
-      T1HA2_LOOP(le, unaligned, &state, data, length);
-      length &= 31;
-    }
-    T1HA2_TAIL_ABCD(le, unaligned, &state, data, length);
-  } else {
-    if (unlikely(length > 32)) {
-      T1HA2_LOOP(le, aligned, &state, data, length);
-      length &= 31;
-    }
-    T1HA2_TAIL_ABCD(le, aligned, &state, data, length);
-  }
-#endif
-}
-
-//------------------------------------------------------------------------------
-
-void t1ha2_init(t1ha_context_t *ctx, uint64_t seed_x, uint64_t seed_y) {
-  init_ab(&ctx->state, seed_x, seed_y);
-  init_cd(&ctx->state, seed_x, seed_y);
-  ctx->partial = 0;
-  ctx->total = 0;
-}
-
-void t1ha2_update(t1ha_context_t *__restrict ctx, const void *__restrict data,
-                  size_t length) {
-  ctx->total += length;
-
-  if (ctx->partial) {
-    const size_t left = 32 - ctx->partial;
-    const size_t chunk = (length >= left) ? left : length;
-    memcpy(ctx->buffer.bytes + ctx->partial, data, chunk);
-    ctx->partial += chunk;
-    if (ctx->partial < 32) {
-      assert(left >= length);
-      return;
-    }
-    ctx->partial = 0;
-    data = (const uint8_t *)data + chunk;
-    length -= chunk;
-    T1HA2_UPDATE(le, aligned, &ctx->state, ctx->buffer.u64);
-  }
-
-  if (length >= 32) {
-#if T1HA_SYS_UNALIGNED_ACCESS == T1HA_UNALIGNED_ACCESS__EFFICIENT
-    T1HA2_LOOP(le, unaligned, &ctx->state, data, length);
-#else
-    const bool misaligned = (((uintptr_t)data) & (ALIGNMENT_64 - 1)) != 0;
-    if (misaligned) {
-      T1HA2_LOOP(le, unaligned, &ctx->state, data, length);
-    } else {
-      T1HA2_LOOP(le, aligned, &ctx->state, data, length);
-    }
-#endif
-    length &= 31;
-  }
-
-  if (length)
-    memcpy(ctx->buffer.bytes, data, ctx->partial = length);
-}
-
-uint64_t t1ha2_final(t1ha_context_t *__restrict ctx,
-                     uint64_t *__restrict extra_result) {
-  uint64_t bits = (ctx->total << 3) ^ (UINT64_C(1) << 63);
-#if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__
-  bits = bswap64(bits);
-#endif
-  t1ha2_update(ctx, &bits, 8);
-
-  if (likely(!extra_result)) {
-    squash(&ctx->state);
-    T1HA2_TAIL_AB(le, aligned, &ctx->state, ctx->buffer.u64, ctx->partial);
-  }
-
-  T1HA2_TAIL_ABCD(le, aligned, &ctx->state, ctx->buffer.u64, ctx->partial);
-}
-
-#endif /* T1HA2_DISABLED */
--- a/src/t1ha/t1ha_bits.h
+++ b/src/t1ha/t1ha_bits.h
--- a/src/virtualMemory.cpp
+++ b/src/virtualMemory.cpp
@ -0,0 +1,112 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#include "virtualMemory.hpp"
+
+#include <stdexcept>
+
+#ifdef _WIN32
+#include <windows.h>
+#else
+#ifdef __APPLE__
+#include <mach/vm_statistics.h>
+#endif
+#include <sys/types.h>
+#include <sys/mman.h>
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+#endif
+
+#ifdef _WIN32
+std::string getErrorMessage(const char* function) {
+	LPSTR messageBuffer = nullptr;
+	size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
+		NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&messageBuffer, 0, NULL);
+	std::string message(messageBuffer, size);
+	LocalFree(messageBuffer);
+	return std::string(function) + std::string(": ") + message;
+}
+
+void setPrivilege(const char* pszPrivilege, BOOL bEnable) {
+	HANDLE           hToken;
+	TOKEN_PRIVILEGES tp;
+	BOOL             status;
+	DWORD            error;
+
+	if (!OpenProcessToken(GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, &hToken))
+		throw std::runtime_error(getErrorMessage("OpenProcessToken"));
+
+	if (!LookupPrivilegeValue(NULL, pszPrivilege, &tp.Privileges[0].Luid))
+		throw std::runtime_error(getErrorMessage("LookupPrivilegeValue"));
+
+	tp.PrivilegeCount = 1;
+
+	if (bEnable)
+		tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+	else
+		tp.Privileges[0].Attributes = 0;
+
+	status = AdjustTokenPrivileges(hToken, FALSE, &tp, 0, (PTOKEN_PRIVILEGES)NULL, 0);
+
+	error = GetLastError();
+	if (!status || (error != ERROR_SUCCESS))
+		throw std::runtime_error(getErrorMessage("AdjustTokenPrivileges"));
+
+	if (!CloseHandle(hToken))
+		throw std::runtime_error(getErrorMessage("CloseHandle"));
+}
+#endif
+
+void* allocExecutableMemory(std::size_t bytes) {
+	void* mem;
+#ifdef _WIN32
+	mem = VirtualAlloc(nullptr, bytes, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
+	if (mem == nullptr)
+		throw std::runtime_error(getErrorMessage("allocExecutableMemory - VirtualAlloc"));
+#else
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (mem == MAP_FAILED)
+		throw std::runtime_error("allocExecutableMemory - mmap failed");
+#endif
+	return mem;
+}
+
+constexpr std::size_t align(std::size_t pos, uint32_t align) {
+	return ((pos - 1) / align + 1) * align;
+}
+
+void* allocLargePagesMemory(std::size_t bytes) {
+	void* mem;
+#ifdef _WIN32
+	setPrivilege("SeLockMemoryPrivilege", 1);
+	mem = VirtualAlloc(NULL, align(bytes, 2 * 1024 * 1024), MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);
+	if (mem == nullptr)
+		throw std::runtime_error(getErrorMessage("allocLargePagesMemory - VirtualAlloc"));
+#else
+#ifdef __APPLE__
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, VM_FLAGS_SUPERPAGE_SIZE_2MB, 0);
+#else
+	mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
+#endif
+	if (mem == MAP_FAILED)
+		throw std::runtime_error("allocLargePagesMemory - mmap failed");
+#endif
+	return mem;
+}
--- a/src/virtualMemory.hpp
+++ b/src/virtualMemory.hpp
@ -0,0 +1,25 @@
+/*
+Copyright (c) 2018 tevador
+
+This file is part of RandomX.
+
+RandomX is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+RandomX is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with RandomX.  If not, see<http://www.gnu.org/licenses/>.
+*/
+
+#pragma once
+
+#include <cstddef>
+
+void* allocExecutableMemory(std::size_t);
+void* allocLargePagesMemory(std::size_t);