X86 Assembly/AVX, AVX2, FMA3, FMA4
Prerequisites: X86 Assembly/SSE.
Example FMA4 program
[edit | edit source]The following program shows the use of the FMA4 instruction vfmaddps that can be used to do 8 single precision floating point multiplication and additions in one instruction.
.data
# 2^-1 2^-2 2^-3 2^-4 2^-5 2^-6 2^-7 2^-8
v1: .float 0.50, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625
v2: .float 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0
v3: .float 512.0, 1024.0, 2048.0, 8192.0, 16384.0, 32768.0, 65536.0, 131072.0
v4: .float 0,0,0,0,0,0,0,0
.text
.globl _start
_start:
vmovups v1,%ymm0
vmovups v2,%ymm1
vmovups v3,%ymm2
# addend + multiplicant1 * multiplicant2 = destination
vfmaddps %ymm0, %ymm1, %ymm2, %ymm3
vmovups %ymm3, v4
If you set a debugger breakpoint after the last line, you can use GDB to analyze the result. Look at the program and try and spot any problems.
Spoiler alert. Dumping the result "vector" in binary, we can see that precision has been lost.
(gdb) x/8t &v4
0x80490fc <v4>: 01000100100000000001000000000000 01000101100000000000001000000000 01000110100000000000000001000000 01001000000000000000000000000100
0x804910c <v4+16>: 01001001000000000000000000000000 01001010000000000000000000000000 01001011000000000000000000000000 01001100000000000000000000000000
(gdb)
Comparing v4+12 to v4+16, one can see that the addend got too small, and was lost. We only halved it from +12 to +16, so why is it gone now? The reason is that the exponent was changed too, so the addend would be placed at a bit more than 1 bit less significant than the last set bit in the previous mantissa. And so, it got so tiny that a 32-bit single-precision float could not represent it. The data loss is also visible when dumping the floats in their base-10 representations, but one must be careful, because the base-10 representation isn't always faithful.
(gdb) x/8f &v4
0x80490fc <v4>: 1024.5 4096.25 16384.125 131072.062
0x804910c <v4+16>: 524288 2097152 8388608 33554432
(gdb)
Resources
[edit | edit source]- Virtual machine (pre-built for Ubuntu) and VM snapshot for testing new instructions on legacy CPUs
- IEEE 754 single-precision interactive applet
- Introduction to Intel AVX
- Binutils test suite (for AVX and FMA examples in AT&T and Intel syntax):
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/fma.s;hb=HEAD
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/fma4.s;hb=HEAD
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx.s;hb=HEAD
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx-gather.s;hb=HEAD
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx2.s;hb=HEAD
- http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx256int.s;hb=HEAD