x86 Assembly/SSE
SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.
Registers
[edit | edit source]SSE, introduced by Intel in 1999 with the Pentium III, creates eight new 128-bit registers:
XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7
Originally, an SSE register could only be used as four 32-bit single precision floating point numbers (the equivalent of a float
in C). SSE2 expanded the capabilities of the XMM registers, so they can now be used as:
- 2 64-bit floating points (double precision)
- 2 64-bit integers
- 4 32-bit floating points (single-precision)
- 4 32-bit integers
- 8 16-bit integers
- 16 8-bit characters (bytes)
Data movement examples
[edit | edit source]The following program (using NASM syntax) performs data movements using SIMD instructions.
;
; nasm -felf32 -g sseMove.asm
; ld -g sseMove.o
;
global _start
section .data
align 16
v1: dd 1.1, 2.2, 3.3, 4.4 ; Four Single precision floats 32 bits each
v1dp: dq 1.1, 2.2 ; Two Double precision floats 64 bits each
v2: dd 5.5, 6.6, 7.7, 8.8
v2s1: dd 5.5, 6.6, 7.7, -8.8
v2s2: dd 5.5, 6.6, -7.7, -8.8
v2s3: dd 5.5, -6.6, -7.7, -8.8
v2s4: dd -5.5, -6.6, -7.7, -8.8
num1: dd 1.2
v3: dd 1.2, 2.3, 4.5, 6.7 ; No longer 16 byte aligned
v3dp: dq 1.2, 2.3 ; No longer 16 byte aligned
section .bss
mask1: resd 1
mask2: resd 1
mask3: resd 1
mask4: resd 1
section .text
_start:
;
; op dst, src
;
;
; SSE
;
; Using movaps since vectors are 16 byte aligned
movaps xmm0, [v1] ; Move four 32-bit(single precision) floats to xmm0
movaps xmm1, [v2]
movups xmm2, [v3] ; Need to use movups since v3 is not 16 byte aligned
;movaps xmm3, [v3] ; This would seg fault if uncommented
movss xmm3, [num1] ; Move 32-bit float num1 to the least significant element of xmm3
movss xmm3, [v3] ; Move first 32-bit float of v3 to the least significant element of xmm3
movlps xmm4, [v3] ; Move 64-bits(two single precision floats) from memory to the lower 64-bit elements of xmm4
movhps xmm4, [v2] ; Move 64-bits(two single precision floats) from memory to the higher 64-bit elements of xmm4
; Source and destination for movhlps and movlhps must be xmm registers
movhlps xmm5, xmm4 ; Transfers the higher 64-bits of the source xmm4 to the lower 64-bits of the destination xmm5
movlhps xmm5, xmm4 ; Transfers the lower 64-bits of the source xmm4 to the higher 64-bits of the destination xmm5
movaps xmm6, [v2s1]
movmskps eax, xmm6 ; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
mov [mask1], eax ; Should be 8
movaps xmm6, [v2s2]
movmskps eax, xmm6 ; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
mov [mask2], eax ; Should be 12
movaps xmm6, [v2s3]
movmskps eax, xmm6 ; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
mov [mask3], eax ; Should be 14
movaps xmm6, [v2s4]
movmskps eax, xmm6 ; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
mov [mask4], eax ; Should be 15
;
; SSE2
;
movapd xmm6, [v1dp] ; Move two 64-bit(double precision) floats to xmm6, using movapd since vector is 16 byte aligned
; Next two instruction should have equivalent results to movapd xmm6, [vldp]
movhpd xmm6, [v1dp+8] ; Move a 64-bit(double precision) float into the higher 64-bit elements of xmm6
movlpd xmm6, [v1dp] ; Move a 64-bit(double precision) float into the lower 64-bit elements of xmm6
movupd xmm6, [v3dp] ; Move two 64-bit floats to xmm6, using movupd since vector is not 16 byte aligned
Arithmetic example using packed singles
[edit | edit source]The following program (using NASM syntax) performs a few SIMD operations on some numbers.
global _start
section .data
v1: dd 1.1, 2.2, 3.3, 4.4 ;first set of 4 numbers
v2: dd 5.5, 6.6, 7.7, 8.8 ;second set
section .bss
v3: resd 4 ;result
section .text
_start:
movups xmm0, [v1] ;load v1 into xmm0
movups xmm1, [v2] ;load v2 into xmm1
addps xmm0, xmm1 ;add the 4 numbers in xmm1 (from v2) to the 4 numbers in xmm0 (from v1), store in xmm0. for the first float the result will be 5.5+1.1=6.6
mulps xmm0, xmm1 ;multiply the four numbers in xmm1 (from v2, unchanged) with the results from the previous calculation (in xmm0), store in xmm0. for the first float the result will be 5.5*6.6=36.3
subps xmm0, xmm1 ;subtract the four numbers in v2 (in xmm1, still unchanged) from result from previous calculation (in xmm1). for the first float, the result will be 36.3-5.5=30.8
movups [v3], xmm0 ;store v1 in v3
;end program
ret
The result values should be:
30.800 51.480 77.000 107.360
Using the GNU toolchain, you can debug and single-step like this:
% nasm -felf32 -g ssedemo.asm
% ld -g ssedemo.o
% gdb -q ./a.out
Reading symbols from a.out...done.
(gdb) break _start
Breakpoint 1 at 0x8048080
(gdb) r
Starting program: a.out
Breakpoint 1, 0x08048080 in _start ()
(gdb) disass
Dump of assembler code for function _start:
=> 0x08048080 <+0>: movups 0x80490a0,%xmm0
0x08048087 <+7>: movups 0x80490b0,%xmm1
0x0804808e <+14>: addps %xmm1,%xmm0
0x08048091 <+17>: mulps %xmm1,%xmm0
0x08048094 <+20>: subps %xmm1,%xmm0
0x08048097 <+23>: movups %xmm0,0x80490c0
End of assembler dump.
(gdb) stepi
0x08048087 in _start ()
(gdb)
0x0804808e in _start ()
(gdb) p $xmm0
$1 = {v4_float = {1.10000002, 2.20000005, 3.29999995, 4.4000001}, v2_double = {3.6000008549541236, 921.60022034645078}, v16_int8 = {-51, -52, -116, 63,
-51, -52, 12, 64, 51, 51, 83, 64, -51, -52, -116, 64}, v8_int16 = {-13107, 16268, -13107, 16396, 13107, 16467, -13107, 16524}, v4_int32 = {1066192077,
1074580685, 1079194419, 1082969293}, v2_int64 = {4615288900054469837, 4651317697086436147}, uint128 = 0x408ccccd40533333400ccccd3f8ccccd}
(gdb) x/4f &v1
0x80490a0 <v1>: 1.10000002 2.20000005 3.29999995 4.4000001
(gdb) stepi
0x08048091 in _start ()
(gdb) p $xmm0
$2 = {v4_float = {6.5999999, 8.80000019, 11, 13.2000008}, v2_double = {235929.65665283203, 5033169.0185546875}, v16_int8 = {51, 51, -45, 64, -51, -52, 12,
65, 0, 0, 48, 65, 52, 51, 83, 65}, v8_int16 = {13107, 16595, -13107, 16652, 0, 16688, 13108, 16723}, v4_int32 = {1087583027, 1091357901, 1093664768,
1095971636}, v2_int64 = {4687346494113788723, 4707162335057281024}, uint128 = 0x4153333441300000410ccccd40d33333}
(gdb)
Debugger commands explained
[edit | edit source]- break
- In this case, sets a breakpoint at a given label
- stepi
- Steps one instruction forward in the program
- p
- short for print, prints a given register or variable. Registers are prefixed by $ in GDB.
- x
- short for examine, examines a given memory address. The "/4f" means "4 floats" (floats in GDB are 32-bits). You can use c for chars, x for hexadecimal and any other number instead of 4 of course. The "&" takes the address of v1, as in C.
Shuffling example using shufps
[edit | edit source]shufps IMM8, arg1, arg2 | GAS Syntax |
shufps arg2, arg1, IMM8 | Intel Syntax |
shufps
can be used to shuffle packed single-precision floats. The instruction takes three parameters, arg1
an xmm register, arg2
an xmm or a 128-bit memory location and IMM8
an 8-bit immediate control byte. shufps
will take two elements each from arg1
and arg2
, copying the elements to arg2
. The lower two elements will come from arg1
and the higher two elements from arg2
.
IMM8 control byte description
[edit | edit source]IMM8 control byte is split into four group of bit fields that control the output into arg2
as follows:
IMM8[1:0]
specifies which element ofarg1
ends up in the least significant element ofarg2
:IMM8[1:0] Description 00b Copy to the least significant element 01b Copy to the second element 10b Copy to the third element 11b Copy to the most significant element IMM8[3:2]
specifies which element ofarg1
ends up in the second element ofarg2
:IMM8[3:2] Description 00b Copy to the least significant element 01b Copy to the second element 10b Copy to the third element 11b Copy to the most significant element IMM8[5:4]
specifies which element ofarg2
ends up in the third element ofarg2
:IMM8[5:4] Description 00b Copy to the least significant element 01b Copy to the second element 10b Copy to the third element 11b Copy to the most significant element IMM8[7:6]
specifies which element ofarg2
ends up in the most significant element ofarg2
:IMM8[7:6] Description 00b Copy to the least significant element 01b Copy to the second element 10b Copy to the third element 11b Copy to the most significant element
IMM8 Example
Consider the byte 0x1B:
Byte value | 0x1B | |||||||
---|---|---|---|---|---|---|---|---|
Nibble value | 0x1 | 0xB | ||||||
2-bit integer (decimal) value | 0 | 1 | 2 | 3 | ||||
Bit value | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
Bit number (0 being LSB) | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
The 2-bit values shown above are used to determine which elements are copied to arg2
. Bits 7-4 are "indexes" into arg2
, and bits 3-0 are "indexes" into the arg1
.
- Since bits 7-6 are 0, the least significant element of
arg2
is copied to the most significant elements ofarg2
, bits 127-96. - Since bits 5-4 are 1, the second element of
arg2
is copied to third element ofarg2
, bits 95-64. - Since bits 3-2 are 2, the third element of
arg1
is copied to the second element ofarg2
, bits 63-32. - Since bits 0-1 are 3, the fourth element of
arg1
is copied to the least significant elements ofarg2
, bits (31-0).
Note that since the first and second arguments are equal in the following example, the mask 0x1B will effectively reverse the order of the floats in the XMM register, since the 2-bit integers are 0, 1, 2, 3. Had it been 3, 2, 1, 0 (0xE4) it would be a no-op. Had it been 0, 0, 0, 0 (0x00) it would be a broadcast of the least significant 32 bits.
Example
.data
.align 16
v1: .float 1.1, 2.2, 3.3, 4.4
v2: .float 5.5, 6.6, 7.7, 8.8
v3: .float 0, 0, 0, 0
.text
.global _start
_start:
movaps v1,%xmm0 # load v1 into xmm0 to xmm6
movaps v1,%xmm1 # using movaps since v1 is 16 byte aligned
movaps v1,%xmm2
movaps v1,%xmm3
movaps v1,%xmm4
movaps v1,%xmm5
movaps v1,%xmm6
shufps $0x1b, %xmm0, %xmm0 # reverse order of the 4 floats
shufps $0x00, %xmm1, %xmm1 # Broadcast least significant element to all elements
shufps $0x55, %xmm2, %xmm2 # Broadcast second element to all elements
shufps $0xAA, %xmm3, %xmm3 # Broadcast third element to all elements
shufps $0xFF, %xmm4, %xmm4 # Broadcast most significant element to all elements
shufps $0x39, %xmm5, %xmm5 # Rotate elements right
shufps $0x93, %xmm6, %xmm6 # Rotate elements left
movups %xmm0,v3 #store v1 in v3
ret
Using GAS to build an ELF executable
as -g shufps.S -o shufps.o
ld -g shufps.o
Text Processing Instructions
[edit | edit source]SSE 4.2 adds four string text processing instructions PCMPISTRI
, PCMPISTRM
, PCMPESTRI
and PCMPESTRM
. These instructions take three parameters, arg1
an xmm register, arg2
an xmm or a 128-bit memory location and IMM8
an 8-bit immediate control byte. These instructions will perform arithmetic comparison between the packed contents of arg1
and arg2
. IMM8
specifies the format of the input/output as well as the operation of two intermediate stages of processing. The results of stage 1 and stage 2 of intermediate processing will be referred to as IntRes1
and IntRes2
respectively. These instructions also provide additional information about the result through overload use of the arithmetic flags(AF
, CF
, OF
, PF
, SF
and ZF
).
The instructions proceed in multiple steps:
arg1
andarg2
are compared- An aggregation operation is applied to the result of the comparison with the result flowing into
IntRes1
- An optional negation is performed with the result flowing into
IntRes2
- An output in the form of an index(in
ECX
) or a mask(inXMM0
) is produced
IMM8 control byte description
[edit | edit source]IMM8 control byte is split into four group of bit fields that control the following settings:
IMM8[1:0]
specifies the format of the 128-bit source data(arg1
andarg2
):IMM8[1:0] Description 00b unsigned bytes(16 packed unsigned bytes) 01b unsigned words(8 packed unsigned words) 10b signed bytes(16 packed signed bytes) 11b signed words(8 packed signed words) IMM8[3:2]
specifies the aggregation operation whose result will be placed in intermediate result 1, which we will refer to asIntRes1
. The size ofIntRes1
will depend on the format of the source data, 16-bit for packed bytes and 8-bit for packed words:IMM8[3:2] Description 00b Equal Any, arg1 is a character set, arg2 is the string to search in. IntRes1[i] is set to 1 if arg2[i] is in the set represented by arg1: arg1 = "aeiou" arg2 = "Example string 1" IntRes1 = 0010001000010000
01b Ranges, arg1 is a set of character ranges i.e. "09az" means all characters from 0 to 9 and from a to z., arg2 is the string to search over. IntRes1[i] is set to 1 if arg[i] is in any of the ranges represented by arg1: arg1 = "09az" arg2 = "Testing 1 2 3, T" IntRes1 = 0111111010101000
10b Equal Each, arg1 is string one and arg2 is string two. IntRes1[i] is set to 1 if arg1[i] == arg2[i]: arg1 = "The quick brown " arg2 = "The quack green " IntRes1 = 1111110111010011
11b Equal Ordered, arg1 is a substring string to search for, arg2 is the string to search within. IntRes1[i] is set to 1 if the substring arg1 can be found at position arg2[i]: arg1 = "he" arg2 = ", he helped her " IntRes1 = 0010010000001000
IMM8[5:4]
specifies the polarity or the processing ofIntRes1
, into intermediate result 2, which will be referred to asIntRes2
:IMM8[5:4] Description 00b Positive Polarity IntRes2 = IntRes1 01b Negative Polarity IntRes2 = -1 XOR IntRes1 10b Masked Positive IntRes2 = IntRes1 11b Masked Negative IntRes2 = IntRes1 if reg/mem[i] is invalid else ~IntRes1 IMM8[6]
specifies the output selection, or howIntRes2
will be processed into the output. ForPCMPESTRI
andPCMPISTRI
, the output is an index into the data currently referenced byarg2
:IMM8[6] Description 0b Least Significant Index ECX contains the least significant set bit in IntRes2 1b Most Significant Index ECX contains the most significant set bit in IntRes2 - For
PCMPESTRM
andPCMPISTRM
, the output is a mask reflecting all the set bits inIntRes2
:IMM8[6] Description 0b Least Significant Index Bit Mask, the least significant bits of XMM0 contain the IntRes2 16(8) bit mask. XMM0 is zero extended to 128-bits. 1b Most Significant Index Byte/Word Mask, XMM0 contains IntRes2 expanded into byte/word mask IMM8[7]
should be set to zero since it has no designed meaning.
The Four Instructions
[edit | edit source]pcmpistri IMM8, arg2, arg1 | GAS Syntax |
pcmpistri arg1, arg2, IMM8 | Intel Syntax |
PCMPISTRI
, Packed Compare Implicit Length Strings, Return Index. Compares strings of implicit length and generates index in ECX
.
Operands
arg1
- XMM Register
arg2
- XMM Register
- Memory
IMM8
- 8-bit Immediate value
Modified flags
CF
is reset ifIntRes2
is zero, set otherwiseZF
is set if a null terminating character is found inarg2
, reset otherwiseSF
is set if a null terminating character is found inarg1
, reset otherwiseOF
is set toIntRes2[0]
AF
is resetPF
is reset
Example
;
; nasm -felf32 -g sse4_2StrPcmpistri.asm -l sse4_2StrPcmpistri.lst
; gcc -o sse4_2StrPcmpistri sse4_2StrPcmpistri.o
;
global main
extern printf
extern strlen
extern strcmp
section .data
align 4
;
; Fill buf1 with a repeating pattern of ABCD
;
buf1: times 10 dd 0x44434241
s1: db "This is a string", 0
s2: db "This is a string slightly different string", 0
s3: db "This is a str", 0
fmtStr1: db "String: %s len: %d", 0x0A, 0
fmtStr1b: db "strlen(3): String: %s len: %d", 0x0A, 0
fmtStr2: db "s1: =%s= and s2: =%s= compare: %d", 0x0A, 0
fmtStr2b: db "strcmp(3): s1: =%s= and s2: =%s= compare: %d", 0x0A, 0
;
; Functions will follow the cdecl call convention
;
section .text
main: ; Using main since we are using gcc to link
sub esp, -16 ; 16 byte align the stack
sub esp, 16 ; space for four 4 byte parameters
;
; Null terminate buf1, make it proper C string, length is now 39
;
mov [buf1+39], byte 0x00
lea eax, [buf1]
mov [esp], eax ; Arg1: pointer of string to calculate the length of
mov ebx, eax ; Save pointer in ebx since we will use it again
call strlenSSE42
mov edx, eax ; Copy length of arg1 into edx
mov [esp+8], edx ; Arg3: length of string
mov [esp+4], ebx ; Arg2: pointer to string
lea eax, [fmtStr1]
mov [esp], eax ; Arg1: pointer to format string
call printf ; Call printf(3):
; int printf(const char *format, ...);
lea eax, [buf1]
mov [esp], eax ; Arg1: pointer of string to calculate the length of
mov ebx, eax ; Save pointer in ebx since we will use it again
call strlen ; Call strlen(3):
; size_t strlen(const char *s);
mov edx, eax ; Copy length of arg1 into edx
mov [esp+8], edx ; Arg3: length of string
mov [esp+4], ebx ; Arg2: pointer to string
lea eax, [fmtStr1b]
mov [esp], eax ; Arg1: pointer to format string
call printf ; Call printf(3):
; int printf(const char *format, ...);
lea eax, [s2]
mov [esp+4], eax ; Arg2: pointer to second string to compare
lea eax, [s1]
mov [esp], eax ; Arg1: pointer to first string to compare
call strcmpSSE42
mov [esp+12], eax ; Arg4: result from strcmpSSE42
lea eax, [s2]
mov [esp+8], eax ; Arg3: pointer to second string
lea eax, [s1]
mov [esp+4], eax ; Arg2: pointer to first string
lea eax, [fmtStr2]
mov [esp], eax ; Arg1: pointer to format string
call printf
lea eax, [s2]
mov [esp+4], eax ; Arg2: pointer to second string to compare
lea eax, [s1]
mov [esp], eax ; Arg1: pointer to first string to compare
call strcmp ; Call strcmp(3):
; int strcmp(const char *s1, const char *s2);
mov [esp+12], eax ; Arg4: result from strcmpSSE42
lea eax, [s2]
mov [esp+8], eax ; Arg3: pointer to second string
lea eax, [s1]
mov [esp+4], eax ; Arg2: pointer to first string
lea eax, [fmtStr2b]
mov [esp], eax ; Arg1: pointer to format string
call printf
lea eax, [s3]
mov [esp+4], eax ; Arg2: pointer to second string to compare
lea eax, [s1]
mov [esp], eax ; Arg1: pointer to first string to compare
call strcmpSSE42
mov [esp+12], eax ; Arg4: result from strcmpSSE42
lea eax, [s3]
mov [esp+8], eax ; Arg3: pointer to second string
lea eax, [s1]
mov [esp+4], eax ; Arg2: pointer to first string
lea eax, [fmtStr2]
mov [esp], eax ; Arg1: pointer to format string
call printf
lea eax, [s3]
mov [esp+4], eax ; Arg2: pointer to second string to compare
lea eax, [s1]
mov [esp], eax ; Arg1: pointer to first string to compare
call strcmp ; Call strcmp(3):
; int strcmp(const char *s1, const char *s2);
mov [esp+12], eax ; Arg4: result from strcmpSSE42
lea eax, [s3]
mov [esp+8], eax ; Arg3: pointer to second string
lea eax, [s1]
mov [esp+4], eax ; Arg2: pointer to first string
lea eax, [fmtStr2b]
mov [esp], eax ; Arg1: pointer to format string
call printf
call exit
;
; size_t strlen(const char *s);
;
strlenSSE42:
push ebp
mov ebp, esp
mov edx, [ebp+8] ; Arg1: copy s(pointer to string) to edx
;
; We are looking for null terminating char, so set xmm0 to zero
;
pxor xmm0, xmm0
mov eax, -16 ; Avoid extra jump in main loop
strlenLoop:
add eax, 16
;
; IMM8[1:0] = 00b
; Src data is unsigned bytes(16 packed unsigned bytes)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 00b
; Positive Polarity, IntRes2 = IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
;
pcmpistri xmm0,[edx+eax], 0001000b
;
; Loop while ZF != 0, which means none of bytes pointed to by edx+eax
; are zero.
;
jnz strlenLoop
;
; ecx will contain the offset from edx+eax where the first null
; terminating character was found.
;
add eax, ecx
pop ebp
ret
;
; int strcmp(const char *s1, const char *s2);
;
strcmpSSE42:
push ebp
mov ebp, esp
mov eax, [ebp+8] ; Arg1: copy s1(pointer to string) to eax
mov edx, [ebp+12] ; Arg2: copy s2(pointer to string) to edx
;
; Subtract s2(edx) from s1(eax). This admititedly looks odd, but we
; can now use edx to index into s1 and s2. As we adjust edx to move
; forward into s2, we can then add edx to eax and this will give us
; the comparable offset into s1 i.e. if we take edx + 16 then:
;
; edx = edx + 16 = edx + 16
; eax+edx = eax -edx + edx + 16 = eax + 16
;
; therefore edx points to s2 + 16 and eax + edx points to s1 + 16.
; We thus only need one index, convoluted but effective.
;
sub eax, edx
sub edx, 16 ; Avoid extra jump in main loop
strcmpLoop:
add edx, 16
movdqu xmm0, [edx]
;
; IMM8[1:0] = 00b
; Src data is unsigned bytes(16 packed unsigned bytes)
; IMM8[3:2] = 10b
; We are using Equal Each aggregation
; IMM8[5:4] = 01b
; Negative Polarity, IntRes2 = -1 XOR IntRes1
; IMM8[6] = 0b
; ECX contains the least significant set bit in IntRes2
;
pcmpistri xmm0, [edx+eax], 0011000b
;
; Loop while ZF=0 and CF=0:
;
; 1) We find a null in s1(edx+eax) ZF=1
; 2) We find a char that does not match CF=1
;
ja strcmpLoop
;
; Jump if CF=1, we found a mismatched char
;
jc strcmpDiff
;
; We terminated loop due to a null character i.e. CF=0 and ZF=1
;
xor eax, eax ; They are equal so return zero
jmp exitStrcmp
strcmpDiff:
add eax, edx ; Set offset into s1 to match s2
;
; ecx is offset from current poition where two strings do not match,
; so copy the respective non-matching byte into eax and edx and fill
; in remaining bits w/ zero.
;
movzx eax, byte[eax+ecx]
movzx edx, byte[edx+ecx]
;
; If s1 is less than s2 return integer less than zero, otherwise return
; integer greater than zero.
;
sub eax, edx
exitStrcmp:
pop ebp
ret
exit:
;
; Call exit(3) syscall
; void exit(int status)
;
mov ebx, 0 ; Arg one: the status
mov eax, 1 ; Syscall number:
int 0x80
Expected output:
String: ABCDABCDABCDABCDABCDABCDABCDABCDABCDABC len: 39
strlen(3): String: ABCDABCDABCDABCDABCDABCDABCDABCDABCDABC len: 39
s1: =This is a string= and s2: =This is a string slightly different string= compare: -32
strcmp(3): s1: =This is a string= and s2: =This is a string slightly different string= compare: -32
s1: =This is a string= and s2: =This is a str= compare: 105
strcmp(3): s1: =This is a string= and s2: =This is a str= compare: 105
pcmpistrm IMM8, arg2, arg1 | GAS Syntax |
pcmpistrm arg1, arg2, IMM8 | Intel Syntax |
PCMPISTRM
, Packed Compare Implicit Length Strings, Return Mask. Compares strings of implicit length and generates a mask stored in XMM0
.
Operands
arg1
- XMM Register
arg2
- XMM Register
- Memory
IMM8
- 8-bit Immediate value
Modified flags
CF
is reset ifIntRes2
is zero, set otherwiseZF
is set if a null terminating character is found inarg2
, reset otherwiseSF
is set if a null terminating character is found inarg2
, reset otherwiseOF
is set toIntRes2[0]
AF
is resetPF
is reset
pcmpestri IMM8, arg2, arg1 | GAS Syntax |
pcmpestri arg1, arg2, IMM8 | Intel Syntax |
PCMPESTRI
, Packed Compare Explicit Length Strings, Return Index. Compares strings of explicit length and generates index in ECX
.
Operands
arg1
- XMM Register
arg2
- XMM Register
- Memory
IMM8
- 8-bit Immediate value
Implicit Operands
EAX
holds the length ofarg1
EDX
holds the length ofarg2
Modified flags
CF
is reset ifIntRes2
is zero, set otherwiseZF
is set ifEDX
is < 16(for bytes) or 8(for words), reset otherwiseSF
is set ifEAX
is < 16(for bytes) or 8(for words), reset otherwiseOF
is set toIntRes2[0]
AF
is resetPF
is reset
pcmpestrm IMM8, arg2, arg1 | GAS Syntax |
pcmpestrm arg1, arg2, IMM8 | Intel Syntax |
PCMPESTRM
, Packed Compare Explicit Length Strings, Return Mask. Compares strings of explicit length and generates a mask stored in XMM0
.
Operands
arg1
- XMM Register
arg2
- XMM Register
- Memory
IMM8
- 8-bit Immediate value
Implicit Operands
EAX
holds the length ofarg1
EDX
holds the length ofarg2
Modified flags
CF
is reset ifIntRes2
is zero, set otherwiseZF
is set ifEDX
is < 16(for bytes) or 8(for words), reset otherwiseSF
is set ifEAX
is < 16(for bytes) or 8(for words), reset otherwiseOF
is set toIntRes2[0]
AF
is resetPF
is reset
SSE Instruction Set
[edit | edit source]There are literally hundreds of SSE instructions, some of which are capable of much more than simple SIMD arithmetic. For more in-depth references take a look at the resources chapter of this book.
You may notice that many floating point SSE instructions end with something like PS or SD. These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be Packed or Scalar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as:
v1[0] = v1[0] + v2[0] v1[1] = v1[1] + v2[1] v1[2] = v1[2] + v2[2] v1[3] = v1[3] + v2[3]
While a scalar add would only be:
v1[0] = v1[0] + v2[0]
The second letter refers to the data size: either Single or Double. This simply tells the processor whether to use the register as four 32-bit floats or two 64-bit doubles, respectively.
SSE: Added with Pentium III
[edit | edit source]Floating-point Instructions:
ADDPS, ADDSS, CMPPS, CMPSS, COMISS, CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, DIVPS, DIVSS, LDMXCSR, MAXPS, MAXSS, MINPS, MINSS, MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVNTPS, MOVSS, MOVUPS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SHUFPS, SQRTPS, SQRTSS, STMXCSR, SUBPS, SUBSS, UCOMISS, UNPCKHPS, UNPCKLPS
Integer Instructions:
ANDNPS, ANDPS, ORPS, PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, XORPS
SSE2: Added with Pentium 4
[edit | edit source]Floating-point Instructions:
ADDPD, ADDSD, ANDNPD, ANDPD, CMPPD, CMPSD*, COMISD, CVTDQ2PD, CVTDQ2PS, CVTPD2DQ, CVTPD2PI, CVTPD2PS, CVTPI2PD, CVTPS2DQ, CVTPS2PD, CVTSD2SI, CVTSD2SS, CVTSI2SD, CVTSS2SD, CVTTPD2DQ, CVTTPD2PI, CVTTPS2DQ, CVTTSD2SI, DIVPD, DIVSD, MAXPD, MAXSD, MINPD, MINSD, MOVAPD, MOVHPD, MOVLPD, MOVMSKPD, MOVSD*, MOVUPD, MULPD, MULSD, ORPD, SHUFPD, SQRTPD, SQRTSD, SUBPD, SUBSD, UCOMISD, UNPCKHPD, UNPCKLPD, XORPD
- * CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings.
Integer Instructions:
MOVDQ2Q, MOVDQA, MOVDQU, MOVQ2DQ, PADDQ, PSUBQ, PMULUDQ, PSHUFHW, PSHUFLW, PSHUFD, PSLLDQ, PSRLDQ, PUNPCKHQDQ, PUNPCKLQDQ
SSE3: Added with later Pentium 4
[edit | edit source]ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD, HSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP
SSSE3: Added with Xeon 5100 and early Core 2
[edit | edit source]PSIGNW, PSIGND, PSIGNB, PSHUFB, PMULHRSW, PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PHADDW, PHADDSW, PHADDD, PALIGNR, PABSW, PABSD, PABSB
SSE4
[edit | edit source]SSE4.1: Added with later Core 2
[edit | edit source]MPSADBW, PHMINPOSUW, PMULLD, PMULDQ, DPPS, DPPD, BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW, PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD, ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD, INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD, PEXTRQ, PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ, PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA
SSE4a: Added with Phenom
[edit | edit source]LZCNT, POPCNT, EXTRQ, INSERTQ, MOVNTSD, MOVNTSS
SSE4.2: Added with Nehalem
[edit | edit source]CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ