x86 Assembly/SSE

x86 Assembly
quick links: registers • move • jump • calculate • logic • rearrange • misc. • FPU

Wikipedia has related information at Streaming SIMD Extensions

SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.

Registers

SSE, introduced by Intel in 1999 with the Pentium III, creates eight new 128-bit registers:

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7

Originally, an SSE register could only be used as four 32-bit single precision floating point numbers (the equivalent of a float in C). SSE2 expanded the capabilities of the XMM registers, so they can now be used as:

2 64-bit floating points (double precision)
2 64-bit integers
4 32-bit floating points (single-precision)
4 32-bit integers
8 16-bit integers
16 8-bit characters (bytes)

Data movement examples

The following program (using NASM syntax) performs data movements using SIMD instructions.

;
; nasm -felf32 -g sseMove.asm
; ld -g sseMove.o
;
global _start

section .data
	align 16
	v1:	dd 1.1, 2.2, 3.3, 4.4	; Four Single precision floats 32 bits each
	v1dp:	dq 1.1, 2.2		; Two Double precision floats 64 bits each
	v2:	dd 5.5, 6.6, 7.7, 8.8
	v2s1:	dd 5.5, 6.6, 7.7, -8.8
	v2s2:	dd 5.5, 6.6, -7.7, -8.8
	v2s3:	dd 5.5, -6.6, -7.7, -8.8
	v2s4:	dd -5.5, -6.6, -7.7, -8.8
	num1:	dd 1.2
	v3:	dd 1.2, 2.3, 4.5, 6.7	; No longer 16 byte aligned
	v3dp:	dq 1.2, 2.3		; No longer 16 byte aligned

section .bss
	mask1:	resd 1
	mask2:	resd 1
	mask3:	resd 1
	mask4:	resd 1

section .text
	_start:

;
;	op	dst,  src
;
				;
				; SSE
				;
				; Using movaps since vectors are 16 byte aligned
	movaps	xmm0, [v1]	; Move four 32-bit(single precision) floats to xmm0 
	movaps	xmm1, [v2]
	movups	xmm2, [v3]	; Need to use movups since v3 is not 16 byte aligned
	;movaps	xmm3, [v3]	; This would seg fault if uncommented 
	movss	xmm3, [num1]	; Move 32-bit float num1 to the least significant element of xmm3
	movss	xmm3, [v3]	; Move first 32-bit float of v3 to the least significant element of xmm3
	movlps	xmm4, [v3]	; Move 64-bits(two single precision floats) from memory to the lower 64-bit elements of xmm4
	movhps	xmm4, [v2]	; Move 64-bits(two single precision floats) from memory to the higher 64-bit elements of xmm4

				; Source and destination for movhlps and movlhps must be xmm registers
	movhlps	xmm5, xmm4	; Transfers the higher 64-bits of the source xmm4 to the lower 64-bits of the destination xmm5
	movlhps	xmm5, xmm4	; Transfers the lower 64-bits of the source xmm4 to the higher 64-bits of the destination xmm5


	movaps	xmm6, [v2s1]
	movmskps eax, xmm6	; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax 
	mov	[mask1], eax	; Should be 8
	movaps	xmm6, [v2s2]
	movmskps eax, xmm6	; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
	mov	[mask2], eax	; Should be 12
	movaps	xmm6, [v2s3]
	movmskps eax, xmm6	; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
	mov	[mask3], eax	; Should be 14
	movaps	xmm6, [v2s4]
	movmskps eax, xmm6	; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax
	mov	[mask4], eax	; Should be 15


				;
				; SSE2
				;
	movapd	xmm6, [v1dp]	; Move two 64-bit(double precision) floats to xmm6, using movapd since vector is 16 byte aligned 
				; Next two instruction should have equivalent results to movapd xmm6, [vldp]
	movhpd	xmm6, [v1dp+8]	; Move a 64-bit(double precision) float into the higher 64-bit elements of xmm6 
	movlpd	xmm6, [v1dp]	; Move a 64-bit(double precision) float into the lower 64-bit elements of xmm6
	movupd	xmm6, [v3dp]	; Move two 64-bit floats to xmm6, using movupd since vector is not 16 byte aligned

Arithmetic example using packed singles

The following program (using NASM syntax) performs a few SIMD operations on some numbers.

global _start

section .data
    v1: dd 1.1, 2.2, 3.3, 4.4    ;first set of 4 numbers
    v2: dd 5.5, 6.6, 7.7, 8.8    ;second set
    
section .bss
    v3: resd 4    ;result
    
section .text
    _start:
    
    movups xmm0, [v1]   ;load v1 into xmm0
    movups xmm1, [v2]   ;load v2 into xmm1
    
    addps xmm0, xmm1    ;add the 4 numbers in xmm1 (from v2) to the 4 numbers in xmm0 (from v1), store in xmm0. for the first float the result will be 5.5+1.1=6.6
    mulps xmm0, xmm1    ;multiply the four numbers in xmm1 (from v2, unchanged) with the results from the previous calculation (in xmm0), store in xmm0. for the first float the result will be 5.5*6.6=36.3
    subps xmm0, xmm1    ;subtract the four numbers in v2 (in xmm1, still unchanged) from result from previous calculation (in xmm1). for the first float, the result will be 36.3-5.5=30.8
    
    movups [v3], xmm0   ;store v1 in v3
    
    ;end program
    ret

The result values should be:

30.800    51.480    77.000    107.360

Using the GNU toolchain, you can debug and single-step like this:

 % nasm -felf32 -g ssedemo.asm
 % ld -g ssedemo.o            
 % gdb -q ./a.out                
Reading symbols from a.out...done.
(gdb) break _start
Breakpoint 1 at 0x8048080
(gdb) r
Starting program: a.out 

Breakpoint 1, 0x08048080 in _start ()
(gdb) disass
Dump of assembler code for function _start:
=> 0x08048080 <+0>:	movups 0x80490a0,%xmm0
   0x08048087 <+7>:	movups 0x80490b0,%xmm1
   0x0804808e <+14>:	addps  %xmm1,%xmm0
   0x08048091 <+17>:	mulps  %xmm1,%xmm0
   0x08048094 <+20>:	subps  %xmm1,%xmm0
   0x08048097 <+23>:	movups %xmm0,0x80490c0
End of assembler dump.
(gdb) stepi
0x08048087 in _start ()
(gdb) 
0x0804808e in _start ()
(gdb) p $xmm0
$1 = {v4_float = {1.10000002, 2.20000005, 3.29999995, 4.4000001}, v2_double = {3.6000008549541236, 921.60022034645078}, v16_int8 = {-51, -52, -116, 63, 
    -51, -52, 12, 64, 51, 51, 83, 64, -51, -52, -116, 64}, v8_int16 = {-13107, 16268, -13107, 16396, 13107, 16467, -13107, 16524}, v4_int32 = {1066192077, 
    1074580685, 1079194419, 1082969293}, v2_int64 = {4615288900054469837, 4651317697086436147}, uint128 = 0x408ccccd40533333400ccccd3f8ccccd}
(gdb) x/4f &v1
0x80490a0 <v1>:	1.10000002	2.20000005	3.29999995	4.4000001
(gdb) stepi
0x08048091 in _start ()
(gdb) p $xmm0
$2 = {v4_float = {6.5999999, 8.80000019, 11, 13.2000008}, v2_double = {235929.65665283203, 5033169.0185546875}, v16_int8 = {51, 51, -45, 64, -51, -52, 12, 
    65, 0, 0, 48, 65, 52, 51, 83, 65}, v8_int16 = {13107, 16595, -13107, 16652, 0, 16688, 13108, 16723}, v4_int32 = {1087583027, 1091357901, 1093664768, 
    1095971636}, v2_int64 = {4687346494113788723, 4707162335057281024}, uint128 = 0x4153333441300000410ccccd40d33333}
(gdb)

Debugger commands explained

break: In this case, sets a breakpoint at a given label
stepi: Steps one instruction forward in the program
p: short for print, prints a given register or variable. Registers are prefixed by $ in GDB.
x: short for examine, examines a given memory address. The "/4f" means "4 floats" (floats in GDB are 32-bits). You can use c for chars, x for hexadecimal and any other number instead of 4 of course. The "&" takes the address of v1, as in C.

Shuffling example using `shufps`

shufps IMM8, arg1, arg2	GAS Syntax
shufps arg2, arg1, IMM8	Intel Syntax

shufps can be used to shuffle packed single-precision floats. The instruction takes three parameters, arg1 an xmm register, arg2 an xmm or a 128-bit memory location and IMM8 an 8-bit immediate control byte. shufps will take two elements each from arg1 and arg2, copying the elements to arg2. The lower two elements will come from arg1 and the higher two elements from arg2.

IMM8 control byte description

IMM8 control byte is split into four group of bit fields that control the output into arg2 as follows:

IMM8[1:0] specifies which element of arg1 ends up in the least significant element of arg2:

IMM8[1:0] Description

00b Copy to the least significant element

01b Copy to the second element

10b Copy to the third element

11b Copy to the most significant element
IMM8[3:2] specifies which element of arg1 ends up in the second element of arg2:

IMM8[3:2] Description

00b Copy to the least significant element

01b Copy to the second element

10b Copy to the third element

11b Copy to the most significant element
IMM8[5:4] specifies which element of arg2 ends up in the third element of arg2:

IMM8[5:4] Description

00b Copy to the least significant element

01b Copy to the second element

10b Copy to the third element

11b Copy to the most significant element
IMM8[7:6] specifies which element of arg2 ends up in the most significant element of arg2:

IMM8[7:6] Description

00b Copy to the least significant element

01b Copy to the second element

10b Copy to the third element

11b Copy to the most significant element

IMM8 Example

Consider the byte 0x1B:

Bit number (0 being LSB)	7	6	5	4	3	2	1	0
Byte value	0x1B
Nibble value	0x1				0xB
2-bit integer (decimal) value	0		1		2		3
Bit value	0	0	0	1	1	0	1	1

The 2-bit values shown above are used to determine which elements are copied to arg2. Bits 7-4 are "indexes" into arg2, and bits 3-0 are "indexes" into the arg1.

Since bits 7-6 are 0, the least significant element of arg2 is copied to the most significant elements of arg2, bits 127-96.
Since bits 5-4 are 1, the second element of arg2 is copied to third element of arg2, bits 95-64.
Since bits 3-2 are 2, the third element of arg1 is copied to the second element of arg2, bits 63-32.
Since bits 0-1 are 3, the fourth element of arg1 is copied to the least significant elements of arg2, bits (31-0).

Note that since the first and second arguments are equal in the following example, the mask 0x1B will effectively reverse the order of the floats in the XMM register, since the 2-bit integers are 0, 1, 2, 3. Had it been 3, 2, 1, 0 (0xE4) it would be a no-op. Had it been 0, 0, 0, 0 (0x00) it would be a broadcast of the least significant 32 bits.

Example

.data
	.align 16
        v1: .float 1.1, 2.2, 3.3, 4.4
        v2: .float 5.5, 6.6, 7.7, 8.8
        v3: .float 0, 0, 0, 0
 
.text
.global _start 
_start:   
        movaps  v1,%xmm0        # load v1 into xmm0 to xmm6
        movaps  v1,%xmm1	# using movaps since v1 is 16 byte aligned
        movaps  v1,%xmm2
        movaps  v1,%xmm3
        movaps  v1,%xmm4
        movaps  v1,%xmm5
        movaps  v1,%xmm6
 
        shufps $0x1b, %xmm0, %xmm0 # reverse order of the 4 floats
        shufps $0x00, %xmm1, %xmm1 # Broadcast least significant element to all elements
        shufps $0x55, %xmm2, %xmm2 # Broadcast second element to all elements
        shufps $0xAA, %xmm3, %xmm3 # Broadcast third element to all elements
        shufps $0xFF, %xmm4, %xmm4 # Broadcast most significant element to all elements
        shufps $0x39, %xmm5, %xmm5 # Rotate elements right
        shufps $0x93, %xmm6, %xmm6 # Rotate elements left 

        movups  %xmm0,v3        #store v1 in v3
        ret

Using GAS to build an ELF executable

as -g shufps.S -o shufps.o
ld -g shufps.o

Text Processing Instructions

SSE 4.2 adds four string text processing instructions PCMPISTRI, PCMPISTRM, PCMPESTRI and PCMPESTRM. These instructions take three parameters, arg1 an xmm register, arg2 an xmm or a 128-bit memory location and IMM8 an 8-bit immediate control byte. These instructions will perform arithmetic comparison between the packed contents of arg1 and arg2. IMM8 specifies the format of the input/output as well as the operation of two intermediate stages of processing. The results of stage 1 and stage 2 of intermediate processing will be referred to as IntRes1 and IntRes2 respectively. These instructions also provide additional information about the result through overload use of the arithmetic flags(AF, CF, OF, PF, SF and ZF).

The instructions proceed in multiple steps:

arg1 and arg2 are compared
An aggregation operation is applied to the result of the comparison with the result flowing into IntRes1
An optional negation is performed with the result flowing into IntRes2
An output in the form of an index(in ECX) or a mask(in XMM0) is produced

IMM8 control byte description

IMM8 control byte is split into four group of bit fields that control the following settings:

IMM8[1:0] specifies the format of the 128-bit source data(arg1 and arg2):

IMM8[1:0]	Description
00b	unsigned bytes(16 packed unsigned bytes)
01b	unsigned words(8 packed unsigned words)
10b	signed bytes(16 packed signed bytes)
11b	signed words(8 packed signed words)

IMM8[3:2] specifies the aggregation operation whose result will be placed in intermediate result 1, which we will refer to as IntRes1. The size of IntRes1 will depend on the format of the source data, 16-bit for packed bytes and 8-bit for packed words:

IMM8[3:2]	Description
00b	Equal Any, arg1 is a character set, arg2 is the string to search in. IntRes1[i] is set to 1 if arg2[i] is in the set represented by arg1: arg1 = "aeiou" arg2 = "Example string 1" IntRes1 = 0010001000010000
01b	Ranges, arg1 is a set of character ranges i.e. "09az" means all characters from 0 to 9 and from a to z., arg2 is the string to search over. IntRes1[i] is set to 1 if arg[i] is in any of the ranges represented by arg1: arg1 = "09az" arg2 = "Testing 1 2 3, T" IntRes1 = 0111111010101000
10b	Equal Each, arg1 is string one and arg2 is string two. IntRes1[i] is set to 1 if arg1[i] == arg2[i]: arg1 = "The quick brown " arg2 = "The quack green " IntRes1 = 1111110111010011
11b	Equal Ordered, arg1 is a substring string to search for, arg2 is the string to search within. IntRes1[i] is set to 1 if the substring arg1 can be found at position arg2[i]: arg1 = "he" arg2 = ", he helped her " IntRes1 = 0010010000001000

IMM8[5:4] specifies the polarity or the processing of IntRes1, into intermediate result 2, which will be referred to as IntRes2:

IMM8[5:4]	Description
00b	Positive Polarity	IntRes2 = IntRes1
01b	Negative Polarity	IntRes2 = -1 XOR IntRes1
10b	Masked Positive	IntRes2 = IntRes1
11b	Masked Negative	IntRes2 = IntRes1 if reg/mem[i] is invalid else ~IntRes1

IMM8[6] specifies the output selection, or how IntRes2 will be processed into the output. For PCMPESTRI and PCMPISTRI, the output is an index into the data currently referenced by arg2:

IMM8[6] Description

0b Least Significant Index ECX contains the least significant set bit in IntRes2

1b Most Significant Index ECX contains the most significant set bit in IntRes2

For PCMPESTRM and PCMPISTRM, the output is a mask reflecting all the set bits in IntRes2:

IMM8[6]	Description
0b	Least Significant Index	Bit Mask, the least significant bits of XMM0 contain the IntRes2 16(8) bit mask. XMM0 is zero extended to 128-bits.
1b	Most Significant Index	Byte/Word Mask, XMM0 contains IntRes2 expanded into byte/word mask

IMM8[7] should be set to zero since it has no designed meaning.

The Four Instructions

pcmpistri IMM8, arg2, arg1	GAS Syntax
pcmpistri arg1, arg2, IMM8	Intel Syntax

PCMPISTRI, Packed Compare Implicit Length Strings, Return Index. Compares strings of implicit length and generates index in ECX.

Operands

arg1

XMM Register

arg2

XMM Register
Memory

IMM8

8-bit Immediate value

Modified flags

CF is reset if IntRes2 is zero, set otherwise
ZF is set if a null terminating character is found in arg2, reset otherwise
SF is set if a null terminating character is found in arg1, reset otherwise
OF is set to IntRes2[0]
AF is reset
PF is reset

Example

;
; nasm -felf32 -g sse4_2StrPcmpistri.asm -l sse4_2StrPcmpistri.lst
; gcc -o sse4_2StrPcmpistri sse4_2StrPcmpistri.o
;
global main 

extern printf
extern strlen
extern strcmp

section .data
	align 4
	;
	; Fill buf1 with a repeating pattern of ABCD
	;
	buf1:		times 10 dd 0x44434241
	s1:		db "This is a string", 0
	s2:		db "This is a string slightly different string", 0
	s3:		db "This is a str", 0
	fmtStr1:	db "String: %s len: %d", 0x0A, 0
	fmtStr1b:	db "strlen(3): String: %s len: %d", 0x0A, 0
	fmtStr2:	db "s1: =%s= and s2: =%s= compare: %d", 0x0A, 0
	fmtStr2b:	db "strcmp(3): s1: =%s= and s2: =%s= compare: %d", 0x0A, 0

;
; Functions will follow the cdecl call convention
;
section .text
	main:			; Using main since we are using gcc to link

	sub	esp, -16	; 16 byte align the stack
	sub	esp, 16		; space for four 4 byte parameters

	;
	; Null terminate buf1, make it proper C string, length is now 39
	;
	mov	[buf1+39], byte 0x00

	lea	eax, [buf1]
	mov	[esp], eax	; Arg1: pointer of string to calculate the length of
	mov	ebx, eax	; Save pointer in ebx since we will use it again
	call	strlenSSE42
	mov	edx, eax	; Copy length of arg1 into edx
	
	mov	[esp+8], edx	; Arg3: length of string
	mov	[esp+4], ebx	; Arg2: pointer to string
	lea	eax, [fmtStr1]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf		; Call printf(3):
				;	int printf(const char *format, ...);

	lea	eax, [buf1]
	mov	[esp], eax	; Arg1: pointer of string to calculate the length of
	mov	ebx, eax	; Save pointer in ebx since we will use it again
	call	strlen		; Call strlen(3):
				;	size_t strlen(const char *s);
	mov	edx, eax	; Copy length of arg1 into edx
	
	mov	[esp+8], edx	; Arg3: length of string
	mov	[esp+4], ebx	; Arg2: pointer to string
	lea	eax, [fmtStr1b]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf		; Call printf(3):
				;	int printf(const char *format, ...);

	lea	eax, [s2]
	mov	[esp+4], eax	; Arg2: pointer to second string to compare
	lea	eax, [s1]
	mov	[esp], eax	; Arg1: pointer to first string to compare
	call	strcmpSSE42

	mov	[esp+12], eax	; Arg4: result from strcmpSSE42  
	lea	eax, [s2]
	mov	[esp+8], eax	; Arg3: pointer to second string
	lea	eax, [s1]
	mov	[esp+4], eax	; Arg2: pointer to first string
	lea	eax, [fmtStr2]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf

	lea	eax, [s2]
	mov	[esp+4], eax	; Arg2: pointer to second string to compare
	lea	eax, [s1]
	mov	[esp], eax	; Arg1: pointer to first string to compare
	call	strcmp		; Call strcmp(3):
				;	int strcmp(const char *s1, const char *s2);

	mov	[esp+12], eax	; Arg4: result from strcmpSSE42  
	lea	eax, [s2]
	mov	[esp+8], eax	; Arg3: pointer to second string
	lea	eax, [s1]
	mov	[esp+4], eax	; Arg2: pointer to first string
	lea	eax, [fmtStr2b]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf

	lea	eax, [s3]
	mov	[esp+4], eax	; Arg2: pointer to second string to compare
	lea	eax, [s1]
	mov	[esp], eax	; Arg1: pointer to first string to compare
	call	strcmpSSE42

	mov	[esp+12], eax	; Arg4: result from strcmpSSE42  
	lea	eax, [s3]
	mov	[esp+8], eax	; Arg3: pointer to second string
	lea	eax, [s1]
	mov	[esp+4], eax	; Arg2: pointer to first string
	lea	eax, [fmtStr2]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf

	lea	eax, [s3]
	mov	[esp+4], eax	; Arg2: pointer to second string to compare
	lea	eax, [s1]
	mov	[esp], eax	; Arg1: pointer to first string to compare
	call	strcmp		; Call strcmp(3):
				;	int strcmp(const char *s1, const char *s2);

	mov	[esp+12], eax	; Arg4: result from strcmpSSE42  
	lea	eax, [s3]
	mov	[esp+8], eax	; Arg3: pointer to second string
	lea	eax, [s1]
	mov	[esp+4], eax	; Arg2: pointer to first string
	lea	eax, [fmtStr2b]
	mov	[esp], eax	; Arg1: pointer to format string
	call	printf

	call	exit


;
; size_t strlen(const char *s);
;
strlenSSE42:
	push	ebp
	mov	ebp, esp

	mov	edx, [ebp+8]	; Arg1: copy s(pointer to string) to edx 
	;
	; We are looking for null terminating char, so set xmm0 to zero
	;
	pxor	xmm0, xmm0
	mov	eax, -16	; Avoid extra jump in main loop

strlenLoop:
	add	eax, 16
	;
	; IMM8[1:0]	= 00b
	;	Src data is unsigned bytes(16 packed unsigned bytes)
	; IMM8[3:2]	= 10b
	; 	We are using Equal Each aggregation
	; IMM8[5:4]	= 00b
	;	Positive Polarity, IntRes2	= IntRes1
	; IMM8[6]	= 0b
	;	ECX contains the least significant set bit in IntRes2
	;
	pcmpistri	xmm0,[edx+eax], 0001000b
	;
	; Loop while ZF != 0, which means none of bytes pointed to by edx+eax
	; are zero.
	;
	jnz	strlenLoop
	
	;
	; ecx will contain the offset from edx+eax where the first null
	; terminating character was found.
	;
	add	eax, ecx
	pop	ebp
	ret

;
; int strcmp(const char *s1, const char *s2);
;
strcmpSSE42:
	push	ebp
	mov	ebp, esp

	mov	eax, [ebp+8]	; Arg1: copy s1(pointer to string) to eax
	mov	edx, [ebp+12]	; Arg2: copy s2(pointer to string) to edx
	;
	; Subtract s2(edx) from s1(eax). This admititedly looks odd, but we
	; can now use edx to index into s1 and s2. As we adjust edx to move
	; forward into s2, we can then add edx to eax and this will give us
	; the comparable offset into s1 i.e. if we take edx + 16 then:
	;
	;	edx 	= edx + 16		= edx + 16
	;	eax+edx	= eax -edx + edx + 16	= eax + 16
	;
	; therefore edx points to s2 + 16 and eax + edx points to s1 + 16.
	; We thus only need one index, convoluted but effective.
	;
	sub	eax, edx
	sub	edx, 16		; Avoid extra jump in main loop

strcmpLoop:
	add	edx, 16
	movdqu	xmm0, [edx]
	;
	; IMM8[1:0]	= 00b
	;	Src data is unsigned bytes(16 packed unsigned bytes)
	; IMM8[3:2]	= 10b
	; 	We are using Equal Each aggregation
	; IMM8[5:4]	= 01b
	;	Negative Polarity, IntRes2	= -1 XOR IntRes1
	; IMM8[6]	= 0b
	;	ECX contains the least significant set bit in IntRes2
	;
	pcmpistri	xmm0, [edx+eax], 0011000b
	;
	; Loop while ZF=0 and CF=0:
	;
	;	1) We find a null in s1(edx+eax) ZF=1
	;	2) We find a char that does not match CF=1
	;
	ja	strcmpLoop

	;
	; Jump if CF=1, we found a mismatched char
	;
	jc	strcmpDiff

	;
	; We terminated loop due to a null character i.e. CF=0 and ZF=1
	;
	xor	eax, eax	; They are equal so return zero
	jmp	exitStrcmp

strcmpDiff:
	add	eax, edx	; Set offset into s1 to match s2
	;
	; ecx is offset from current poition where two strings do not match,
	; so copy the respective non-matching byte into eax and edx and fill
	; in remaining bits w/ zero.
	;
	movzx	eax, byte[eax+ecx]
	movzx	edx, byte[edx+ecx]
	;
	; If s1 is less than s2 return integer less than zero, otherwise return
	; integer greater than zero.
	;
	sub	eax, edx

exitStrcmp:
	pop	ebp
	ret

exit:
				;
				; Call exit(3) syscall
				;	void exit(int status)
				;
	mov	ebx, 0		; Arg one: the status
	mov	eax, 1		; Syscall number:
	int 	0x80

Expected output:

String: ABCDABCDABCDABCDABCDABCDABCDABCDABCDABC len: 39
strlen(3): String: ABCDABCDABCDABCDABCDABCDABCDABCDABCDABC len: 39
s1: =This is a string= and s2: =This is a string slightly different string= compare: -32
strcmp(3): s1: =This is a string= and s2: =This is a string slightly different string= compare: -32
s1: =This is a string= and s2: =This is a str= compare: 105
strcmp(3): s1: =This is a string= and s2: =This is a str= compare: 105

pcmpistrm IMM8, arg2, arg1	GAS Syntax
pcmpistrm arg1, arg2, IMM8	Intel Syntax

PCMPISTRM, Packed Compare Implicit Length Strings, Return Mask. Compares strings of implicit length and generates a mask stored in XMM0.

Operands

arg1

XMM Register

arg2

XMM Register
Memory

IMM8

8-bit Immediate value

Modified flags

CF is reset if IntRes2 is zero, set otherwise
ZF is set if a null terminating character is found in arg2, reset otherwise
SF is set if a null terminating character is found in arg2, reset otherwise
OF is set to IntRes2[0]
AF is reset
PF is reset

pcmpestri IMM8, arg2, arg1	GAS Syntax
pcmpestri arg1, arg2, IMM8	Intel Syntax

PCMPESTRI, Packed Compare Explicit Length Strings, Return Index. Compares strings of explicit length and generates index in ECX.

Operands

arg1

XMM Register

arg2

XMM Register
Memory

IMM8

8-bit Immediate value

Implicit Operands

EAX holds the length of arg1
EDX holds the length of arg2

Modified flags

CF is reset if IntRes2 is zero, set otherwise
ZF is set if EDX is < 16(for bytes) or 8(for words), reset otherwise
SF is set if EAX is < 16(for bytes) or 8(for words), reset otherwise
OF is set to IntRes2[0]
AF is reset
PF is reset

pcmpestrm IMM8, arg2, arg1	GAS Syntax
pcmpestrm arg1, arg2, IMM8	Intel Syntax

PCMPESTRM, Packed Compare Explicit Length Strings, Return Mask. Compares strings of explicit length and generates a mask stored in XMM0.

Operands

arg1

XMM Register

arg2

XMM Register
Memory

IMM8

8-bit Immediate value

Implicit Operands

EAX holds the length of arg1
EDX holds the length of arg2

Modified flags

CF is reset if IntRes2 is zero, set otherwise
ZF is set if EDX is < 16(for bytes) or 8(for words), reset otherwise
SF is set if EAX is < 16(for bytes) or 8(for words), reset otherwise
OF is set to IntRes2[0]
AF is reset
PF is reset

SSE Instruction Set

There are literally hundreds of SSE instructions, some of which are capable of much more than simple SIMD arithmetic. For more in-depth references take a look at the resources chapter of this book.

You may notice that many floating point SSE instructions end with something like PS or SD. These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be Packed or Scalar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as:

v1[0] = v1[0] + v2[0]
v1[1] = v1[1] + v2[1]
v1[2] = v1[2] + v2[2]
v1[3] = v1[3] + v2[3]

While a scalar add would only be:

v1[0] = v1[0] + v2[0]

The second letter refers to the data size: either Single or Double. This simply tells the processor whether to use the register as four 32-bit floats or two 64-bit doubles, respectively.

SSE: Added with Pentium III

Floating-point Instructions:

ADDPS, ADDSS, CMPPS, CMPSS, COMISS, CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, DIVPS, DIVSS, LDMXCSR, MAXPS, MAXSS, MINPS, MINSS, MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVNTPS, MOVSS, MOVUPS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SHUFPS, SQRTPS, SQRTSS, STMXCSR, SUBPS, SUBSS, UCOMISS, UNPCKHPS, UNPCKLPS

Integer Instructions:

ANDNPS, ANDPS, ORPS, PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, XORPS

SSE2: Added with Pentium 4

Floating-point Instructions:

ADDPD, ADDSD, ANDNPD, ANDPD, CMPPD, CMPSD*, COMISD, CVTDQ2PD, CVTDQ2PS, CVTPD2DQ, CVTPD2PI, CVTPD2PS, CVTPI2PD, CVTPS2DQ, CVTPS2PD, CVTSD2SI, CVTSD2SS, CVTSI2SD, CVTSS2SD, CVTTPD2DQ, CVTTPD2PI, CVTTPS2DQ, CVTTSD2SI, DIVPD, DIVSD, MAXPD, MAXSD, MINPD, MINSD, MOVAPD, MOVHPD, MOVLPD, MOVMSKPD, MOVSD*, MOVUPD, MULPD, MULSD, ORPD, SHUFPD, SQRTPD, SQRTSD, SUBPD, SUBSD, UCOMISD, UNPCKHPD, UNPCKLPD, XORPD

* CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings.

Integer Instructions:

MOVDQ2Q, MOVDQA, MOVDQU, MOVQ2DQ, PADDQ, PSUBQ, PMULUDQ, PSHUFHW, PSHUFLW, PSHUFD, PSLLDQ, PSRLDQ, PUNPCKHQDQ, PUNPCKLQDQ

SSE3: Added with later Pentium 4

ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD, HSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP

SSSE3: Added with Xeon 5100 and early Core 2

PSIGNW, PSIGND, PSIGNB, PSHUFB, PMULHRSW, PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PHADDW, PHADDSW, PHADDD, PALIGNR, PABSW, PABSD, PABSB

SSE4

SSE4.1: Added with later Core 2

MPSADBW, PHMINPOSUW, PMULLD, PMULDQ, DPPS, DPPD, BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW, PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD, ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD, INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD, PEXTRQ, PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ, PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA

SSE4a: Added with Phenom

LZCNT, POPCNT, EXTRQ, INSERTQ, MOVNTSD, MOVNTSS

SSE4.2: Added with Nehalem

CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ

Registers

Data movement examples

Arithmetic example using packed singles

Debugger commands explained

Shuffling example using shufps

IMM8 control byte description

Text Processing Instructions

IMM8 control byte description

The Four Instructions

SSE Instruction Set

SSE: Added with Pentium III

SSE2: Added with Pentium 4

SSE3: Added with later Pentium 4

SSSE3: Added with Xeon 5100 and early Core 2

SSE4

SSE4.1: Added with later Core 2

SSE4a: Added with Phenom

SSE4.2: Added with Nehalem

Shuffling example using `shufps`