Performance comparison: 'Find a char'

Introduction – the optimization dilemma

Sometimes could be useful to test the real performance of a basic algorithm implementation by the vendor of a compiler/library/framework, to understand better if and how to improve performance of critical code-path. Remember that, in the optimization field, always this rule applies: premature optimization is the root of all evil (or at least most of it) in programming.

This time i will spend a little effort studying the behaviour of the stdlib “strchr“-like routine in the Visual Studio 2013 implementation; my feeling definitely says that implementation in stdlib is faster than others, but it’s just an impression? and, if this is the case, how much faster is respect to a naive implementation, without any in-deep speed consideration? The algorithm has an intrinsic O(n) complexity (without any further knowledge about the length of the string), so differences in computation time are due only to implementation tricks.

I google around, I found this similar question on StackOverflow and I’m impressed that the chosen one is the weakest, without any test done,

So I made some naive implementation into account and test different implementations on the same machine, with VS and its stdlib implementation (known the be not the best compiler/library in terms of performance) using this setup:

HW: AMD FX(tm)-4300 Quad Core Processor @ 3.00Ghz, MB ASRock 970 Extreme4, 16GB DDR3
OS: Windows 7 Pro 64bit, SP1 (5.9 performance index)
Compiler:
- Visual Studio 2013 Ultimate (RELEASE mode)
- Visual C++ Compiler Nov 2013 CTP (CTP_Nov2013)

Test methodology

I prepare a randomly array of std::string of different lengths: 4,8,32,256,1024,8192,65536 char; all chars are in the ascii range [32..255], expect the ‘|’ char 7Ch
All times are taken in a easy way, with the High Precision Timer Infrastructure of Windows. It’s not the best method, however was the faster to implement using an old 2000 RAII library; this use the profileapi.h API under Windows SDK (QueryPerformanceCounter & QueryPerformanceFrequency). Time are expressed in seconds, so lower is better.
Every routine is executed 1.000.000 times, to reduce sporadic error, and activated looking for the ‘|’ symbol, that never occurs due how the strings are build (point 1); so the algorithm has a cost of O(n), with n = string length
The routine is implemented in a callable function, (using fastcall calling convention)
A brief, preliminary, integrity check of the functions has done, to check if the implementation was always right (10000 random test case)
External interference: I’ve killed all application and services on windows knowed of some relevant CPU/IO activity; maybe Windows is not very good as a performance check platform, however I will adopt the new C++11 time facility to make code portable and I will update results with Linux/Mac performance measurements (they all have a faster function calling mechanism)
Program was build in RELEASE mode, in 32bit mode, VS2013 Ultimate

Implementations examined

Implementation #1: strchr

Nothing special to say here: I use a std::string as the string container, so I access the underlying chars array using the std::string::c_str() method.

Implementation #2: std::string::find

Nothing special to say: call direct std::string::find routine, compare the result with std::string::npos

Implementation #3: ansi strchr found on internet

I found this implementation googling around, take it as a plus.

/// implementation found here:
/// http://uos-embedded.googlecode.com/svn/trunk/sources/runtime/strchr-fast.c

/* Nonzero if X is not aligned on a "long" boundary.  */
#define UNALIGNED(X) ((long)X & (sizeof (long) - 1))

/* How many bytes are loaded each iteration of the word copy loop.  */
#define LBLOCKSIZE (sizeof (long))

#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

/* DETECTCHAR returns nonzero if (long)X contains the byte used
to fill (long)MASK. */
#define DETECTCHAR(X,MASK) (DETECTNULL(X ^ MASK))
unsigned char *
strchr2(const unsigned char *s, unsigned char i)
{
	unsigned char c = (unsigned char)i;
	unsigned long mask, j;
	unsigned long *aligned_addr;

	if (!UNALIGNED(s))
	{
		mask = 0;
		for (j = 0; j < LBLOCKSIZE; j++)
			mask = (mask << 8) | c;

		aligned_addr = (unsigned long*)s;
		while (!DETECTNULL(*aligned_addr) && !DETECTCHAR(*aligned_addr, mask))
			aligned_addr++;

		/* The block of bytes currently pointed to by aligned_addr
		contains either a null or the target char, or both.  We
		catch it using the bytewise search.  */

		s = (unsigned char*)aligned_addr;
	}

	while (*s && *s != c)
		s++;
	if (*s == c)
		return (unsigned char*)s;
	return 0;
}

Implementation #4: naive simple algorithm based on stl

This implementation is very straightforward and easy to implement, checking of every single char in the string using the string default operator[] accessor. The length was cached in the n variable (maybe useless optimization for today compiler).

bool naive_container_find(const std::string & s, char key)
{
	int n = s.length();
	for (string::size_type i = 0; i<n; ++i)
	{
		if (s[i] == key)
			return true;
	}

	return false;
}

Implementation #5: naive C like implementation

Simpler access all chars of the string sequentially using pointer arithmetic.

bool my_fast_custom_find_one_byte(const unsigned char * pS, unsigned char key)
{
	while (*pS)
	{
		if (*pS == key)
			return true;
		pS++;
	}
	return false;
}

Implementation #6: empty function

I would like to insert the cost of a function call as a reminder of the jump cost; I found that VS2013 make a great optimization eliding a nothing-to-do function (containings a constant return false instruction); so I put a return of a global variable, so optimizer doesn’t be allowed to cut-off the subroutine call.

/// For test purpose, nothing do function
bool global_flag;
static bool nothing_do_function(const std::string & s, char key)
{
	return global_flag;
}

Results

The following chart shows that with increasing length of the strings the #1 cstdlib strchr routine is doing very well compared to others, instead for lower size the startup time of strchr performs worst than others:

stdlib must doing some good optimization; since this is a memory intensive algorithm, I expect that the memory management is considerably better than other algorithms. So let’s look to the disassembly of the strchr function:

0FAC4140  cmp         dword ptr ds:[0FBDB0DCh],1  
0FAC4147  jb          0FAC41A8  
0FAC4149  movzx       eax,byte ptr [esp+8]  
0FAC414E  mov         edx,eax  
0FAC4150  shl         eax,8  
0FAC4153  or          edx,eax  
0FAC4155  movd        xmm3,edx  
0FAC4159  pshuflw     xmm3,xmm3,0  
0FAC415E  movlhps     xmm3,xmm3  
0FAC4161  mov         edx,dword ptr [esp+4]  
0FAC4165  mov         ecx,0Fh  
0FAC416A  or          eax,0FFFFFFFFh  
0FAC416D  and         ecx,edx  
0FAC416F  shl         eax,cl  
0FAC4171  sub         edx,ecx  
0FAC4173  movdqu      xmm1,xmmword ptr [edx]  
0FAC4177  pxor        xmm2,xmm2  
0FAC417B  pcmpeqb     xmm2,xmm1  
0FAC417F  pcmpeqb     xmm1,xmm3  
0FAC4183  por         xmm2,xmm1  
0FAC4187  pmovmskb    ecx,xmm2  
0FAC418B  and         ecx,eax  
0FAC418D  jne         0FAC4197  
0FAC418F  or          eax,0FFFFFFFFh  
0FAC4192  add         edx,10h  
0FAC4195  jmp         0FAC4173  
0FAC4197  bsf         eax,ecx  
0FAC419A  add         eax,edx  
0FAC419C  movd        edx,xmm3  
0FAC41A0  xor         ecx,ecx  
0FAC41A2  cmp         dl,byte ptr [eax]  
0FAC41A4  cmovne      eax,ecx  
0FAC41A7  ret

as you can see, the function heavy use xmm registers (the 128 bits registers of the SSE extension on x86 platform – see more here); reading 8 bytes at a time allows cache friendly code, and this is the key-point of this fast implementation.

We can also note that the #2 and #3 show a behavior tending asymptotically to a factor of x4. Looking at the memory consumed by the routines:

strchr_table

I can see that #1 is very near to my board max memory read speed, that is 12GB/sec.

Future work

I would like to spent a little more time studying the memory/cache behavior. Following…

Resources

List of some benchmark tools
Here a similar, more deep discussion about a new, improved implementation

Performance comparison: ‘Find a char’