Introduction – the optimization dilemma
Sometimes could be useful to test the real performance of a basic algorithm implementation by the vendor of a compiler/library/framework, to understand better if and how to improve performance of critical code-path. Remember that, in the optimization field, always this rule applies: premature optimization is the root of all evil (or at least most of it) in programming.
This time i will spend a little effort studying the behaviour of the stdlib “strchr“-like routine in the Visual Studio 2013 implementation; my feeling definitely says that implementation in stdlib is faster than others, but it’s just an impression? and, if this is the case, how much faster is respect to a naive implementation, without any in-deep speed consideration? The algorithm has an intrinsic O(n) complexity (without any further knowledge about the length of the string), so differences in computation time are due only to implementation tricks.
I google around, I found this similar question on StackOverflow and I’m impressed that the chosen one is the weakest, without any test done,
So I made some naive implementation into account and test different implementations on the same machine, with VS and its stdlib implementation (known the be not the best compiler/library in terms of performance) using this setup:
- HW: AMD FX(tm)-4300 Quad Core Processor @ 3.00Ghz, MB ASRock 970 Extreme4, 16GB DDR3
- OS: Windows 7 Pro 64bit, SP1 (5.9 performance index)
- Compiler:
- Visual Studio 2013 Ultimate (RELEASE mode)
- Visual C++ Compiler Nov 2013 CTP (CTP_Nov2013)
Test methodology
- I prepare a randomly array of std::string of different lengths: 4,8,32,256,1024,8192,65536 char; all chars are in the ascii range [32..255], expect the ‘|’ char 7Ch
- All times are taken in a easy way, with the High Precision Timer Infrastructure of Windows. It’s not the best method, however was the faster to implement using an old 2000 RAII library; this use the profileapi.h API under Windows SDK (QueryPerformanceCounter & QueryPerformanceFrequency). Time are expressed in seconds, so lower is better.
- Every routine is executed 1.000.000 times, to reduce sporadic error, and activated looking for the ‘|’ symbol, that never occurs due how the strings are build (point 1); so the algorithm has a cost of O(n), with n = string length
- The routine is implemented in a callable function, (using fastcall calling convention)
- A brief, preliminary, integrity check of the functions has done, to check if the implementation was always right (10000 random test case)
- External interference: I’ve killed all application and services on windows knowed of some relevant CPU/IO activity; maybe Windows is not very good as a performance check platform, however I will adopt the new C++11 time facility to make code portable and I will update results with Linux/Mac performance measurements (they all have a faster function calling mechanism)
- Program was build in RELEASE mode, in 32bit mode, VS2013 Ultimate
Implementations examined
Implementation #1: strchr
Nothing special to say here: I use a std::string as the string container, so I access the underlying chars array using the std::string::c_str() method.
Implementation #2: std::string::find
Nothing special to say: call direct std::string::find routine, compare the result with std::string::npos
Implementation #3: ansi strchr found on internet
I found this implementation googling around, take it as a plus.
/// implementation found here: /// http://uos-embedded.googlecode.com/svn/trunk/sources/runtime/strchr-fast.c /* Nonzero if X is not aligned on a "long" boundary. */ #define UNALIGNED(X) ((long)X & (sizeof (long) - 1)) /* How many bytes are loaded each iteration of the word copy loop. */ #define LBLOCKSIZE (sizeof (long)) #define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080) /* DETECTCHAR returns nonzero if (long)X contains the byte used to fill (long)MASK. */ #define DETECTCHAR(X,MASK) (DETECTNULL(X ^ MASK)) unsigned char * strchr2(const unsigned char *s, unsigned char i) { unsigned char c = (unsigned char)i; unsigned long mask, j; unsigned long *aligned_addr; if (!UNALIGNED(s)) { mask = 0; for (j = 0; j < LBLOCKSIZE; j++) mask = (mask << 8) | c; aligned_addr = (unsigned long*)s; while (!DETECTNULL(*aligned_addr) && !DETECTCHAR(*aligned_addr, mask)) aligned_addr++; /* The block of bytes currently pointed to by aligned_addr contains either a null or the target char, or both. We catch it using the bytewise search. */ s = (unsigned char*)aligned_addr; } while (*s && *s != c) s++; if (*s == c) return (unsigned char*)s; return 0; }
Implementation #4: naive simple algorithm based on stl
This implementation is very straightforward and easy to implement, checking of every single char in the string using the string default operator[] accessor. The length was cached in the n variable (maybe useless optimization for today compiler).
bool naive_container_find(const std::string & s, char key) { int n = s.length(); for (string::size_type i = 0; i<n; ++i) { if (s[i] == key) return true; } return false; }
Implementation #5: naive C like implementation
Simpler access all chars of the string sequentially using pointer arithmetic.
bool my_fast_custom_find_one_byte(const unsigned char * pS, unsigned char key) { while (*pS) { if (*pS == key) return true; pS++; } return false; }
Implementation #6: empty function
I would like to insert the cost of a function call as a reminder of the jump cost; I found that VS2013 make a great optimization eliding a nothing-to-do function (containings a constant return false instruction); so I put a return of a global variable, so optimizer doesn’t be allowed to cut-off the subroutine call.
/// For test purpose, nothing do function bool global_flag; static bool nothing_do_function(const std::string & s, char key) { return global_flag; }
Results
The following chart shows that with increasing length of the strings the #1 cstdlib strchr routine is doing very well compared to others, instead for lower size the startup time of strchr performs worst than others:
stdlib must doing some good optimization; since this is a memory intensive algorithm, I expect that the memory management is considerably better than other algorithms. So let’s look to the disassembly of the strchr function:
0FAC4140 cmp dword ptr ds:[0FBDB0DCh],1 0FAC4147 jb 0FAC41A8 0FAC4149 movzx eax,byte ptr [esp+8] 0FAC414E mov edx,eax 0FAC4150 shl eax,8 0FAC4153 or edx,eax 0FAC4155 movd xmm3,edx 0FAC4159 pshuflw xmm3,xmm3,0 0FAC415E movlhps xmm3,xmm3 0FAC4161 mov edx,dword ptr [esp+4] 0FAC4165 mov ecx,0Fh 0FAC416A or eax,0FFFFFFFFh 0FAC416D and ecx,edx 0FAC416F shl eax,cl 0FAC4171 sub edx,ecx 0FAC4173 movdqu xmm1,xmmword ptr [edx] 0FAC4177 pxor xmm2,xmm2 0FAC417B pcmpeqb xmm2,xmm1 0FAC417F pcmpeqb xmm1,xmm3 0FAC4183 por xmm2,xmm1 0FAC4187 pmovmskb ecx,xmm2 0FAC418B and ecx,eax 0FAC418D jne 0FAC4197 0FAC418F or eax,0FFFFFFFFh 0FAC4192 add edx,10h 0FAC4195 jmp 0FAC4173 0FAC4197 bsf eax,ecx 0FAC419A add eax,edx 0FAC419C movd edx,xmm3 0FAC41A0 xor ecx,ecx 0FAC41A2 cmp dl,byte ptr [eax] 0FAC41A4 cmovne eax,ecx 0FAC41A7 ret
as you can see, the function heavy use xmm registers (the 128 bits registers of the SSE extension on x86 platform – see more here); reading 8 bytes at a time allows cache friendly code, and this is the key-point of this fast implementation.
We can also note that the #2 and #3 show a behavior tending asymptotically to a factor of x4. Looking at the memory consumed by the routines:
I can see that #1 is very near to my board max memory read speed, that is 12GB/sec.
Future work
I would like to spent a little more time studying the memory/cache behavior. Following…
Resources
- List of some benchmark tools
- Here a similar, more deep discussion about a new, improved implementation