A Quick Study On Tokenizing

I decided that it would be useful for me to have an OBJ parser in my pile of code. There is a lot of OBJ content kicking around, for example Morgan McGuire’s excellent data repository.

OBJ is a text-based format, which means I’d need to write a parser, and 3D models are big, which means my parser shouldn’t be too slow. This got me thinking about the various options for tokenizing a string in C++, which led me to this StackOverflow thread.

There are a variety of options mentioned. The objective of this post is to benchmark them. Let’s meet the contenders…

Option 1: Boost

Lots of people like boost, and they have a nice, easy to use tokenizer (along with gobs and gobs of other things). Boost was the accepted answer on the StackOverflow article, and it is so popular that some commenters went so far as to declare that it is unreasonable not to assume it. Here is the boost tokenizer benchmark:

void DoBoost( std::ofstream& cout, std::string& text )
    boost::char_separator<char> sep(" \n\t\r\f");
    boost::tokenizer<boost::char_separator<char>> tokens(text, sep);
    for (const auto& t : tokens) {
        cout << t ;

I used boost 1.59.0 for this, built as per boost’s instructions.

Option 2: Stream Iterators

Somebody else suggested using stream iterators, which I didn’t know were a thing. This lets us build a non-boost tokenizer using only standard library code. Here’s the original from the StackOverflow post:

void DoIterator(std::ofstream& cout, std::string& str )
    // construct a stream from the string
    std::stringstream strstr(str);
    // use stream iterators to copy the stream to the vector as whitespace separated strings
    std::istream_iterator<std::string> it(strstr);
    std::istream_iterator<std::string> end;
    std::vector<std::string> results(it, end);
    // send the vector to stdout.
    std::ostream_iterator<std::string> oit(cout);
    std::copy(results.begin(), results.end(), oit);

Now, no offense to whoever wrote that, but, as we’ll see, sucking the entire token set into an std::vector of std::strings is, well, wrong :) and would not have been a fair comparison. I modified it thusly:

void DoIteratorCorrectly(std::ofstream& cout, std::string& str )
    // construct a stream from the string
    std::stringstream strstr(str);
    // use stream iterators to read individual strings
    std::istream_iterator<std::string> it(strstr);
    std::istream_iterator<std::string> end;
    std::for_each( it, end, [&cout]( const std::string& str ) { cout << str; } );

Option 3: Strtok

Some grizzled veteran suggested strtok, and was predictably chided for suggesting a non-thread-safe, C-like solution. Somebody else pointed out that strtok needs a non-const pointer to a null-terminated string, which is not something you find lying around very often in C++. I’m not a big fan of strtok either but let’s give it whirl.

In order to make it non-destructive, we’ll even take a big hit and create our own mutable copy of the input string.

void DoStrtok(std::ofstream& cout, std::string& str)
    char* pMutableString = (char*) malloc( str.size()+1 );
    strcpy( pMutableString, str.c_str() );
    char *p = strtok(pMutableString, " \n\t\r\f");
    while (p) {
        cout << p;
        p = strtok(NULL, " \n\t\r\f");

To be fair, you couldn’t just copy the whole thing like that if you were streaming tokens from a very large on-disk file, but we could address this with a little more work by wrapping some streaming buffers around it.

Option 4: Doing It Ourselves

Everybody always says you shouldn’t roll your own code for anything interesting, because the standard library’s always going to be faster and safer, but let’s ignore them. My home brew has a few disadvantages. It doesn’t handle UTF-8 or locales or any of those niceties, but it’s adequate for what I’m targeting it for, which is for parsing pure ascii files containing data that… really shouldn’t be ASCII encoded in the first place.

static bool IsDelim( char tst )
    const char* DELIMS = " \n\t\r\f";
    do // Delimiter string cannot be empty, so don't check for it.  Real code should assert on it.
        if( tst == *DELIMS )
            return true;
    } while( *DELIMS );
    return false;
void DoJoshsWay( std::ofstream& cout, std::string& str)
    char* pMutableString = (char*) malloc( str.size()+1 );
    strcpy( pMutableString, str.c_str() );
    char* p = pMutableString;
    // skip leading delimiters
    while( *p && IsDelim(*p) )
    while( *p )
        // note start of token
        char* pTok = p;
        do// skip non-delimiters
        } while( !IsDelim(*p) && *p );
        // clobber trailing delimiter with null
        *p = 0;
        cout << pTok; // send the token
        do // skip null, and any subsequent trailing delimiters
        } while( *p && IsDelim(*p) );


The contest shall be to tokenize the 20MB obj for ‘crytek_sponza’, which may be found here. I’m testing it by loading the entire file into memory ahead of time and tokenizing out of a string stream, then writing the resulting tokens back to disk. This lets me run multiple test cases back to back without having to worry about disk caching skewing the results. There is I/O on the other end, of course, but since my process never reads that stuff it shouldn’t interfere too much. Whatever noise there is should be evenly distributed amongst the test cases. Results presented are averages over five consecutive runs as measured by boost’s timer. My CPU is a Core i3-4010U. The code for the test is here. I compiled this for x86 with MSVC 2013 express edition. Here is the command line it used:

/GS /GL /analyze- /W3 /Gy /Zc:wchar_t /I".\boost_1_59_0" /Zi /Gm- /O2 /Ob2 /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Parse.pch" 


Here are the timing results.


Perhaps surprisingly (not to me), it is faster to do two passes over a 20MB string using C-style techniques than it is to do one pass using either of the more OOPy C++ approaches. This has nothing to do with cache-warming, by the way. 20MB is about 10 times the size of my machine’s L3.

More surprisingly (even to me), my homebrew is outperforming the strtok() implementation by a comfortable margin.

Keep in mind that I did not “micro-optimize” this code, apart from the sort of on-the-fly tweaking that I always do when I write code I don’t intend to throw away.

Source Code Size

Here is the code size for each solution, measured in C statements. For this purpose I define a statement as:

  • A control structure (if, while, do/while) OR
  • A function definition OR
  • Anything ending in a semicolon

This is a better metric than simply counting lines, since code formatting and comments can throw these off.


Unsurprisingly, rolling your own code results in… more code.


Broadly speaking, there is a certain “conventional C++ wisdom” which says the following:

You shouldn't roll your own code.  
The standard library was written by gurus and mere mortals won't do better.  
Even if you do, you'll write bugs, and you'll end up spending more time fixing them than it's worth.

The first part is clearly false, at least in the example I’ve shown. The second part has some validity. It is true that writing your own code will result in more bugs than using other, existing, well-tested code. However, given enough care and testing, any implementation can be made as bug-free as any other. Code for a task as well-defined as this CAN be brought to a point where it is well and truly finished. This is not always true of all code, but it is certainly true of the fundamental building blocks. Once the code is done, the only thing that matters is how good that code is at its job.

Someone else will say: “it’s a waste of time to re-write this code every time.” It is indeed wasteful to write the same loops again and again and again, but it would not take much to generalize the 14 statements I wrote and collect them into a simple helper class. Extra time spent on a component that is scalable and encapsulated is time invested in software quality, and your users will appreciate it.

Let’s view the performance results in a slightly different light:


The accepted solution to the StackOverflow thread takes 60% longer than a hand-coded solution. That is a very steep tax. I really, REALLY hope that the people who write my compilers are not paying this tax in their lexers, because I spend a sizable fraction of my life waiting for them to run. The amount of engineer-time annually spent waiting on text processing probably represents a non-trivial fraction of a typical engineering budget.

Still, you might scoff at the extra half-second of wall-clock, and you might assert that whatever I’m DOING with the tokens is going to always be so expensive that this won’t matter anyway. If you’re going to take that position, consider this: What is the average tax rate on that other code? In how many more places are you willing to pay 50,60,70 percent? If you multiply enough small constants by 2, you will eventually arrive at a significant number. By the time you notice, it will generally be too late to do anything about.