Listed from least expensive to most expensive at run-time:
str::strtok
is the cheapest standard provided tokenization method, it also allows the delimiter to be modified between tokens, but it incurs 3 difficulties with modern C++:
std::strtok
cannot be used on multiple strings
at the same time (though some implementations do extend to support this, such as: strtok_s
)std::strtok
cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)std::strtok
modifies the std::string
it is operating on, so it cannot be used on const string
s, const char*
s, or literal strings, to tokenize any of these with std::strtok
or to operate on a std::string
who's contents need to be preserved, the input would have to be copied, then the copy could be operated onGenerally any of these options cost will be hidden in the allocation cost of the tokens, but if the cheapest algorithm is required and std::strtok
's difficulties are not overcomable consider a hand-spun solution.
// String to tokenize
std::string str{ "The quick brown fox" };
// Vector to store tokens
vector<std::string> tokens;
for (auto i = strtok(&str[0], " "); i != NULL; i = strtok(NULL, " "))
tokens.push_back(i);
std::istream_iterator
uses the stream's extraction operator iteratively. If the input std::string
is white-space delimited this is able to expand on the std::strtok
option by eliminating its difficulties, allowing inline tokenization thereby supporting the generation of a const vector<string>
, and by adding support for multiple delimiting white-space character:// String to tokenize
const std::string str("The quick \tbrown \nfox");
std::istringstream is(str);
// Vector to store tokens
const std::vector<std::string> tokens = std::vector<std::string>(
std::istream_iterator<std::string>(is),
std::istream_iterator<std::string>());
std::regex_token_iterator
uses a std::regex
to iteratively tokenize. It provides for a more flexible delimiter definition. For example, non-delimited commas and white-space:// String to tokenize
const std::string str{ "The ,qu\\,ick ,\tbrown, fox" };
const std::regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
// Vector to store tokens
const std::vector<std::string> tokens{
std::sregex_token_iterator(str.begin(), str.end(), re, 1),
std::sregex_token_iterator()
};
See the regex_token_iterator
Example for more details.