ansaurus

Question

A better way to split a string into an array of strings in C/C++ using whitespace as a delimiter.

Answer 1

A:

The best way to do it would be using strtok. That link should be self explanatory on how to use it, and you can use multiple delimiters as well. Very handy C function.

adam_0 2010-07-01 22:24:06

+1, but I'm sure someone will have a crazy C++ solution including bizarro syntax to take away your upvotes.

Carl Norum 2010-07-01 22:26:20

This is a good solution for C. If you want to use C++, there's nothing "crazy" about coding to the language's strengths.

Cogwheel - Matthew Orlando 2010-07-01 22:29:39

If `strtok` is the right answer, you've usually asked the wrong question.

Jerry Coffin 2010-07-01 22:34:25

@Jerry, what do you mean? Should it be strntok instead?

Hamish Grubijan 2010-07-01 22:48:42

@Hamish: Usually it should be (among other things) something that doesn't modify its input, and doesn't have quite such an error-prone interface.

Jerry Coffin 2010-07-01 22:53:16

@Jerry, sorry, confused again ... so, `strtok` can modify its input and it has an error-prone interface? How so? Where can I read more on this?

Hamish Grubijan 2010-07-01 23:04:33

@Hamish: notice how `strtok` does not say `const char *str`, this is because it modifies the input string.

SiegeX 2010-07-01 23:16:22

@Hamish: More accurately, `strtok` *will* (at least normally) modify its input -- that's how it's defined to work, and part of what makes it error-prone (e.g., passing it a string literal leads to undefined behavior). The sequence to use it is also clumsy -- call once passing the input string, then repeatedly passing NULL, stopping only when it returns NULL.

Jerry Coffin 2010-07-01 23:17:00

One small detail worth noting about strtok: it's not thread safe. I don't know if this applies to OP's question really, but still notable.

Tom 2010-07-01 23:45:38

@Tom: that's really an implementation detail. The version on most POSIX systems isn't. OTOH, the one in the MS multi-threaded library is thread safe (though non-trivially by allocating thread local storage).

Jerry Coffin 2010-07-02 01:03:12

There's usually nothing *wrong* with `strtok()` modifying its input - in most of the cases where you want to parse with `strtok()` you don't care about the unparsed string any more. The string literal complaint is also bogus; runtime parsing of a compile-time constant string is most definitely a corner case.

caf 2010-07-02 01:44:13

I am more worried about thread-safety than mutation. I could always make a copy of a string to be mutated, but I do want the result to be correct.

Hamish Grubijan 2010-07-02 02:49:14

@Hamish Grubijan: Most systems provide a `strtok()` alternative that can be used in thread-safe way - eg the re-entrant `strtok_r()`.

caf 2010-07-03 10:45:46

Ah, nice ... so, can I do this in VS2010 without installing extra stuff? Do you mind posting a separate answer with an example of working code?

Hamish Grubijan 2010-07-03 17:26:33

@Hamish Grubijan: `strtok()` is a standard function. I can provide example code of `strtok()` in a few hours when the code will be available to me. Is this what you want, or are you looking for `strtok_r()`? I've never done work with the latter, but I could check it out if that's what you need. Let me know if this would help and if you're still looking at the problem.

adam_0 2010-07-19 21:47:24

@adam_O If you do have a solution, then I could use it. I have not fixed the bug yet; other things had more priority.

Hamish Grubijan 2010-07-20 01:13:42

I'm sorry I've been very busy but here's an example on the web. Hope this helps.http://www.elook.org/programming/c/strtok.html

adam_0 2010-07-23 03:53:38

Answer 2

+3 A:

In C++, it's probably easiest to use a stsringstream:

std::istringstream buffer("kas\nhjkfh kjsdjkasf");

std::vector<std::string> strings;

std::copy(std::istream_iterator<std::string>(buffer),
          std::istream_iterator<std::string>(),
          std::back_inserter(strings));

I haven't tried to stick to exactly the same signature, mostly because most of it is non-standard, so it doesn't apply to C++ in general.

Another possibility would be to use Boost::tokenizer, though obviously that does involve another library, so I won't try to cover it in more detail.

I'm not sure if that qualifies as "bizarro syntax" or not. I may have to work a bit on that part...

Edit: I've got it -- initialize the vector instead:

std::istringstream buffer("kas\nhjkfh kjsdjkasf");

std::vector<std::string> strings(
    (std::istream_iterator<std::string>(buffer)),
    std::istream_iterator<std::string>());

The "bizarro" part is that without the extra parentheses around the first argument, this would invoke the "most vexing parse", so it would declare a function instead of defining a vector. :-)

Edit2: As far as the edit to the question goes, it seems nearly impossible to answer directly -- it depends on too many types (e.g., CGXStyle, CLVDateTime) that are neither standard nor explained. I, for one, can't follow it in any detail at all. Offhand, this looks like a fairly poor design, letting the user enter things that are more or less ambiguous, and then trying to sort out the mess. Better to use a control that only allows unambiguous input to start with, and you can just read some fields that contain a date and time directly.

Edit3: code to do the splitting that also treats commas as separators could be done like this:

#include <iostream>
#include <locale>
#include <algorithm>
#include <vector>
#include <sstream>

class my_ctype : public std::ctype<char> {
public:
    mask const *get_table() { 
        // this copies the "classic" table used by <ctype.h>:
        static std::vector<std::ctype<char>::mask> 
            table(classic_table(), classic_table()+table_size);

        // Anything we want to separate tokens, we mark its spot in the table as 'space'.
        table[','] = (mask)space;

        // and return a pointer to the table:
        return &table[0];
    }
    my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};

int main() { 
    // put our data in a strea:
    std::istringstream buffer("first kas\nhjkfh kjsdjk,asf\tlast");

    // Create a ctype object and tell the stream to use it for parsing tokens:
    my_ctype parser;
    buffer.imbue(std::locale(std::locale(), &parser));

    // separate the stream into tokens:
    std::vector<std::string> strings(
        (std::istream_iterator<std::string>(buffer)),
        std::istream_iterator<std::string>());

    // copy the tokes to cout so we can see what we got:
    std::copy(strings.begin(), strings.end(), 
        std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

Jerry Coffin 2010-07-01 22:36:58

Cool, we use VS2010 to compile this, so `Boost` is a bit of a stretch, but I am sure that lots of libraries are available.

Hamish Grubijan 2010-07-01 23:01:34

Wait, Jerry, where do I specify the list of characters to tokenize on? See answer by Beh Tou Cheh above for example. he has: `strtk::parse(data, ", \r\n", str_list);`.

Hamish Grubijan 2010-07-02 18:20:37

They're specified in the `locale` used by the stream. By default it'll only be whitespace, but you can create a locale with a `ctype facet` that uses whatever you want. http://stackoverflow.com/questions/1894886/parsing-a-comma-delimited-stdstring/1895584#1895584

Jerry Coffin 2010-07-02 18:38:16

Uh ... this is so close, but it is not apparent to me how to use `ctype facet` `locale` in code. A working example would be so nice!

Hamish Grubijan 2010-07-02 18:59:24

Boost.Tokenzier is ideal for splitting up strings. The iterator interface is very easy to work with too for those familiar with the STL.

Dr. Watson 2010-07-02 20:22:09

@Jerry, I noted the edit, and I appreciate it. I am going to test this next week now.

Hamish Grubijan 2010-07-02 22:30:37

Jerry, does this program fully compile for you? I had this problem: `thefile.cpp(1404): error C2228: left of '.imbue' must have class/struct/union` as well as: `thefile.cpp(1412): error C2440: '<function-style-cast>' : cannot convert from 'std::istringstream (__cdecl *)(CStringA)' to 'std::istream_iterator<_Ty>'`.

Hamish Grubijan 2010-07-07 21:27:35

@Hamish: it does compile, though looking at it again, it really *should* also `#include <iterator>` (VC++ doesn't mind but, for one example, g++ requires it). For better or worse, the C++ standard allows one header to include another, allowing problems like this to slip through...

Jerry Coffin 2010-07-08 02:15:31

Answer 3

A:

Parsing strings in C/C++ rarely turns out to be a simple matter. The method you posted looks like it has quite a bit of "history" involved in it. For example, you state that you want to split the string on white space. But the method itself appears to be using a member variable m_strDelim as part of the splitting decision. Simply replacing the method could lead to other unexpected issues.

Using an existing tokenizing class such as this Boost library could simplify things quite a bit.

Mark Wilkins 2010-07-01 22:38:45

"Parsing strings in C/C++ rarely turns out to be a simple matter." It is a simple matter, but solutions like parsing tend to involve processing strings character-by-character and loops. I.e. I can't think of something (standard) like python's split() function in C++....

SigTerm 2010-07-01 23:43:52

Answer 4

A:

Quite an over the top way of sorting this problem is to use the Qt libraries. If you're using KDE then they're installed already. The QString class has a member function split which works like the python version. For example

QString("This is a string").split(" ", QString::SkipEmptyParts)

returns a QStringList of QStrings:

["This", "is", "a", "string"]

(in pythonic syntax). Note the second argument is required or else should the words be split by multiple spaces, each individual one would be returned.

In general I find with the help of the Qt libraries, most of the simplicity of python, eg. simple string parsing and list iteration, can be handled with ease and with the power of C++.

Simon Walker 2010-07-01 23:16:59

We are a MSFT shop, so I can use whatever standard libs come with VS2010. I like Linux, open source, etc, but I cannot just install arbitrary libs. I could steal a couple of .h and .cpp files though if a license allows it.

Hamish Grubijan 2010-07-02 02:51:29

Answer 5

+5 A:

The String Toolkit Library (Strtk) has the following solution to your problem:

#include <string>
#include <deque>
#include "strtk.hpp"
int main()
{ 
   std::string data("kas\nhjkfh kjsdjkasf");
   std::deque<std::string> str_list;
   strtk::parse(data, ", \r\n", str_list);
   return 0;
}

More examples can be found Here

Beh Tou Cheh 2010-07-02 02:28:55

Hm ... I wonder if I can steal just a couple of headers and cpp files without installing the whole thing.

Hamish Grubijan 2010-07-02 02:53:32

@Hamish: Feel free to take what you like, its all under CPL. If you don't have Boost, or just use vanilla C++ with STL you can just comment out #define ENABLE_LEXICAL_CAST #define ENABLE_RANDOM #define ENABLE_REGEX, and its should all work, its all explained in the readme.txt.

Beh Tou Cheh 2010-07-02 03:43:53

Answer 6

+1 A:

You can use boost::algorithm::split. I.e.:

std::string myString;
std::vector<std::string> splitStrings;
boost::algorithm::split(splitStrings, myString, boost::is_any_of(" \r\n"));

Billy ONeal 2010-07-02 02:31:59

+1 The Boost String Algorithms library is essential for string operations!

Matthieu M. 2010-07-02 06:47:21

I really wish I knew why people were downvoting this.

Billy ONeal 2010-07-02 20:54:51

I did not downvote. I suppose it is because of my comment (not original post) that I prefer to use standard libs only ...

Hamish Grubijan 2010-07-02 22:32:39

@Hamish Grubijan: Yes -- agreed -- I was going to delete my answer except I saw that the most upvoted answer to this question is asking to use another library....

Billy ONeal 2010-07-03 01:49:15

@Billy: I've +1 in the hopes you don't break out into tears :D

Beh Tou Cheh 2010-07-03 23:03:27

@Beh: Thank you :) I don't have a major issue with the downvotes -- just if someone is going to downvote the answer they should at least leave a comment as to why there's a problem with the answer.

Billy ONeal 2010-07-04 02:16:51

Answer 7

A:

A better method than my other answer: TR1's regex feature. Here's a small tutorial to get you started. This answer is C++, uses regular expressions (which is perhaps the best / easiest way to split a string), and I used it myself recently, so I know it's a nice tool.

adam_0 2010-07-27 17:07:12

ansaurus

tags:

views:

answers:

A better way to split a string into an array of strings in C/C++ using whitespace as a delimiter.

related questions