Parse string into array based on spaces or "double quotes strings"

Airon Zagarella picture Airon Zagarella · Mar 11, 2012 · Viewed 11.5k times · Source

Im trying to take a user input string and parse is into an array called char *entire_line[100]; where each word is put at a different index of the array but if a part of the string is encapsulated by a quote, that should be put in a single index. So if I have

char buffer[1024]={0,};
fgets(buffer, 1024, stdin);

example input: "word filename.txt "this is a string that shoudl take up one index in an output array";

tokenizer=strtok(buffer," ");//break up by spaces
        do{
            if(strchr(tokenizer,'"')){//check is a word starts with a "
            is_string=YES;
            entire_line[i]=tokenizer;// if so, put that word into current index
            tokenizer=strtok(NULL,"\""); //should get rest of string until end "
            strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue

              }  
        entire_line[i]=tokenizer;
        i++;
        }while((tokenizer=strtok(NULL," \n"))!=NULL);

This clearly isn't working and only gets close if the double quote encapsulated string is at the end of the input string but i could have input: word "this is text that will be user entered" filename.txt Been trying to figure this out for a while, always get stuck somewhere. thanks

Answer

torek picture torek · Mar 12, 2012

The strtok function is a terrible way to tokenize in C, except for one (admittedly common) case: simple whitespace-separated words. (Even then it's still not great due to lack of re-entrance and recursion ability, which is why we invented strsep for BSD way back when.)

Your best bet in this case is to build your own simple state-machine:

char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;

for (p = buffer; *p != '\0'; p++) {
    c = (unsigned char) *p; /* convert to unsigned char for is* functions */
    switch (state) {
    case DULL: /* not in a word, not in a double quoted string */
        if (isspace(c)) {
            /* still not in a word, so ignore this char */
            continue;
        }
        /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
        if (c == '"') {
            state = IN_STRING;
            start_of_word = p + 1; /* word starts at *next* char, not this one */
            continue;
        }
        state = IN_WORD;
        start_of_word = p; /* word starts here */
        continue;

    case IN_STRING:
        /* we're in a double quoted string, so keep going until we hit a close " */
        if (c == '"') {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_STRING or we handled the end above */

    case IN_WORD:
        /* we're in a word, so keep going until we get to a space */
        if (isspace(c)) {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_WORD or we handled the end above */
    }
}

Note that this does not account for the possibility of a double quote inside a word, e.g.:

"some text in quotes" plus four simple words p"lus something strange"

Work through the state machine above and you will see that "some text in quotes" turns into a single token (that ignores the double quotes), but p"lus is also a single token (that includes the quote), something is a single token, and strange" is a token. Whether you want this, or how you want to handle it, is up to you. For more complex but thorough lexical tokenization, you may want to use a code-building tool like flex.

Also, when the for loop exits, if state is not DULL, you need to handle the final word (I left this out of the code above) and decide what to do if state is IN_STRING (meaning there was no close-double-quote).