views:

168

answers:

2

Hi,

I have some content like this:

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

And I need to catch everything between double quotes for title. My first try was this:

^(" "|\t)+"title"" "*=" "*"\"".+"\","

Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n somewhere to allow multiple lines, like this:

^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","

But this doesn't help, instead, it catches everything.

Than I though, "what I want is between double quotes, what if I catch everything until I find another " followed by ,? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

But this has another problem... The example above doesn't have it, but the double quote symbol (") can be in between the title declaration. For instance:

title = "aaaaaaa \"X bbbbbb",

And yes, it will always be preceded by a backslash (\).

Any suggestions to fix this regexp?

+2  A: 

The classical regex to match strings in double quotes is:

\"([^\"]|\\.)*\"

In your case, you'll want something like this:

"title"\ *=\ *\"([^\"]|\\.)*\"

PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.

rz0
Lex doesn't work with empty spaces, it needs `" "` to match a space. It's just because of Lex really, I don't usually do this on different languages like PHP (where I'm most used to work with regex).
Nazgulled
You can also use '`\ `' to match a space in most lex versions
Chris Dodd
I believe '\ ' is POSIX-compliant. See http://www.opengroup.org/onlinepubs/009695399/utilities/lex.html , Table: Escape Sequences in lex.
rz0
It's just a matter of preference, it doesn't really matter in the end.
Nazgulled
A: 

You could use start conditions to simplify each separate pattern, for example:

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

To make an executable:

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

Run it:

$ ./parse_ini < input.txt 

Where input.txt is:

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

Output:

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

It replaced '"' around the title by '<|' and '|>'. Also'\"'` is replaced by '""' inside title.

J.F. Sebastian
I'm already using too much start conditions, this complicates things a bit. Also, it's easier to catch everything in one regex cause I need to pass the match to a C function.
Nazgulled