views:

56

answers:

2

Hi all,

I've got a (proprietary) output from a software that I need to parse. Sadly, there are unescaped user names and I'm scratching my hairs trying to know if I can, or not, describe the files I need to parse using a BNF (or EBNF or ABNF).

The problem, oversimplified (it's really just an example), may look like this:

(data) ::= <username>
<username> ::= (other type of data)

And in some case, instead of appearing at the left or at the right, the username can also appear in the middle of a line.

The problem is that the username is unescaped and there are not enough restrictions on user names (they're printable ASCII, max 20 chars and they can't contain line break). So "=" would be a perfectly valid username, for example. And so would "= 1 = john = 2" (because user, at sign-on, where allowed to choose any user name they wanted and these appear unescaped in the output I've got).

I'm asking because my parser chocked on some very creative usernames (once again, not in my control, they're "weird" and I need to deal with it) and I cannot find an easy way to deal with this. Also note that I do not know in advance the user names (for example I don't have access to a database that would contain all the user names that the users created).

So are unrestricted and unescaped user names incompatibles with BNF?

P.S: be cool with me if I made mistakes, it's my first post on stackoverflow :)

+1  A: 

BNF doesn't "care" for user names per-se. It works on the token level. If you define a username token, you can build describe a grammar using BNF based on it.

Your problem should be solved on the lexer level. The lexer should be smart enough to recognize user names, even when they're not escaped, and pass username tokens to the parser.

In theory you could describe all kinds of user names with a grammar, but this heavily depends on the other things in your language. Is = a valid token on its own right? How can you tell a username having = in it apart if it is? I think you'll have to describe the rest of the rules and valid tokens in your language to get a fuller answer here.

Eli Bendersky
The problem comes if there are other tokens on the same line, or if the user name contains things that the parser also considers tokens, doesn't it?
Paul Tomblin
@Eli: Yup OK, my question wasn't formulated well enough: we've got problems when tokenizing. Can the lexer always be smart enough to recognize user names, knowing that basically any user name is valid? Let's imagine users will use, on purpose, names that would break the lexer. Is this always possible to solve when users have no restriction at all on their username?
Webinator
@Paul: Yup, that is precisely the point of my question :)
Webinator
@OldEnthusiast: no, it's not always possible to solve. If your language is simple arithmetic expressions on user names (just for example...) and usernames are allowed to have spaces and + and = in them, it won't be possible to parse. This is why you need to present more of your grammar to get real help here
Eli Bendersky
+1  A: 

It might be possible to work by recognising things that are not usernames and then declaring everything else a username, even if this means parsing from right to left instead of left to right or doing something equally eccentric.

It may be worth looking to see if your input is actually ambiguous: can you find two different situations that lead to identical output being generated? If so, you need to go back and get requirements for which of them to favour, or what sort of error to produce, or whatever. If not, the reason why not might help you work out what your parser or lexer or whatever needs to do.

mcdowella
@mcdowella: yup, it meant exactly such "eccentric" parsing. I basically used (before even asking the question here) a LR parser, a RL parser and a "outside to inside" parser (parsing one char at the left, one at the right, then going towards the center of each line) depending on the kind of input I had (once again the input format wasn't in my control).But I asked the question here to make sure I wasn't "seeing thing": I abandonned the BNF/tokenizer approach and wanted to be sure it was simply, well, not working in such a case.Thanks a lot for your answer too :)
Webinator