views:

361

answers:

2

I want to be able to predicate pattern matches on whether they occur after word characters or after non-word characters. In other words, I want to simulate the \b word break regex char at the beginning of the pattern which flex/lex does not support.

Here's my attempt below (which does not work as desired):

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

Input :

a
ba
a

Expected output

'a' not in word
'a' in word
'a' not in word

actual output:

a
'a' in word
'a' in word

I'm doing this because I want to do something like the dialectizer and I have always wanted to learn how to use a real lexer. Sometimes the patterns I want to replace need to be fragments of words, sometimes they need to be whole words only.

A: 
%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

Testing:

user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word
A: 

Here's what accomplished what I wanted:

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

This way I can do the equivalent of \B or \b at the beginning or end of any pattern. You can match at the end by doing a/{WC} or a/{NW}.

I wanted to set up the states without consuming any characters. The trick is using REJECT rather than yymore(), which I guess I didn't fully understand.

ʞɔıu