tags:

views:

60

answers:

3

I want to extract attribute1 and attribute3 values only. I don't understand why charset doesn't seem to work in my case to "skip" any other attributes (attribute3 is not extracted as I would like):

content: {<tag attribute1="valueattribute1" attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}


attribute1: [{attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [{attribute3="} copy valueattribute3 to {"} thru {"}]

spacer: charset reduce [tab newline #" "]
letter: complement spacer 
to-space: [some letter | end]

attributes-rule: [(valueattribute1: none valueattribute3: none) [attribute1 | none] any letter [attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]

rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]

parse content rule

output is

>> parse content rule
valueattribute1
none
== true
>>
+1  A: 

Short answer, [any letter] eats your attribute3="..." as the #"^"" character is by your definition a 'letter. Additionally, you may have problems where there is no attribute2, then your generic second attribute rule will eat attribute3 and your attribute3 rule will not have anything to match - better to either be explicit that there is an optional attribute2 or an optional anything-but-attribute3

attribute1="foo"       attribute2="bar" attribute3="foobar" 
<- attribute1="..." -> <-     any letter                 -> <- attibute3="..." ->

Also, 'parse without the /all refinement ignores spaces (or at least is very unwieldy where spaces are concerned) - /all is highly recommended for this type of parsing.

rgchris
And what if I have 10 optional attributes as my example is just a simplified case of I need in real world :) it would be cumbersome to be explicit as I would have to list all combinations of alternation of the 10 attributes !
Rebol Tutorial
Thanks for [any letter] eating and for #"^"" I didn't realize
Rebol Tutorial
+1  A: 

Firstly you're not using parse/all. In Rebol 2 that means that whitespace has been effectively stripped out before the parse runs. That's not true in Rebol 3: if your parse rules are in block format (as you are doing here) then /all is implied.

(Note: There seemed to be consensus that Rebol 3 would throw out the non-block form of parse rules, in favor of the split function for those "minimal" parse scenarios. That would get rid of /all entirely. No action has yet been taken on this, unfortunately.)

Secondly your code has bugs, which I'm not going to spend time sorting out. (That's mostly because I think using Rebol's parse to process XML/HTML is a fairly silly idea :P)

But don't forget you have an important tool. If you use a set-word in the parse rule, then that will capture the parse position into a variable. You can then print it out and see where you're at. Change the part of attribute-rule where you first say any letter to pos: (print pos) any letter and you'll see this:

>> parse/all content rule
 attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>

valueattribute1
none
== true

See the leading space? Your rules right before the any letter put you at a space... and since you said any letter was ok, no letters are fine, and everything's thrown off.

(Note: Rebol 3 has an even better debugging tool...the word ??. When you put it in the parse block it tells you what token/rule you're currently processing as well as the state of the input. With this tool you can more easily find out what's going on:

>> parse "hello world" ["hello" ?? space ?? "world"]
space: " world"
"world": "world"
== true

...though it's really buggy on r3 mac intel right now.)

Additionally, if you're not using copy then your pattern of to X thru X is unnecessary, you can achieve that with just thru X. If you want to do a copy you can also do that with the briefer copy Y to X X or if it's just a single symbol you could write the clearer copy Y to X skip

In places where you see yourself writing repetitive code, remember that Rebol can go a step above by using compose etc:

>> temp: [thru (rejoin [{attribute} num {=}]) 
          copy (to-word rejoin [{valueattribute} num]) to {"} thru {"}]

>> num: 1
>> attribute1: compose temp
== [thru "attribute1=" copy valueattribute1 to {"} thru {"}]

>> num: 2
>> attribute2: compose temp
== [thru "attribute2=" copy valueattribute2 to {"} thru {"}]
Hostile Fork
>using Rebol's parse to process XML/HTML is a fairly silly idea:) in truth it's for parsing ASP.NET C# code ... but maybe that's still silly !
Rebol Tutorial
But I can't see any other tool to parse ASP.NET code so I use Rebol :)
Rebol Tutorial
And I don't need to write a full blown parser, I just need to do simple extraction of some attributes.
Rebol Tutorial
By the way I still use Rebol 2 (2.7.7 January). Haven't tried your code yet ...
Rebol Tutorial
forgot to say: an important requirement is that the rules should be human readable by layman, that's why I don't want full blown classical parser.
Rebol Tutorial
The problem is you're building in a lot of expectations to the code of things like the order of the attributes, which those developing XML and HTML oriented things are pretty liberal about changing because such order isn't supposed to carry meaning. You might be able to hack something that works for one case, but if the code is intended to have a broader usefulness then it's best to stick with a real DOM-oriented parser.
Hostile Fork
You're right, I have to change my strategy : sometimes I nearly believe Rebol is not a programming language but a natural language that can mimic human thoughts not so in that case but I learnt more about parse in that way thanks to you :)
Rebol Tutorial
Finally I have thought to parse ALL attributes using parse and break but seems it won't work, can't see why break can't do this:http://stackoverflow.com/questions/2457618/parse-and-break-why-break-cannot-be-used-for-getting-out-of-any-or-some-rule
Rebol Tutorial
A: 

When adding parse/all it didn't seem to change anything. Finally this seems to work (using set-word has been indeed a great help for debugging !!!), what do you think ?

content: {<tag attribute1="valueattribute1" attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}


attribute1: [to {attribute1="} thru {attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [to {attribute3="} thru {attribute3="} copy valueattribute3 to {"} thru {"}]

letter: charset reduce ["ABCDEFGHIJKLMNOPQRSTUabcdefghijklmnopqrstuvwxyz1234567890="]

attributes-rule: [(valueattribute1: none valueattribute3: none) 
[attribute1 | none] any letter pos: 
[attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]

rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]

parse content rule

which outputs:

>> parse/all content rule
valueattribute1
valueattribute3
valueattribute11
none
== true
>>
Rebol Tutorial
Since you weren't aware of the "get parse position" trick with set-words, I'll mention the analogous tool for "set parse position" from a variable pointing to a series type... use a get-word! Cool, huh?
Hostile Fork
I edited my answer to include a couple of other notes.
Hostile Fork