views:

282

answers:

6

I'm having a hard time understanding this regex stuff...

I have a string like this:

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">

I want to use findall() and groups to get this:

['56242','saddelmageri']

I can match the number with something like "synset-[0-9]" and the word with something like "{(.*?)}" but how do I write it to get the above result?

And here's a follow-up question - some lines look like this:

<wn20schema:NounSynset rdf:about="&dn;synset-2589" rdfs:label="**{cykel_3: trehjulet cykel; tricykel,1_1}**">

In this case I want to extract the stuff between the {} with this result:

['2589', ['cykel', 'trehjulet cykel', 'tricykel']]

so that I can drop it in a dictionary later as a key(2589) : value(['cykel', 'trehjulet cykel', 'tricykel']) pair.

Any thoughts?

+2  A: 

Please see the top answer to this question. It is generally a terrible idea to parse xml with regular expressions. XML parsers are built for this purpose.

The quickest way to do this would probably be python's built-in minidom

Mike
A: 

Since this appears to be xml data, you would be better off using an xml parser, since parsing xml with regular expressions is very, very difficult to do right.

However, since you specifically asked for a regular expression...

Your specifications are a bit imprecise, and with regular expressions you need to be very precise in what constitutes a match. For example, will the rdfs:label value always have a _1 that you want to strip off? Will there always only be one of these blocks of data per line, or multiple per line? Also, is the order of the result important?

Here's a quick hack that might give you close to what you want:

import re
data=r'<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">"'

matches=re.findall('synset-([0-9]+).*label="{(.*)_1}"', data)
print "matches:", matches

When I run the above, I get the following output, which is a list of two-tuples containing the two strings you wanted (though in a different order):

matches: [('56242', 'saddelmageri')]
Bryan Oakley
That did the trick for my question!But yes, you're right, I wasn't very precise. As it happens, it's not always a _1, it could be a _2 or more. And even sometimes if there's nothing the word will be prefixed by DN:, like {DN:saddelmageri}.I guess you're all right, it must be easier to use an XML parser.
Steen H
+1  A: 

If you do a lot with this data, consider even a specialized RDF library (e.g. RDFLib). If not, an XML parser is definitely the way to go!

  • What if tomorrow it won't be on a single line?
  • What if tomorrow the label will come before the about?
  • There are at a least a dozen more ways in which it can remain valid XML but break your regexp!

Anyway, I tried applying an XML parser, but I'm getting an "undefined entity error" for the &dn; there. Can you post the top of the file (doctype, namespace definitions, and the like)?

Beni Cherniavsky-Paskin
A: 

Yes, sure :)

<!DOCTYPE rdf:RDF [
<!ENTITY rdf        "http://www.w3.org/1999/02/22-rdf-syntax-ns#"&gt;
<!ENTITY rdfs       "http://www.w3.org/2000/01/rdf-schema#"&gt;
<!ENTITY owl        "http://www.w3.org/2002/07/owl#" >
<!ENTITY xsd        "http://www.w3.org/2001/XMLSchema#"&gt;
<!ENTITY wn20schema "http://www.w3.org/2006/03/wn/wn20/schema/"&gt;
<!ENTITY dn         "http://www.wordnet.dk/owl/instance/2009/03/instances/"&gt;
<!ENTITY dn_schema  "http://www.wordnet.dk/owl/instance/2009/03/schema/"&gt;
]>

And the first part of the tag:

<rdf:RDF
            xmlns:owl="&owl;"
            xmlns:rdf="&rdf;"
            xmlns:rdfs="&rdfs;"
            xmlns:dn="&dn;"
            xmlns:wn20schema="&wn20schema;"
            xmlns:dn_schema="&dn_schema;">

I'm playing around with minidom, and I can get to the wn20schema:NounSynset nodes, but what to do from there? I guess I need to reference the rdf and rdfs attributes somehow?

Thanks for your help btw :)

Steen H
A: 

You're doing two different kinds of parsing here, and you'll need to use two different tools.

First, you're parsing XML. For that, you're going to need to use an XML parser, not regular expressions. Because these elements are functionally identical XML:

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">
</wn20schema:NounSysnset>

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}"/>

<wn20schema:NounSynset rdfs:label="{saddelmageri_1}" rdf:about="&dn;synset-56242"/>

and conceivably even:

<NounSynset xmlns="my_wn20schema_namespace_urn" C:label='not_of_interest' A:label='{saddelmageri_1}' B:about='&dn;synset-56242'/>

To parse that element, you need to know the names of the namespaces that the element and the attributes you're interested in belong to, and then use an XML parser to find them - specifically, an XML parser that properly supports XML namespaces and XPath, like lxml.

You'll end up with something like this to find the attributes you're looking for (assuming that doc is the parsed XML document, and that variables ending in _urn are strings containing the various namespace URNs):

def find_attributes(doc):
    for elm in doc.xpath('//x:NounSynset', namespaces={'x': wn20schema_namespace_urn}):
        yield (elm.get(rdf_namespace_urn + "about"), elm.get(rdfs_namespace_urn + "label"))

Now you can look at the second part of the problem, which is parsing the values you need out of the attribute values you have. For that, you would use regular expressions. To parse the about attribute, this might work:

re.match(r'[^\d]*(\d*)', about).groups()[0]

which returns the first series of digit characters found. And to parse the label attribute, you might use:

re.match(r'{([^_]*)', label).groups()[0]

which returns all characters in label after a leading left brace and up to but not including the first underscore. (As far as parsing the second form of label that you posted, you haven't posted enough information for me to guess what a regular expression to parse that would look like.)

Robert Rossney
A: 

Thank you for your input. Here are the different forms of label:

label="{forsikring_1}"   # most common
label="{DN:bilforsikring; DN:autoforsikring}"
label="{kampvogn_1; tank,2_1}"
label="{panserkøretøj_1; panservogn_1}"
label="{automekaniker_1; DN:bilreparatør; DN:autoelektriker; DN:autoelektromekaniker; bilmekaniker_1}"
label="{lære,1_2_1}"
label="{uddannelse_1_1}"
label="{studie,1_4; studium_1}"
label="{vogn_6: den blå vogn}"

What it shows is that each word in a label is separated by a semi-colon . Whatever comes after the word (underscore, comma, colon) should be stripped and if a word is preceded by DN: this should also be stripped.

So the ultimate output of e.g.:

label="{automekaniker_1; DN:bilreparatør; DN:autoelektriker; DN:autoelektromekaniker; bilmekaniker_1}"

should be

[automekaniker, bilreparatør, autoelektriker, autoelektromekanikerm, bilmekaniker]

But I'd be satisfied just to get whatever is between the curly braces, as the rest can be processed later on. Of course it would be nice to have some magic regex to do it all in one sitting :)

Steen H