You're doing two different kinds of parsing here, and you'll need to use two different tools.
First, you're parsing XML. For that, you're going to need to use an XML parser, not regular expressions. Because these elements are functionally identical XML:
<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">
</wn20schema:NounSysnset>
<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}"/>
<wn20schema:NounSynset rdfs:label="{saddelmageri_1}" rdf:about="&dn;synset-56242"/>
and conceivably even:
<NounSynset xmlns="my_wn20schema_namespace_urn" C:label='not_of_interest' A:label='{saddelmageri_1}' B:about='&dn;synset-56242'/>
To parse that element, you need to know the names of the namespaces that the element and the attributes you're interested in belong to, and then use an XML parser to find them - specifically, an XML parser that properly supports XML namespaces and XPath, like lxml.
You'll end up with something like this to find the attributes you're looking for (assuming that doc
is the parsed XML document, and that variables ending in _urn
are strings containing the various namespace URNs):
def find_attributes(doc):
for elm in doc.xpath('//x:NounSynset', namespaces={'x': wn20schema_namespace_urn}):
yield (elm.get(rdf_namespace_urn + "about"), elm.get(rdfs_namespace_urn + "label"))
Now you can look at the second part of the problem, which is parsing the values you need out of the attribute values you have. For that, you would use regular expressions. To parse the about
attribute, this might work:
re.match(r'[^\d]*(\d*)', about).groups()[0]
which returns the first series of digit characters found. And to parse the label
attribute, you might use:
re.match(r'{([^_]*)', label).groups()[0]
which returns all characters in label
after a leading left brace and up to but not including the first underscore. (As far as parsing the second form of label
that you posted, you haven't posted enough information for me to guess what a regular expression to parse that would look like.)