views:

75

answers:

2

I am having a hard time creating a regular expression that extracts the namespaces from this SPARQL query:

SELECT * 
WHERE {
    ?Vehicle rdf:type umbel-sc:CompactCar ;
             skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission&gt;;
             dbp-prop:assembly ?Place.
    ?Place geo-ont:parentFeature dbpedia:United_States .
}

I need to get:

"rdf", "umbel-sc", "skos", "dbp-prop", "geo-ont", "dbpedia"

I need a expression like this:

\\s+([^\\:]*):[^\\s]+

But the above one does not work, because it also eats spaces before reaching :. What am I doing wrong?

A: 

I don't know the details of SPARQL syntax, but I would imagine that it is not a regular language so regular expressions won't be able to do this perfectly. However you can get pretty close if you search for something that looks like a word and is surrounded by space on the left and a colon on the right.

This method might be good enough for a quick solution or if your input format is known and sufficiently restricted. For a more general solution suggest you look for or create a proper parser for the SPARQL language.

With that said, try this:

string s = @"SELECT * 
WHERE {
    ?Vehicle rdf:type umbel-sc:CompactCar ;
    skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission&gt;;
    dbp-prop:assembly ?Place.
    ?Place geo-ont:parentFeature dbpedia:United_States .
}";

foreach (Match match in Regex.Matches(s, @"\s([\w-]+):"))
{
    Console.WriteLine(match.Groups[1].Value);
}

Result:

rdf
umbel-sc
skos
dbp-prop
geo-ont
dbpedia
Mark Byers
Cool! That was quick. Thanks!
Anton Andreev
@Anton Andreev: So... does it do what you want? Have you tested it?
Mark Byers
yes, but I had to change it a bit: @"\s\[*([\w-]+):(?!//)" with "[" and probably there will be more cases like this to add. Testing will take time. You can try for fun some SPARQL queries on my company's website: http://factforge.net/sparql
Anton Andreev
@Anton Andreev: I just noticed that there is no whitespace before http so that special case I added is not actually necessary. I've updated my post to reflect that.
Mark Byers
A: 

So I need a expression like this:

\\s+([^\\:]*):[^\\s]+

But the above one does not work, because it also eats spaces before reaching ":".

The regular expression will eat those spaces, yes, but the group captured by your parenthesis won’t contain it. Is that a problem? You can access this group by reading from Groups[1].Value in the Match object returned from Regex.Match.

If you really need the regex to not match these spaces, you can use a so-called look-behind assertion:

(?<=\s)([^:]*):[^\s]+

As an aside, you don’t need to double all your backslashes. Use a verbatim string instead, like this:

Regex.Match(input, @"(?<=\s)([^:]*):[^\s]+")
Timwi