views:

183

answers:

4
+4  Q: 

SQL for the web

Does anyone have experience with a query language for the web?

I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.

I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.

I found websql

http://www.cs.utoronto.ca/~websql/

Which looks good but I'm not into Java

SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";

Are there others out there?

Is there a library that can be used in a .NET language?

A: 

I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.

Greg Hewgill
+1  A: 

You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.

Sklivvz
+3  A: 

See hpricot (a Ruby library).

# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc

It supports querying with CSS or XPath selectors.

Pistos
+2  A: 

Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.

For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.

There is also this C# html parser which looks good but I've not tried it.

Colin Pickard