views:

484

answers:

3

I am trying to parse a bit of data from an HTML file, but my Linq statement is not working. Here is the XML/HTML. Below, how can I extract the string "41.8;12.23" from the geo.position meta tag? Thx!!

Here is my Linq

   String longLat = (String)
        from el in xdoc.Descendants()
              where
               (string)el.Name.LocalName == "meta"
               & el.FirstAttribute.Name == "geo.position"
                select (String) el.LastAttribute.Value;

Here is my Xdocument

<span>
  <!--CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt -->
  <html xmlns="http://www.w3.org/1999/xhtml"&gt;
    <head>
      <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type" />
      <meta content="text/css" http-equiv="Content-Style-Type" />
      <meta name="geo.position" content="41.8;12.23" />
      <meta name="geo.placename" content="RomeFiumicino, Italy" />
      <title>RomeFiumicino, Italy</title>
    </head>
    <body />
  </html>
</span>

Edit: My query as given returns nothing. The "inner" query seems to return a list of all the meta elements instead of just the one element I want.

Edit: The following Linq query works against the same XDocument to retreive a table with class name = "data"

    var dataTable =
        from el in xdoc.Descendants()
        where (string)el.Attribute("class") == "data"
        select el;
+4  A: 

A span around your html tag?

You could do this with XLinq, but it would only support well-formed XML. You might want to look at the HTML Agility Pack instead.

Edit - This works for me:

string xml = "...";
var geoPosition = XElement.Parse(xml).Descendants().
    Where(e => e.Name.LocalName == "meta" &&
        e.Attribute("name") != null &&
        e.Attribute("name").Value == "geo.position").
    Select(e => e.Attribute("content").Value).
    SingleOrDefault();
Thorarin
Thanks much, Thorarin. I used the HTML Agility Pack to get the XDocumnent -- the pack added the Span.
Tom A
This isn't well-formed XML? Sure looks that way to the parser.
Robert Rossney
Yeah, it actually is. I noticed a missing double quote, but didn't notice the doctype was actually converted to an XML comment ;)
Thorarin
Excellent! That works for me, too. Thx again, Thorian.
Tom A
+1  A: 

I agree with Thorarin - use the HTML Agility pack, it's much more robust.

However, I suspect the problem you are having using LinqToXML is because of the namespace. See MSDN here for how to handle them in your queries.

" If you have XML that is in a default namespace, you still must declare an XNamespace variable, and combine it with the local name to make a qualified name to be used in the query.

One of the most common problems when querying XML trees is that if the XML tree has a default namespace, the developer sometimes writes the query as though the XML were not in a namespace."

Dan Diplo
Thanks, Dan. Yes, I am a big fan of the Agility Pack, which got me far enough to have this problem. :) I have other Linq queries which *do* work against the same doc. I added an example of the query, but not the big table that it extracts for me.
Tom A
+1  A: 

I'd bet that the problem you're having comes from not referencing the namespace correctly with an XmlNamespaceManager. Here are two ways to do it:

string xml =
        @"<span>
   <!--CTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN""
        ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt -->
   <html xmlns=""http://www.w3.org/1999/xhtml""&gt;
    <head>
     <meta content=""application/xhtml+xml; charset=utf-8"" http-equiv=""Content-Type"" />
      <meta content=""text/css"" http-equiv=""Content-Style-Type"" />
      <meta name=""geo.position"" content=""41.8;12.23"" />
      <meta name=""geo.placename"" content=""RomeFiumicino, Italy"" />
      <title>RomeFiumicino, Italy</title>
    </head>
    <body />
   </html>
    </span>";

    string ns = "http://www.w3.org/1999/xhtml";
    XmlNamespaceManager nsm;

    // pre-Linq:
    XmlDocument d = new XmlDocument();
    d.LoadXml(xml);
    nsm = new XmlNamespaceManager(d.NameTable);
    nsm.AddNamespace("h", ns);

    Console.WriteLine(d.SelectSingleNode(
        "/span/h:html/h:head/h:meta[@name='geo.position']/@content", nsm).Value);

    // Linq - note that you have to create an XmlReader so that you can
    // use its NameTable in creating the XmlNamespaceManager:
    XmlReader xr = XmlReader.Create(new StringReader(xml));
    XDocument xd = XDocument.Load(xr);
    nsm = new XmlNamespaceManager(xr.NameTable);
    nsm.AddNamespace("h", ns);

    Console.WriteLine(
        xd.XPathSelectElement("/span/h:html/h:head/h:meta[@name='geo.position']", nsm)
            .Attribute("content").Value);
Robert Rossney
Thanks, robert.
Tom A
Thanks, Robert.
Tom A