views:

63

answers:

1

Hi,

I've been trying to get either an <object> or an <embed> tag using:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

This doesn't seem to work.

Can anyone please tell me how to get these tags and their InnerHtml?

A YouTube embedded video looks like this:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

I got a feeling the JavaScript might stop the swf player from working, hope not...

Cheers

+3  A: 

Update 2010-08-26 (in response to OP's comment):

I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.

Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.

In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).

I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.


YouTubeScraper

OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

And in case you're interested, here's a little demo I threw together (super fancy, I know):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}

Original Answer

Why not try using the element's Id instead?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

Dan Tao
I get an error with the code: `HtmlAgilityPack.HtmlDocument' does not contain a definition for 'GetElementById' and no extension method 'GetElementById'`
AlexW
@AlexW: Yikes, OK then what am I thinking of? Just a sec...
Dan Tao
@Dan Tao - last few questions I've put to SO haven't got answered, appreciate the tip ;)
AlexW
@AlexW: It looks like the `b` should be lowercase? Try that (`GetElementbyId`) and see if you have any luck.
Dan Tao
@Dan Tao - still didn't scrape anything `videoEmbedNode == null;`
AlexW
@AlexW: Gimme a url and I'll take a stab at it as well.
Dan Tao
AlexW
@Dan Tao - Searching inside the script tags sounds like a plan. Thanks man!
AlexW
YouTube player is not part of the HTML document, it's added to the DOM by a script.
Franci Penov
@AlexW: I was able to get it to work. Did you get it? If not I can send you my code.
Dan Tao
@Dan Tao - no I wasn't able to get it working yet..would much appreciate your code!
AlexW
@AlexW: There you go, man. I need to take a shower now.
Dan Tao
@Dan Tao - thanks a lot dude! Looks like a neat workaround, can't wait to get it incorporated into the rest.
AlexW
@Dan Tao - the thing I don't get is why HtmlAgility can't go and locate those tags amongst the script. It sees the script just as any other web page data... maybe these tags are not defined in it's library?
AlexW
@AlexW: My response to that question ended up being too big to fit into a comment box, so I added it to my answer.
Dan Tao
@Dan Tao - it makes sense with the escape characters that it would a problem, thanks for explanation. In the end I used `StringBuilder.Replace("\", "");` and `StringBuilder.Replace(" + ", "");` to clean the code up nicely. Now all I have to do is stop it playing automatically (it's not nice when 10 YT videos start playing at the same time!). Now I have clean tags, I can use the Agility Pack to change the param tags (remove one) to this end. Thanks again for your effort, very helpful for a young dev :)
AlexW