Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string
to which the codeBlock
variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object>
and <embed>
elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack
did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \
characters (notice the precarious Unescape
method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack
.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object>
and <embed>
elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack
; all of that JavaScript is really one big string inside a <script>
tag.) Maybe there's some way you can parse the <script>
tag's inner text itself as HTML and go from there.