views:

523

answers:

4

I want to parse a random website, modify the content so that every word is a link (for a dictionary tooltip) and then display the website in an iframe.

I'm not looking for a complete solution, but for a hint or a possible strategy. The linking is my problem, parsing the website and displaying it in an iframe is quite simple. So basically I have a String with all the html content. I'm not even sure if it's better to do it serverside or after the page is loaded with JS.

I'm working with Ruby on Rails, jQuery, jRails.

Note: The content of the href tag depends on the word.

Clarification: I tried a regexp and it already kind of works:

@site.gsub!(/[A-Za-z]+(?:['-][A-Za-z]+)?|\\d+(?:[,.]\\d+)?/) {|word| '<a href="">' + word + '</a>'}

But the problem is to only replace words in the text and leave the HTML as it is. So I guess it is a regex problem...

Thanks for any ideas.

+1  A: 

It sounds like you have it mostly planned out already.

Split the content into words and then for each word, create a link, such as <a href="http://dictionary.reference.com/dic?q=whatever&amp;search=search"&gt;whatever&lt;/a&gt;

EDIT (based on your comment): Ahh ... I recommend you search around for screen scraping techniques. Most of them should start with removing anything between < and > characters, and replacing <br> and <p> with newlines.

belgariontheking
thanks, but that is kind of difficult. I tried [email protected]!(/[A-Za-z]+(?:['-][A-Za-z]+)?|\\d+(?:[,.]\\d+)?/) {|word| '<a href="">' + word + '</a>'}but I need a way to only replace words in the text, not html tags. Any ideas?
ole_berlin
+2  A: 

I don't think a regexp is going to work for this - or, at least, it will always be brittle. A better way is to parse the page using Hpricot or Nokogiri, then go through it and modify the nodes that are plain text.

Sarah Mei
A: 

Simple. Hash the HTML, run your regex, then unhash the HTML.

<?php
class ht
{
 static $hashes = array();

 # hashes everything that matches $pattern and saves matches for later unhashing
 function hash($text, $pattern) { 
  return preg_replace_callback($pattern, array(self,'push'), $text);
 }

 # hashes all html tags and saves them
 function hash_html($html) {
  return self::hash($html, '`<[^>]+>`');
 }

 # hashes and saves $value, returns key
 function push($value) {
  if(is_array($value)) $value = $value[0];
  static $i = 0;
  $key = "\x05".++$i."\x06";
  self::$hashes[$key] = $value;
  return $key;
 }

 # unhashes all saved values found in $text
 function unhash($text) {
  return str_replace(array_keys(self::$hashes), self::$hashes, $text);
 }

 function get($key) {
  return self::$hashes[$key];
 }

 function clear() {
  self::$hashes = array();
 }
}
?>

Example usage:

ht::hash_html($your_html);
// your word->href converter here
ht::unhash($your_formatted_html);

Oh... right, I wrote this in PHP. Guess you'll have to convert it to ruby or js, but the idea is the same.

Mark
Your regex is a bit hard to read, but you could also match all the text between > and <, then pass that off to another function that just explodes it into words. Assuming you have well-formed HTML (that starts and ends with a tag), you shouldn't need the edge cases.
Mark
That would make a total mess of scripts of the form <script> ... </script>
Jim Mischel
No it wouldn't actually. You just need to modify the hash pattern to hash everything inside the script tags too so that the code doesn't get parsed. In fact, that's exactly why I wrote this class.
Mark
+1  A: 

I would use Nokogiri to remove the HTML structure before you use the regex.

no_html = Nokogiri::HTML(html_as_string).text
Sam C