views:

176

answers:

1

Hi all,

I want to cut all url's like (http://....) and replace them on anchors <a></a> but my requirement: Do not touch anchors and page definition(Doc type) like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;

So I need to find just plain text with url's...

I'm trying to override my render inside page and I made BrowserAdapter:

<browser refID="default">
 <controlAdapters>
  <adapter controlType="System.Web.Mvc.ViewPage"
     adapterType="Facad.Adapters.AnchorAdapter" />
 </controlAdapters>
</browser>

it looks like this:

public class AnchorAdapter : PageAdapter
{
 protected override void Render(HtmlTextWriter writer)
 {
  /* Get page output into string */
  var sb = new StringBuilder();
  TextWriter tw = new StringWriter(sb);
  var htw = new HtmlTextWriter(tw);

  // Render into my writer
  base.Render(htw);

  string page = sb.ToString();
  //regular expression 
  Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); 

  //get the first match 
  Match match = regx.Match(page); 

  //loop through matches 
  while (match.Success)
  {

   //output the match info 
   System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");

   //get next match 
   match = match.NextMatch();
  }

  writer.Write(page);
 }
}
+1  A: 

You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:

(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)

(^|[^'"]+) means start of string or a character that is NOT a quote ([^'"]|$) means end of string or not a quote

The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url

BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.

SpliFF
Could you provide any samples of good regex's
omoto