views:

147

answers:

5

Hi

I am not sure what I am doing wrong. I am trying to use the asp.net regex.replace but it keeps replacing the wrong item.

I have 2 replaces. The first one does what I want it to it replaces what I want. The next replace that is almost a mirror image does not replace what I want.

So this is my sample code

<%@ Page Title="Tour" Language="C#" MasterPageFile="~/Views/Shared/Site.Master" Inherits="System.Web.Mvc.ViewPage" %>
<asp:Content ID="Content1" ContentPlaceHolderID="HeadContent" runat="server">
    <title>Website Portfolio Section - VisionWebCS</title>
    <meta name="description" content="A" />
    <meta name="keywords" content="B" />
</asp:Content>
<asp:Content ID="Content2" ContentPlaceHolderID="MainContent" runat="server">
    <!-- **START** -->

I am looking to replace both the meta tags.

<meta name=\"description\" content=\"A\" />
<meta name=\"keywords\" content=\"B\" />

In my code first I replace the keywords meta tag with

<meta name=\"keywords\" content=\"C\" />

This works so my next task is to replace the description meta tag with this

<meta name=\"description\" content=\"D\" />

This does not work instead it replaces the "keywords" meta tag and then replaces the "description" tag.

Here is my test program so you all can try it out. Just through it in C# console app.

  private const string META_DESCRIPTION_REGEX = "<\\s* meta \\s* name=\"description\" \\s* content=\"(?<Description>.*)\" \\s* />";
        private const string META_KEYWORDS_REGEX = "<\\s* meta \\s* name=\"keywords\" \\s* content=\"(?<Keywords>.*)\" \\s* />";
        private static RegexOptions regexOptions = RegexOptions.IgnoreCase
                                   | RegexOptions.Multiline
                                   | RegexOptions.CultureInvariant
                                   | RegexOptions.IgnorePatternWhitespace
                                   | RegexOptions.Compiled;

        static void Main(string[] args)
        {

            string text = "<%@ Page Title=\"Tour\" Language=\"C#\" MasterPageFile=\"~/Views/Shared/Site.Master\" Inherits=\"System.Web.Mvc.ViewPage\" %><asp:Content ID=\"Content1\" ContentPlaceHolderID=\"HeadContent\" runat=\"server\">    <title>Website Portfolio Section - VisionWebCS</title>    <meta name=\"description\" content=\"A\" />    <meta name=\"keywords\" content=\"B\" /></asp:Content><asp:Content ID=\"Content2\" ContentPlaceHolderID=\"MainContent\" runat=\"server\"><!-- **START** -->";
            Regex regex = new Regex(META_KEYWORDS_REGEX, regexOptions);
            string newKeywords = String.Format("<meta name=\"keywords\" content=\"{0}\" />", "C");
            string output = regex.Replace(text, newKeywords);

            Regex regex2 = new Regex(META_DESCRIPTION_REGEX, regexOptions);
            string newDescription = String.Format("<meta name=\"description\" content=\"{0}\" />", "D");
            string newOutput = regex2.Replace(output, newDescription);
            Console.WriteLine(newOutput);
        }

This gets me a final output of

<%@ Page Title="Tour" Language="C#" MasterPageFile="~/Views/Shared/Site.Master"
Inherits="System.Web.Mvc.ViewPage" %>
<asp:Content ID="Content1" ContentPlaceHold erID="HeadContent" runat="server">
    <title>Website Portfolio Section - VisionW
        ebCS</title>
    <meta name="description" content="D" />
</asp:Content>
<asp:Conten t ID="Content2" ContentPlaceHolderID="MainContent" runat="server">
    <!-- **START**
    -->

Thanks

+7  A: 

What are you doing wrong? You are parsing HTML with a regex!

Recommended library for .NET: HTML Agility Pack

Will
So - what would you do instead, then?
marc_s
@Will: +1 but you shoud provide a link/code snipet how to parse it with a proper parser
RageZ
the graphic alone is funny enough to click
bobby
AGREED! USE THE DOM!
Visionary Software Solutions
@marc_S my thoughts exactly. Just a quick glance at the article does not seem to show what to use instead. It also says in certain situations you can. I been able to parse html many times without a problem and it was 100 times more complicated then what I doing right now.
chobo2
You are highly encouraged to check out http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx http://www.w3schools.com/htmldom/default.asp
Visionary Software Solutions
Link to a parsing library added
Will
A: 

http://www.codinghorror.com/blog/archives/001311.html

Priyank Bolia
The downvoting wasn't so fair so I pushed it up again; its the same link as mine; the poster has good a good reading list, and its all about forming the post to be more informative; a link on its own does not invite clicking.
Will
Sorry I didn't seem your link, and as I was reading the above link yesterday, so posted here as a reply to the question.
Priyank Bolia
A: 

Learn, love, and use the DOM. It is the W3C (HTML standards body) approved way to parse XML (HTML is a subset of XML) documents. Unless you have sufficient reason to believe your input HTML is horribly wrong, this is usually the best approach to start with.

Learn here

You are highly encouraged to check out Walkthrough: Accessing the DHTML DOM from C#

You may also want to try jQuery, as it makes it very easy to search the DOM. Like so.

Visionary Software Solutions
+4  A: 

To answer your question without useless life lessons, you are having troubles because of greedy quantifiers. Try making them lazy by adding question marks:

<meta\\s+?name=\"description\"\\s+?content=\"(?<Description>.*?)\"\\s*?/>

Sure this regex won't work for all pages in the world, but if you need just make some quick replacement script for your own templates, regex is the fastest and easiest solution and the way to go.

serg
Hmm that works but I don't get it. I thought even though I am using a greedy quantifiers it would keep going till it sees the "/>" and stop. So why does it go further? Like even when checking how many expressions this caught it always came back one.
chobo2
+1  A: 

I agree with @serg555's answer - the issue is with the greedy quantifiers - making them lazy with '?' should solve the problem

<meta\\s*name=\"description\"\\s*content=\"(?<Description>.*?)\"\\s*/>
HishHash