views:

84

answers:

4

hi,

my problems is a bit case specific ,

first of all,

Its only for <p>tags not for any other tag.So you need not worry about any other tag.

I am having html document which is a output of one software ,but it has some errors like unclosed <p> tags.

eg. I have taken all document in a string

my document is like ..

    <html>
    ....
    ....
      <head>
      </head>
    ....
    ....
       <body>

    ...
    ...
    <p>                 // tag is to be removed as no closing tag

<p align="left">   AAA   </p>
<p class="style6">   BBB    </P>
<p class="style1" align="center">    CCC    </P>

<p align="left">  DDD               // tag is to be removed as no closing tag
<p class="style6">   EEE              // tag is to be removed as no closing tag
<p class="style1" align="center">    FFF             // tag is to be removed as no closing tag

<p class="style15"><strong>xxyyzz</strong><br/></p>

<p>                // tag is to be removed as no closing tag



<p> stack Overflow </P>


       <body>
      </html>

tags with DDD,EEE,FFF and unclosed <p> tag are to be removed As you can see it should work for every unclosed <P> tag whether it is having attributes like class or align.

I also want to mention that, there is no <p> tag inside another <p> tag ,i mean

<p>
    <p>
    </p>

     <p>
     </p>

</p>

Such condition will never occur .

I tried using REGEX and StringBuilder but could not get perfect answer.

Thanx a lot in advance for those who will help.

Regards

+1  A: 

You might get better results using the Html Agility Pack:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML.

Just load the document into the DOM, iterate over the elements looking for <p> and filter them out, almost like you were doing valid XML manipulation.

chakrit
is it possible to do without HTML agility Pack...? i am not suppose to use html agility pack ...thats y any regex or any other method
Sangram
Of course, it is, but I'd rather encourage you use this instead of wasting time doing fuzzy Regex searches.
chakrit
A: 

First of all, please have a look here. If that didn't deter you from using regular expressions for parsing HTML (and because I understand it's a very specific case that might not warrant using a full DOM parser, even though that's the absolute best recommended way), I've posted an answer to a similar question here; you can easily adapt it for your case, but please understand that it's not recommended and many things can go wrong if you decide to use it (including, as outlined in the first link above, the end of the universe etc. :P).

If the regex I pointed you to seems too complex or you're having problems understanding or simplifying it, post a comment and I'll add more clarifications.

Alex Paven
Thanks for that first link. I about killed myself trying to stifle laughter here at work. :D
kcoppock
@ Alex, Kcoppock : its done, just take a look at the code.its not very great code but i am not suppose to use html agility pack so manually i had to done
Sangram
+1  A: 

Disclaimer: Please note that I do not advocate trying to parse arbitrary HTML with regular expressions or simple substring matches. The solution below is for this specific problem, which appears to be purposely limited to make parsing possible with simple methods. In general, I agree with the consensus: To parse HTML, use an HTML parser.

That said . . .

Given that nested <p> tags aren't allowed, and assuming that there aren't any HTML comments allowed, it should be relatively easy to do the following in a loop to find and eliminate all <p> tags that have no corresponding </p>.

string inputText = GetHtmlText();
int scanPos = 0;
int startTag = inputText.IndexOf("<p>", scanPos);
while (startTag != -1)
{
    scanPos += 4;
    // Now look for a closing tag or another open tag
    int closeTag = inputText.IndexOf("</p">, scanPos);
    int nextStartTag = inputText.IndexOf("<p>", scanPos);
    if (closeTag == -1 || nextStartTag < closeTag)
    {
        // Error at position startTag.  No closing tag.
    }
    else
    {
        // You have a full paragraph between startTag and (closeTag+5).
    }
    startTag = nextStartTag;
}

The code assumes that the strings <p> and </p> cannot exist in the text except as actual paragraph open and closing tags. If you can make that guarantee, than the above (or something very similar) should work quite well.

ADDED:

Handling things like <p class="classname">, etc., gets a little less sure. If you can guarantee that there won't be any > characters between the opening <p and the closing >, then you can modify the code above to search for <p as well as for <p>, and if found then locate the closing >. It's a little bit messy, but not particularly difficult.

All that said, I would not recommend this approach for parsing arbitrary HTML, because of the caveats I've already stated: it won't handle comments and it makes what are probably invalid assumptions about the format of the HTML in general. It also won't handle things like <p > and </p >, both of which are perfectly valid (and that I've encountered in the wild).

Jim Mischel
any idea how to deal with <p class="style1" align="center"> or <p class="style1">
Sangram
I would just substitute "<p" for "<p>" in the loop. Then, once you find an unmatched tag that needs to be removed, just remove from the index of "<" to the first index of ">". That will be your full "p" tag.
kcoppock
@Sangram: see my additional information.
Jim Mischel
@jim: sure..thnx
Sangram
@ JIM : its done, just take a look at the code.its not very great code but i am not suppose to use html agility pack so manually i had to done
Sangram
+1  A: 

I really appreciate help from all of u specially JIM n ALEX.. i tried and its working nicely. thnx a lot.

 public static string CleanUpXHTML(string xhtml)
            {
                int pOpen = 0, pClose = 0, pSlash = 0, pNext = 0, length = 0;
                pOpen = xhtml.IndexOf("<p", 0);
                pClose = xhtml.IndexOf(">", pOpen);
                pSlash = xhtml.IndexOf("</p>", pClose);
                pNext = xhtml.IndexOf("<p", pClose);

                while (pSlash > -1)
                {


                    if (pSlash < pNext)
                    {
                        if (pSlash < pNext)
                        {
                            pOpen = pNext;
                            pClose = xhtml.IndexOf(">", pOpen);
                            pSlash = xhtml.IndexOf("</p>", pClose);
                            pNext = xhtml.IndexOf("<p", pClose);
                        }
                    }
                    else
                    {
                        length = pClose - pOpen + 1;
                        if (pNext < 0 && pSlash > 0)
                        {
                            break;
                        }


                        xhtml = xhtml.Remove(pOpen, length);

                        pOpen = pNext - length;
                        pClose = xhtml.IndexOf(">", pOpen);
                        pSlash = xhtml.IndexOf("</p>", pClose);
                        pNext = xhtml.IndexOf("<p", pClose);


                    }

                    if (pSlash < 0)
                    {
                        int lastp = 0, lastclosep = 0, lastnextp = 0, length3 = 0, TpSlash =0 ;

                        lastp = xhtml.IndexOf("<p",pOpen-1);

                        lastclosep = xhtml.IndexOf(">", lastp);
                        lastnextp = xhtml.IndexOf("<p", lastclosep);


                        while (lastp >0)
                        {
                            length3 = lastclosep - lastp + 1;
                            xhtml = xhtml.Remove(lastp, length3);
                            if (lastnextp < 0)
                            {
                                break;
                            }
                            lastp = lastnextp-length3;
                            lastclosep = xhtml.IndexOf(">", lastp);
                            lastnextp = xhtml.IndexOf("<p", lastclosep);

                        }

                        break;
                    }

                }

                return xhtml;

            }
Sangram
This code is for specific case ..pls do not use it as a parsing technique.
Sangram
Good job. It's considered good manners to upvote helpful answers and select one as "the" answer.
Jim Mischel