ansaurus

Question

Removing incomplete P Tags (using REGEX or any other method)

Answer 1

+1 A:

You might get better results using the Html Agility Pack:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML.

Just load the document into the DOM, iterate over the elements looking for  and filter them out, almost like you were doing valid XML manipulation.

chakrit 2010-09-21 14:26:39

is it possible to do without HTML agility Pack...? i am not suppose to use html agility pack ...thats y any regex or any other method

Sangram 2010-09-21 14:31:37

Of course, it is, but I'd rather encourage you use this instead of wasting time doing fuzzy Regex searches.

chakrit 2010-09-21 14:33:51

Answer 2

A:

First of all, please have a look here. If that didn't deter you from using regular expressions for parsing HTML (and because I understand it's a very specific case that might not warrant using a full DOM parser, even though that's the absolute best recommended way), I've posted an answer to a similar question here; you can easily adapt it for your case, but please understand that it's not recommended and many things can go wrong if you decide to use it (including, as outlined in the first link above, the end of the universe etc. :P).

If the regex I pointed you to seems too complex or you're having problems understanding or simplifying it, post a comment and I'll add more clarifications.

Alex Paven 2010-09-21 14:39:02

Thanks for that first link. I about killed myself trying to stifle laughter here at work. :D

kcoppock 2010-09-21 14:51:54

@ Alex, Kcoppock : its done, just take a look at the code.its not very great code but i am not suppose to use html agility pack so manually i had to done

Sangram 2010-09-22 11:22:33

Answer 3

+1 A:

Disclaimer: Please note that I do not advocate trying to parse arbitrary HTML with regular expressions or simple substring matches. The solution below is for this specific problem, which appears to be purposely limited to make parsing possible with simple methods. In general, I agree with the consensus: To parse HTML, use an HTML parser.

That said . . .

Given that nested  tags aren't allowed, and assuming that there aren't any HTML comments allowed, it should be relatively easy to do the following in a loop to find and eliminate all  tags that have no corresponding .

string inputText = GetHtmlText();
int scanPos = 0;
int startTag = inputText.IndexOf("<p>", scanPos);
while (startTag != -1)
{
    scanPos += 4;
    // Now look for a closing tag or another open tag
    int closeTag = inputText.IndexOf("</p">, scanPos);
    int nextStartTag = inputText.IndexOf("<p>", scanPos);
    if (closeTag == -1 || nextStartTag < closeTag)
    {
        // Error at position startTag.  No closing tag.
    }
    else
    {
        // You have a full paragraph between startTag and (closeTag+5).
    }
    startTag = nextStartTag;
}

The code assumes that the strings  and  cannot exist in the text except as actual paragraph open and closing tags. If you can make that guarantee, than the above (or something very similar) should work quite well.

ADDED:

Handling things like , etc., gets a little less sure. If you can guarantee that there won't be any > characters between the opening <p and the closing >, then you can modify the code above to search for <p as well as for , and if found then locate the closing >. It's a little bit messy, but not particularly difficult.

All that said, I would not recommend this approach for parsing arbitrary HTML, because of the caveats I've already stated: it won't handle comments and it makes what are probably invalid assumptions about the format of the HTML in general. It also won't handle things like  and , both of which are perfectly valid (and that I've encountered in the wild).

Jim Mischel 2010-09-21 15:29:12

any idea how to deal with or

Sangram 2010-09-21 16:08:22

I would just substitute "<p" for "" in the loop. Then, once you find an unmatched tag that needs to be removed, just remove from the index of "<" to the first index of ">". That will be your full "p" tag.

kcoppock 2010-09-21 17:04:18

@Sangram: see my additional information.

Jim Mischel 2010-09-21 17:38:19

@jim: sure..thnx

Sangram 2010-09-21 18:05:57

@ JIM : its done, just take a look at the code.its not very great code but i am not suppose to use html agility pack so manually i had to done

Sangram 2010-09-22 11:22:06

Answer 4

+1 A:

I really appreciate help from all of u specially JIM n ALEX.. i tried and its working nicely. thnx a lot.

 public static string CleanUpXHTML(string xhtml)
            {
                int pOpen = 0, pClose = 0, pSlash = 0, pNext = 0, length = 0;
                pOpen = xhtml.IndexOf("<p", 0);
                pClose = xhtml.IndexOf(">", pOpen);
                pSlash = xhtml.IndexOf("</p>", pClose);
                pNext = xhtml.IndexOf("<p", pClose);

                while (pSlash > -1)
                {


                    if (pSlash < pNext)
                    {
                        if (pSlash < pNext)
                        {
                            pOpen = pNext;
                            pClose = xhtml.IndexOf(">", pOpen);
                            pSlash = xhtml.IndexOf("</p>", pClose);
                            pNext = xhtml.IndexOf("<p", pClose);
                        }
                    }
                    else
                    {
                        length = pClose - pOpen + 1;
                        if (pNext < 0 && pSlash > 0)
                        {
                            break;
                        }


                        xhtml = xhtml.Remove(pOpen, length);

                        pOpen = pNext - length;
                        pClose = xhtml.IndexOf(">", pOpen);
                        pSlash = xhtml.IndexOf("</p>", pClose);
                        pNext = xhtml.IndexOf("<p", pClose);


                    }

                    if (pSlash < 0)
                    {
                        int lastp = 0, lastclosep = 0, lastnextp = 0, length3 = 0, TpSlash =0 ;

                        lastp = xhtml.IndexOf("<p",pOpen-1);

                        lastclosep = xhtml.IndexOf(">", lastp);
                        lastnextp = xhtml.IndexOf("<p", lastclosep);


                        while (lastp >0)
                        {
                            length3 = lastclosep - lastp + 1;
                            xhtml = xhtml.Remove(lastp, length3);
                            if (lastnextp < 0)
                            {
                                break;
                            }
                            lastp = lastnextp-length3;
                            lastclosep = xhtml.IndexOf(">", lastp);
                            lastnextp = xhtml.IndexOf("<p", lastclosep);

                        }

                        break;
                    }

                }

                return xhtml;

            }

Sangram 2010-09-22 11:27:28

This code is for specific case ..pls do not use it as a parsing technique.

Sangram 2010-09-22 11:29:14

Good job. It's considered good manners to upvote helpful answers and select one as "the" answer.

Jim Mischel 2010-09-22 14:47:51

ansaurus

tags:

views:

answers:

Removing incomplete P Tags (using REGEX or any other method)

related questions