tags:

views:

34

answers:

1
((<(\\s*?)(object|OBJECT|EMBED|embed))+(.*?)+((object|OBJECT|EMBED|embed)(\\s*?)>))

I need to get object and embed tags from some html files stored locally on disk. I've come up with the above regex to match the tags in java then use matcher.group(1); to get the entire tag and its contents

Can anyone perhaps improve this? Is there anything that stands out immediately to you that i should change?

It does work BTW, just wanting an input to see if it can be better because i'm fairly new to regex myself.

+2  A: 

Yes, here's the improvement:

  1. Download a fullworthy Java HTML parser like Jsoup and put it in classpath.

  2. Now you can select all <object> and <embed> elements as follows:

    Document document = Jsoup.parse(new File("/path/to/file.html"), "UTF-8");
    Elements elements = document.select("object,embed");
    for (Element element : elements) {
        System.out.println(element.outerHtml());
    }
    

See also:

BalusC
Add to that list: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454
cHao
@cHao: this one is mentioned in 2nd link.
BalusC
thanks for the link but i don't want to use an html parser. I just want my one line of regex, using a parser teaches me nothing. I use parsers when i have an actual system that depends on the result. I'm just trying to learn to do it from scratch so the one day i will know enough to write my own parser even.
robinsonc494
If it isn't clear yet: regex is the wrong tool for the job. Look for another subjects to practice regex on. HTML is not a regular language.
BalusC
Using regular expressions to parse HTML will only teach you a whole new definition for frustration.
cHao
@robin: read at least the first two linked articles, they are written by Jeff Atwood, the founder/owner of Stackoverflow. It are really enlightening writeups.
BalusC
Ok, point take Regex + HTML = nightmare on code streetOne question though, (related but kinda off topic) : What is/should regex be used for and when?I only read Jeff Atwood's post so if any of the links cover it then...
robinsonc494