ansaurus

Question

How to replace all html tags from <anything> to \n<anything>\n [using regexp (JavaScript)]

Answer 1

+3 A:

Just don't parse HTML using regex. Read this: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

In JavaScript, you can turn HTML into DOM using the .innerHTML property, and after that you can use other DOM methods to traverse it.

Simple example (needs Firebug):

var div = document.createElement('div');
var html = '<p>foo <span>bar</span><br /></p>';
div.innerHTML = html;

function scan(node, depth) 
{
    depth = depth || 0;
    var is_tag = node.nodeType == 1; 
    var self_contained = false;
    if (is_tag) {
        self_contained = node.childNodes.length == 0;
        var tag_name = node.tagName.toLowerCase();
        console.log('<' + tag_name + (self_contained ? ' /' : '') + '>', depth);
    } else {
        console.log(node.data); 
    }
    for (var i = 0, n = node.childNodes.length; i < n; i++) {
        scan(node.childNodes[i], depth + 1);
    }
    if (!self_contained && is_tag) {
        console.log('</' + tag_name + '>', depth);
    }
}

scan(div);

Output:

<div> 0
<p> 1
foo
<span> 2
bar
</span> 2
<br /> 2
</p> 1
</div> 0

You could also modify this to output attributes and use the depth argument for indentation.

Reinis I. 2010-08-31 14:07:44

Beat me to it. I don't necessarily agree with the article, but everyone should read it.

Rushyo 2010-08-31 14:10:12

`innerHTML` is not a method but a property.

Gumbo 2010-08-31 14:13:44

Right, thanks. I've been using jQuery's `.html()` for so long that I forget.

Reinis I. 2010-08-31 14:14:46

How can i do that. My code in var not in div

faressoft 2010-08-31 14:16:26

Answer 2

+4 A:

You can prettify xml without regex:

var text = "<anything>welcome</anything><anything>Hello</anything>";
var xml = new XML("<root>" + text + "</root>");
console.log(xml.children().toXMLString());

output:

<anything>welcome</anything>
<anything>Hello</anything>

Amarghosh 2010-08-31 14:08:51

How to get the output ?

faressoft 2010-08-31 14:12:00

@faressoft `var text1 = xml.children().toXMLString();`

Amarghosh 2010-08-31 14:12:55

+1 neat approach. E4X has very little availability though.

Anurag 2010-08-31 14:13:32

This will only work if it's XHTML right?

Abe Miessler 2010-08-31 14:23:07

@Abe Ya, the string should give a valid xml when enclosed in the root tag.

Amarghosh 2010-08-31 14:28:45

Answer 3

+1 A:

Try this:

str.replace(/<(\/?)[a-zA-Z]+(?:[^>"']+|"[^"]*"|'[^']*')*>/g, function($0, $1) {
    return $1 === "/" ? $0+"\n" : "\n"+$0;
})

Gumbo 2010-08-31 14:09:29

This code helped me in my project, thanks

faressoft 2010-08-31 14:50:55

Answer 4

A:

text = text.replace(/<(?!\/)/g, "\n<"); // replace every < (which are not followed by /) by \n<

gawi 2010-08-31 14:15:18

Answer 5

A:

Expanding on @Amarghosh's answer:

Assuming the HTML you are trying to parse is more complicated than your example (which I would guess it is) you may want to convert your HTML page into XHTML. This will allow you to use treat it as XML and do a number of things including:

Use an XSL to transform the data
Use .NET's extensive set of XML libraries to extract and manipulate the data.

I have done this in the past with a free .NET library called SGML.

Abe Miessler 2010-08-31 14:41:45

ansaurus

tags:

views:

answers:

How to replace all html tags from <anything> to \n<anything>\n [using regexp (JavaScript)]

related questions