ansaurus

Question

Cleaning and stripping of strings/HTML - Python

Answer 1

+2 A:

For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]

Mark Byers 2010-10-28 21:38:26

It would have to be result = [x.strip() for x in i.split(',') if x.strip()], was hoping there would be a more efficient way of doing this though. Well thanks anyway

RadiantHex 2010-10-28 21:43:37

btw [x.strip() for x in i.split(',') if x.strip()] does both at the same time :)

RadiantHex 2010-10-28 21:48:46

@RadiantHex: ... by performing the `strip` twice. This answer would be better if the first operation were a generator, not a list comprehension.

hughdbrown 2010-10-29 03:44:15

Answer 2

+2 A:

To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()

infinity 2010-10-28 21:39:36

thanks :) but it doesn't really clean/sanitize the HTML. I just need br, p, strong, italic, span elements :)

RadiantHex 2010-10-28 21:47:13

A good way to sanitise HTML is to parse it into a DOM, remove all the elements, attributes, and URL-schemes that aren't known-safe, and serialise back to HTML.

bobince 2010-10-28 22:47:11

Answer 3

+1 A:

I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.

klausbyskov 2010-10-28 21:40:26

Answer 4

+1 A:

1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/

Orbit 2010-10-28 21:41:02

Erm... the question appears to be Python, rather than Ruby? The way the two languages handle Unicode is very, very different.

bobince 2010-10-28 22:45:29

oh, wow. glad you pointed that out. loving the 1 up.

Orbit 2010-10-28 22:46:35

Answer 5

+1 A:

1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

singularity 2010-10-28 21:47:14

thanks so much for pointing tidy out!!! :)

RadiantHex 2010-10-28 21:48:07

Answer 6

+1 A:

I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines.

hughdbrown 2010-10-29 03:48:49

ansaurus

tags:

views:

answers:

Cleaning and stripping of strings/HTML - Python

related questions