views:

74

answers:

6

Hi folks,

I have a set of questions, of which I do not have an answer to.

1) Stripping lists of string

input:
'item1,   item2, \t\t\t item3, \n\n\n \t, item4, , , item5, '

output:
['item1', 'item2', 'item3', 'item4', 'item5']

Anything more efficient than doing the following?

[x.strip() for x in l.split(',') if x.strip()]

2) Cleaning/Sanitizing HTML

keeping basic tags e.g. strong, p, br, ...

removing malicious javascript, css and divs

3) Unicode handling...

what would you recommend for dealing with unicode parsed within documents?


Any ideas? :) Thanks guys!

+2  A: 

For the first one you can use split then a list comprehension to trim the extra whitespace:

result = [x.strip() for x in i.split(',')]

And to remove the empty strings from the list:

result = [x for x in result if x]
Mark Byers
It would have to be result = [x.strip() for x in i.split(',') if x.strip()], was hoping there would be a more efficient way of doing this though. Well thanks anyway
RadiantHex
btw [x.strip() for x in i.split(',') if x.strip()] does both at the same time :)
RadiantHex
@RadiantHex: ... by performing the `strip` twice. This answer would be better if the first operation were a generator, not a list comprehension.
hughdbrown
+2  A: 

To clean HTML use lxml.html

import lxml.html
text = lxml.html.fromstring("...")
text.text_content()
infinity
thanks :) but it doesn't really clean/sanitize the HTML. I just need br, p, strong, italic, span elements :)
RadiantHex
A good way to sanitise HTML is to parse it into a DOM, remove all the elements, attributes, and URL-schemes that aren't known-safe, and serialise back to HTML.
bobince
+1  A: 

I am somewhat of a beginner at python web development, but for cleaning/sanitizing html I have found that the markdown2 library has some very nice features. You can use it with the MarkItUp! jQuery-based editor. They may not solve all your problems but might help you do a lot of work in a short time.

klausbyskov
+1  A: 

1) you can use the strip method

2) you can use sanitize , http://wonko.com/post/sanitize

3) some unicode tips here: http://blog.trydionel.com/2010/03/23/some-unicode-tips-for-ruby/

Orbit
Erm... the question appears to be Python, rather than Ruby? The way the two languages handle Unicode is very, very different.
bobince
oh, wow. glad you pointed that out. loving the 1 up.
Orbit
+1  A: 

1) [j.strip() for j in a.split(',') if j.strip()]

2) check tidy

singularity
thanks so much for pointing tidy out!!! :)
RadiantHex
+1  A: 

I tend to write multiple cascading generators, particularly if I want to some output to be part of a test:

stripped_iter = (x.strip() for x in l.split(','))
non_empty_iter = (x for x in stripped_iter if x)

The inspiration is Beazley's presentation on coroutines.

hughdbrown