ansaurus

Question

Extra characters Extracted with XPath and Python (html)

Answer 1

A:

Use strip() to remove the leading and trailing white spaces.

>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Wai Yip Tung 2010-05-25 22:51:27

how would i do that in the program? can i just write `item1['Title']= item1['title'].strip()`,? I am new to python.

Nacari 2010-05-25 22:55:18

Yes, assuming item1['title'] is a string.

Wai Yip Tung 2010-05-25 23:04:24

Answer 2

A:

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.

Probably

my_answer = item1['Title'][0].strip()

Or if you are expecting several matches

for ans_i in item1['Title']:
    do_something_with( ans_i.strip() )

Dan Menes 2010-05-25 23:00:46

Ok thanks, that fixed it kinda, it seems to be picking up the dash in `1 - MathOverflow` as an odd string `\u 2013`, and ascii cant read it. As for the [u'204'], I have no idea why xpath is putting it around the data. The xpath statement is `//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table/tr[@class="odd"][1]/td[1]/text()`

Nacari 2010-05-25 23:11:32

I think you are confusing what is actually being returned with how Python renders that when it prints it at the prompt. When you see `[u'204']` on the screen, that is not a string that begins with a `[` character. Rather, it is how Python tells you that it is showing you a list object that contains a single unicode string. The value inside that unicode string is the three characters `2`, `0` and `4`. Which is exactly what you want. The code I showed you should unpack this for you.

Dan Menes 2010-05-25 23:51:26

Likewise, Python is not replacing the dash with the string `\u2013`. Rather, it is just showing you that the Unicode string that has been returned contains the character at code point 2013. Which, it hopefully won't surprise you to learn, is the codepoint for "EN DASH." Python isn't altering the string, it is returning exactly what is in the browser. If you want to remove the non-ascii character, this recent thread will help: http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters

Dan Menes 2010-05-25 23:56:14

Ah, I'm just trying to put this stuff in a csv document and am having issues.

Nacari 2010-05-26 00:41:57

Answer 3

A:

The standard XPath function normalize-space() has exactly the wanted effect.

It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.

So, you could use:

normalize-space(someExpression)

Dimitre Novatchev 2010-05-25 23:17:02

Ah ok, so how is the syntax on that? Is ('normalize-space(//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table') items = []') correct?

Nacari 2010-05-25 23:21:48

@Nacari: This is a correct XPath expression:`normalize-space(//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table)`

Dimitre Novatchev 2010-05-26 01:02:10

ansaurus

tags:

views:

answers:

Extra characters Extracted with XPath and Python (html)

related questions