I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td>
tag and getting [u'204']
. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']
. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?
views:
96answers:
3
A:
Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'
Wai Yip Tung
2010-05-25 22:51:27
how would i do that in the program? can i just write `item1['Title']= item1['title'].strip()`,? I am new to python.
Nacari
2010-05-25 22:55:18
Yes, assuming item1['title'] is a string.
Wai Yip Tung
2010-05-25 23:04:24
A:
What does the line of code look like that returns [u'204']
? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )
Dan Menes
2010-05-25 23:00:46
Ok thanks, that fixed it kinda, it seems to be picking up the dash in `1 - MathOverflow` as an odd string `\u 2013`, and ascii cant read it. As for the [u'204'], I have no idea why xpath is putting it around the data. The xpath statement is `//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table/tr[@class="odd"][1]/td[1]/text()`
Nacari
2010-05-25 23:11:32
I think you are confusing what is actually being returned with how Python renders that when it prints it at the prompt. When you see `[u'204']` on the screen, that is not a string that begins with a `[` character. Rather, it is how Python tells you that it is showing you a list object that contains a single unicode string. The value inside that unicode string is the three characters `2`, `0` and `4`. Which is exactly what you want. The code I showed you should unpack this for you.
Dan Menes
2010-05-25 23:51:26
Likewise, Python is not replacing the dash with the string `\u2013`. Rather, it is just showing you that the Unicode string that has been returned contains the character at code point 2013. Which, it hopefully won't surprise you to learn, is the codepoint for "EN DASH." Python isn't altering the string, it is returning exactly what is in the browser. If you want to remove the non-ascii character, this recent thread will help: http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters
Dan Menes
2010-05-25 23:56:14
Ah, I'm just trying to put this stuff in a csv document and am having issues.
Nacari
2010-05-26 00:41:57
A:
The standard XPath function normalize-space()
has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)
Dimitre Novatchev
2010-05-25 23:17:02
Ah ok, so how is the syntax on that? Is ('normalize-space(//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table') items = []') correct?
Nacari
2010-05-25 23:21:48
@Nacari: This is a correct XPath expression:`normalize-space(//div[@id="content"]/div[@id="directory-list"]/div[@class="wrapper"]/table)`
Dimitre Novatchev
2010-05-26 01:02:10