questions about nokogiri | ansaurus

nokogiri

How to unescape HTML in Nokogiri Ruby, so & remains & and not &

I have a title doc.at('head/title').inner_html that comes out & and it should be &. My original document is: <head><title>Foo & Bar</title></head> but in comes out as the following: >> doc = Nokogiri::HTML.parse(file, nil, "UTF-8") >> doc.at('head/title') => #<Nokogiri::XML::Element:0x..fdb851bea name="title" children=#<Nokogiri...

extracting content of content attribute in meta tag of a website given a specified value for the name attribute with nokogiri in ruby?

My first question here, would be awesome to find an answer. I am new to using nokogiri. Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post): <meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ...

Ruby Nokogiri Parsing HTML table II

Hi all I have just installed ruby+mechanize. It seems to me that it is posible in ruby nokogiri what I want to do but I do not know how to do it. What about this table? It is just part of html of vBulletin forum site. I tried to keep the html structure but deleted some text and tag attributes. I want to get some details per thread like...

parse 'page 1 of x' - the best method (ruby/mechanize/nokogiri)

what is the best method using ruby/mechanize/nokogiri to go/click through all pages in case there is more than 1 page I need to access/click on? For example here Page 1 of 34 Should I click the page number or next? Or is out there any better solution? ...

how to use xpath? (nokogiri)

I have not found any documentation nor tutorial for that. Does anything like that exist? doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') the code above will get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251. But why does it start with double //? why there is /tr at the e...

Editing Text in a Nokogiri Element or Using Regex

Is there a way to edit the text of a nokogiri element? I have a nokogiri element that contains a list element (<li>) and I would like to remove some characters from the text while preserving the <li> html. Specifically, I want to remove a leading ":" character in the text if it exists. It doesn't look like there's a text= method for n...

Nokogiri Element Removal Using Regex in Ruby

This seems like the hardest problem I have had yet, but maybe I am making it harder than it needs to be. I need to remove an unknown number of nested elements that may or may not be at the beginning of a sentence. The span elements contain a number of words in parentheses. So in the sentence: (cryptography, slang) An internet firewa...

webrat, rspec, nokogiri segfault

I'm getting a segfault in nokogiri (1.4.1) run (under cucumber 0.6.1/webrat 0.7.0/rspec 1.3.x) response.should have_selector("div", :class => "fieldWithErrors") and the div in the page is actually <div class="fieldWithErrors validation_error"> stuff </div> Everything runs fine if I just test nokogiri against a test document >> req...

extract single string from html using ruby/mechanize (and nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath. Sample code: require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http:/...

information-extraction

What is the absolutely cheapest way to select a child node in Nokogiri?

I know that there are dozens of ways to select the first child element in Nokogiri, but which is the cheapest? I can't get around using Node#children, which sounds awfully expensive. Say that there are 10000 child nodes, and I don't want to touch the 9999 others... ...

how to use nokogiri methods .xpath & .at_xpath

I'm learning how to use nokogiri and few questions came to me based on the code below require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708') puts "\nabsolute path with tbody gives nil" puts post_page.parser.xpath('/html/body/div/div/d...

How can I extract html escape chars/entities as text when scraping web? (ruby & nokogiri)

In my ruby+mechanize(nokogiri) script I use this piece of code: row.at_xpath('td[3]/div[1]/a/text()').to_s.strip on a forum where the post title html looks like: <a href="showthread.php?t=233891" ></body> on Footer ?</a> and I receive from xpath this string </body> on Footer ? I would like to get what I can see in ...

how to read nokogiri.org documentation? (ruby+mechanize+nokogiri)

last week I started to write a script in ruby. I needed to scrape some data from the web so I was recommended to use mechanize and then nokogiri. Mechanize documentation says Mechanize uses nokogiri to parse html. What does this mean for you? You can treat a mechanize page like an nokogiri object. After you have used Mechanize to navig...

how to get horizontal depth of a node ?

note i made up the term horizontal depth to measure the sub-dimension of a node within a tree. so imagine a which would have xpath something like /html/table/tbody/tr/td, and "horizontal depth" of 5 i am trying to see if there is a way to identify and select elements based on this horizontal depth. how can i find the maximum depth ? ...

Rake task to fetch XML stream with nogokiri and write selected fields to DB

Hey Everybody, I am trying to build a rake tasks, that fetches a product feed and adds it to my db. task :testme => :environment do require 'nokogiri' require 'zlib' require 'open-uri' @url = "http://some_url/filename.xml.gz" @source = open((@url), :http_basic_authentication=>[USERID, "PASSWORD"]) @gz = Zlib::GzipReader.new(@s...

Mechanize not recognizing anchor tags via CSS selector methods

(Hope this isn't a breach of etiquette: I posted this on RailsForum, but I haven't been getting much response from there recently.) Has anyone else had problems with Mechanize not recognizing anchor tags via CSS selectors? The HTML looks like this (snippet with white space removed for clarity): <td class='calendarCell' align='left'> <...

screen-scraping

Nokogiri replace tag values

How to replace "foo" to "bar" ? From <h1>foo1<p>foo2<a href="foo3.com">foo4</a>foo5</p>foo6</h1> to <h1>bar1<p>bar2<a href="foo3.com">bar4</a>bar5</p>bar6</h1> I want only replace tag inner content, without tag attributes. Any ideas ? ...

data-structures

string-manipulation

Rails .scan won't work with a model variable using nokogiri

Hello, I 'll try to be as explicit as possible I am using nokogiri to parse links from paths and rules out of a database I have this model: --- !ruby/object:Content attributes: id: "2" name: http://www.****** try description: try url_base: http://www.****** scan_flv: /"file","([^<>]*flv)"\);/imu source_site_id: "2" con...

How can I create a nokogiri case insensitive Xpath selector?

I'm using nokogiri to select the 'keywords' attribute like this: puts page.parser.xpath("//meta[@name='keywords']").to_html One of the pages I'm working with has the keywords label with a capital "K" which has motivated me to make the query case insensitive. <meta name="keywords"> AND <meta name="Keywords"> So, my question is: W...

How to get 'value' of select tag based on content of select tag, using Nokogiri

How would one get the contents of the 'value' attribute of a select tag, based on content of the select tag (i.e. the text wrapped by option), using Nokogiri? For example, given the following HTML: <select id="options" name="options"> <option value="1">First Option - 4</option> <option value="2">Second Option - 5</option> <op...

1
...
4
5
6
7
8
...
13