questions about nokogiri | ansaurus

nokogiri

How should I do a loop a nokogiri search in ruby?

I have the following that I retreive the title of each url from an array that contains a list of urls. require 'rubygems' require 'nokogiri' require 'open-uri' @urls = ["http://google.com", "http://yahoo.com", "http://rubyonrails.org"] @found_titles = Array.new @found_titles[0] = Nokogiri::HTML(open("#{@urls[0]}")).search("title").inn...

Parsing some results returned by nokogiri in ruby, getting an error message

The following code returns an error: require 'nokogiri' require 'open-uri' @doc = Nokogiri::HTML(open("http://www.amt.qc.ca/train/deux-montagnes/deux-montagnes.aspx")) #@doc = Nokogiri::HTML(File.open("deux-montagnes.html")) stations = @doc.xpath("//area") stations.each { |station| str = station reg = /href="(.*)" title="(.*)"/ ...

In Mechanize what is the cookiejar and how does it differ from cookies?

How does Mechanize::CookieJar differ from the Mechanize::Cookies array? There must be some difference but after poking around for a little bit I can't seem to find a good explanation? ...

Nokogiri and Special Characters

I'm using Nokogiri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing: require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open(link)) title = doc.at_css("title") At this point, the title looks like this: Rag\30...

screen-scraping

NameError: uninitialized constant Nokogiri::HTML::DocumentFragment

About three hours ago I started seeing the above error in my production server. It comes from a call to the sanitize gem: vendor/rails/activerecord/lib/../../activesupport/lib/active_support/dependencies.rb:276:in 'load_missing_constant' vendor/rails/activerecord/lib/../../activesupport/lib/active_support/dependencies.rb:468:in `const_m...

XML Pretty Printer Missing 2 Key Edge Cases

Using this xslt file found on this blog to pretty print xml using Nokogiri, everything almost works, but to the point where I can't use it for HTML. First, if a node is empty, it turns it into a self closing node, so: <textarea></textarea> gets converted to <textarea/> But that messes up the html tree when rendered. Second, if th...

how to translate this hpricot code to nokogiri ?

Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ") hpricot = Hpricot(html) hpricot.search("script").remove hpricot.search("link").remove hpricot.search("meta").remove hpricot.search("style").remove found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/ ...

Nokogiri Doc Element Not Returning Correctly

I am trying to scrape a wiktionary entry: uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure')) doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby')) but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with thi...

screen-scraping

Nokogiri: add <tbody> after <table> elements as FireFox does

i have a problem: Firefox adds <tbody> whether it's there or not, after <table>. I have no problem with this. Nokogiri doesn't add it. I need Nokogiri to emulate Firefox's behavior. How can i add tbody after <table> elements to a given HTML page ? If tbody is already there, then move on to the next <table>....until all <tbody> tags a...

Unable to have nokogiri obey custom path parameters during install

I am trying to install nokogiri locally on dreamhost using the commands: $ wget ftp://xmlsoft.org/libxml2/libxml2-2.7.6.tar.gz $ wget ftp://xmlsoft.org/libxml2/libxslt-1.1.26.tar.gz $ tar zxvf libxml2-2.7.6.tar.gz $ cd libxml2-2.7.6 $ ./configure --prefix=$HOME/local/ --exec-prefix=$HOME/local $ make && make install $ cd .. $ tar zxvf l...

nokogiri: how to insert tbody tag immediately after table tag ?

i want to make sure all table's immediate child is tbody.... how can i write this with xpath or nokogiri ? doc.search("//table/").each do |j| new_parent = Nokogiri::XML::Node.new('tbody',doc) j.replace new_parent new_parent << j end ...

Changing href attributes with nokogiri and ruby on rails

Hi, I Have a HTML document with links links, for exemple: <html> <body> <ul> <li><a href="http://someurl.com/etc/etc">teste1</a></li> <li><a href="http://someurl.com/etc/etc">teste2</a></li> <li><a href="http://someurl.com/etc/etc">teste3</a></li> <ul> </body> </html...

string-manipulation

nokogiri: wrap <tbody> around <table>'s child

how can i do this ? i need to place tbody after table tags, basically to emulate Firefox's behavior. i done this: nodes = @doc.css "table > *" wrapper = nodes.wrap("<tbody></tbody>") Thanks. ...

nokogiri: strip all tbody tags without destroying it's children !

doc.xpath("//tbody").remove removes tbody's children ! i only want to remove all tags from the document ! how can i achieve this ? ...

osx rvm ruby 1.8.7 nokogiri 1.4.1 - ERROR: Failed to build gem native extension.

I'm stuck with this problem. cat ~/.rvm/gems/ruby-1.8.7-p249/gems/nokogiri-1.4.1/ext/nokogiri/mkmf.log Gives this errors (clipped) conftest.c:3: error: 'xmlParseDoc' undeclared (first use in this function) conftest.c:3: error: (Each undeclared identifier is reported only once conftest.c:3: error: for each function it appears in.) F...

Nokogiri pull parser (Nokogiri::XML::Reader) issue with self closing tag

I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded, so i tried to parse and process it using a pull parser. Below is a snippet from the each_product(&block) method where i iterate over the product list. Basically, using a stack, i transform each <product> ... </product> node into a hash and process ...

xml-pull-parser

How to install Nokogiri as a Macruby gem?

The latest MacRuby release notes (v0.6) state that the authors have managed to get this release working with the SQLite and Nokogiri gems. However when I run sudo macgem install nokogiri I get the following errors: ERROR: Error installing nokogiri: extconf failed: and then a bunch of paths followed by: libxml2 is missing. try 'por...

Nokogiri find text in paragraphs

I want to replace the inner_text in all paragraphs in my XHTML document. I know I can get all text with Nokogiri like this doc.xpath("//text()") But I want only operate on text in paragraphs, how I can select all text in paragraphs without affecting eventually existent anchor texts in links ? #For example : <p>some text <a href="/">...

string-manipulation

Following a link using Nokogiri for scraping

Is there a method to follow a link using Nokogiri for scraping? I know I can extract the href and open it, but I thought I saw a method to do this using hpricot and was wondering if there was something like that in Nokogiri. ...

screen-scraping

Scraping &#151 character (long dash) error in Nokogiri

I having trouble scraping a certain long dash that is encoded as ; on the Time magazine site. It looks like this: —. It works fine when this dash is encoded as &mdash, but when the problem dash is scraped, it is returned as unknown characters. I am using Nokogiri and am wondering if I have to use some sort of special encoding? The p...

screen-scraping

1
...
6
7
8
9
10
...
13