nokogiri

Better Way to Remove Blank Lines After Nokogiri Node Removal

Perhaps this is nitpicky, but I have to ask. I'm using Nokogiri to parse XML, remove certain tags, and write over the original file with the results. Using .remove leaves blank lines in the XML. I'm currently using a regex to get rid of the blank lines. Is there some built-in Nokogiri method I should be using? Here's what I have: requir...

nokogiri: xml to html

Hi, I just want to do some straight conversion (almost just search and replace) but I'm having trouble just getting things to sit in place - I'm ending up with links out of place and duplicated content. I'm sure I'm doing something silly with my attempts at traversing the xml : ) I'm trying with: builder = Nokogiri::HTML::Builder.new d...

ruby script memory % consumption keeps going up...any way to prevent this ?

as i run my ruby script, which is an very long series of loop. for each loop, some random html file is parsed via nokogiri. top reveals that memory consumption % is incrementing via 0.1 along with cpu usage every few seconds. eventually the ruby script crashes due to "not enough memory" UPDATED to latest: def extract(newdoc, newarra...

nokogiri: invalid xpath ??

lotofxpath = arrayofmanyxpaths.map{|s| "\"" + s + "\""}.join(",") puts lotofxpath #=> "/html/body/a[1]", "/html/body/a[2]" newb = doc.xpath(lotofxpath).to_a this will not work, and complain about invalid xpath. however, copying pasting the output string newb = doc.xpath("/html/body/a[1]", "/html/body/a[2]").to_a will work wit...

nokogiri doc.xpath() problem....

when looping through many web pages and calling something simple like below manyhtmlpages.each do |page| doc = Nokogiri::HTML(page) puts doc.xpath("/html/body/h2[1]","/html/body/a[1]").to_s end i observe that memory consumption continually goes up until the script terminates due to running out of memory. when i remove the doc.xpa...

running nokogiri in Jruby vs. just ruby

I found startling difference in CPU and memory consumption usage. It seems garbage collection is not happening when i run the following nokogiri script require 'rubygems' require 'nokogiri' require 'open-uri' def getHeader() doz = Nokogiri::HTML(open('http://losangeles.craigslist.org/wst/reb/1484772751.html')) puts doz.xpath("html[1]...

Adding non-escaped Ampersands to HTML with Nokogiri::XML::Builder

I would like to add things like bullet points "•" and such to html using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped? I would like the result to be: <span>&#8226;</span> rather than <span>&amp;#8226;</span> What am I missing? I'm just doing this: xml.span { xml...

possible to load nokogiri in jruby without installing nokogiri-java ?

i need a way to run following nokogiri script #parser.rb require 'nokogiri' def parseit() //... end and call the parseit() while running below main.rb in jruby #main.rb require 'parser' parseit() Of course the problem is jruby cannot find 'nokogiri' as I have not installed it aka nokogiri-java via jruby -S gem install nokogiri T...

Fastest/One-liner way to print XML node's XPath in Ruby?

What's the fastest/one-liner way to print the current nodes xpath, or just "path/to/node", in Ruby with Nokogiri? So this: <nodeA> <nodeB> <nodeC/> </nodeB> </nodeA> to this (say we've gone down to nodeC by processing xml.children.each, etc...): "nodeA/nodeB/nodeC" ...

Creating an XML document with a namespaced root element with Nokogiri builder

I'm implementing an exporter for an XML data format that requires namespaces. I'm using the Nokogiri XML Builder (version 1.4.0) to do this. However, I can't get Nokogiri to create a root node with a namespace. This works: Nokogiri::XML::Builder.new { |xml| xml.root('xmlns:foobar' => 'my-ns-url') }.to_xml <?xml version="1.0"?> <root ...

How can I get nokogiri to select node attributes and add them to other nodes?

Is it possible to grab a following element's attributes and use them in the preceding one like this?: <title>Section X</title> <paragraph number="1">Stuff</paragraph> <title>Section Y</title> <paragraph number="2">Stuff</paragraph> into: <title id="ID1">1. Section X</title> <paragraph number="1">Stuff</paragraph> <title id="ID2">2. S...

Nokogiri how to get the parent text and not the childs text and reference the text back to its parent

Let's say I have this sample: page = "<html><body><h1 class='foo'></h1><p class='foo'>hello people<a href='http://'&gt;hello world</a></p></body></html>" @nodes = [] Nokogiri::HTML(page).traverse do |n| if n[:class] == "foo" @nodes << {:name => n.name, :xpath => n.path, :text => n.text } end end ...

Nokogiri and random div name

Using Nokogiri and Ruby. I have a page to parse with div id's like: div id="some-list-number^875" Numbers after ...-number^ changes random, and i just can't do doc.css('#wikid-list-genres^875').each do |n| puts n.text.to_s end But the base structure is always the same -number^..some digits... So i need some kind of wildm...

fastest method to find a specific word in an xhtml document

What would be the fastest way to do this. I have may html documents that might (or might not) contain the word "Instructions" followed by several lines of instructions. I want to parse these pages that contain the word "Instructions" and the lines that follow. ...

Nokogiri pretty printing

Sorry for this question, apparently all my googling and api searching skills must be failing me. I've written a web crawler in ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants...

nokogiri with :after css pseudo selector

I've the following html: <li><a href="/stumbler/millisami/tag/company/" class=""> <span class="right">69</span> company</a> </li> and I want to scrap the text after the span tag, i.e. "company" So, when I tried doc.at_css("span:after") the no method error :after is thrown. How to use pseudo selectors with Nokogiri?? ...

Nokogiri::XML::Builder with hypen in element name

I am trying to build an XML document using Nokogiri. Some of the elements have hyphens in them. To illustrate the problem is an example: require "nokogiri" builder = Nokogiri::XML::Builder.new do |xml| xml.foo_bar "hello" end puts builder.to_xml Produces: <?xml version="1.0"?> <foo_bar>hello</foo_bar> However, when I try: build...

Add a dtd using nokogiri builder

I am using nokogiri to generate svg pictures. I would like to add the correct xml preamble and svg DTD declaration to get something like: <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"&gt; <svg> ... With builder I could use instruct! and declare!...

Nokogiri screen-scrapes a large set of data. How do I save all of the data in Rails?

The following code screen-scrapes fishersci.com for 3 pieces of information: The product name, The product URL and the catalog number and saves the data into 3 table items rec_item, rec_url and rec_cat respectively. # lib/tasks/inventory_courses_new_item.rake task :fetch_new_courses => :environment do require 'nokogiri' require 'o...

Removing the <script> elements of an HTML

Hi, I'm using Ruby, with the Nokogiri module, and i want to get the content of the body without the script elements. Nokogiri parse uses XPATH or CSS 3.0. XPATH i really dont understand, and i can't find the CSS selector to achieve my goals. ...