Perhaps this is nitpicky, but I have to ask. I'm using Nokogiri to parse XML, remove certain tags, and write over the original file with the results. Using .remove leaves blank lines in the XML. I'm currently using a regex to get rid of the blank lines. Is there some built-in Nokogiri method I should be using? Here's what I have:
requir...
Hi, I just want to do some straight conversion (almost just search and replace) but I'm having trouble just getting things to sit in place - I'm ending up with links out of place and duplicated content. I'm sure I'm doing something silly with my attempts at traversing the xml : )
I'm trying with:
builder = Nokogiri::HTML::Builder.new d...
as i run my ruby script, which is an very long series of loop. for each loop, some random html file is parsed via nokogiri.
top reveals that memory consumption % is incrementing via 0.1 along with cpu usage every few seconds.
eventually the ruby script crashes due to "not enough memory"
UPDATED to latest:
def extract(newdoc, newarra...
lotofxpath = arrayofmanyxpaths.map{|s| "\"" + s + "\""}.join(",")
puts lotofxpath #=> "/html/body/a[1]", "/html/body/a[2]"
newb = doc.xpath(lotofxpath).to_a
this will not work, and complain about invalid xpath.
however, copying pasting the output string
newb = doc.xpath("/html/body/a[1]", "/html/body/a[2]").to_a
will work wit...
when looping through many web pages and calling something simple like below
manyhtmlpages.each do |page|
doc = Nokogiri::HTML(page)
puts doc.xpath("/html/body/h2[1]","/html/body/a[1]").to_s
end
i observe that memory consumption continually goes up until the script terminates due to running out of memory.
when i remove the doc.xpa...
I found startling difference in CPU and memory consumption usage. It seems garbage collection is not happening when i run the following nokogiri script
require 'rubygems'
require 'nokogiri'
require 'open-uri'
def getHeader()
doz = Nokogiri::HTML(open('http://losangeles.craigslist.org/wst/reb/1484772751.html'))
puts doz.xpath("html[1]...
I would like to add things like bullet points "•" and such to html using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than
<span>&#8226;</span>
What am I missing?
I'm just doing this:
xml.span {
xml...
i need a way to run following nokogiri script
#parser.rb
require 'nokogiri'
def parseit()
//...
end
and call the parseit() while running below main.rb in jruby
#main.rb
require 'parser'
parseit()
Of course the problem is jruby cannot find 'nokogiri' as I have not installed it aka nokogiri-java via jruby -S gem install nokogiri
T...
What's the fastest/one-liner way to print the current nodes xpath, or just "path/to/node", in Ruby with Nokogiri?
So this:
<nodeA>
<nodeB>
<nodeC/>
</nodeB>
</nodeA>
to this (say we've gone down to nodeC by processing xml.children.each, etc...):
"nodeA/nodeB/nodeC"
...
I'm implementing an exporter for an XML data format that requires namespaces. I'm using the Nokogiri XML Builder (version 1.4.0) to do this.
However, I can't get Nokogiri to create a root node with a namespace.
This works:
Nokogiri::XML::Builder.new { |xml| xml.root('xmlns:foobar' => 'my-ns-url') }.to_xml
<?xml version="1.0"?>
<root ...
Is it possible to grab a following element's attributes and use them in the preceding one like this?:
<title>Section X</title>
<paragraph number="1">Stuff</paragraph>
<title>Section Y</title>
<paragraph number="2">Stuff</paragraph>
into:
<title id="ID1">1. Section X</title>
<paragraph number="1">Stuff</paragraph>
<title id="ID2">2. S...
Let's say I have this sample:
page = "<html><body><h1 class='foo'></h1><p class='foo'>hello people<a href='http://'>hello world</a></p></body></html>"
@nodes = []
Nokogiri::HTML(page).traverse do |n|
if n[:class] == "foo"
@nodes << {:name => n.name, :xpath => n.path, :text => n.text }
end
end
...
Using Nokogiri and Ruby.
I have a page to parse with div id's like:
div id="some-list-number^875"
Numbers after ...-number^ changes random, and i just can't do
doc.css('#wikid-list-genres^875').each do |n|
puts n.text.to_s
end
But the base structure is always the same -number^..some digits...
So i need some kind of wildm...
What would be the fastest way to do this.
I have may html documents that might (or might not) contain the word "Instructions" followed by several lines of instructions. I want to parse these pages that contain the word "Instructions" and the lines that follow.
...
Sorry for this question, apparently all my googling and api searching skills must be failing me. I've written a web crawler in ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants...
I've the following html:
<li><a href="/stumbler/millisami/tag/company/" class="">
<span class="right">69</span>
company</a>
</li>
and I want to scrap the text after the span tag, i.e. "company"
So, when I tried
doc.at_css("span:after")
the no method error :after is thrown.
How to use pseudo selectors with Nokogiri??
...
I am trying to build an XML document using Nokogiri. Some of the elements have hyphens in them. To illustrate the problem is an example:
require "nokogiri"
builder = Nokogiri::XML::Builder.new do |xml|
xml.foo_bar "hello"
end
puts builder.to_xml
Produces:
<?xml version="1.0"?>
<foo_bar>hello</foo_bar>
However, when I try:
build...
I am using nokogiri to generate svg pictures. I would like to add the correct xml preamble and svg DTD declaration to get something like:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg>
...
With builder I could use instruct! and declare!...
The following code screen-scrapes fishersci.com for 3 pieces of information: The product name, The product URL and the catalog number and saves the data into 3 table items rec_item, rec_url and rec_cat respectively.
# lib/tasks/inventory_courses_new_item.rake
task :fetch_new_courses => :environment do
require 'nokogiri'
require 'o...
Hi,
I'm using Ruby, with the Nokogiri module, and i want to get the content of the body without the script elements.
Nokogiri parse uses XPATH or CSS 3.0. XPATH i really dont understand, and i can't find the CSS selector to achieve my goals.
...