tags:

views:

146

answers:

3

What would be the fastest way to do this.

I have may html documents that might (or might not) contain the word "Instructions" followed by several lines of instructions. I want to parse these pages that contain the word "Instructions" and the lines that follow.

A: 

This is not the most "correct" way, but will work mostly. Use a regular expression to find the strings:ruby regex

The regex you want is something like /instructions([^<]+)/. This assumes that you are ending with a < character.

Jamie
A: 

You can start by just testing if a document matches:

if open('docname.html').read =~ /Instructions/
  # Parse to remove the instructions.
end

I'd recommend using Hpricot to then extract the part you want - this will be more or less difficult depending on how your html is structured. Please post some more details about the structure if you want some more specific help.

Peter
+1  A: 

Maybe something along this lines

require 'rubygems'
require 'nokogiri'

def find_instructions doc
  doc.xpath('//body//text()').each do |text|
    instructions = text.content.select do |line|
      # flip-flop matches all sections starting with
      # "Instructions" and ending with an empty line
      true if (line =~ /Instructions/)..(line =~ /^$/) 
    end
    return instructions unless instructions.empty?
  end
  return []
end

puts find_instructions(Nokogiri::HTML(DATA.read))


__END__
<html>
<head>
  <title>Instructions</title>
</head>
<body>
lorem
ipsum
<p>
lorem
ipsum
<p>
lorem
ipsum
<p>
Instructions
- Browse stackoverflow
- Answer questions
- ???
- Profit

More
<p>
lorem
ipsum
</body>
</html>
Adrian