tags:

views:

62

answers:

3

I'm trying to write a word counter for LyX files.

Life is almost very simple as most lines that need to be ignored begin with a \ (I'm prepared to make the assumption that no textual lines begin with backslashes) - however there are some lines that look like real text that aren't, but they are enclosed by \begin_inset and \end_inset:

I'm genuine text.

\begin_inset something
I'm not real text
Perhaps there will be more than one line! Or none at all! Who knows.
\end_inset

/begin_layout
I also need to be counted, and thus not removed
/end_layout

Is there a quick way in ruby to strip the (smallest amount of) text between two markers? I'm imagining Regular Expressions are the way forward, but I can't figure out what they'd have to be.

Thanks in advance

+3  A: 

Is there a quick way in ruby to strip the (smallest amount of) text between two markers?

str = "lala BEGIN_MARKER \nlu\nlu\n END_MARKER foo BEGIN_MARKER bar END_MARKER baz"
str.gsub(/BEGIN_MARKER.*?END_MARKER/m, "")
#=> "lala  foo  baz"
sepp2k
D'oh! *thumps head* of course - thanks!
JP
+1  A: 

gsub could be expensive for longer files (if you're reading in the whole file as string)

so if you have to chunk it anyway, you might want to use a stateful parser

in_block = false
File.open(fname).each_line do |line| 
 if in_block
    in_block = false if line =~ /END_MARKER/
    next
  else
    in_block = true if line =~ /BEGIN_MARKER/
    next
  end
  count_words(line)
end
klochner
A: 

You should look at str.scan(). Assuming your text is in the variable s, something like this should work:

s_strip_inset = s.sub!(/\\begin_inset.*?\\end_inset/, "")
word_count = s_strip_inset.scan(/(\w|-)+/).size
ghoppe