views:

289

answers:

8

Question: Is there a clever way to parse plain-text lists into HTML?

Or, must we resort to esoteric recursive methods, or sheer brute force?

I've been wondering this for a while now. In my own ruminations I have come back again and again to the brute-force, and odd recursive, methods ... but it always seems so clunky. There must be a better way, right?

So what's the clever way?

Assumptions

It is necessary to set up a scenario, so these are my assumptions.

  1. Lists may be nested 3 levels deep (at a minimum), of either unordered or ordered lists. The list type and depth is controlled by its prefix:

    1. There is a mandatory space following the prefix.
    2. List depth is controlled by how many non-spaced characters there are in the prefix; ***** would be nested five lists deep.
    3. List type is enforced by character type, * or - being an unordered list, # being a disordered list.
  2. Items are separated by only 1 \n character. (Lets pretend two consecutive new-lines qualify as a "group", a paragraph, div, or some other HTML tag like in Markdown or Textile.)

  3. List types may be freely mixed.

  4. Output should be valid HTML 4, preferably with ending </li>s

  5. Parsing can be done with, or without, Regex as desired.

Sample Markup

* List
*# List
** List
**# List
** List

# List
#* List
## List
##* List
## List

Desired Output

Broken up a bit for readability, but it should be a valid variation of this (remember, that I'm just spacing it nicely!):

<ul>
  <li>List</li>
  <li>
    <ol><li>list</li></ol>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
  <li>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
</ul>


<ol>
  <li>List</li>
  <li>
    <ul><li>list</li></ul>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
  <li>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
</ol>

In Summary

Just how do you do this? I'd really like to understand the good ways to handle unpredictably recursing lists, because it strikes me as an ugly mess for anyone to tangle with.

+1  A: 

Look at Textile.

It is available in a number of languages.

Josh
-1; I'm not asking for an implementation, I'm looking for the principle(s) and methodry.
The Wicked Flea
there are open source versions of textile you could use to see how they have solved the problem.
Josh
The trouble is that they're written 100% for performance, once I find the code. And, when it's written for performance, I can have trouble comprehending its operation.
The Wicked Flea
I doubt they're written "100% for performance". Take a look, e.g., at http://github.com/jsamsa/python-textile/blob/76d6a5d9710b8d4bbc5684c2f1c4abf8d589f8a2/textile.py , it looks pretty straightforward to me.
Roberto Bonvallet
@Roberto: Thanks, but that is still slightly obfuscated to me. It doesn't help that I don't know Python. :/
The Wicked Flea
Some of us don't want to sift through hundreds of lines of code that we've never seen just to get a small idea of how to do something. Isn't that was SO is for, in the first place? :)
musicfreak
here is an interesting article about transforming html to textile with some example code http://debuggable.com/posts/how-to-transform-html-to-textile-markup-the-cakephp-textilehelper-revisited:480f4dfe-a258-4b54-a31b-4933cbdd56cb
Josh
+1  A: 

This how you can do it with regexp and cycle (^ stands for newline, $ for endline):

do { 
    ^#anything$ -> <ol><li>$^anything</li></ol>$
    ^*anything$ -> <ul><li>$^anything</li></ul>$
} while any of those above applies

do {
    </ol><ol> -> 
    </ul><ul> -> 
    </li><li> -> 
} while any of those above applies

This makes it much simpler than a simple regexp. The way it works: you first expand each line as if it was isolated, but then eat extra list markers.

ilya n.
Regexp is about the worst way to tackle this, I'd think. It's not what I would call simple, and it's a bitch to maintain. Recursion would be much more elegant and much easier to change later. But regexp is a working solution so no -1 from me.
musicfreak
A regexp is fine, if you need to, say, identify a list within a larger body of non-list text (as within a wiki page). But if you know ahead of time that *all* of your lines are list items, you don't really need to bother.
Shog9
I added how it works... You handle nested lists by iterating over those expressions until you cannot apply them.
ilya n.
There we go. Good examples. +1
musicfreak
+2  A: 

Basic iterative technique:

  1. A regex or some other simple parser that'll recognize the format for a list, capturing each list item (including those with additional levels of indentation).
  2. A counter to keep track of the current indentation level.
  3. Logic to iterate through each capture, writing out <li>s and inserting appropriate begin / end tags (<ol></ol>, <ul></ul>) and incrementing / decrementing the indentation counter whenever the current indentation level is greater or less than the previous one.


Edit: Here's a simple expression that'll probably work for you with a bit of tweaking: each match is a top-level list, with two sets of named captures, the markers (char count is indentation level, last char indicates desired list type) and the list item text.

(?:(?:^|\n)[\t ]*(?<marker>[*#]+)[\t ]*(?<text>[^\n\r]+)\r*(?=\n|$))+
Shog9
So basically, loop through it checking for changes in the indentation level? I come to this premise, but keep doubting myself and not writing part 3 of that.
The Wicked Flea
Yeah - it's really that simple. The only interesting part is writing a solid regex, and your format is simple enough (no explicit list item numbering that you wish to preserve) that that won't be too bad either.
Shog9
Hmm, maybe I should just do it. It just never has looked that simple. I've never done a lot of text-processing, and so am still novice-like with it.
The Wicked Flea
Yeah, go for it. I've edited in a simple regex to give you a start; though as I mentioned in a comment elsewhere you can probably get away with just looping line-by-line if you know that everything in the text you're working with is a list.
Shog9
Thanks Shog, I'll post an answer eventually with my code and see how well it works.
The Wicked Flea
Sounds good. Thinking about it a bit more, I realize you'll have to keep a bit of additional state as well: a stack of open list tags. Else you won't know how to close out mixed-type nested lists.
Shog9
+2  A: 

The line-by-line solution with some pythonic concepts:

cur = ''
for line in lines():
    prev = cur
    cur, text = split_line_into_marker_and_remainder(line)
    if cur && (cur == prev) :
         print '</li><li>'
    else :
         nprev, ncur = kill_common_beginning(prev, cur)
         for c in nprev: print '</li>' + ((c == '#') ? '</ol>' : '</ul>') 
         for c in ncur:  print           ((c == '#') ? '<ol>'  : '<ul>' )  + '<li>'
    print text

This is how it works: to process the line, I compare the marker for previous line with the marker for this line.

I use a fictional function split_line_into_marker_and_remainder, which returns two results, marker cur and the text itself. It's trivial to implement it as a C++ function with 3 arguments, an input and 2 output strings.

At the core is a fictional function kill_common_beginning which would take away the repeat part of prev and cur. After that, I need to close everything that remains in previous marker and open everything that remains in current marker. I can do it with a replace, by mapping characters to string, or by a loop.

The three lines wil be pretty straightforward in C++:

char * saved = prev;
for (; *prev && (*prev == *cur);  prev++, cur++ ); // "kill_common_beginning"
while (*prev) *(prev++) == '#' ? ...
while (*cur)  *(cur++) == '#' ? ...
cur = saved;

Note, however, that there is a special case: when the indentation didn't change, those lines don't output anything. That's fine if we're outside of the list, but that's not fine in the list: so in that case we should output the </li><li> manually.

ilya n.
I'd like the stress the key idea: you don't need indent counter.
ilya n.
I'm primarily confused about that else statement you've got, could you explain its code, specifically?
The Wicked Flea
Aha, now I've got it. Thank you. :)
The Wicked Flea
+2  A: 

Best explanation I've seen is from Higher-Order Perl by Mark Jason Dominus. The full text is available online at http://hop.perl.plover.com/book/.

Though the examples are all in Perl, the breakdown of the logic behind each area is fantastic.

Chapter 8 (! PDF link) is specifically about parsing. Though the lessons through out the book are somewhat related.

Mark
Thanks for the link, it's a pretty good resource for parsing. However, it is long-ish and I've spent too much time on this curiosity already.
The Wicked Flea
A: 

This Perl program is a first attempt at that.

#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;

my $data = [];
while( my $line = <> ){
  last if $line =~ /^[.]{3,3}$/;
  my($nest,$rest) = $line =~ /^([\#*]*)\s+(.*)$/x;
  my @nest = split '', $nest;

  if( @nest ){
    recourse($data,$rest,@nest);
  }else{
    push @$data, $line;
  }
}

de_recourse($data);

sub de_recourse{
  my($ref) = @_;
  my %de_map = (
    '*' => 'ul',
    '#' => 'ol'
  );

  if( ref $ref ){
    my($type,@elem) = @$ref;
    if( ref $type ){
      for my $elem (@$ref){
        de_recourse($elem);
      }
    }else{
      $type = $de_map{$type};

      say "<$type>";
      for my $elem (@elem){
        say "<li>";
        de_recourse($elem);
        say "</li>"
      }
      say "</$type>";
    }
  }else{
    print $ref;
  }
}

sub recourse{
  my($last_ref,$str,@nest) = @_;
  die unless @_ >= 2;
  die unless ref $last_ref;
  my $nest = shift @nest;

  if( @_ == 2 ){
    push @$last_ref, $str;
    return;
  }

  my $previous = $last_ref->[-1];
  if( ref $previous ){
    if( $previous->[0] eq $nest ){
      recourse( $previous,$str,@nest );
      return;
    }
  }

  my $new_ref = [ $nest ];
  push @$last_ref, $new_ref;
  recourse( $new_ref, $str, @nest );
}

Hope it helps

Brad Gilbert
+1  A: 

Here is my own solution, which seems to be a hybrid of Shog9's suggestions (a variation on his regex, Ruby doesn't support named matches) and Ilya's iterative method. My working language was Ruby.

Some things of note: I used a stack-based system, and that "String#scan(pattern)" is really just a "match-all" method that returns an array of matches.

def list(text)
  # returns [['*','text'],...]
  parts = text.scan(/(?:(?:^|\n)([#*]+)[\t ]*(.+)(?=\n|$))/)

  # returns ul/ol based on the byte passed in
  list_type = lambda { |c| (c == '*' ? 'ul' : 'ol') }

  prev = []
  tags = [list_type.call(parts[0][0][0].chr)]
  result = parts.inject("<#{tags.last}><li>") do |output,newline|
    unless prev.count == 0
      # the following comparison says whether added or removed,
      # this is the "how much"
      diff = (prev[0].length - newline[0].length).abs
      case prev[0].length <=> newline[0].length
        when -1: # new tags to add
          part = ((diff > 1) ? newline[0].slice(-1 - diff,-1) : newline[0][-1].chr)
          part.each_char do |c|
            tags << list_type.call(c)
            output << "<#{tags.last}><li>"
          end
        when 0: # no new tags... but possibly changed
          if newline[0] == prev[0]
            output << '</li><li>'
          else
            STDERR.puts "Bad input string: #{newline.join(' ')}"
          end
        when 1: # tags removed
          diff.times{ output << "</li></#{tags.pop}>" }
          output << '</li><li>'
      end
    end

    prev = newline
    output + newline[1]
  end

  tags.reverse.each { |t| result << "</li></#{t}>" }
  result
end

Thankfully this code does work and generate valid HTML. And this did turn out better than I had anticipated. It doesn't even feel clunky.

The Wicked Flea
++ Nicely done!
Shog9
Cool! Can one of us get an answer accepted? I can rework my program if necessary.
ilya n.
I don't know that there is one true answer to this, as both of you helped me almost equally. To understand what you wrote I had to rewrite the whole thing in Ruby, though looking at it I did understand the basic premise. Put the advice and code together and I got a working result, without either I might not have made it. (I wish I could accept both as answers.)
The Wicked Flea
A: 

Try Gelatin. The syntax definition would probably be 5 lines or less.

knipknap
Looks interesting, but I don't know how to program in python.
The Wicked Flea
There's no need to program in Python, Gelatin can be used as any other command line tool (gel -s syntax.gel input.txt).
knipknap