ansaurus

Question

Is there a clever way to parse plain-text lists into HTML?

Answer 1

+1 A:

Look at Textile.

It is available in a number of languages.

Josh 2009-06-17 18:44:33

-1; I'm not asking for an implementation, I'm looking for the principle(s) and methodry.

The Wicked Flea 2009-06-17 18:53:46

there are open source versions of textile you could use to see how they have solved the problem.

Josh 2009-06-17 18:55:52

The trouble is that they're written 100% for performance, once I find the code. And, when it's written for performance, I can have trouble comprehending its operation.

The Wicked Flea 2009-06-17 18:58:28

I doubt they're written "100% for performance". Take a look, e.g., at http://github.com/jsamsa/python-textile/blob/76d6a5d9710b8d4bbc5684c2f1c4abf8d589f8a2/textile.py , it looks pretty straightforward to me.

Roberto Bonvallet 2009-06-17 19:16:41

@Roberto: Thanks, but that is still slightly obfuscated to me. It doesn't help that I don't know Python. :/

The Wicked Flea 2009-06-17 19:28:05

Some of us don't want to sift through hundreds of lines of code that we've never seen just to get a small idea of how to do something. Isn't that was SO is for, in the first place? :)

musicfreak 2009-06-17 19:29:51

here is an interesting article about transforming html to textile with some example code http://debuggable.com/posts/how-to-transform-html-to-textile-markup-the-cakephp-textilehelper-revisited:480f4dfe-a258-4b54-a31b-4933cbdd56cb

Josh 2009-06-17 23:29:57

Answer 2

+1 A:

This how you can do it with regexp and cycle (^ stands for newline, $ for endline):

do { 
    ^#anything$ -> <ol><li>$^anything</li></ol>$
    ^*anything$ -> <ul><li>$^anything</li></ul>$
} while any of those above applies

do {
    </ol><ol> -> 
    </ul><ul> -> 
    </li><li> -> 
} while any of those above applies

This makes it much simpler than a simple regexp. The way it works: you first expand each line as if it was isolated, but then eat extra list markers.

ilya n. 2009-06-17 18:46:01

Regexp is about the worst way to tackle this, I'd think. It's not what I would call simple, and it's a bitch to maintain. Recursion would be much more elegant and much easier to change later. But regexp is a working solution so no -1 from me.

musicfreak 2009-06-17 18:53:44

A regexp is fine, if you need to, say, identify a list within a larger body of non-list text (as within a wiki page). But if you know ahead of time that *all* of your lines are list items, you don't really need to bother.

Shog9 2009-06-17 19:05:59

I added how it works... You handle nested lists by iterating over those expressions until you cannot apply them.

ilya n. 2009-06-17 19:23:51

There we go. Good examples. +1

musicfreak 2009-06-17 19:32:22

Answer 3

+2 A:

Basic iterative technique:

A regex or some other simple parser that'll recognize the format for a list, capturing each list item (including those with additional levels of indentation).
A counter to keep track of the current indentation level.
Logic to iterate through each capture, writing out <li>s and inserting appropriate begin / end tags (<ol></ol>, <ul></ul>) and incrementing / decrementing the indentation counter whenever the current indentation level is greater or less than the previous one.

Edit: Here's a simple expression that'll probably work for you with a bit of tweaking: each match is a top-level list, with two sets of named captures, the markers (char count is indentation level, last char indicates desired list type) and the list item text.

(?:(?:^|\n)[\t ]*(?<marker>[*#]+)[\t ]*(?<text>[^\n\r]+)\r*(?=\n|$))+

Shog9 2009-06-17 18:58:24

So basically, loop through it checking for changes in the indentation level? I come to this premise, but keep doubting myself and not writing part 3 of that.

The Wicked Flea 2009-06-17 19:02:41

Yeah - it's really that simple. The only interesting part is writing a solid regex, and your format is simple enough (no explicit list item numbering that you wish to preserve) that that won't be too bad either.

Shog9 2009-06-17 19:04:27

Hmm, maybe I should just do it. It just never has looked that simple. I've never done a lot of text-processing, and so am still novice-like with it.

The Wicked Flea 2009-06-17 19:08:52

Yeah, go for it. I've edited in a simple regex to give you a start; though as I mentioned in a comment elsewhere you can probably get away with just looping line-by-line if you know that everything in the text you're working with is a list.

Shog9 2009-06-17 19:22:04

Thanks Shog, I'll post an answer eventually with my code and see how well it works.

The Wicked Flea 2009-06-17 19:26:02

Sounds good. Thinking about it a bit more, I realize you'll have to keep a bit of additional state as well: a stack of open list tags. Else you won't know how to close out mixed-type nested lists.

Shog9 2009-06-17 19:33:12

Answer 4

+2 A:

The line-by-line solution with some pythonic concepts:

cur = ''
for line in lines():
    prev = cur
    cur, text = split_line_into_marker_and_remainder(line)
    if cur && (cur == prev) :
         print '</li><li>'
    else :
         nprev, ncur = kill_common_beginning(prev, cur)
         for c in nprev: print '</li>' + ((c == '#') ? '</ol>' : '</ul>') 
         for c in ncur:  print           ((c == '#') ? '<ol>'  : '<ul>' )  + '<li>'
    print text

This is how it works: to process the line, I compare the marker for previous line with the marker for this line.

I use a fictional function split_line_into_marker_and_remainder, which returns two results, marker cur and the text itself. It's trivial to implement it as a C++ function with 3 arguments, an input and 2 output strings.

At the core is a fictional function kill_common_beginning which would take away the repeat part of prev and cur. After that, I need to close everything that remains in previous marker and open everything that remains in current marker. I can do it with a replace, by mapping characters to string, or by a loop.

The three lines wil be pretty straightforward in C++:

char * saved = prev;
for (; *prev && (*prev == *cur);  prev++, cur++ ); // "kill_common_beginning"
while (*prev) *(prev++) == '#' ? ...
while (*cur)  *(cur++) == '#' ? ...
cur = saved;

Note, however, that there is a special case: when the indentation didn't change, those lines don't output anything. That's fine if we're outside of the list, but that's not fine in the list: so in that case we should output the </li><li> manually.

ilya n. 2009-06-17 19:54:00

I'd like the stress the key idea: you don't need indent counter.

ilya n. 2009-06-17 20:16:02

I'm primarily confused about that else statement you've got, could you explain its code, specifically?

The Wicked Flea 2009-06-17 20:28:58

Aha, now I've got it. Thank you. :)

The Wicked Flea 2009-06-17 20:49:49

Answer 5

+2 A:

Best explanation I've seen is from Higher-Order Perl by Mark Jason Dominus. The full text is available online at http://hop.perl.plover.com/book/.

Though the examples are all in Perl, the breakdown of the logic behind each area is fantastic.

Chapter 8 (! PDF link) is specifically about parsing. Though the lessons through out the book are somewhat related.

Mark 2009-06-17 21:37:35

Thanks for the link, it's a pretty good resource for parsing. However, it is long-ish and I've spent too much time on this curiosity already.

The Wicked Flea 2009-06-18 13:56:11

Answer 6

A:

This Perl program is a first attempt at that.

#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;

my $data = [];
while( my $line = <> ){
  last if $line =~ /^[.]{3,3}$/;
  my($nest,$rest) = $line =~ /^([\#*]*)\s+(.*)$/x;
  my @nest = split '', $nest;

  if( @nest ){
    recourse($data,$rest,@nest);
  }else{
    push @$data, $line;
  }
}

de_recourse($data);

sub de_recourse{
  my($ref) = @_;
  my %de_map = (
    '*' => 'ul',
    '#' => 'ol'
  );

  if( ref $ref ){
    my($type,@elem) = @$ref;
    if( ref $type ){
      for my $elem (@$ref){
        de_recourse($elem);
      }
    }else{
      $type = $de_map{$type};

      say "<$type>";
      for my $elem (@elem){
        say "<li>";
        de_recourse($elem);
        say "</li>"
      }
      say "</$type>";
    }
  }else{
    print $ref;
  }
}

sub recourse{
  my($last_ref,$str,@nest) = @_;
  die unless @_ >= 2;
  die unless ref $last_ref;
  my $nest = shift @nest;

  if( @_ == 2 ){
    push @$last_ref, $str;
    return;
  }

  my $previous = $last_ref->[-1];
  if( ref $previous ){
    if( $previous->[0] eq $nest ){
      recourse( $previous,$str,@nest );
      return;
    }
  }

  my $new_ref = [ $nest ];
  push @$last_ref, $new_ref;
  recourse( $new_ref, $str, @nest );
}

Hope it helps

Brad Gilbert 2009-06-18 05:43:11

Answer 7

+1 A:

Here is my own solution, which seems to be a hybrid of Shog9's suggestions (a variation on his regex, Ruby doesn't support named matches) and Ilya's iterative method. My working language was Ruby.

Some things of note: I used a stack-based system, and that "String#scan(pattern)" is really just a "match-all" method that returns an array of matches.

def list(text)
  # returns [['*','text'],...]
  parts = text.scan(/(?:(?:^|\n)([#*]+)[\t ]*(.+)(?=\n|$))/)

  # returns ul/ol based on the byte passed in
  list_type = lambda { |c| (c == '*' ? 'ul' : 'ol') }

  prev = []
  tags = [list_type.call(parts[0][0][0].chr)]
  result = parts.inject("<#{tags.last}><li>") do |output,newline|
    unless prev.count == 0
      # the following comparison says whether added or removed,
      # this is the "how much"
      diff = (prev[0].length - newline[0].length).abs
      case prev[0].length <=> newline[0].length
        when -1: # new tags to add
          part = ((diff > 1) ? newline[0].slice(-1 - diff,-1) : newline[0][-1].chr)
          part.each_char do |c|
            tags << list_type.call(c)
            output << "<#{tags.last}><li>"
          end
        when 0: # no new tags... but possibly changed
          if newline[0] == prev[0]
            output << '</li><li>'
          else
            STDERR.puts "Bad input string: #{newline.join(' ')}"
          end
        when 1: # tags removed
          diff.times{ output << "</li></#{tags.pop}>" }
          output << '</li><li>'
      end
    end

    prev = newline
    output + newline[1]
  end

  tags.reverse.each { |t| result << "</li></#{t}>" }
  result
end

Thankfully this code does work and generate valid HTML. And this did turn out better than I had anticipated. It doesn't even feel clunky.

The Wicked Flea 2009-06-18 13:54:42

++ Nicely done!

Shog9 2009-06-18 18:48:47

Cool! Can one of us get an answer accepted? I can rework my program if necessary.

ilya n. 2009-06-23 04:29:12

I don't know that there is one true answer to this, as both of you helped me almost equally. To understand what you wrote I had to rewrite the whole thing in Ruby, though looking at it I did understand the basic premise. Put the advice and code together and I got a working result, without either I might not have made it. (I wish I could accept both as answers.)

The Wicked Flea 2009-06-23 13:27:16

Answer 8

A:

Try Gelatin. The syntax definition would probably be 5 lines or less.

knipknap 2010-01-19 22:59:43

Looks interesting, but I don't know how to program in python.

The Wicked Flea 2010-01-21 23:48:19

There's no need to program in Python, Gelatin can be used as any other command line tool (gel -s syntax.gel input.txt).

knipknap 2010-01-22 04:30:57

ansaurus

tags:

views:

answers:

Is there a clever way to parse plain-text lists into HTML?

Question: Is there a clever way to parse plain-text lists into HTML?

Assumptions

Sample Markup

Desired Output

In Summary

Basic iterative technique:

This Perl program is a first attempt at that.

related questions