tags:

views:

87

answers:

2

Long story coming up, but I'll try to keep it brief. I have many pure-text paragraphs which I extract from a system and re-output in wiki format so that the copying of said data is not such an arduous task. This all goes really well, except that there are no automatic references being generated for the 'topics' we have pages for, which end up needing to be added by reading through all the text and adding it in manually by changing Topic to [[Topic]].

First requirement: each topic is only to be made clickable once, which is the first occurrence. Otherwise, it would become a really spammy linkfest, which would detract from readability. To avoid issues with topics that start with the same words

Second requirement: overlapping topic names should be handled in such a way that the most 'precise' topic gets the link, and in later occurrences, the less precise topics do not get linked, since they're likely not correct.

Example:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
input = "Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project."
output = function_to_be_written(input)
-- "[[Mary]] and [[Mr. Moore]] work together on [[Project Omega]]. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the [[Project]]."

Now, I quickly figured out a simple or complicated string.gsub() could not get me what I need to satisfy the second requirement, as it provides no way to say 'Consider this match as if it did not happen - I want you to backtrack further'. I need the engine to do something akin to:

input = "abc def ghi"
-- Looping over the input would, in this order, match the following strings:
-- 1) abc def ghi
-- 2) abc def
-- 3) abc
-- 4) def ghi
-- 5) def
-- 6) ghi

Once a string matches an actual topic and has not been replaced before by its wikified version, it is replaced. If this topic has been replaced by a wikified version before, don't replace, but simply continue the matching at the end of the topic. (So for a topic "abc def", it would test "ghi" next in both cases.)

Thus I arrive at LPeg. I have read up on it, played with it, but it is considerably complex, and while I think I need to use lpeg.Cmt and lpeg.Cs somehow, I am unable to mix the two properly to make what I want to do work. I am refraining from posting my practice attempts as they are of miserable quality and probably more likely to confuse anyone than assist in clarifying my problem.

(Why do I want to use a PEG instead of writing a triple-nested loop myself? Because I don't want to, and it is a great excuse to learn PEGs.. except that I am in over my head a bit. Unless it is not possible with LPeg, the first is not an option.)

+1  A: 

So why don't you use string.find? It search only for a first topic occurrence and gives you its starting index and length. All you have to do is to add '[[' to a result. For each chunk, copy the topics table and when the first occurency has been found, remove it. Sort topics by length, most long first so that the most relevant topic will be found first

LPeg is a good tool, but it's not necessary to use it here.

phil pirozhkov
Yeah, I ended up using an ordinary bunch of loops.Point is that I want to learn LPegs, and from what I understand, they are capable of matching a wider variety of patterns than simple regular expressions can. Thus, when I found something regular expressions could not do easily, I figured this was my chance to actually use LPeg for an actual problem rather than just messing with it for the hell of it. :)
Stigma
I know an interesting project you can help and learn LPEG at once.http://github.com/tarcieri/reiait has incomplete LPEG syntax and complete BNF onehttp://github.com/tarcieri/reia/blob/neotoma/src/compilerhttp://github.com/pirj/ryan/tree/master/src/retem/I had some whitespace related issues with LPEG's implementation in Erlang's - NeotomaHope all that stuff will be interesting to you
phil pirozhkov
+1  A: 

So... I got bored and needed something to do:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}

pcall ( require , 'luarocks.require' )
require 'lpeg'
local locale = lpeg.locale ( )
local endofstring = -lpeg.P(1)
local endoftoken = (locale.space+locale.punct)^1

table.sort ( topics , function ( a , b ) return #a > #b end ) -- Sort by word length (longest first)
local topicpattern = lpeg.P ( false )
for i = 1, #topics do
    topicpattern = topicpattern + topics [ i ]
end

function wikify ( input )
    local topicsleft = { }
    for i = 1 , #topics do
        topicsleft [ topics [ i ] ] = true
    end

    local makelink = function ( topic )
        if topicsleft [ topic ] then
            topicsleft [ topic ] = nil
            return "[[" .. topic .. "]]"
        else
            return topic
        end
    end

    local patt = lpeg.Ct ( 
        (
            lpeg.Cs ( ( topicpattern / makelink ) )* #(-locale.alnum+endofstring) -- Match topics followed by something thats not alphanumeric
            + lpeg.C ( ( lpeg.P ( 1 ) - endoftoken )^0 * endoftoken ) -- Skip tokens that aren't topics
        )^0 * endofstring -- Match adfinum until end of string
    )
    return table.concat ( patt:match ( input ) )
end

print(wikify("Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project.")..'"')
print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa.")..'"')

I start off my making a pattern which matches all the different topics; we want to match the longest topics first, so sort the table by word length from longest to shortest. Now we need to make a list of the topics we haven't seen in the current input. makelink quotes/links the topic if we haven't seen it already, otherwise leaves it be.

Now for the actual lpeg stuff:

  • lpeg.Ct packs all our captures into a table (to be concated together for output)
  • topicpattern / makelink captures a topic, and passes in through our makelink function.
  • lpeg.Cs substitutes the result of makelink back in where the match of the topic was.
  • + lpeg.C ( ( lpeg.P ( 1 ) - locale.space )^0 * locale.space^1 ) if we didn't match a topic, skip a word (that is, not spaces followed by a space)
  • ^0 repeat.

Hope thats what you wanted :)

Daurn

Note: Edited code, description no longer correct

daurnimator
Awesome effort, I'm learning a lot. Also, it is case sensitive (which I admit I didn't specifically mention or test in the example), but I think that is fixed easily enough by tracking all topics in their lowercased form, although I think lpeg.P() might disagree with me.Bigger issues: it likes to match partial words, and it drops the data after the last match when I run your code. For example, try this (modified) test: print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa."))It shows both issues.
Stigma
Fixed/Changed :)
daurnimator
This is awesome. A real gem. Thank you for your efforts - I learned a lot from you. If you have the chance still, could you explain how one would, for example, match topics case insensitively? I am imagining pre-processing each topic to look like a lpeg form of the following regex: '[Mm][Aa][Rr][Yy]', but somehow I think that is the wrong way to take it. -- Either way, I am accepting this answer, since I have learned plenty from it and it does everything I have asked. Thanks! :)
Stigma
easiest way to get case insensitivity would be just to lowercase everything: the list of topics, and use input:lower() in your call to :match().
daurnimator
Right. But I thought lpeg.P() matched case-sensitively? I can't lowercase the input string (for the obvious reason being that the end-output needs to keep its normal casing), thus meaning it could never match topic 'test' to a word 'TEST' in the actual data?Or have I got my understanding of lpeg.P() wrong somehow?
Stigma
hmmm yeah, I didn't think of that.You'll have to either change the topicpattern generator; or do a matchtime capture that checks back with the original string.Here in an example of the former, but I don't think its the best solution. (untested)change: topicpattern = topicpattern + topics [ i ]to: local topic = lpeg.P(true) for char in topics [ i ]:gsub(".",string.lower) do topic = topic * ( lpeg.P(char) + lpeg.P(char:upper()) ) end topicpattern = topicpattern + topic
daurnimator