ansaurus

Question

Answer 1

+1 A:

So why don't you use string.find? It search only for a first topic occurrence and gives you its starting index and length. All you have to do is to add '[[' to a result. For each chunk, copy the topics table and when the first occurency has been found, remove it. Sort topics by length, most long first so that the most relevant topic will be found first

LPeg is a good tool, but it's not necessary to use it here.

phil pirozhkov 2010-04-29 08:39:04

Yeah, I ended up using an ordinary bunch of loops.Point is that I want to learn LPegs, and from what I understand, they are capable of matching a wider variety of patterns than simple regular expressions can. Thus, when I found something regular expressions could not do easily, I figured this was my chance to actually use LPeg for an actual problem rather than just messing with it for the hell of it. :)

Stigma 2010-04-30 17:08:25

I know an interesting project you can help and learn LPEG at once.http://github.com/tarcieri/reiait has incomplete LPEG syntax and complete BNF onehttp://github.com/tarcieri/reia/blob/neotoma/src/compilerhttp://github.com/pirj/ryan/tree/master/src/retem/I had some whitespace related issues with LPEG's implementation in Erlang's - NeotomaHope all that stuff will be interesting to you

phil pirozhkov 2010-05-03 19:19:42

Answer 2

+1 A:

So... I got bored and needed something to do:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}

pcall ( require , 'luarocks.require' )
require 'lpeg'
local locale = lpeg.locale ( )
local endofstring = -lpeg.P(1)
local endoftoken = (locale.space+locale.punct)^1

table.sort ( topics , function ( a , b ) return #a > #b end ) -- Sort by word length (longest first)
local topicpattern = lpeg.P ( false )
for i = 1, #topics do
    topicpattern = topicpattern + topics [ i ]
end

function wikify ( input )
    local topicsleft = { }
    for i = 1 , #topics do
        topicsleft [ topics [ i ] ] = true
    end

    local makelink = function ( topic )
        if topicsleft [ topic ] then
            topicsleft [ topic ] = nil
            return "[[" .. topic .. "]]"
        else
            return topic
        end
    end

    local patt = lpeg.Ct ( 
        (
            lpeg.Cs ( ( topicpattern / makelink ) )* #(-locale.alnum+endofstring) -- Match topics followed by something thats not alphanumeric
            + lpeg.C ( ( lpeg.P ( 1 ) - endoftoken )^0 * endoftoken ) -- Skip tokens that aren't topics
        )^0 * endofstring -- Match adfinum until end of string
    )
    return table.concat ( patt:match ( input ) )
end

print(wikify("Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project.")..'"')
print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa.")..'"')

I start off my making a pattern which matches all the different topics; we want to match the longest topics first, so sort the table by word length from longest to shortest. Now we need to make a list of the topics we haven't seen in the current input. makelink quotes/links the topic if we haven't seen it already, otherwise leaves it be.

Now for the actual lpeg stuff:

lpeg.Ct packs all our captures into a table (to be concated together for output)
topicpattern / makelink captures a topic, and passes in through our makelink function.
lpeg.Cs substitutes the result of makelink back in where the match of the topic was.
+ lpeg.C ( ( lpeg.P ( 1 ) - locale.space )^0 * locale.space^1 ) if we didn't match a topic, skip a word (that is, not spaces followed by a space)
^0 repeat.

Hope thats what you wanted :)

Daurn

Note: Edited code, description no longer correct

daurnimator 2010-05-02 12:31:25

Awesome effort, I'm learning a lot. Also, it is case sensitive (which I admit I didn't specifically mention or test in the example), but I think that is fixed easily enough by tracking all topics in their lowercased form, although I think lpeg.P() might disagree with me.Bigger issues: it likes to match partial words, and it drops the data after the last match when I run your code. For example, try this (modified) test: print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa."))It shows both issues.

Stigma 2010-05-02 21:02:36

Fixed/Changed :)

daurnimator 2010-05-03 04:50:24

This is awesome. A real gem. Thank you for your efforts - I learned a lot from you. If you have the chance still, could you explain how one would, for example, match topics case insensitively? I am imagining pre-processing each topic to look like a lpeg form of the following regex: '[Mm][Aa][Rr][Yy]', but somehow I think that is the wrong way to take it. -- Either way, I am accepting this answer, since I have learned plenty from it and it does everything I have asked. Thanks! :)

Stigma 2010-05-03 11:15:07

easiest way to get case insensitivity would be just to lowercase everything: the list of topics, and use input:lower() in your call to :match().

daurnimator 2010-05-04 06:36:27

Right. But I thought lpeg.P() matched case-sensitively? I can't lowercase the input string (for the obvious reason being that the end-output needs to keep its normal casing), thus meaning it could never match topic 'test' to a word 'TEST' in the actual data?Or have I got my understanding of lpeg.P() wrong somehow?

Stigma 2010-05-04 17:31:15

hmmm yeah, I didn't think of that.You'll have to either change the topicpattern generator; or do a matchtime capture that checks back with the original string.Here in an example of the former, but I don't think its the best solution. (untested)change: topicpattern = topicpattern + topics [ i ]to: local topic = lpeg.P(true) for char in topics [ i ]:gsub(".",string.lower) do topic = topic * ( lpeg.P(char) + lpeg.P(char:upper()) ) end topicpattern = topicpattern + topic

daurnimator 2010-05-09 09:09:25

ansaurus

tags:

views:

answers:

Wiki-fying a text using LPeg

related questions