views:

289

answers:

3

I'd like to write a Greasemonkey script that requires finding lines ending with a string ("copies.") & sorting those lines based on the number preceding that string.

The page I'm looking to modify does not use tables unfortunately, just the br/ tag, so I assume that this will involve Regex:

http://www.publishersweekly.com/article/CA6591208.html

(Lines without the matching string will just be ignored.)

Would be grateful for any tips to get me started.

A: 

It's not clear to me what it is you're trying to do. When posting questions here, I encourage you to post (a part of) your actual data and clearly indicate what exactly you're trying to match.

But, I am guessing you know very little regex, in which case, why use regex at all? If you study the topic a bit, you will soon know that regex is not some magical tool that produces whatever it is you're thinking of. Regex cannot sort in whatever way. It simply matches text, that's all.

Have a look at this excellent on-line resource: http://www.regular-expressions.info/

And if after reading you think a regex solution to your problem is appropriate, feel free to elaborate on your question and I'm sure I, or someone else is able to give you a hand.

Best of luck.

Bart Kiers
bibliwho
A ok, I really got the impression you thought regex would be able to sort. To me, the description of the data wasn't clear, otherwise I wouldn't ask for clarification. But, it seems Peter did understand.
Bart Kiers
+2  A: 

Most times, HTML and RegEx do not go together, and when parsing HTML your first thought should not be RegEx.

However, in this situation, the markup looks simple enough that it should be okay - at least until Publisher Weekly change how they do that page.

Here's a function that will extract the data, grab the appropriate lines, sort them, and put them back again:
($j is jQuery)

function reorderPwList()
{
    var Container = $j('#article span.table');

    var TargetLines = /^.+?(\d+(?:,\d{3})*) copies\.<br ?\/?>$/gmi

    var Lines = Container.html().match( TargetLines );

    Lines.sort( sortPwCopies );

    Container.html( Lines.join('\n') );


    function sortPwCopies()
    {
     function getCopyNum()
     { return arguments[0].replace(TargetLines,'$1').replace(/\D/g,'') }

     return getCopyNum(arguments[0]) - getCopyNum(arguments[1]);
    }
}


And an explanation of the regex used there:

^           # start of line
.+?         # lazy match one or more non-newline characters
(           # start capture group $1
  \d+       # match one or more digits (0-9)
  (?:       # non-capture group
    ,\d{3}  # comma, then three digits
  )*        # end group, repeat zero or more times
)           # end group $1
 copies\.   # literal text, with . escaped
<br ?\/?>   # match a br tag, with optional space or slash just in case
$           # end of line

(For readability, I've indented the groups - only the spaces before 'copies' and after 'br' are valid ones.)
The regex flags gmi are used, for global, multi-line mode, case-insensitive matching.



<OLD ANSWER>

Once you've extracted just the text you want to look at (using DOM/jQuery), you can then pass it to the following function, which will put the relevant information into a format that can then be sorted:

function makeSortable(Text)
{
    // Mark sortable lines and put number before main content.
    Text = Text.replace
     ( /^(.*)([\d,]+) copies\.<br \/>/gm
     , "SORT ME$2 $1"
     );

    // Remove anything not marked for sorting.
    Text = Text.replace( /^(?!SORT ME).*$/gm , '' );

    // Remove blank lines.
    Text = Text.replace( /\n{2,}/g , '\n' );

    // Remove sort token.
    Text = Text.replace( /SORT ME/g , '' );

    return Text;
}


You'll then need a sort function to ensure that the numbers are sorted correctly (the standard JS array.sort method will sort on text, and put 100,000 before 20,000).


Oh, and here's a quick explanation of the regexes used here:

/^(.*)([\d,]+) copies\.<br \/>/gm

/.../gm    a regex with global-match and multi-line modes
^          matches start of line
(.*)       capture to $1, any char (except newline), zero or more times
([\d,]+)   capture to $2, any digit or comma, one or more times
 copies    literal text
\.<br \/>  literal text, with . and / escaped (they would be special otherwise)


/^(?!SORT ME).*$/gm

/.../gm      again, enable global and multi-line
^            match start of line
(?!SORT ME)  a negative lookahead, fails the match if text 'SORT ME' is after it
.*           any char (except newline), zero or more times
$            end of line


/\n{2,}/g

\n{2,}    a newline character, two or more times

</OLD ANSWER>

Peter Boughton
I'm not convinced this is the best solution; however, +1 for the fine explanation of regexp
Justin Johnson
bibliwho
Justin, I agree with that - I appear to have done it backwards :/ stereofrog's approach of just picking out the appropriate lines is a more sensible one (though his formatting is terribly icky).
Peter Boughton
bibliwho
If it doesn't jump out at you, I'm not sure I can explain it, it's all just... *all over the place*. :SWhat I'll do is update my answer with how I would probably write it, which is hopefully easier to follow.
Peter Boughton
+1  A: 

you can start with something like this (just copypaste into the firebug console)

 // where are the things
 var elem = document.getElementById("article").
  getElementsByTagName("span")[1].
  getElementsByTagName("span")[0];

 // extract lines into array
 var lines = []
 elem.innerHTML.replace(/.+?\d+\s+copies\.\s*<br>/g,
    function($0) { lines.push($0) });

 // sort an array

//   lines.sort(function(a, b) {
//      var ma = a.match(/(\d+),(\d+)\s+copies/);
//      var mb = b.match(/(\d+),(\d+)\s+copies/);
//
//      return parseInt(ma[1] + ma[2]) - 
//     parseInt(mb[1] + mb[2]);

            lines.sort(function(a, b) {
                 function getNum(p) {
                     return parseInt(
                          p.match(/([\d,]+)\s+copies/)[1].replace(/,/g, ""));
                 }
                 return getNum(a) - getNum(b);
 })

 // put it back
 elem.innerHTML = lines.join("");
stereofrog
Wow, extremely slick. Added an extra line -- lines.reverse() -- to get the titles in descending order. Only problem I noticed is that print runs over 999,999 (e.g., "1,300,000 copies") are not handled correctly.
bibliwho
oh you're rightpost edited
stereofrog