views:

388

answers:

3

In the following code, if the string s is appended to be something like 10 or 20 thousand characters, the Mathematica kernel seg faults.

s = "This is the first line.
MAGIC_STRING
Everything after this line should get removed.
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
...";

s = StringReplace[s, RegularExpression@"(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*"->""]

I think this is primarily Mathematica's fault and I've submitted a bug report and will follow up here if I get a response. But I'm also wondering if I'm doing this in a stupid/inefficient way. And even if not, ideas for working around Mathematica's bug would be appreciated.

+4  A: 

The best way to optimize the regex depends on the internals of Mathematica's regex engine, but I would definitely get rid of the (.|\\n)*, as @Simon mentioned. It's not just the alternation--although it's almost always a mistake to have an alternation in which every alternative matches exactly one character; that's what character classes are for. But you're also capturing each character when you match it (because of the parentheses), only to throw it away when you match the next character.

A quick scan of the Mathematica regex docs doesn't turn up anything like the /s (Singleline or DOTALL) modifier, so I recommend the old JavaScript standby, [\\s\\S]* -- match anything that is whitespace or anything that isn't whitespace. Also, it might help to add the $ anchor to the end of the regex:

"(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"

But your best option would probably be not to use regexes at all. I don't see anything here that requires them, and it would probably be much easier as well as more efficient to use Mathematica's normal string-manipulation functions.

Alan Moore
That was extremely edifying. Thank you!
dreeves
+6  A: 

Mathematica uses PCRE syntax, so it does have the /s aka DOTALL aka Singleline modifier, you just prepend the (?s) modifier before the part of the expression in which you want it to apply.

See the RegularExpression documentation here: (expand the section labeled "More Information")
http://reference.wolfram.com/mathematica/ref/RegularExpression.html

The following set options for all regular expression elements that follow them:
(?i) treat uppercase and lowercase as equivalent (ignore case)
(?m) make ^ and $ match start and end of lines (multiline mode)
(?s) allow . to match newline
(?-c) unset options

This modified input doesn't crash Mathematica 7.0.1 for me (the original did), using a string that is 15,000 characters long, producing the same output as your expression:

s = StringReplace[s,RegularExpression@".*MAGIC_STRING(?s).*"->""]

It should also be a bit faster for the reasons @AlanMoore explained

Michael Pilat
Guess I scanned a little *too* quickly. :-/ Have you tested that regex with `(?m)^` on the front? Seems like that should speed it up a bit more.
Alan Moore
A: 

Mathematica is a great executive toy but I'd advise against trying to do anything serious with it like regexs over long strings or any kind of computation over significant amounts of data (or where correctness is important). Use something tried and tested. Visual F# 2010 takes 5 milliseconds and one line of code to get the correct answer without crashing:

> let str =
    "This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
      String.replicate 2000 "0123456789";;
val str : string =
  "This is the first line.
MAGIC_STRING
Everything after this li"+[20022 chars]

> open System.Text.RegularExpressions;;
> #time;;
--> Timing now on

> (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : string = "This is the first line."
Jon Harrop
"or where correctness is important"Mathematica is all about getting correct answers. This is a serious and unfounded accusation.
gdelfino
Thanks for replicating this in another language; that's useful. (I tend to agree with gdelfino about your preamble though.)
dreeves
@gdelfino: Just look at the ridiculous bugs in major versions of Mathematica they ship. For example, in Mathematica 7.0.0 virtually every Fourier transform done by the `Fourier` function gave the wrong answer because it dropped imaginary components. http://flyingfrogblog.blogspot.com/2009/05/mathematica-bug-afflicting-our-product.html
Jon Harrop
@dreeves: Well, I have been using Mathematica for over 10 years and it is great fun but every single time I have tried to do any serious work with it, Mathematica got all the answers wrong. When I did my PhD, I had four major symbolic derivations and Mathematica (v4) gave wrong answers for all of them as well as having a serious bug in `ListConvolve` that silently corrupted all of my numerical computations. The idea that "Mathematica is all about getting correct answers" is absurd.
Jon Harrop
FWIW, I've been playing with Mathematica 7 for a few hours, uncovered bugs everywhere and documented them here: http://flyingfrogblog.blogspot.com/2010/07/half-dozen-bugs-in-mathematica-7.html
Jon Harrop