views:

2346

answers:

4

Finding Line Beginning using Regular expression in Notepad++

(Sorry if this is a newbie question)

I want to strip a 4000-line HTML file from all the jQuery "done" stuff, e.g.:

<DIV class=menu done27="1" done26="0"
done9="1" done8="0" done7="1"
done6="0" done4="20">

should be replaced with:

<DIV class=menu>

In http://www.zytrax.com/tech/web/regex.htm#experiment I can do it with RE:

[ ^]done[0-9]+="[0-9]+"

but in Notepad++ 5.6.8 UNICODE, in a .HTM file encoded in ANSI, Search > Find, Search mode = Regular expression, putting this RE in the "Find what" field won't work (it will only find the 5 occurrences starting with a space, it will miss the 2 occurrences starting at the beginning of a line; IOW, the caret for line beginning, or the alternating it with a space, fails). How do I? TIA,

Versailles, Wed 21 Apr 2010 10:42:20 +0200

A: 

I'm afraid, Notepad++ Regex cannot do that

Notepad++ using Scintilla regex engine, its per line based, so multiline search / replace cannot be done.

Note that \r and \n are never matched because in Scintilla, regular expression searches are made line per line (stripped of end-of-line chars).

Quoted from http://www.scintilla.org/SciTERegEx.html

S.Mark
A: 

I like Notepad++ too but the regexing is really a pain. If you insist on using Notepad++ try this:

  • First find out which newline characters are being used in your document (View>Show Symbol>Show End Of Line)
  • Delete those line-breaks by replacing them with a single space (Search and replace. CR is \r LF is \n. Be sure to tick "Extended" search mode)
  • Regex-replace done[0-9][0-9]*=\"[0-9][0-9]*\" with the empty string (be sure to put a single space before the regex expression)

Voila! Not very nice n clean but it works ;o)

After that if you want it human-readable again you could use the HTMLTidy functions

das_weezul
A: 

You almost had it! Unfortunately, the complete solution in Notepad++ would have to be a 3 step process.

  1. Regex search/replace with the following search: \<done[0-9]+="[0-9]+"[ ]* Of course, leave the replace field empty, so that it will simply delete everything that matches. (In Notepad++ understanding of regular expressions \< represents the "beginning of a word".)

  2. Select the portion of text affected by your previous search/replace. You don't want to select the entirety of your document, because we're going to...

  3. Strip newlines. Hit Ctrl-F to bring up the Search/Replace dialog again and this time select "Extended" search mode, instead of "Regular expression". Depending on the format of your document you are going to want to search for either \n or \r\n. The replacement field should, again, be empty. Also, make sure that the "In Selection" checkbox is checked.

Click "Replace All" and you're done!

kurige
+3  A: 

Extended Replace "\n" with "LINEBREAK "

Thanks a lot to all for these timely replies. Following your advices, here's what I did:

  • "Notepad++ > View > Show Symbol > Show End Of Line" shows "CR+LF" at each line end.
  • "Notepad++ > Search > Find", "Search mode" = "Normal", made sure that "Find what" = "LINEBREAK" finds nothing
  • "Search mode" = "Extended", "Find what" = "\n\r" only finds the double-breaks (CR + LF + a blank line); "\n \r" find nothing; yet "\n" does find exactly all line breaks, and only them.
  • Saving my "Towncar.htm" test file as "Towncar_02.htm" (also encoded in ANSI)
  • Under "Extended", replaced all "\n" with "LINEBREAK " (notice the trailing space)
  • Under "Regular expression", replaced each occurrence of:

     done[0-9]*="[0-9]*"
    

(Be careful to check there is THE HEADING SPACE before "done"
and there is NO TRAILING SPACE! see below)

with an empty string

  • Under "Extended", replaced each occurrence of "LINEBREAK" with "\n" (no trailing space this time after "LINEBREAK"!)
  • Checked that the resulting "Towncar.htm" file (after a few cosmetic reformatting) looked OK and pretty, and that after refresh, it still rendered the same as the "Towncar_02.htm" backup.

Recalls and Notes:

  • This forum apparently works well in Chrome 4; but with some browsers (e.g. IE6 and other discontinued ones), under some circumstances, it causes some artifacts; so, be careful:
  • even if the forum doesn't show it in your browser, there is a heading space, i.e. at the beginning of the Regex (the " done..." Regular expression above) and inside it, so to replace only strings starting with " done", with the starting space, thus making even surer to NOT alter eventual other strings with "undone" or "methadone" or else
  • same way, even if the forum shows one in your browser, there is no trailing space at the end of the Regex!
  • in the Regex, [0-9] matches 1 and only 1 occurrence of any decimal digit (characters in the 0-9 range); IOW it matches « 0 » or « 1 » or « 9 » etc, but NOT « 01 » or « 835 » or « » (the empty string) or whichever.
  • * (asterisk) matches 0 or more times the previous character (here it matches the empty string or any string made exclusively of digits)
  • samewise, + (plus sign) matches 1 or more times the previous character (here it matches any string, at least 1 character long, made exclusively of digits)
    Ref: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions#Notepad.2B.2B_regex_syntax

Again a lot of thanks to all 3!

Versailles, Wed 21 Apr 2010 16:20:45 +0200,
edited (correcting small display errors) Fri 23 Apr 19:47

Michel Merlin
Wow!, this is very well-written and detailed answer, Impressive! I've voted on both question and answer.
S.Mark