views:

68

answers:

2

I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.

Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>

I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.

any help is apprciated

A: 

If I have understood your request correctly this pattern matches your string:

Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>

I use the Regex Coach to try out complicated regex patterns. Other utilities are available.

edit

As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).

APC
notepad++ has unfortunately a bad regex parser - it does not recognize `\n` - at least for me, version 5.6.8
tanascius
Notepad++ won't recognize `\n` between lines of text/code if your files' line endings are `\r\n`. I've not had trouble myself with `\n` on LF-terminated-line files.
BoltClock
+1  A: 

The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:

As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.

I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.

However, you can use the macros to overcome this problems.

  • prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
  • start recording a macro
  • press F3 to use your search
  • mark the four lines and delete them
  • stop recording the macro ... done!

Just replay it and it deletes what you wanted to delete.

tanascius
Thanks alot this was a easy fix for my issue. I was definitely over thinking it the macro was much easier for such a simple removal scenario
Levi