views:

45

answers:

5

Hi, I have a messy html that looks like this:

<div id=":0.page.0" class="page-element" style="width: 1620px;">
 <div>
  <img src="viewer_files/viewer_004.png" class="page-image" style="width: 800px; height: 1131px; display: none;">
  <img src="viewer_files/viewer_005.png" class="page-image" style="width: 1600px;">
 </div>
</div>// this repeats 100+ times with different 'src' attributes

Now this is all one line actually (i have formatted in multiple lines for easy readibility). I am trying to remove all <img> tags that have display:none; set in the inline css. Is it possible to use sed/awk or some other unix command to achieve this? I think if it were a well indented html document, it would've been easy.

A: 

That would do it

sed -e "s@<img.*display: none;.*>@@g" FILINAME
sha
Isn't the second .* going to match greedily?
pdbartlett
it removed all the img tags :|
fenderplayer
Well, it did work on original sample. But if greedy would be a problem we can always replace . with [^>]
sha
Are u sure? I just tried it with your file. Worked like a charm.
sha
this is greedy...
ghostdog74
+2  A: 

HTML and regexes are a notoriously bad match, so you probably want something that is HTML-aware. I'd probably go for something like TagSoup, but there are no doubt other options that are more shell-friendly, or suitable for any favourite scripting language you may have.

pdbartlett
+2  A: 

I would use either Twig or XMLStarlet to do this kind of processing. A lot more reliable than sed/awk/grep. Since your pattern is regular and repeating, they would work too.

Noufal Ibrahim
+1 love xmlstarlet as much as I can love anything related to XML.
pra
A: 
sed -e "s/<img[^>]*display: none;[^>]*>//g" filein

A quick explanation about sed :

s stands for substitution / are delimiters

s means that the first field will be a pattern to be search, that will be replaced by the second one. The last one are options. g means global (replace it many times if many matches are found).

to replace inplace : sed -i -e "..."

Scharron
ok this worked but without the 'r' option.
fenderplayer
should be `display: *none\b` instead of `display: none;`
Pumbaa80
@Pumbaa80 what is the difference?
fenderplayer
Matches zero or more spaces, instead of exactly one.
pdbartlett
also, it matches `"display: none"` and `"display: none ;"`
Pumbaa80
+1  A: 
sed 's/<img.*display: none;[^>]>//g' file
ghostdog74