tags:

views:

254

answers:

4

Hi guys wonder if you guys could help me I'm trying to compile a bash script that will display some values from a section of html code and I am stuck on the regular expression part,

I have the following piece of code

 <li><div friendid="107647498" class="friendHelperBox"><div><a href="http://www.myspace.com/rockyrobsyn" class="msProfileTextLink" title="rØbylin">rØbylin</a></div><span class="msProfileLink friendToolTipBox" friendid="107647498" style="width:90px;"><a href="http://www.myspace.com/rockyrobsyn"&gt;&lt;img src="http://x.myspacecdn.com/modules/common/static/img/spacer.gif" source="http://c2.ac-images.myspacecdn.com/images01/59/s_8b94c89a98de643e59ab9a1cf03885c1.jpg" alt="rØbylin" class="profileimagelink" onerror="UseNoPicImage(event.target||event.srcElement)" /><span class="pilRealName">Robyn</span></a></span></div><br /><img src="http://x.myspacecdn.com/images/onlinenow.gif" /></li><li><div friendid="59261168" class="friendHelperBox"><div><a href="http://www.myspace.com/christownsendmusic" class="msProfileTextLink" title="Chris Townsend">Chris Townsend</a></div><span class="msProfileLink friendToolTipBox" friendid="59261168" style="width:90px;"><a href="http://www.myspace.com/christownsendmusic"&gt;&lt;img src="http://x.myspacecdn.com/modules/common/static/img/spacer.gif" source="http://c4.ac-images.myspacecdn.com/images02/83/s_029c098cc40c40ff8f88fe54d53a1277.jpg" alt="Chris Townsend" class="profileimagelink" onerror="UseNoPicImage(event.target||event.srcElement)" /></a></span></div><br /><img src="http://x.myspacecdn.com/images/onlinenow.gif" /></li></ul>

all on one line and I would like to pull out all the text that is inside

..class="msProfileTextLink" title="<GRAB THIS TEXT>">....

I would like to grab all occurrences how am i able to do this?

A: 

What about Perl? ;)

#!/usr/bin/perl

$string = 'Your string';

$string =~ m/class=\"msProfileTextLink\" title=\"([^\"]*)\"/;

print $1; print "\n";
Artyom Sokolov
A: 

The following Perl-style regex should work for you:

m/class="msProfileTextLink"\s*title="([^"]+)"/g

As far as using it from a bash script, you should be able to use it in a Perl one-liner (see the -p and -e Perl command-line options), or in another language that supports perl-style regexes such as Python, PHP, etc.

EvanK
+1  A: 

I'm assuming that it's okay to invoke standard unix tools, not just bash built-ins

Well,

grep -o 'class="msProfileTextLink" title="([^"])*"' file.html

gets you as far as:

class="msProfileTextLink" title="rØbylin"

class="msProfileTextLink" title="Chris Townsend"

That assumes there's never any whitespace variation in the html - otherwise you need to do

egrep -o 'class="msProfileTextLink"[[:space:]]*title="([^"])*"' inserting the [[space]]* whereever there might be some whitespace.

Then grep -o '"[^"]*"$' Gets it down to:

"rØbylin"

"Chris Townsend"

Ealdwulf
+1  A: 

Try this

awk '/title="([^"]*)"/ {print substr($2,8,length($2)-8)}'
Journeyman Programmer