views:

60

answers:

3

i need to get images from a webpage source.

i can use cfhttp method get and use htmleditformat() to read the html from that page, now i need to loop through the content to get all image url's(src)

can i use rematch() or refind() etc... and if yes how??

please help!!!!!

if im not clear i can try to clarify..

+3  A: 

It can be very difficult to reliably parse html with regex.

Antony
+1 - That made me laugh so hard, I nearly fell off my chair.
Leigh
+1  A: 

Here's a function that will probably trip up on a lot of bad cases, but might work if you just need something quick and dirty.

<cffunction name="getSrcAttributes" access="public" output="No">
    <cfargument name="pageContents" required="Yes" type="string" default="" />

    <cfset var continueSearch = true />
    <cfset var cursor = "" />
    <cfset var startPos = 0 />
    <cfset var finalPos = 0 />
    <cfset var images = ArrayNew(1) />

    <cfloop condition="continueSearch eq true">
        <cfset cursor = REFindNoCase("src\=?[\""\']", arguments.pageContents, startPos, true) />

        <cfif cursor.pos[1] neq 0>
            <cfset startPos = (cursor.pos[1] + cursor.len[1]) />
            <cfset finalPos = REFindNoCase("[\""\'\s]", arguments.pageContents, startPos) />
            <cfset imgSrc = Mid(arguments.pageContents, startPos, finalPos - startPos) />

            <cfset ArrayAppend(images, imgSrc) />
        <cfelse>
            <cfset continueSearch = false />
        </cfif>
    </cfloop>

    <cfreturn images>
</cffunction>

Note: I can't verify at the moment that this code works.

Soldarnal
Huh? *If* you're going the regex route (see Anthony's answer for why you shouldn't), you just want: ` <!--- INFO: Grab things resembling src attributes: ---> <cfset SrcMatches = rematch( 'src\s*=\s*(["'']?)((?!\1).)+' , InputText ) /> <!--- INFO: Clean-up front of match (remove src=" part) ---> <cfloop index="i" from="1" to="#ArrayLen(SrcMatches)#"> <cfset SrcMatches[i] = rereplace(SrcMatches[i],'src\s*=\s*["'']','')/> </cfloop> `
Peter Boughton
I had written this function a while back (before CF8, hence no REMatch) for, like I mention above, something quick and dirty. I make no pretense that it is production code - obviously it doesn't check if src= is even in an img tag (or in a tag at all!) - but not all code has to be.
Soldarnal
Peter Boughton: thanks for the code it seemes to pick up only one src attr. if you can modifty it to list all the src... i would appreciate that.i added the #SrcMatches[i]#<br> in the loop assuming it will list all src found. <cfset SrcMatches = rematch( 'src\s*=\s*(["'']?)((?!\1).)+' , InputText ) /> <!--- INFO: Clean-up front of match (remove src=" part) ---> <cfloop index="i" from="1" to="#ArrayLen(SrcMatches)#"> <cfset SrcMatches[i] = rereplace(SrcMatches[i],'src\s*=\s*["'']','')/>#SrcMatches[i]#<br> </cfloop>
loo
+1  A: 

Use a browser and jQuery to 'query' out all the img tag from the DOM might be easier...

Henry
Or using this Java-based CSS selector library to do the querying:http://github.com/chrsan/css-selectors/tree
Peter Boughton