views:

63

answers:

1

i need to parse the url and title from multiple href tags in a string regex... i need to get each url and title into a variable
eg.

<DT><A HREF="http://www.partyboatnj.com/" ADD_DATE="1210713679" LAST_VISIT="1225055180"     LAST_MODIFIED="1210713679">NJ Party Boat - Sea Devil of Point Pleasant Beach, NJ</A> 
<DT><A     HREF="http://www.test.com/" ADD_DATE="1210713679" LAST_VISIT="1225055180"     LAST_MODIFIED="1210713679">test parse</A> 
 <DT><A HREF="http://www.google.com/"     ADD_DATE="1210713679" LAST_VISIT="1225055180" LAST_MODIFIED="1210713679">google</A>
+1  A: 

Ok, if I understand correctly, I would do something like this:

<cffunction name="reMatchGroups" access="public" returntype="array" output="false">
    <cfargument name="text" type="string" required="true" />
    <cfargument name="pattern" type="string" required="true" />
    <cfargument name="scope" type="string" required="false" default="all" />

    <cfscript>
         l = {};
         l.results = [];

         l.pattern = createObject("java", "java.util.regex.Pattern").compile(javacast("string", arguments.pattern));
         l.matcher = l.pattern.matcher(javacast("string", arguments.text));

         while(l.matcher.find()) {
             l.groups = {};

             for(l.i = 1; l.i <= l.matcher.groupCount(); l.i++) {
                 l.groups[l.i] = l.matcher.group(javacast("int", l.i));
             }

             arrayAppend(l.results, l.groups);

             if(arguments.scope == "one")
                 break;
         }

         return l.results;
   </cfscript>      
</cffunction>

The above function returns groups for each regex pattern match.

You could use it like this:

<cfset a = reMatchGroups("<a href=""http://iamalink.com"" class=""testlink"">This is a link</a>", "href=[""']([^""|']*)[""'][^>]*>([^<]*)", "all") />

Which will give you an array of structs with the key-value pairs for each back reference in the regex. In this case the href and node text.

Bigfellahull
thanks for the code it works wonders!what im really trying to achieve is to parse a HTML file exported from a browser like chrome, firefox etc. and be able to import the url's to a database. would you have a way to do that? the catch is that they have categories(folders) using <h3> tags, and i need to get the url's with each category they are in. i can clarify if necessary.thanks
loo
Ok for that you really need an html parser. Not sure of any in coldfusion but I'm sure a search will throw some up. I came across a simple "parser" by Ben Nadel which will store each html element in a structure. You could then look for just the h3's and write the attributes you want to the database. You may have to tweak his code though http://www.bennadel.com/blog/779-Parsing-HTML-Tag-Data-Into-A-ColdFusion-Structure.htm
Bigfellahull