views:

568

answers:

2

I'm looking to learn how to create a REGEX in Coldfusion that will scan through a large item of html text and create a list of items.

The items I want are contained between the following

<span class="findme">The Goods</span>

Thanks for any tips to get this going.

+5  A: 

You don't say what version of CF. Since v8 you can use REMatch to get an array

results = REMatch('(?i)<span[^>]+class="findme"[^>]*>(.+?)</span>', text)

Use ArrayToList to turn that into a list. For older version use REFindNoCase and use Mid() to extract substrings.

EDIT: To answer your follow-up comment the process of using REFind to return all matches is quite involved because the function only returns the FIRST match. This means you actually have to call REFind many times passing a new startpos each time. Ben Forta has written a UDF which does exactly this and will save you some time.

<!---
Returns all the matches of a regular expression within a string.
NOTE: Updated to allow subexpression selection (rather than whole match)

@param regex      Regular expression. (Required)
@param text       String to search. (Required)
@param subexnum   Sub-expression to extract (Optional)
@return Returns a structure.
@author Ben Forta ([email protected])
@version 1, July 15, 2005
--->
<cffunction name="reFindAll" output="true" returnType="struct">
<cfargument name="regex" type="string" required="yes">
<cfargument name="text" type="string" required="yes">
<cfargument name="subexnum" type="numeric" default="1">

<!--- Define local variables --->    
<cfset var results=structNew()>
<cfset var pos=1>
<cfset var subex="">
<cfset var done=false>

<!--- Initialize results structure --->
<cfset results.len=arraynew(1)>
<cfset results.pos=arraynew(1)>

<!--- Loop through text --->
<cfloop condition="not done">

   <!--- Perform search --->
   <cfset subex=reFind(arguments.regex, arguments.text, pos, true)>
   <!--- Anything matched? --->
   <cfif subex.len[1] is 0>
      <!--- Nothing found, outta here --->
      <cfset done=true>
   <cfelse>
      <!--- Got one, add to arrays --->
      <cfset arrayappend(results.len, subex.len[arguments.subexnum])>
      <cfset arrayappend(results.pos, subex.pos[arguments.subexnum])>
      <!--- Reposition start point --->
      <cfset pos=subex.pos[1]+subex.len[1]>
   </cfif>
</cfloop>

<!--- If no matches, add 0 to both arrays --->
<cfif arraylen(results.len) is 0>
   <cfset arrayappend(results.len, 0)>
   <cfset arrayappend(results.pos, 0)>
</cfif>

<!--- and return results --->
<cfreturn results>
</cffunction>

This gives you the start (pos) and length of each match so to get each substring use another loop

<cfset text = '<span class="findme">The Goods</span><span class="findme">More Goods</span>' />
<cfset pattern = '(?i)<span[^>]+class="findme"[^>]*>(.+?)</span>' />
<cfset results = reFindAll(pattern, text, 2) />
<cfloop index="i" from="1" to="#ArrayLen(results.pos)#">
    <cfoutput>match #i#: #Mid(text, results.pos[i], results.len[i])#<br></cfoutput>
</cfloop>

EDIT: Updated reFindAll with subexnum argument. Setting this to 2 will capture the first subexpression. The default value 1 captures the entire match.

SpliFF
AnApprentice
Good news! I was able to use your REGEX above with a CF script called REGet that worked and got all the spans.issue is it returns the tags, which I don't want.. <span id="581-1268367477845" class="findme">WTC Captive was created with a $1 billion FEMA grant and provides insurance coverage</span> How can the REGEX above be updated to just send back: WTC Captive was created with a $1 billion FEMA grant and provides insurance coverage
AnApprentice
If you look at the function above you'll see it looks up the match position with `subex.len[1]`. Why the 1 there? Well if you check the docs for REFind you'll see len and pos are actually arrays of `matched subpressions`. The regex standard says the whole expression is always the first match so it's showing you where the match started (at the edge of the tag). A subexpression is a match in parenthesis. Look closely at the regex I gave you and you'll see `(.+?)`. It's a subexpression which will be stored as match 2. So just change pos[1] to pos[2] and len[1] to len[2] in the function above.
SpliFF
Just to clarify I'm talking about changing the function reFindAll, not the loop over its results shown below it, ie: `subex.len[1]` => `subex.len[2]`, etc...
SpliFF
BTW, the updated function will break if you don't have a subexpression in your regex because then the arrays won't have 2 elements. Also I said subexpressions are in parenthesis but that's an over-simplification, for instance (?i) is NOT a subexpression, it's a flag declaration. Consult regex documentation for all the gory specifics.
SpliFF
+1  A: 

Try looking into the possibility of making your HTML work with a regular DOM Parser and querying it via XPath instead of hammering this trough an regex-based abomination.

  1. to make HTML input usable, pass it through jTidy (see http://jtidy.riaforge.org/)
  2. Once you have well-formed XML/XHTML, build an XML document from it
    <cfset dom = XmlParse(scrubbedHtml, true)>
  3. query the XML document using XPath
    <cfset result = XmlSearch(dom, "//span[@class='findme']")>

Done.

EDIT: Coldfusion's XmlSearch() doesn't have great XML namespace support. If you end up producing XHTML instead of the more recommendable XML, use the following XPath (note the colon) "//:span[@class='findme']" or "//*:span[@class='findme']". See here and here for more info.

See the jTidy API documentation for a complete overview what jTidy can do.

Tomalak
What do you mean by Well-Formed HTML? It's already an HTML block of TEXT from a WYSIWYG editor? So is this needed??? Also, what would RESULT be set to? One string, an array etc? There will be multiple SPAN that should match in the TEXT.
AnApprentice
Well-formed means by-the-book, according-to-the-spec, no-errors-whatsoever HTML. Better yet XHTML. Not all WYSIWYG editors produce that. Something that can be fed to an XML parser without making it choke. jTidy can easily clean up lax HTML that does not conform to the spec. Once you have done that, you can evaluate the *structure* of it instead of throwing mind-boggling regular expressions at it that never will be quite good enough to do the job. `result` would then contain an array of matching `<span>` nodes. My advice is to abandon the regex approach, tempting as it may be, in favor of this.
Tomalak
Ok this helps, but why is this better than the REGEX?
AnApprentice
@nobosh: Because HTML is a language that cannot be parsed successfully with regex. Even if it is a predictable sample of HTML in a controlled environment - regex just can't understand the nested structures. What if your `<span>` contains other `<span>`s? Regex cannot find the correct end tag, an HTML/XML parser can. Since ColdFusion is built on Java, you can easily utilize one of the existing libraries.
Tomalak
@tomalak, I just hooked this all up. And sometimes it works, other times it errors.I'm getting: " An error occured while Parsing an XML document.Content is not allowed in prolog"or "The markup in the document following the root element must be well-formed. " is there way to make this a little less buggy or sensative? I'm going to be using copy from a WYSIWYG so this seems way to sensative thxs
AnApprentice
did you try jTidy to clean up the malformed HTML as he suggested?
Carson Myers
@nobosh: Tidy has several options what it should produce. By default, it produces HTML (which is not XML), but `XmlParse()` understands XML only. Best set up Tidy to produce XML (or XHTML with numeric entities). This format can be fed to `XmlParse()` without problems.
Tomalak