tags:

views:

1535

answers:

7

Could someone show me a regular expression that would look through this document and select the href value of every href that has RELATION_ID on the end of it? Then if it does, I have to get the Id that is before the question mark (example href="dctm://ISDOFSDdev/37004e1f800021f3?DMS_OBJECT_SPEC=RELATION_ID")

Thanks!

<?xml version="1.0" encoding="utf-8"?>
<?dctm xml_app="elearningContent"?>
<!DOCTYPE OnlineContent PUBLIC "-//ISDOFSD//DTD Online Content//EN" "file:C:/dmExport/New%20Folder%20(2)/ISDOFSDdev/elearningContent/OnlineContent.dtd">
<OnlineContent outputclass="Graphic Down" id="OnlineContent_955627C91D8743B98DCB8BD9BE379DE8">
    <title>Text and Popup</title>
    <OnlineContentBody>
     <lcInstruction id="lcInstruction_770F26218C064A84BFA1813562173970">
      <p>This is an example of a plain text screen with an attached popup.</p>
      <p>
       Popups are used to display additional content in a popup window. A <xref scope="local" type="topic" format="dita" href="dctm://ISDOFSDdev/37004e1f800021f3?DMS_OBJECT_SPEC=RELATION_ID">link is provided</xref> in the main text of the screen, which may clicked on to open a popup. A screen may contain <xref scope="local" type="topic" format="dita" href="dctm://ISDOFSDdev/37004e1f800021f4?DMS_OBJECT_SPEC=RELATION_ID">more than one popup</xref>.
      </p>
     </lcInstruction>
    </OnlineContentBody>
    <OnlinePopup id="OnlinePopup_AFE53E2CACBF4D8196E6360D4DDB6B70">
     <title>A Popup</title>
     <OnlinePopupBody>
      <p>This is an example of popup content.</p>
      <p>A popup may contain one or more paragraphs of text. They may also contain lists, like this:</p>
      <ul id="ul_7812991BBBDD4995B7499A9557C4EA9C">
       <li id="li_E83BDB28EC494B98BFF3DD5924AF855E">An item in a list</li>
       <li id="li_270F2A3A85BA4E6EBF98CB4023344475">Another item in a list</li>
      </ul>
      <p>A numbered list is demonstrated in the second popup.</p>
     </OnlinePopupBody>
    </OnlinePopup>
    <OnlinePopup id="OnlinePopup_5AE081BFB97043CE99F39A9E4A063332">
     <title>Another Popup</title>
     <OnlinePopupBody>
      <p>This is the second popup on this screen, containing a numbered list.</p>
      <ol id="ol_EF18C080E7CC40B7998DEB75772367A6">
       <li id="li_91B42F1B886B4CF887C001577C14B3F0">An item in a list</li>
       <li id="li_95C4F32E093843FAB985A3F6981A7D07">Another item in a list</li>
      </ol>
     </OnlinePopupBody>
    </OnlinePopup>
</OnlineContent>
+1  A: 

Something like: href=".*/([^"?/]*)?[^"]*RELATION_ID[^"]*". This assumes you're being consistent in using double quotes for you attributes. This should be perl & java friendly.

The ([^"?/]*) will capture the bit between the slash and the question mark. In java, you would use Matcher.group(int) to get the value. If you're trying to get multiple values from the same document, look at Matcher.find(int).

sblundy
Thanks!! is there an easy way to get the id which is between the slash and the questionmark easily then?
joe
+1  A: 

It might not be prudent to attack this with a plain-old regex. XPath with a built-in url-parsing function might be a better solution.

As stated before, the best solution depends on the language you're using.

Cybis
+1  A: 

maybe something like this href="(.+?)/(.+?)\?(.+?)RELATION_ID" and use the second match if your only looking for the id part (37004e1f800021f3 in your example)

Tjofras
+1  A: 

Here is a python solution:

expr = re.compile('href=.*?/(.*?)\?.*?=RELATION_ID', re.MULTILINE)

for x in expr.finditer(test_string): # iterate through all matches
   s = x.group(1) # get the one and only group of the match
   ss = s.split("/") # split off the ISDOFSDdev
   s = ss[len(ss) - 1] # grab the last element
   print s # print it

Output where test_string is the string you posted:

37004e1f800021f3
37004e1f800021f4

Again this is in python, but with any modern regex library you should be able to replicate it.

It is extremely difficult to get a regular expression that will just pull out the ID. I am not saying it is impossible, but it is often easier to get close with the regex then split out what you need from the substring the regex gives you.

Documentation on the python regex module.

grieve
Well Will showed an extremely easy regular expression that does pull out the ID. :) :(
grieve
+4  A: 

You can use this regex expression:

[a-fA-F0-9]+(?=\?DMS_OBJECT_SPEC=RELATION_ID)

which matches the the hex number immediately before the query string.

I'd also suggest using XPath to do this over regex.

Will
I misread the question, sorry.
Will
Thanks that works good now.
joe
+2  A: 

As you have XML data, why not using an XSLT stylesheet?. This example picks the value of the desired attributes. This examples uses only XPath 1.0 functions which are somewhat limited. It outputs the values of desired href attributes.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     >
     <xsl:output method="text" indent="no"/>
     <xsl:template match="*[@href]">
      <xsl:if test="contains(@href, 'RELATION_ID')">
       <xsl:value-of select="@href"/>
   <xsl:text>&#xa;</xsl:text>
      </xsl:if>
      <xsl:apply-templates select="*"/>
     </xsl:template>
     <xsl:template match="*">
      <xsl:apply-templates select="*"/>
     </xsl:template>
</xsl:stylesheet>

Considering you name "example.xml" the given file and "example-xslt.xsl" the XSLT stylesheet provided you can use the following line to save the result to a file "out.txt" using MSXSL.exe:

C:\Documents and Settings\fer\Escritorio>msxsl.exe -xw example.xml example-xslt.xsl > out.txt

Edit: Next is the XSLT using XPath v2.0 that let's you use the power of regular expressions inside string handling funcions. The result is the ID inside URL you were looking for (instead of the whole value of href attributes).

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:fn="http://www.w3.org/2005/xpath-functions" >
     <xsl:output method="text" indent="no"/>
     <xsl:template match="*[@href]">
      <xsl:if test="fn:contains(@href, 'RELATION_ID')">
       <xsl:value-of select="fn:replace(@href,'.*/([^/]*)\?.*', '$1')"/>
       <xsl:text>&#xa;</xsl:text>
      </xsl:if>
      <xsl:apply-templates select="*"/>
     </xsl:template>
     <xsl:template match="*">
      <xsl:apply-templates select="*"/>
     </xsl:template>
</xsl:stylesheet>

There are not many free XSLT v2.0 processors out there, but AltovaXML-2008 is one of those. The following command line gives you the expected result.

C:\Documents and Settings\fer\Escritorio>AltovaXML -xslt2 example-xslt.xsl -in example.xml
Fernando Miguélez
A: 

First find the href attribute using this regex: href="[^=]*=RELATION_ID"

Once you have a collection of those attributes, use the following regex to find the ID: dctm:[^?]*

Explanation of first regex

href=" : Match the characters "href="" literally
[^=]* : Match any character that is NOT a "=" between zero and unlimited times
=RELATION___ID : Match the characters "=RELATION_ID" literally.

Explanation of second regex

dctm: : Match the characters "dctm:" literally
[^?]* : Match any character that is NOT a "?" between zero and unlimited times.

If you are going to use regular expressions often you should strongly consider buying Regex Buddy at http://www.regexbuddy.com/

Jason