tags:

views:

95

answers:

3

Having trouble wrapping my head around this. I need parse this using a regular expression to create the definition list below

Width=3/8 in|Length=1 in|Thread - TPI or Pitch=|Bolt/Screw Length=|Material=|Coating=|Type=Snap-On|Used With=|Quantity=5000 per pack|Wt.=20 lb|Color=

The result would be something like this

<dt>Width</dt>
<dd>3/8 in</dd>
<dt>Length </dt>
<dd>1 Inch</dd>
<dt>Thread - TPI or Pitch</dt>
<dd></dd>
<dt>Quantity</dt>
<dd>5000 a pack</dd>
<dt>Wt.</dt>
<dd>20 lb</dd>
A: 

Something like this:

/(?:(.*?)=(.*?)(\||$))+/
Greg
Greg, you missed a closing ')' just before the `+`.
Bart Kiers
Oops. Fixed - thanks
Greg
A: 

You can use

([^=|]+)=([^|]+)(?:\||$)

Apply with the "global" flag.

Explanation:

(             # start match group 1
  [^=|]+      # any character that's not a "=" or "|", at least once
)             # end match group 1
=             # a literal "="
(             # start match group 2
  [^|]+       # any character that's not a "|", at least once
)             # end match group 2
(?=           # look-ahead: followed by
  \|          # either a literal "|"
  |           # or…
  $           # the end of the string
)             # end look-ahead

The string parts you are interested in are in match groups 1 and 2, respectively. For me the above matches:

  1. Width = 3/8 in
  2. Length = 1 in
  3. Type = Snap-On
  4. Quantity = 5000 per pack
  5. Wt. = 20 lb

Your example is inconsistent in the Thread - TPI or Pitch case.

Tomalak
Tomalak, in some cases the data includes blank entries that have a definition defined like "Thread - TPI or Pitch" but no value associated with it. In that case you a blank <dd> would be used.
jeff
Your "result something like this" example seems to omit empty entries. I figured you wanted to hide them. If you want them to show, change `[^|]+` from match group 2 to `[^|]*`.
Tomalak
Your right I want to remove them from the results set. How would I do that? Your query removes them but they are not removed from the actual results being generated.
jeff
+1  A: 

If you don't need to reorder items or change their values, and are confident the values themselves don't contain the equals signs or vertical bars used as markup in the input, you could apply a series of regular expressions to introduce the HTML. Using Java's String class from Scala, this could be a dense but effective one-liner:

"Escape test=&<>|Width=3/8 in|Length=1 in|Thread - TPI or Pitch=|Bolt/Screw Length=|Material=|Coating=|Type=Snap-On|Used With=|Quantity=5000 per pack|Wt.=20 lb|Color=".
replaceAll("&","&amp;").
replaceAll("<","&lt;").
replaceAll(">","&gt;").
replaceAll("^","<dl>\n\t<dt>").
replaceAll("=","</dt>\n\t<dd>").
replaceAll("\\|","</dd>\n\n\t<dt>").
replaceAll("$","</dd>\n</dl>")

which yields

<dl>
<dt>Escape test</dt>
<dd>&amp;&lt;&gt;</dd>

<dt>Width</dt>
<dd>3/8 in</dd>

<dt>Length</dt>
<dd>1 in</dd>

<dt>Thread - TPI or Pitch</dt>
<dd></dd>

<dt>Bolt/Screw Length</dt>
<dd></dd>

<dt>Material</dt>
<dd></dd>

<dt>Coating</dt>
<dd></dd>

<dt>Type</dt>
<dd>Snap-On</dd>

<dt>Used With</dt>
<dd></dd>

<dt>Quantity</dt>
<dd>5000 per pack</dd>

<dt>Wt.</dt>
<dd>20 lb</dd>

<dt>Color</dt>
<dd></dd>

ewg