views:

404

answers:

3

Hi,

I've been tasked with build an accessible RSS feed for my company's job listings. I already have an RSS feed from our recruiting partner; so I'm transforming their RSS XML to our own proxy RSS feed to add additional data as well limit the number of items in the feed so we list on the latest jobs.

The RSS validates via feedvalidator.org (with warnings); but the problem is this. Unfortunately, no matter how many times I tell them not to; my company's HR team directly copies and pastes their Word documents into our Recruiting partners CMS when inserting new job listings, leaving WordML in my feed. I believe this WordML is causing issues with Feedburner's BrowserFriendly feature; which we want to show up to make it easier for people to subscribe. Therefore, I need to remove the WordML markup in the feed.

Anybody have experience doing this? Can anyone point me to a good solution to this problem?

Preferably; I'd like to be pointed to a solution in .Net (VB or C# is fine) and/or XSL.

Any advice on this is greatly appreciated.

Thanks.

A: 

I would do something like this:

char[] charToRemove = { (char)8217, (char)8216, (char)8220, (char)8221, (char)8211 };
char[] charToAdd = { (char)39, (char)39, (char)34, (char)34, '-' };
string cleanedStr = "Your WordML filled Feed Text.";

for (int i = 0; i < charToRemove.Length; i++)
{
cleanedStr = cleanedStr.Replace(charToRemove.GetValue(i).ToString(), charToAdd.GetValue(i).ToString());
}

This would look for the characters in reference, (Which are the Word special characters that mess up everything and replaces them with their ASCII equivelents.

Jeremy Reagan
A: 

Jeff Attwood blogged about how to do this a while ago. His post contains some c# code that will clean the WordML.

http://www.codinghorror.com/blog/archives/000485.html

d4nt
Jeff's article is about cleaning up the nasty HTML that Word generates, not stripping out the XML elements from a WordML file.
Chris Zwiryk
The questioner was saying how content that is copied and pasted from word contains lots of unwanted html tags. Jeff's code will remove those.
d4nt
+1  A: 

I haven't yet worked with WordML, but assuming that its elements are in a different namespace from RSS, it should be quite simple to do with XSLT.

Start with a basic identity transform (a stylesheet that add all nodes from the input doc "as is" to the output tree). You need these two templates:

  <!-- Copy all elements, and recur on their child nodes. -->
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

  <!-- Copy all non-element nodes. -->
  <xsl:template match="@*|text()|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>

A transformation using a stylesheet containing just the above two templates would exactly reproduce its input document on output, modulo those things that standards-compliant XML processors are permitted to change, such as entity replacement.

Now, add in a template that matches any element in the WordML namespace. Let's give it the namespace prefix 'wml' for the purposes of this example:

  <!-- Do not copy WordML elements or their attributes to the 
       output tree; just recur on child nodes. -->
  <xsl:template match="wml:*">
    <xsl:apply-templates/>
  </xsl:template>

The beginning and end of the stylesheet are left as an exercise for the coder.

ChuckB