ansaurus

Question

How to "HTML encode" Em Dash in Visual Basic.NET

Answer 1

A:

Take a look at A List Apart, as I suggested in HTML Apostrophe question.

The em dash — is represented by —.

mouviciel 2009-01-08 10:56:55

I should have been clearer - my problem is not finding what to encode it to, it's finding what to encode it from. I'll fix the question to make that clear.

RB 2009-01-08 11:15:30

Answer 2

+2 A:

As this character is not an ASCII character, how do I encode it?

It's not an ASCII character, but it is a Unicode character, U+2014. If your page output is going to be UTF-8, which in this day and age it really should be, you don't need to HTML-encode it, just output the character directly.

Are there other characters which are likely to give me problems.

What problems exactly is it giving you? If you can't output '—', you probably can't output any other non-ASCII Unicode character, which is thousands of them.

Replace "\u2014" with "& #x2014;" if you really must, but really with today's Unicode-aware tools there should be no need to go around replacing every non-ASCII Unicode character with markup.

bobince 2009-01-08 11:29:12

I've updated my question with my current solution - I think that might explain my problem better than I have been.

RB 2009-01-08 11:44:39

I'm not sure what it is trying to do. You don't need a CDATA section, HTML encoding doesn't work inside a CDATA section, and neither HtmlEncode nor HtmlDecode do anything special with Unicode characters. Don't do any of this, just use UTF-8 and spit the HtmlEncoded output directly into the page.

bobince 2009-01-08 12:58:20

Oh - sorry. I should have explained that. It's a web-feed to a property portal (in this case, www.fish4.co.uk), so it is web-content, delivered in an XML element, hence the CDATA tags.

RB 2009-01-08 17:03:18

Answer 3

A:

Bobince's answer gives a solution to what seems to be your main concern : replacing your use of HtmlDecode by a more straightforward declaration of the char to replace.
Rewrite

sWebsiteText.Replace(HttpUtility.HtmlDecode("&#8211;"), "&#8211;")

as

sWebsiteText.Replace("\u2013", "&#x2013;")

('\u2014' (dec 8212) is em dash, '\u2013' (dec 8211) is en dash.)
For readability purpose it may be considered better to use "–" rather than "–", since the .Net declaration for the char ("\u2013") is in hex too. But, as decimal notation seems more common in html, I personaly would prefer using "–".
For reuse purpose, you probably should write your own HtmlEncode function declared in a custom HttpUtility, in order to be able to call it from anywhere else in your site without duplicating it.
(Have something like (sorry I have written it in C#, forgetting your examples were in VB):

/// <summary>
/// Supplies some custom processing to some HttpUtility functions.
/// </summary>
public static class CustomHttpUtility
{
    /// <summary>
    /// Html encodes a string.
    /// </summary>
    /// <param name="input">string to be encoded.</param>
    /// <returns>A html encoded string.</returns>
    public static string HtmlEncode(string input)
    {
        if (intput == null)
            return null;
        StringBuilder encodedString = new StringBuilder(
            HttpUtility.HtmlEncode(input));
        encodedString.Replace("\u2013", "&#x2013;");
        // add over missing replacements here, as for &#8212;
        encodedString.Replace("\u2014", "&#x2014;");
        //...

        return encodedString.ToString();
    }
}

Then replace

sWebsiteText = _
    "<![CDATA[" & _
    HttpUtility.HtmlEncode(sSomeText) & _
    "]]>"
'This is the bit which seems "hacky"'
sWebsiteText = _
    sWebsiteText.Replace(HttpUtility.HtmlDecode("&#8211;"), "&#8211;")

With:

sWebsiteText = _
    "<![CDATA[" & _
    CustomHttpUtility.HtmlEncode(sSomeText) & _
    "]]>"

)

2009-06-09 09:53:24

ansaurus

tags:

views:

answers:

How to "HTML encode" Em Dash in Visual Basic.NET

related questions