views:

177

answers:

2

Why would the below eliminate the whitespace around matched keyword text when replacing it with an anchor link? Note, this error only occurs in Chrome, and not firefox.

For complete context, the file is located at: http://seox.org/lbp/lb-core.js

To view the code in action (no errors found yet), the demo page is at http://seox.org/test.html. Copy/Pasting the first paragraph into a rich text editor (ie: dreamweaver, or gmail with rich text editor turned on) will reveal the problem, with words bunched together. Pasting it into a plain text editor will not.

// Find page text (not in links) -> doxdesk.com
function findPlainTextExceptInLinks(element, substring, callback) {
    for (var childi= element.childNodes.length; childi-->0;) {
        var child= element.childNodes[childi];
        if (child.nodeType===1) {
            if (child.tagName.toLowerCase()!=='a')
                findPlainTextExceptInLinks(child, substring, callback);
        } else if (child.nodeType===3) {
            var index= child.data.length;
            while (true) {
                index= child.data.lastIndexOf(substring, index);
                if (index===-1 || limit.indexOf(substring.toLowerCase()) !== -1)
                    break;
                // don't match an alphanumeric char
                var dontMatch =/\w/;
                if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
                    break;
                // alert(child.nodeValue.charAt(index+keyword.length + 1));
                callback.call(window, child, index)
            }
        }
    }
}

// Linkup function, call with various type cases (below)
function linkup(node, index) {

    node.splitText(index+keyword.length);
    var a= document.createElement('a');
    a.href= linkUrl;
    a.appendChild(node.splitText(index));
    node.parentNode.insertBefore(a, node.nextSibling);
    limit.push(keyword.toLowerCase()); // Add the keyword to memory
    urlMemory.push(linkUrl); // Add the url to memory
}

// lower case (already applied)
findPlainTextExceptInLinks(lbp.vrs.holder, keyword, linkup);

Thanks in advance for your help. I'm nearly ready to launch the script, and will gladly comment in kudos to you for your assistance.

A: 

I'd like to help you more, but it's hard to guess without being able to test it, but I suppose you can get around it by adding space-like characters around your links, eg.  .

By the way, this feature of yours that adds helpful links on copying is really interesting.

mqchen
Thanks chen. You should be able to test it at the link I provided. Please let me know what problems you're running into.
Matrym
+2  A: 

It's not anything to do with the linking functionality; it happens to copied links that are already on the page too, and the credit content, even if the processSel() call is commented out.

It seems to be a weird bug in Chrome's rich text copy function. The content in the holder is fine; if you cloneContents the selected range and alert its innerHTML at the end, the whitespaces are clearly there. But whitespaces just before, just after, and at the inner edges of any inline element (not just links!) don't show up in rich text.

Even if you add new text nodes to the DOM containing spaces next to a link, Chrome swallows them. I was able to make it look right by inserting non-breaking spaces:

var links= lbp.vrs.holder.getElementsByTagName('a');
for (var i= links.length; i-->0;) {
    links[i].parentNode.insertBefore(document.createTextNode('\xA0 '), links[i]);
    links[i].parentNode.insertBefore(document.createTextNode(' \xA0), links[i].nextSibling);
}

but that's pretty ugly, should be unnecessary, and doesn't fix up other inline elements. Bad Chrome!

var keyword = links[i].innerHTML.toLowerCase();

It's unwise to rely on innerHTML to get text from an element, as the browser may escape or not-escape characters in it. Most notably &, but there's no guarantee over what characters the browser's innerHTML property will output.

As you seem to be using jQuery already, grab the content with text() instead.

var isDomain = new RegExp(document.domain, 'g');
if (isDomain.test(linkUrl)) { ...

That'll fail every second time, because g​lobal regexps remember their previous state (lastIndex): when used with methods like test, you're supposed to keep calling repeatedly until they return no match.

You don't seem to need g (multiple matches) here... but then you don't seem to need regexp here either as a simple String indexOf would be more reliable. (In a regexp, each . in the domain would match any character in the link.)

Better still, use the URL decomposition properties on Location to do a direct comparison of hostnames, rather than crude string-matching over the whole URL:

if (location.hostname===links[i].hostname) { ...

// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
    break;

If you want to match words on word boundaries, and case insensitively, I think you'd be better off using a regex rather than plain substring matching. That'd also save doing four calls to findText for each keyword as it is at the moment. You can grab the inner bit (in if (child.nodeType==3) { ...) of the function in this answer and use that instead of the current string matching.

The annoying thing about making regexps from string is adding a load of backslashes to the punctuation, so you'll want a function for that:

// Backslash-escape string for literal use in a RegExp
//
function RegExp_escape(s) {
    return s.replace(/([/\\^$*+?.()|[\]{}])/g, '\\$1')
};

var keywordre= new RegExp('\\b'+RegExp_escape(keyword)+'\\b', 'gi');

You could even do all the keyword replacements in one go for efficiency:

var keywords= [];
var hrefs= [];
for (var i=0; i<links.length; i++) {
    ...
    var text= $(links[i]).text();
    keywords.push('(\\b'+RegExp_escape(text)+'\\b)');
    hrefs.push[text]= links[i].href;
}
var keywordre= new RegExp(keywords.join('|'), 'gi');

and then for each match in linkup, check which match group has non-zero length and link with the hrefs[ of the same number.

bobince
Bobince, you're my hero :). Did you notice the doxdesk kudos? You'll be showered with appreciation on my project page!
Matrym
Heh! Just noticed I forgot to link the other answer containing the regex-based `findText`... fixed.
bobince
Matrym
I meant library independent. Typo
Matrym
"and then for each match in linkup, check which match group has non-zero length and link with the hrefs[ of the same number." <-- Sorry, but I'm unable to follow you. Could you show me? http://jsbin.com/oroxo3/edit
Matrym
The canonical way to get text (and what `text()` uses) is to do a depth-first traversal of the DOM tree from the element's childNodes collecting text (ie. recurse on `child.nodeType===1` and add to string on `child.nodeType===3`). There is also the DOM Level 3 Core property `element.textContent`, but it isn't supported in IE or some other older browsers. On IE you can branch and use `element.innerText` instead, but this isn't quite exactly the same (it is especially sloppy about whitespaces).
bobince
Well, if you have a regexp like `(term1)|(term2)|(term3)`, you can use a replacement function that takes a `match` object and looks at `match[1]`. If it's undefined we know that `term1` was not the expression that caused the match; then look at `match[2]`, and so on until you find which term it was that matched.
bobince