views:

836

answers:

10

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?

Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.

Edit: I have a list of all of the URLs on the site that I need to check.

A: 

You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?

I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.

Ant
+1  A: 

If its a one off, and due to the number of pages to check it might be worth considering somthing like spellr.us which would be a quick solution. You can entering in your website url on the homepage to get a feel for how it would report spelling mistakes.

http://spellr.us/

but I'm sure there are some free alternatives.

kevchadders
A: 

Use templates (well) with your webapp (if you're programming the site instead of just writing html), and an html editor that includes spell-checking. Eclipse does, for one.

If that's not possible for some reason... yeah, wget to download the finished pages, and something like this:

http://netsw.org/dict/tools/ispell-html-mode.patch

Lee B
+2  A: 

If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.

#!/bin/sh

# Find HTML files
find $1 -name \*.html -type f |
while read f
do
        # Split file into words
        sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[     ][      ]*/\
/g ' "$f" |
        # Remove blank lines
        sed '/^$/d' |
        # Sort the words
        sort -u |
        # Print words not in the dictionary
        comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
        # See if errors were found
        if [ -s /tmp/spell.$$.out ]
        then
                # Print file, number, and matching words
                fgrep -Hno -f /tmp/spell.$$.out "$f"
        fi
done
# Remove temporary file
rm /tmp/spell.$$.out
Diomidis Spinellis
+1 :: Even if you cannot get the site source files, you can use wget -m (mirror mode) to spider the site.
garrow
This does not filter out JavaScript and CSS embedded in the HTML.
Liam
Also, some words like 'at' and 'me' are output as misspelled words even though they are in the dictionary.
Liam
I modified the code to remove JavaScript and CSS. Note: the code is an example, you should modify it to make it fit your setup.
Diomidis Spinellis
Awesome! I'll test it out
Liam
+5  A: 

Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).

lynx -dump http://www.example.com

It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:

lynx -dump http://www.example.com | grep -v "http"

The URLs could also be local (file://) if I have used wget to mirror the site.

I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).

This will ignore text in title and meta elements. These can be spellchecked seperately.

Liam
You can use wget -R to grab all your web pages recursively. Then, run lynx on the local files, and spellcheck from there.
strager
A: 

We use the Telerik RAD Spell control in our ASP.NET applications.

Telerik RAD Spell

Michael Kniskern
A: 

You may want to check out a library like jspell.

Jas Panesar
A: 

allankirsch.tech.officelive.com offers a high quality large website content quality assurance service. This service is highly effective at identifying document content issues. It provides suggestions to correct misspelled words. This service does far more than just spell check web pages. Samples of large websites are provide on the website as a demonstration of it's capabilities. Tech and science jargon are also included. A great deal of time and effort went into this service to make it the best it can be. Please see allankirsch.tech.officelive.com for more information.

A: 

Just a view days before i discovered Spello web site spell checker. It uses my NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.

Thomas Maierhofer
A: 

I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.

Luke P M