I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?
Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.
Edit: I have a list of all of the URLs on the site that I need to check.
Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).
lynx -dump http://www.example.com
It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:
lynx -dump http://www.example.com | grep -v "http"
The URLs could also be local (file://
) if I have used wget to mirror the site.
I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).
This will ignore text in title and meta elements. These can be spellchecked seperately.