Digital Content Quality Assurance Project

Context-based Spell Check | Data Mining | SEO

Home
FAQs
Demonstrations
Project Specs
Comprehensive Planning
Comprehensive Indexing
Testimonials
Guide
News
Contact Us
Legal
Site Map
Privacy Policy
Free Tools
This website provides information about the development of a very advanced context-based spell check system designed to handle very large business and educational websites or document batches containing complex terminology.  It has numerous features not found anywhere else on the Internet.  Most people look for a content proofing service where they enter the URL for a website and press a button to get a report.  An automated single pass system is fine for small simple websites, but large complex sites require a two pass approach.  The results from the first pass are used to train this context-based system allowing the final pass to provide much more accurate results. This method is far superior to a one pass system.  This substantially reduces the amount of information that must be reviewed to determine what is correct vs. incorrect and is especially useful for reducing false positives.

This is a very mature system and it has already discovered and classified a vast number of U.S. English words and named entities.   The dictionaries and knowledge base used by this system are one of the most comprehensive for U.S. English.  Please see the system specs. page for details.

The Demonstrations page provides detailed information about this system's capabilities and
some very challenging content sources are used to challenge it.  Try loading a large page from the CIA World Factbook into the best typographical error checking system you can find to see what I mean.  The main focus of this project is on the engine itself rather than a fancy interactive website.

The truth is that if you were to pay for such a very inexpensive service and your website had significant content and SEO issues you would recoup your cost several times over.  I have seen very embarrassing and reputation damaging typos on numerous occasions. One political candidate blogged about his "qualifactions" for governor.  Another site misspells the word "management" several hundred times seriously impacting its search engine ranking.  Another website business owner who works in a particular scientific field misspells it on the website creating a credibility issue and impacts the website's search engine ranking.  Another website is littered with foul language in its forum section.  Government websites littered with typos are used as educational resources for school children. 
The list goes on and on.  You never know what typo monsters are lurking on your website until you check it.

This project was developed with the end goal of creating a process capable of comprehending a large portion of the English language.  You'll see it's well on its way to meeting that goal if you take a look at the Demonstrations page.  Why create such a system?  It's all about sifting through massive amounts of useless information to find the gems that you need to know about.  Such a system has to be very adept at dealing with imperfect content.  It's very rare to find near perfect content once the volume of content reaches a certain level -- hence this phase of the project.

I'm still processing large websites for businesses and accredited educational institutions for free.
Businesses must be listed on a major stock exchange and have at least 100 web pages or documents to qualify.

The dictionaries are constantly being updated with all the latest words and terminology.  It simply isn't cost effective to try and  duplicate the capabilities of this system which have been evolving for a number of years. It's highly trained and refined from processing hundreds of websites and documents.

What's next?  The data mining/information extraction capabilities are constantly
being improved.  Check out the News page for the latest information.  The plan is to find a way that this system can mine information in a way that results in a payoff via indirect means.

Core System Features

This context-based spell check system identifies:

  • Spelling errors in document bodies (example)
  • Spelling errors in HTML tags (title, desc., keywords) (example)
  • Mis-capitalizations of proper nouns (example)
  • Double typed words such as "the the"  (example) (example)
  • Unpopulated HTML tags affecting your Internet search engine ranking
  • Grammar check of the articles of speech "a" and "an" (example)
  • Documents containing profanity.
  • Documents containing foreign languages (use to group in own area)
  • U.K. English words with its U.S. English counterpart
  • Duplicate document files (to free disk space) (example)
  • Zero size document files
  • Unreadable or corrupt document files (example)

Additional Features

  • Includes specialized business, medical and scientific terminology
  • On-the-fly identification of specialized terms not in dictionaries (example)
  • Full hyphenated word support
  • Top keyword phrase extraction for trend/content analysis (example)
  • HTML tag counts and population statistics (example)
  • HTML alternate image text summary report (example)
  • List of website addresses present in your documents (example)
  • List of e-mail addresses present in your documents (example)
  • List of unrecognized words you can use to add to a local custom dictionary
  • File selection filters to select specific documents by location and file name
  • MD5 file checksum listing for document modification monitoring
  • Robust and refined dictionaries to limit the number of unrecognized words
  • Recognition of popular word abbreviations
  • Can process movie scripts and screenplay content containing slang
  • Bird's eye view of HTML tag content
  • No software to learn or install.  Simply request service via e-mail.


This system currently supports the following document types:

 

  • HTML web pages
  • Microsoft Word Documents (.doc)
  • Active server pages (.asp/.aspx)
  • Java server pages (.jsp)
  • ASCII text files
  • Rich-text files (.rtf)
  • Adobe PDF files (.pdf) (limited support)
  • HTML files containing an XML ID tag*
  • Cold Fusion Markup Language files (.cfm)
  • PHP web pages (.php)


* XML files can be processed when they solely contain English language words.


System Limitations

This system is designed for documents which are available to the public.  This system does not have the security safeguards needed to protect documents that contain sensitive information such as social security numbers, credit card numbers, military secrets, bank account numbers, etc.

Documents  heavily populated  with  foreign language characters  are not  good candidates for this system.   If your documents contain large segments of English; intermixed with foreign language, then the  "unrecognized items file"  may be larger;  thus making  it  more  time  consuming  to use that information to  create  a  custom  dictionary  for  your  document set.  This  system  can  currently recognize a  large number  of  Spanish, French and German  words preventing  those  words  from appearing in the unrecognized words list. Some Latin words such as those used in legal lingo may not be recognized.

Smaller websites and document sets create a smaller "unrecognized items file" making it easier to create a custom dictionary for future spell checks.

This system cannot process web pages that use copy protection schemes or compressed, encrypted or password protected document files.  This system may not be able to process websites using exotic methods to render web pages.

See this website's Legal page for additional information regarding terms.

This system uses a website's "robots.txt" file to determine which files to check.


[ Click HERE to ask a question or send feedback via e-mail ]

 


Now a proud partner with Trade Service Port of China -- IIOM China





Check us out on    A great place to list your organization.  Get emails about opportunities automatically!




Allan Kirsch is a member of the business network.