spell
Matt Cohen
sysnmc at magic706.chron.com
Sat Jun 9 03:13:12 AEST 1990
> Specifically I would like to know:
>
> 1. Why refuses spell to learn some words (either using +local_words, or using
> spellin). E.g.: "rerouting".
>
> 2. Why spell accepts some words as correct, though I see no way how they
> could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
> (This holds for a SCO=Xenix distribution where I am absolutely sure, that
> nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
> is).
>
> 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?
> Is that file in any way related to /usr/lib/spell/words?
Ulrich,
To the best of my knowledge, the way spell works is that it
hashes the dictionary into a very large hash table. Unfortunately, the
table would be ridiculously large if the entire hash table were stored
in the usual fashion. (Each hash value pointing to a linked list of words
with that hash value.) Instead, spell uses a technique called
"optimistic hashing".
What is done is to drop the linked list at each hash value. You
can then use a single bit to say "at least one word in the dictionary hashes
to this value". This decreases space usage dramatically. However,
although the hash table used is huge, there is a chance that any
random word will hash to the same value as a word in the dictionary.
When one of these non-words is identified, it is entered into
the "stop list", which is then hashed with a different hash function.
The stop list is typically small since it's difficult to identify
these words except by chance.
If a word hashes to a set bit in the hashed dictionary and
to a clear bit in the stop list, it is declared correct.
Also, the word is examined to see if it could be the result
of applying one of a bunch of prefixes/suffixes to a root word. All
of these potential root words are also checked, and if any of them passes,
the word is declared correct.
--- Now, this explains your problems.
> 1. Why refuses spell to learn some words (either using +local_words, or using
> spellin). E.g.: "rerouting".
It looks like some quirk of spell keeps it from recognizing that
rerouting = re + route - e + ing .
It seems to figure out "reroute" and "routing" ok.
My guess is that "rerouting" hashes to the same value as a word in the
stop list. Therefore, when you create a dictionary with "rerouting" in it,
it can't pass.
> 2. Why spell accepts some words as correct, though I see no way how they
> could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
> (This holds for a SCO=Xenix distribution where I am absolutely sure, that
> nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
> is).
These words are "derived" from correct words.
neeeds = nee + ed + s
miist = mi + ist
The problem is that the dictionary contains no information
about which suffixes/prefixes are appropriate for each word. You
can therefore come up with many ridiculous things that spell says
are good words.
> 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?
No. The hashed files contain none of the input words, see above.
> Is that file in any way related to /usr/lib/spell/words?
It is likely that a file similar to /usr/lib/spell/words
(/usr/dict/words on many Unixes) was hashed to produce /usr/lib/spell/hlista.
American spellings were probably naively changed to British and hashed
to produce hlistb.
The source of hlista is probably not exactly /usr/dict/words, at
least on my version of Unix (SunOS 4.0.3). You can test this easily
with "spell /usr/lib/spell/words". By its very definition, no
word in the dictionary should be deemed incorrect.
Whew! That was a mouthful. Hope it helped. You can gather
that modern spell checkers don't use this kind of optimistic hashing.
You may want to check out GNU's ispell program which solves many
of these problems. You can ftp it from uunet.uu.net. The files
are ~/gnu/{ispell.shar,ispell.el,dict.shar}. I can also send it to
you if you want.
You may want to play with "spell -v" and "spell -x", if your
system has these options.
Please tell me what else you discover. Good luck!
-- Matt
Matt Cohen INET: sysnmc at chron.com
Department of Technology Resources UUCP: ...!uunet!chron!sysnmc
The Houston Chronicle AT&T: +1 713 220 7023
801 Texas Avenue, Houston, TX 77002 "The opinions above are most likely
"Houston's Leading Information Source" those of my employer."
More information about the Comp.unix.questions
mailing list