spell and /usr/dict/words.

Tue Aug 28 01:54:33 AEST 1990

In article <7385 at star.cs.vu.nl> mike at cs.vu.nl (Mike Marcel Jonkmans) writes:
>Allright I had nothing to do and typed for fun the following :
>
>(csh)% spell < /usr/dict/words > error
>(csh)% cat error
>belying
>revisable
>(csh)% 
>
>What's so special about belying and revisable ??
>
>
>--
>
>			Mike Jonkmans.  (mike at cs.vu.nl)
>			       ..!uunet!mcsun!botter!mike
Interesting. I tried the same thing on our VS system and found the
following words from /usr/dict/words which spell did not accept:

acclimatize
belying (same as your list)
implementer
Remus
revisable (same as your list)
vis

My guess would be that an out of date hash file would account for
such errors, but finding two such words on different systems seems
too much of a coincidence to support that theory. Maybe others on
the net can try this on their systems and see what anomalies they get.

I have also found another interesting deficiency in spell. It appears
to test only whether or not the hashed value of strings match the
hashed values of words from the the dictionary. For relatively long
strings there are many hash collisions which cause nonwords to be
accepted. I have run some tests which confirm that this is the reason
for the false acceptance.

For example, passing all 8! permutations of "abcdefgh" through 
/usr/lib/spell (the executable which does the actual spell check for 
the spell script) finds 10 strings which are accepted, none of which
are really words. Trying the 9! permutations of "abcdefghi" gives
159 accepted strings. Note that this is more than 9 times as many,
which indicates to me that the hashing algorithm is better at generating
unique codes for short strings, a desirable feature, I think. When
I experimented with this some time ago I found that the word
"receptionist" had many thousands of accepted anagrams. Recently I tried
to duplicate this and ran out of disk space. The partial result showed
2380 anagrams beginning with 'c' alone. (For efficiency my program
generates them in collating sequence, hence I found all the 'c' words
and some of the 'e' words before running out of space.) These results
have been about the same using VS, AIX (on an RT), SCO Xenix, and SunOS,
often getting exact matches across different systems, indicating that
the same dictionary and hashing algorithm is being used.

P.S. to Brad Appleton of Harris Computer Systems, who sent me email
asking for a copy of the program I used to find this: I got your
request. I would be glad to share the code. Unfortunately my system
doesn't seem to know how to send email to yours. Can you suggest
a routing which might work. I'm not very good at figuring out the 
mysteries of Internet.