fuzzy strcmp
Fred Mitchell - PA
mitchell at cbmvax.commodore.com
Tue Jan 9 09:30:42 AEST 1990
In article <297 at hhb.UUCP> istvan at hhb.UUCP (Istvan Mohos) writes:
>tchrist at convexe.uucp (Tom Christiansen @ Convex Computer) writes:
>>I'm looking for an algorithm that would allow me to determine
>>whether two strings were similar. Thus
>>
>> "abcde" !~ "xyzzy"
>> "this old man can read" =~ "that old man can't read"
>>
>>... perhaps just
>> float strfzcmp(string1,string2)
I will give a general description of an algorithm that can accomplish
what you ask.
NOTE: I am doing this from the 'top of my head'. Some refinement may
be in order.
You will do a weighted comparision based on two factors:
a) The number of characters each string has in common
b) The number of matches the strings have in sequence
Let's take two arbitrary strings:
"This old man can't read" string alpha
"That silly man can't read" string beta
Notice that they are of different lengths. Also, there is an alignment
shift. The following algorithm _should_ properly handle this:
a) Count the number of occurences of each character in each string.
Compare the count of each character in alpha to that in beta in the
following way:
For each different character from alpha & beta:
Normalize the counts so that the greater is normalized to 1.
Multiply the two normalized values toghether.
Add this product to a running total.
Normalize this total by the the length of the longest string
Multiply that by weight w1 yeilding (we shall call it) yw1
b) Compare the two strings byte for byte as follows:
Start at the beginning of alpha & beta (let's use i and j as indexes)
initialize k to 0
Until we hit the end of either string:
if (alpha[i] == beta[j])
++k, ++i, ++j
else
we scan forward on beta to find a byte that matchess
our current location on alpha. If so, adjust j to index
it. If nothing was found, do vice-versa for alpha.
When the above loop is completed,
Normalize k against the size of the longest string
and multiply it by weight w2 yeilding (we shall call it) yw2.
return (yw1 + yw2) / (w1 + w2)
This should produce a good evaluation of how closely matched alpha
is to beta, taking into account mis-alignment. The value returned
will be between 0 and 1 inclusive. Weight w1 should be less than w2.
Perhaps w1 = w2 / 3. Experiment.
As I said, this is off the top of my head, so some refinement is in
order, as should be evident during the implementation/testing phase.
If anyone knows of a better way to accomplish this, let's hear it!
-Mitchell
mitchell at cbmvax.UUCP
"The shortening of the Way."
More information about the Comp.unix.wizards
mailing list