Efficient STRing CoMPares?
R. Kym Horsell
kym at bingvaxu.cc.binghamton.edu
Thu Mar 21 07:31:33 AEST 1991
In article <1991Mar20.174452.4246 at zoo.toronto.edu> henry at zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar19.034802.24340 at cbnewsk.att.com> barton at cbnewsk.att.com (jim.m.barton) writes:
>>... The above macro, StrEq(), MIGHT be faster if most
>>of the strings being compared tend to differ in their first character...
>>IMHO, I doubt that StrEq() will achieve any significant optimization for
>>MOST cases...
>Sorry, there is considerable experimental evidence that it does. The fact
>is, most string comparisons fail on the very first character, and even a
An easy method for verifying that two strings typically differ
in their first character run the following (dumb) sorting program.
For the first 10k words out of /usr/dict/words scrambled up in
essentially random order, it says 89% of the comparisons failed on their
first character.
This is not much of a test, granted, and there _will_
be some cases (e.g. sorting nearly-sorted lists of words) where
comparisons _wont_ be decided by the first characters. However, even in
these latter cases the overhead of an extra `compare' versus the
overhead of the strcmp loop will be small -- please feel free to
experimentally verify this.
(As a back of the envelope calculation
we can say that if it costs S instructions for the strcmp code and
C instruction(s) for the compare then on average we will perform
C + .11 S instructions. We can then say ``how much would C have to be
to make it worth using only S?'' We find break even at C = .89 S. Therefore,
provided the compare sequence (including jumps and register loads) on your
machine is less than 89% of a (typical) call/inline of strcmp then USE THE
COMPARE).
As a slight side note there are certain contexts where string comparisons
can be made MUCH more efficient than a character-by-character compare.
The well-known `string search' algorithms that essentially compare
_rare_ characters within each string (e.g. if one string contains a `z'
then first search for `z' in the other string -- if found then match the
rest) & then adjust pointers by an appropriate amount (rather than just
incrementing them) depending on the outcome of each character comparison is
one example.
I remember reading an article in Software Practice and Experience
that compared such a statistical method with
string-matching INSTRUCTIONS (I _think_ it was on an IBM 370 so that dates
me doesnt it?) and the algorithm was found to be SUBSTANTIALLY faster
(like an order-of-magnitude, unless I'm mistaken) than the hardware
string compare.
-kym
/* Dumb sort program that figgas out how many strings (mis)match on
first character -- it says 11% match for /usr/dict/words. */
#include <stdio.h>
#include <string.h>
#define MAXWORD 10000
char *word[MAXWORD];
int nw=0;
int nmatch1=0;
int ncmp=0;
main(){
init();
scramble();
sort();
printf("fraction match first char=%g\n",(double)nmatch1/ncmp);
}
init() {
int i;
char buf[100];
for(i=0;i<MAXWORD;i++){
gets(buf);
word[i]=strcpy(malloc(strlen(buf)+1),buf);
}
}
scramble() {
int i;
for(i=0;i<MAXWORD;i++){
int j=rand()%MAXWORD;
char *tmp=word[i];
word[i]=word[j];
word[j]=tmp;
}
}
sort() {
int i,j,k;
char *tmp;
for(i=0;i<MAXWORD;i++){
j=i;
tmp=word[j];
for(k=j+1;k<MAXWORD;k++)
if(cmp(word[k],tmp)<0) j=k,tmp=word[j];
word[j]=word[i];
word[i]=tmp;
}
}
cmp(s1,s2) char *s1,*s2; {
++ncmp;
if(*s1==*s2) ++nmatch1;
return strcmp(s1,s2);
}
More information about the Comp.lang.c
mailing list