Fast "find" sources
pag at hao.UUCP
pag at hao.UUCP
Thu Jul 14 06:40:30 AEST 1983
echo extracting:
echo READ.ME
cat > READ.ME << 'NEWFILE'
FAST FIND MODIFICATIONS (see ;login: 3/83)
Run the shell on this file to separate into components
(poor man's network archive style). They are:
READ.ME
news.item -- brief announcement for local news
find.1 -- manual page
Makefile
find.squeeze -- csh script to build the pathname database
find.bigram.c -- called from find.squeeze to list bigrams
find.code.c -- does the actual filename coding
additions to find.c:
find.c.mod1 -- header
find.c.mod2 -- invocation to fastfind()
find.c.mod3 -- fastfind() itself
Procedure (to be done in a local source directory):
(1) in 'Makefile':
(a) define BINDIR to be the directory to contain the new executable
'find', which overrides the standard (Bell/Berkeley) command.
(b) LIBDIR is to house the auxiliary programs
'find.squeeze', 'find.bigram', and 'find.code'.
Create this directory if it does not already exist.
(c) by changing 'chown root' to 'chown daemon', and by running
as 'daemon' instead of 'root' in step (4) below,
the contents of non-searchable directories will be kept out
of the filelist. We do NOT do this.
in 'find.squeeze':
(a) specify LIBDIR as above.
(b) set STDFIND to the standard 'find' command.
(c) set FINDHONCHO to the name of some user who can act on
any [rare] error messages emanating from the script.
(d) fix a daily time for the 'at' command (ATTIME).
Alternatively, wire it into /usr/lib/crontab for finer control.
The script takes about 20 min. on our 11/70 running v7.
(e) on 4.1/4.2 systems where 'at' runs the Berkeley shell as standard,
take out the lines containing 'EOF'.
(2) cp /usr/src/cmd/find.c find.c.old
Integrate the three find.c mods. Start with
cat find.c.mod1 find.c.old find.c.mod3 > find.c
Then edit find.c to incorporate find.c.mod2--place this after the
main() declarations and before the time() call.
FCODES, as in the Makefile, tells where the database lives.
Our 20000 or so files (100 MB content) takes up ~100K bytes in /etc.
Remove find.c.mod[1-3] when done.
(3) Run the Makefile as root.
(4) Do (as root)
at [suitable time] find.squeeze &
Come back in awhile, then try something like
find find or maybe
find / | wc (to yield total number of files on system)
(5) Install the 'man' page.
Announce 'find' to the world with something like 'news.item'.
NEWFILE
echo news.item
cat > news.item << 'NEWFILE'
>From jaw Tue Jun 9 12:35:48 1981
To: news
Subj: The Ames fast file finder
To locate files on this system in seconds, type
find <name>
where <name> is any character string. E.g.
find xyz
lists all files whose pathname contains 'xyz'.
Shell filename "globbing" is also permitted, i.e.:
find '????'
will list all four-letter files on the system.
vtroff -man `find '*man*ls.[0-9u]'`
would typeset all manual pages for 'ls' variants.
The code is actually a superfast implementation of
find / -mtime +0 -name "*<name>*" -print
with a much simplified form. Note that the two-argument
invocation of 'find' does not conflict with the syntax of
the standard utility, so that the more verbose 'find'
will behave as expected. The .cshrc idiom
alias f 'set noglob; find \*; unset noglob'
proves handy for saving keystrokes.
NEWFILE
echo find.1
cat > find.1 << 'NEWFILE'
.TH FIND AMES
.SH NAME
find \- find files
.SH SYNOPSIS
.B find
pathname-list expression
.br
.B find
name
.SH DESCRIPTION
.I Find
recursively descends
the directory hierarchy for
each pathname in the
.I pathname-list
(i.e., one or more pathnames)
seeking files that match a boolean
.I expression
written in the primaries given below.
In the descriptions, the argument
.I n
is used as a decimal integer
where
.I +n
means more than
.I n,
.I \-n
means less than
.I n
and
.I n
means exactly
.IR n .
.PP
The second simplified form will list all files on the system
whose pathname contains
.I name.
This is similar to
find / -mtime +0 -name "*<name>*" -print
but much faster.
As with
.B -name
below, shell syntax may be used for
.I name.
.TP 10n
.BR \-name " filename"
True if the
.I filename
argument matches the current file name.
Normal
Shell
argument syntax may be used if escaped (watch out for
`[', `?' and `*').
.TP
.BR \-perm " onum"
True if the file permission flags
exactly
match the
octal number
.I onum
(see
.IR chmod (1)).
If
.I onum
is prefixed by a minus sign,
more flag bits (017777, see
.IR stat (2))
become significant and
the flags are compared:
.IR (flags&onum)==onum .
.TP
.BR \-type " c"
True if the type of the file
is
.I c,
where
.I c
is
.B "b, c, d"
or
.B f
for
block special file, character special file,
directory or plain file.
.TP
.BR \-links " n"
True if the file has
.I n
links.
.TP
.BR \-user " uname"
True if the file belongs to the user
.I uname
(login name or numeric user ID).
.TP
.BR \-group " gname"
True if the file belongs to group
.I gname
(group name or numeric group ID).
.TP
.BR \-size " n"
True if the file is
.I n
blocks long (512 bytes per block).
.TP
.BR \-inum " n"
True if the file has inode number
.I n.
.TP
.BR \-atime " n"
True if the file has been accessed in
.I n
days.
.TP
.BR \-mtime " n"
True if the file has been modified in
.I n
days.
.TP
.BR \-exec " command"
True if the executed command returns
a zero value as exit status.
The end of the command must be punctuated by an escaped
semicolon.
A command argument `{}' is replaced by the
current pathname.
.TP
.BR \-ok " command"
Like
.B \-exec
except that the generated command is written on
the standard output, then the standard input is read
and the command executed only upon response
.BR y .
.TP
.B \-print
Always true;
causes the current pathname to be printed.
.TP
.BR \-newer " file"
True if
the current file has been modified more recently than the argument
.I file.
.PP
The primaries may be combined using the following operators
(in order of decreasing precedence):
.TP 4
1)
A parenthesized group of primaries and operators
(parentheses are special to the Shell and must be escaped).
.TP 4
2)
The negation of a primary
(`!' is the unary
.I not
operator).
.TP 4
3)
Concatenation of primaries
(the
.I and
operation
is implied by the juxtaposition of two primaries).
.TP 4
4)
Alternation of primaries
.RB "(`" \-o "' is the"
.I or
operator).
.SH EXAMPLES
To generate nancy's file types:
.IP
find nancy | pump file
.PP
To typeset all variants of manual pages for 'ls':
.IP
vtroff -man `find '*man*/ls.?'`
.PP
To remove all files named
`a.out' or `*.o' that have not been accessed for a week:
.IP
find / \\( \-name a.out \-o \-name '*.o' \\)
\-atime +7 \-exec rm {} \\;
.SH FILES
/etc/passwd
.br
/etc/group
.br
/etc/find.codes coded filenames
.SH "SEE ALSO"
sh(1), test(1), filsys(5)
.br
Relevant paper in February, 1983 issue of
.I ;login:.
.SH BUGS
The syntax (except for the second form), is painful.
NEWFILE
echo Makefile
cat > Makefile << 'NEWFILE'
BINDIR=/usr/local
LIBDIR=/usr/lib/find
find: find.c find.bigram.c find.code.c
cc -n -s -O find.c -o $(BINDIR)/find
cc -s -O find.bigram.c -o $(LIBDIR)/find.bigram
cc -s -O find.code.c -o $(LIBDIR)/find.code
cp find.squeeze $(LIBDIR)
chown root $(LIBDIR)/find.squeeze
chmod og-wx $(LIBDIR)/find.squeeze
NEWFILE
echo find.squeeze
cat > find.squeeze << 'NEWFILE'
#! /bin/csh --this line for VAX BSD--next and last line for V7 only
/bin/csh << 'EOF'
set LIBDIR = /usr/lib/find # for subprograms
set STDFIND = /bin/find # in /usr/bin on some systems
set ATTIME = 0130 # daily
set FINDHONCHO = root # for error messages
set FCODES = /etc/find.codes # the database
set path = ( $LIBDIR /usr/ucb /bin /usr/bin )
set bigrams = /tmp/f.bigrams$$
set filelist = /tmp/f.list$$
set errs = /tmp/f.errs$$
# Make a file list and compute common bigrams.
# Alphabetize '/' before any other char with 'tr'.
# If the system is very short of sort space, 'find.bigram' can be made
# smarter to accumulate common bigrams directly without sorting
# ('awk', with its associative memory capacity, can do this in several
# lines, but is too slow, and runs out of string space on small machines).
at $ATTIME $LIBDIR/find.squeeze
nice +10
$STDFIND / -print | tr '/' '\001' | \
(sort -f; echo $status > $errs) | \
tr '\001' '/' | tee $filelist | $LIBDIR/find.bigram | \
(sort; echo $status >> $errs) | uniq -c | sort -nr | \
awk '{ if (NR <= 128) print $2 }' | tr -d '\012' > $bigrams
# code the file list
if { grep -s -v 0 $errs } then
echo 'find.squeeze error: out of sort space' | mail $FINDHONCHO
else
$LIBDIR/find.code $bigrams < $filelist > $FCODES
rm $bigrams $filelist $errs
endif
'EOF'
NEWFILE
echo find.bigram.c
cat > find.bigram.c << 'NEWFILE'
/*
********************************************************************************
find.bigram < text > bigrams
List bigrams for 'find.squeeze' script.
Use 'find.code' to encode a file using this output.
********************************************************************************
*/
#include <stdio.h>
#define MAXPATH 1024 /* maximum pathname length */
char path[MAXPATH];
char oldpath[MAXPATH] = " ";
main ( )
{
register int count, j;
while ( gets ( path ) != NULL ) {
count = prefix_length ( oldpath, path );
/*
output post-residue bigrams only
*/
for ( j = count; path[j] != NULL; j += 2 ) {
if ( path[j + 1] == NULL )
break;
putchar ( path[j] );
putchar ( path[j + 1] );
putchar ( '\n' );
}
strcpy ( oldpath, path );
}
}
prefix_length ( s1, s2 ) /* return length of longest common prefix */
char *s1, *s2; /* ... of strings s1 and s2 */
{
register char *start;
for ( start = s1; *s1 == *s2; s1++, s2++ )
if ( *s1 == NULL )
break;
return ( s1 - start );
}
NEWFILE
echo find.code.c
cat > find.code.c << 'NEWFILE'
/*
********************************************************************************
NAME: find.code
PURPOSE: sorted list compressor (works with a modified 'find'
to encode/decode a filename database)
USAGE: find.bigram < list > bigrams
process bigrams (see find.squeeze) > common_bigrams
find.code common_bigrams < list > squozen_list
METHOD: Uses 'front compression' (see ";login:", March 1983, p. 8 ).
Output format is, per line, an offset differential count byte
followed by a partially bigram-encoded ascii residue.
The codes are:
0-28 likeliest differential counts + offset to make nonnegative
30 escape code for out-of-range count to follow in next word
128-255 bigram codes, (128 most common, as determined by 'find.squeeze')
32-127 single character (printable) ascii residue
SEE ALSO: find.squeeze, find.bigram.c, find.c
AUTHOR: James A. Woods, Informatics General Corp.,
NASA Ames Research Center, 10/82
********************************************************************************
*/
#include <stdio.h>
#define MAXPATH 1024 /* maximum pathname length */
#define RESET 30 /* switch code */
char path[MAXPATH];
char oldpath[MAXPATH] = " ";
char bigrams[257] = { 0 };
main ( argc, argv )
int argc; char *argv[];
{
int count, oldcount, diffcount;
int j, code;
char bigram[3];
FILE *fp;
oldcount = 0;
bigram[2] = NULL;
if ( (fp = fopen ( argv[1], "r" )) == NULL )
printf ( "Usage: find.code common_bigrams < list > coded_list\n" ), exit ( 1 );
fgets ( bigrams, 257, fp );
fwrite ( bigrams, 1, 256, stdout );
while ( gets ( path ) != NULL ) {
/*
squelch unprintable chars so as not to botch decoding
*/
for ( j = 0; path[j] != NULL; j++ ) {
path[j] &= 0177;
if ( path[j] < 040 || path[j] == 0177 )
path[j] = '?';
}
count = prefix_length ( oldpath, path );
diffcount = count - oldcount;
if ( (diffcount < -14) || (diffcount > 14) ) {
putc ( RESET, stdout );
putw ( diffcount + 14, stdout );
}
else
putc ( diffcount + 14, stdout );
for ( j = count; path[j] != NULL; j += 2 ) {
if ( path[j + 1] == NULL ) {
putchar ( path[j] );
break;
}
bigram[0] = path[j];
bigram[1] = path[j + 1];
/*
linear search for specific bigram in string table
*/
if ( (code = strindex ( bigrams, bigram )) % 2 == 0 )
putchar ( (code / 2) | 0200 );
else
fputs ( bigram, stdout );
}
strcpy ( oldpath, path );
oldcount = count;
}
}
strindex ( string, pattern ) /* return location of pattern in string or -1 */
char *string, *pattern;
{
register char *s, *p, *q;
for ( s = string; *s != NULL; s++ )
if ( *s == *pattern ) { /* fast first char check */
for ( p = pattern + 1, q = s + 1; *p != NULL; p++, q++ )
if ( *q != *p )
break;
if ( *p == NULL )
return ( q - strlen ( pattern ) - string );
}
return ( -1 );
}
prefix_length ( s1, s2 ) /* return length of longest common prefix */
char *s1, *s2; /* ... of strings s1 and s2 */
{
register char *start;
for ( start = s1; *s1 == *s2; s1++, s2++ )
if ( *s1 == NULL )
break;
return ( s1 - start );
}
NEWFILE
echo find.c.mod1
cat > find.c.mod1 << 'NEWFILE'
/*
*******************************************************************************
NAME: find.c -- locate files
USAGE: find pathname-list expression, or
find <name> (Ames modification)
INSTALL: edit/run Makefile
run find.squeeze as root
FILES: /etc/find.codes
SEE ALSO: find.squeeze, find.bigram.c, find.code.c
Usenix ;login:, February/March, 1983, p. 8.
REVISIONS: James A. Woods, Informatics General Corporation,
NASA Ames Research Center, 6/81.
The second form searches a pre-computed filelist
(constructed nightly by /usr/lib/crontab) which is
compressed by find.squeeze (v.i.z.) The effect of
find <name>
is similar to
find / +0 -name "*<name>*" -print
but much faster.
8/82 faster yet + incorporation of bigram coding -- jaw
1/83 incorporate glob-style matching -- jaw
********************************************************************************
*/
#define AMES 1
NEWFILE
echo find.c.mod2
cat > find.c.mod2 << 'NEWFILE'
#ifdef AMES
if ( argc < 2 ) {
fprintf ( stderr, "Usage: find name, or find path-list predicate-list\n" );
exit ( 1 );
}
if ( argc == 2 ) {
fastfind ( argv[1] );
exit ( 0 );
}
#endif
NEWFILE
echo find.c.mod3
cat > find.c.mod3 << 'NEWFILE'
#ifdef AMES
/*
'fastfind' scans a file list for the full pathname of a file
given only a piece of the name. The list has been processed with
with "front-compression" and bigram coding. Front compression reduces
space by a factor of 4-5, bigram coding by a further 20-25%.
The codes are:
0-28 likeliest differential counts + offset to make nonnegative
30 escape code for out-of-range count to follow in next word
128-255 bigram codes, (128 most common, as determined by 'find.squeeze')
32-127 single character (printable) ascii residue
A novel two-tiered string search technique is employed:
First, a metacharacter-free subpattern and partial pathname is
matched BACKWARDS to avoid full expansion of the pathname list.
The time savings is 40-50% over forward matching, which cannot efficiently
handle overlapped search patterns and compressed path residue.
Then, the actual shell glob-style regular expression (if in this form)
is matched against the candidate pathnames using the slower routines
provided in the standard 'find'.
*/
#define FCODES "/etc/find.codes"
#define YES 1
#define NO 0
#define OFFSET 14
#define ESCCODE 30
fastfind ( pathpart )
char pathpart[];
{
register char *p, *s;
register int c;
char *q, *index(), *patprep();
int i, count, globflag;
FILE *fp, *fopen();
char *patend, *cutoff;
char path[1024];
char bigram1[128], bigram2[128];
int found = NO;
if ( (fp = fopen ( FCODES, "r" )) == NULL ) {
fprintf ( "find: can't open %s\n", FCODES );
exit ( 1 );
}
for ( i = 0; i < 128; i++ )
bigram1[i] = getc ( fp ), bigram2[i] = getc ( fp );
if ( index ( pathpart, '*' ) || index ( pathpart, '?' ) || index ( pathpart, '[' ) )
globflag = YES;
patend = patprep ( pathpart );
c = getc ( fp );
for ( ; ; ) {
count += ( (c == ESCCODE) ? getw ( fp ) : c ) - OFFSET;
for ( p = path + count; (c = getc ( fp )) > ESCCODE; ) /* overlay old path */
if ( c < 0200 )
*p++ = c;
else /* bigrams are parity-marked */
*p++ = bigram1[c & 0177], *p++ = bigram2[c & 0177];
if ( c == EOF )
break;
*p-- = NULL;
cutoff = ( found ? path : path + count);
for ( found = NO, s = p; s >= cutoff; s-- )
if ( *s == *patend ) { /* fast first char check */
for ( p = patend - 1, q = s - 1; *p != NULL; p--, q-- )
if ( *q != *p )
break;
if ( *p == NULL ) { /* success on fast match */
found = YES;
if ( globflag == NO || amatch ( path, pathpart ) )
puts ( path );
break;
}
}
}
}
/*
extract first glob-free subpattern for fast pre-match;
prepend NULL for backwards match; return end of pattern
*/
static char globfree[100];
char *
patprep ( p )
register char *p;
{
register char *subp = globfree;
while ( *p == '*' || *p == '?' )
p++;
if ( *p == '[' ) {
while ( *p != ']' && *p != NULL )
p++;
p++;
}
*subp++ = NULL;
if ( *p != NULL ) /* copy first noglob string */
while ( *p != '*' && *p != '?' && *p != '[' && *p != NULL &&
subp < (globfree + sizeof(globfree)) )
*subp++ = *p++;
else /* pattern has noglob chars only */
*subp++ = '/'; /* ... check every path */
*subp = NULL;
return ( --subp );
}
#endif
NEWFILE
More information about the Comp.sources.unix
mailing list