file locking issues, NFS, lockf
Robert Thurlow
thurlow at convex.com
Wed May 1 13:37:29 AEST 1991
In <1991Apr30.192117.4730 at xn.ll.mit.edu> rkc at xn.ll.mit.edu writes:
>=This is a slight modification of a posting that occured in comp.sys.sun.
>=I received only a few answers which seemed to open as many questions as they
>=answered. I now call upon the unix wizards to help me out.
The best audience may have been comp.protocols.nfs; though NFS and the
Sun lock manager are almost completely separate, they are both based on
RPC and many companies picked them both up from Sun.
> S1. The client dies and the server doesn't realize it. In order to
> avoid processes being killed when they own the lock, I catch the
> following signals: ... Should I catch more?
I guess you have no idea why they are dying? That looked like a pretty
good list to me, I can't say why your clients might be dying.
> I avoid the indefinate wait lock because this appears to increase the
> probability that an error will occur.
Something you may want to try to verify: Sun is said to have badly
broken the server side of the SunOS 4.1.x lock manager in that F_LOCK
requests that have to pend are answered with a GRANTED message with the
wrong process id. These responses are discarded by the client kernel
as being ridiculous. Do you have greater problems working against the
4.1.1 servers? You may get more happiness from either backing off to
4.0.3 or yelling at Sun _really_ loudly :-)
> S2. Sometimes the client doesn't die--it just hangs. Attaching the
>hung program indicates something hangs inside of fcntl.
Hmmm. Does anyone know how to get a backtrace of the kernel context
of a sleeping/waiting process on a Sun? I could use the information,
and it would be helpful here.
> S3. Occasionally, I get messages like
> unknown klm_reply proc(0)
> unknown klm_reply proc(40)
> Does anyone have any idea where these come from?
See S1; do you see this anywhere other than on a SunOS 4.1.x server?
> Other questions include:
> 1. Is there any known way to unconfuse our machines and reset
>state without rebooting the things? Killing statd and lockd is not
>sufficient.
Part (but not enough) of the lock manager lives in the kernel, and if
things get bad enough, a reboot will be necessary. I don't find I have
to do this very much, though. Do both daemons start and respond to
"rpcinfo -u <host> {l,n}lockmgr" requests when they're confused?
> 2. I was once told that sun released patches to their lock daemon, but
>noone could direct me to them. Does a wizard know where such things exist?
Sun, as far as I can tell, can give you patches, but not THE patches
needed to make the lock manager work properly. They _are_ finally
working on it, but it's taking awhile.
> 3. If lockf cannot be made to work, would I be at risk using the old
>technique of creating a "lock directory"?
No, keep after Sun for a working lock manager, because NFS doesn't do
locking-via-file-creation well. The protocol has no O_EXCL flag, so
you can't be sure another process on another machine or on the server
didn't get to the file while your NFS daemon was trying your request,
and you can easily get false failures if your server doesn't keep track
of the retransmissions made necessary by UDP. Rauhl Desai posted a
scheme based on symbolic link creation to comp.protocols.nfs that will
at least work better than creating files.
>I would prefer to get this to work properly using lockf, since this seems to
>be exactly what lockf is designed for.
You're right; sadly, the only implementation available for a whole lot
of machines is Sun's, and it's never worked properly. It's also never
been something Sun cared about until customers started eating them for
breakfast. There's a lot of OEM vendors who feel helpless waiting for
Sun to fix the thing properly, as well. Keep pressuring your Sun sales
rep for information.
>Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or
>4.1.1, and sun3's running 4.0.3. In the near future we will also be using
>DG's aviion/UX workstations.
Good for you; DG has done a splendid job of fixing their lock manager.
Rob T
--
Rob Thurlow, thurlow at convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX
More information about the Comp.unix.wizards
mailing list