sed script to remove cr/lf except at paragraph breaks

Ubben Greg bink at aplcen.apl.jhu.edu
Mon May 22 04:52:01 AEST 1989


In article <119 at sherpa.UUCP> rac at sherpa.UUCP (Roger A. Cornelius) writes:
> I'm in need of a sed script to remove MSDOS cr/lf (actually replace each
> cr/lf combination with one space) except at the start of a paragraph.
> i.e. only the cr/lf preceding a paragraph break should remain.  Paragraphs
> are marked only by four leading spaces and nothing else.
>
> Here's where I am now:
>
> N
> h
> /\n    /{
> P
> D
> }
> s/^M\n/ /g

The h here is useless, because you never use G, g, or x to get the text back.
The problem with using N to gather an arbitrary number of lines in the pattern
space is that SED doesn't keep the pattern space between cycles (unless you
can make the D command work out), so you must code an explicit loop:

	: loop
	$q
	N
	/\n    /{ P; D; }
	s/^M\n/ /
	b loop

Also, the $q is needed because SED will stop dead without printing the pattern
space if an N (or n) is attempted on the last line of the input.  If you don't
care for "gotos" (or correctness), here's an alternative method that makes use
of the hold space and SED's natural cycle for looping:

	/^    /!{ H; $!d; }
	x
	1d
	s/^M\n/ /g

Since this algorithm is based on the transition BETWEEN two paragraphs, the
1d and $! are necessary to handle the special cases of the first and last
lines (and even then it doesn't work right when the first line is not the
beginning of a paragraph or the last line IS the beginning of a paragraph).
This problem requires a 1-line look-ahead, and in general, the x command is
a good way to implement this in SED.

> This works correctly for the first match, ie beginning of a paragraph,
> but for all other lines, the substitution of a space for cr/lf only
> works correctly for the first occurrance in the line (the g flag seems
> to have no effect).  But there are two occurrances due to the N function.

Because you're never gathering more than 2 lines in the pattern space at once,
due to ending the cycle as explained above.

> How can I match (and substitute for) the terminating nl in the pattern
> space?  The sed man pages concerning addresses say you can't.  What am
> I missing or how can I get around this?

The terminating newline can only be matched by a $ because it is not really
there -- it is always tacked on when the line is output.

						-- Greg Ubben "A SED fanatic"
						   bink at aplcen.apl.jhu.edu



More information about the Comp.unix.questions mailing list