[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wrong mime2 (rfc2047) header encoding.



On Wed, Jul 05, 2000 at 01:21:31PM +0400, Alexey Gadzhiev wrote:
>  Hello,
>  when zmailer attemt to "correct" 8-bit headers it encode words as separate
>  mime2 "encoded words"  separated with spaces. It's wrong.
>  Spaces in original headers should be encoded in mime2 header as "_" or "=20".
> 
>   From rfc2047:
>   "When displaying a particular header field that contains multiple
>    'encoded-word's, any 'linear-white-space' that separates a pair of
>    adjacent 'encoded-word's is ignored.  (This is to allow the use of
>    multiple 'encoded-word's to represent long strings of unencoded text,
>    without having to separate 'encoded-word's where spaces occur in the
>    unencoded text.)"

	Ack, that is propably what the current code is violating.
	All in all the "MIME2" is not a trivial thing to implement,
	which shows up all the time around the world when new people
	are trying to do it -- and produce invalid results.(*)

	The current code is rarely wrong in case the 8-bitness is rare
	in the header contained words, but in KOI8-R I presume the
	ratio is quite opposite from ISO-8859-1/Finnish case..
	(And the kudos of MIME-2 encoding is really in user-agent,
	 not at the MTA, thus I have not placed any active interest
	 at the internal MIME-2 encoder codes.)

	Reading the relevant code, I think the only sensible thing
	there is to scan thru the entire header, tokenize things to
	whitespace/ok-as-is-text/must-encode-mime2/unencodable-separator-tokens,
	and then combine with possible conversions -- that way things
	like: ("8" presents 8-bit character)

		To: "8888888" (some 888 text) <foo@bar> (888 888 text)

	will be converted as:

		To: =?"8888888"?= (some =?888?= text) <foo@bar> (=?888_888?= text)

	plus adding there the token folding rules of injecting folding
	whitespace if line length exceeds 78 chars (columns)
	(Possibly converting a long TTTTTTTTTTTTTTT ok-as-is-text to
	 =?TTTTTTTTT?= =?TTTT?= token pair is semi dangerous as such
	 code should recognize when it is scanning ADDRESS entities,
	 and not mere TEXT entities..  E.g. contain full RFC 822 scanner.
	 It is better to fold from front of such long text, and possibly
	 suffer if the non-whitespace text sequence length exceeds 78 chars,
	 and that 78 char line length limit bites at some time..)

	It seems to require RFC 822 token scanner, which (I recall) is
	context sensitive :-(  Ah, indeed:
		RFC 2047: 5. (2):

    It is important to note that 'comment's are only recognized inside
    "structured" field bodies.  In fields whose bodies are defined as
    '*text', "(" and ")" are treated as ordinary characters rather than
    comment delimiters, and rule (1) of this section applies.  (See RFC
    822, sections 3.1.2 and 3.1.3)


	Yes, this propably is why I haven't spent much time at implementing
	it..  It really should be done at the Router, not at the transport-
	agents.

	For that matter, message MIME structure analysis for body conversions
	should also be done at the router, and not at the transport agents for
	each recipient transmits.

>   Alexey

-- 
/Matti Aarnio	<mea@nic.funet.fi>

(*): Lattest invalid encoding result I have seen was something like:
	To:  =?xxxx?Q?ttttttt?= <foo@bar> =?xxxx?Q?(foo@bar)?=
     the last token at the line carries @-character in it, so it smells
     alike a piece of address, while it *really* was in comment..
     Proper encoding would have been:
	To: =?xxxx?Q?ttttttt?= <foo@bar> (=?xxxx?Q?foo@bar?=)