[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: ISO Latin Strings



> Do you know where I could get somekind of encoder/decoder thing? So
> that I could store it as something that is supportted in current
> releases, and then map it back to an ISO Latin string when I retrieve
> it?

Here are the scripts I use.  Replace /local/bin/perl5 with the pathname
of a perl 5 release on your machine.

-------- iso2utf --------
#!/local/bin/perl5 -wp
# Convert iso8859-1 to UTF-8
s/([\200-\377])/pack('CC', 0xC0 + (ord($1) >> 6), (ord($1) & 0xBF))/ge;

-------- utf2iso --------
#!/local/bin/perl5 -p
# Convert UTF-8 to iso8859-1, or to "{" "UTF8:" utf8-char "}" and return error.

# Note:	Do not remove the code that checks if the UTF-8 sequence
#	is valid and can be converted to iso8859-1.
#	Otherwise filters like this can be fooled into converting
#	some 8-bit chars to \0 or control characters.

# 1. octet in non-ASCII char should be [\300-\377].  However, we check
# for [\200-...] in case the input starts in the middle of an UTF-8 char.
s/([\200-\377][\200-\277]*)/&utf2iso($1)/ge;

sub utf2iso {
    my($first,@rest) = unpack('C*', $_[0]);
    $first -= 0xC2;
    if (@rest != 1 || ($first & ~1)) {
	warn "\nutf2iso: Non-iso8859-1 characters(s) in text.\n" unless $w++;
	return "{UTF8:$_[0]}";
    }
    chr($first * 0x40 + $rest[0]);
}

END { die "utf2iso: $w non-iso8859-1 characters.\n" if $w && $w > 1; }