NAME

Ru::Text - basic conversions of russian text in various encodings


SYNOPSIS

    use Ru::Text; 

    $text = "cp866-coded russian text";     $to_windows = alt_win $text;  # alt - alias for cp866                                   # win - alias for cp1251 

    $Ru::Text::ENCODING = 'win';     $upper_case = uc $to_windows; # 'uc' using cp1251 russian letters 

    $Ru::Text::ENCODING = 'koi8r';     $to_koi8r = cp866_koi8r $text;     $capitalise = cap $to_koi8r;   # capitalization using koi8r russian letters 

    undef $Ru::Text::ENCODING;     $text = "text without cyrillic letters";     $text =~ s/[^\s]+/ucfirst $&/ge;  # again 'ucfirst' knows only English 

    $Ru::Text::ENCODING = 'cp1251';     $text = "some english-russian-cp1251 mixture";     @eng = $text =~ /([^$RUS_HIGH$RUS_LOW\s]+)/g;  # print out non-russian words     print "@eng \n"; 

    @words = $text =~ /([$RUS_w]+)/g; # all alphanumeric 'words'     print "@words \n\n";              # including cyrillic 


DESCRIPTION

Ru::Text provides a big set of functions that make conversion from one character set to another. All functions' names are lowercase, in the form:

from_to ( )
where from and to are different cyrrilic codetables.

alt, cp866, win, win1251, cp1251, koi8r, iso, iso88595, mac are allowed values now.

   Codetable           Aliases            Meaning      cp866              alt            Russian Alternative (CP866) Codepage      cp1251          win, win1251      MS windows-1251 Codepage      koi8r               -             Relcom's windows-1251 (RFC 1341)      iso88595           iso            ECMA windows-1251 Cyrillic Standard      mac                 -             Apple MAC Cyrillic(Ukrainian)   . 

Examples:
alt_win, cp866_win, cp866_win1251, alt_cp1251 have the same meaning.

For example if $text contains some text in iso88595 encoding, iso_koi8r($text) converts it to koi8r. As well as iso88595_koi8r($text).

Magic
Magic variable $Ru::Text::ENCODING that is initially undef lets you adjust built-in function to understand cyrrilic letters in given encoding.

You may assign $Ru::Text::ENCODING the same values as were used for codetabs but in any case.

For example

       $Ru::Text::ENCODING = 'CP866'; 

makes a lot of work.

1. uc, lc, ucfirst, lcfirst are thinking now about russian letters (in cp866) that they are letters ;-)

2. Some additional variables are being exported to the current package:

$RUS_LOW
scalar with russian lower-case letters in chosen charset.

$RUS_HIGH
scalar with russian upper-case letters in chosen charset.

$RUS_w
something analogous to \w. Complete set of russian letters under chosen ENCODING + those that are usual \w.

$RUS_W
something analogous to \W. In fact everything that is not in $RUS_w .

These variables can be used in regular expressions inside [ ] to match alphanumeric/non-alphanumeric byte (see examples).

3. Note that all those variables as well as changed uc, lc, ucfirst, lcfirst have meaning only in the package where you use'd Ru::Text and defined $Ru::Text::ENCODING. So, if you need this behavior in few packages you must take these actions in each one.

4. To remove magic variables and return to default behavior in uc, lc, ucfirst, lcfirst functions use:

       undef $Ru::Text::ENCODING; # or just 

       $Ru::Text::ENCODING = ''; 

5. All changes do not affect \b and \B. Well, it's quite possible not to use them.

6. $Ru::Text::ENCODING also affects the execution of cap_words ( ) function (see below).

Additional functions
For now only two functions provided (and exported):

cap ( )
Capitalize. Tries to make first character in each word upper-case while all other - lower-case. Some stupid one - while found '!word' it will think that ! is the first letter.

cap_words ( )
The same as cap ( ) but much better. Will find real alphanumeric words based on \w or (if $Ru::Text::ENCODING is defined) - based on $RUS_w.


TABLES AND METHODS

I got all conversion tables except MAC from the excellent package "trans Character Encoding Converter Generator" by Kosta Kostis <kosta@kostis.net>. It contains the most adjusted and precise tables that I could find on the net.

MAC (applecyru) is my own creature - seems like no problems.

K. Kostis uses 'strict' conversion when character is being replaced only with the same character, not similar. And I designed 'applecyru' under the same conditions. So, it's still impossible for now to use some good replacement for example for cp866 pseudo-graphics etc.

All chars that have no strict equivalent in destination codetable appear as '?' in the result document.


INSTALLATION

As this is just a plain module no special installation is needed. Just create subdirectory /Ru somewhere in your @INC and put it there.

NOTE: Please, don't distribute this module as plain text - by mail or by Copy/Paste from browser or editor. It contains binary data that will definitely be corrupted.


CAVEATS

This module has been created and tested in a Win32 environment. Although I expect it to function correctly on other platforms, that fact has not been confirmed. версия для печати


TODO

Some additional code tables are planned for near future (KOI7, Translit and may be Unicode) and also some simple recognition methods. The idea of making conversion tables more user-friendly by allowing ``conversion-to-similar'' instead of ``strict'' conversion - is far from good understanding. Sorting.

Any suggestions are much appreciated.


BUGS

Please report.


CREDITS

Thanks to Ron Wantock <ron.wantock@emd.com> and Jan Krynicky <Jenda@mccann.cz> for their great samples of using tie. Thanks to Aleksandr Korostin <kav@kkc1.lipetsk.su>, author of the best russian coder/converter Codepage.

Thanks to all guys on Perl-Win32 maillist for their creativity and real Perl spirit.


VERSION

This man page documents ``Ru::Text'' version 0.01.

September 10, 1998.


AUTHOR

Mike Blazer <blazer@mail.nevalink.ru>


COPYRIGHT

Copyright (C) 1998 by Mike Blazer. All rights reserved.


LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.