use Ru::Text;
$text = "cp866-coded russian text"; $to_windows = alt_win $text; # alt - alias for cp866 # win - alias for cp1251
$Ru::Text::ENCODING = 'win'; $upper_case = uc $to_windows; # 'uc' using cp1251 russian letters
$Ru::Text::ENCODING = 'koi8r'; $to_koi8r = cp866_koi8r $text; $capitalise = cap $to_koi8r; # capitalization using koi8r russian letters
undef $Ru::Text::ENCODING; $text = "text without cyrillic letters"; $text =~ s/[^\s]+/ucfirst $&/ge; # again 'ucfirst' knows only English
$Ru::Text::ENCODING = 'cp1251'; $text = "some english-russian-cp1251 mixture"; @eng = $text =~ /([^$RUS_HIGH$RUS_LOW\s]+)/g; # print out non-russian words print "@eng \n";
@words = $text =~ /([$RUS_w]+)/g; # all alphanumeric 'words' print "@words \n\n"; # including cyrillic
alt, cp866, win, win1251, cp1251, koi8r, iso, iso88595, mac are allowed values now.
Codetable Aliases Meaning cp866 alt Russian Alternative (CP866) Codepage cp1251 win, win1251 MS windows-1251 Codepage koi8r - Relcom's windows-1251 (RFC 1341) iso88595 iso ECMA windows-1251 Cyrillic Standard mac - Apple MAC Cyrillic(Ukrainian) .
For example if $text contains some text in iso88595 encoding, iso_koi8r($text) converts it to koi8r. As well as iso88595_koi8r($text).
undef lets you adjust built-in function to understand cyrrilic letters in given encoding. You may assign $Ru::Text::ENCODING the same values as were used for codetabs but in any case.
For example
$Ru::Text::ENCODING = 'CP866';
makes a lot of work.
1. uc, lc, ucfirst, lcfirst are thinking now about russian letters (in cp866) that they are letters ;-)
2. Some additional variables are being exported to the current package:
3. Note that all those variables as well as changed uc, lc, ucfirst, lcfirst have meaning only in the package where you use'd Ru::Text and defined $Ru::Text::ENCODING. So, if you need this behavior in few packages you must take these actions in each one.
4. To remove magic variables and return to default behavior in uc, lc, ucfirst, lcfirst functions use:
undef $Ru::Text::ENCODING; # or just
$Ru::Text::ENCODING = '';
5. All changes do not affect \b and \B. Well, it's quite possible not to use them.
6. $Ru::Text::ENCODING also affects the execution of cap_words ( ) function (see below).
"trans Character Encoding Converter Generator" by Kosta Kostis <kosta@kostis.net>. It contains the most adjusted and precise tables that I could find on the net. MAC (applecyru) is my own creature - seems like no problems.
K. Kostis uses 'strict' conversion when character is being replaced only with the same character, not similar. And I designed 'applecyru' under the same conditions. So, it's still impossible for now to use some good replacement for example for cp866 pseudo-graphics etc.
All chars that have no strict equivalent in destination codetable appear as '?' in the result document.
@INC and put it there. NOTE: Please, don't distribute this module as plain text - by mail or by Copy/Paste from browser or editor. It contains binary data that will definitely be corrupted.
Any suggestions are much appreciated.
<ron.wantock@emd.com> and Jan Krynicky <Jenda@mccann.cz> for their great samples of using tie. Thanks to Aleksandr Korostin <kav@kkc1.lipetsk.su>, author of the best russian coder/converter Codepage. Thanks to all guys on Perl-Win32 maillist for their creativity and real Perl spirit.
September 10, 1998.
<blazer@mail.nevalink.ru>