UTF-8 Width Display Issue of Chinese Characters

Question

When I use Perl or C to printf some data, I tried their format to control the width of each column, like

printf("%-30s", str);

But when str contains Chinese character, then the column doesn't align as expected. see the attachment picture.

My ubuntu's charset encoding is zh_CN.utf8, as far as I know, utf-8 encoding has 1~4 length of bytes. Chinese character has 3 bytes. In my test, I found printf's format control count a Chinese character as 3, but it actually displays as 2 ascii width.

So the real display width is not a constant as expected but a variable related to the number of Chinese character, i.e.

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w is the width limit expected, x is the count of Chinese characters, Sw(x) is the real display width.

So the more Chinese character str contains, the shorter it displays.

How can I get what I want? Count the Chinese characters before printf?

As far as I know, all Chinese or even all wide characters I guess, displays as 2 width, then why printf count it as 3? UTF-8's encoding has nothing to do with display length.

In other words, you're looking for a multibyte-aware version of printf for Perl and/or C? — deceze, Commented May 25, 2012 at 9:32
I've never done utf8 decoding in C but here's a Go code that counts runes in an utf-8 string : golang.org/src/pkg/unicode/utf8/utf8.go?s=4824:4876#L202 — Denys Séguret, Commented May 25, 2012 at 9:33
@dystroy It isn’t just a matter of counting the code points (i.e., runes). Rather, it is taking into account that different code points represent 0, 1, or 2 print columns per UAX#11, and this is fairly subtle, especially with the East_Asian_Width=Ambiguous characters. I don’t know of any Go library that deals with this the way the Perl library described in my answer does, but if there is such a thing for Go, I’d love to learn about it! Thanks. — tchrist, Commented May 26, 2012 at 7:40
@tchrist : I learned something. And I just made a test : "go fmt" doesn't format correctly structs with "long" characters. So I guess there are still imperfections is Go's handling of the gigantic beast that is Unicode... — Denys Séguret, Commented May 26, 2012 at 7:59
Display width (number of screen positions), number of characters and number of bytes are three different things. printf only cares about the number of bytes. If you want to take into account the number of characters, use wprintf (remember, it takes a wchar_t* format). There's no formatting function in C that takes into account display width. — n. m. could be an AI, Commented May 26, 2012 at 8:00

Community · Accepted Answer · 2017-05-23 11:45:39Z

Yes, this is a problem with all versions of printf that I am aware of. I briefly discuss the matter in this answer and also in this one.

For C, I do not know of a library that will do this for you, but if anyone has it, it would be ICU.

For Perl, you have to use the Unicode::GCString module form CPAN to calculate the number of print columns a Unicode string will take up. This takes into account Unicode Standard Annex #11: East Asian Width.

For example, some code points take up 1 column and others take up 2 columns. There are even some that take up no columns at all, like combining characters and invisible control characters. The class has a columns method that returns how many columns the string takes up.

I have an example of using this for aligning Unicode text vertically here. It will sort a bunch of Unicode strings, including some with combining characters and “wide” Asian ideograms (CJK characters), and allow you to align things vertically.

sample terminal output

Code for the little umenu demo program which prints that nicely aligned output, is included below.

You might also be interested the far more ambitious Unicode::LineBreak module, of which the aforementioned Unicode::GCString class is just a smaller component. This module is much cooler, and takes into account Unicode Standard Annex #14: Unicode Line Breaking Algorithm.

Here’s the code for the little umenu demo, tested on Perl v5.14:

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

As you see, the key to making it work in that particular program is this line of code, which just calls other functions defined above, and uses the module I was discussing:

print pad(entitle($item), $width, ".");

That will pad out the item to the given width using dots as the fill character.

Yes, it’s a lot less convenient that printf, but at least it is possible.

Collectives™ on Stack Overflow

UTF-8 Width Display Issue of Chinese Characters

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
c
perl
unicode
utf-8
vertical-alignment
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cperlunicodeutf-8vertical-alignment or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c
perl
unicode
utf-8
vertical-alignment
or ask your own question.