1

In The Linux Command Line William Shotts claims that character ranges can be problematic. See the relevant excerpt below, emphasis is mine.

Character Ranges

If you are coming from another Unix-like environment or have been reading some other books on this subject, you may have encountered the [A-Z] and [a-z] character range notations. These are traditional Unix notations and worked in older versions of Linux as well. They can still work, but you have to be careful with them because they will not produce the expected results unless properly configured. For now, you should avoid using them and use character classes instead.

What is he talking about in the last couple of sentences? What do the POSIX standards say about this?

7
  • The book is freely available for download here.
    – Git Gud
    Commented Mar 10, 2019 at 17:18
  • wildcards are typically used in the area of filename generation; do you see any connection to variables for your question? (I don't, but it's your question)
    – Jeff Schaller
    Commented Mar 10, 2019 at 17:20
  • @JeffSchaller My suspicion stems from the second paragraph here. If you think the tag isn't appropriate here, please let me know and I'll remove it. Also, feel free to remove it yourself. Thanks.
    – Git Gud
    Commented Mar 10, 2019 at 17:24
  • 4
    The reference is probably to locale-dependence: see for example Why does [A-Z] match lowercase letters in bash? Commented Mar 10, 2019 at 17:24
  • @steeldriver Thanks, this is very promising. In my two systems LC_COLLATE isn't even defined. It would be helpful to know more about this variable.
    – Git Gud
    Commented Mar 10, 2019 at 17:27

1 Answer 1

3

That most likely refers to locales having uppercase and lowercase characters ordered in alternation, instead of first one, then the other:

$ echo "$LANG"
en_US.UTF-8
$ touch a A z Z
$ ls
A  Z  a  z
$ bash -c 'echo [a-z]'
a A z

However, the appropriate character class works:

$ bash -c 'echo [[:lower:]]'
a z

But might also match more than just a to z:

$ LANG=fi_FI.UTF-8
$ touch ä Ä ö Ö
$ bash -c 'echo [[:lower:]]'
a z ä ö

If you want to avoid that, and only match the English lowercase letters a to z, Bash in particular has an option to interpret the ranges in the ASCII order:

$ bash -c 'shopt -s globasciiranges; echo [a-z]'
a z

And you can always force the default collating order:

$ LC_COLLATE=C bash -c 'echo [a-z]'
a z

As for what POSIX says, it seems to me that ranges in bracket expressions are left undefined in locales other than the default POSIX one. The pattern matching description refers to the regex description of bracket expressions, which says:

In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.

6
  • Thanks. I don't understand how bash -c 'echo [a-z]' works. If I don't run the commands in the specific order that you do in your answer, it just echoes [a-z]. Can you shed some light on that? Also, is there a way to check the definition of each character class?
    – Git Gud
    Commented Mar 10, 2019 at 18:20
  • @GitGud [a-z] is a glob pattern, just like foo*; if you don't have any file starting with foo in the current dir, echo foo* will just echo foo*; if you don't have any file named a, b, c, etc in the current dir, echo [a-z] will echo just [a-z].
    – user313992
    Commented Mar 10, 2019 at 18:29
  • Ah! Of course! Got it. Thanks @UncleBilly
    – Git Gud
    Commented Mar 10, 2019 at 18:31
  • @ikkachu Just to make sure you didn't miss my second question: is there a way to check the definition of each character class?
    – Git Gud
    Commented Mar 10, 2019 at 22:24
  • 1
    Even worse, "unspecified behavior" means that it can change at any time. As of Ubuntu 20.04, the lower case ranges in the en_US.UTF-8 locale behaves just like the POXIS or C locale, unlike in the above example which outputs (some) uppercase. (End of the range UC was always missed, an implementation error most likely).
    – ubfan1
    Commented Nov 2, 2020 at 21:59

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .