7
\$\begingroup\$

As part of an assignment, I wrote this helper function to convert a specified name to a required format. This function simply converts a string in the format "this-string" to "This String". I was wondering if there was any simpler way to do this, as this seems a bit "clunky".

For this specific assignment, we load names from a .csv file where they are formatted as "ability-name", and are required to output them as "Ability Name".

While this might conflict with some answers using a different C compiler, I am compiling this code with -std=gnu11.

#include <ctype.h> // toupper
#include <string.h> // NULL
void format_name(char* name) {

    if (name == NULL) return;

    int next_index = 0;
    for (int i = 0; name[i] != '\0'; i++) {
        if (i == 0) {
            name[i] = (char)toupper(name[i]);
        }
        if (name[i] == '-') {
            next_index = i + 1;
            name[i] = ' ';
        }
        if (i == next_index) {
            name[i] = (char)toupper(name[i]);
        }
    }

}

An example of how this function can be used:

#include <stdio.h>
int main(int argc, char** argv) {

    (void)argc; (void)argv; // Silence -Wunused-value

    char string1[12] = "this-string";
    format_name(string1);
    printf("String1: %s\n", string1);
    
    char string2[22] = "this-string-is-longer";
    format_name(string2);
    printf("String2: %s\n", string2);

}
\$\endgroup\$
6
  • 2
    \$\begingroup\$ OT: Why are the source strings peppered with hyphens instead of conventional spaces? \$\endgroup\$
    – Fe2O3
    Commented Jun 3 at 22:09
  • 2
    \$\begingroup\$ @Fe2O3 For this specific assignment, we load names from a .csv file where they are formatted as "ability-name", and are required to output them as "Ability Name". \$\endgroup\$
    – Linny
    Commented Jun 4 at 19:05
  • 2
    \$\begingroup\$ @Linny Thanks for clarifying... Suggest you consider changing function name to titleCase(). Good luck with your studies... And, the function is short, but... Suggest a one line comment: "Replace hyphens with SP and capitalise all words." Helps the reader to understand what they're looking at... \$\endgroup\$
    – Fe2O3
    Commented Jun 4 at 19:24
  • \$\begingroup\$ Just to note: You have to be the judge of right/wrong. When asked to clarify something, IF that clarification does not invalidate any part of any existing answers, it's best to edit your question and add the clarification there. Comments are "ephemeral" and not all users will read down into the comment blocks. Just something to know about using this website. A recently posted answer (below) has explored variations of what might be your test data and use case, so, now, please respond to that answer below that answer (if you want to). Just FYI... Cheers! :-) \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 19:39
  • 1
    \$\begingroup\$ lower case, UPPER CASE, Title Case, snake_case, camelCase, "just-in-case"... Rapidly becoming something of a basket case :-D \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 21:47

8 Answers 8

9
\$\begingroup\$

Avoid unhelpful comments:

#include <ctype.h> // toupper
#include <string.h> // NULL

NULL from <string.h>? I would have expected <stddef.h>, but I am aware that the answers on StackOverflow say that <string.h> also has it. Not wrong, but quite unconventional. Pray why include the entirety of string.h for the mere definition of NULL?

And whilst you do not need to mention that toupper() resides in <ctype.h>, it might be a good idea if it lets a future maintainer know when a header is safe to remove. Otherwise, not really needed.

No need to check for a nullptr:

The documentation does not specify whether nullptr is a valid value for this function. Though there is no need to check for it. It is not your fault if some person thought it would be a good idea to format a nullptr, or did not read the documentation of this function (which is missing). The Standard library functions also do not check for it.

Two things you can do:

  • Use array notation for pointer parameters:

    void format_name(char name[static 1]);
    

    static 1 says that name should point to at-least 1 item of type char. GCC will complain on most wrong use cases of this function. Clang and Intel's compiler, not so much. MSVC does not support this C99 feature for some reason, neither is it in C++. Code aiming for portability can define at_least/AT_LEAST to expand to static when available, or nothing.

  • Document that this function would invoke undefined behavior if a nullptr is passed in as an argument. This is redundant in the above case, as an argument that is expected to point to at-least one item can not possibly be a nullptr.

Signed index?

int next_index = 0;

You are indexing into a string, the index can not possible be signed, and an int can not hold all possible lengths. This should be size_t. Same goes for i in the for loop. I would not repeat what has already been mentioned about passing a char to the character-classification functions.

Better way to avoid the warning from -Wunused-value:

int main(int argc, char** argv) {

    (void)argc; (void)argv; // Silence -Wunused-value

It is as trivial as not using this prototype when argc and argv are not going to be used.

Code before C23 should use:

int main(void)

Code in C23 and above could use:

int main();

int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv);

int main(int, char **);

Empty parameters and void now mean the same thing. C23 has introduced attribute specifier sequences. And unnamed parameters have been included too.

Let the compiler do its job:

char string1[12] = "this-string";
char string2[22] = "this-string-is-longer";

Specifying the sizes by counting it ourselves takes time and is error-prone. The compiler can detect it automatically for us. Consider changing it to:

char string1[] = "this-string";
char string2[] = "this-string-is-longer";
\$\endgroup\$
4
  • \$\begingroup\$ @Downvoter What is your issue? \$\endgroup\$
    – Harith
    Commented Jun 4 at 13:59
  • 5
    \$\begingroup\$ Listing what the module actually uses from its headers is a good idea, if it lets a future maintainer know when a header is safe to remove. \$\endgroup\$
    – Davislor
    Commented Jun 4 at 16:47
  • \$\begingroup\$ @Davislor Valid argument, I would add it to my answer. \$\endgroup\$
    – Harith
    Commented Jun 4 at 16:48
  • 2
    \$\begingroup\$ I upvoted, by the way. \$\endgroup\$
    – Davislor
    Commented Jun 4 at 23:22
7
\$\begingroup\$

There are currently four VG answers. Without simply rephrasing or plagiarising that content, I've nothing else to add to their commentary about the OP's code.

However, there is one more suggestion that might satisfy the OP's desire for something "less clunky"; to recognise that mixing two operations into one function makes code both highly specific (not as re-usable as it could be), and a bit more difficult to assess for correctness.

Below, the two operations are separated at the expense of making two traversals of the data instead of one.

#include <stdio.h>
#include <ctype.h>
#include <string.h>

typedef unsigned char uc; // be brief

char *mkUpper( char *pOrg ) { // aka "Title Case"
    for( char *p = pOrg; *p; p += !!*p ) // don't venture beyond '\0'
        if( isalpha( (uc)*p ) ) // 1st char to CAP, then skip adjacent alpha
            for( *p = (uc)toupper((uc)*p); isalpha((uc)*++p); ) {}
    return pOrg;
}

char *subChar( char *pOrg, char c ) {
    for( char *p = pOrg; ( p = strchr( p, c ) ) != NULL ); ) *p++ = ' ';
    return pOrg;
}

char *format_name( char *p ) {
    return p ? mkUpper( subChar( p, '-' ) ) : p;
}

int main( void ) {
    char *s[] = { "this-string", "this-string-is-longer" };

    printf( "String1: %s\n", format_name( s[0] ) );
    printf( "String2: %s\n", format_name( s[1] ) );

    return 0;
}
  • format_name() sequentially invokes two helper functions
  • mkUpper() capitalises the first alpha character of each block of alpha
  • subChar() iteratively replaces one character with a SP.

Now there's a general purpose function to replace a character (any single character) with a SP. And, there is a general purpose function to make CAPITAL the first alpha character of any block of alpha characters. The capitalisation is no longer tied to being prefixed by a hyphen.

I've used the OP's snake_case function name while also using my preferred camelCase for the two helper function names. Tastes differ, and it's easy to adjust in either direction.

De-clunked?


Actually, there is one thing I don't see mentioned elsewhere

..._name(char* name) versus
..._name(char *name)
There are two tribes with differing opinions about where to place the splats. Members of each tribe can-and-will argue the virtues of their preference. I prefer reading code that is reasonably compact and gets to the point.

When first learning C, I made the usual newbie mistake and wrote int* p1, p2; (back in the days of less communicative compilers). I then spent (wasted) MANY hours trying to find my bug.

I recommend putting the splats hard-up against the variable name: int *p1, *p2;, habitually. This practice has precluded repeating my mistake, even in the heat of the battle, completing assigned work on-time. Just a recommendation to the learning OP.


Seeing the two operations separated, it's tempting to put them back together again to make only one traversal of the data. For brevity, this version of the OP's function presumes the caller won't pass a NULL pointer. The only significant difference here is using one loop and two distinct if() blocks:

typedef unsigned char uc; // be brief

char *format_name( char *pOrg ) {
    for( char *p = pOrg; *p; p += !!*p ) { // don't venture beyond '\0'
        if( isalpha( (uc)*p ) ) // 1st char to CAP; skip adjacent alpha
            for( *p = (uc)toupper((uc)*p); isalpha((uc)*++p); ) {}
        if( *p == '-' ) // any hyphen becomes SP
            *p = ' ';
    } // braces... **shiver** :-)

    return pOrg;
}

It's often the simplest challenges that offer the opportunity to highlight basic ideas and enhance one's "skill set". The following has occasionally been recommended to beginners in other Q&A.

Compare these two blocks of two lines of "test harness" code:

    printf( "String1: %s\n", format_name( s[0] ) );
    printf( "String2: %s\n", format_name( s[1] ) );
    printf( "String1: '%s'\n", format_name( s[0] ) );
    printf( "String2: '%s'\n", format_name( s[1] ) );

Here, apostrophes have been used to mark the full length of the processed test strings. The tester is now able to determine if invisible tabs or spaces have somehow made their way onto the ends of the strings. (Use a '|' or '/' instead of apostrophe if desired.)

\$\endgroup\$
14
  • 2
    \$\begingroup\$ @Harith You and the OP should "nut out" the meaning of p += !!*p... The full expression means "add 1 to pointer p if not at end of string, else add 0 so as not to venture too far." The conditional of the for() will then "mop up" and end its looping. No need for an additional if() to break out of the loop... My motto: "Less code is better code" Once the idiom has been seen, its purpose becomes part of the reader's skill set for future use. I didn't learn this from reading a textbook. It comes from reading others' code and wondering "what the heck does this do??" :-) \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 2:10
  • 1
    \$\begingroup\$ @Harith Some might call it "obfuscated code". I call it "code a professional would understand and, perhaps, adopt." Opinions vary as to where the line is, as always... :-) Cheers! (Casting to bool may not work in all implementations AND requires more typing. :-) \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 2:13
  • 1
    \$\begingroup\$ Funny: when I wrote my first version of format_name I changed direction midstream and split it into two functions, and then wrote something in my answer about being tempted to split this into two functions. Then I looked at the code after posting and thought that it looked like a mess (my solution, not yours) so I ended up rewriting it as one function again in the end. \$\endgroup\$ Commented Jun 5 at 4:21
  • 1
    \$\begingroup\$ @adabsurdum This, but without the extension about "differing..." :-D \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 4:25
  • 1
    \$\begingroup\$ "Less code is (usually!) better code" - but there's a whole SE site devoted to pushing a sensible maxim much too far... \$\endgroup\$ Commented Jun 9 at 14:26
7
\$\begingroup\$

converts a string in the format "this-string" to "This String"

  • I'd take this to mean each subset of letters should begin with an uppercase letter.

  • The NULL macro is not needed as code can test with if (!name) (or if (name) for the opposite polarity).
    I do prefer if (name == NULL) over if (!name) as I find == easier to reviewers comprehend than !=.

  • is...() functions are design to work with characters in the unsigned char range (and EOF), which might differ from the char range.

  • Consider using a flag to determine a possible beginning of a word.

  • Pedantic: accessing characters as via unsigned char* matches the approach used by str...() functions. This is important for the nearly extinct non-2's complement encodings for a signed char. This also avoids implementation defined behavior from a char cast.

  • Consider incrementing the pointer and drop the index (which should be of type size_t, not int).

  • Consider returning something useful, like the original pointer.


#include <ctype.h>
#include <stdbool.h>

char *format_name(char* name) {
  unsigned char *uname = (unsigned char *) name;
  if (uname) {
    bool word_beginning = true;

    while (*uname) {
      if (isalpha(*uname)) {
        if (word_beginning) {
          word_beginning = false;
          *uname = (unsigned char) toupper(*uname);
        }
      } else {
        word_beginning = true;
        if (*uname == '-')) {
          *uname = ' ';
        }
      }
      uname++;  
    }

  }
  return name;
}
\$\endgroup\$
2
  • \$\begingroup\$ In C23, all signed integral types are two’s-complement. \$\endgroup\$
    – Davislor
    Commented Jun 4 at 16:48
  • 3
    \$\begingroup\$ @Davislor Code is not tagged C23. IAC, as of May, C23 is yet to be released so "all signed integral types are two’s-complement." is likely yet in the future. \$\endgroup\$
    – chux
    Commented Jun 4 at 17:26
6
\$\begingroup\$

If you're willing to link against a regex library there's simpler ways to accomplish this. Let's stick with just ISO C for now.

specification

We helpfully see "converts a string in the format "this-string" to "This String" in the Review Context. This is almost a specification.

The source code needs to include documentation comments that are at least as helpful as that. As written, format_name is hopelessly vague.

Also, it seems it is important that caller does not attempt to pass in a UTF8 string, as we are limited to something like US-ASCII, or maybe Latin1 code page. In particular, there definitely isn't support for en-dash or em-dash, which is fine. It brings us back to the whole business of specifying just what this function offers to the caller.

NUL

    for ( ... ; name[i] != '\0'; ... ) {

This is just fine. But standard idiom would be simply name[i], similar to while (*p++) loops.

DRY

This seems slightly regrettable.

        if (i == 0) {
            name[i] = (char)toupper(name[i]);
        }

I mean, we already have next_index = 0, right? Why not just let the if (i == next_index) clause worry about this?

On which topic, when defining next_index it wouldn't hurt to explain its meaning. Yes, we see how it interacts, and it's not a complex interaction. But still it's worthwhile to get in the habit of commenting on state variables when the variable's name is not self explanatory.

unit test

Thank you for including a test suite.

But a unit test should "know the right answer", so it can display a Green bar in automated fashion.

    char string1[12] = ... ;
    ...
    char string2[22] = ... ;

Those seem tedious and painful, not a pretty sight for a maintenance engineer to gaze upon. Surely we could default the length, given that each string is a constant known at compile time?

You didn't mention if you're writing to c99, c11, or something more modern. The code looks like it is C, but these feel like they want to be C++ constexprs.

\$\endgroup\$
1
  • 1
    \$\begingroup\$ As an adjunct to the final sentence: it is actually a quite reasonable idea to use a C++ main() (or a C++ testing framework) to exercise functions written in C. I do it quite frequently. \$\endgroup\$ Commented Jun 9 at 14:33
6
\$\begingroup\$

The functions in ctype.h expect input in the range of unsigned char, so you should cast char inputs (which may be signed) to unsigned char to be safe. The posted code fails to do this. On the other hand, the posted code casts the result of calling toupper to char before assigning the value to a location in name[]; there is no reason for this cast since the returned value corresponds to a character in the current locale.

I would consider having format_name return a pointer to the modified string so that the result can be used directly.

I think that this code would be much nicer if written in the spirit of some of the examples from the classic K&R book. In the code below the pointer c is used to traverse the input string, in_word tracks whether it is in between words or inside a word. When c moves into a word the first letter can be converted.

This approach is much cleaner and much more clear, in my opinion. There is no need to special case i == 0 and no need to juggle the indices i and next_index, which I find to be unclear. Perhaps it is this aspect that OP found to be "clunky."

#include <stdio.h>
#include <ctype.h>
#include <stdbool.h> // pre C23

char * format_name(char s[static 1]) {
    bool in_word = false;
    char *c = s;

    while (*c) {
        if (*c == '-') {
            *c = ' ';
            in_word = false;
        } else if (!in_word) {
            *c = toupper((unsigned char) *c);
            in_word = true;
        }
        ++c;
    }

    return s;
}

int main(void) {
    char s1[] = "this-string";
    char s2[] = "this-string-is-longer";
    char s3[] = "---another---possibility---";

    puts(s1);
    puts(format_name(s1));
    puts(s1);
    putchar('\n');

    puts(s2);
    format_name(s2);
    puts(s2);
    putchar('\n');

    puts(s3);
    format_name(s3);
    puts(s3);
}

In the above program stdbool.h was included, but with the arrival of C23 this will no longer be needed.

One could maintain an index into the string s for traversal via array indexing instead of using the pointer c with pointer arithmetic. I have grown accustomed to the clarity and concision of this style, but there would be nothing wrong with using array indexing instead and some would prefer that.

Note that there are some cases that may be of further interest which were not specified in the OP post. Perhaps most obviously, how should multiple hyphens be handled? How should hyphens at the start or end of an input string be handled? In the above program all hyphens are simply converted to spaces; this, I believe, is the same behavior produced by the OP posted code.

Here is a sample run of the program:

this-string
This String
This String

this-string-is-longer
This String Is Longer

---another---possibility---
   Another   Possibility

A Note About Null Pointer Checks

The OP posted code included a null pointer check at the top of the format_name function. In my rewriting of this function I removed the null pointer check. A brief discussion follows.

In the C Standard Library string handling functions expect pointers to null-terminated arrays of char for their string arguments. NULL (and nullptr in C23) is not a pointer to anything, so it cannot be a string argument in the Standard Library sense. Passing a null pointer where a pointer to a string is expected to a string handling function from the C Standard Library results in undefined behavior, so programmers don't do that and often include null checks before passing arguments to such functions.

format_name is presumably a string-processing function, and it seems reasonable to follow this practice. There may sometimes be reasons to pass a null pointer for a string argument in a string handling function, but in this case no such reason is apparent; there is no reason to pass a null pointer to format_name. format_name should document what it expects of callers, and callers must conform to those expectations.

Some compilers allow function attributes to help with this, e.g., GCC and Clang allow the nonnull function attribute which can be used to generate warnings when null pointers are passed where they are not allowed.

Since C99 the static keyword has been allowed to appear in function array parameter declarations to indicate the minimum number of elements that must be present in the corresponding array. This is what I used in my code above: char s[static 1] indicates that s will be adjusted to a pointer into an array of at least one char. This cannot be a null pointer.

According to the C Standard, if a caller attempts to pass a null pointer (or rather, a pointer that violates the minimum asserted by the [static n] notation), the result is undefined behavior.

Fortunately, compilers should warn when code attempts to pass a null pointer to a parameter annotated with [static 1]. To get these warnings in GCC, you need to compile with -Wnonnull enabled; this flag is included when you compile with -Wall enabled, which you should always be doing. Compiling with default optimization (-O0) this will catch attempts to pass NULL (or nullptr in C23) directly, and when compiling at levels higher than -O0 this will also catch many cases of pointer variables to which the null pointer has been assigned.

Before calling this function, a null pointer check would be a good idea, e.g., if (s1) { format_name(s1); } else { /* handle null pointer */ }. But what you should not do is place the null pointer check inside of format_name. Since passing a null pointer to this function would be undefined behavior, the compiler is allowed to assume that this will never happen and it may optimize away any null pointer checks it finds inside of the function.

\$\endgroup\$
6
\$\begingroup\$

A nice classic C program. Other than the other various notions on fixing the tests for a letter to be upper-cased, I have two issues:

Header

For a function in an external compilation unit, there should be a shared header file to define a prototype. (This is not required if the real code is just one compilation unit, and particularly if format_name is static.)

Invalid Input

Your code, and your apparent specification, stated the input would be restricted to 27 possible characters. What if it doesn't? Consider what your output should be for these possible inputs:

THAT-STRING
tHAT-sTRING
this.string
This String
ThIs-sTrInG
this-.string
this---string
12345
123this-456string
-----
this,string

I can't really say what it should produce in any if these cases, but I do think that you should think about it, and probably document it. (If I were giving the assignment, and you weren't normally given the test input, I would provide unexpected input. Even to the point of using a CSV quoting syntax.)

In particular: What do you do with uppercase input? What do you do with other characters? Do you uppercase only the first and after dash, or after any non-alphabetic? How do other characters (like digits) affect capitalization?

\$\endgroup\$
1
  • 1
    \$\begingroup\$ Good answer! Wonderful opportunity to introduce beginner to many test cases in an array, and old chestnut of sizeof arr / sizeof arr[0]. Just a thought... Cheers! :-) \$\endgroup\$
    – Fe2O3
    Commented Jun 5 at 19:09
5
\$\begingroup\$

Title-Case of Hyphenated Words

This algorithm only recognizes the space character as a word separator. It doesn’t capitalize words after hyphens, unlike Standard English.

Unicode

If you’re formatting text for real-world display in this century, you need to support Unicode, not just ASCII. There are some Unicode ligatures that are different in title-case than in uppercase, like dzDz, (as in Dzlecko) and some whose capitalization changes in another locale, like iİ in Turkish.

Since your input and output in 2023 should support UTF-8, and the title-case mapping of a UTF-8 character does not necessarily have the same length, you will need separate input and output buffers. As totitle is not standardized, you may want to fall back to toupper for portability.

Putting it All Together

First, some boilerplate for compatibility and error-handling:

#include <assert.h>
#include <limits.h>
#include <locale.h>
#include <stdbool.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>
#include <wctype.h>

#if defined(_MSC_VER) && !defined(__clang__)
#  define MUSTTAIL /**/
#elif __has_c_attribute(clang::musttail)
#  define MUSTTAIL [[clang::musttail]]
#else
#  define MUSTTAIL /**/
#endif

#if defined(_MSC_VER) && !defined(__clang__)
#  define NORETURN __declspec(noreturn)
#elif __has_c_attribute(noreturn)
#  define NORETURN [[noreturn]]
#else // Any implementation supporting <uchar.h> probably also supports _Noreturn.
#  define NORETURN _Noreturn
#endif

#define BUFSIZE 4096U

NORETURN void fatal_error_handler( const char* const msg, const char* const file, const int line );

NORETURN void fatal_error_handler( const char* const msg, const char* const file, const int line ) {
    fflush(stdout);
    fprintf(stderr, "%s (at line  %s:%d)\n", msg, file, line);
    exit(EXIT_FAILURE);
}

#define fatal_error(msg) fatal_error_handler((msg), __FILE__, __LINE__)

I implemented the UTF-8 title-case transformation as a finite state machine using tail recursion, because I’m weird. This is only efficient on LLVM compilers that support [[musttail]], and it might not work for codepoints outside the BMP on platforms where wchar_t is 16-bit.

static inline char* titlecase_utf8_helper( const size_t n_in,
                                    const char in[],
                                    const size_t n_out,
                                    char out[],
                                    mbstate_t* const in_st,
                                    mbstate_t* const out_st,
                                    const wctrans_t tf,
                                    const bool is_in_word ) {
    char32_t wc = U'\0';
    const size_t result_in = mbrtoc32(&wc, in, n_in, in_st);

    if (0 == result_in || 0 == n_in) {
        if (0 == n_out) {
            fatal_error("Output buffer is full");
        }
        *out = '\0';
        return out;
    } else if ((size_t)-1 == result_in) {
        fatal_error("UTF-8 encoding error");
    } else if ((size_t)-2 == result_in) {
        fatal_error("Incomplete UTF-8 string");
    } else if (0 > (ptrdiff_t)result_in) {
        fatal_error("Unknown UTF-8 processing error");
    } else if (result_in > n_in) { // Prevent buggy implementation from causing a buffer overrun.
        fprintf(stderr, "%u\n", (unsigned)*in);
        fatal_error("Broken mbrtoc32() returned invalid result");
    } else if (is_in_word) { // Converted a positive number of bytes, and should not capitalize.
        if (result_in >= n_out)  {
            fatal_error("Output buffer full");
        }

        assert(result_in <= n_in && result_in < n_out);
        memcpy(out, in, (size_t)result_in);
        MUSTTAIL return titlecase_utf8_helper(
            n_in - result_in,
            in + result_in,
            n_out - result_in,
            out + result_in,
            in_st,
            out_st, // Can this actually change?
            tf,
            !(iswspace((wint_t)wc) || iswpunct((wint_t)wc)) );
    } else { // Should capitalize this letter and start a new word.
        const char32_t cap = (char32_t)towctrans((wint_t)wc, tf);
        char utf8_buf[MB_LEN_MAX] = "";
        const size_t result_out = c32rtomb(utf8_buf, cap, out_st);

        if ((size_t)-1 == result_out) {
            fatal_error("unicode mapping returned invalid character");
        } else if (result_out >= n_out) {
            fatal_error("Output buffer full");
        } else { // In a word-initial position
            assert(result_out < n_out && result_in <= n_in);
            memcpy(out, utf8_buf, result_out);
            MUSTTAIL return titlecase_utf8_helper(
                n_in - result_in,
                in + result_in,
                n_out - result_out,
                out + result_out,
                in_st,
                out_st,
                tf,
                !(iswspace((wint_t)cap) || iswpunct((wint_t)cap)) );
        }
    }
}

static size_t titlecase_utf8( const size_t n_in,
                              const char in[],
                              const size_t n_out,
                              char out[] ) {
    mbstate_t in_state = {};
    // initialize the states to the initial conversion state:
    mbrtoc32(NULL, "", 1, &in_state);
    mbstate_t out_state = in_state;
    wctrans_t tf = wctrans("totitle");
    if (tf == 0) { // Fall back to uppercase if titlecase is unsupported
        fflush(stdout);
        fputs("Using uppercase, not titlecase.\n", stderr);
        fflush(stderr);
        tf = wctrans("toupper"); // Guaranteed to work.
        assert(tf);
    }

    char* const endp = titlecase_utf8_helper(n_in, in, n_out, out, &in_state, &out_state, tf, false);
    return (size_t)(endp - out);
}

Finally, a basic test driver. It requires the environment to be set to a UTF-8 locale.

int main(void) {
    static char input_buf[BUFSIZE];
    static char output_buf[BUFSIZE];

    setlocale(LC_ALL, "");
    fgets(input_buf, sizeof(input_buf), stdin);
    const size_t output_size = titlecase_utf8(strlen(input_buf), input_buf, sizeof(output_buf), output_buf);
    assert(strlen(output_buf) == output_size);
    fwrite(output_buf, 1, output_size, stdout);
    return EXIT_SUCCESS;
}

Try it on the Godbolt compiler explorer.

\$\endgroup\$
13
  • \$\begingroup\$ Yes, strlen(input_buf)+1 is better. Fair point about input_buf initialized to null bytes. \$\endgroup\$
    – chux
    Commented Jun 5 at 0:33
  • \$\begingroup\$ #define BUFSIZE 4096U ==> 1) Why not BUFSIZ from <stdio.h>? It is usually 8K though. 2) Perhaps first check whether __has_c_attribute is defined? 3) NORETURN can be made better, as C11 provides _Noreturn, or its convenience macro noreturn in <stdnoreturn.h>. 4. I am curious about the forward declaration. Why? Also, is the function not small enough to be made static inline? 5. Good code, I should start caring about UTF-8 too. \$\endgroup\$
    – Harith
    Commented Jun 5 at 0:46
  • \$\begingroup\$ @Harith The definition of BUFSIZE was good enough for this test driver. If it weren’t enough, I’d probably want something arbitrarily long. \$\endgroup\$
    – Davislor
    Commented Jun 5 at 0:51
  • 1
    \$\begingroup\$ @Davislor In C23, yes. Not before that. I was talking about checking __STDC_VERSION__ for C11 and defining it to _Noreturn in the case that #if __has_attribute(noreturn) evaluated to false. But not a big deal. \$\endgroup\$
    – Harith
    Commented Jun 5 at 0:54
  • 1
    \$\begingroup\$ @Harith Finally, GCC is the only major compiler to support #if defined(__has_c_attribute) && __has_c_attribute(clang::musttail). Without the ability to short-circuit like that, the conditionals were more trouble than it’s worth for this project. \$\endgroup\$
    – Davislor
    Commented Jun 5 at 1:21
3
\$\begingroup\$

Yes, I think it is possible to make it simpler. Firstly, here is the code:

#include <ctype.h>

void format_name(char* name) {

    char *uppos = name;  // where we next need to make upper case
    for (; *name; name++) {
        if (name == uppos) {
            *name = toupper((unsigned char) *name);
        }
        if (*name == '-') {
            uppos = name + 1;
            *name = ' ';
        }
    }

}

There is no need to check for a null pointer. Just specify that the function precondition includes that the pointer is not null. Then on many systems you can allow the hardware to verify this for free.

Advance the pointer instead of using a separate index. This is the idiomatic C way of doing it. An experienced C programmer will start by assuming that using an index means you might be going back to previous positions or have something more complex that mere advancement.

There's no need for a special case at the beginning of the string because seeing a '-' merely puts you into the same state as you were at the start - about to uppercase the next char.

In this code, at every position we test to see if we must uppercase the char. We could avoid this test, probably making the code slightly faster, but I'm not sure it's worth it as it makes it a little harder to understand. This is what I mean:

#include <ctype.h>

void format_name(char* name) {

        while (*name) { // at start or just after '-'
                *name = toupper((unsigned char) *name);
                        // name has not advanced to catch "--..."
                while (*name != '-') {
                        if (!*++name) return;
                }
                *name++ = ' ';
        }
}

Whether you prefer this is really a matter of taste, but it certainly merits a few more test cases. These are the ones I used:

const char *tests[] = {
        "this-string",
        "this-string-is-longer",
        "",
        "-",
        "x",
        "x--this",
        "x-",
        "-x",
        "x-----------y",
        0,
};

int main()
{
 for (const char **test = tests; *test; test++) {
        char buf[256];
        strcpy(buf, *test);
        format_name(buf);
        printf("'%s' -> '%s'\n", *test, buf);
 }
 return 0;
}

\$\endgroup\$
7
  • \$\begingroup\$ +1, I like the second version! Is there a reason to declare uppos outside of the for loop? \$\endgroup\$
    – Harith
    Commented Jun 6 at 2:36
  • 2
    \$\begingroup\$ I did consider putting it in the first clause of the for(;;) but then decided it made it look too much like the standard usage of for(;;) with a controlling variable which might be very slightly harder to understand. If there had been code following the for loop, I would very likely have done it the other way so that it was obvious that the value was never used after the loop \$\endgroup\$
    – c19
    Commented Jun 6 at 2:42
  • 1
    \$\begingroup\$ Although this cow has been quite thoroughly milked, this is "Code Review", not "Stack Overflow". Can you find something (new) about the OP's code to critique. Just a suggestion... \$\endgroup\$
    – Fe2O3
    Commented Jun 6 at 3:07
  • 1
    \$\begingroup\$ "Whether you prefer this is really a matter of taste" -- perhaps. To my taste, expressions like if (!*++name) return; aren't a model of clarity. \$\endgroup\$ Commented Jun 6 at 5:15
  • \$\begingroup\$ To second the view of @adabsurdum, I find thinking in negative logic far more challenging than positive logic. Don't you think so, too? :-) \$\endgroup\$
    – Fe2O3
    Commented Jun 6 at 7:34

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.