3
\$\begingroup\$

Since the standard fgets() does not suffice for my use cases, as it doesn't automatically enlarge the target buffer if needed, and getline()/fgetln() are not part of the C standard, I rolled out my own.

Code:

The code below is listed as one file for code review.

#define TEST_MAIN // Or pass this with make/gcc.

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

/*
 * Reads the next line from the stream pointed to by `stream`.
 *
 * The line is terminated by a NUL character.
 * On success, the memory pointed to by `len` shall contain the length of the line 
 * (including the terminating NUL character).
 * The returned line does not contain a newline.
 *
 * Upon successful completion a pointer is returned, otherwise NULL is returned.
 * `fgetline()` does not distinguish between end-of-file and error; the routines
 * `feof()` and `ferror()` must be used to determine which occurred.
 */
char *fgetline(FILE *stream, size_t *len)
{
    clearerr(stream);

    char *line = NULL;
    size_t capacity = 0;
    size_t size = 0;

    while (true) {
        if (size >= capacity) {
            char *const tmp = realloc(line, capacity += BUFSIZ);
            
            if (!tmp) {
                free(line);
                return NULL;
            }

            line = tmp;
        }
        
        if (!fgets(line + size, BUFSIZ, stream)) {
            if (ferror(stream)) {
                free(line);
                return NULL;
            } 
            break;
        }

        if (strpbrk(line, "\n\r")) {
            break;
        }

        size += BUFSIZ;
    }
    
    if (feof(stream) && size == 0) {
        free(line);
        return NULL;
    }

    /* TODO: Shrink the memory chunk here. */ 
    *len = strcspn(line, "\n\r"); 
    line[*len] = '\0';
    *len += 1;
    return line;
}

#ifdef TEST_MAIN

#include <assert.h>

int main(int argc, char **argv)
{
    assert(argv[1]);
    FILE *fp = fopen(argv[1], "rb");
    assert(fp);
    
    size_t size = 0;

    assert(!strcmp(fgetline(fp, &size), "12"));
    assert(size == 3);

    assert(!strcmp(fgetline(fp, &size), "AB"));
    assert(size == 3);

    assert(!strcmp(fgetline(fp, &size), "123"));
    assert(size == 4);

    assert(!strcmp(fgetline(fp, &size), "ABC"));
    assert(size == 4);

    assert(!strcmp(fgetline(fp, &size), "1234"));
    assert(size == 5);

    assert(!strcmp(fgetline(fp, &size), "ABCD"));
    assert(size == 5);

    rewind(fp);
    char *res = NULL;

    while ((res = fgetline(fp, &size)) != NULL) {
        printf("Line read: \"%s\", line length: %zu, string length: %zu\n", res, size, strlen(res));
    }

    assert(!fgetline(fp, &size));
    assert(feof(fp));    
    return EXIT_SUCCESS;
}
#endif /* TEST_MAIN */

prints:

Line read: "12", line length: 3, string length: 2
Line read: "AB", line length: 3, string length: 2
Line read: "123", line length: 4, string length: 3
Line read: "ABC", line length: 4, string length: 3
Line read: "1234", line length: 5, string length: 4
Line read: "ABCD", line length: 5, string length: 4
Line read: "", line length: 1, string length: 0

where the sample text file contained:

12
AB
123
ABC
1234
ABCD
^@^@^@^@

The last line contains 4 NUL characters, courtesy of fputc(0, fp).

Review Request:

Are there any bugs in the code?

Are there any edge cases I have missed?

Does the function provide what's expected of a good line-reading function?

If there are embedded NUL characters in the line, the function returns the string up to the first NUL character. Was that a bad decision to take?

\$\endgroup\$
1
  • \$\begingroup\$ If you map the file to memory, you can return a line as a string slice, and then parse or copy it. This has some drawbacks (reading a new page of the file might be slow on some OSes, won’t work on streams such as a network, console or pipe), but if can simplify this operation a lot. \$\endgroup\$
    – Davislor
    Commented Feb 26 at 6:38

1 Answer 1

5
\$\begingroup\$

A real good get line function is tricky and sadly missing in standard C.


Are there any bugs in the code?

See below.

Are there any edge cases I have missed?

See below.

Does the function provide what's expected of a good line-reading function?

Overall a fairly good effort. Yet buffer allocating O(n*n) and use of clearerr() are amiss.

If there are embedded NUL characters in the line, the function returns the string up to the first NUL character. Was that a bad decision to take?

A get line function that returns length should be able to cope with reading null characters. If not, why have a len parameter as calling code can simply perform strlen() to determine the length?


Questionable clearerr(stream)?

fgets(), fscanf() and others do not clear the error flag and end-of-file flags. The choice to do that is usually left to the calling code, not the get line code. I'd expect fgetline() to correctly operate with the error and end-of-file flags in either state upon function entry and to not employ a clearerr().

Concerning if (ferror(stream)) {, consider using if (!feof(stream)) { to not get fooled by an end-of-file with the error flag set prior to this call.
Is it possible to distinguish the error returned by fgets

*len on failure

Code has "On success, the memory pointed to by len shall contain the length of the line ...", but is silent on what happens when the is a failure. Perhaps add that *len is unchanged on failure.

Consider *len should be set in all cases with 0 on failure.

Code assumes a null character is not read

strpbrk(line, "\n\r") and strcspn(line, "\n\r") will not detect data read beyond the first '\0' in the buffer. There will be multiple '\0' in line when a null character is read, the ones read and the one appended.

Properly handling the reading of a null character is tricky.

Bug?: Code assumes a '\r' ends the line

strpbrk(line, "\n\r") and strcspn(line, "\n\r") performed as if the line stopped at '\r'. Consider a line of user input such as "abc\rdef\r\n".

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. ... C23dr § 7.23.2 2

What makes for a raw terminating new-line character varies between systems - it is implementation dependent. Trying to make a universal detector of line terminator is tricky should code encounter text files from local and "foreign" sources.

File open in binary mode?

If this fgetline() requires the stream is opened in text or binary, that requirement needs to be documented. Better if the function works well in either case.

Redundant work

*len = strcspn(line, "\n\r") rescans the initial portion of line that is known not to contain "\n\r". Something like *len = strcspn(line + (size - BUFSIZ), "\n\r") + (size - BUFSIZ); can avoid that.

Surprising "length"

*len += 1; does follow "all contain the length of the line * (including the terminating NUL character)." yet maybe something other than length is warranted - like size or count or .... With strings, "length" is tied closely to the return value of strlen() and that differs by 1 here.

FWIW, I see advantages of code returning this + 1 count, especially when code changed to set the *len to 0 on error.

Minor: NUL vs. null character`

Consider using null character.

C defines null character and only uses NUL in a footnote referring to ASCII.

ASCII defines NUL.

When NULL is returned

I see 3 cases:

  • End-of-file and nothing read.

  • Input error

  • Memory allocation failure.

The last case is not documented in the fgetline() preface.

Minor: Big buffers

BUFSIZ is at least 256 so each line read can have a cumulative significant extra size. Consider a final realloc() to right-size the buffer - or warn about these inefficient allocations.

I suppose that is all implied with the TODO: Shrink the memory chunk here.

Minor: buffer growth

line grows linearly and so can make for O(len*len) for long lines. More complexity, but other approaches have advantages.

One way that is O(len) is to form a linked list of BUFSIZ buffers and then collect them as one in the end.

Testing

Code only tested in binary mode with FILE *fp = fopen(argv[1], "rb");, yet fgetline() more likely used in text mode.

.h file

Consider a preceding

#include < ... >
// documentation
char *fgetline(FILE *stream, size_t *len);

and then

#include < ... >
char *fgetline(FILE *stream, size_t *len) {

To convey how one would present this function in a typical header file.

\$\endgroup\$
8
  • \$\begingroup\$ "Re: strpbrk(line, "\n\r") and strcspn(line, "\n\r") will not detect data read beyond the first '\0' in the buffer." ==> Yes. Do you have an alternate method with the current approach? Or do I read byte by byte? \$\endgroup\$
    – Harith
    Commented Feb 26 at 14:00
  • \$\begingroup\$ Rewriting it to read byte-by-byte cut down the code size by half and took care of the embedded null characters. \$\endgroup\$
    – Harith
    Commented Feb 26 at 14:47
  • \$\begingroup\$ @Harith Byte-by-byte has a performance hit vs. fgets(). fgets() without a pre-fill makes it hard to find those null characters, fscanf(f, "%width[^n]%n", buf, &len) works but is not size flexible and likely slow. IMO, there is not a good clean line input function in C. User code solutions incur some weakness. For a input function that allocates, consider *nix's getline() (and lift its source code when your system does not have it) - . It has weaknesses - but at least there is a chance it will perform well. \$\endgroup\$
    – chux
    Commented Feb 26 at 17:32
  • \$\begingroup\$ @Harith "Do you have an alternate method with the current approach?" --> For a non-allocating version, I have had some success with fscanf(f, "%width[^n]%n", buf, &len). IMO, input that exceeds the caller's maximum expected input should read the whole line, yet is an error - and so the whole line need not get saved. This prevents the end-user from overwhelming memory usage. My model is to be flexible, yet user input is evil until vetted. \$\endgroup\$
    – chux
    Commented Feb 26 at 17:39
  • \$\begingroup\$ This sample implementation of getline(): github.com/lattera/freebsd/blob/master/contrib/file/getline.c reads byte-by-byte with fgetc(). \$\endgroup\$
    – Harith
    Commented Feb 27 at 9:59

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.