Manipulating Binary Data Using The Korn Shell
Manipulating Binary Data Using The Korn Shell
Manipulating Binary Data Using The Korn Shell
Most people are unaware that ksh93 (Korn Shell 93) can handle binary data. Indeed, the situation
has been make worse by a certain well-known person on unix.stockexchange.com who claims that
zsh is the only shell that is 8-bit clean and can internally handle binary data. AS far as I am aware,
this person does not contribute to the development of zsh, so I am not sure why that person is so
ly
adamant about unique binary data capabilities of zsh.
on
As the following examples will demonstrate, ksh93 is perfectly capable of generating binary data,
making an exact copy of a binary file and manipulating binary files.
For my first example, I demonstrate how to create a 256-byte binary file containing all the binary
se
values from 0x00 (NUL) to 0xFF.
#!/bin/ksh93
lu
typeset -i8 value
redirect 3>out.hex || exit 1
for ((value = 0; value < 256; value++))
a
do
print -u 3 -f "\\${value#8#}"
nn
done
redirect 3<&- || echo 'cannot close FD 3'
exit 0
o
As you can see a perfect binary file was created containing the full ASCII table (NUL to Ox7F) plus
rs
the values from 0x89 to 0xFF (sometimes known as the Extended ASCII Table.) By the way,
redirect is simply an alias for command exec. I assume you are familiar with manipulating file
pe
descriptors in ksh93 or other shells. If not, read the appropriate section of the ksh93 man page.
$ xxd out.hex
Fo
00000000: 0001 0203 0405 0607 0809 0a0b 0c0d 0e0f ................
00000010: 1011 1213 1415 1617 1819 1a1b 1c1d 1e1f ................
00000020: 2021 2223 2425 2627 2829 2a2b 2c2d 2e2f !"#$%&'()*+,-./
00000030: 3031 3233 3435 3637 3839 3a3b 3c3d 3e3f 0123456789:;< =>?
00000040: 4041 4243 4445 4647 4849 4a4b 4c4d 4e4f @ABCDEFGHIJKLMNO
00000050: 5051 5253 5455 5657 5859 5a5b 5c5d 5e5f PQRSTUVWXYZ[\]^_
00000060: 6061 6263 6465 6667 6869 6a6b 6c6d 6e6f `abcdefghijklmno
00000070: 7071 7273 7475 7677 7879 7a7b 7c7d 7e7f pqrstuvwxyz{|}~.
00000080: 8081 8283 8485 8687 8889 8a8b 8c8d 8e8f ................
00000090: 9091 9293 9495 9697 9899 9a9b 9c9d 9e9f ................
000000a0: a0a1 a2a3 a4a5 a6a7 a8a9 aaab acad aeaf ................
000000b0: b0b1 b2b3 b4b5 b6b7 b8b9 babb bcbd bebf ................
000000c0: c0c1 c2c3 c4c5 c6c7 c8c9 cacb cccd cecf ................
000000d0: d0d1 d2d3 d4d5 d6d7 d8d9 dadb dcdd dedf ................
000000e0: e0e1 e2e3 e4e5 e6e7 e8e9 eaeb eced eeef ................
000000f0: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff ................
$
My next example simply uses builtin ksh93 functionality to copy a binary file, image.jpg, to a
binary file named image.cpy.
!/bin/ksh93
#
# copy a binary file
#
typeset -b byte
command exec 3<image.jpg || exit 1
bytes=0
eof=$(3<#((EOF)))
3<#((0))
:> image.cpy
while (( $(3<#((CUR))) < $eof ))
ly
do
# print "At offset $(3<#)"
read -r -u 3 -N 1 byte
on
printf "%B" byte >> image.cpy
(( bytes ++ ))
done
redirect 3<&- || echo 'cannot close FD 3'
print "$bytes copied"
se
exit 0
lu
The key to understanding how this script works is understanding what typeset -b does. From the
ksh93 manpage:
a
-b The variable can hold any number of bytes of data. The data can be text
nn
used to output the actual data in this buffer instead of the base64
rs
The script works by reading the source file byte by byte, and storing the read byte in the variable
pe
called byte. Internally, this byte is stored as a base64-encoded string. This was David Korn’s
solution to the design issue of how to store a NUL ((character 0 in the portable character set
corresponding to US ASCII) in a NUL terminated string. Remember, unlike some other
programming languages such as Pascal, strings are NUL terminated in the C programming
r
The zsh shell uses a different mechanism but the end result is the same. It uses a guard byte, Meta,
to guard the following byte.
From ..zsh/Src/zsh.h:
/* Meta together with the character following Meta denotes the character *
* which is the exclusive or of 32 and the character following Meta. *
* This is used to represent characters which otherwise has special *
* meaning for zsh. These are the characters for which the imeta() test *
* is true: the null character, and the characters from Meta to Marker. */
The interesting thing about the zsh special character guard mechanism is that zsh provides a
mechanism to adjust the behavior of the two byte sequence Meta NUL using the options
POSIX_STRINGS (setopt posixstrings) or NO_POSIX_STRINGS (setopt noposixstrings.) When
unset, the entire string including Meta bytes and NUL, is output to files where necessary,
although owing to “restrictions of the library interface a string is truncated at the NUL character
in file names, environment variables, or in arguments to external programs.”
For my next example, we are going to reverse a binary file, i.e. image.gpj to image.jpg. Have a
look at the following code:
ly
!/bin/ksh93
#
# reverse a binary file
on
#
typeset -b byte
redirect 3< image.gpj || exit 1
eof=$(3<#((EOF)))
read -r -u 3 -N 1 byte
se
printf "%B" byte > image.jpg
3<#((CUR - 1))
while (( $(3<#) > 0 )) lu
do
read -r -u 3 -N 1 byte
printf "%B" byte >> image.jpg
3<#((CUR - 2))
done
a
read -r -u 3 -N 1 byte
nn
Again, I use typeset -b byte to declare byte to be a binary type which, by the way, can hold up to
rs
64KB of either binary or text data. Again, I use the ksh93 I/O mechanism to open the input file,
image.gpj using file descriptor 3. Again, I read byte by byte but this time backwards from the last
byte of the file to the byte at offset 0. While decrement the file offset by 2 in the loop? Simple,
pe
read advances the file offset by 1, so the script has to compensate for the last read and also
decrement the offset by 1 so that next read reads the previous byte in the file.
Obviously this script is fairly inefficient as it reads and writes individual bytes. Use strace to
r
understand how inefficient it is. Actually it turns out that a lot of the inefficiencies are actually due
Fo
In the following example, the previous script has been modified to read and write in chunks of 16
bytes where possible.
#!/bin/ksh
#
# reverse a binary file - chunks
#
typeset -b bytes
redirect 3< image.gpj || exit 1
eof=$(3<#((EOF)))
read -r -u 3 -N 16 bytes
printf "%B" bytes > image.jpg
3<#((CUR - 16))
offset=0
ly
redirect 3<&- || echo 'cannot close FD 3'
exit 0
on
This script works as intended but I am pushing the limits of ksh93 file I/O using the CUR and EOF
builtins. If instead of redirecting output to image.jpg using > and/or >>, I assigned file descriptor
4 to image.jpg, the script will never terminate. This is due to a implementation/design issue in
se
ksh93 when using either or both of these two builtin variables, CUR or EOF, and more than one
file descriptor simultaneously.
lu
Look at how CUR and EOF are set in ../ast/src/cmd/ksh93/sh/io.c
return (Sfdouble_t)end;
}
rs
char *cp;
Sfoff_t off;
struct Eof endf;
Namval_t *mp = nv_open("EOF", shp->var_tree, 0);
Namval_t *pp = nv_open("CUR", shp->var_tree, 0);
r
sh_iovalidfd(shp, fd);
Fo
sp = shp->sftable[fd];
memset(&endf, 0, sizeof(struct Eof));
endf.fd = fd;
endf.hdr.disc = &EOF_disc;
endf.hdr.nofree = 1;
if (mp) nv_stack(mp, &endf.hdr);
if (pp) nv_stack(pp, &endf.hdr);
if (sp) sfsync(sp);
off = sh_strnum(shp, fname, &cp, 0);
if (mp) nv_stack(mp, NULL);
if (pp) nv_stack(pp, NULL);
return *cp ? (Sfoff_t)-1 : off;
}
As you can see, the code for these two builtins (name-value pairs) is inextricably tangled together
in both the file_offset function and the discipline function associated with each builtin.. Not the
best of designs; the result being that the shell can easily get confused as to which file descriptor to
use. A redesign is definitely warranted if ksh93 is intended to support seeking to more than one
user-specified file offset in a shell script. The man page is silent on the issue.
My final example shows how to work around this issue by limiting the use of EOF and avoiding the
use of CUR.
!/bin/ksh
#
# reverse a binary file
#
typeset -b byte
redirect 3< image.gpj || exit 1
iof=$(3<#((EOF)))
redirect 4> image.jpg || exit 1
ly
# oof=0
read -r -u 3 -N 1 byte
3<#(( --iof ))
on
print -u 4 -f "%B" byte
# (( oof++ ))
while (( iof > 0 ))
do
# print "At offset $iof $oof"
se
read -r -u 3 -N 1 byte
3<#(( --iof))
print -u 4 -f "%B" byte
# (( oof++ ))
lu
done
read -r -u 3 -N 1 byte
print -u 4 -f "%B" byte
a
redirect 3<&- || echo 'cannot close FD 3'
redirect 4>&- || echo 'cannot close FD 4'
nn
exit 0
In the above script, the builtin variable EOF is used but once, i.e. to initially set the variable iof
o
which is used to store the current offset of the input file. The CUR builtin variable, used in
rs
previous examples, is never used. The script then tracks the location of the input file offset using
iof from that point on on until the script exits when iof decrements to 0.
pe
Well, I have run out of time and must finish this blog. The above examples should have adequately
demonstrated to you that ksh93 is perfectly capable of handling NULs and binary data. The next
tine somebody tells you that ksh93 cannot handle binary data internally, or that zsh is the only
shell that can handle binary data, just point that person to this blog point.
r
Fo
Enjoy!