Shell Tool Testing - Screwtape's Notepad

I’m a firm believer in automated testing. Even if I don’t always go for full test driven development, I like watching “passing tests” tick upward as I implement an interface. I like having corner-cases reproducible, so I don’t have to remember them myself. I like being able to set up some scenario, and then see how the system’s behaviour changes as I tweak the implementation. But like any other task, building automated tests is a lot easier if you have the right-shaped components to build with.

When I write Python code, I use the standard library’s unittest module. Writing tests for Python code is straightforward, and so is using them: execute python3 -m unittest in the root directory of a project, it runs all the tests it can find, and it highlights the failures. However, these days I don’t write so much Python, I’m writing little tools to be used from the Unix shell, or helper processes for Kakoune plugins, or whatever. I want something like Python’s unittest framework, but with a Unix shell-based API rather than a Python one.

Of course, lots of people have made test harnesses for shell-scripts, but I don’t want people who run the test suite to have to install some third-party library, and I don’t want to include a big dependency in my project either. I want something I can put together myself, and extend in the ways I need when I need it.

And so I stumbled across the prove command and TAP. Let me show you all the things I learned about them while setting up a test suite for the last tool I worked on.

What is the prove command?¶

prove is a command-line tool (originally part of Perl) that discovers tests, runs them, and summarises the results. It’s similar to python3 -m unittest, but while Python’s test tool requires all the tests to be written in Python, prove is language-agnostic.

prove finds files matching a particular pattern and executes them, intepreting their output as a sequence of test results. It can make smart decisions about which tests to run and in which order, but I’ll talk about that later. For now, the benefit of prove is that it can run tests written in any language that can write to standard output in TAP format.

What is TAP?¶

The Test Anything Protocol is a plain-text, human-readable format that represents the results of running a test suite. It’s something like JUnit XML result files, but much simpler.

TAP version 14
ok 1 - Input file opened
not ok 2 - First line of the input valid
# Expected 7 columns, got 19
1..2

Line 1 announces the version of the TAP specification we’re following. Lines 2 and 3 represent test results. The first test, named “Input file opened” passed (ok); the second test, named “First line of the input valid” failed (not ok). Line 4 is a comment, extra information that is not a test result, but might help the person diagnosing one. The last line records how many tests there should have been in the suite (so a parser can tell if the test suite crashed before finishing).

TAP is more flexible than this, as you’ll learn if you read the specification, but this is enough for our purposes. If we can write a test harness that reports results in TAP format, and place that script somewhere that prove can find it, then we’ve got automated testing.

A minimal test harness¶

By default, prove will look for files matching t/*.t, and execute them as tests. Therefore, the absolute minimal test harness looks something like this:

$ mkdir t
$ cat > t/example.t << EOF
#!/bin/sh
echo "TAP version 14"
echo "ok 1 - an example test"
echo "1..1"
EOF

$ prove
t/example.t .. ok   
All tests successful.
Files=1, Tests=1,  0 wallclock secs ( 0.07 usr +  0.01 sys =  0.08 CPU)
Result: PASS

Note that if you don’t include the #! line, prove will try to execute the script as Perl. It’s not necessary to mark the test script as executable, but it’s still a good idea so you can run it outside prove if you need to:

$ chmod +x t/example.t
$ t/example.t
TAP version 14
ok 1 - an example test
1..1

I found one problem with the above setup: although prove will run tests written in any language, some text editors will assume any *.t file is a Perl script regardless of the #! line. To get around this, we rename the tests to use the .test extension, and tell prove to look for that instead:

$ mv t/example.t t/example.test
$ prove --ext test
t/example.test .. ok   
All tests successful.
Files=1, Tests=1,  1 wallclock secs ( 0.07 usr +  0.01 sys =  0.08 CPU)
Result: PASS

It would be tedious to type out the --ext test option every time, so we tell prove to always use that setting when run from this directory:

$ cat > .proverc << EOF
--ext test
EOF
$ prove # without any extra options!
t/example.test .. ok   
All tests successful.
Files=1, Tests=1,  1 wallclock secs ( 0.07 usr +  0.01 sys =  0.08 CPU)
Result: PASS

Writing a test¶

We want to write an actual test, not just spit out a hard-coded result, so let’s say we want to test the behaviour of mkdir. We create t/make-a-directory.test with the following contents:

#!/bin/sh
echo "TAP version 14"

mkdir foo
if [ "$?" -eq 0 ]; then
    echo "ok 1 - mkdir exit status"
else
    echo "not ok 1 - mkdir exit status"
fi

if [ -d foo ]; then
    echo "ok 2 - directory created"
else
    echo "not ok 2 - directory created"
fi

echo "1..2"

$ prove
t/example.test ........... ok   
t/make-a-directory.test .. ok   
All tests successful.
Files=2, Tests=3,  0 wallclock secs ( 0.07 usr  0.01 sys +  0.00 cusr  0.01 csys =  0.09 CPU)
Result: PASS

Two test script files, with a total of three tests between them. They all pass, it works!

Isolating tests¶

$ prove
t/example.test ........... ok   
t/make-a-directory.test .. mkdir: cannot create directory ‘foo’: File exists
t/make-a-directory.test .. Failed 1/2 subtests

Test Summary Report
-------------------
t/make-a-directory.test (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  1
Files=2, Tests=3,  1 wallclock secs ( 0.08 usr  0.01 sys +  0.00 cusr  0.01 csys =  0.10 CPU)
Result: FAIL

This time it failed. When we ran the tests the first time, the directory they created was not cleaned up, so it could not be created a second time. Our tests are going to need to create a place to put temporary files, and to clean it up afterwards. Let’s make that change:

#!/bin/sh
echo "TAP version 14"
TESTDATA=$(mktemp -d)

mkdir "$TESTDATA"/foo
if [ "$?" -eq 0 ]; then
    echo "ok 1 - mkdir exit status"
else
    echo "not ok 1 - mkdir exit status"
fi

if [ -d "$TESTDATA"/foo ]; then
    echo "ok 2 - directory created"
else
    echo "not ok 2 - directory created"
fi

echo "1..2"

rm -rf "$TESTDATA"

We’ve added lines 3 and 20, to create a temporary directory, store the name in $TESTDATA, and clean it up when the test is done. Lines 5 and 12 are modified to create the foo directory inside $TESTDATA, and to look for it there. Now we can run the tests as many times as we like, and they never interfere with each other.

Making it easier to add tests¶

Come to think of it, “cannot create the same directory twice” is an important part of mkdir’s behaviour. We should test that too! We copy the first test, paste it at the end, and fix it to expect a non-zero exit status:

mkdir "$TESTDATA"/foo
if [ "$?" -eq 1 ]; then
    echo "ok 1 - mkdir fails a second time"
else
    echo "not ok 1 - mkdir fails a second time"
fi

$ prove
t/example.test ........... ok   
t/make-a-directory.test .. 1/? mkdir: cannot create directory ‘/tmp/tmp.GDiIhUNScM/foo’: File exists
t/make-a-directory.test .. Failed -1/2 subtests

Test Summary Report
-------------------
t/make-a-directory.test (Wstat: 0 Tests: 3 Failed: 0)
  Parse errors: Tests out of sequence.  Found (1) but expected (3)
                Bad plan.  You planned 2 tests but ran 3.
Files=2, Tests=4,  0 wallclock secs ( 0.06 usr  0.02 sys +  0.02 cusr  0.01 csys =  0.11 CPU)
Result: FAIL

Uh oh. We added a new test, but we forgot to change the test number, and we forgot to increment the expected number of tests at the end. You could go back and fix these things manually, but the point of a test framework is to make it easier to add tests. Let’s do exactly that:

#!/bin/sh
report_ok() {
    TESTCOUNT=$(( TESTCOUNT + 1 ))
    echo "ok $TESTCOUNT - $*"
}

report_not_ok() {
    TESTCOUNT=$(( TESTCOUNT + 1 ))
    echo "not ok $TESTCOUNT - $*"
}

echo "TAP version 14"
TESTDATA=$(mktemp -d)
TESTCOUNT=0

mkdir "$TESTDATA"/foo
if [ "$?" -eq 0 ]; then
    report_ok "mkdir exit status"
else
    report_not_ok "mkdir exit status"
fi

if [ -d "$TESTDATA"/foo ]; then
    report_ok "directory created"
else
    report_not_ok "directory created"
fi

mkdir "$TESTDATA"/foo
if [ "$?" -eq 1 ]; then
    report_ok "mkdir fails a second time"
else
    report_not_ok "mkdir fails a second time"
fi

echo "1..$TESTCOUNT"

rm -rf "$TESTDATA"

We’ve added line 4, to initialise $TESTCOUNT to zero, and line 36 uses it to print the final number of tests. The new report_ok() and report_not_ok() helpers in lines 6-14 print properly-numbered pass and fail messages (respectively) and also increment the test count, so there can never be duplicates. The actual tests have been updated to call report_ok() and report_not_ok() as appropriate.

t/example.test ........... ok   
t/make-a-directory.test .. 1/? mkdir: cannot create directory ‘/tmp/tmp.Ib1Oh7B4jn/foo’: File exists
t/make-a-directory.test .. ok   
All tests successful.
Files=2, Tests=4,  1 wallclock secs ( 0.08 usr  0.01 sys +  0.00 cusr  0.02 csys =  0.11 CPU)
Result: PASS

A note on “strict mode”¶

A lot of people depend on the “unofficial bash strict mode” to help them make their shell-scripts more robust. This is generally a good idea, but specifically for writing tests, the set -e part of the strict mode can cause problems. That’s the setting that makes the shell exit immediately if any command produces a non-zero exit status, unless failure is explicitly captured with if, while, ||, or similar constructs.

In a test suite, we don’t just care whether an operation failed, we want to be sure that it failed in the correct way, and set -e makes that difficult to figure out. If you write something like:

set -e

mkdir "$TESTDATA"/foo
if [ "$?" -eq 1 ]; then
    report_ok "mkdir fails a second time"
else
    report_not_ok "mkdir fails a second time"
fi

…the shell will exit immediately after the mkdir, before the if has a chance to execute. On the other hand, if you write:

set -e

if mkdir "$TESTDATA"/foo; then
    report_not_ok "mkdir fails a second time"
else
    report_ok "mkdir fails a second time"
fi

…then we can detect mkdir producing a non-zero exit status, but we can no longer detect which status. By the time we get to the else branch, the if statement has already executed, resetting $? to 0.

To detect a command’s exit status within the limitations of “strict mode”, I figured out this trick:

1
2
3

EXIT_STATUS=0
mkdir "$TESTDATA"/foo || EXIT_STATUS="$?"
if [ "$EXIT_STATUS" -eq 1]; ...

It’s a bit awkward, but if you’d rather have strict mode everywhere than keep this particular chunk of code simple, that’s how to do it.

Hiding unnecessary output¶

Our test suite works properly, but mkdir’s error message still gets scribbled into prove’s output. It would be nice to clean that up, and it would also be nice to capture that output so we can confirm it’s failing due to “File exists” rather than “Permission denied” or some other, weirder error. It’s easy enough to do both at once:

mkdir "$TESTDATA"/foo 2> "$TESTDATA"/stderr
if [ "$?" -eq 1 ]; then
    report_ok "mkdir fails a second time"
else
    report_not_ok "mkdir fails a second time"
fi
if grep "File exists" "$TESTDATA"/stderr; then
    report_ok "mkdir failed for the right reason"
else
    report_not_ok "mkdir failed for the right reason"
fi

This works, but we can do more to conform to the conventions of the TAP ecosystem.

TAP is a stream of test results (ok and not ok lines) and comments (lines beginning with #). There’s no defined connection between any given comment and any given test, so prove has no way to show only the comments related to a failing test. It can show you all the comments or none of them, but nothing in between.

As a result, TAP test scripts only use comments for log-like messages that happen every run regardless of success or failure. The details of failing tests are sent to standard error, so they’re not buried in the noise of comment lines. Of course, it’s possible to plumb a test script’s standard output and error together, so even though failure messages are normally separate from comments, it’s still a good idea to format them as comments anyway.

One more detail: in the screenshots above, we’ve seen the output of stderr appears at the end of a line printed by prove, so it would be tidy to have error messages start with a line-break.

# Writes a message to stderr, formatted as a TAP comment
log() { printf "# %s\n" "$*" 1>&2 ; }

# Run a command, recording all its outputs
record() {
    echo "$*" > "$TESTDATA/last-command"
    "$@" > "$TESTDATA/last-stdout" 2> "$TESTDATA/last-stderr"
    echo "$?" > "$TESTDATA/last-exit"
}

# Report a successful test, with no further details
#
# This is just as it was before.
report_ok() {
    TESTCOUNT=$(( TESTCOUNT + 1 ))
    echo "ok $TESTCOUNT - $*"
}

# Report a failed test, dump the last-executed command
#
# We log the test name as well as reporting it
# because prove hides test result records by default.
report_not_ok() {
    TESTCOUNT=$(( TESTCOUNT + 1 ))
    echo "not ok $TESTCOUNT - $*"
    echo 1>&2 # must log a newline to stderr first
    log "Test: $*"
    log "Last command:"
    log "   " "$(cat "$TESTDATA/last-command")"
    log "Exit status:" "$(cat "$TESTDATA/last-exit")"
    log "stdout was:"
    sed -e 's/^/#   /' "$TESTDATA/last-stdout" >&2
    log "stderr was:"
    sed -e 's/^/#   /' "$TESTDATA/last-stderr" >&2
}

# Assert that the last command exited with a particular exit status
#
# In addition to the dump that report_not_ok() does,
# (which includes the exit status we got)
# we also dump the exit status we expected.
assert_exit_status() {
    expected="$1"
    shift
    got="$(cat "$TESTDATA/last-exit")"

    if [ "$got" -eq "$expected" ] ; then
        report_ok "$*"
    else
        report_not_ok "$*"
        log "Expected exit status $expected"
    fi
}

# Assert that the last command's stderr matches a particular regex
#
# In addition to the dump that report_not_ok() does,
# (which includes the stderr output we got)
# we also dump the pattern we expected it to match.
assert_stderr_matches() {
    pattern="$1"
    shift
    if grep -E "$pattern" "$TESTDATA/last-stderr" >/dev/null ; then
        report_ok "$*"
    else
        report_not_ok "$*"
        log "Expected command stderr to match pattern: $pattern"
    fi
}

record mkdir "$TESTDATA"/foo
assert_exit_status 0 "mkdir exit code"

if [ -d "$TESTDATA"/foo ]; then
    report_ok "directory created"
else
    report_not_ok "directory created"
fi

record mkdir "$TESTDATA"/foo
assert_exit_status 1 "mkdir fails a second time"
assert_stderr_matches "File exists" \
    "mkdir failed for the right reason"

$ prove
t/example.test ........... ok   
t/make-a-directory.test .. ok   
All tests successful.
Files=2, Tests=5,  0 wallclock secs ( 0.07 usr  0.02 sys +  0.00 cusr  0.03 csys =  0.12 CPU)
Result: PASS

…wait, that doesn’t show off the changes we just made. Let me deliberately break the tests by changing assert_exit_status 1 to assert_exit_status 42:

$ prove
t/example.test ........... ok   
t/make-a-directory.test .. 1/? 
# Test: mkdir fails a second time
# Last command:
#     mkdir /tmp/tmp.X61ggkxxnZ/foo
# Exit status: 1
# stdout was:
# stderr was:
#   mkdir: cannot create directory ‘/tmp/tmp.X61ggkxxnZ/foo’: File exists
# Expected exit status 42, got 1
t/make-a-directory.test .. Failed 1/4 subtests

Test Summary Report
-------------------
t/make-a-directory.test (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=2, Tests=5,  0 wallclock secs ( 0.07 usr  0.01 sys +  0.01 cusr  0.05 csys =  0.14 CPU)
Result: FAIL

Now it’s a lot clearer which test failed and why, without any distracting output from passing tests.

Writing a test suite¶

While it’s possible to keep adding new tests to the end of our test script, that’s not always a great idea. Some sets of tests form a sensible chain, each building upon the last, but it can be hard to figure out where to insert a new test. And if the behaviour of the system changes, and one of the early tests needs to be set up differently, that can invalidate all the following tests. We want it to be easy to add new tests, so we’d rather have a bunch of independent test scripts, one per behaviour we want to test. The prove tool will automatically discover tests to run, so we can have as many as we like.

Unfortunately, our current test script has a lot of helper functions we’d want to re-use in every test. While it’s possible to have every script use its own versions of these helper functions (and sometimes that might be a good idea) I’d rather the default be a shared implementation.

First, we create a new file, t/common.sh. Because the filename doesn’t end in .test, prove won’t try to run it. It also doesn’t need a #! line, because it doesn’t need to be executed directly. Into that file we place all the helper functions we created above, along with new helper functions that wrap the setup and teardown code:

setup() {
    TESTDATA=$(mktemp -d)
    TESTCOUNT=0
    echo "TAP version 14"
}

teardown() {
    echo "1..$TESTCOUNT"
    rm -rf "$TESTDATA"
}

Now we can make a new test script that includes our common code, calls setup, does a test, then calls teardown. Behold, t/make-a-directory-modularised.test:

#!/bin/sh
# Include the contents of common.sh as though it were pasted here
. common.sh

setup

record mkdir "$TESTDATA"/foo
assert_exit_status 0 "mkdir exit status"

if [ -d "$TESTDATA"/foo ]; then
    report_ok "directory created"
else
    report_not_ok "directory created"
fi

teardown

$ prove
t/example.test ....................... ok   
t/make-a-directory-modularised.test .. t/make-a-directory-modularised.test: 3: .: common.sh: not found
t/make-a-directory-modularised.test .. Dubious, test returned 2 (wstat 512, 0x200)
No subtests run 
t/make-a-directory.test .............. ok

Test Summary Report
-------------------
t/make-a-directory-modularised.test (Wstat: 512 (exited 2) Tests: 0 Failed: 0)
  Non-zero exit status: 2
  Parse errors: No plan found in TAP output
Files=3, Tests=5,  0 wallclock secs ( 0.10 usr  0.02 sys +  0.00 cusr  0.04 csys =  0.16 CPU)
Result: FAIL

…it fails with the error “common.sh: not found”. The shell’s include directive is very weird, compared to other languages. Where other languages might look for the named file beside the file that’s doing the including, or in some special system-wide directory for libraries written in that language, POSIX shell checks the directories listed in $PATH (which is normally for directly-executable commands, not shell libraries).

I don’t know why it works that way, but at least it’s easy to work around. Before we try to import common.sh, we’ll just change to the directory containing the test script, and make sure we refer to the file with an explicitly relative path:

1 2	`cd "$(dirname "$0")" . ./common.sh`

$ prove
t/example.test ....................... ok   
t/make-a-directory-modularised.test .. ok   
t/make-a-directory.test .............. ok   
All tests successful.
Files=3, Tests=7,  0 wallclock secs ( 0.10 usr  0.02 sys +  0.03 cusr  0.03 csys =  0.18 CPU)
Result: PASS

For the sake of completeness, let’s modularise the “mkdir fails a second time” test too. Here’s t/cannot-recreate-a-directory.test:

#!/bin/sh
cd "$(dirname "$0")"
. ./common.sh

setup

record mkdir "$TESTDATA"/foo
assert_exit_status 0 "mkdir exit status"

record mkdir "$TESTDATA"/foo
assert_exit_status 1 "mkdir fails a second time"
assert_stderr_matches "File exists" \
    "mkdir failed for the right reason"

teardown

$ prove
t/cannot-recreate-a-directory.test ... ok   
t/example.test ....................... ok   
t/make-a-directory-modularised.test .. ok   
t/make-a-directory.test .............. ok   
All tests successful.
Files=4, Tests=10,  1 wallclock secs ( 0.11 usr  0.02 sys +  0.05 cusr  0.05 csys =  0.23 CPU)
Result: PASS

Bailing out¶

Now we have make-a-directory-modularised.test to test the basics of making a directory, and cannot-recreate-a-directory.test to test what happens if we create the same directory twice. But since it’s a separate, isolated test, before it can try creating a directory for a second time, it has to create it once, first.

If there’s some kind of problem with creating a directory, that will be detected by make-a-directory-modularised.test. If other tests stumble over the same problem and report failures, we might have dozens or even hundreds of failures to look though, with no indication which is the root cause. It would be nice for a test script to be able to say “I cannot set up the situation I need, this test is not worth running”.

TAP provides this with the Bail out! message. Anything after that message is considered a reason why the test script cannot continue. It’s easy enough to write a helper function for it:

bail_out() {
    printf "Bail out! %s\n" "$*"
    exit 1
}

In practice, a lot of the setup we’ll need to do for any given test will be “run a command, and it has to set exit status 0”. We might as well write a helper function for that too:

run_or_bail() {
    record "$@"
    if [ "$(cat "$TESTDATA"/last-exit)" -ne 0 ]; then
        log "Last command:"
        log "   " "$(cat "$TESTDATA/last-command")"
        log "Exit status:" "$(cat "$TESTDATA/last-exit")"
        log "stdout was:"
        sed -e 's/^/#   /' "$TESTDATA/last-stdout" >&2
        log "stderr was:"
        sed -e 's/^/#   /' "$TESTDATA/last-stderr" >&2
        bail_out "Setup command unexpectedly failed"
    fi
}

This pastes some reporting code from assert_exit_status(), but you could factor that into a helper function too if you wanted.

#!/bin/sh
cd "$(dirname "$0")"
. ./common.sh

setup
run_or_bail mkdir "$TESTDATA"/foo

record mkdir "$TESTDATA"/foo
assert_exit_status 1 "mkdir fails a second time"
assert_stderr_matches "File exists" \
    "mkdir failed for the right reason"

teardown

On line 6, we’ve replaced record with run_or_bail. We no longer need to call assert_exit_status after it, since run_or_bail() does that check for us.

However much setup we need to do, we can just wrap each command in run_or_bail without writing extra assertions or error messages, and be certain the test will stop immediately if they can’t continue.

Smarter ways to run tests¶

Now we’ve got a full test suite of independent tests, what can we do with them?

One of the biggest problems of having a comprehensive test suite is that running the tests can take a while. To address this, prove supports the -j option to control how many test scripts are run simultaneously. For example, prove -j2 will run two test scripts at once. Our test suite is too small and disk-I/O-bound to make parallelisation worthwhile, but if your tests are CPU-bound and don’t access some shared resource, you can crank this up to the number of CPUs you have available.

For that matter, how do you know if your tests access some shared resource? Even if you try to keep your tests isolated, there may be some way in which different tests interfere with each other, causing mysterious test failures or hiding bugs. prove -s will shuffle the list of test scripts before running them. If some test accidentally depends on some other test having completed first, that should flush it out quickly.

Possibly the most powerful option prove has is the --state option. The simplest version is prove --state=save, which just makes it save some statistics about the test suite in the .prove file in the current directory. However, once you have some statistics available, there’s more interesting options.

For example, prove --state=slow reads test-timing information from the state file and uses it to schedule the slowest-running tests first. Combined with -j, you can make sure that you don’t wind up waiting for the slowest tests at the end of the run. Alternatively, prove --state=fast runs the fastest tests first, giving you quick feedback on whether a given change has broken everything.

If you’re debugging a regression, prove --state=failed only runs the tests that failed last time, saving time running tests that you don’t expect to be affected. Combined with save, (as in prove --state=failed,save) tests that you fix will be removed from the list, so you can whittle down failures one by one.

Even fancier, the hot option runs most recently-failing tests first, automatically focusing on testing whatever code you’ve been working on. It can be combined with all and save to run recently failing tests, then tests that have not been observed to fail, and then to save the updated statistics for next time. This particular combination can be spelled as prove --state=hot,all,save or as prove --state=adrian — I presume somebody named Adrian really loved that particular setting.

Conclusion¶

While it doesn’t do all the things I’m used to with Python’s unittest module, prove does some fancy things I wasn’t expecting (like “failing tests first” and parallelisation). The boilerplate in each test script is appealingly minimal, and the final version of common.sh does everything I need while being small enough that I don’t mind customising it for different projects.

Screwtape's Notepad

Shell Tool Testing¶

What is the `prove` command?¶

What is TAP?¶

A minimal test harness¶

Writing a test¶

Isolating tests¶

Making it easier to add tests¶

A note on “strict mode”¶

Hiding unnecessary output¶

Writing a test suite¶

Bailing out¶

Smarter ways to run tests¶

Conclusion¶

Shell Tool Testing¶

What is the prove command?¶

What is TAP?¶

A minimal test harness¶

Writing a test¶

Isolating tests¶

Making it easier to add tests¶

A note on “strict mode”¶

Hiding unnecessary output¶

Writing a test suite¶

Bailing out¶

Smarter ways to run tests¶

Conclusion¶

What is the `prove` command?¶