A. Rothuis

Testing software, whether as specification beforehand or as verification afterwards, is often based on examples. Examples are easy to understand, explain the language of the domain and help illustrate the intended use of the system or unit under test.

However, it can be difficult to fully test a system or unit using examples alone. Coverage alone is not enough. When there are lots of cases that need consideration, there is a risk that edge-cases and certain combinations of inputs are overlooked. In some cases, the sheer amount of possible inputs and outputs make total testing a prohibitive effort.

That is where property-based testing can offer a helping hand. In this post, we will look at the similarities and differences between example-based and property-based approaches to testing and why there is value in combining both. All this against the backdrop of a relatively simple game: Rock, Paper, Scissors.

The Holy Trinity of Testing

In 1969, Tony Hoare wrote a paper seeking to provide a logical basis for proofs of the properties of a program, including whether or not a program carries out its intended function.

The intended function of a program, or part of a program, can be specified by making general assertions about the values which the relevant variables will take after execution of the program.


In many cases, the validity of the results of a program (or part of a program) will depend on the values taken before that program is initiated. These initial preconditions of successful use can be specified by the same type of general assertion as is used to describe the results obtained on termination. To state the required connection between a precondition (P), a program (Q) and a description of the result of its execution (R), we introduce a new notation:

P {Q} R

Hoare, C.A.R. (1969). An Axiomatic Basis for Computer Programming. Communications of the ACM, Vol. 12 (10), pp. 371-372.

Although based on Robert Floyd’s work (1967) on “Assigning Meanings to Programs”, in which he approaches the analysis and verification of commands in a flowchart through the lens of predicate logic, the idea of meeting a precondition so that a command can be executed which establishes the postcondition is known as a Hoare Triple.

In its modern appearance, a Hoare Triple is defined as having the form of {P}C{Q} where:

  1. P is the precondition, which must be true before execution of the command;
  2. C is a command, a set of instructions or a program; and
  3. Q is the postcondition, which must be true after execution of the command.

This logical way of asserting the (correct) execution of a certain instruction or set of instructions is the basis for a lot of common testing techniques.

Hoare’s Triple in Practice

Although Hoare logic was primarily aimed at formally proving (partial) program correctness using predicate logic, it has found its way in the toolbox of informal specification and verification techniques.

When performing tests, a common way to verify the working of a (sub)system or unit is to bring the system under test (SUT) in an initial state receptive to an instruction or set of instructions. Once setup, the instructions are executed. Afterwards, the resulting state is compared to a desired state. Whether a test has passed or failed is determined by a test oracle. This can be a human, a specification, an approximation or even a program of similar functionality.

We will explore the ways in which Hoare’s Triples make their appearance in modern testing practices by looking at a deceptively simple example: Rock, Paper, Scissors.

Rock, Paper, Scissors (Over-)Analyzed

Rock, Paper, Scissors is a game in which two or more players simultaneously select a move (often expressed by hand symbols): rock (a closed fist), paper (a flat hand), scissors (a V-sign).

The rules for winning are:

  • Rock crushes scissors
  • Paper covers rock
  • Scissors cuts paper

In all other cases:

  • if two moves are the same it is a draw
  • otherwise, we lose

We can represent this as a decision table, where the two moves are our independent variables (our inputs) and our result is the dependent variable (our result):

mine theirs result
Rock Scissors Win
Paper Rock Win
Scissors Paper Win
Rock Rock Draw
Paper Paper Draw
Scissors Scissors Draw
Rock Paper Lose
Paper Scissors Lose
Scissors Rock Lose

More elegantly, we can describe the results as endorelations in the set of possible moves:

  • the WIN relation contains: (Rock, Scissors), (Scissors, Paper) and (Paper, Rock)
  • the DRAW relation contains: (Rock, Rock), (Scissors, Scissors) and (Rock, Rock)
  • the LOSE relation contains: (Scissors, Rock), (Paper, Scissors) and (Rock, Paper)

Each relation is a pair of moves, a subset of the Cartesian product of the set of moves and itself, which can be mapped to a result. More generally: the domain of our game exists of a pair of moves, while our codomain contains our three outcomes.

These endorelations can be drawn as directed graphs:

Ofcourse, we can also represent this in pseudocode, in which our game takes two moves and gives us an outcome:

game Rock Scissors = Win
game Paper Rock = Win
game Scissors Paper = Win
game mine theirs | mine == theirs = Draw
                 | otherwise = Lose

Implementation in Java

In Java, we can model our moves and results as the enums Move and Result and model our game using simple if-statements or switch statements.

// Package declaration and imports omitted...

public class RockPaperScissors {
    public Result decide(Move mine, Move theirs) {
        if (mine.equals(Move.ROCK) && theirs.equals(Move.SCISSORS)
                || mine.equals(Move.PAPER) && theirs.equals(Move.ROCK)
                || mine.equals(Move.SCISSORS) && theirs.equals(Move.PAPER)) {
            return Result.WIN;

        if (!mine.equals(theirs)) {
            return Result.LOSE;

        return Result.DRAW;

OK, that’s not the prettiest code. Especially if we compare it to the “pseudocode” mentioned above which was totally not Haskell…

We can do slightly better. For instance, we could extract the “winning” branch into a private method to encapsulate that logic more precisely and have a more intent-revealing line of code. Another, arguably more declarative approach would be to keep track of a set of winning combinations. Sadly, Java does not have tuples. A Set of Lists will do.

public class RockPaperScissors {
    private Set<List<Move>> wins = Set.of(
            List.of(Move.ROCK, Move.SCISSORS),
            List.of(Move.PAPER, Move.ROCK),
            List.of(Move.SCISSORS, Move.PAPER)

    public Result decide(Move mine, Move theirs) {
        if (this.wins.contains(List.of(mine, theirs))) {
            return Result.WIN;

        if (!mine.equals(theirs)) {
            return Result.LOSE;

        return Result.DRAW;

In modern versions of Java, this could be implemented more elegantly using switch expressions, pattern matching and value or record types.

Example-Based Testing

In automated example-based testing, Hoare’s Triple prominently shows up in two forms.

Arrange, Act, Assert

First, the Triple comes up as a pattern from eXtreme Programming (XP) invented by Bill Wake, called Arrange, Act, Assert, Assemble, Activate, Assert or 3A. It prescribes that one should write simple test cases according to an easy-to-follow formula:

  1. Arrange: setup the SUT and, if needed, collaborators (real or test objects)
  2. Act: act on the object (using real or test parameters)
  3. Assert: make claims about the object and its collaborators (or global state)

Using a framework like JUnit, we specify our examples as follows:

// Package declaration and imports omitted...

public class RockPaperScissorsTest {
    @DisplayName("Rock crushes scissors")
    void rockCrushesScissors() {
        RockPaperScissors game = new RockPaperScissors();
        Result result = game.decide(Move.ROCK, Move.SCISSORS);
        assertEquals(Result.WIN, result);

    // Other examples omitted...

As you can see, Arrange, Act, Assert is implicit:

  1. Arrange: we setup our game
  2. Act: the game decides the outcome of two moves
  3. Assert: we assert that the result equals our expected result (Result.WIN)

The Arrange step is performed using the JUnit’s assertions package, but we could also use a more declarative library like hamcrest or assertJ.

We can refactor our test.

The Arrange step could be omitted by using lifecycle methods (annotated with @BeforeEach or @BeforeAll), but I prefer to be more explicit about the actual use of the objects and use the lifecycle methods primarily for complex setup logic. Tests can document the intended use of the code. Besides, there is a more effective refactoring to perform — as we will see soon.

Given, When, Then

A more explicit manifestation of the Triple is found in Gherkin, the language for specifying behaviour-driven examples, used in frameworks like Cucumber: Given, When, Then.

  1. Given some state of the system
  2. When I perform one or more actions
  3. Then there should be some expected outcome(s)

In Gherkin, a specification would look as follows. We won’t specify the other examples.

Scenario: Winning Rock, Paper, Scissors
    Given a game of Rock, Paper, Scissors
    When I play "Rock" and my opponent plays "Scissors"
    Then I "Win" the game

Keep in mind that the specifications would still need to be converted to actual test code using something like Cucumber. The quoted words are entered as parameters to be used in the reusable step definitions for Cucumber.

This conversion can happen at the end-to-end or system test level, but can just as easily be applied as a unit or integration test.

Parametrized Tests

Whether we use JUnit, Cucumber or something
like JBehave, writing every example as a separate test is a bit unwieldy. Especially since the only thing that varies are the parameters that go in and the results that go out. This is a good candidate for parameterized tests or data-driven testing.

In JUnit, we can use junit-jupiter-params for that. This gives us the power of the @ParameterizedTest annotation. We can use several sources as the inputs for our parameterized tests: values, csv-like arrays and even files. In this example we will use a data provider method, using the @MethodSource annotation. Let’s first write our method source:

private static Stream<Arguments> provideMovesAndResults() {
    return Stream.of(
        Arguments.of(Move.ROCK, Move.SCISSORS, Result.WIN),
        Arguments.of(Move.PAPER, Move.ROCK, Result.WIN),
        Arguments.of(Move.SCISSORS, Move.PAPER, Result.WIN),
        Arguments.of(Move.ROCK, Move.PAPER, Result.LOSE),
        Arguments.of(Move.PAPER, Move.SCISSORS, Result.LOSE),
        Arguments.of(Move.SCISSORS, Move.ROCK, Result.LOSE),
        Arguments.of(Move.ROCK, Move.ROCK, Result.DRAW),
        Arguments.of(Move.PAPER, Move.PAPER, Result.DRAW),
        Arguments.of(Move.SCISSORS, Move.SCISSORS, Result.DRAW)

As you can see, it looks a lot like our decision table from earlier! We produce a stream of 9 examples, with 3 arguments each: my move, their move and the end result. We can use this provider as a @MethodSource for a parametrized test.

Because only the input and output vary, we can replace our examples with a single test method:

void verifyMovesAndResults(Move mine, Move theirs, Result expectedResult) {
    RockPaperScissors game = new RockPaperScissors();
    Result result = game.decide(mine, theirs);
    assertEquals(expectedResult, result);

In Gherkin we can achieve something similar, but more readable using Scenario Outlines and Examples:

Scenario Outline: playing Rock, Paper, Scissors
    Given a game of Rock, Paper, Scissors
    When I play <mine> and my opponent plays <theirs>
    Then I <result> the game

        mine     | theirs   | result
        Rock     | Scissors | Win
        Paper    | Rock     | Win
        Scissors | Paper    | Win
        Rock     | Paper    | Lose
        Paper    | Scissors | Lose
        Scissors | Rock     | Lose
        Rock     | Rock     | Draw
        Paper    | Paper    | Draw
        Scissors | Scissors | Draw

Opinion on Example-Based Testing

I like example-based specification and testing, especially when it is so readable that it can serve as documentation to developers and stakeholders. It encourages collaboration with domain experts and exposes a shared language through the examples given. This is what I appreciate about Domain-Driven Design and its quest for ubiquitous language.

Another reason I like this kind of testing is that I can prove it works as specified. But this has a flipside. It can often look as if we are re-implementing our tests as code (or viceversa), instead of verifying the actual abstraction.

Lastly, we cannot rely fully on the specification. In the words of Edsger Dijkstra (1969): “Testing shows the presence, not the absence of bugs”. Coverage can only bring you so far.

Sure, mutation testing can help test the quality of your tests and the fidelity of your coverage. But what if, out of a thousand examples, a few examples are wrong? Would we say that system functions correctly? What if we don’t test enough? What if we can’t test enough? What if we don’t know the exact implementation yet? What if we only know some guiding principles our solution should conform to?

This is where property-based testing can offer a helping hand.

Property-Based Testing

Citing Hoare’s 1969-paper, I left out a part in which Hoare referred to asserting a programs correctness by looking at the properties of the produced result.

These assertions will usually not ascribe particular values to each variable, but will rather specify certain general properties of the values and the relationship holding between them.

Hoare, C.A.R. (1969). An Axiomatic Basis for Computer Programming. Communications of the ACM, Vol. 12 (10), pp. 371-372.

Specifically testing for properties in code was more recently popularised in Claessen and Hughes’ paper (2000) on QuickCheck, the first property-based testing tool.

We have designed a simple domain-specific language of testable specifiations which the tester uses to define expected properties of the functions under test. QuickCheck then checks that the properties hold in a large number of cases.


A testing tool must also be able to generate test cases automatically. We have chosen the simplest method, random testing, which competes surprisingly favourably with systematic methods in practice.


We have chosen to put distribution under the human tester’s control, by defining a test data generation language (…), and a way to observe the distribution of test cases. By programming a suitable generator, the tester can not only control the distribution of test cases, but also ensure that they satisfy arbitrarily complex invariants.

Claessen, K. & Hughes, J. (2000). QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP ‘00: Proceedings of the fifth ACM SIGPLAN international conference on Functional programming.

When we aim to discover Hoare’s Triple in property-based testing, it looks somewhat like this:

  1. For an arbitrary P
  2. When I do C
  3. Property Q should hold

Let’s unpack this.

In terms of the triple, the processing step C is pretty much the same: we execute an operation or run some data through a system.

The precondition P and postcondition Q are more abstract, as the precondition is meant to constrain randomly generated inputs while the postcondition is a general truth after performing C.

Precondition: Generating Arbitrary Values

Instead of using examples for input, we generate our input based on the domain for our system or unit under test.

If our system works with arbitrary integers, we generate random integers. If a property should only hold when dealing with arbitrary positive even numbers, we make sure those are generated. If we are dealing with pairs of arbitrary moves in a game of rock, paper, scissors, we select from those.

When the domain of the unit under test is small, it won’t cost too much time to exhaustively test all cases. If the domain is too large (i.e. all integers), we generate a maximum amount of random numbers each time the test is run.

Using randomized data, we can discover edge-cases we could not come up with on our own. For instance, when working with multibyte strings. However, we do need to be able to constrain our input data, to make sure generally sensible values are generated (i.e. types).

Postcondition: Asserting Properties

Using property-based testing, we generate some arbitrary values that conform to some precondition. After we run them against our unit under test, we want to assert that one or more properties hold. Or, more importantly: we would want our test to fail when the expected property does not hold under a value that does hold for the given preconditions.

Finding properties

Using properties, we don’t focus on implementation details. We focus on the abstract rules that govern our code.

Finding the right property to base your test around is one of the hardest things to do. Some common patterns include:

  • checking commutative operations (“different paths, same destination”)
  • checking an operation’s inverse operation (“there and back again”)
  • checking for preserved invariants (“some things never change”)
  • checking for idempotence (“the more things change, the more they stay the same”)
  • checking through structural induction (“solve a smaller problem first”)
  • checking for typical results (“hard to prove, easy to verify”)
  • checking against an alternative implementation (“test oracle”)

In Rock, Paper, Scissors, we determined that the relations within the set of moves, which map to a result, can be modelled using digraphs:

Based on these endorelations, the following truths hold.

  • DRAW is a reflexive relation, comparable to that of identity.
    • Given an arbitrary move M, if we wish to DRAW, we need our opponent to also perform M.
    • This means a DRAW does not occur if there are two different moves.
  • WIN is characterized as asymmetric.
    • Given that an arbitrary move M beats an arbitrary move N, N will not beat M.
    • This means LOSE is the inverse of WIN.
  • WIN and LOSE are anti-transitive.
    • Given that n arbitrary move M beats an arbitrary move N and move N beats an arbitrary move O, then M will not beat O.

This yields 4 testable properties:

  1. Draw is reflexive
  2. Non-draw is winning or losing
  3. Winning is the inverse of losing
  4. Anti-transitivity

We let our tests generate the arbitrary moves required, apply our operation and verify that the outcome exhibits the desired property.

Using jqwik

Jqwik is a property-based testing framework that hooks into JUnit’s test runner. Its aim is “to combine the intuitiveness of microtests with the effectiveness of randomized, generated test data.”

First, add it to your project’s dependencies, alongside JUnit. For maven, this means:

<!-- ... -->

<!-- ... -->

Property I: Draw is reflexive

Create a test class. For this example, let’s create the file src/test/java/domain/RockPaperScissorsTest.java. In this class, we will create a first property to test:

public class RockPaperScissorsTest {
    boolean same_moves_are_draw(
            @ForAll("moves") Move move
    ) {
        RockPaperScissors game = new RockPaperScissors();
        Result result = game.decide(move, move);

        return result.equals(Result.DRAW);

    Arbitrary<Move> moves() {
        return Arbitraries.of(Move.class);

We annotate a property with the @Property annotation. Our method receives an arbitrary move, provided by the @ForAll annotation pointing to the moves method, which returns an arbitrary value from the Move enum upon each run. This is a provider method, annotated by @Provide. Back in the same_moves_are_draw method, we setup the game and verify that comparing a move with the same move results in a draw.

The test returns a boolean. Jqwik assumes true to be a passing test, false to be a failing one, but of course you could use an assertion library.

When we run the test, we see the following report:

timestamp = 2020-04-08T01:41:44.189095500, RockPaperScissorsTest:same moves are draw = 
tries = 3                     | # of calls to property
checks = 3                    | # of not rejected calls
generation-mode = EXHAUSTIVE  | parameters are exhaustively generated
after-failure = PREVIOUS_SEED | use the previous seed
seed = -946291026846295022    | random seed to reproduce generated values

Jqwik has tested draws three times – for all current moves: for Rock, Paper and Scissors.

Property II: Non-draw is winning or losing

We can add a test to verify the inverse of drawing. Two different moves should result in either a win or a draw!

For this move we need two arbitrary moves that are not the same. We can re-use the earlier provider for this, but add some annotations to constrain the input:

boolean two_different_moves_give_either_win_or_lose(
        @ForAll @Size(2) List<@From("moves") @Unique Move> moves
) {
    RockPaperScissors game = new RockPaperScissors();
    Result result = game.decide(moves.get(0), moves.get(1));

    return result.equals(Result.WIN) || result.equals(Result.LOSE);

We could also create another provider, which is arguably slightly more readable:

boolean two_different_moves_give_either_win_or_lose(
        @ForAll("two_moves") List<Move> moves
) {
    RockPaperScissors game = new RockPaperScissors();
    Result result = game.decide(moves.get(0), moves.get(1));

    return result.equals(Result.WIN) || result.equals(Result.LOSE);

Arbitrary<List<Move>> two_moves() {
    return Arbitraries.of(Move.class)

Jqwik will generate 6 tests for based on these providers and properties: the unique combinations of 2 moves (3 * 2).

Property III: Win is the inverse of lose

Next up: the anti-symmetric relation between winning and losing. If A vs B means losing, B vs A should mean winning. A result of two unique moves should not be the same as its inverse.

We can express that as follows, using our earlier provider.

boolean two_different_moves_give_antisymmetric_results(
        @ForAll("two_moves") List<Move> moves
) {
    RockPaperScissors game = new RockPaperScissors();
    Result result = game.decide(moves.get(0), moves.get(1));
    Result inverse = game.decide(moves.get(1), moves.get(0));

    return !result.equals(inverse);

Again: 6 tests are run.

Property IV: anti-transitivity

Rock beats scissors. Scissors beats paper. But rock does not beat paper. In other words, given three arbitrary unique moves, (M, N, O) if M > N > O then M > O should not hold.

There might be a clever way to write this using a provider and a 3-value tuple, but I opted for a simple provider and a more complex verification of the property:

boolean moves_are_not_transitive(
        @ForAll("three_moves") List<Move> moves
) {
    RockPaperScissors game = new RockPaperScissors();

    Result mVersusN = game.decide(moves.get(0), moves.get(1));
    Result nVersusO = game.decide(moves.get(1), moves.get(2));
    Result mVersusO = game.decide(moves.get(0), moves.get(2));

    return mVersusN.equals(nVersusO) != mVersusO.equals(mVersusN);

Arbitrary<List<Move>> three_moves() {
    return Arbitraries.of(Move.class)

For this property, 6 tests are run. One for each unique combination of rock, paper and scissors.

Note that we have tested 4 properties without explicitly referring to the moves themselves. Total tests generated: 21.

Although this misses the specificity of concrete examples about which move beats which, we could add examples using the @Example annotation. Combining the happy path examples with these properties offers sufficient coverage. More interestingly, these tests allow us to add moves to our game, without paying for it in coverage.

Scaling Up

A popular variant of Rock, Paper, Scissors is the game Rock, Paper, Scissors, Lizard, Spock, increasing the number of possible moves. There is even an instance of the game with 101 moves: RPS-100!

Let’s add Move.LIZARD and Move.SPOCK:

public enum Move {

What will happen if we run our tests?

In our current implementation, we have two failing tests: properties III and IV. Jqwik provides us with a counter example for the failing test “two different moves give antisymmetric results”: [[ROCK, LIZARD]].

This means losing in both ways. That is how we (unintentionally) deal with unimplemented moves. Let’s fix this by expanding the set of winning moves:

        private Set<List<Move>> wins = Set.of(
                List.of(Move.ROCK, Move.SCISSORS),
                List.of(Move.ROCK, Move.LIZARD),
                List.of(Move.PAPER, Move.ROCK),
                List.of(Move.PAPER, Move.SPOCK),
                List.of(Move.SCISSORS, Move.PAPER),
                List.of(Move.SCISSORS, Move.LIZARD),
                List.of(Move.LIZARD, Move.PAPER),
                List.of(Move.LIZARD, Move.SPOCK),
                List.of(Move.SPOCK, Move.ROCK),
                List.of(Move.SPOCK, Move.SCISSORS)

Now, only 1 tests fails. There are combinations of three moves that are transitive! Jqwik provides a falsification in Rock, Scissors, Lizard. Rock crushes Scissors and Scissor decapitates Lizard, but Rock also crushes Lizard!

Antitransitivity is apparently a property specific only to a 3-move variant of Rock, Paper, Scissors, not a general variant. Our other, more important, properties still hold which form the basis of a balanced game. Collectively, they guarantee that there is an equal amount of winning, losing and drawing combinations.

Adding more moves will result in a failing test whenever the game is unbalanced. Which is at the core of what our code is supposed to do.


In our example, we tested the essential, abstract rules that make up a game of Rock, Paper, Scissors. The game only works because of these properties.

This can make property-based testing interesting for designing against heuristics using Test-Driven Development (TDD). Imagine we knew nothing about Rock, Paper, Scissors, just that we wanted to make a balanced game and define a general heuristic for “balance”. Testing for these properties using TDD would continuously evaluate the game balance while designing it! We focus on a higher abstraction: one can define general principles the software should conform to without needing to fill in the details upfront.

That is not to say that the details are unimportant. One would still benefit from other test oracles, like examples. Not only because it invites collaboration and maximizes on the use of the ubiquitous language, but also because specifications require specificity.

Testing properties scales well. A given design heuristic will hold regardless of the expansion of domain or codomain. That being said, as the amount of possibilities increases, it will be more difficult (performance-wise) to exhaustively generate tests. However, property-based testing tools can test against a randomized subset of arbitrary values and can shrink a failing test to a minimum counterproof that falsifies the underlying assumption. This is useful for discovering implementation oversights. It is difficult to do this using examples alone.

It takes a while getting used to testing properties, because it is to find them. Distilling the general truth that holds for a certain set of arbitrary values when applying a certain operation is easier said than done. Knowing certain common patterns can help with that, but it can be tough to design the right test without re-implementing (part of) the desired behavior. That, like most things in software engineering, takes practice.


Leave a comment below!