Segue Coplien Testing
Segue Coplien Testing
By James O Coplien
2.1 Introduction
I guess it’s time for the second installment. My earlier essay
started just as a casual reply to a client, but when Rex Black
posted it on his web site, it went viral — going to #3 on Reddit
and appearing prominently on other social networking
amalgamation sites. Since then I’ve had the pleasure of
watching the dialog unfold. It’s ranged from profound to just
silly and, sadly, the majority of it falls into the latter category. It
appears as though there is a very wide mythology, perhaps
propelled by industry hope and fuelled by academic programs
and consultants desperate to justify their reason to exist.
Segue 1
archetypical responses that the gallery offered on the first round,
together with my analysis. These provide a good cross-section
of the typical misunderstandings that flood our industry.
Segue 2
You may know about how big the map should be based on the
client for the upcoming release but there will be more releases
after that, and you want to make the class reasonably change-
proof. So you allow the map to dynamically create as much
memory as it needs. The current names are ASCII strings and
the telephone numbers are eight decimal digits, but you make
the map slightly more general than that (because maybe your
country doesn’t use 8-digit numbers) — maybe you make it a
template instead of a class, or you insist that its objects adhere to
some declared interface.
Now you write the unit tests for the map class methods. Unless
you have full traceability, you can’t know exactly what kind of
data the program will offer the map nor in what order it will
invoke the methods, so you test a reasonable number of
combinations — A large, reasonable number of combinations.
This is all the more important if the map is used from several
different loci within the program, as we want to cover as many
of its countably infinite usage scenarios as possible.
The map may even have a method count that reports how many
associations it holds. And we need to test that. It seems like a
natural part of what a map should do. We test the entire
interface of the map. Maybe there is a method to replace an
association. And to delete one. Or to fetch one out as a pair. It
depends how good the programmer is, doesn’t it?
Segue 3
The map in isolation requires more exercising by its tests than
the application will requires at any point in its lifetime!
Segue 4
In theory we could reduce this number if we had a crystal ball
for the orders of method executions, argument values, and so
forth, that the class’s objects will experience in the field. And
we in fact can’t be sure about what tests faithfully reproduce
tomorrow’s system behavior. We make lots of arguments about
equivalent sets across configurations of value, but we rarely
prove these arguments. In an OO system with polymorphism we
probably can’t even prove that a given method will be invoked
at all, let alone know the context of invocation! All bets are off
for such formal proofs in an OO world (though there would be
hope in FORTRAN).
One can make ideological arguments for testing the unit, but the
fact is that the map is much larger as a unit, tested as a unit, than
it is as an element of the system. You can usually reduce your
test mass with no loss of quality by testing the system at the use
case level instead of testing the unit at the programming
interface level. System tests line up many method invocations,
all of which must work for the test to pass. (I know this is a bit
simplistic, but it holds as a generalization.) System testing
becomes a form of compression of information that one simply
cannot realise at the unit level.
It can be even worse: the very act of unit testing may cause the
interface of the map to grow in a way that’s invisible in the
delivered program as a whole. Felix Petriconi and I have been
debating the unit testing issue in Email, and today he wrote me
that: “You are right. E.g. we introduced in our application lots of
interfaces to get the code under (unit) test and from my
point of view the readability degraded.” David Heinemeier
Hannson calls this “test-induced design damage:” degradation of
code and quality in the interest of making testing more
convenient (http://david.heinemeierhansson.com/2014/test-
induced-design-damage.html). Rex Black adds that such
Segue 5
tradeoffs exist at the system level as well as at the unit level.
2.2.1 A corollary
Weinberg’s Law of Composition tells us that we need much
more than the unit to show that a bug has been mitigated.
Segue 6
A second historic note: CRC cards (“A Laboratory for Teaching
Object-Oriented Thinking,” Kent Beck, OOPSLA ’89
Conference Proceedings; see
http://c2.com/doc/oopsla89/paper.html) used to be a powerful
way to create a class-based or object-based design from end-
user scenarios. In a role-play of a system scenario, each team
member represents the interests of one or more classes or
objects, using a recipe card to represent each crisply-named
object; the name appears on the top line of the card. The rest of
the card is split in two: the left half lists the responsibilities of
the object or class to the system, and the right half are the
collaborators, or helpers (other cards) that the object enlists to
complete its work. Today the CRC acronym stands for
Candidate object, Responsibilties, and Collaborators. If the card
is on the table, it’s a class; if it’s active in the discussion of a
scenario, its person holds it aloft and we think of it as an object.
(In reality, the cards represent roles rather than either classes or
objects, but that’s another discussion.)
Segue 7
design is minimal because the only way to add a responsibility
to a card is to support a use case or scenario.
Segue 8
network routing algorithm contributes to lowering congestion,
decreasing latency, or increasing reliability, and testing it has a
first-order, traceable tie to product value. It takes a lot of
imagination, hand-waving, or indirect inferences to say the same
of many object instance methods.
Segue 9
See: http://www.differencebetween.info/difference-between-
fault-and-failure)
Some failures are hard to test in practice, and there are many
failures that simply cannot be tested. (Think of synchronization
errors that occur only within some timing window; in a system
with multiple clocks, the window of opportunity for error can be
infinitely small, which means you need an infinite number of
discrete tests to explore the space of possibility of failure.) Most
software failures come from the interactions between objects
rather than being a property of an object or method in isolation.
Segue 10
mean something only if they lead to failures. Without the
contextualization of either requirements or interactions with
other units, finding failures at the unit level is a dicey
proposition.
Let’s say that your unit tests discover a problem with your Stack
library: that when you push more than 215 items, all items but
one on the Stack are lost. This is certainly a bug: the kind of
thing we look for in unit testing. It is certainly a fault. But we
find that the application program never pushes more than three
things on any Stack at any time. By definition, it is not a failure:
it is irrelevant to product value. It’s likely that fixing this bug is
waste: that is, I will never realize my return on the investment of
fixing it. Even testing for it is waste.
Segue 11
All this extra baggage does provide an outlet for more nerd
work! Given that we design at the unit level without regard for
the details of what code is used by the application, we end
creating a good deal of dead code. Some organizations pride
themselves in removing that code after finding it and they label
it with the noble title of refactoring, aided by code coverage.
(George Carlin’s standup routine on Stuff, and how to deal with
leftovers in your refrigerator, comes to mind here.) Given that
unit testing has become so popular in the past decade it’s no
wonder that refactoring and code coverage are having their
heyday.
Segue 12
Except maybe for those making kitchen recipe-filing programs,
most of the rest of us work in complex domains where we will
never prevail to the point of perfection. The question is: what do
we do about it? It’s easy to let the user deal with the crash or to
pray that our errors don’t corrupt their data. Yet the very idea
behind testing is that it gives us tools to deal with this challenge
more intelligently. Why we don’t exercise them more fully than
we do is indeed a wonder, and that’s what we’ll explore here.
Segue 13
We usually think of a test as a combination of some stimulus, or
exercise, together with an oracle-checker that compares actual
and expected results to detect faults. While we usually separate
software into test drivers, the software under test, and the oracle-
checker, considering the two testing components together helps
us extend our notion of quality into post-partum software. And
this is not some ideal dream, but can be achieved today just by
an act of will and design.
Segue 14
Scenarios we
test in the lab
a
b d
e g
Scenarios f Scenarios
that run in exhibi-
the field ting faults
Segue 15
Good analysis can reduce the effort to test sequences that will
never occur in real life ((a), (d), and (g)). (Trying to increase the
overlap between scenarios tested in the lab and scenarios that
run in the field is referred to as test fidelity — thanks, Rex!)The
more understanding we have of our client, the less we will
deliver something that works wrong or that is never used. More
importantly good analysis also reduces field failures ((c) and
(f)), which is where the rework cost lies.
The problem with the last case — unanticipated field faults (f)
— is that the fault may go undetected in testing. The code may
just silently generate the wrong result and go on. The only
scenarios for which testing delivers value are those that we
foresee, that we test, which actually run in the field, but which
exhibit failure in the lab (c). That is, testing generates value in
only one out of six of these combinations. We get value because
we bring together the right scenario-generator and the right
correctness oracle in one place.
Segue 16
libraries, and the rest of the execution environment. We can see
if their value falls within the range that knowledge validates for
the operation.
Segue 17
independently written software to return the system to a known
safe state while retaining as much work in progress as possible.
This is one of the fundamental building blocks of fault-tolerant
computing. In the end, these “tests” change your quality
mentality. If my browser fails to connect to a site, it doesn’t just
issue an error message. The “test” digs deeper to find out why.
If I can’t reach that page because the Internet is in a particular
stage of disconnection, it queues the request, and automatically
re-tries it on my behalf when the network comes back online.
Great quality assurance turns the drudgery of tests into
customer-pleasers.
Segue 18
What struck me is that most arguments for unit testing and TDD
are of the form: “be fearless and you’ll be faster,” or present
homilies such as “aim for test feedback in 300 ms.” In this talk,
it seems that only Correy Haines seems to come out with
reasoned advice related to focusing on the places where there is
the most payoff. I, too, also feel that in Chapter 1 I strove to
give compelling models that are in essence gedanken
experiments that lead the developer into reasonable practice. I’ll
argue that there’s some science, or at least some reason, behind
my admonitions. Many of the other “arguments” are either
emotive or are credos.
Efficiency
Removal Step
Lowest Modal Highest
Formal Design
35% 55% 75%
Inspections
Modeling or
35% 65% 80%
Prototyping
Field Testing 35% 50% 65%
Segue 19
Informal design
30% 40% 60%
review
Formal code
30% 60% 70%
inspections
Integration Test 25% 45% 60%
Functional Test 20% 35% 55%
Code Desk Check 20% 40% 60%
Design Doc Desk
15% 35% 70%
Check
Unit Test 10% 25% 50%
Total 93% 99% 99%
Segue 20
Desk Check 27%
Prototyping 20%
Acceptance Tests 17%
Regression Tests 14%
Total 99.96%
Segue 21
from being wrong than prevents the correctness of the change
itself being flawed. This is therefore a seriously flawed and
dangerous perspective.
Segue 22
The dynamics are different for small chunks of code than for
aggregate behaviour, which is one reason that requirements-
level testing is qualitatively different than unit testing. Good
Smalltalk methods are about three lines long: good object
methods are very short and that they behave like atomic
operations on the object. Clean Code (Bob Martin, 2008) says
that it’s rare that a method should grow to 20 lines, and he
describes Kent Beck’s Smalltalk code as comprising methods
that are two, three, or four lines long. Trygve Reenskaug
(inventor of MVC) recently calculated the average method
length in a large program typical of his code: it also came out
around three statements per method.
Consider that such a method is the unit under test. I have ten
unit tests for it. I make a change to one line of the method and
try to argue that five of the tests are still valid: i.e., that the
single-line change won’t change how the function responds to
those tests. The chances of the other two lines being well
encapsulated from the changed line are pretty small, so changes
to code should almost always imply changes to a test. This
would lead to a unit-test rule:
Segue 23
process improvement). And those problems aren’t usually
expressible at the unit level, but rather at the system level.
Toyota has recently renewed its faith in these ideas to the point
of replacing robots on its assembly lines with individuals. See
the article from April 2014 at qz.com:
http://qz.com/196200/toyota-is-becoming-more-efficient-by-
replacing-robots-with-humans/
Segue 24
debunked some software engineering myths that many have held
dear for years. One of these myths is a common excuse for unit
testing: the earlier you test, the cheaper it is. We have this belief
that the closer to source our testing is, that the more cost-
effective it will be. Jørgensen challenges this claim: (“Myths
and Over-simplifications in Software Engineering”, research
paper, simula.no,
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.259.50
23)
The report states (page 5-4) ‘... regardless of when an error is
introduced it is always more costly to fix it downstream in the
development process.’ An assumption of no added error detection
cost and always decreasing correction cost when detecting more
errors in the phases where they are introduced, creates, not
surprisingly, great savings from better testing infrastructure. The
assumption is, as argued earlier, not supported with evidence, and,
likely to be incorrect.
2.9 Conclusion
The most important thing I want you take away from this
chapter is that most bugs don’t happen inside the objects, but
between the objects. Adele Goldberg used to say “it always
happens somewhere else.” The crucial design decisions of
object-oriented programming happen outside the wall. Unit
testing has returned us to the pre-object days of modules in what
might better be called class-oriented programming than object-
oriented programming. Unit testing checks up on exactly the
same small design decisions “inside the wall” that are easily
caught without the waste of additional code: inspections, clean
room, code reviews, and pair programming come to mind.
Segue 25
Developers make a big deal out of this pittance, perhaps because
it is something they feel they can control. Broader testing
requires cooperation across other software modules and a sense
of teamwork. Nerds, who tend to be introverted, would much
rather sit with J-Unit than sit around a table playing CRC Cards.
Their ignorant bosses who view the latter as a meeting and the
former as real work add fuel to the fire.
Dig deeper.
2.10 Acks
Tons of thanks to Felix Petriconi, Neil Harrison, Brian Okken,
and Rex Black
Segue 26