0% found this document useful (0 votes)
7 views

Overload 180

Uploaded by

Sundar Nil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Overload 180

Uploaded by

Sundar Nil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

C++ Safety,

In Context
Herb Sutter discusses C++’s
current security problems
and potential solutions

User-Defined Formatting in std::format


Spencer Collyer demonstrates how to provide
formatting for a simple user-defined class

To See a World in a Grain of Sand


Jez Higgins shows how to refactor code that has grown
organically, making it clearer and more concise

Judgment Day
Teedy Dee finds out what happens if AI takes your job

A magazine of ACCU ISSN: 1354-3172


accu

Monthly journals, printed and online


Local groups run by ACCU members
Discounted rate for the ACCU Conference
Email discussion lists

accu.org
OVERLOAD CONTENTS

April 2024
ISSN 1354-3172 Overload is a publication of the ACCU
Editor
Frances Buontempo For details of the ACCU, our publications
and activities, visit the ACCU website:
[email protected]

Advisors
Paul Bennett
[email protected]
www.accu.org
Matthew Dodkins
[email protected]
4 C++ Safety, In Context
Paul Floyd
[email protected]
Herb Sutter discusses C++’s current security
problems and potential solutions.
Jason Hearne-McGuiness
[email protected]
Mikael Kilpeläinen 14 To See a World in a Grain of Sand
[email protected] Jez Higgins shows how to refactor code that
Steve Love has grown organically, making it clearer and
[email protected] more concise.
Christian Meyenburg
[email protected]
Chris Oldwood
20 User-Defined Formatting in std::format
[email protected] Spencer Collyer demonstrates how to provide
Roger Orr formatting for a simple user-defined class.
[email protected]
Balog Pal 27 Judgment Day
[email protected] Teedy Dee finds out what
Honey Sukesan happens if AI takes your job.
[email protected]
Jonathan Wakely
[email protected]
Anthony Williams
[email protected]

Advertising enquiries
[email protected]

Printing and distribution


Parchment (Oxford) Ltd

Cover design
Original design by Pete Goodliffe
[email protected]
Cover photo by Daniel James, of
a double row of the ‘colours’ of Copy deadlines
the Royal Tank Regiment that can All articles intended for publication in Overload 181 should be submitted by
be seen in the church of St Mary 1st May 2024 and those for Overload 182 by 1st July 2024.
Aldermary.
Copyrights and trademarks
Some articles and other contributions use terms that are either registered trade marks
ACCU or claimed as such. The use of such terms is not intended to support nor disparage any
ACCU is an organisation of trade mark claim. On request, we will withdraw all references to a specific trade mark
programmers who care about and its owner.
professionalism in programming. We By default, the copyright of all material published by ACCU is the exclusive property
care about writing good code, and of the author. By submitting material to ACCU for publication, an author is, by default,
about writing it in a good way. We are assumed to have granted ACCU the right to publish and republish that material in any
dedicated to raising the standard of medium as they see fit. An author of an article or column (not a letter or a review of
programming. software or a book) may explicitly offer single (first serial) publication rights and thereby
Many of the articles in this magazine retain all other rights.
have been written by ACCU members – Except for licences granted to 1) corporate members to copy solely for internal
by programmers, for programmers – and distribution 2) members to copy source code for use on their own computers, no material
all have been contributed free of charge. can be copied from Overload without written permission from the copyright holder.

April 2024 | Overload | 1


Editorial Frances Buontempo

I Don’t Believe It!


Sometimes we are surprised by unexpected outcomes or how long
things take. Frances Buontempo confesses to how she’s lost hours
recently, but learnt from the experiences.

I recently spoke at CppOnline [CppOnline], a new number and we end up with unbelievable nonsense. Maybe you already
online-only conference. It was loads of fun, though it know about infinite series and analytic continuations [Wikipedia], which
always feels odd talking to your monitor and hoping allow us to extend the domain of functions. They are not to be confused
someone is listening. We were advised to close with algebraic continuations which allow us to continue execution using
unnecessary applications and browser tabs down to futures and similar, and might mean I end up with more tabs open again
ensure smooth performance of our machines while were I to try to explain in detail. The take away message is that reasoning
we spoke. You may find this hard to believe, but I spent about four hours is often caveated with prerequisites; for example, a radius of convergence
closing browser tabs, taking up time I could have otherwise spent on an for a series. Applying similar logic in different circumstances may lead
editorial. I currently have 62 open; a grand improvement on the 99 or to surprises or mistakes. If something seems unbelievable, like adding
more before the conference. No editorial though, sorry. positive numbers and getting a negative answer, an assumption you are
making might be wrong.
If you’re not a tab hoarder you might find spending so much time closing
tabs very strange, but I know I am not the only person who does this. A relevant computing example concerns benchmarking. A long time ago,
I could just bookmark pages, but I gave up on bookmarks years ago, Roger Orr wrote an article entitled ‘Order notation in practice’, based
because links went stale and I had so many I couldn’t find anything. on his talk at an ACCU conference [Orr14]. He demonstrated various
If I have a tab open, it’s usually something I do want to read or listen factors which also influence the performance of an algorithm besides
to at some point, and then maybe make notes or buy music or similar. its complexity measure. He discussed strlen, and discovered many
One tab I closed was for a new turntable, because our old one seemed compilers had optimised away the call, so the theory didn’t match the
to have stopped working. I bit the bullet and bought the new turntable. practice. Trying to build up an intuition about possible outcomes, so
It’s excellent and in the process of setting it up, I discovered why the you spot when something is amiss, is an important skill, so well spotted
old turntable didn’t work. The pre-amp was unplugged. The new bit of Roger. Kevin Carpenter talked about building intuition at MeetingCpp
kit does have a USB port though, so I can record all my old records one [Carpenter23], and discussed making educated guesses, which may or
day. Closing that tab was expensive, informative and has probably caused may not be true. I couldn’t attend his talk, because it clashed with mine,
another time consuming job. so I had a tab open to listen at some point. Fortunately, I managed to catch
his re-working of the talk live at CppOnline and even ask a question. So,
Another tab was The Return of -1/12 by Numberphile on YouTube
I closed another tab.
[Numberphile]. They discussed infinite series. As many of you know, 1 +
½ + ¼ + … equals 2. We can prove this, since writing Our intuition can be wrong, but we need to start somewhere. Lots of
interesting mathematics falls out of proving a first guess is incorrect, or
S = 1+ 1
2 + 1
4 + 1
8 + ...
finding circumstances under which the ordinary does not happen, leaving
means us with something extraordinary. And wondering what-if can be fruitful.
Whether that’s imagining a square root of -1, or exploring what is possible
2S = 2 + 1 + 1
2 + 1
4 + 1
8 + ...
at compile time, new disciplines emerge. However, sometimes wondering
which tells us when we subtract both we get 2S - S = S = 2. QED. That why we have 5 test cases for a function with 7 if/else branches leads
doesn’t seem unreasonable. However, if we were now to try writing 1 + us to deduce we can delete the extra branches. The tests may still pass,
10 + 100 + … we get into trouble. Writing however there’s a chance someone forgot to add more tests when they
added more code. Mutation testing might well pick this kind of thing up.
S=1+10+100+…
If you’re not familiar with this, at a high level it randomly mutates the
would mean we could have code, dropping branches, changing + to – and similar, and reports back
if any tests still pass. Filip von Laenen wrote an article about mutation
10S=10+100+1000+…
testing for us back in 2012 [vonLaenen12] if you want to know more. He
so we would then be claiming 10S - S= 9S = -1. I’m not sure about you, did say at the time he wasn’t a C++ programmer so could only give details
but this suggests the sum, S, is -1/9, which seems very unlikely. Of course, on other languages and mention a couple of frameworks in C++ he was
there is a restriction on the terms of the infinite aware of. Perhaps the time has come for someone to write a new article
sum. The terms need to decrease by enough telling us about current tools?
so that we can actually write the equals sign,
Tests for branches in code came to mind because Jez Higgins recently
otherwise the sum doesn’t converge on a
tooted [Higgins24a] about some flappy code he refactored, which had

Frances Buontempo has a BA in Maths + Philosophy, an MSc in Pure Maths and a PhD using AI and
data mining. She’s written a book about machine learning: Genetic Algortithms and Machine Learning
for Programmers. She has been a programmer since the 90s, and learnt to program by reading the
manual for her Dad’s BBC model B machine. She can be contacted at [email protected].

2 | Overload | April 2024


Frances Buontempo Editorial
more branches than tests. Of course, a code coverage tool should pick Forming an intuition takes time and sometimes helps us to form correct
that up, though mutation testing may find other problems. Jez spotted this instincts, though we all get things wrong from time to time. Again, the
by eye from simply looking at the code and wrote about this in a blog counterintuitive results in mathematics, or any discipline, often lead to
[Higgins24b]. Thankfully, he has followed it up with the refactorings novel approaches and concepts. This is a good thing. Furthermore, if you
to make the code better, and allowed us to include the write up in this get to a point where you think you are so good at something you could
issue. The code he considers in his blog is unbelievable, but untidy and do it with your eyes shut, you often get a wake-up call. Again, this is a
confusing code does emerge over time, and you need to find time to good thing, because it should encourage you to up your game and keep
tidy up once in a while, otherwise the weeds grow and take over. As a learning. Hopefully you won’t turn into Victor Meldrew, moaning and
side note, we caught up with Jez at the Norfolk Developers Conference complaining, while muttering “I don’t believe it” instead. The unfamiliar
[NorDev], which a handful of ACCU people based in the UK go to. Jez is an opportunity. I recall a discussion about Duff’s device [Wikipedia-2]
didn’t have a ticket for the speakers’ dinner, so found an EMF gig in town when I had been programming for a living for a year or so and thought I
that evening instead. Unbelievable. (Possibly a niche joke if you don’t knew it all. This stopped me in my tracks. I still have to concentrate on
know the band EMF, but here’s a famous song by them [EMF]: You’re how the loop unrolling works and what is going on. It’s weird, confusing
unbelievable. Apologies). and kinda beautiful all at once. I suspect most programmers enjoy slightly
surprising edge cases and unusual ways to do things, because we enjoy
I picked the title ‘I don’t believe it’ based on an oft-repeated phrase by
thinking and learning.
a TV character, Victor Meldrew [IMDB]. A variety of slightly unlikely
things happen to him, and he usually responds with a variation of the What have we learnt? Citations are a good thing, because at least they may
phrase “I don’t believe it.” I caught myself saying this a few times stop you falling for an April Fools’ joke. Some things are unbelievable
recently, and treating that as a warning because the character is a slightly because they are incorrect and based on false assumptions. Other things
sulky old man. Not something to aspire to. Now, not all unbelievable are unbelievable because we just discovered a whole
things are negative. For example, finding a gig at the last minute is a new approach. Let’s check our results from time to
nice surprise. Fighting some code for a couple of hours and finding it time, and try to avoid resting on our laurels. Surprises
compiles is always a surprise too, but often leaves you wondering if it can be annoying, but they can be wonderful too. And,
really works. Life is so much calmer if you can take tiny baby steps to 64 tabs, in case you wondered.
refactor something. I hope Jez does write up his refactoring steps – maybe
we can see this as an article in Overload. Refactoring is an important References
skill, and I suspect many of us still have lots to learn. [Carpenter23] Kevin Carpenter, ‘Tooling Intuition’, presented at
As languages change, we need to keep learning. It’s never easy, and Meeting C++ 2023, available at https://www.youtube.com/
I don’t know about you, but I am often surprised when I come across watch?v=mmdoDfw9tIk
things I hadn’t noticed before. One of the many tabs I closed was from [CppOnline] https://cpponline.uk/
CppReference, telling me all about std::piecewise_construct [CppRef-1] CppReference: std::piecewise_construct,
[CppRef-1]. (Aside: you know I am reopening these tabs to double check https://en.cppreference.com/w/cpp/utility/piecewise_construct
what they say as I write: place bets on my tab count when I’m done.) [CppRef-2] CppReference: std::forward_as_tuple
The std::piecewise_construct_t is an empty class tag type and https://en.cppreference.com/w/cpp/utility/tuple/forward_as_tuple
is used to differentiate between functions taking a tuple of two elements
and those taking two arguments directly. In contrast, the next tab told [D’Angelo22] Guiseppe D’Angelo, ‘C++23 will be really awesome’,
me about std::forward_as_tuple [CppRef-2]. This allows me to available at https://www.kdab.com/cpp23-will-be-really-awesome/
construct a tuple of references to forward as an argument to a function. [EMF] ‘You’re unbelievable’ performed by EMF: https://www.youtube.
CppReference gives an example using a map: com/watch?v=g4gU74gMbp0
std::map<int, std::string> m; [Higgins24a] Jez Higgins, March 2024, https://mastodon.me.uk/@
jezhiggins/112039275413895974
We can then add a value like this:
[Higgins24b] Jez Higgins, ‘To see a world in a grain of sand’, blog
m.emplace(std::piecewise_construct, post published 24 February 2024 at https://www.jezuk.co.uk/
std::forward_as_tuple(10),
std::forward_as_tuple(20, 'a')); blog/2024/02/to-see-a-world-in-a-grain-of-sand.html
[IMDB] Victor Meldrew, character from One Foot in the Grave:
How we ended up needing this, I can only imagine. Perhaps someone will https://www.imdb.com/title/tt0098882/characters/nm0934014
write in and tell me? Seriously, if you do fall across something in C++, or
any language, you hadn’t spotted before, write a page for us and send it [NorDev] Norfolk Developers Conference: https://nordevcon.com/
my way. Let’s help each other learn. There will be motivating examples [Numberphile] Tony Feng ‘The Return of -1/12’, uploaded
and reasons behind the piecewise construct and forward as tuple. I just February 2024, available athttps://www.youtube.com/
haven’t followed this up, because my tab count has now hit 68. I could watch?v=FmLIGN8ZGdw
wander over to the bookcase and look it up in a book instead, but then I [Orr14] Roger Orr, ‘Order Notation in Practice’ in Overload 124,
definitely wouldn’t get an editorial written. December 2014, https://accu.org/journals/overload/22/124/orr_2043/
Talking of obscure parts of C++, I have been reviewing a manuscript for [vanLaenen12] Filip van Laenen ‘Mutation Testing’ in Overload 108,
a potential book, and noticed a sidebar claiming C++23 added the new April 2012, https://accu.org/journals/overload/20/108/overload108.
keyword really. My first instinct was, oh no, yet another thing I didn’t pdf#page=17
notice. The writer had not explained what it did or why it was introduced, [Wikipedia-1] Analytic continuation: https://en.wikipedia.org/wiki/
so like a sucker I opened yet another tab or three, and went hunting. I did Analytic_continuation
find a blog post [D’Angelo22] which has the subtitle ‘A blog for April [Wikipedia-2] Duff’s device: https://en.wikipedia.org/wiki/Duff%27s_
Fool Day’, which explains a function taking an int, say f(int x), device
can be called with a double, so the new keyword would allow us to say
f(really int x). As for the manuscript I am reviewing, I am tempted [xkcd] https://xkcd.com/285/
to add a link to the xkcd Wikipedian Protestor holding a banner saying
“[Citation needed]” [xkcd]. Writers do get things wrong, but hopefully
our Overload review team spot any such inexactitudes. Do let us know if
we missed anything though.

April 2024 | Overload | 3


Feature Herb Sutter

C++ Safety, In Context


The safety of C++ has become a hot topic recently.
Herb Sutter discusses the language’s current problems
and potential solutions.

W
e must make our software infrastructure more secure against
the rise in cyberattacks (such as on power grids, hospitals, and
Some background
banks), and safer against accidental failures with the increased Scope. To talk about C++’s current safety problems and solutions
use of software in life-critical systems (such as autonomous vehicles and well, I need to include the context of the broad landscape of security
autonomous weapons). and safety threats facing all software. I chair the ISO C++ standards
committee and I work for Microsoft, but these are my personal
The past two years in particular have seen extra attention on programming
opinions and I hope they will invite more dialog across programming
language safety as a way to help build more-secure and -safe software; on
language and security communities.
the real benefits of memory-safe languages (MSLs); and that C and C++
language safety needs to improve – I agree. Acknowledgments. Many thanks to people from the C, C++, C#,
Python, Rust, MITRE, and other language and security communities
But there have been misconceptions, too, including focusing too narrowly
whose feedback on drafts of this material has been invaluable,
on programming language safety as our industry’s primary security and
including: Jean-François Bastien, Joe Bialek, Andrew Lilley Brinker,
safety problem – it isn’t. Many of the most damaging recent security
Jonathan Caves, Gabriel Dos Reis, Daniel Frampton, Tanveer Gani,
breaches happened to code written in MSLs (e.g., Log4j [CISA-1]) or
Daniel Griffing, Russell Hadley, Mark Hall, Tom Honermann, Michael
had nothing to do with programming languages (e.g., Kubernetes Secrets
Howard, Marian Luparu, Ulzii Luvsanbat, Rico Mariani, Chris McKinsey,
stored on public GitHub repos [Kadkoda23]).
Bogdan Mihalcea, Roger Orr, Robert Seacord, Bjarne Stroustrup,
In that context, I’ll focus on C++ and try to: Mads Torgersen, Guido van Rossum, Roy Williams, Michael Wong.
„ highlight what needs attention (what C++’s problem is), and how Terminology. (See ISO/IEC 23643:2020 [ISO]). Software security
we can get there by building on solutions already underway; (or cybersecurity or similar) means making software able to protect
its assets from a malicious attacker. Software safety (or life safety
„ address some common misconceptions (what C++’s problem isn’t),
or similar) means making software free from unacceptable risk of
including practical considerations of MSLs; and
causing unintended harm to humans, property, or the environment.
„ leave a call to action for programmers using all languages. Programming language safety means a language’s (including its
standard libraries’) static and dynamic guarantees, including but not
tl;dr: I don’t want C++ to limit what I can express efficiently. I just want limited to type and memory safety, which helps us make our software
C++ to let me enforce our already-well-known safety rules and best
both more secure and more safe. When I say safety unqualified here,
practices by default, and make me opt out explicitly if that’s what I
want. Then I can still use fully modern C++… just nicer.
I mean programming language safety, which benefits both software
security and software safety.
Let’s dig in.

The immediate problem “is”…


The immediate problem is that it’s Too Easy By Default™ to write
security and safety vulnerabilities in C++ that would have been caught by
stricter enforcement of known rules for type, bounds, initialization, and
lifetime language safety
In C++, we need to start with improving these four categories. These
are the main four sources of improvement provided by all the MSLs that
NIST/NSA/CISA/etc. recommend using instead of C++ [CISA-2], so
by definition addressing these four would address the immediate NIST/
NSA/CISA/etc. issues with C++. (More on this under ‘What the problem
“isn’t”…’, section (1) on page 6.)
And in all recent years including 2023 (see Figure 1’s four highlighted
rows – rows 1, 4, 7 and 12 – and Figure 2), these four constitute the bulk
of those oft-quoted 70% of CVEs (Common [Security] Vulnerabilities

Herb Sutter Herb is a software technologist, working at the


intersection of programming language design/UX, people, and high
performance code. He is an author, chair of the ISO C++ committee,
and a software architect at Microsoft. Figure 1
4 | Overload | April 2024
Herb Sutter Feature

As we specify and evolve default language


safety rules, we must also include our
stakeholders who care deeply about
functional safety issues

requires eliminating out-of-bounds accesses to unallocated objects. But,


conversely, full bounds safety (that accessed memory is inside allocated
bounds) similarly requires eliminating type-unsafe downcasts to larger
derived-type objects that would appear to extend beyond the actual
allocation.
Software safety is also important. Cyberattacks are urgent, so it’s
natural that recent discussions have focused more on security and CVEs
first. But as we specify and evolve default language safety rules, we must
also include our stakeholders who care deeply about functional safety
issues that are not reflected in the major CVE buckets but are just as
harmful to life and property when left in code. Programming language
safety helps both software security and software safety, and we should
start somewhere, so let’s start (but not end) with the known pain points
of security CVEs.

In those four buckets, a 10–50× improvement


(90–98% reduction) is sufficient
If there were 90–98% fewer C++ type/bounds/initialization/lifetime
vulnerabilities we wouldn’t be having this discussion. All languages
have CVEs, C++ just has more (and C still more); so far in 2024, Rust
has 6 CVEs [Rust-1], and C and C++ combined have 61 CVEs [C/C++].
So zero isn’t the goal; something like a 90% reduction is necessary, and
Figure 2 a 98% reduction is sufficient, to achieve security parity with the levels
and Exposures) [Wikipedia] related to language memory unsafety. of language safety provided by MSLs… and has the strong benefit that I
(However, that “70% of language memory unsafety CVEs” is misleading; believe it can be achieved with perfect backward link compatibility (i.e.,
for example, in figure 1, most of MITRE’s 2023 “most dangerous without changing C++’s object model, and its lifetime model which does
weaknesses” [MITRE-1] did not involve language safety and so are not depend on universal tracing garbage collection and is not limited to
outside that denominator. More on this under ‘What the problem tree-based data structures) which is essential to our being able to adopt
“isn’t”…’, section (3) on page 7.) the improvements in existing C++ projects as easily as we can adopt other
The C++ guidance literature already broadly agrees on safety rules new editions of C++. After that, we can pursue additional improvements
in those categories. It’s true that there is some conflicting guidance to other buckets, such as thread safety and overflow safety.
literature, particularly in environments that ban exceptions or run-time Aiming for 100%, or zero CVEs in those four buckets, would be a
type support and so use some alternative rules. But there is consensus on mistake:
core safety rules, such as banning unsafe casts, uninitialized variables,
and out-of-bounds accesses (see ‘Appendix’, starting on page 9). „ 100% is not necessary because none of the MSLs we’re being told
to use instead are there either. More on this under ‘What the problem
C++ should provide a way to enforce them by default, and require “isn’t”…’, section (2) on page 7.
explicit opt-out where needed. We can and do write ‘good’ code and
secure applications in C++. But it’s easy even for experienced C++ „ 100% is not sufficient because many cyberattacks exploit security
developers to accidentally write ‘bad’ code and security vulnerabilities weaknesses other than memory safety.
that C++ silently accepts, and that would be rejected as safety violations And getting that last 2% would be too costly, because it would require
in other languages. We need the standard language to help more by giving up on link compatibility and seamless interoperability (or ‘interop’)
enforcing the known best practices rather than relying on additional with today’s C++ code. For example, Rust’s object model and borrow
nonstandard tools to recommend them. checker deliver great guarantees, but require fundamental incompatibility
These are not the only four aspects of language safety we should with C++ and so make interop hard beyond the usual C interop level.
address. They are just the immediate ones, a set of clear low-hanging One reason is that Rust’s safe language pointers are limited to expressing
fruit where there is both a clear need and clear way to improve (see tree-shaped data structures that have no cycles; that unique ownership
‘Appendix’, starting on page 9). is essential to having great language-enforced aliasing guarantees, but
it also requires programmers to use ‘something else’ for anything more
Note: And safety categories are of course interrelated. For example, complex than a tree (e.g., using Rc, or using integer indexes as ersatz
full type safety (that an accessed object is a valid object of its type)
April 2024 | Overload | 5
Feature Herb Sutter

C++ should seriously try to deliver as


many of the safety improvements as
practical without requiring manual
source code changes

pointers); it’s not just about linked lists [Rust-2] but those are a simple safety rules carries a cost; worse, not all code can be easily updated to
well-known illustrative example. conform to safety rules (e.g., it’s old and not understood, it belongs to a
third party that won’t allow updates, it belongs to a shared project that
If we can get a 98% improvement and still have fully compatible interop
won’t take upstream changes and can’t easily be forked). That’s why above
with existing C++, that would be a holy grail worth serious investment.
(and in the Appendix) I stress that C++ should seriously try to deliver as
many of the safety improvements as practical without requiring manual
A 98% reduction source code changes, notably by automatically making existing code
A 98% reduction across those four categories is achievable in new/ do the right thing when that is clear (e.g., the bounds checks mentioned
updated C++ code, and partially in existing code above, or emitting static_cast pointer downcasts as effectively
Since at least 2014, Bjarne Stroustrup has advocated addressing safety in dynamic_cast without requiring the code to be changed), and by
C++ via a ‘subset of a superset’: That is, first ‘superset’ to add essential offering automated fixits that the programmer can choose to apply (e.g.,
items not available in C++14, then ‘subset’ to exclude the unsafe to change the source for static_cast pointer downcasts to actually say
constructs that now all have replacements. dynamic_cast). Even though in many cases a programmer will need
to thoughtfully update code to replace inherently unsafe constructs that
As of C++20, I believe we have achieved the ‘superset’, notably by can’t be automatically fixed, I believe for some percentage of cases we
standardizing span, string_view, concepts, and bounds-aware can deliver safety improvements by just recompiling existing code in the
ranges. We may still want a handful more features, such as a null- safety-rules-by-default mode, and we should try because it’s essential to
terminated zstring_view, but the major additions already exist. maximizing safety profiles’ adoptability and impact.
Now we should ‘subset’: Enable C++ programmers to enforce best
practices around type and memory safety, by default, in new code What the problem “isn’t”:
and code they can update to confirm to the subset. Enabling safety Some common misconceptions
rules by default would not limit the language’s power but would require
explicit opt-outs for non-standard practices, thereby reducing inadvertent (1) The problem “isn’t” defining what we mean by “C++’s most
risks. And it could be evolved over time, which is important because C++ urgent language safety problem.” We know the four kinds of
is a living language and adversaries will keep changing their attacks. safety that most urgently need to be improved: type, bounds,
ISO C++ evolution is already pursuing Safety Profiles for C++ initialization, and lifetime safety.
[Stroustrup23]. The suggestions in the Appendix are refinements to We know these four are the low-hanging fruit (see ‘The immediate
that, to demonstrate specific enforcements and to try to maximize their problem “is”…’ on page 4). It’s true that these are just four of perhaps
adoptability and useful impact. For example, everyone agrees that many two dozen kinds of ‘safety’ categories, including ones like safe integer
safety bugs will require code changes to fix. However, how many safety arithmetic. But:
bugs could be fixed without manual source code changes, so that just „ Most of the others are either much smaller sources of problems, or
recompiling existing code with safety profiles enabled delivers some are primarily important because they contribute to those four main
safety benefits? For example, we could by default inject a call-site bounds categories. For example, the integer overflows we care most about
check 0 <= b < a.size() on every subscript expression a[b] when are indexes and sizes, which fall under bounds safety.
a.size() exists and a is a contiguous container, without requiring any
source code changes and without upgrading to a new internally bounds- „ Most MSLs don’t address making these safe by default either,
checked container library; that checking would Just Work out of the typically due to the checking cost. But all languages (including
box with every contiguous C++ standard container, span, string_ C++) usually have libraries and tools to address them. For example,
view, and third-party custom container with no library updates needed
Microsoft ships a SafeInt library for C++ to handle integer overflows
(including therefore also no concern about ABI breakage). [Microsoft-1], which is opt-in. C# has a checked arithmetic language
feature [Microsoft-2] to handle integer overflows, which is opt-in.
Rules like those summarized in the Appendix would have prevented Python’s built-in integers are overflow-safe by default because they
(at compile time, test time or run time) most of the past CVEs I’ve automatically expand; however, the popular NumPy fixed-size
reviewed in the type, bounds, and initialization categories, and integer types do not check for overflow by default and require using
would have prevented many of the lifetime CVEs. I estimate a roughly checked functions, which is opt-in.
98% reduction in those categories is achievable in a well-defined and
standardized way for C++ to enable safety rules by default while still Thread safety is obviously important too, and I’m not ignoring it. I’m
retaining perfect backward link compatibility. See the Appendix on page just pointing out that it is not one of the top target buckets: Most of the
9 for a more detailed description. MSLs that NIST/NSA/CISA/etc. recommend over C++ (except uniquely
Rust, and to a lesser extent Python) address thread safety impact on user
We can and should emphasize adoptability and benefit also for C++ data corruption about as well as C++. The main improvement MSLs
code that cannot easily be changed. Any code change to conform to give is that a program data race will not corrupt the language’s own
6 | Overload | April 2024
Herb Sutter Feature

Many of 2023’s largest data breaches and other


cyberattacks and cybercrime had nothing to do
with programming languages at all

virtual machine (whereas, in C++, a data race is currently all-bets-are-off [Rust-4], and related tools like fuzzers. Sanitizers are known to be
undefined behavior). Some languages do give some additional protection, still needed as a complement to language safety, and not only for
such as that Python guarantees two racing threads cannot see a torn write when programmers use ‘unsafe’ code; furthermore, they go beyond
of an integer and reduces other possible interleavings because of the finding memory safety issues. The uses of Rust at scale that I know
global interpreter lock (GIL). of also enforce use of sanitizers. So using sanitizers can’t be an
indicator that a language is unsafe — we should use the supported
(2) The problem “isn’t” that C++ code sanitizers for code written in any language.
is not formally provably safe
Note: “Use your sanitizers” does not mean to use all of them all
Yes, C++ code makes it too easy to write silently-unsafe code by default the time. Some sanitizers conflict with each other, so you can only
(see ‘The immediate problem “is”…’ on page 4). use those one at a time. Some sanitizers are expensive, so they
But I’ve seen some people claim we need to require languages to be should only be run periodically. Some sanitizers should not be run in
production, including because their presence can create new security
formally provably safe, and that would be a bridge too far. Much to the
vulnerabilities.
chagrin of CS theorists, mainstream commercial programming languages
aren’t formally provably safe. Consider some examples:
„ None of the widely-used languages we view as MSLs (except (3) The problem “isn’t” that moving the world’s C
uniquely Rust) claim to be thread-safe and race-free by construction,
and C++ code to memory-safe languages (MSLs)
as covered in the previous section. Yet we still call C#, Go,
would eliminate 70% of security vulnerabilities
MSLs are wonderful! They just aren’t a silver bullet.
Java, Python, and similar languages “safe”. Therefore, formally
guaranteeing thread safety properties can’t be a requirement to be An oft-quoted number [Gaynor20] is that “70%” of programming
considered a sufficiently safe language. language-caused CVEs (reported security vulnerabilities) in C and
C++ code are due to language safety problems. That number is true and
„ That’s because a language’s choice of safety guarantees is a tradeoff:
repeatable, but has been badly misinterpreted in the press: No security
For example, in Rust, safe code uses tree-based dynamic data
expert I know believes that if we could wave a magic wand and instantly
structures only. This feature lets Rust deliver stronger thread safety
transform all the world’s code to MSLs, that we’d have 70% fewer CVEs,
guarantees than other safe languages, because it can more easily
data breaches, and ransomware attacks. (For example, see this February
reason about and control aliasing. However, this same feature also
2024 example analysis paper [Hanley24].)
requires Rust programs to use unsafe code more often to represent
common data structures that do not require unsafe code to represent Consider some reasons.
in other MSLs such as C# or Java, and so 30% to 50% of Rust crates
use unsafe code [Wang22], compared for example to 25% of Java
„ That 70% is of the subset of security CVEs that can be addressed by
programming language safety. See figure 1 again: Most of 2023’s
libraries [Mastrangelo15].
top 10 “most dangerous software weaknesses” were not related to
„ C#, Java, and other MSLs still have use-before-initialized and memory safety. Many of 2023’s largest data breaches and other
use-after-destroyed type safety problems too: They guarantee not cyberattacks and cybercrime had nothing to do with programming
accessing memory outside its allocated lifetime, but object lifetime languages at all. In 2023, attackers reduced their use of malware
is a subset of memory lifetime (objects are constructed after, and because software is getting hardened and endpoint protection is
destroyed/disposed before, the raw memory is allocated and effective (CRN) [Alspach23], and attackers go after the slowest
deallocated; before construction and after dispose, the memory is animal in the herd. Most of the issues listed in NISTIR-8397
allocated but contains “raw bits” that likely don’t represent a valid [Black21] affect all languages equally, as they go beyond memory
object of its type). If you doubt, please run (don’t walk) and ask safety (e.g., Log4j [CISA-1]) or even programming languages (e.g.,
ChatGPT about Java and C# problems with: access-unconstructed- automated testing, hardcoded secrets, enabling OS protections,
object bugs (e.g., in those languages, any virtual call in a constructor string/SQL injections, software bills of materials). For more detail,
is “deep” and executes in a derived object before the derived see the Microsoft response to NISTIR-8397 [Microsoft-3], for
object’s state is initialized); use-after-dispose bugs; “resurrection” which I was the editor. (More on this in the ‘Call to Action’, below.)
bugs; and why those languages tell people never to use their
finalizers. Yet these are great languages and we rightly consider
„ MSLs get CVEs too, though definitely fewer (again, e.g., Log4j).
For example, see MITRE list of Rust CVEs, including six so far in
them safe languages. Therefore, formally guaranteeing no-use-
2024 [MITRE-2]. And all programs use unsafe code; for example,
before-initialized and no-use-after-dispose can’t be a requirement
see the ‘Conclusions’ section of Firouzi et al.’s study of uses of
to be considered a sufficiently safe language.
C#’s unsafe on StackOverflow [Firouzi20] and prevalence of
„ Rust, Go, and other languages support sanitizers too [Rust-3], vulnerabilities, and that all programs eventually call trusted native
including ThreadSanitizer and undefined behavior sanitizers libraries or operating system code.
April 2024 | Overload | 7
Feature Herb Sutter

CVEs are known to be an imprecise metric.


We use it because it’s the metric we have,
at least for security vulnerabilities, but we
should use it with care

„ Saying the quiet part out loud: CVEs are known to be an imprecise To address all these points, I think we need the C++ standard to specify
metric. We use it because it’s the metric we have, at least for security a mode of well-agreed and low-or-zero-false-positive deterministic rules
vulnerabilities, but we should use it with care. This may surprise that are sufficiently low-cost to implement in-the-box at build time.
you, as it did me, because we hear a lot about CVEs. But whenever
I’ve suggested improvements for C++ and measuring “success” via Call(s) to action
a reduction in CVEs (including in this essay), security experts insist As an industry generally, we must make a major improvement in
to me that CVEs aren’t a great metric to use… including the same programming language memory safety – and we will.
experts who had previously quoted the 70% CVE number to me. —
Reasons why CVEs aren’t a great metric include that CVEs are self-
reported and often self-selected, and not all are equally exploitable;
but there can be pressure to report a bug as a vulnerability even if
there’s no reasonable exploit because of the benefits of getting one’s
name on a CVE. In August 2023, the Python Software Foundation
became a CVE Numbering Authority (CNA) for Python and pip
distributions [MITRE-3], and now has more control over Python
and pip CVEs. The C++ community has not done so.
„ CVEs target only software security vulnerabilities (cyberattacks
and intrusions), and we also need to consider software safety (life-
critical systems and unintended harm to humans).

(4) The problem “isn’t” that C++ programmers aren’t trying hard
enough/using the existing tools well enough. The challenge is
making it easier to enable them.
Today, the mitigations and tools we do have for C++ code are an uneven
mix, and all are off-by-default:
„ Kind. They are a mix of static tools, dynamic tools, compiler
switches, libraries, and language features.
„ Acquisition. They are acquired in a mix of ways: in-the-box in the
C++ compiler, optional downloads, third-party products, and some
you need to google around to discover.
„ Accuracy. Existing rulesets mix rules with low and high false
positives. The latter are effectively unadoptable by programmers,
and their presence makes it difficult to ‘just adopt this whole set of
rules’.
„ Determinism. Some rules, such as ones that rely on interprocedural
analysis of full call trees, are inherently nondeterministic (because
an implementation gives up when fully evaluating a case exceeds
the space and time available; a.k.a. ‘best effort’ analysis). This
means that two implementations of the identical rule can give
different answers for identical code (and therefore nondeterministic
rules are also not portable, see below).
„ Efficiency. Existing rulesets mix rules with low and high (and
sometimes impossible) cost to diagnose. The rules that are not
efficient enough to implement in the compiler will always be
relegated to optional standalone tools.
„ Portability. Not all rules are supported by all vendors. ‘Conforms
to ISO/IEC 14882 (Standard C++)’ is the only thing every C++ tool
vendor supports portably.
8 | Overload | April 2024
Herb Sutter Feature

if we focus on programming language safety alone,


we may find ourselves fighting yesterday’s war and
missing larger past and future security dangers
that affect software written in any language

In C++ specifically, we should first target the four key safety categories „ Do keep investing long-term in keeping your threat modeling
that are our perennial empirical attack points (type, bounds, initialization, current, so that you can stay adaptive as your adversaries keep
and lifetime safety), and drive vulnerabilities in these four areas down to trying different attack methods.
the noise for new/updated C++ code – and we can.
We need to improve software security and software safety across the
But we must also recognize that programming language safety is not a industry, especially by improving programming language safety in C and
silver bullet to achieve cybersecurity and software safety. It’s one battle C++, and in C++ a 98% improvement in the four most common problem
(not even the biggest) in a long war: Whenever we harden one part of areas is achievable in the medium term. But if we focus on programming
our systems and make that more expensive to attack, attackers always language safety alone, we may find ourselves fighting yesterday’s war
switch to the next slowest animal in the herd. Many of 2023’s worst and missing larger past and future security dangers that affect software
data breaches did not involve malware, but were caused by inadequately written in any language.
stored credentials (e.g., Kubernetes Secrets on public GitHub repos
Sadly, there are too many bad actors. For the foreseeable future, our
[Kadkoda23]), misconfigured servers (e.g., DarkBeam [Okunytė23a],
software and data will continue to be under attack, written in any language
Kid Security [Okunytė23b]), lack of testing, supply chain vulnerabilities,
and stored anywhere. But we can defend our programs and systems, and
social engineering, and other problems that are independent of
we will.
programming languages. Apple’s white paper about 2023’s rise in
cybercrime emphasizes improving the handling, not of program code, but Be well, and may we all keep working to have a safer and more secure
of the data [Madnick23]: 2024.
it’s imperative that organizations consider limiting the amount of
personal data they store in readable format while making a greater Appendix: Illustrating why a 98%
effort to protect the sensitive consumer data that they do store reduction is feasible
[including by using] end-to-end [E2E] encryption. This Appendix exists to support why I think a 98% reduction in type/
bounds/initialization/lifetime CVEs in C++ code is believable. This is not
No matter what programming language we use, security hygiene is a formal proposal, but an overview of concrete ways to achieve such an
essential: improvement it in new and updatable code, and ways to even get some
„ Do use your language’s static analyzers and sanitizers. Never fraction of that improvement in existing code we cannot update but can
pretend using static analyzers and sanitizers is unnecessary “because recompile. These notes are aligned with the proposals currently being
I’m using a safe language.” If you’re using C++, Go, or Rust, then pursued in the ISO C++ safety subgroup, and if they pan out as I expect in
use those languages’ supported analyzers and sanitizers. If you’re ongoing discussions and experiments, then I intend to write further details
a manager, don’t allow your product to be shipped without using about them in a future paper.
these tools. (Again: This doesn’t mean running all sanitizers all the There are runtime and code size overheads to some of the suggestions
time; some sanitizers conflict and so can’t be used at the same time, in all four buckets, notably checking bounds and casts. But there is no
some are expensive and so should be used periodically, and some reason to think those overheads need to be inherently worse in C++ than
should be run only in testing and never in production including other languages, and we can make them on by default and still provide a
because their presence can create new security vulnerabilities.) way to opt out to regain full performance where needed.
„ Do keep all your tools updated. Regular patching is not just for iOS Note: For example, bounds checking can cause a major impact on
and Windows, but also for your compilers, libraries, and IDEs. some hot loops, when using a compiler whose optimizer does not hoist
„ Do secure your software supply chain. Do use package management bounds checks; not only can the loops incur redundant checking, but
for library dependencies. Do track a software bill of materials for they also may not get other optimizations such as not being vectorized.
This is why making bounds-checking on by default is good, but all
your projects.
performance-oriented languages also need to provide a way to say
„ Don’t store secrets in code. (Or, for goodness’ sake, on GitHub!) “trust me” and explicitly opt out of bounds checking tactically where
needed.
„ Do configure your servers correctly, especially public Internet-
facing ones. (Turn authentication on! Change the default password!) This appendix refers to the ‘profiles’ in the C++ Core Guidelines safety
„ Do keep non-public data encrypted, both when at rest (on disk) and profiles [CPP], a set of about two dozen enforceable rules for type and
when in motion (ideally E2E… and oppose proposed legislation memory safety of which I am a co-author. I refer to them only as examples,
that tries to neuter E2E encryption with ‘backdoors only good guys to show ‘what’ already-known rules exist that we can enforce, to support
will use’ because there’s no such thing). that my claimed improvement is possible. They are broadly consistent
with rules in other sources, such as: The C++ Programming Language’s
advice on type safety [Stroustrup13]; C++ Coding Standards’ section on

April 2024 | Overload | 9


Feature Herb Sutter

In cases where bounds checking incurs


a performance impact, code can still
explicitly opt out of the bounds check in
just those paths

type safety [Sutter04]; the Joint Strike Fighter Coding Standards [LM05]; on every expression of the form a[b], where a is a contiguous
High Integrity C++ [Perforce13]; the C++ Core Guidelines section on sequence with a size/ssize function and b is an integral index.
safety profiles (a small enforceable set of safety rules) [CPP-1]; and the When a violation happens, the action taken can be customized
recently-released MISRA C++:2023 [MISRA]. using a global bounds violation handler; some programs will want
to terminate (the default), others will want to log-and-continue,
The best way for ‘how’ to let the programmer control enabling those rules
throw an exception, integrate with a project-specific critical fault
(e.g., via source code annotations, compiler switches, and/or something
infrastructure.
else) is an orthogonal UX issue that is now being actively discussed in the
C++ standards committee and community. Importantly, the latter explicitly avoids implementing bounds-checking
intrusively for each individual container/range/view type. Implementing
Type safety bounds-checking non-intrusively and automatically at the call site makes
Enforce the Pro.Type safety profile by default [CPP-2]. That includes full bounds checking available for every existing standard and user-
either banning or checking all unsafe casts and conversions (e.g., written container/range/view type out of the box: Every subscript into
static_cast pointer downcasts, reinterpret_cast), including a vector, span, deque, or similar existing type in third-party and
implicit unsafe type punning via C union and vararg. company-internal libraries would be usable in checked mode without any
need for a library upgrade.
However, these rules haven’t yet been systematically enforced in
the industry. For example, in recent years I’ve painfully observed a It’s important to add automatic call-site checking now before libraries
significant set of type safety-caused security vulnerabilities whose root continue adding more subscript bounds checking in each library, so
cause was that code used static_cast instead of dynamic_cast for that we avoid duplicating checks at the call site and in the callee. As a
pointer downcasts, and ‘C++’ gets blamed even when the actual problem counterexample, C# took many years to get rid of duplicate caller-and-
was failure to follow the well-publicized guidance to use the language’s callee checking, but succeeded and .NET Core addresses this better now;
existing safe recommended feature. It’s time for a standardized C++ we can avoid most of that duplicate-check-elimination optimization work
mode that enforces these rules by default. by offering automatic call-site checking sooner.
Language constructs like the range-for loop are already safe by
Note: On some platforms and for some applications, dynamic_cast
has problematic space and time overheads that hinder its use. Many
construction and need no checks.
implementations bundle dynamic_cast indivisibly with all C++ run- In cases where bounds checking incurs a performance impact, code can
time typing (RTTI) features (e.g., typeid), and so require storing still explicitly opt out of the bounds check in just those paths to retain
full potentially-heavyweight RTTI data even though dynamic_cast
full performance and still have full bounds checking in the rest of the
needs only a small subset. Some implementations also use needlessly
application.
inefficient algorithms for dynamic_cast itself. So the standard must
encourage (and, if possible, enforce for conformance, such as by
setting algorithmic complexity requirements) that dynamic_cast Initialization safety
implementations be more efficient and decoupled from other RTTI Enforce initialization-before-use by default. That’s pretty easy to
overheads, so that programmers do not have a legitimate performance statically guarantee, except for some cases of the unused parts of lazily
reason not to use the safe feature. That decoupling could require constructed array/vector storage. Two simple alternatives we could
an ABI break; if that is unacceptable, the standard must provide an
enforce are (either is sufficient):
alternative lightweight facility such as a fast_dynamic_cast that
is separate from (other) RTTI and performs the dynamic cast with „ Initialize-at-declaration as required by Pro.Type [CPP-2] and ES.20
minimum space and time cost. [CPP-4]; and possibly zero-initialize data by default as currently
proposed in P2723 [Bastien23]. These two are good but with
Bounds safety some drawbacks; both have some performance costs for cases that
Enforce the Pro.Bounds safety profile [CPP-3] by default, and require ‘dummy’ writes that are never used but hard for optimizers
guarantee bounds checking. We should additionally guarantee that: to eliminate, and the latter has some correctness costs because it
‘fixing’ some uninitialized cases where zero is a valid value but
„ Pointer arithmetic is banned (use std::span instead); this enforces masks others for which zero is not a valid initializer and so the
that a pointer refers to a single object. Array-to-pointer decay, if behavior is still wrong, but because a zero has been jammed in it’s
allowed, will point to only the first object in the array. harder for sanitizers to detect.
„ Only bounds-checked iterator arithmetic is allowed (also, prefer „ Guaranteed initialization-before-use, similar to what Ada and C#
ranges instead). successfully do. This is still simple to use, but can be more efficient
„ All subscript operations are bounds-checked at the call site, by because it avoids the need for artificial ‘dummy’ writes, and can be
having the compiler inject an automatic subscript bounds check more flexible because it allows alternative constructors to be used

10 | Overload | April 2024


Herb Sutter Feature
for the same object on different paths. For details, see: example These examples are not exhaustive. We should review the list of UB in
diagnostic; definite-first-use rules [Sutter22]. the standard for a more thorough list of cases we can automatically fix
(ideally) or diagnose.
Lifetime safety
Enforce the Pro.Lifetime safety profile [CPP-5] by default, ban Summarizing: Better defaults for C++
manual allocation by default, and guarantee null checking. The C++ could enable turning safety rules on by default that would make
Lifetime profile is a static analysis that diagnoses many common sources code:
of dangling and use-after-free, including for iterators and views (not just
„ fully type-safe,
raw pointers and references), in a way that is efficient enough to run
during compilation. It can be used as a basis to iterate on and further „ fully bounds-safe,
improve. And we should additionally guarantee that:
„ fully initialization-safe,
„ All manual memory management is banned by default (new,
and for lifetime safety, which is the hardest of the four, and where I would
delete, malloc, and free). Corollary: ‘Owning’ raw pointers
expect the remaining vulnerabilities in these categories would mostly lie:
are banned by default, since they require delete or free. Use
RAII instead, such as by calling make_unique or make_shared. „ fully null-safe,
„ All dereferences are null-checked. The compiler injects an automatic „ fully free of owning raw pointers,
check on every expression of the form *p or p-> where p can
„ with lifetime-safety static analysis that diagnoses most common
be compared to nullptr to null-check all dereferences at the call
pointer/iterator/view lifetime errors;
site (similar to bounds checks above). When a violation happens,
the action taken can be customized using a global null violation and, finally:
handler; some programs will want to terminate (the default), others
„ with less undefined behavior including by automatically fixing
will want to log-and-continue, throw an exception, integrate with a
existing bugs just by recompiling code with safety enabled by
project-specific critical fault infrastructure.
default.
Note: The compiler could choose to not emit this check (and not All of this is efficiently implementable and has been implemented.
perform optimizations that benefit from the check) when targeting
Most of the Lifetime rules have been implemented in Visual Studio and
platforms that already trap null dereferences, such as platforms that
CLion, and I’m prototyping a proof-of-concept mode of C++ that includes
mark low memory pages as unaddressable. Some C++ features, such
as delete, have always done call-site null checking. all of the other above language safeties on-by-default in my cppfront
compiler [Sutter], as well as other safety improvements including an
implementation of the current proposal for ISO C++ contracts. I haven’t
Reducing undefined behavior and semantic bugs yet used the prototype at scale. However, I can report that the first major
Tactically, reduce some undefined behavior (UB) and other semantic change request I received from early users was to change the bounds
bugs (pitfalls), for cases where we can automatically diagnose or checking and null checking from opt-in (off by default) to opt-out (on
even fix well-known antipatterns. Not all UB is bad; any performance- by default).
oriented language needs some. But we know there is low-hanging fruit
where the programmer’s intent is clear and any UB or pitfall is a definite Note: Please don’t be distracted by that cppfront uses an experimental
bug, so we can do one of two things: alternate syntax for C++. That’s because I’m additionally trying to
see if we can reach a second orthogonal goal: to make the C++
(A – Good) Make the pitfall a diagnosed error, with zero false positives language itself simpler, and eliminate the need to teach ~90% of the
– every violation is a real bug. Two examples mentioned above are to C++ guidance literature related to language complexity and quirks.
automatically check a[b] to be in bounds and *p and p-> to be non-null. This essay’s language safety improvements are orthogonal to that,
however, and can be applied equally to today’s C++ syntax.
(B – Ideal) Make the code actually do what the programmer
intended, with zero false positives – i.e., fix it by just recompiling. An
example, discussed at the most recent ISO C++ November 2023 meeting Solutions need to distinguish between (A) ‘solution for new-or-
[Wakely23], is to default to an implicit return *this; when the updatable code’ and (B) ‘solution for existing code’
programmer writes an assignment operator for their type C that returns (A) A ‘solution for new-or-updatable code’ means that to help existing
a C& (note: the same type), but forgets to write a return statement. code we have to change/rewrite our code. This includes not only ‘(re)
Today, that is undefined behavior. Yet it’s clear that the programmer write in C#/Rust/Go/Python/…’ but also ‘annotate your code with SAL’
meant return *this; –nothing else can be valid. If we make return [Microsoft-4] or ‘change your code to use std::span’.
*this; be the default, all the existing code that accidentally omits the One of the costs of (A) is that anytime we write/change code to fix bugs,
return is not just ‘no longer UB’, but is guaranteed to do the right and we also introduce new bugs; change is never free. We need to recognize
intended thing. that changing our code to use std::span often means non-trivially
An example of both (A) and (B) is to support chained comparisons rewriting parts of it which can also create other bugs. Even annotating
[Revzin18], that makes the mathematically valid chains work correctly our code means writing annotations that can have bugs (this is a common
and rejects the mathematically invalid ones at compile time. Real-world experience in the annotation languages I’ve seen used at scale, such as
code does write such chains by accident [SO-1] [SO-2] [SO-3] [SO-4] SAL). All these are significant adoption barriers.
[SO-5] [SO-6] [SO-7] [SO-8] [SO-9] [SO-10]. Actually switching to another language means losing a mature ecosystem.
„ For (A): We can reject all mathematically invalid chains like C++ is the well-trod path: It’s taught, people know it, the tools exist,
a != b > c at compile time. This automatically diagnoses bugs interop works, and current regulations have an industry around C++
in existing code that tries to do such nonsense chains, with perfect (such as for functional safety). It takes another decade at least for another
accuracy. language to become the well-trod path, whereas a better C++, and its
benefits to the industry broadly, can be here much sooner.
„ For (B): We can fix all existing code that writes would-be-correct
chains like 0 <= index < max. Today those silently compile (B) A ‘solution for existing code’ emphasizes the adoptability benefits
but are completely wrong, and we can make them mean the right of not having to make manual code changes. It includes anything that
thing. This automatically fixes those bugs, just by recompiling the makes existing code more secure with ‘just a recompile’ (i.e., no binary/
existing code. ABI/link issues; e.g., ASAN, compiler switches to enable stack checks,
April 2024 | Overload | 11
Feature Herb Sutter
static analysis that produces only true positives, or a reliable automated https://www.cisa.gov/news-events/news/apache-log4j-vulnerability-
code modernizer). guidance
We will still need (B) no matter how successful new languages or new [CISA-2] ‘The Case for Memory Safe Roadmaps’, published December
C++ types/annotations are. And (B) has the strong benefit that it is easier 2023 jointly by US, Australian, Canadian, New Zealand and
to adopt. Getting to a 98% reduction in CVEs will require both (A) and UK cyper security centres/agencies, available at https://media.
(B), but if we can deliver even a 30% reduction using just (B) that will be defense.gov/2023/Dec/06/2003352724/-1/-1/0/THE-CASE-FOR-
a major benefit for adoption and effective impact in large existing code MEMORY-SAFE-ROADMAPS-TLP-CLEAR.PDF
bases that are hard to change. [CPP-1] Pro: Profiles in C++ Core Guidelines, available at
https://isocpp.github.io/CppCoreGuidelines/
Consider how the ideas earlier in this appendix map onto (A) and (B):
CppCoreGuidelines#pro-profiles
In C++, (A) Solution for (B) Solution for [CPP-2] Pro.safety: Type-safety profile in C++ Core Guidelines,
by default,
new/updated code existing code (requires available at https://isocpp.github.io/CppCoreGuidelines/
enforce…
(can require code recompile only – no CppCoreGuidelines#SS-type
changes – no link/ manual code changes, [CPP-3] Pro.bounds: Bounds safetyprofile in C++ Core Guidelines,
binary changes) no link/binary changes) available at https://isocpp.github.io/CppCoreGuidelines/
CppCoreGuidelines#probounds-bounds-safety-profile
Type safety Ban all inherently Make unsafe casts and
unsafe casts and conversions with a safe [CPP-4] ES.20: Always initialize an object in C++ Core Guidelines,
conversions alternative do the safe thing available at https://isocpp.github.io/CppCoreGuidelines/
Bounds Ban pointer arithmetic Check in-bounds for all CppCoreGuidelines#Res-always
allowed iterator arithmetic
safety Ban unchecked [CPP-5] Pro.safety: Type-safety profile in C++ Core Guidelines,
iterator arithmetic Check in-bounds for all available at https://isocpp.github.io/CppCoreGuidelines/
subscript operations CppCoreGuidelines#SS-lifetime
Initialization Require all variables [Firouzi20] Ehsan Firouzi, Ashkan Sami, Foutse Khomh and Gias
to be initialized (either
safety Uddin ‘On the use of C# Unsafe Code Context: An Empirical Study
at declaration, or
before first use)
of Stack Overflow’ from the Proceedings of the 14th ACM / IEEE
International Symposium on Empirical Software Engineering and
Lifetime Statically diagnose Check not-null for all pointer
Measurement (ESEM), available at https://www.researchgate.net/
many common dereferences
safety publication/344892072_On_the_use_of_C_Unsafe_Code_Context_
pointer/iterator lifetime
error cases An_Empirical_Study_of_Stack_Overflow
[Gaynor20] Alex Gaynor ‘What science can tell us about C and C++’s
Less Statically diagnose Automatically fix known UB/
known UB/bug cases, bug cases, to make current security’, published 27 May 2020, available at https://alexgaynor.
undefined
to error on actual bugs bugs in existing code be net/2020/may/27/science-on-memory-unsafety-and-security/
behavior in existing code with actually correct with just a
[Hanley24] Zach Hanley ‘Rust Won’t Save Us: An Analysis of 2023’s
just a recompile and recompile and zero false
zero false positives: positives: Known Exploited Vulnerabilities’, posted 6 February 2024,
available at https://www.horizon3.ai/attack-research/attack-blogs/
„ Ban mathematically „ Define mathematically
invalid comparison valid comparison chains analysis-of-2023s-known-exploited-vulnerabilities/
chains „ Default return *this; [ISO] ISO/IEC 23643:2020 – ‘Software and systems engineering:
„ (add additional for C assignment Capabilities of software safety and security verification tools’
cases from UB operators that return C& https://www.iso.org/standard/76517.html
Annex review) „ (add additional cases from [Kadkoda23] Yakir Kadkoda and Assaf Morag ‘The Ticking Supply
UB Annex review)
Chain Attach Bomb of Exposed Kubernetes Secrets’, published 21
By prioritizing adoptability, we can get at least some of the safety benefits Nov 2023 on the Aqua Blog, available at https://www.aquasec.com/
just by recompiling existing code, and make the total improvement easier blog/the-ticking-supply-chain-attack-bomb-of-exposed-kubernetes-
to deploy even when code updates are required. I think that makes it a secrets/
valuable strategy to pursue. [LM05] Lockhead Martin: ‘Joint Strike Fighter Air Vehicle C++
Finally, please see again the main article’s conclusion: ‘Call(s) to action’ Coding Standards for the System Development and Demonstration
on page 8. n Program’, published December 2025 and available at
https://www.stroustrup.com/JSF-AV-rules.pdf
References [Madnick23] Stuart Madnick, ‘The Continued Threat to Personal Data:
[Alspach23] Kyle Alspach ‘10 Major Cyberattacks And Data Breaches Key Factors Behind the 2023 Increase’, published by Apple in
In 2023, published 13 December 2023 by CRN at https://www. December 2023 and available at https://www.apple.com/newsroom/
crn.com/news/security/10-major-cyberattacks-and-data-breaches- pdfs/The-Continued-Threat-to-Personal-Data-Key-Factors-Behind-
in-2023 the-2023-Increase.pdf
[Bastien23] JF Bastien, ‘P2723R1: Zero-initialize objects of automatic [Mastrangelo15] Luis Mastrangelo, Luca Pnzanelli, Andrea Mocci,
storage duration’, published 15 January2023, available at https:// Michele Lanza, Matthias Hauswirth and Nathaniel Nystrom
www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2723r1.html ‘Use at your own risk: the Java unsafe API in the wild’ from
[Black21] Paul E. Black, Barbara Guttman and Vadim Okum, the Proceedings of the 2015 ACM SIGPLAN International
‘Guidelines on Minimum Standards for Developer Verificiation Conference on Object-Oriented Programming Systems,
of Software’ (NISTIR 8397) available at https://nvlpubs.nist.gov/ Languages and Applications, available at https://dl.acm.org/doi/
nistpubs/ir/2021/NIST.IR.8397.pdf abs/10.1145/2814270.2814313
[C/C++] C and C++ CVEs: https://cve.mitre.org/cgi-bin/cvekey. [Microsoft-1] SafeInt Library: https://learn.microsoft.com/en-us/cpp/
cgi?keyword=c++ safeint/safeint-library?view=msvc-170
[CISA-1] ‘Apache Log4j Vulnerability Guidance’, published April 2022
by America’s Cyber Defense Agency, April 2022, available at
12 | Overload | April 2024
Herb Sutter Feature
[Microsoft-2] Checked and unchecked statements: [SO-7] ‘Why is if not working in my Magic Square program’, available
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/ on StackOverflow at https://stackoverflow.com/questions/45385837/
statements/checked-and-unchecked why-is-if-not-working-in-my-magic-square-program
[Microsoft-3] Build reliable and secure C++ programs: [SO-8] ‘Math-like chaining of the comparison operator - as in, “if (
https://learn.microsoft.com/en-us/cpp/code-quality/build-reliable- (5<j<=1) )”’, available on StackOverflow at https://stackoverflow.
secure-programs?view=msvc-170 com/questions/20989496/math-like-chaining-of-the-comparison-
[Microsoft-4] Understanding SAL: https://learn.microsoft.com/en-us/ operator-as-in-if-5j-1
cpp/code-quality/understanding-sal?view=msvc-170 [SO-9] ‘Only Returning the first if statement? (C++)’, available on
[MISRA] MISRA 2023: https://misra.org.uk/misra-cpp2023-released- StackOverflow at https://stackoverflow.com/questions/35564553/
including-hardcopy/ only-returning-the-first-if-statement-c
[MITRE-1] ‘2023 CWE Top 25’ on the Common Weakness Enumeration [SO-10] ‘Warning comparison integer and pointer’, available on
website operated by Mitre, available at: https://cwe.mitre.org/top25/ StackOverflow at https://stackoverflow.com/questions/42335710/
archive/2023/2023_top25_list.html#tableView warning-comparison-integer-and-pointer
[MITRE-2] Rust CVEs, from the CVE website managed by Mitre, [Stroustrup13] Bjarne Stroupstrup (2013) The C++ Programming
available at : https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=rust Language, 4th Edition published by Addison-Wesley Professional in
May 2023. ISBN-13: 978-0275967307
[MITRE-3] CVE: ‘Python Software Foundation Added as CVE
Numbering Authority (CNA)’ published 29 August 2023 at https:// [Stroustrup23] Bjarne Stroustrup and Gabriel Dos Reis, ‘Safety Profiles:
www.cve.org/Media/News/item/news/2023/08/29/Python-Software- Type-and-resource Safe Programming in ISO Standard C++’. The
Foundation-Added-as-CNA slides presented by Bjarne at the February 2023 C++ Standard
Committee meeting, available at: https://open-std.org/JTC1/SC22/
[Okunytė23a] Paulina Okunytė, ‘DarkBeam leaks billions of email and
WG21/docs/papers/2023/p2816r0.pdf
password combinations’, published by Cybernews, last updated
15 November 2023, available at https://cybernews.com/security/ [Sutter] ccpfront compiler, available at https://github.com/hsutter/
darkbeam-data-leak/ cppfront/
[Okunytė23b] Paulina Okunytė, ‘KidSecurity’s user data compromised [Sutter04] Herb Sutter and Andrei Alexandrescu (2004) C++ Coding
after app failed to set password’, published by Cybernews, last Standards: 101 Rules, Guidelines, and Best Practices, published
updated 30 November 2023, available at https://cybernews.com/ by Addison-Wesley Professional in October 2024. ISBN-13: 978-
security/kidsecurity-parental-control-data-leak/ 0321113580
[Perforce13] Perforce, ‘High Integrity C++ Coding Standard’ version [Sutter22] Herb Sutter ‘Can C++ be 10× simpler & safer …?’, a
4.0, released 3 October 2013, available at https://www.perforce.com/ presentation delivered at CppCon 2022, available at
resources/qac/high-integrity-cpp-coding-standard https://www.youtube.com/watch?v=ELeZAKCN4tY&t=4305s
[Revzin18] Barry Revzin and Herb Sutter, ‘P0893R1: Chaining [Wakely23] Jonathan Wakely and Thomas Köppe, ‘P2973R0: Erroneous
comparisons’, published 28 April 2018, available at behaviour for missing return from assignment’ published 15
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/ September 2023, available at https://www.open-std.org/jtc1/sc22/
p0893r1.html wg21/docs/papers/2023/p2973r0.html
[Rust-1] Rust CVEs: https://cve.mitre.org/cgi-bin/cvekey. [Wang22] Jun Wang ‘Unsafe Rust in the Wild’, published on The New
cgi?keyword=rust Stack on 29 September 2022, available at: https://thenewstack.io/
unsafe-rust-in-the-wild/
[Rust-2] ‘Learn Rust with Entirely Too Many Linked Lists’, available at
https://rust-unofficial.github.io/too-many-lists/ [Wikipedia] ‘Common Vulnerabilities and Exposures’, available at
https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_
[Rust-3] ‘Sanitizers Support’ in the Rust Compiler Development Guide,
Exposures
available at https://rustc-dev-guide.rust-lang.org/sanitizers.html
[Rust-4] Undefined behavior sanitizers: https://github.com/rust-lang/miri
[SO-1] ‘Is (4 > y > 1) a valid statement in C++? How do you evaluate This article was first published on Herb Sutter’s blog (Sutter’s
it if so?’, available on StackOverflow at https://stackoverflow.com/ Mill) on 11th March 2023: https://herbsutter.com/2024/03/11/
questions/8889522/is-4-y-1-a-valid-statement-in-c-how-do-you- safety-in-context/
evaluate-it-if-so
[SO-2] ‘Chaining Bool values give opposite result to expected’,
available on StackOverflow at https://stackoverflow.com/
questions/5939077/chaining-bool-values-give-opposite-result-to-
expected
[SO-3] ‘Checking if a value is within a range in if statment’, available
on StackOverflow at https://stackoverflow.com/questions/14433884/
checking-if-a-value-is-within-a-range-in-if-statment
[SO-4] ‘Test if all elements are equal with C++17 fold-expression’,
available on StackOverflow at https://stackoverflow.com/
questions/46806239/test-if-all-elements-are-equal-with-c17-fold-
expression
[SO-5] ‘Incorrect logic in C++’, available on StackOverflow at
https://stackoverflow.com/questions/25965157/incorrect-logic-in-c
[SO-6] ‘Is (val1 > val2 > val3) a valid comparison in C?’, available on
StackOverflow at https://stackoverflow.com/questions/38643022/is-
val1-val2-val3-a-valid-comparison-in-c

April 2024 | Overload | 13


Feature Jez Higgins

To See a World in a Grain of Sand


Code often rots over time as various people add new
features. Jez Higgins shows how to refactor code that has
grown organically, making it clearer and more concise.

I
n a recent blog post def canonicalise_reference(reference_type, reference_match, canonical_form):
[Higgins24] about my if (
sadness and disappointment (reference_type == "RefYearAbbrNum")
about the candidates we | (reference_type == "RefYearAbbrNumTeam")
| (reference_type == "YearAbbrNum")
were getting for interview, I ):
talked about the refactoring components = re.findall(r"\d+", reference_match)
exercise we give people, and year = components[0]
the conversations we have d1 = components[1]
afterwards. d2 = ""
corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)
I’m not able to show any of elif (
that code, but I am going to talk (reference_type == "RefYearAbbrNumNumTeam")
| (reference_type
about some code here of the == "RefYearAbrrNumStrokeNum")
type we often see. According to | (reference_type == "RefYearNumAbbrNum")
the version history, it’s passed ):
through a number of hands, components = re.findall(r"\d+", reference_match)
year = components[0]
and I want to be clear I know d1 = components[1]
none of the people involved d2 = components[2]
nor have I spoken to them. corrected_reference = (
They are, though, exactly canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
the type of person presenting )
elif (
themselves for interview, and (reference_type == "AbbrNumAbbrNum")
so for my purposes here they | (reference_type == "NumAbbrNum")
are exemplars. | (reference_type == "EuroRefC")
| (reference_type == "EuroRefT")
Here’s some Python code. ):
It’s from a larger document components = re.findall(r"\d+", reference_match)
processing pipeline. year = ""
d1 = components[0]
Documents come shoved d2 = components[1]
into the system, get squished corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)
around a bit, have metadata
added, some formatting fixups, return corrected_reference, year, d1, d2
then squirt out the other end Listing 1
as nice looking pdfs. Standard
stuff.
That’s where the function in Listing 1, normalise_reference, comes
This is not about them, though. I hold them blameless, and wish t hem in. I have obfuscated identifiers in the code sample, but its structure and
only happiness. This is about the places that they worked, about the wider behaviour are as I found it.
trade, about a culture that says this is fine.
I’d been kind-of browsing around a git repository, looking at folder
structure, getting the general picture. A chunk of the system is a Django
To see a world in a grain of sand webapp and thus has that shape, so I went digging for a bit of the meat
Documents can have references to other documents, both within the underneath. This was almost the first thing I saw and, well, I kind of
existing corpus, and to a variety of external sources. These references flinched. Poking around some more confirmed it’s not an anomaly. It is
have standard forms, and when we find something that looks like a representative of the system.
document reference, we do a bit of work to make sure it’s absolutely
clean and proper. You’ve probably had some kind of reaction of your own. This is what
immediately leapt out at me:
Jez Higgins lives on the Pembrokeshire coast, largely to make „ The length
return-to-office mandates impractical. Truth is, he hasn’t worked in
an office for nearly 25 years, and has no intention of starting now. „ The width!1
He’s been programming for a living that whole time and thinks he
might be starting getting to get the hang of it. He can be contacted at
1 As this is a printed publication, in most listings the very wide lines are
[email protected] or @[email protected]
wrapped. Listing 1 is presented full-width, as is Listing 6.
14 | Overload | April 2024
Jez Higgins Feature
„ The visual repetition, both in the if conditions and in the bodies of I been able I would absolutely have signed up for it. It’s fascinating stuff
the conditionals and right up a multiplicity of my alleys.
„ The string literals Let’s imagine for a moment that I was sitting down for my first day on
this job, what would I do with this code? Well, at a guess, nothing. Well,
„ The string literal with the spelling mistake
nothing until I needed to, and then I’d spend a bit of time on it. But I’d
„ The extraneous brackets in the second conditional body – written absolutely be talking to my new colleagues about, well, everything.
by someone else?
„ The extra line before the return – functionally, of course, it makes One step at a time
no difference, but still, urgh The code in Listing 1 is just not great. It’s long, for a start, and it’s long
because it’s repetitious. The line
Straightaway I’m thinking that more than one person has worked on this
components = re.findall(r"\d+", reference_match)
over time. That’s normal, of course, but I shouldn’t be able to tell. If I can,
it’s a contra-indicator. appears in every branch of the if/else. Let’s start by hoisting that up.
Looking a little longer, there’s a lot of repetition – in shape, and in
detail. Looking a little longer still, and I think function parameters are Clearing visual noise
in the wrong order. reference_type and canonical_form are The unnecessary brackets in the first elif body just jar. They catch the
correlated and originate within the system. They should go together. It’s eye and makes it appear that something different is happening in the
reference_match which comes from the input document, it’s the middle there, when in fact it adds nothing and is just visual noise.
only true variable and so, for me anyway, should be the first parameter. I (This result of this change and the previous one are shown in Listing 2).
suspect this function only had two parameters initially, and the third was
added without a great deal of thought to the aesthetics of the change. Move the action down
That’s a lot to not like in not a lot of code. The if/else ladder sets up a load of variables, which are then used to
build corrected_reference.
But at least there are tests The lines building corrected_reference aren’t the same, but they
And hurrah for that! There are tests for this function, tangled up in a source are pretty similar. We can move them out of the if/else ladder and
file with some unrelated tests that pull in a mock database connection and combine them together.
some other business, but they do exist.
There are two test functions, one checking well-formed references, the
other malformed references, but, in fact, each function checks multiple def canonicalise_reference(reference_type,
cases. reference_match, canonical_form):
components = re.findall(r"\d+",
It’s a start, but the test code is much the same as the code it exercises – reference_match)
long and repetitious – which isn’t, perhaps, that surprising. A quick visual
if (
check shows they’re deficient in other, more serious ways. There are ten (reference_type == "RefYearAbbrNum")
reference types named in canonicalise_reference. The tests check | (reference_type == "RefYearAbbrNumTeam")
seven of them and, in fact, there is a whole branch of the if/else ladder | (reference_type == "YearAbbrNum")
that isn’t exercised. That’s the branch I already suspect of being a later ):
addition. year = components[0]
d1 = components[1]
Curiously too, while canonicalise_reference returns a 4-tuple, the d2 = ""
corrected_reference =
tests only check the corrected reference and the year, ignoring the other two
canonical_form.replace("dddd", year)
values. That sent me off looking for the canonicalise_reference .replace("d+", d1)
call sites, where all four elements of the tuple are used. Again, I’d suggest
the 4-tuple came in after the tests were first written and were not updated elif (
to match. After all, they still passed. (reference_type == "RefYearAbbrNumNumTeam")
| (reference_type ==
I am sure these tests were written post-hoc. They did not inform the "RefYearAbrrNumStrokeNum")
design and development of the code they support. | (reference_type == "RefYearNumAbbrNum")
):
year = components[0]
Miasma d1 = components[1]
If the phrase coming to mind is code smells, then I guess you’re right. This d2 = components[2]
corrected_reference =
code is a stinky bouquet of bad odours, except they aren’t clues to some canonical_form.replace("dddd", year)
deeper problem with the code. We don’t need clues – it’s right out there .replace("d1", d1).replace("d2", d2)
front and centre. No, these smells emanate from with the organisation,
from a failure to develop the programmers whose hands this code has elif (
passed through. The code works, let’s be clear, but there’s a clumsiness to (reference_type == "AbbrNumAbbrNum")
| (reference_type == "NumAbbrNum")
it and a lack of care in its evolution. That’s a cultural and organisational | (reference_type == "EuroRefC")
failing. | (reference_type == "EuroRefT")
):
I keep saying this is about organisations. I’m not saying these are bad year = ""
places to work, where maladjusted managers delight in making their d1 = components[0]
underlings squirm. Quite the contrary, I’ve worked at more than one of d2 = components[1]
the organisations responsible for the code above and had a great time. corrected_reference =
canonical_form.replace("d1", d1)
But there is something wrong – an unacknowledged failure. An unknown .replace("d2", d2)
failure even. There so much potential, and it’s just not being taken up
I came across this code because I was talking about potential work on it, return corrected_reference, year, d1, d2
going back into one of those organisations. That didn’t pan out, but had
Listing 2
April 2024 | Overload | 15
Feature Jez Higgins

Looking up and out def canonicalise_reference(reference_type,


This is a bit of a meta-change, because you can’t infer it from the code reference_match, canonical_form):
here, but canonical_form is drawn from a data file elsewhere in the components = re.findall(r"\d+",
reference_match)
source tree. We control that data file.
if (
Examining it, we can see it’s safe to replace d+ with d1 in the canonical
(reference_type == RefYearAbbrNum)
forms. As a result, we can eliminate one of the replace calls when | (reference_type == RefYearAbbrNumTeam)
constructing corrected_reference. | (reference_type == YearAbbrNum)
):
This change and the previous one are shown in Listing 3. The shape of year = components[0]
the code hasn’t wildly changed, but feels like we’re moving in a good d1 = components[1]
direction. d2 = ""
elif (
(reference_type == RefYearAbbrNumNumTeam)
Typos must die | (reference_type == RefYearAbbrNumStrokeNum)
The ‘typo’ in "RefYearAbrrNumStrokeNum" is corrected – another | (reference_type == RefYearNumAbbrNum)
meta-fix. That string comes from the same data file as the canonical forms. ):
year = components[0]
Obviously "RefYearAbrrEtcEtc" looks like a loads of nonsense, but d1 = components[1]
Abrr is so clearly a typo. It’s an abbreviation for abbreviation! It should d2 = components[2]
be Abbr! Like the brackets I mentioned above, this is a piece of visual elif (
noise that needs to go. (reference_type == AbbrNumAbbrNum)
| (reference_type == NumAbbrNum)
Ok, the corrected version now says "RefYearAbbrNumStrokeNum", | (reference_type == EuroRefC)
which isn’t a world changing difference, but to me it looks better and IDE | (reference_type == EuroRefT)
agrees because there isn’t a squiggle underneath. ):
year = ""
d1 = components[0]
Constants d2 = components[1]
Those string literals give me the heebee-geebies. I’ve replaced them with
corrected_reference =
constants. (This change and the previous one are shown in Listing 4.) (canonical_form.replace("dddd", year)
.replace("d1", d1)
Birds of a feather .replace("d2", d2))
By grouping like reference types together, we can slim down each if
return corrected_reference, year, d1, d2
condition.
Listing 4

YearAbbrNum_Group = [
def canonicalise_reference(reference_type, RefYearAbbrNum,
reference_match, canonical_form): RefYearAbbrNumTeam,
components = re.findall(r"\d+", YearAbbrNum
reference_match) ]
if ( Having tried it, I like that. Let’s roll it out to the rest of the types (see
(reference_type == "RefYearAbbrNum") Listing 5.)
| (reference_type == "RefYearAbbrNumTeam")
| (reference_type == "YearAbbrNum") Love it.
):
year = components[0] Remembered Python calls arrays lists, but also that it has tuples too.
d1 = components[1] Tuples are immutable, so they’re a better choice for our groups.
d2 = ""

elif (
(reference_type == "RefYearAbbrNumNumTeam")
| (reference_type == def canonicalise_reference(reference_type,
"RefYearAbrrNumStrokeNum") reference_match, canonical_form):
| (reference_type == "RefYearNumAbbrNum") components = re.findall(r"\d+",
): reference_match)
year = components[0]
d1 = components[1] if reference_type in YearNum_Group:
d2 = components[2] year = components[0]
d1 = components[1]
elif ( d2 = ""
(reference_type == "AbbrNumAbbrNum") elif reference_type in YearNumNum_Group:
| (reference_type == "NumAbbrNum") year = components[0]
| (reference_type == "EuroRefC") d1 = components[1]
| (reference_type == "EuroRefT") d2 = components[2]
): elif reference_type in NumNum_Group:
year = "" year = ""
d1 = components[0] d1 = components[0]
d2 = components[1] d2 = components[1]

corrected_reference = corrected_reference =
(canonical_form.replace("dddd", year) (canonical_form.replace("dddd", year)
.replace("d1", d1) .replace("d1", d1)
.replace("d2", d2)) .replace("d2", d2))

return corrected_reference, year, d1, d2 return corrected_reference, year, d1, d2

Listing 3 Listing 5
16 | Overload | April 2024
Jez Higgins Feature
The result of swapping tuples for lists by switching [] def canonicalise_reference(reference_type,
to () is: reference_match, canonical_form):
YearAbbrNum_Group = ( components = re.findall(r"\d+",
RefYearAbbrNum, reference_match)
RefYearAbbrNumTeam,
YearAbbrNum if reference_type in YearNum_Group:
) year, d1, d2 = components[0], components[1], ""
elif reference_type in YearNumNum_Group:
year, d1, d2 = components[0], components[1], components[2]
Destructure FTW! elif reference_type in NumNum_Group:
We can collapse the year, d1, d2 = "", components[0], components[1]

year = ... corrected_reference =


d1 = ... (canonical_form.replace("dddd", year)
d2 = ... .replace("d1", d1)
.replace("d2", d2))
lines together into a single statement, going from three
lines into a single line (see Listing 6). return corrected_reference, year, d1, d2
Much easier on the eye. Listing 6
An extra level of indirection def canonicalise_reference(reference_type,
reference_match, canonical_form):
Bringing the year, d1, d2 assignments together particular highlights the components = re.findall(r"\d+",
similarity across each branch of the if ladder. reference_match)
Let’s pair up a type group with a little function that pulls out the if reference_type in YearNum_Group.Types:
components. (See Listing 7.) Probably did a bit too much in one go here, year, d1, d2 =
and it’s ugly as hell. But it works, and it captures something useful. YearNum_Group.Parts(components)
elif reference_type in YearNumNum_Group.Types:
If we now introduce a little class to pair up the types and components year, d1, d2 =
lambda function, it’s more setup at the top, but it’s neater in the function YearNumNum_Group.Parts(components)
body: elif reference_type in NumNum_Group.Types:
year, d1, d2 =
class TypeComponents: NumNum_Group.Parts(components)
def __init__(self, types, parts):
self.Types = types corrected_reference =
self.Parts = parts (canonical_form.replace("dddd", year)
.replace("d1", d1)
YearNum_Group = TypeComponents( .replace("d2", d2))
(
RefYearAbbrNum, return corrected_reference, year, d1, d2
RefYearAbbrNumTeam,
YearAbbrNum Listing 8
),
lambda cmpts: (cmpts[0], cmpts[1], "") The if conditions and the bodies now all have the same shape. That’s
) pretty cool. They were similar before, but now they’re the same.

That worked, and Listing 8 shows it extended across the two elif
branches.
Yoink out the decision making
It’s not really clear in the code, but there are only two things
YearNum_Group = { really going on in this function. The first is pulling chunks out of
"Types": [ reference_match, and the second is putting those parts back together
RefYearAbbrNum, into canonical_reference. Let’s make that clearer (see Listing 9).
RefYearAbbrNumTeam,
YearAbbrNum def reference_components(reference_type,
], reference_match):
"Parts": lambda cmpts: (cmpts[0], cmpts[1], "") components = re.findall(r"\d+",
} reference_match)
if reference_type in YearNum_Group.Types:
def canonicalise_reference(reference_type, year, d1, d2 =
reference_match, canonical_form): YearNum_Group.Parts(components)
components = re.findall(r"\d+", elif reference_type in YearNumNum_Group.Types:
reference_match) year, d1, d2 =
YearNumNum_Group.Parts(components)
if reference_type in YearNum_Group.Types: elif reference_type in NumNum_Group.Types:
year, d1, d2 = year, d1, d2 = NumNum_Group.Parts(components)
YearNum_Group.Parts(components)
elif reference_type in YearNumNum_Group.Types: return year, d1, d2
year, d1, d2 =
YearNumNum_Group.Parts(components) def canonicalise_reference(reference_type,
elif reference_type in NumNum_Group.Types: reference_match, canonical_form):
year, d1, d2 = year, d1, d2 = reference_components(
NumNum_Group.Parts(components) reference_type, reference_match)

corrected_reference = corrected_reference =
(canonical_form.replace(“dddd”, year) (canonical_form.replace("dddd", year)
.replace("d1", d1) .replace("d1", d1)
.replace("d2", d2)) .replace("d2", d2))

return corrected_reference, year, d1, d2 return corrected_reference, year, d1, d2


Listing 7 Listing 9
April 2024 | Overload | 17
Feature Jez Higgins

def reference_components(reference_type, def reference_components(reference_match,


reference_match): reference_type):
components = re.findall(r"\d+", components = re.findall(r"\d+",
reference_match) reference_match)

if (reference_type in YearNum_Group.Types): for group in TypeGroups:


return YearNum_Group.Parts(components) if reference_type in group.Types:
elif (reference_type in return group.Parts(components)
YearNumNum_Group.Types):
return YearNumNum_Group.Parts(components) def canonicalise_reference(reference_match,
elif (reference_type in NumNum_Group.Types): reference_type, canonical_form):
return NumNum_Group.Parts(components) year, d1, d2 =
reference_components(reference_match,
def canonicalise_reference(reference_type, reference_type)
reference_match, canonical_form):
year, d1, d2 = corrected_reference =
reference_components(reference_type, (canonical_form.replace("dddd", year)
reference_match) .replace("d1", d1)
.replace("d2", d2))
corrected_reference =
(canonical_form.replace(“dddd”, year) return corrected_reference, year, d1, d2
.replace("d1", d1)
.replace("d2", d2)) Listing 12
return corrected_reference, year, d1, d2 We only use "RefYearAbbrNum", for example, as part of a
TypeComponents object. It’s not needed anywhere else, but having it
Listing 10 as a constants in its own right floating around implies that you might and
suggests that you can. In fact, it’s YearNum_Group that is the constant,
Say what you mean so let’s tie things down to that.
There’s no need to assign year, d1, d2 in that new function. We can just
return the values directly (see Listing 10). YearNum_Group = TypeComponents(
(
"RefYearAbbrNum”,
Search "RefYearAbbrNumTeam”,
I mentioned the if conditions and the bodies now all have the same "YearAbbrNum"
shape. We can exploit that now to eliminate the if/else ladder by ),
lambda cmpts: (cmpts[0], cmpts[1], “”),
checking each group in turn (see Listing 11). )

And rest I also felt the parameters to


I first wrote this on Mastodon [Higgins24] because I’m that kind of bear, canonicalise_reference(reference_type,
and this where I stopped. I felt the code was in a much better place – not reference_match, canonical_form):
perfect by any means, but better. are in the wrong order.
But then I thought of something else. reference_type and canonical_form go together. They originate
in the same place in the code, from the data file I mentioned earlier, and
You wouldn’t let it lie if they were in a tuple or wrapped in a little object I certainly wouldn’t
Now the types are grouped together, I was inclinded to put the string argue.
literals back in.
The thing we’re working on, that we take apart and reassemble is
reference_match. To me, that means it should be the first parameter
we pass (see Listing 12).
TypeGroups = (
YearNum_Group, And that I thought was that. And I went to bed.
YearNumNum_Group,
NumNum_Group
) It’s a new day
The following morning, I got a nudge from my internet fellow-traveller
def reference_components(reference_type, Barney Dellar, who said
reference_match):
components = re.findall(r"\d+", I tend to think of for-loops as Primitive Obsession. You aren’t
reference_match) looping to do something n times. You’re actually looking for the
for group in TypeGroups: correct entry in the array to use. I would make that explicit. I’m not
if reference_type in group.Types: good at Python, but some kind of find or filter. Then invoke your
return group.Parts(components) method on the result of that filtering.

def canonicalise_reference(reference_type, He was right and I knew it. Had this code been in C#, for instance, I’d
reference_match, canonical_form): probably have gone straight from the if ladder to a LINQ expression.
year, d1, d2 =
reference_components(reference_type, He set me off. I knew Python’s list comprehensions were its LINQ-a-like,
reference_match) and I had half an idea I could use one here.
corrected_reference = However, I thought list comprehensions only created new lists. If I’d
(canonical_form.replace(“dddd”, year) done that here, it would mean I’d still have to extract the first element.
.replace("d1", d1) That felt at least as clumsy as the for loop.
.replace("d2", d2))
Turns out I’d only ever half used them, though. A list comprehension
return corrected_reference, year, d1, d2 actually returns an iterable. Combined with next(), which pulls the next
Listing 11 element off the iterable, and well, it’s more pythonic.
18 | Overload | April 2024
Jez Higgins Feature
def reference_components(reference_type, def reference_components(reference_match,
reference_match): reference_type):
components = re.findall(r"\d+", components = re.findall(r"\d+",
reference_match) reference_match)
return next(group.Parts(components) for group in TypeGroups:
for group in TypeGroups if reference_type in group.Types:
if reference_type in group.Types) return group.Parts(components)
What’s kind of fascinating about this change is that the list comprehension
def build_canonical_form(canonical_form,
has the exact same elements as the for version, but the intent, as Barney year, d1, d2):
suggested, is very different. return (canonical_form.replace("dddd", year)
.replace("d1", d1)
At the same time, Barney came up with almost exactly the same thing, too .replace("d2", d2))
[Dellar24]. We’d done a weird long-distance almost-synchronous little
pairing session. Magic. def canonicalise_reference(reference_match,
reference_type, canonical_form):
year, d1, d2 =
Reflecting reference_components(reference_match,
This is contrived, obviously, because it’s a single function I’ve pulled out reference_type)
of larger code base.
corrected_reference =
But, but, but, I do believe that now I’ve shoved it about that it’s better build_canonical_form(canonical_form,
code. year, d1, d2)

If I was able to work to my way out from here, I’m confident I could make return corrected_reference, year, d1, d2
the whole thing better. It’d be smaller, it would be easier to read, easier
to change.
Listing 13

The big finish PPPS (really, the last one, I promise)


I’m sure I have made the code better, and I’m just as sure that I’d make I was proofing this article before pressing publish (which probably means
the people I was working with better programmers too. I’d be better from there are only seven spelling and grammatical errors left), when I saw
working with them - I’ve learned from everyone I’ve ever worked with another change I’d make. (See Listing 13.)
- but I’m old. I’ve been a lot of places, done a lot of stuff, on a lot of
different code bases, with busloads of people. I know what I’m doing, and Again, nothing huge but just another little clarification.
I know I could have helped. That really is it. For now! n
I’m sorry I couldn’t take the job, but it needed more time than I could
give. In the future, well, who knows? References
[Dellar24] Barney Dellar on Mastodon:
PS https://mastodon.scot/@BarneyDellar/112042140234945492
I think it’s important to note I didn’t know where I was heading when I [Higgins24] The changes on Mastodon:
started. I just knew that if I nudged things around then a right shape would https://mastodon.me.uk/@jezhiggins/112039275413895974
emerge. When I had that shape, I could be more directed. [Hill21] GeePaw (Michael) Hill: ‘Many More Much Smaller Steps’
Barney’s little nudge was important too. He knew there was an (MMMSS): a series of five blog posts published from 29 September
improvement in there, even if neither of us was quite sure what it was 2021 to 30 December 2021, available at:
(until we were!). That was great. A lovely cherry on the top. https://www.geepawhill.org/series/many-more-much-smaller-steps/

This article was published as two posts on Jez’s blog:


PPS
I tried to do the least I could at each stage. In one place I took out two „ ‘To See a World in a Grain of Sand’ (posted 24 February 2024)
characters, in another I changed a single letter. Didn’t always succeed - available from: https://www.jezuk.co.uk/blog/2024/02/to-see-a-
some of what I did could have been split - but small is beautiful, and we world-in-a-grain-of-sand.html
should all aim for beauty. „ ‘If You’re So Smart’ (posted 7 March 2024) available from:
https://www.jezuk.co.uk/blog/2024/03/if-youre-so-smart.html
This comes, in large part, from my man GeePaw Hill [Hill21] and his
‘Many More Much Smaller Steps’. He’s been a big influence on me over Go to the second post to see all of the listings full-width (and some
the past few years, and I’ve benefited greatly as a result. intermediate steps).

April 2024 | Overload | 19


Feature Spencer Collyer

User-Defined Formatting
in std::format
std::format allows us to format values quickly and safely.
Spencer Collyer demonstrates how to provide formatting
for a simple user-defined class.

I
n a previous article [Collyer21], [I gave an introduction to the C++26 and runtime_format
std::format library, which brings modern text formatting Forcing the use of the v-prefixed functions for non-constant format
capabilities to C++. specs is not ideal, and can introduce some problems. The original
That article concentrated on the output functions in the library and how P2216 paper mentioned possible use of a runtime_format to allow
they could be used to write the fundamental types and the various string non-constant format specs but did not add any changes to enable that.
types that the standard provides. A new proposal [P2918] does add such a function, and once again
allows non-constant format specs in the various format functions. This
Being a modern C++ library, std::format also makes it relatively easy paper has been accepted into C++26, and the libstdc++ library that
to output user-defined types, and this series of articles will show you how comes with GCC should have it implemented by the time you read this
to write the code that does this. article, if you want to try it out.
There are three articles in this series. This article describes the basics
of setting up the formatting for a simple user-defined class. The second Creating a formatter for a user-defined type
article will describe how this can be extended to classes that hold objects To enable formatting for a user-defined type, you need to create a
whose type is specified by the user of your class, such as containers. specialization of the struct template formatter. The standard defines
The third article will show you how to create format wrappers, special this as:
purpose classes that allow you to apply specific formatting to objects of template<class T, class charT = char>
existing classes. struct formatter;

A note on the code listings: The code listings in this article have lines where T is the type you are defining formatting for, and charT is the
labelled with comments like // 1. Where these lines are referred to in character type your formatter will be writing.
the text of this article, it will be as ‘line 1’ for instance, rather than ‘the
Each formatter needs to declare two functions, parse and format,
line labelled // 1’.
that are called by the formatting functions in std::format. The purpose
and design of each function is described briefly in the following sections.
Interface changes
Since my previous article was first published, based on the draft C++20
Inheriting existing behaviour
standard, the paper [P2216] was published which changes the interface
Before we dive into the details of the parse and format functions, it is
of the format, format_to, format_to_n, and formatted_size
worth noting that in many cases you can get away with re-using existing
functions. They no longer take a std::string_view as the format
formatters by inheriting from them. Normally, you would do this if the
string, but instead a std::format_string (or, for the wide-character
standard format spec does everything you want, so you can just use the
overloads std::wformat_string). This forces the format string to
inherited parse function and write your own format function that
be a constant at compile time. This has the major advantage that compile
ultimately calls the one on the parent class to do the actual formatting.
time checks can be carried out to ensure it is valid.
For instance, you may have a class that wraps an int to provide
The interfaces of the equivalent functions prefixed with v (e.g. vformat)
some special facilities, like clamping the value to be between min and
has not changed and they can still take runtime-defined format specs.
max values, but when outputting the value you are happy to have the
One effect of this is that if you need to determine the format spec standard formatting for int. In this case you can just inherit from
at runtime then you have to use the v-prefixed functions and pass the std::formatter<int> and simply override the format function to
arguments as an argument pack created with make_format_args or call the one on that formatter, passing the appropriate values to it. An
make_wformat_args. This will impact you if, for instance, you want example of doing this is given in Listing 1 on the next page.
to make your program available in multiple languages, where you would
Or you may be happy for your formatter to produce a string representation
read the format spec from some kind of localization database.
of your class and use the standard string formatting to output that string.
Another effect is on error reporting in the functions that parse the format You would inherit from std::formatter<std::string> and just
spec. We will deal with this when describing the parse function of the override the format function to generate your string representation and
formatter classes described in this article. then call the parent format function to actually output the value.

The parse function


Spencer Collyer Spencer has been programming for more years The parse function does the work of reading the format specification
than he cares to remember, mostly in the financial sector, although (format-spec) for the type.
in his younger years he worked on projects as diverse as monitoring
water treatment works on the one hand, and television programme
scheduling on the other.
20 | Overload | April 2024
Spencer Collyer Feature

The format-spec for your type is written in a


mini-language which you design …there are no
rules for the mini-language, as long as you can
write a parse function that will process it

#include <format> On entry to the function, pc.begin() points to the start of the format-
#include <iostream> spec for the replacement field being formatted. The value of pc.end() is
#include <type_traits> such as to allow the parse function to read the entire format-spec. Note
that the standard specifies that an empty format-spec can be indicated by
class MyInt
{
either pc.begin() == pc.end() or *pc_begin() == '}', so
public: your code needs to check for both conditions.
MyInt(int i) : m_i(i) {};
int value() const { return m_i; }; The parse function should process the whole format-spec. If it
private: encounters a character it doesn’t understand, other than the } character
int m_i; that indicates the format-spec is complete, it should report an error. The
}; way to do this is complicated by the need to allow the function to be
template<> called at compile time. Before that change was made, it would be normal
struct std::formatter<MyInt>
: public std::formatter<int> to throw a std::format_error exception. You can still do this, with
{ the proviso that the compiler will report an error, as throw cannot be
using Parent = std::formatter<int>; used when evaluating the function at compile time. Until such time as
auto format(const MyInt& mi, a workaround has been found for this problem, it is probably best to
std::format_context& format_ctx) const
{ just throw the exception and allow the compiler to complain. That is the
return Parent::format(mi.value(), solution used in the code presented in this article.
format_ctx);
} If the whole format-spec is processed with no errors, the function should
}; return an iterator pointing to the terminating } character. This is an
int main() important point – the } is not part of the format-spec and should not be
{ consumed, otherwise the formatting functions themselves will throw an
MyInt mi{1};
std::cout << std::format(“{0} [{0}]\n”, mi);
error.
}
Format specification mini-language
Listing 1 The format-spec for your type is written in a mini-language which you
It should store any formatting information from the format-spec in the design. It does not have to look like the one for the standard format-specs
formatter object itself1. defined by std::format. There are no rules for the mini-language, as
As a reminder of what is actually being parsed, my previous article had long as you can write a parse function that will process it.
the following for the general format of a replacement field: An example of a specialist mini-language is that defined by std::chrono
‘{’ [arg-id] [‘:’ format-spec] ‘}’ or its formatters, given for instance at [CppRef]. Further examples are
given in the code samples that make up the bulk of this series of articles.
so the format-spec is everything after the : character, up to but not There are some simple guidelines to creating a mini-language in the
including the terminating }. appendix at the end of this article: ‘Simple Mini-Language Guidelines’.
Assume we have a typedef PC defined as follows:
using PC = basic_format_parse_context<charT>;
The format function
The format function does the work of actually outputting the value of
where charT is the template argument passed to the formatter the argument for the replacement field, taking account of the format-spec
template. Then the parse function prototype looks like the following: that the parse function has processed.
constexpr PC::iterator parse(PC& pc); Assume we have a typedef FC defined as follows:
The function is declared constexpr so it can be called at compile time. using FC = basic_format_context<Out, charT>;
The standard defines specialisations of the basic_format_parse_ where Out is an output iterator and charT is the template argument
context template called format_parse_context and wformat_ passed to the formatter template. Then the format function prototype
parse_context, with charT being char and wchar_t respectively. looks like the following:
1 There is nothing stopping you storing the formatting information in a FC::iterator format(const T& t, FC& fc) const;
class variable or even a global variable, but the standard specifies that
the output of the format function in the formatter should only where T is the template argument passed to the formatter template.
depend on the input value, the locale, and the format-spec as parsed by
Note that the format function should be const-qualified. This is
the last call to parse. Given these constraints, it is simpler to just store
because the standard specifies that it can be called on a const object.
the formatting information in the formatter object itself.
April 2024 | Overload | 21
Feature Spencer Collyer

If you need more complex formatting than just


writing one or two characters, the easiest way
to create the output is to use the formatting
functions already defined by std::format

The standard defines specialisations of the basic_format_context In the parse function, the lambda get_char defined in line 1 acts as
template called format_context and wformat_context, with a convenience function for getting either the next character from the
charT being char and wchar_t respectively. format-spec, or else indicating the format-spec has no more characters
by returning zero. It is not strictly necessary in this function as it is only
The function should format the value t passed to it, using the formatting
called once, but will be useful as we extend the format-spec later.
information stored by parse, and the locale returned by fc.locale()
if it is locale-dependent. The output should be written starting at The if-statement in line 2 checks that we have no format-spec defined.
fc.out(), and on return the function should return the iterator just past The value 0 will be returned from the call to get_char if the begin and
the last output character. end calls on parse_ctx return the same value.
If you just want to output a single character, the easiest way is to write The format function has very little to do – it just returns the result of
something like the following, assuming iter is the output iterator and c calling format_to with the appropriate output iterator, format string,
is the character you want to write: and details from the Point object. The only notable thing to point out is
*iter++ = c; that we wrap the format_ctx.out() call which gets the output iterator

If you need more complex formatting than just writing one or two #include "Point.hpp"
#include <format>
characters, the easiest way to create the output is to use the formatting #include <iostream>
functions already defined by std::format, as they correctly maintain #include <type_traits>
the output iterator.
template<>
The most useful function to use is std::format_to, as that takes the struct std::formatter<Point>
iterator returned by fc.out() and writes directly to it, returning the {
updated iterator as its result. Or if you want to limit the amount of data constexpr auto parse(
std::format_parse_context& parse_ctx)
written, you can use std::format_to_n. {
Using the std::format function itself has a couple of disadvantages. auto iter = parse_ctx.begin();
auto get_char = [&]() { return iter
Firstly it returns a string which you would then have to send to the != parse_ctx.end() ? *iter : 0; }; // 1
output. And secondly, because it has the same name as the function in char c = get_char();
formatter, you have to use a std namespace qualifier on it, even if if (c != 0 && c != '}') // 2
you have a using namespace std; line in your code, as otherwise {
throw std::format_error(
function name resolution will pick up the format function from the
"Point only allows default formatting");
formatter rather than the std::format one. }
return iter;
Formatting a simple object }
auto format(const Point& p,
For our first example we are going to create a formatter for a simple std::format_context& format_ctx) const
Point class, defined in Listing 2. {
return std::format_to(std::move(
format_ctx.out()), "{},{}", p.x(), p.y());
Default formatting }
Listing 3 shows the first iteration of the formatter for Point. This just };
allows default formatting of the object. int main()
{
Point p;
class Point std::cout << std::format("{0} [{0}]\n", p);
{ try
public: {
Point() {} std::cout << std::vformat("{0:s}\n",
Point(int x, int y) : m_x(x), m_y(y) {} std::make_format_args(p));
}
const int x() const { return m_x; } catch (std::format_error& fe)
const int y() const { return m_y; } {
std::cout << "Caught format_error : "
private: << fe.what() << "\n";
int m_x = 0; }
int m_y = 0; }
};
Listing 2 Listing 3
22 | Overload | April 2024
Spencer Collyer Feature

we now have to store information derived


from the format-spec by the parse function
so the format function can do its job

in std::move. This is in case the user is using an output that has move- The code for this example is in Listing 4.
only iterators.
Member variables
Adding a separator character and width specification The first point to note is that we now have to store information derived
Now we have seen how easy it is to add default formatting for a class, from the format-spec by the parse function so the format function
let’s extend the format specification to allow some customisation of the can do its job. So we have a set of member variables in the formatter
output. defined from line 10 onwards.
The format specification we will use has the following form: The default values of these member variables are set so that if no format-
spec is given, a valid default output will still be generated. It is a good
[sep] [width]
idea to follow the same principle when defining your own formatters.
where sep is a single character to be used as the separator between the two
values in the Point output, and width is the minimum width of each of The parse function
the two values. Both elements are optional. The sep character can be any The parse function has expanded somewhat to allow parsing of the
character other than } or a decimal digit. new format-spec. Line 1 gives a short-circuit if there is no format-spec
defined, leaving the formatting as the default.
#include "Point.hpp"
#include <format>
#include <iostream> if (!IsDigit(c)) // 7
{
using namespace std; throw format_error("Invalid format "
"specification for Point");
template<> }
struct std::formatter<Point> m_width = get_int(); // 8
{ m_width_type = WidthType::Literal;
constexpr auto parse( if ((c = get_char()) != '}') // 9
format_parse_context& parse_ctx) {
{ throw format_error("Invalid format "
auto iter = parse_ctx.begin(); "specification for Point");
auto get_char = [&]() { return iter }
!= parse_ctx.end() ? *iter : 0; }; return iter;
char c = get_char(); }
if (c == 0 || c == '}') // 1 auto format(const Point& p,
{ format_context& format_ctx) const
return iter; {
} if (m_width_type == WidthType::None)
auto IsDigit = [](unsigned char uc) { return {
isdigit(uc); }; // 2 return
if (!IsDigit(c)) // 3 format_to(std::move(format_ctx.out()),
{ "{0}{2}{1}", p.x(), p.y(), m_sep);
m_sep = c; }
++iter; return format_to(std::move(format_ctx.out()),
if ((c = get_char()) == 0 || c == '}') //4 "{0:{2}}{3}{1:{2}}", p.x(), p.y(), m_width,
{ m_sep);
return iter; }
} private:
} char m_sep = ‘,’; // 10
auto get_int = [&]() { // 5 enum WidthType { None, Literal };
int val = 0; WidthType m_width_type = WidthType::None;
char c; int m_width = 0;
while (IsDigit(c = get_char())) // 6 };
{ int main()
val = val*10 + c-'0'; {
++iter; Point p1(1, 2);
} cout << format("[{0}] [{0:/}] [{0:4}]"
return val; "[{0:/4}]\n", p1);
}; }

Listing 4 Listing 4 (cont’d)


April 2024 | Overload | 23
Feature Spencer Collyer

Avoid having complicated constructions or


interactions between different elements in
your mini-language … it should be possible
to parse it in a single pass

In the code following the check above we need to check if the specified as in the standard format specification with either {} or {n},
character we have is a decimal digit. The normal way to do this is to where n is an argument index.
use std::isdigit, but because this function has undefined behaviour
The format specification for this example is identical to the one above,
if the value passed cannot be represented as an unsigned char, we
with the addition of allowing the width to be specified at runtime.
define lambda IsDigit at line 2 as a wrapper which ensures the value
passed to isdigit is an unsigned char. The code for this example is in Listing 5. When labelling the lines in this
listing, corresponding lines in Listing 4 and Listing 5 have had the same
As mentioned above, any character that is not } or a decimal digit is taken
labels applied. This does mean that some labels are not used in Listing 5
as being the separator. The case of } has been dealt with by line 1 already.
if there is nothing additional to say about those lines compared to Listing
The if-statement at line 3 checks for the second case. If we don’t have
4. We use uppercase letters for new labels introduced in Listing 5.
a decimal digit character, the value in c is stored in the member variable.
We need to increment iter before calling get_char in line 4 because
get_char itself doesn’t touch the value of iter.
#include "Point.hpp"
Line 4 checks to see if we have reached the end of the format-spec after #include <format>
reading the separator character. Note that we check for the case where #include <iostream>
get_char returns 0, which indicates we have reached the end of the using namespace std;
format string, as well as the } character that indicates the end of the template<>
format-spec. This copes with any problems where the user forgets to struct std::formatter<Point>
terminate the replacement field correctly. The std::format functions {
will detect such an invalid condition and throw a std::format_error constexpr auto
exception. parse(format_parse_context& parse_ctx)
{
The get_int lambda function defined starting at line 5 attempts to auto iter = parse_ctx.begin();
read a decimal number from the format-spec. On entry iter should be auto get_char = [&]() { return iter
!= parse_ctx.end() ? *iter : 0; };
pointing to the start of the number. The while-loop controlled by line 6 char c = get_char();
keeps reading characters until a non-decimal digit is found. In the normal if (c == 0 || c == '}')
case this would be the } that terminates the format-spec. We don’t check {
in this function for which character it was, as that is done later. Note that return iter;
as written, the get_int function has undefined behaviour if a user uses }
auto IsDigit = [](unsigned char uc)
a value that overflows an int – a more robust version could be written if { return isdigit(uc); };
you want to check against users trying to define width values greater than if (c != '{' && !IsDigit(c)) // 3
the maximum value of an int. {
m_sep = c;
The check in line 7 ensures we have a width value. Note that the checks ++iter;
in lines 3 and 4 will have caused the function to return if we just have a if ((c = get_char()) == 0 || c == '}')
sep element. {
return iter;
The width is read and stored in line 8, with the following line indicating }
we have a width given. }
auto get_int = [&]() {
Finally, line 9 checks that we have correctly read all the format-spec. This int val = 0;
char c;
is not strictly necessary, as the std::format functions will detect any while (IsDigit(c = get_char()))
failure to do so and throw a std::format_error exception, but doing {
it here allows us to provide a more informative error message. val = val*10 + c-'0';
++iter;
}
The format function return val
The format function has changed to use the sep and width elements };
specified. It should be obvious what is going on, so we won’t go into it if (!IsDigit(c) && c != '{') // 7
in any detail. {
throw format_error("Invalid format "
"specification for Point");
Specifying width at runtime }
In this final example we will allow the width element to be specified at
runtime. We do this by allowing a nested replacement field to be used, Listing 5
24 | Overload | April 2024
Spencer Collyer Feature

if (c == '{') // A private:
{ mutable char m_sep = ',';
m_width_type = WidthType::Arg; // B enum WidthType { None, Literal, Arg };
++iter; mutable WidthType m_width_type
if ((c = get_char()) == '}') // C = WidthType::None;
{ mutable int m_width = 0;
m_width = parse_ctx.next_arg_id(); };
} int main()
else // D {
{ Point p1(1, 2);
m_width = get_int(); cout << format(
parse_ctx.check_arg_id(m_width); "[{0}] [{0:-}] [{0:4}] [{0:{1}}]\n", p1, 4);
} cout << format(
++iter; "With automatic indexing: [{:{}}]\n", p1, 4);
} try
else // E {
{ cout << vformat("[{0:{2}}]\n",
m_width = get_int(); // 8 std::make_format_args(p1, 4));
m_width_type = WidthType::Literal; }
} catch (format_error& fe)
if ((c = get_char()) != '}') {
{ cout << format("Caught exception: {}\n",
throw format_error("Invalid format " fe.what());
"specification for Point"); }
} }
return iter;
} Listing 5 (cont’d)
auto format(const Point& p,
format_context& format_ctx) const Nested replacement fields
{
if (m_width_type == WidthType::None) The standard format-spec allows you to use nested replacement fields
{ for thewidth and prec fields. If your format-spec also allows nested
return replacement fields, the basic_format_parse_context class has a
format_to(std::move(format_ctx.out()), couple of functions to support their use: next_arg_id and check_
"{0}{2}{1}", p.x(), p.y(), m_sep); arg_id. They are used in the parse function for Listing 5, and a
}
if (m_width_type == WidthType::Arg) // F description of what they do will be given in that section.
{
m_width = get_arg_value(format_ctx, The parse function
m_width);
} The first change in the parse function is on line 3. As can be seen, in
return format_to(std::move(format_ctx.out()), the new version, it has to check for the { character as well as for a digit
"{0:{2}}{3}{1:{2}}", p.x(), p.y(), m_width, when checking if a width has been specified. This is because the dynamic
m_sep); width is specified using a nested replacement field, which starts with a {
} character.
private:
int get_arg_value(format_context& format_ctx, The next difference is in line 7, where we again need to check for a {
int arg_num) const // G
character as well as a digit to make sure we have a width specified.
{
auto arg = format_ctx.arg(arg_num); // H The major change to this function starts at line A. This if-statement
if (!arg)
{ checks if the next character is a {, which indicates we have a nested
string err; replacement field. If the test passes, line B marks that we need to read
back_insert_iterator<string> out(err); the width from an argument, and then we proceed to work out what the
format_to(out, "Argument with id {} not " argument index is.
"found for Point", arg_num);
throw format_error(err); The if-statement in line C checks if the next character is a }, which
} means we are using automatic indexing mode. If the test passes, we call
int width = visit_format_arg([]
the next_arg_id function on parse_ctx to get the argument number.
(auto value) -> int { // I
if constexpr ( That function first checks if manual indexing mode is in effect, and if
!is_integral_v<decltype(value)>) it is it throws a format_error exception, as you cannot mix manual
{ and automatic indexing. Otherwise, it enters automatic indexing mode
throw format_error("Width is not " and returns the next argument index, which in this case is assigned to
"integral for Point”);
} the m_width variable.
else if (value < 0 If the check in line C fails, we enter the else-block at line D to do manual
|| value > numeric_limits<int>::max())
{ indexing. We get the argument number by calling get_int, and then
throw format_error("Invalid width for " we call the check_arg_id function on parse_ctx. The function
Point"); checks if automatic indexing mode is in effect, and if so it throws a
} format_error exception. If automatic indexing mode is not in effect
else
{ then check_arg_id enters manual indexing mode.
return value; The else-block starting at line E just handles the case where we have
}
}, arg); literal width specified in the format-spec, and is identical to the code
return width; starting at line 8 in Listing 4.
}
Note that when used at compile time, next_arg_id or check_arg_id
check that the argument id returned (for next_arg_id) or supplied (for
Listing 5 (cont’d)
April 2024 | Overload | 25
Feature Spencer Collyer
check_arg_id) is within the range of the arguments, and if not will fail Enable a sensible default
to compile. However, this is not done when called at runtime. It should be possible to use an empty format-spec and obtain sensible
output for your type. Then the user can just write {} in the format string
The format function and get valid output. Effectively this means that every element of your
The changes to the format function are just the addition of the if- mini-language should be optional, and have a sensible default.
statement starting at line F. This checks if we need to read the width value
from an argument, and if so it calls the get_arg_value function to get Shorter is better
the value and assign it to the m_width variable, so the format_to call Your users are going to be using the mini-language each time they want
following can use it. to do non-default outputting of your type. Using single characters for the
elements of the language is going to be a lot easier to use than having to
The get_arg_value function type whole words.
The get_arg_value function, defined starting at line G, does the work
of actually fetching the width value from the argument list. Keep it simple
Line H tries to fetch the argument from the argument list. If the argument Similar to the above, avoid having complicated constructions or
number does not represent an argument in the list, it returns a default interactions between different elements in your mini-language. A simple
constructed value. The following if-statement checks for this, and interaction, like in the standard format-spec where giving an align element
reports the error if required. Note that in your own code you might want causes any subsequent ‘0’ to be ignored, is fine, but having multiple
to disable or remove any such checks from production builds, but have elements interacting or controlling others is going to lead to confusion.
them in debug/testing builds.
Make it single pass
If the argument is picked up correctly, line I uses the visit_format_arg
function to apply the lambda function to the argument value picked up in It should be possible to parse the mini-language in a single pass. Don’t
line H. The visit_format_arg function is part of the std::format have any constructions which necessitate going over the format-spec
API. The lambda function checks that the value passed is of the correct more than once. This should be helped by following the guideline above
type – in this case, an integral type – and that its value is in the allowed to ‘Keep it simple’. This is as much for ease of programming the parse
range. Failure in either case results in a format_error exception. function as it is for ease of writing format-specs.
Otherwise, the lambda returns the value passed in, which is used as the
width. Avoid ambiguity
If it is possible for two elements in your mini-language to look alike then
Summary you have an ambiguity. If you cannot avoid this, you need a way to make
We have seen how to add a formatter for a user-defined class, and the second element distinguishable from the first.
gone as far as allowing the user to specify certain behaviour (in our case For instance, in the standard format-spec, the width and prec elements are
the width) at runtime. We will stop at this point as we’ve demonstrated both integer numbers, but the prec element has ‘.’ as an introducer so you
what is required, but there is no reason why a real-life Point class couldn’t can always tell what it is, even if no width is specified.
have further formatting abilities added.
In the next article in the series, we will explain how you can write a Use nested-replacement fields like the standard ones
formatter for a container class, or any other class where the types of some If it makes sense to allow some elements (or parts of elements) to be
elements of the class can be specified by the user. n specified at run-time, use nested replacement fields that look like the
ones in the standard format-spec to specify them, i.e. { and } around an
Appendix: Simple mini-language guidelines optional number.
As noted when initially describing the parse function of the formatters,
the format-spec you parse is created using a mini-language, the design Avoid braces
of which you have full control over. This appendix offers some simple Other than in nested replacement fields, avoid using braces (`{` and `}`)
guidelines to the design of your mini-language. in your mini-language, except in special circumstances.
Before giving the guidelines, I’d like to introduce some terminology.
These are not ‘official’ terms but hopefully will make sense. References
[Collyer21] Spencer Collyer (2021) ‘C++20 Text Formatting – An
„ An element of a mini-language is a self-contained set of characters
Introduction’ in Overload 166, December 2021, available at:
that perform a single function. In the standard format-spec most
https://accu.org/journals/overload/29/166/collyer/
elements are single characters, except for the width and prec values,
and the combination of fill and align. [CppRef] std::formatter<std::chrono::systime>:
https://en.cppreference.com/w/cpp/chrono/system_clock/formatter
„ An introducer is a character that says the following characters make
[P2216] P2216R3 – std::format improvements, Victor Zverovich, 5 Feb
up a particular element. In the standard format-spec the ‘.’ at the
2021, https://wg21.link/P2216
start of the prec element is an introducer.
[P2918] P2918R2 – Runtime format strings II, Victor Zverovich, 7 Nov
Remember, the following are guidelines, not rules. Feel free to bend or 2023, https://wg21.link/P2918
break them if you think you have a good reason for doing so.

26 | Overload | April 2024


Teedy Deigh Feature

Judgment Day
What if AI takes your job?
Teedy Deigh finds out.

TD what? that ‘thorough study’ means they saw a couple of videos, read some
MD I’ve been trying to get in touch. press releases and spent the rest of the day binge-watching classic
sci-fi
TD i know
got the same desperate msg from you on a dozen platforms MD I’m sure they were more thorough than that.
repeated enough times to buffer overflow TD fraid not
you even left voicemail msgs been dealing with their ‘architectures’ for years
who even uses phones for that anymore? me and the other devs had sweepstakes bout what was gonna come
and all before a reasonable person’s had the chance to have a 4th up
coffee both the questionable technical choices and the movie refs
so what’s app? MD Movie references?
MD We have a problem and we need your help. TD plus we kept a repo of ADRs to deal with their decisions
TD i don’t work for you any more MD ADRs?
MD But we’ve got a problem. TD Architecture Denial Records
TD you fired all the developers just over 2 weeks ago ways of working around and avoiding the official architecture
MD It’s serious. TBH might’ve been the most enjoyable and creative part of my job
TD so was firing all the developers MD I found their presentations compelling and insightful.
MD We had no choice. Our new AI-only development strategy was TD that’s not how you spell inciteful
approved by the board. We followed through. There’s no turning your predecessor made them architects to keep them out of the code
back. We’re embracing the future. reckoned they couldn’t do as much damage with PowerPoint
marketecture
TD who proposed the strategy?
guess we now know that wasn’t true
MD That’s not important.
MD Which is why I’m contacting you.
TD who proposed the strategy? It’s not working.
MD I did. But it was based on a thorough study and supported by a TD what’s not working?
number of others.
MD It. You know. The software. The stuff you develop.
TD who?
TD developed
MD Some managers, the finance department, marketing, HR and C-level
MD Whatever. It’s not working. After the last sprint things started going
execs.
wrong, and it’s all blown up this morning.
TD C-level?
TD when you say last sprint you mean the first sprint using 100% LLM-
sounds like you went overboard
based codegen?
you involve any techies?
MD Yes, and we don’t understand what’s wrong. I’ve been told all the
MD Yes, a couple of senior architects did the study.
tests are passing.
TD i meant bit wranglers not hand wavers
TD which tests?
MD You mean developers?
MD The ones generated by the AI.
Of course not! That’s like getting turkeys to vote on Xmas.
TD seriously WTF?!
MD Sorry about that. Sensitivity training’s not booked until next month.
Anyway, the architects said lots of technical things that sounded very TD
impressive and quite persuasive. has anyone looked at the code?
That all you need are product owners describing the functionality
and architects filling in some technical bits, the non-functional stuff. Teedy Deigh
AI generates all the code. Teedy says she’s been dealing with artificial intelligence her whole
They called it the Skynet strategy, for some reason, and said it would career, that many of her colleagues qualify and are not as smart as
terminate our need for developers. they make themselves out to be, (deeply) faking and (heavily) bluffing
TD oh I know which architects you mean their way through codebases, technologies and business decisions,
‘non-functional’ is definitely the right description playing an imitation game informed by Stack Overflow, hype cycles and
group think, and that it’s not imposter syndrome if they are actually
imposters.
April 2024 | Overload | 27
Feature Teedy Deigh
MD Yes, the architects. MD I don’t recall all this stuff about ‘precision’, ‘rigour’, ‘detail’ and
TD what did they say? ‘checking’ being mentioned in the study. Is this what they call
‘prompt engineering’?
MD They shrugged and said ‘LGTM’, if I recall correctly. Not quite sure
what they meant. TD it’s what we call programming
tell you what
TD when a dev uses LGTM it means they couldn’t be bothered to look
i’ll help you sort out this mess if you give me my old job back
through it
when an architect uses LGTM it means they haven’t a clue MD We can’t do that. There’s no software development department
basically your CI/CD pipeline is now a GIGO pipeline anymore. We let it go, and the budget for software is frozen.
MD Is that bad? TD well that’s all very Disney of you but no job means no help
to be clear
TD very
what you need is someone to correctly specify, verify, adapt and
MD I also overheard them later on being concerned about someone called adjust prompts?
Ellie.
MD Exactly.
TD that would probably be ELE
TD that would be like a product owner right?
Extinction Level Event
MD Yes.
MD What does that mean?
I see.
TD they were probably talking about the deep impact on the company’s We have hiring capacity for POs. But that would mean hiring you
prospects back at a higher pay grade than when you were a software developer.
MD This is even worse than I thought! TD i have no problem with that
TD perhaps your product owners could have a go at fixing things and as a senior PO i’d be able to take advantage of this (re)hiring
i mean it’s their code right? capacity yes?
MD They just told the AI what they wanted it to do. MD Wait, why would you be senior?
TD did they precisely and rigorously specify what they wanted? TD you need a PO with the specific ability to be specific in a way that is
MD They’re product owners, what do you think? correct?
that seems to be a higher grade of ability than the other POs
TD ah
guess that also means they didn’t check the results or specify at a MD That’s true.
high-level of detail? TD and you have a (very very) big problem that needs to be solved asap
MD Do they need to do that? It seems like a lot of work. I thought they MD That’s also true.
just needed to nudge the AI and it would all work. TD just to check: senior PO is higher up the hierarchy than senior
TD ‘prompt’ not ‘nudge’ architect?
you need to be very detailed and very precise and to pay a lot of MD Correct.
attention
TD then i accept
and then you do the nudging
pls tell the architects i’ll be back
(and often quite a lot of shoving)
if not, it’s no better than telling your cat you farted

28 | Overload | April 2024


To connect with
like-minded people
visit accu.org

accu
accu
Professionalism in Programming

Professional development
World-class conference

Printed journals
Email discussion groups

Individual membership
Corporate membership

Visit accu.org
for details

You might also like