A review of Accelerate: The Science of Lean Software and DevOps

2022-06-07

Summary

N. Forsgren, J. Humble, G. Kim. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press, 2018.

Accelerate is an influential book of advice for building and delivering software. I’m a practicing software engineer; I spend a fair fraction of my free time trying to get better at my craft; so, among other things, I read books. I read this one after coming across numerous positive references (example), and skimming several of the annual “State of DevOps” reports that the authors and their collaborators have published under the “DORA” (DevOps Research and Assessment) banner.

Accelerate investigates some important questions in software development that are hard to investigate. I believe that the authors are smart people who are trying their best. Gathering and analyzing this data was a lot of work, and I admire that effort.

Nevertheless, the claim to have scientifically validated a set of superior software development practices seems questionable. I find the authors’ approach inadequate to the task in multiple ways. Moreover, their use of statistical language, and rhetoric in general, is sloppy enough to raise doubts about whether the methodological issues go deeper than just the language.

As for what you should do in your software development organization: I came away thinking that the authors’ high level recommendations are mostly fine; see the appendix for a summary. However, I urge you to consider for yourself, from first principles, whether improvement along any given axis is the most valuable use of your limited resources, compared to other things you could be doing. For example, think hard about whether investing in improving deploy frequency is where you should spend your time now, and at what point you would see diminishing returns.

At this point, you might reasonably reply:

Yeah? Well, you know, that’s just like uh, your opinion, man. [The Big Lebowski]

Fair enough. Now the burden falls on me to substantiate my opinion. I’ll begin with a summary of the book’s line of argument, and then go into my criticisms.

What does the book claim?

Accelerate claims to have identified a set of practices which “predict” (more on that word later) high performance and positive business outcomes in organizations that make software.

Importantly, it claims that its findings are scientifically validated. The word “science” is in the title, and the authors spend a large fraction of the book discussing and defending their methodology. All of the practices advocated in this book have been described qualitatively and in greater detail by other sources; arguably the primary contribution of this work is its claim to empiricism and quantifiable rigor. Since this claim is so central to the book’s argument and influence, we should spend some time discussing methodology.

What did the authors do?

The authors introduce their methodology in Chapter 2, “Measuring Performance”; some details appear in later chapters and appendices. I’ll summarize the relevant parts here.

A first step towards measuring performance in software development is simply defining what “performance” means. The authors briefly outline and avoid some of the obvious traps, such as treating intermediate outputs (lines of code, tickets closed), or even inputs (utilization, i.e. how much developer time is allocated to work) as performance measures. To their credit, they recognize that high performance must consist of delivering value to users and the sponsoring organization.

Having made this argument, they move on to defining performance using four key metrics:

Deployment frequency - How frequently is software deployed to production?
Lead time for changes - How long does it take from the time code is committed to version control until it is deployed in production?
Time to restore service - How long does it take from the time a failure occurs until service is restored for customers?
Change failure rate - What fraction of changes deployed to production fail?

They collected data on these four metrics, as well as many other features, using surveys sent to “over 23,000 respondents” in various software development organizations. Survey respondents were reached via a combination of marketing campaigns and the authors’ social networks (transitively), so there is undoubtedly some unquantifiable selection effect here, but for the purposes of this review I’ll set this aside. At least, I think there are bigger bones to pick than the survey population.

Next, they applied hierarchical clustering — specifically, some version of Ward’s method — and found 3 clusters for their 4 metrics. They call these clusters “high”, “medium”, and “low” performers.

Finally, the authors report a variety of statistical correlations in the survey data between cluster membership and various other “constructs” of interest.

Now, anyone with a passing familiarity with statistics will have the obvious objection that it is trivial to find spurious correlations in a high-dimensional data set. The authors explicitly acknowledge this danger, and claim to have avoided it by formulating hypotheses before designing the surveys to test them. This is not as strong as performing a blinded experiment with a preregistered hypothesis, but it’s something. The authors contend that their methodology reduces the scope for mining arbitrary correlations out of the data, and then fitting constructs to them.

What did they find?

The authors found correlations between respondents’ answers pertaining to a wide variety of management practices, software delivery practices, and business outcomes. A full list of the reported correlations is beyond the scope of this post, but I’ll try to convey the flavor of their findings with an example: the following constructs are found to be positively correlated with successfully implementing “continuous delivery”:

Version control

Deployment automation

Continuous integration

Trunk-based development

Test automation

Test data management

Shift left on security

Loosely coupled architecture

Empowered teams

Monitoring

Proactive notification

where “continuous delivery” is defined as “a team’s ability to achieve the following outcomes”:

Teams can deploy to production (or to end users) on demand, throughout the software delivery lifecycle.

Fast feedback on the quality and deployability of the system is available to everyone on the team, and people make acting on this feedback their highest priority.

Again, all the constructs were measured using survey responses. Presumably, for example, there were one or more survey questions whose answers measure how loosely coupled a respondent’s architecture is; the answers to that question were correlated with other questions which measured whether a respondent’s team had fast feedback on the quality and deployability of the system.

Most of the book is a laundry list of correlations of this flavor, lightly connected with a tissue of management-advice-book-quality prose and references to other writing on software development (for example, a Steve Yegge rant is referenced).

If you want a quick list of takeaways — that is, if you just want to know what the authors think you should do — I’ve summarized all the authors’ recommended “capabilities” in an appendix. Many of these will strike experienced practitioners as obvious. But it is extremely useful to investigate the empirical evidence for “obvious” ideas, so that’s not really a criticism.

Critical analysis

I have tried to present Accelerate’s line of argument relatively sympathetically in the preceding section; but if you detected an undertone of skepticism, you wouldn’t be wrong.

Here is a summary of the main issues I have:

The things being measured are inadequate: The development performance metrics are incomplete and sometimes circular.
The instruments used to measure them are inadequate: The survey-based methodology is vulnerable to halo effects and other threats to validity.
The statistical analysis of the measurements is suspect: The presentation of statistical correlations is sloppy and misleading.
The published results are impossible to evaluate or replicate: The authors have released neither the full survey questions nor the data gathered from them; without these, nobody can precisely understand the measurements or the analysis thereof.

I’ll discuss these in turn.

Development performance metrics

Recall the authors’ definitions of software development performance:

deployment frequency

lead time for changes

time to restore service

change failure rate

One obvious objection is that these metrics do not even attempt to measure all the work that occurs before a code change is committed. The authors explain why:

. . . in the context of product development, where we aim to satisfy multiple customers in ways they may not anticipate, there are two parts to lead time: the time it takes to design and validate a product or feature, and the time to deliver the feature to customers. In the design part of the lead time, it’s often unclear when to start the clock, and often there is high variability. For this reason, Reinertsen calls this part of the lead time the “fuzzy front end” (Reinertsen 2009). However, the delivery part of the lead time—the time it takes for the work to be implemented, tested, and delivered—is easier to measure and has a lower variability.

In short, they decided to measure delivery time because it was amenable to measurement. I don’t want to go all drunkards-and-lampposts here, but this is a clear limitation of the methodology. In my experience, the most abject software development failures stem from building the wrong thing, not from building the thing wrong, and the former failure occurs outside anything that the authors even attempt to measure.

There is also some sleight of hand in the passage quoted above (the attentive reader will notice many little slippages like this in the book — too many for me to enumerate in this review). Time for work to be “implemented” and “tested” is classified as part of “delivery” time. However, the authors’ four metrics will naturally exclude both pre-commit implementation and testing, and post-deploy observation and analysis. For many changes, these two categories naturally comprise the bulk of the engineering effort, and any definition of “performance” that ignores them is dramatically incomplete.

Another objection is that many of the Accelerate correlations have a circular quality. For example, it is not only obvious but, I would argue, tautologically trivial to find that “deployment automation” is highly correlated with fast deployments. A deployment process that does not achieve (relatively) high speed is almost definitionally not automated; and if deployments are not high speed, it is hard to make them high frequency. Saying that deployment automation is correlated with high performance, where high performance is defined as having fast and frequent deployments, is circular reasoning.

M. Lamourine of Usenix ;login: book reviews had a similar reaction:

What most worries me is the possibility that the metric definitions are in some way forming a logical circle with the Agile methods supposedly under test. The descriptions of the metrics align fairly strongly with my understanding of the purpose and goals behind Agile philosophy and methods design. Is it possible that what is being measured is the effectiveness of the methods used to achieve the stated behavioral goals, without ever evaluating whether those behaviors actually improve the software being delivered? I’m inclined to view Agile methods as indeed effective and beneficial. I’m not sure how to show something stronger than “They do what they claim, and aim, to do.”

I’ll leave it to you to ponder in detail which practices in the appendix seem tautologically correlated with the authors’ performance metrics.

Taken together, the limited scope of measurement and the circularity of (some) measure definitions mean that, to a first approximation, the Accelerate view of high performance amounts to “the team can push a button and deploy changes to prod quickly”. If you care about any software quality outcome other than that, then Accelerate does not even claim to measure it.

Now, I want to be clear about what I’m saying here, because it is easy to misinterpret: the authors do go on to correlate other organizational outcomes with these core software delivery performance metrics. But when they refer, for example, to a cluster of “high performers”, they are talking about these core software delivery metrics, which, I claim, all amount to “can you deploy to prod quickly”.

(Q: What about “time to restore service” and “change failure rate”? Aren’t those qualitatively different from “deploying to prod quickly”?

A: Well…

Time to restore service has three components: (a) identify the issue; (b) fix the issue; (c) deploy the fix. A team that can deploy to prod quickly can do (c) quickly. Therefore, trivially, we ought to observe a correlation between faster deployment and lower time to restore service; it would be shocking if we did not, since it would mean that implementing fast deployment automation is somehow correlated with making the team slower at identifying or fixing issues. The authors present no evidence that (a) or (b) are accelerated by devops practices (although, to be clear, I do believe that practices like “having monitoring” should trivially be correlated with identifying and fixing issues faster). Therefore, in this context, measuring time to restore service is another way of measuring whether you can push a button and deploy to prod quickly.
In a technical aside that most readers are likely to gloss over, the authors admit that their measures of change failure rate do not pass statistical tests of construct validity, i.e. they could not statistically tease a reliable signal for change failure rate out of their data. I could also write more about my conceptual quibbles with change failure rate as a proxy for quality, but this review is already too long; I’ll leave you with the thought exercise of considering how a team could game this metric, and what that implies even for teams that are not intentionally gaming it.)

Do surveys reliably measure software development outcomes?

A further concern is that I don’t believe that surveys alone produce an adequate signal for objectively measuring performance.

The authors devote the entirety of Chapter 14 to defending surveys, so I’ll try to give them a fair hearing by presenting their arguments here. In summary, they argue the following:

Surveys allow you to collect and analyze data quickly.

Measuring the full stack with system data is difficult.

Measuring completely with system data is difficult.

You can trust survey data.

Some things can only be measured through surveys.

Item (1) is a practical admission that more reliable data collection methods could be prohibitively costly, rather than an argument that surveys are a first-best method. Let’s set that aside. Surely if we could obtain equally broad data of higher quality through some other means (for example, by employing independent auditors to observe actual software delivery performance), we would do so.

Items (2) and (3) seem like valid arguments against purely mechanical measurement methods. The authors point out, for example, that it is impossible to determine what fraction of artifacts are in version control by examining the version control system; the very nature of the question being investigated is whether data exists that is inaccessible to straightforward mechanical inspection. I will grant that these are fair arguments for using humans in the loop, although again they are not an argument that surveys are a first-best data gathering method.

Item (5) seems like a mixture of trivially true and obviously wrong. Certainly, characteristics like whether an organization “supports or embodies transformational leadership” seem impossible to measure without human judgment in the loop. If the alternative we are considering is mechanical measurement, then surveys seem obviously better. However, it also seems obviously false that surveys are literally the only way to measure such qualities (compared, for example, to ethnographic methods), and it is not obvious that surveys do successfully measure them. However, for the sake of space, I will also set this point aside.

Item (4) — whether you can trust survey data — is where I have the most trouble. The authors point out that system data (for example, logging output) can be easily corrupted, and then say this:

In the case of survey data, a few highly motivated bad actors can lie on survey questions, and their responses may skew the results of the overall group. Their impact on the data depends on the size of the group surveyed. In the research conducted for this book, we have over 23,000 respondents whose responses are pooled together. It would take several hundred people “lying” in a coordinated, organized way to make a noticeable difference—that is, they would need to lie about every item in the latent construct to the same degree in the same direction. In this case, the use of a survey actually protects us against bad actors. There are additional steps taken to ensure good data is collected; for example, all responses are anonymous, which helps people who take the survey feel safe to respond and share honest feedback.

This is why we can trust the data in our survey—or at least have a reasonable assurance that the data is telling us what we think it is telling us: we use latent constructs and write our survey measures carefully and thoughtfully, avoiding the use of any propaganda items; we perform several statistical tests to confirm that our measures meet psychometric standards for validity and reliability; and we have a large dataset that pulls respondents from around the world, which serves as a safeguard against errors or bad actors.

This is, more or less, all they say about humans giving poor responses. But focusing on “bad actors” misses the elephant in the room, which is that humans just cannot answer some questions correctly, at least not without additional work that the subjects probably did not undertake for this survey.

For example, if you were to ask me about the distribution of times to recovery from failures in the systems that I currently work on personally, I would be able to make a ballpark guess. However, without doing a careful review of recent production incident history, I would not trust any figure that I quoted to be statistically representative, despite having been personally involved in the resolution of many incidents. This is in part because, in the systems I work on, both the causes and effects of system failure are diverse. How does a severe degradation in performance for a single customer over several hours compare to a total system outage of ten minutes? To communicate the reality of failure recovery in my engineering organization, you’d need, at a minimum, a 3-dimensional chart of incidents categorized by severity, scope, and duration. Based on my past experience with taking psychometric surveys, I am skeptical that these nuances were captured.

Now, the authors claim that errors must produce wrong answers on “every item in the latent construct to the same degree in the same direction” to skew the results. But this is objectively wrong, as an elementary mathematical fact. It is sufficient merely for bad answers to have a distribution that is asymmetric with respect to the truth.

(How likely is it that some errors are asymmetric? I challenge you to mentally model the likelihood of error being symmetric across every dimension of a high-dimensional space, given a tractable count of data points. If you think you have a good handle on this, try visualizing some of the counterintuitive properties of higher dimensions and think again.)

More generally, any survey-based instrument must cope with several ways that respondents' answers could be wrong. Here are a couple of important issues:

Divergent understanding of survey questions.
Halo effects and confounding.

The authors’ defenses — for example, that “measures meet psychometric standards for validity and reliability” — unfortunately do not defang these threats, for reasons that will become clear shortly.

Divergent understandings of survey questions

A fundamental challenge for any survey-based methodology is that survey questions use human language. When respondents answer a question, the signal is always mediated by the respondents’ interpretation of the questions, and there is no guarantee that the readers agree on that interpretation, either with each other or with the investigators.

To even begin evaluating whether survey questions measure anything (and I emphasize that this would only be a first step), we must examine the questions carefully. Unfortunately, the text does not include the full survey questions (I’ll return to this issue later). However, in Chapter 13’s summary of psychometric methods, the authors do present sample questions for two of their recommended capabilities:

To measure the “Westrum organizational culture” construct, the authors asked the respondents the following:
On my team…
- Information is actively sought.
- Messengers are not punished when they deliver news of failures or other bad news.
- Responsibilities are shared.
- Cross-functional collaboration is encouraged and rewarded.
- Failure causes inquiry.
- New ideas are welcomed.
- Failures are treated primarily as opportunities to improve the system.
Using a scale from “1 = Strongly disagree” to “7 = Strongly agree,” teams can quickly and easily measure their organizational culture.
To measure the “proactive notification” construct, they asked (presumably on the same 1-7 scale), whether respondents agreed with the following:
- We get failure alerts from logging and monitoring.
- We monitor system health based on threshold warnings (ex. CPU exceeds 90%).
- We monitor system health based on rate-of-change warnings (ex. CPU usage has increased by 25% over the last 10 minutes)

Readers’ opinions may differ, but it seems to me that many of these are worded vaguely enough for respondents to diverge widely in their understandings of key terms like “actively sought”, “punished”, or “welcomed”.

And even relatively objective facts like “[w]e get failure alerts from logging and monitoring” seem subject to what I’ll call the “lying to your dentist” effect. Everyone knows they are supposed to floss, so people may interpret an ambiguous question like “do you floss?” as charitably as possible — for example, to mean “do you intend to floss, and even manage to do it once or twice a year?” — so that they can give the right answer.

Similarly, “we get failure alerts” is ambiguous enough to be interpreted as anything ranging from “we have gotten a nonzero number of failure alerts sometime in my tenure at this company” to “we routinely set up alerts for important system failures”. Most every software engineer knows that they are supposed to have automated alerts for system failures. When asked, will they “lie to their dentist” or will they grade themselves as strictly as possible? Surely it differs from person to person. What is being reliably measured here?

This issue of motivated interpretation relates to (but is distinct from) my next point…

Halo effects and confounding

Consider the “halo effect”:

The halo effect (sometimes called the halo error) is the tendency for positive impressions of a person, company, brand or product in one area to positively influence one’s opinion or feelings in other areas.

P. Rosenzweig wrote a whole book which cogently demolishes vast swathes of business management advice based on the halo effect and related threats to research validity. Many of those threats to validity seem like significant risks of the Accelerate methodology. I came across Rosenzweig’s book via A. Swartz’s book reviews, and I can’t do the book justice here (just read it, especially if you read Accelerate), but I’ll observe that no matter how objective you try to make your survey questions, it is incredibly hard to avoid halo effects.

To their credit, the Accelerate authors are not entirely naive about this issue. From Chapter 14:

By only surveying those familiar with DevOps, we had to be careful in our wording. That is, some who responded to our survey might want to paint their team or organization in a favorable light, or they might have their own definition of key terms. For example, everyone knows (or claims to know) what continuous integration (CI) is, and many organizations claim CI as a core competency. Therefore, we never asked any respondents in our surveys if they practiced continuous integration. (At least, we didn’t ask in any questions about CI that would be used for any prediction analysis.) Instead, we would ask about practices that are a core aspect of CI, e.g. if automated tests are kicked off when code is checked in. This helped us avoid bias that could creep in by targeting users that were familiar with DevOps.

I trust the authors far enough to believe that they tried to ask questions about objective outcomes. But even survey responses to seemingly objective questions can be skewed by halo effects.

Consider my time-to-recovery example from above: what would I say on a survey asking how long it takes to restore service after a failure? If my morale were low, I might recall the bad incidents that dragged on longer than we’d like, and had engineers burning the midnight oil. If my morale were high, I might recall the incidents where we quickly identified the problem and rolled back or failed over in minutes. Either way, it’s just a survey and there’s no real career capital on the line for me if I get it wrong, so I’ll probably go with my gut and click. Halo effects ahoy.

And that’s for a question where there is, arguably, an objective answer that could be derived with sufficient cross-validation against ground truth data. I struggle to imagine how a question could measure, for example, how organizations “support a generative culture” objectively, without being vulnerable to halo effects.

The authors find that high software delivery performance correlates with both high “organizational performance” and high “noncommercial performance”. (To avoid a long digression, I won’t get into how they measure these, but think of them as rough proxies for “do we make money?” and “do we retain staff?” respectively.) At first blush, this seems to make the research all the more compelling: follow this recipe, and not only will you deliver software faster, your business will improve all around! But paradoxically, when you consider halo effects, it raises more threats to research validity. Staff employed by successful businesses will generally have more positive opinions of the organization and their work, leading to greater halo effects.

Lastly, consider that businesses achieving positive business and staffing outcomes will have more resources, and will tend to be more effective in applying resources to problems, in general. This means that any divergence between “high performing” and “low performing” businesses will have multiple plausible causes. And it will be extraordinarily hard to disentangle the effect sizes of all these causes. (For what it’s worth, this last point combines Rosenzweig’s “Delusion of Single Explanations” and “Delusion of the Wrong End of the Stick”.)

Note that this issue — that of confounding factors — may be as serious a threat to validity as all my other complaints about surveys put together. I found no evidence in the text that the authors seriously tried to account for confounders like these.

Construct validity tests do not save us

Let’s revisit the following claim by the authors:

we perform several statistical tests to confirm that our measures meet psychometric standards for validity and reliability

Most readers, I assume, will not be familiar with what this means. Such readers may be tempted to conclude that there is some mysterious statistical alchemy in the notion of “construct validity and reliability” that addresses all of the issues I’ve raised with surveys.

Construct validity, as described by the authors, consists of “convergent validity” and “divergent validity”. I am not a statistics expert, but based on my reading of the text and own independent forays into Wikipedia and other sources, I would summarize these concepts as follows:

Convergent validity tests detect whether multiple measurements (such as multiple survey questions) are statistically correlated enough that they behave as though they might be noisy measurements of some common underlying phenomenon (“construct”).
Divergent validity tests also detect whether multiple measurements are statistically correlated in a way that suggests they are measuring the same thing, but applies to cases where you’re trying to ensure that this does not occur. If you claim that X and Y are different constructs, but your measurements of X and Y always line up, then they are probably not, in fact, different constructs.

The authors’ Appendix C offers some details about how they did construct validity testing. I’ve reproduced the relevant passage in an appendix in case you want to read it yourself. I must confess here that I am not an expert in statistics, and even after close reading of the text, I found it challenging to understand exactly what the authors did here. Nevertheless, mastery of all the details does not seem necessary to make the key observation: even if we grant that all of the right methods were used, and used correctly, these construct validity tests confirm just two things:

Answers to the questions which allegedly describe each construct are highly correlated.
Answers to questions which allegedly describe different constructs are not so highly correlated.

Neither of these fundamentally address any of the issues that I’ve raised. All the authors’ construct validity tests check, basically, whether the patterns in the dataset exhibit some desirable internal consistency traits. They do not at all address the fact that the things being measured may not accurately reflect ground truth (whether due to high-dimensional randomness, divergent understandings, halo effects, or confounders).

And while we’re on the subject of statistics…

The curious case of “inferential predictive analysis”

A pivotal moment in Accelerate occurs in a sidebar in Chapter 2 describing the many figures interspersed throughout the text. Despite being a sidebar, this snippet of text is key to the book’s explanations, many of which are communicated via diagrams:

We will include figures to help guide you through the research.

When you see a box, this is a construct we have measured. (For details on constructs, see Chapter 13.)

When you see an arrow linking boxes, this signifies a predictive relationship. You read that right: the research in this book includes analyses that go beyond correlation into prediction. (For details, see Chapter 12 on inferential prediction.) You can read these arrows using the words “drives,” “predicts,” “affects,” or “impacts.” These are all positive relationships unless otherwise noted. For example, Figure 2.4 could be read as “software delivery performance impacts organizational performance and noncommercial performance.”

A casual reader would reasonably conclude that the authors claim to have demonstrated causality. Three of these words — “drives”, “affects”, and “impacts” — signify causal relationships in everyday English. “Predict” is more ambiguous: “predicting” an outcome refers to the act of saying that it will occur, rather than to bringing that outcome about, but in this context most readers likely lump it with the others.

I confess that when I read the above sidebar, I was taken aback. Software engineering is a messy, complex, and multidimensional endeavor, stubbornly resistant to quantitative empirical analysis. In my opinion, a paucity of detailed objective data and the curse of dimensionality conspire to defeat most quantitative software engineering research.

If the authors had found rigorously empirical causal relationships between a constellation of concrete practices and successful software development, it would have been a stunning accomplishment.

And: “For details, see Chapter 12 on inferential prediction.”

Being the kind of person I am, I went immediately to Chapter 12, excited to read the details.

Here the Accelerate authors situate their work in a framework by professor J. T. Leek, of the Johns Hopkins biostatistics department, allegedly outlining six levels of statistical analysis (verbatim quote from Accelerate):

Descriptive

Exploratory

Inferential predictive

Predictive

Causal

Mechanistic

The citation for this breakdown is given as “Leek 2013” in the text, and the corresponding bibliography reference from current Kindle edition (spring 2022) is as follows:

Leek, Jeffrey. “Six Types of Analyses Every Data Scientist Should Know.” Data Scientist Insights. January 29, 2013. https://datascientistinsights.com/2013/01/29/six-types-of-analyses-every-data-scientist-should-know/

The most interesting distinction for our purposes is between levels 2 and 3:

“exploratory” analysis, in which the investigator freely hunts in the data for patterns, and
“inferential predictive” analysis, which allegedly does something more, and is the level of analysis that Accelerate itself claims to have accomplished.

In their description of “inferential predictive” analysis, the Accelerate authors write:

The third level of analysis, inferential, is one of the most common types conducted in business and technology research today. It is also called inferential predictive . . . Inferential design is used when purely experimental design is not possible and field experiments are preferred . . .

To avoid problems with “fishing for data” and finding spurious correlations, hypotheses are theory driven. This type of analysis is the first step in the scientific method. Many of us are familiar with the scientific method: state a hypothesis and then test it . . . hypothesis must be based on a well-developed and well-supported theory.

Whenever we talk about impacting or driving results in this book, our research design utilized this third type of analysis. While some suggest that using theory-based design opens us up to confirmation bias, this is how science is done. Well, wait—almost. Science isn’t done by simply confirming what the research team is looking for. Science is done by stating hypotheses, designing research to test those hypotheses, collecting data, and then testing the stated hypotheses. The more evidence we find to support a hypothesis, the more confidence we have for it. This process also helps to avoid the dangers that come from fishing for data—finding the spurious correlations that might randomly exist but have no real reason or explanation beyond chance.

This writing is not a model of clarity (and, despite my elisions, I do not think I have omitted anything which makes it clearer), but my best attempt at a paraphrase is: “inferential predictive” analysis must be prompted by a “well-developed and well-supported theory”. In other words:

If you find a correlation between A and B, it’s just a correlation.
If you have a theory which predicts a correlation between A and B, and then you find that correlation in data, it’s an inferential predictive analysis.

Clearly, there is a conceptual problem here: the notion of “well-developed and well-supported theory” is inevitably subjective. A correlation does not become more than a correlation just because someone holds a theory beforehand which is consistent with it; for one thing, there may be other theories which also explain the data. I admit that rigorously defining the relationship between correlation and causality is conceptually sticky, and frankly beyond my philosophical sophistication, but saying “we had a theory!” does not transmute a correlation into some more elevated substance by fiat.

Now, reasonable people can disagree on what constitutes high-quality evidence of some mechanism stronger than correlation. The most systematic practical framework I’m aware of is the hierarchy of evidence proposed by the medical community (and enthusiastically taken up by the “rationalist community”). In this framework, the Accelerate authors’ study is somewhere between a single cross-sectional study and a single case-control study, with the additional complicating factor that these authors used survey results in lieu of direct measurement of objective symptoms. This is one of the lower levels in the hierarchy.

For further discussion of correlation and causality in complex systems, I’ll toss you down this Gwern rabbit-hole.

Moving along, another problem is that the authors inaccurately cite “Leek 2013”, and, more importantly, either misunderstand or misrepresent the statistical concepts therein. First, let’s quote from the very first paragraph of the cited blog post:

Jeffrey Leek, Assistant Professor of Biostatistics at John Hopkins Bloomberg School of Public Health, has identified six(6) archetypical analyses. As presented, they range from the least to most complex, in terms of knowledge, costs, and time. In summary,

Descriptive

Exploratory

Inferential

Predictive

Causal

Mechanistic

Where is “inferential predictive”? Nowhere. In fact, Leek specifically distinguishes “inferential” from “predictive” analysis.

OK, but that’s a blog post (datascientistinsights.com appears to be a low-quality secondary source — not, despite the citation, published by Leek himself). Maybe Leek uses the term “inferential predictive” somewhere else?

As it happens, the blog post just reproduces lecture slides from Leek’s homepage, which in turn appear to be course materials for Leek’s Coursera course. I haven’t taken the course, but from the written materials available, it seems clear that Leek carefully distinguishes inferential analysis from predictive analysis, and certainly never uses the combined term “inferential predictive”. Leek and his collaborators have proposed the same sixfold taxonomy in other venues: a Science article (pdf) coauthored with R. D. Peng, and an online textbook by R. D. Peng and E. Matsui. All of these sources agree: inferential is different from predictive, and the term “inferential predictive” does not appear anywhere. I also have it on good authority (private correspondence with a statistician) that “inferential predictive” is not a standard term in the field.

(Cue “stop trying to make ‘inferential predictive’ happen, it’s not going to happen” meme.)

However, this sloppy citation is less significant than the conceptual mischaracterizations that result.

Again, as far as I can tell, Accelerate characterizes inferential analysis as one which validates a prior hypothesis by examining data. In contrast, Leek, Peng, and Matsui characterize inferential analysis as the practice of using sampled data to estimate something about a larger, unobserved population. Maybe one could, with sufficient imagination, construct a relationship between the Accelerate definition and the Leek/Peng/Matsui definition, but it seems to me that they’re talking about fundamentally different things. It’s not that the Accelerate definition is a poor paraphrase; the very concepts involved seem totally different.

Moreover, although Accelerate does not claim to be doing “predictive” analysis, it’s worth noting that they mischaracterize that too. In an aside on predictive analysis, they say:

Predictive analysis is used to predict, or forecast, future events based on previous events. Common examples include cost or utilities prediction in business. Prediction is very hard, particularly as you try to look farther away into the future. This analysis generally requires historical data.

This is just wrong. None of the Leek et al. references describe predictive analysis as specifically related to the past, the future, or historical data. In the Leek/Peng/Matsui taxonomy, predictive analysis tries to predict values for specific individuals, whereas inferential analysis tries to predict properties of a population as a whole. This individual vs. population distinction makes no appearance in Accelerate’s recap.

Popping back up a level, note that in the Leek/Peng/Matsui framework, both inferential and predictive analysis just work with correlations. Indeed, the implication that correlations are somehow categorically separate from inference has no basis in statistical theory: statisticians frequently perform inference using correlations.

I have no idea how the Accelerate authors arrived at “inferential predictive”. The most charitable explanation I can imagine is that whichever author originally wrote Chapter 12 was working from some accidentally garbled notes based on Leek’s lectures. But however it happened, the end result is that this term and its definition have been invented, perhaps accidentally, by the authors.

Why does this matter? Recall the language of the quotes which started this section: “drives”, “impacts”, “affects”. Throughout the main body of the text, the authors pervasively use terms that imply causality in ordinary English. Later, they walk this language back in a statement which admits the absence of causal analysis, but also confusingly misappropriates technical jargon from statisticians. Along the way, the authors evince some fundamental misunderstandings of the statistical framework they rely on. Each successive step demands progressively more work in order for the reader to spot the missteps. In the end, I had to hunt for the original sources behind a low-quality secondhand citation to find the authors’ error.

Do these rhetorical moves inspire confidence that the rest of their work was conducted and communicated with rigor and conscientiousness?

This misdirection and confusion is not limited to the text itself. On Twitter, you can see Accelerate author J. Humble writing things like this:

It’s not correlation, nor is it causation, it’s inferential prediction. Longer merge times predict worse performance.

– https://twitter.com/jezhumble/status/1415915467802234886

and this:

Just to get this off my chest, sometimes people look at the DORA research & say, “correlation doesn’t imply causation.”

We were careful to say which results are correlations vs predictive inferential. The arrows on the diagram on this page are the latter: https://devops-research.com/research.html

…

Much of the DORA research uses inferential predictive analysis. While the results aren’t as “strong”, they do tell you why (hey are theory-based).

Inferential predictive methods are also foundational for particle physics: see https://nature.com/articles/s42254-021-00305-6 & https://arxiv.org/abs/1609.04150

So if you’re prepared to accept the results of the experiments done at CERN, you should also be fine with the DORA research :-)

(Obviously statistics is enormously complex and I will deserve any shit I get from professional statisticians as a result of this statement, but inasmuch as anything you can prove with stats is true, this statement is true.)

– https://twitter.com/jezhumble/status/1397626349515206659

Inasmuch as anything you can prove with stats is true? Humble is way out over his skis here. I am very far from a professional statistician, but the above statements don’t pass muster, and you absolutely do not need a statistics degree to question them. The question of correlation vs. casuation is fundamental, and one cannot wield “inferential predictive” (a term the authors just made up!) as a talismanic incantation against it.

Humble also makes a totally wild comparison to statistical methods used in high energy physics simulations, and I feel comfortable using the phrase “totally wild” because

The phrase “inferential predictive” does not appear in the “particle physics” links that he references (one of which is just a bog standard “statistics for physicists” tutorial); of course, it cannot, because the authors made it up.
Correlations inferred from particle simulations can be refined by repeated comparisons against ongoing physical experiments (not surveys of humans!) which have ridiculously greater precision and longitudinal detail than anything that we could dream of in software engineering research.
Particle physics experiments occur in carefully controlled conditions wherein all the agents are far simpler than the stakeholders in software development, and yet I would venture a guess that experimental physicists report their findings with a decent dose of humility and respect for measurement error.
Particle physicists do not typically publish results with p < 0.1, which is not a mortal sin but is at least a venial one.

Am I being too nitpicky about arcane terminological details? I’ll close this section by turning the mic over to some online reactions from readers. You can be the judge of whether this work is being received with an appropriately nuanced understanding of the statistical results.

This book feels like two books under the same cover. The first part, “What we found,” is practical, while the second, “The research,” explains in detail the science behind the book. Finally, there’s a third smaller part, “Transformation,” with a case study. I read the first part and only skimmed [through] the second and the third.

Summarizing this book in a phrase, I would say, “Everything you knew about development is right!” The book doubles down on the importance of DevOps and agile and proves their superiority by science.

– https://roman.pt/posts/accelerate/

Good choice! I really liked it, especially the finding that organizations with high tempo also have high stability. But one thing was unclear to me. Does Inferential predictive establish causation, or is it only showing correlation?

– https://twitter.com/henrikwarne/status/1203258403310768128

Don’t take my word for it - take a look at five years’ worth of academically defensible State of DevOps research that shows causality of feature branching with poor performance. So yes, it is always harmful. You may be able to compensate in other ways, but why would you?

– https://twitter.com/tastapod/status/1415588779176480770

Peer programming, daily checkups, a rock solid CI, and, above all, trust in the professionalism of your team are some ingredients for high quality, high throughput software development.

This is not an opinion. It’s a scientifically proven fact. As laid out in the book Accellarate.

– https://news.ycombinator.com/item?id=31051045

The impossibility of evaluation or replication

Earlier, I mentioned in passing that I don’t have access to the Accelerate survey questions. In fact, as far as I can tell, the authors have never publicly released either their full questions (other than to the survey takers, who presumably only have ephemeral access), or the datasets that they gathered.

As noted earlier, the text itself gives full questions for just two constructs. Beyond this, it occasionally mentions in passing what appear to be paraphrases or fragmentary quotations of a few others. I will not enumerate all these passing references, since I don’t think they materially increase the availability of questions overall.

Searching beyond the book, I have been unable to find the full questions in the two papers by Forsgren et al. from the book’s bibliography (which are the only citations of previous psychometric research by the authors themselves): see CAIS 2016 and WSDI 2016. The CAIS article does contain some survey questions (see “Appendix B: Survey Instrument”), but they appear to be for a different study; at least, the results presented look quite different to me from those presented in Accelerate.

This is a more serious problem for this research than it may initially seem.

Recall that the objects of analysis in Accelerate are “constructs” like “loosely coupled architecture”. As an exercise, I suggest you reflect on how you would devise a set of survey questions that you’d ask an engineer in order to measure how loosely coupled their project’s software architecture is.

I’ll wait; go ahead and think for a moment.

If you’re an experienced programmer, you probably came up with some ideas. I came up with some too. But I’m not going to mention them, because I don’t want to focus on any particular ideas. I have two much more important follow-up questions for you to consider:

How confident are you that you came up with the same list that I did?
How confident are you, a priori, that answers to either of our surveys, taken by a diverse population of engineers, would correlate with the same objectively measurable characteristics of the codebases that they work on?

If you reflect for a while on your answers to these questions, I think you’ll come to the same conclusion that I did: it is impossible to confidently evaluate the accuracy of the Accelerate authors’ constructs without access to the questions.

Now, I believe the authors are smart and experienced people. J. Humble and G. Kim, for example, have long and successful careers in engineering leadership. They probably know a thing or two about engineering! But to accept the validity of survey constructs based on the qualifications of the authors would be an argument from authority.

Per the Royal Society, nullius in verba: it is incumbent upon people making scientific claims, particularly claims as sweeping and allegedly conclusive as those the Accelerate authors make, to “show their work” so that readers can evaluate the work on its merits.

Moreover, consider the plight of a researcher who, inspired by Accelerate, decided that they wanted to replicate the results or do follow-up work. How would they go about doing this? I think it’s clearly impossible, at least given the publicly available data. A researcher could, at best, engineer their own study from scratch which attempted to investigate the same questions.

So, in summary, the published results seem impossible to evaluate, let alone to replicate.

Interlude: A story about engineering interviews

At this point, you might be saying something to yourself like this:

Okay, I get it: you can raise a whole lot of technical nitpicks about this research. But I don’t need proof, which is impossible anyway. I just need results that are good enough to act on. So many smart people at the best tech companies believe in this devops stuff! So many people were surveyed! There has got to be something to these results.

I think the only way to address this reaction is by switching tack from technical arguments about validity to jolting your intuition. So I’m just going to tell a story.

A few decades ago, Microsoft was legendary for its approach to technical interviewing, which relied on a mixture of brain teasers and whiteboard questions involving linked list manipulation and the like. Proponents of this interviewing practice had a variety of a priori arguments for why this was a good idea; one might imagine that they went something like this:

Writing performant code means writing C code, and when you write C, it is often necessary to bang out some code by hand that directly manipulates linked lists.
The linked list is a fundamental data structure in computer science, and it is the simplest case of the important broader category of linked data structures. A programmer who cannot fluently manipulate linked lists probably cannot manipulate more complicated linked structures at all, and is hopelessly lacking in fundamental computer science skills.
The ability to manipulate discrete structures in your head and communicate that thinking on a whiteboard is a strong indicator of general intelligence. Brain teasers are too. And general intelligence is the most powerful tool for doing intellectual work like programming.

Put a bunch of ideas like this together, and you might even call it a theory.

Microsoft was the highest-performing software business at the time, by far; it may be difficult for younger readers, accustomed to the multipolar tech world of the 2020s, to understand viscerally how massive a juggernaut they were. Companies high and low imitated Microsoft’s approach to hiring, to the extent that they were able. Presumably, they could afford to be as picky as Microsoft only if they could bid competitively with Microsoft in the talent market.

If you had sent out a survey ca. 1998 measuring the effectiveness of technical interview styles, I suspect that you would have found a strong correlation between adoption of the Microsoft interview format and a variety of other indicators, including software delivery performance, profitability, and employee satisfaction. (All those Microsoft millionaires were pretty satisfied!) With those survey results in hand, you could credibly claim to have a theory-backed statistical analysis showing that Microsoft-style technical interview questions “drive” the creation of high performance software teams.

In the years since, the winds have shifted, and this interview style is almost universally reviled. With good reason: the industry has largely learned, from hard experience, that it spuriously excludes huge numbers of skilled programmers and overfits on narrow measures of engineering skill.

I hope that the analogies to Accelerate are clear. Even if devops practices are more well-founded than the 1990s Microsoft interview, surveys alone cannot be considered conclusive evidence.

Closing thoughts: Where do we go from here?

Ok, I’ve written a few thousand words now, and perhaps you have even read them. What am I hoping to accomplish here? How would I like the devops discourse to change?

I have criticized the Accelerate research and its presentation extensively, so I should reiterate up front that I believe in most of the practices advanced under the “devops” banner (by the Accelerate authors and many others). I mean, like, “version control”? Yes, you should use it! And continuous integration, and trunk-based development, and so forth, I believe all that stuff is good, and we do it at my day job.

But I don’t believe in it because of this research. There are too many issues with the methodology and presentation to be fully credible.

Admittedly, the Accelerate research is evidence of something — at a minimum, evidence of how engineers might verbally describe their practices in a broad sample of software development teams. But I claim that it is not convincing evidence of the superiority of the exact software development methods studied.

A. Tabarrok has coined the aphorism “trust literatures not papers”, and that applies here. Of the original book’s extensive claims, only a small subset, which appeared in just two papers by Forsgren et al., were formally peer reviewed at all. Some of the authors’ subsequent work has also been peer reviewed, but all of it constitutes essentially a single line of research, using methods that I find dubious. This does not add up to a literature, let alone a convincing one.

So, what would make this research more convincing? Well, here are few things the authors could do:

Publish errata or a revised edition of Accelerate correcting clear defects, like the presentation of “inferential predictive” statistics, and clarifying the statistical methods overall.
Moderate the tone of their communications, both in the book and in other venues like social media, to more clearly communicate the caveats in their research methodology.
Release their surveys and the resulting datasets so that others can both replicate their methods and analyses, and independently correlate survey results against other forms of “ground truth” data, like actual site reliability metrics.

I do not have high hopes that the authors will do any of this, especially at the suggestion of an essay by an Internet rando. But I am somewhat more hopeful that I can convince you, dear reader, to alter your thinking and behavior, just a little bit. Here are some suggestions.

First, help vet my analysis. Nullius in verba applies to this review as well. Maybe my criticisms are mistaken because of something I missed or misunderstood; conversely, maybe there are problems with the Accelerate findings that I haven’t even mentioned. I encourage anyone so inclined to study Accelerate and the “State of DevOps” reports (and, I suggest, Rosenzweig’s The Halo Effect). Reproduce or refute my criticisms, or develop your own, and report your findings. I would especially like it if somebody with deeper statistical expertise than me examined the statistical details in the book’s appendices.

Second, stop citing Accelerate and the “State of DevOps” reports uncritically. Until the literature is shored up considerably, it’s misleading to your readers to cite these results as settled questions with hard evidence behind them. I promise you that it is possible to make the case for your preferred set of software development practices some other way.

Third, stop giving credence to other people who cite Accelerate and the “State of DevOps” reports with more certainty than the work warrants. I recently watched a terrible video by a Youtube personality (not one of the Accelerate authors) who dogmatically and arrogantly insisted that it was not “scientific” or “rational” to dispute any aspect of their preferred development style, citing these sources as proof. As I hope this review makes clear, anybody who talks this way is ironically revealing their own deficient skills in close reading and critical thinking. I suggest you stop paying attention to people like that.

Fourth, if there are any software engineering researchers out there reading this: consider independently researching the effectiveness of devops practices. Frankly, I think that merely replicating the Forsgren et al. methodology, while perhaps valuable in some abstract sense, would advance human knowledge less than trying to study these important subjects with methods that are less subject to the problems that I have discussed.

Lastly, I reiterate my plea to all software practitioners: think carefully, from first principles, about where you should spend your efforts right now, taking into account the context of your particular development team’s problems. Do not delegate your decisionmaking blindly to anyone who claims to have all the answers. Take evidence from research and other sources seriously, but do so critically, paying attention to the details which may make the results more or less applicable to your context.

I’ll close with a quotation from a different book about software engineering evidence: Making Software: What Really Works, and Why We Believe It, edited by A. Oram and G. Wilson:

By now you will probably agree that high credibility is far from easy to obtain. However, that does not mean there are no (or almost no) credible studies; it is just that the credible ones are almost always limited. They are far more specialized than we would like them to be and burdened with more ifs, whens, and presumablys than we would prefer. There is not much sense in complaining about this situation; it is simply an unavoidable result of the complexity of the world in which we live and (as far as technology is concerned) that we have created. And it would not even be much of a problem if we were patient enough to be happy with the things we do find out.

And here is what we believe to be the real problem: although engineers and scientists understand a lot about complexity, have respect for it and for the effort it often implies, and are still capable of awe in that regard, our society and culture as a whole do not. We are surrounded by so many spectacular things and events that we come to feel that small news is no news. A finding that requires an intricate 50-word clause to summarize it without distortion is unlikely to receive our attention.

The general media act accordingly. In order to capture attention, they ignore, inflate, or distort the findings of an empirical study, often beyond recognition. Scientists often are not helpful either and write so-called abstracts that merely announce results rather than summarizing them. In either case, the burden is on the critical reader to take a closer look. You will have to dig out a report about the study, digest it, decide its credibility, and take home what is both credible and relevant for you. Your qualifications as a software engineer mean you are able to do this, and progress in software engineering depends on many engineers actually exercising this capability on a regular basis.

Thus, I commend you to the cause of progress in software engineering.

Appendices

Appendix A: Capabilities

As of mid-2022, everything in the list below is described in more detail, in interactive form, at the Google Cloud-sponsored devops-research.com website, but I’ve reproduced it below in order to make this review more self-contained.

Continuous delivery capabilities

Use version control for all production artifacts.

Automate your deployment process.

Implement continuous integration.

Use trunk-based development methods.

Implement test automation.

Support test data management.

Shift left on security.

Implement continuous delivery.

Architectural capabilities

Use a loosely coupled architecture.

Architect for empowered teams.

Product and process capabilities

Gather and implement customer feedback.

Make the flow of work visible through the value stream.

Work in small batches.

Foster and enable team experimentation.

Lean management and monitoring capabilities

Have a lightweight change approval process.

Monitor across application and infrastructure to inform business decisions.

Check system health proactively.

Improve processes and manage work with work-in-progress limits.

Visualize work to monitor quality and communicate throughout the team.

Cultural capabilities

Support a generative culture (as outlined by Westrum).

Encourage and support learning.

Support and facilitate collaboration among teams.

Provide resources and tools that make work meaningful.

Support or embody transformational leadership.

Appendix B: Construct validity tests

An extended excerpt from Accelerate Appendix C:

Testing for Relationships

Consistent with best practices and accepted research, we conducted our analysis in two stages (Gefen and Straub 2005). In the first step, we conduct analyses on the measures to validate and form our latent constructs (see Chapter 13). This allows us to determine which constructs can be included in the second stage of our research.

Tests of the Measurement Model

Principal components analysis (PCA). A test to help confirm convergent validity. This method is used to help explain the variance-covariance structure of a set of variables.

Principal components analysis was conducted with varimax rotation, with separate analyses for independent and dependent variables (Straub et al. 2004).

There are two types of PCA that can be done: confirmatory factor analysis (CFA) and exploratory factor analysis (EFA). In almost all cases, we performed EFA. We chose this method because it is a stricter test used to uncover the underlying structure of the variables without imposing or suggesting a structure a priori. (One notable exception was when we used CFA to confirm the validity for transformational leadership; this was selected because the items are well-established in the literature.) Items should load on their respective constructs higher than 0.60 and should not cross-load.

Average variance extracted (AVE). A test to help confirm both convergent and discriminant validity. AVE is a measure of the amount of variance that is captured by a construct in relation to the amount of variance due to measurement error.

AVE must be greater than 0.50 to indicate convergent validity.

The square root of the AVE must be greater than any cross-diagonal correlations of the constructs (when you place the square root of the AVE on the diagonal of the correlation table) to indicate divergent validity.

Correlation. This test helps confirm divergent validity when correlations between constructs are below 0.85 (Brown 2006). Pearson correlations were used (see below for details).

Reliability

Cronbach’s alpha: A measure of internal consistency. The acceptable cutoff for CR is 0.70 (Nunnally 1978); all constructs met either this cutoff or CR (listed next). Note that Cronbach’s alpha is known to be biased against small scales (i.e., constructs with a low number of items), so both Cronbach’s alpha and composite reliability were run to confirm reliability.

Composite reliability (CR): A measure of internal consistency and convergent validity. The acceptable cutoff for CR is 0.70 (Chin et al. 2003); all constructs either met this cutoff or Cronbach’s alpha (listed above).

All of the above tests must pass for a construct to be considered suitable for use in further analysis. We say that a construct “exhibits good psychometric properties” if this is the case, and proceed. All constructs used in our research passed these tests.

(Aside: it seems to be a somewhat controversial question whether EFA is technically a “type of PCA” or whether PCA “with varimax-rotation” is technically PCA, but with respect to this passage these seem like minor terminological nits.)

Citations in the above passage refer to the following bibliography entries:

Brown, Timothy A. Confirmatory Factor Analysis for Applied Research. New York: Guilford Press, 2006.

Chin, Wynne W., Barbara L. Marcolin, and Peter R. Newsted. “A Partial Least Squares Latent Variable Modeling Approach for Measuing Interaction Effects: Results from a Monte Carlo Simulation Study and an Electronic-Mail Emotion/Adoption Study.” Information Systems Research 14, no. 2 (2003: 189-217.

Gefen, D., and D. Straub. “A Practical Guide to Factorial Validity using PLS-Graph: Tutorial and Annotated Example.” Communications of the Association for Information Systems 16, art. 5 (2005): 91-109.

Nunally, J. C. Psychometric Theory. New York: McGraw-Hill, 1978.

Straub, D., M.-C. Boudreau, and D. Gefen. “Validation Guidelines for IS Positivist Research.” Communications of the AIS 13 (2004): 380-427.

Appendix C: “Statistically significant” with p < 0.1

At one point, the authors evaluate the statistical significance of performance differences between clusters as follows:

Pairwise comparisons were done across clusters using each software delivery performance variable, and significant differences sorted the clusters into groups wherein the variable’s mean value does not significantly differ across clusters within a group, but differs at a statistically significant level (p < 0.10 in our research) across clusters in different groups.

I am curious why a p value threshold (alpha) of 0.1 was used, rather than the more typical 0.05.

To be clear, there is no fundamental reason that 0.05 ought to be the threshold; it is merely customary. However, departing from the customary value adds another degree of freedom, which is troubling given the other oddities in the authors’ statistical rhetoric.

And raising the threshold runs counter to the prevailing winds of scientific opinion. See Di Leo and Sardanelli 2020, for example, for an argument that medical research should adopt a stricter threshold and also report actual p-values. See also the ASA’s statement on p-values from 2016 which imposes some fairly high standards:

P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. Cherry-picking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference, and “p-hacking,” leads to a spurious excess of statistically significant results in the published literature and should be vigorously avoided. One need not formally carry out multiple statistical tests for this problem to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting.

As I’ve admitted multiple times in this review, I am not a statistician. I also don’t think that every published result needs to pass the above gold standards of statistical transparency. But I do think that you should communicate and interpret results with the caveats appropriate to the methods used.