On Scaling (Meta-)Software for Value

A value-focused analysis of software (and, "what am I doing at Bloomberg that warrants a 3000 word blog post?")

On Scaling (Meta-)Software for Value
Photo by Sean Pollock / Unsplash

What is software?

You could cop out and say that a software is simply a "thing" that tells a computer what to do, sure; but for it to be a simple a set of instructions, society values it a LOT: after all, it is how we manage to put food on the table, for building it, maintaining it, running it, designing it, etc.

But why?

Another cop out answer would be that it generates "value". But in particular, these sets of instructions solve "things" in society, and that the solution can be used to create value. Most importantly for software, the solution it creates can be applied repeatedly, raking an increasing amount of value every time the software is "run".

And for the most part, it is this scalable value creation that provides us much higher wages than most other similarly-educated workers: we can spend a day or a week or a month solving a "thing" (a fixed cost - for the most part; we'll get back to this later), and the company can just run that any number of times to generate incredible value (an almost linearly-scaling marginal value per run).

So, the scalability of value creation is important for those of us in the industry. And when we say the word "scalability", we immediately think of scalability in terms of running the same thing an incredibly high number of times (e.g. using distributed databases and stateless applications to "scale out" your cool webapp to support an ever-increasing userbase). In other words, raising the ceiling for the number of runs.

And that makes sense - more runs of the same thing equal more value. A good way to explain this would be through an equation like this: value of software = (value generated per run) * (number of runs) - (cost to create a "solution") (or, shortened, output = V * N - C), and it is because the number of runs is essentially multiplied by value generated per run that the value of software is O(n) with regards to it; hence, scalable value.

But that's not the only way software can scale in terms of value. For a library, for instance, "more runs" come from allowing it to be integrated into lots of different applications (left-pad, anybody?). And for the more "business-y" applications (especially internal ones), "more runs" come not from raising the absolute number of runs for an application, but from solving more problems (after all, there are only so many business uers within a company who can use your application).

Consider a purely hypothetical application in a purely hypothetical business where a lot of agreements are made (sales, HR, legal, hiring, etc). Without software, a lot of poor schmucks have to manually write, review, and sign legalese with counterparties who have to do the same thing, not to mention that if they happen to fuck up even one form, it could cost the company millions of dollars. But with an application to generate forms from word templates, those people no longer have to draft up every form from scratch and are less likely to make mistakes copy+pasting, hooray!

But business comes back the next day, and complains that they're making too many mistakes by selecting the wrong templates. Oh no! So you add in a feature to the application to make sure people are getting the right templates for their needs.

But business comes back the next day, and complains that they're making too many mistakes by letting people who don't work in HR to select HR templates. Oh no! So you add in a feature to the application to make sure the right people are getting the right templates for their needs.

But business comes back the next day, and complains that they're making too many mistakes by accidentally missing some fields. Oh no! So you add in a feature to the application to make sure the right people are getting the right templates for their needs, and that they fill in the right fields.

But business comes back the next day, and complains that they're making too many mistakes by filling out the counterparty field incorrectly. Oh no! So you add in a feature to the application to make sure the right people are getting the right templates for their needs, and that they fill in the right fields and assign it to the right people.

And so on and so forth...

If you work on business-oriented applications, I'm sure this is all too familiar to you. You build an application that solves a business issue. Then people want that application to also solve something else. And else. And else. And else...

Assuming all of the problems they're asking your application are reasonably within scope (which is its own can of worms I won't touch today), the scalability of value for your application can't come from simply increasing N; hell, we're not even optimizing for N, for we need to continuously develop and deliver new features in order to deliver more value.

So is scalable value generation dead for this type of software?

Solve For X

Warning: math (notations) ahead

This is where the magician reveals that she has tricked the audience, for the equation output = V * N - C was really only meant for a single feature. It's not like we just stop working on an application after developing a single feature (maybe you do, good on you if that's the case), so we need to account for all of the features we have developed and will continue developing.

Then, the full equation for a software's output looks more like this: output = (V_1 * N_1 - C_1) + (V_2 * N_2 - C_2) + ... = Σ_(i = 1...x) (V_i * N_i - C_i). Furthermore, for business-oriented applications, since we can consider any N_i to be effectively capped, we can take it out as a constant, and the equation looks more like this: output = Σ_(i = 1...x) (V_i * N_i - C_i) ≈ Σ_(i = 1...x) (V_i - C_i) ≈ X * (V - C), where X is the number of features.

Thus, we can see that the output is O(n) with regards to X, the number of features (yes, I know this isn't exactly how O notations work, but you get the point). So even in cases where the N is capped, a software's scalability of value is not dead, per se, but rather is just scalable against a different metric, X. The scalability of such software comes from having V - C be essentially constant so that the output can basically scale linearly against X, no matter how high it is.

And we know how to optimize for such problems (for leetcoders, the problem is basically binpacking, and the solution is a simple greedy algorithm)! Think of oil wells. Hydrocarbons are a finite resource, and yet an oil company can keep making oil rigs one after another. Given infinite investments, an oil company will first drill the most profitable oil site (max(V_i - C_i)), then move onto the next one, and so on and so forth.

But eventually, the laws of diminishing returns will catch up to the company, and the company's money printing operations will no longer be scalable. Every well they dig will have to be more and more complex, and they might choose to skip over wells they know have vast stores of value.

And sure, once a new technology comes around (e.g. fracking), they will get right back into the scalable extraction model, but eventually the diminishing returns will catch up again, and the arms race seemingly never ends, all because C is constant while V is variable. Is there any way out of this?

Optimizing for C (no, not the language)

While I said before that solving a "thing" is a fixed cost for the most part, development costs for any given "thing" is technically not a fixed cost. Once you develop a thing, you will inevitably have to fix it later down the line and keep it running: our assumptions change, code rots, and the software landscape changes rapidly, and it never stops.

And that's really one of the more obvious reasons why scalable value generation breaks down - because the marginal cost starts to creep up (and this is where the oil well analogy breaks down). The more you develop, the more the cost of development rises due to factors such as 1. mounting tech debt, 2. increasing fragility stemming from the sheer number of moving parts, and 3. increasing complexity, making it difficult to architect and design new features.

For something like a public-facing application, while it does have some marginal cost in terms of feature development, it is "effectively" a fixed cost as it is so miniscule to the sheer scalability of value generation. Sure, there are unique challenges (and associated costs) that come from operating at web scale, but that is essentially nothing in comparison to numerous users being able to use the feature (i.e. it costs virtually nothing to onboard your 10 millionth user because you've already laid out the ground work).

The same can't be said for the business-oriented applications (where we are especially sensitive to rising marginal costs due to the capped value generation "per feature"), so we resort to addressing the above 3 points to keep marginal costs from creeping up.

It's $current_year, so we know how to solve these problems. For the mounting tech debt, the occasional "feature pause" to reevaluate current architecture and clean up the biggest tech debts (akin to GC, but for development); for the increasing fragility and complexity, the move to clearly-defined microservices/modules (hello, Java 9) + failover behaviour + mesh networking/resilient message passing + local reproducible builds allows us to tackle things at a much more reduced scope, make software more stable, and limit the "blast radius".

Okay, so if we know the answer, why do we let these things drag our momentum of development? This is where the oil company analogy breaks down once again, because while Chevron or Exxon can have essentially infinite investments, we - the individual teams down on the ground - don't; the higher-ups that ultimately sign our paycheck only have a limited amount of resources to invest in any application, and it just so happens that our recency bias heavily overweighs the value of "the next new thing" (ooh shiny) versus the cost of inaction.

Optimizing for V

All of these scalability problems arise because V doesn't "scale" with C for any given feature (i.e. the marginal cost is decoupled from the value generated from developing a feature). Here's a wild idea, though: what if it did?

Nobody knows the problem domain (and in particular, all of the hidden context around the singular feature they ask of us) better than the person requesting the feature, whether it be business or run-of-the-mill users. Nobody knows the unknown unknowns that inevitably bump up development time better than the person requesting the feature.

So we build "meta"-software and empower these users to build their own features (without killing the codebase), so that us developers' time could be better spent on more productive things (perhaps addressing the 3 problems I mentioned above), so that features could be developed at the users' own pace (thus making V "scale").

Let us go back to the purely hypothetical workplace with purely hypothetical applications and users to see how development works now versus how it might work.

Currently, a user complains about something and requests a feature. First, you need to decipher the user's request to figure out what exactly is that they're requesting and which parts are actually needed, then you need to figure out which parts you need to touch to add the feature, design it, plan it, scope it, implement it, test it, deploy it, etc.

Not to mention, there is a whole jumble of business logic sprawled all across your codebase. You don't know why something acts in a certain way, but you do know that you can't fuck with it and that it's what runs the business. It's a necessary evil that you need to deal with every time you need to develop a feature or need to figure out why the application is doing something (intentionally or not).

Maybe it lives across your entire stack; in some giant, disgusting stored procedure; in some middle layer that was only supposed to transparently transfer requests from A to B, etc. Wherever it is, not only do you need to figure out where the needle is in the haystack to expand upon/modify/interact with feature X, but you also need to understand what it is doing, made complicated by the fact that it sprawls across the codebase.

So I guess we can all agree that the troves of business logic sprawling across the codebase are evil, for sure, but are they necessary?

Hypothetically, let's say that the team lead is proactive in fighting said evil and has assigned you the task of tackling the most obvious area: business specifications (or more specifically, rules). And let's say you were previously involved in the company's inner-sourced rule engine, and that you've used it to express business rules. And let's say you've built a rule & context builder components so that users could literally craft all parts of the rules themselves. And let's say you've built a "translator" for the rules so that users could intuitively understand them. And let's say you've built the rules around the raw underlying data so the users could build literally any rule they needed, on any "thing", run any number of transformations, and "join" on any number of "things".  And let's say you've built out UI integrations that allowed users to manage the rules easily so that all of the business logic was centralized in one place and was crystal clear to the users. And let's say you've built reusable higher-level frontend components and hooks, and backend validators so that developers could easily integrate rules to replace "evil" business rules over time. And let's say you've designed the APIs involved with the rule engine to be completely stateless so that you could support any number of validations by just spinning up more of that service. And let's say you've designed all of these pieces across the stack to support the rule engine in an isomorphic manner so that one could use the rules as the source of truth for all business logic, no matter where in the stack you are. And let's say the isomorphic manner in which you've designed all of the above allowed you to build a seamless business logic validation workflow across all of the stack. And let's say you've architected the business logic validation system for easy rollback with versioning, with tests users could build out to ensure their rules won't accidentally let a "bad" form (or document or literally anything else) slip through. And let's say you've made it easy to configure which rules to use when by crafting rule selectors (which are, in fact, rules in and of themselves - providing you and the users with infinite flexibility) and JSON Schema validation for rule inputs, all of which are completely transparent to the users.

What would happen then?

You have the same business logic, but it's centralized, so you and the team know exactly where it is and which business rules are in effect. You have the same business logic, but it's completely transparent to the users so now they can see what these rules actually are. You have the same business logic, but now you've empowered the users so they can craft these rules and logic themselves in a safe manner.

Of course, the complexity of said business logic is not gone - it's still present within the rules. But at least now, it is far more maintainable and efficient for engineers to build the surrounding pieces supporting the business logic. Their time can be better spent on more productive things than hand crafting ACLs and CRUD until the end of time.

And most importantly, features can be "developed" at the user's pace - making the marginal value output scalable (i.e. V ~ C)!

This is just an example (not a definitive guide) of how you could scale your software even in cases where it has an inherent "cap". It all just depends on how you define your software: is it the thing delivering the features, or the thing enabling the delivery of features?

Scaling the Value Output

Ultimately, all of this discussion is, at its core, about how to make our skills and expertise scale. We are getting paid to write and maintain something, but modern applications have gotten way too complex and our demands of software way too much for us to be working on all of it in its entirety by ourselves.

In the early 2000s' web, maybe. But not nowadays.

So, in this complex mess of a machine we call "software", we must take care not to spread ourselves too thin (modern "fullstack engineer" is expected to do literally everything for the salary of a regular developer and keep up with changes in literally every part of the stack); we must take care to consider not only about the part that directly faces the users, but also about what's going on "underneath the hood". And perhaps most importantly, we must think about how the (part of) software we work on fits into the bigger picture, and assess the overall impact it has on the value output of the whole.

After all, we only have so much time on this planet: we'll only get to work on so many parts of software before our time is up, and I don't know about you, but, I want to work on the parts of software that can be made to scale its value generation; to have the most ripple effect, and to quell my fears that one day, I might look back at my career and realize that my passion, my life's work have been inconsequential in the "big picture".