Note: This diagram is deliberately a stereotype, think of it as a projection along the dimension of churn.
Before I go further I should explain what I mean by “churn”. Churn is a terminology I have borrowed from marketing that is typically referred to as “customer churn” or the rate of change with which customers turn over. In the context of software systems I refer to churn as meaning more generally “rate of change” within a software system. You can think of this as the churn rate of the code, but it also can refer to the rate of change of the use cases, configurations, and business systems surrounding the code itself. In essence it is the rate of change imposed by external systems, end users, business requirements on the desired software system to solve that particular business problem.
Thinking about the previous “classic enterprise architecture” in this context presents some interesting ways of viewing the phenomenon. You start to see that on the edges of the software system where it is interacting with external systems (typically end users), there is more churn in the system, and that we compensate for that greater rate of churn by adopting languages and paradigms that are more dynamic and ammenable to that higher rate of change.
Churn in analogy to thermodynamic entropy
One way to think about this is in terms of entropy of heat in materials engineering. You design materials to accomodate the highest heat rate on the outside then gradually develop different layers of materials with other desirable characteristics that need to be more insulated from the heat going gradually internal to the material. Another analogy that I use alot when thinking about designing reactive systems to be able to handle a large amount of data is thinking about traffic patterns, off ramps, speed zones as various tools you can use to take a heavy amount of traffic and gradually start to siphon off parts of it until you have distributed a manageable amount of traffic to the right places. Yet another analogy I use in this regard is that of water sluiceways and how they are used to prevent floods. You get the idea, but long story short the same analogy can be used here substituting “traffic volume”, or “water volume” for “churn volume”.
This is where I come to my recent experiences with Formation.ai and some of the insights that I have had recently around its architecture. Formation is an amazing and challenging place to work for a number of reasons, and if you are a data engineer I highly recommend it. The main kind of defining feature of Formation is that it has been seeded from the beginning with ultra brilliant often times phd level practitioners of functional programming. While simultaneously Formation also finds itself at the extremely churn-filled world of online marketing as it builds incredible reinforcement learning inspired customer loyalty programs for major brands that are bleeding edge and deliver tremendous value. The difference between these two worlds is tremendously challenging and produces some amazingly novel concepts.
A main desirable characterstic of functional programming is that it allows you to define extremely safe maintainable software systems. Systems that you know exactly what they are doing at any given moment. This is an over simplification obviously but relative to other approaches there is a kernal of truth here. However, one aspect of functional programming that makes it challenging is that it generally takes longer to build systems that are designed to be purely functional. I’ve had many conversations with my collegues and I realize that this is also an oversimplfication, and in their perspective perhaps a straight falsehood. They would argue, and I think this is completely valid, that for someone sufficiently experienced in FP it actually takes less time to implement systems with a pure FP approach than any other approach and that you get the added benefiets of pure programming as a bonus. The key phrase there is that it takes “someone sufficiently experienced” to be able to implement such a system, which if you look at the average of all the individuals that you have available to your hiring pool, and the amount of added implementation time added for lesser experience it averages out to a longer amount of time to implement than say throwing together a quick python script. And thats totally fine because you get so many added benefiets from FP that its often time worth the investment.
However, lets consider the aspect of churn that I brought up before. Imagine if you start to model a particular type of relationship you observe in the business patterns and develop the FP algebra that models that relationship over the period of a week and implement its requisite categories using a very pure FP approach only to find that within that week those relationships have fundmanetally changed or been abandoned altogether multiple times. In general, we find that the business world wants to move faster than the engineering world. That impendence mismatch is well known to us all so needs to be respected. Then the challenge becomes that (going back to the entropy analogy) given the business operations are “hot”, how do we arrangement our materials (layers of software systems) in such a way to best accomodate that reality and insulate systems that are less engineered to accomodate for that high rate of churn but have other characteristics such as maintainability which we desperately need if we are to succeed long term.
Credit: Toggl engineering team
Notice a key aspect of this architecture is using SQL as what I call a “data configuration” language via means of Dremio. Dremio for those who aren’t familiar is a tool based on Apache Arrow and Apache Calcite that allows you to use SQL to specify virtual data tables which you can then access via means of JDBC for example to use in other Spark pipelines (for example). Think of it as a competitor to Presto with a materialization framework for external views and a snazzy front end with the ability to read from multiple types of data stores other than s3.
Its an interesting tool, that seems to be exactly the “right tool for the job” in this particular project, despite its lack of maturity around things like having a seperate persistent metadatabase that you can interact with directly (unlike hive), which we were able to get around by building a migration tool in scala that pulled the metadata into version control using the Dremio rest api. This allows Dremio to serve as the outer most, most flexible layer of our system, the one thats is most exposed to churn and is allowed to be deliberately client bespoke. What this allows us to do is massage the data into “canonical” tables which the rest of the system can use and dramatically siphons off a signifigant amount of churn. The other thing it allows us to do which is really intersting is defer product decision making until we have a statistically signifigant pattern emerge around our clients. In essensce Dremio becomes our product feature hopper, which we dig into and pull components from and develop later as we start to see patterns emerge in the Dremio queries between clients.
From there you can see the continuum from the very high churn Dremio SQL configuration layer down into the datascience which is still dynamic and flexible but requires more rigorous process around developing, down to the data engineering which is more type safe and written in scala and heavily leverages the canonical tables to prevent the code from having to change very much, all the way into the haskell layer which could be completely application agnostic and more “core” operations which don’t need to change constantly but need to be ultra safe and maintainable. I argue that this particular arrangement of using the right tools for the right job is both intuitive and puts the right pressures in the right places so that the company can operately at maximal efficiency in light of the highly churn based environment it finds itself in. Notice that the symmetry of the pattern of low to high churn exists even on the egress side of the fence where Formation has to configure sometimes very bespoke measurement metrics and guidelines to prove attribution of their value proposition. Basically any time the system is exposed to an “end user” or “non specialist user” of some sort we see these layers of churn insulation (going back to the thermodynamic analogy) emerge towards that eventual end goal.
Its also interesting to note how this parallels many of the data strategies that we’ve seen emerge over the years that goes from very unstructured data to highly use case specific structured data. In our case also the fact that we are using a data virtualization tool which helps to aleviate some of the heavier ETL that would be involved in such a data pipeline. The fact that Dremio is a distributed technology also means that it can scale to most our use cases in the batch realm within acceptable SLA’s (barring the complexity of reflection management which can sometimes make the SLA’s a bit more dicey). In our case we use it as part one of a one two punch for scalable data processing, the second being Apache Spark. Basically most use cases of Spark SQL we have moved into Dremio as its a tool that allows non specialist to query the data without having to know python or scala. This is really useful for things like debugging, QA and validation teams.
One note of caution, Dremio is still a green tool, and has its fair share of problems, particularly around materialization scheduling and handling. We realized after the fact that we could likely get the same type of behavior out of AWS Glue Data Catalog and Athena (Spark’s ability to point its metadata at the hive metastore-glue underlying athena is especially convenient) to handle the “services oriented metadata specification”, with some additional work around specifying certain views to be maintained as materialized views for quicker access. Moving forward this is likely what we will use in place of Dremio. A similar tool with alot of promise is being developed by LinkedIn called Dali which seems to be solving a similar problem as well, though its not yet open source. Another alternative could have been something like zeppelin or Databricks notebooks but it still would limit the users to people knowledgeable of python or scala. Deltalake SQL could have helped here but databricks overall felt like it was too disruptive to introduce at the time and still wasn’t quite as user friendly to non programmers.
To summarize I want to discuss various types of complexity in software engineering for data systems. There are 3 broad types of complexity here that I would like to shed light on as a result of this architecture, the third being perhaps the most overlooked:
System Complexity (internal complexity): This is what we typically think of when we refer to complexity in software. In general, how well does the software function on its own when no external forces are changing it or exerting pressure on it, ie how well does the software do its job. How well does it deal with volume, throughput, reliability and stability, latency. How easy is it to provision and maintain the infrastructure to run the software.
Operational Complexity (external complexity): This is some of the types of complexity we refer to when we say complexity in software but not all of it. This essentially is how well does the software cope with external business processes and change. How easy is it to extend the software and evolve it over time. How easy is it to configure. How easy is it for users to get analytic or diagnostic information out of the software. Churn as it pertains to business configuration falls into this category and the concept Churn Based Programming focuses a light on this type of complexity.
I will end my discussion here as its starting to get long, but I hope that Churn Based Programming is found as a useful concept that others will draw inspiration from and try to find other more formal ways of expressing it. I’m eager to learn of more obvious or orthodox expressions of the same idea. In my next post I’m going to continue the discussion about programming languages and talk about the dueling concepts of Cohesion and Coupling and how it affects certain aspects of what programming language you might chose for a particular task, and think about those ideas in the lens of program churn.