Ergalics: Predictive Causal Models

Overview

Ergalics, the science of computer science, has as a primary goal the development of predictive causal models. Such models are a product of empirical sciences, with these sciences differentiated by what the models are about. For ergalics, the models make predictions about the behavior of computational tools, about the use of computational tools, and about computation itself.

For ergalics, spelling out predictive causal models is especially relevant because the discipline of computer science has historically relied on the mathematical and engineering perspectives, which utilize methodologies and formalisms different from that of the scientific perspective at issue here. So it is appropriate in this context to characterize these models. Interestingly, computer science, through the deep work of Judea Pearl and others, has contributed to our formal understanding of these models, by ascribing to the models' directed graph representation a precise semantics. The language of directed graphs is very familiar to computer scientists, rendering this language a useful entré into sophisticated empirical methods for representing hypotheses about relationships under study.

Predictive causal models relate independent variables (which can be manipulated) to dependent variables (which can be measured), generating predictions for the values of dependent variables given set values for the independent variables.

As a simple example, Newton's second law relates the impact of applying a force to an object (the force being an independent variable) to the velocity of an object (a dependent variable).

There have been many formal models that have been used to represent such relationships between independent and dependent variables, as well as methods for testing those relationships.

More ...

A SCM consists of a set of random variables (RVs) and the relationships between them. Each RV describes a particular aspect of the world as assuming, at any given time, one of some number of states or values. This set of values defines the domain of the RV, and the domain can be continuous or discrete. RVs may be observed (measured) or they may be posited to describe aspects of the world that cannot directly be observed (i.e., they are hidden or latent).

A RV also describes the tendency of the represented aspect of the world to assume particular values. This tendency is generally not considered in isolation: a central goal of the causal modeling framework is to represent the relationships between RVs. Such relationships determine how the value of a RV depends on the values of one or more other RVs.

More ...

For example, in the model depicted in Figure 1a, the state of Y depends in some way on the state of X. In causal Bayesian networks [Pearl 1988], a type of SCM, the directed arrow represents a conditional probabilistic relationship: the distribution of values for Y is represented by the conditional probability P(Y|X), that is, the probability of Y taking on a particular value given the value of X. Provided the marginal distributions of Y and X are known, P(Y) and P(X), Bayes rule can be used to compute the distribution of X given the value of Y: P(X|Y) = P(Y|X)P(X) / P(Y). This ability to relate the behavior of variables to one another is very powerful. For example, if Y is not observable, but it is possible to observe that X is in a particular state, it is possible to predict what state Y is likely to be in, and vice versa.

Figure 1: Simple structural causal moels (SCMs)

SCMs add a critical additional semantics to the relationships: directed links between RVs are interpreted as indicating the direction of causal influence from parent (X) to child (Y), but not the other direction. This is made concrete by considering the difference in the relationship between X and Y were we to control, or intervene, on the state of X or Y. If we control the value of X, Y's behavior will be governed by the original conditional probability P(Y|X). However, if we intervene on Y, as depicted in Figure 1b, the causal influence of X on Y is broken or removed: Y will be in the state we set it to, no matter what X is; furthermore we can't turn to the intervened state of Y to determine the state of X -- now X can only be predicted by P(X), not the conditional probability P(X|Y) we calculated above with Bayes rule.

This semantics of directed causal influence, in conjunction with the semantics of probabilistic graphical models [Koller 2009, Pearl 1988], allows SCMs to represent complex relationships among variables assuming only passive observation, as well as represent the logic of controlled experiments [Spirtes 2000, Glymour 1999].

Less ...

Experimental control is a powerful tool for identifying whether observed dependencies between variables are due to direct causal relationship(s) or some other common cause.

More ...

There are a variety of modeling tools and resulting kinds of predictive causal models that one can construct. The discussion above has considered statistical methods, used heavily in other sciences as well. Other methods are relevant to ergalics, such as dynamical systems. At their core, all of these methods make claims about causal relationships among observed phenomena.

DBMSes as a Domain of Ergalic Theories

As an example, let's consider database management systems (DBMSes) as a type of computational model for which we would like develop and test predictive causal models.

Despite their considerable engineering and the formal theory developed for them, DBMSes are not well understood. For example, there is surprisingly little known about the following, quite basic questions concerning the prevalent cost-based query optimization and two-phase locking transaction management architectures: (1) What is the relative impact of the non-linearity of some operator cost models on the selection of a suboptimal plan for a query? (2) How critical is the accuracy of the cost model in selecting good plans? (3) Does adding a physical operator for an algebraic operation always improve the effectiveness of query optimization, or is there a limit to the number of operators that can be practically a accommodated?

These questions get at the heart of research and development of DBMS query optimization, query evaluation, and transaction processing architectures and algorithms. They address fundamental limits and complex interactions within these architectures. But these questions share another commonality: they cannot be answered by investigating a single algorithm or even a single DBMS. Rather, they require the development of predictive models across multiple DBMSes.

Let's examine a causal structure that can address the third question above, concerning whether adding a physical operator always improves query optimization performance.

Query evaluation proceeds in several steps [Ramakrishnan 2003]. First, the query (generally expressed in SQL, a high-level declarative language) is translated into alternative query evaluation plans based on the relational algebra via query enumeration. The cost of each plan is then estimated and the plan with the lowest estimated cost is chosen. These steps comprise query optimization, specifically cost-based query optimization [Selinger 1979]. The identified query plan is then evaluated by the query execution engine which implements a set of physical operators, often several for each logical operator, such as join [Graefe 1993]

The third question above concerns the role of optimizer complexity, specifically, the number of physical operators that the optimizer must contend with, in relation to suboptimality, those situations when the optimizer selects the wrong plan, where there is another plan that is faster that the optimizer passed over. The goal is to determine if a fundamental limitation exists, which implies that it is necessary to study multiple instances of that optimization architecture. To do so, a model could be articulated for how and in what circumstances suboptimality arises and then evidence could be collected via hypothesis testing that this model can indeed predict the behavior of these several DBMSes.

More ...

Figure 2: Model of relation between physical operators and suboptimality

Figure 2 shows an example structural causal model hypothesizing the possible relationships between component phenomena in a database. The reasoning behind this model is that suboptimality is hypothesized to be due in part to the complexity of the optimizer and concomitant unanticipated interactions between the different components used within the optimizer. With the proliferation of concerns and variability that an optimizer must contend with, it is extremely difficult to ensure that for an arbitrary query that the optimizer will always pick the right plan. The components of the optimizer are all interacting with each other; this theory implicates these interactions as a primary source of suboptimality.

More ...

The causal model in Figure 2 identifies some of the variables that may impact the prevalence of suboptimality. "Query Complexity" is an independent variable. Query Complexity includes such measures as the number of correlation names in the from clause, the number of predicates in the where clause, the presence of correlated subqueries, and the presence of aggregates; each of these measures can be independently manipulated.

Another independent variable is "Query Complexity." Once a particular DBMS is chosen, it will include specific optimizer code, including a set of physical operators (e.g., relation scan, sort-merge join) at its disposal. So one measure of Optimizer Complexity is the number of physical operators supported by the DBMS. Other measures of this construct include the number of rules in the optimizer and the amount of code in the DBMS, the non-linearity of the cost model, and cost model accuracy. Each DBMS optimizer will thus have a measurable complexity, across a variety of measures.

There are three dependent variables. "Suboptimality" has already been discussed. "Flutter" is an interesting phenomenon, exhibited by some queries, in which the query optimizer returns to a previous plan as the inaccuracy increases. Finally, "Plan Space Complexity" is a variable that expresses how challenging the plan space is. The causal model predicts that the plan space will be more complex when the query is complex (say, more tables to join) as well as when the optimizer is complex (say, more physical operators for joins).

Each of these dependent variables can be associated with one or more measures. Flutter can be operationalized as a categorical variable with the value of 1 if the specific query run on the specific DBMS (of a specific optimizer complexity and resulting plan space complexity) flutters. Plan Space Complexity can be operationalized several ways: through the number of plans considered, the number of plans actually generated by the optimizer when presented with different cardinality estimates of the underlying tables, the average number of operators generated, and the number of different operators involved in some plan that is considered.

Less ...

The central aspects of this model for the question at hand are the Optimizer Complexity variable and the causal paths from that variable to Suboptimality. The causal model states that if the optimizer is very complex, i.e., involving a large set of components that interact in complex ways, the optimizer will have a higher chance of fluttering as well as a higher chance of generating a suboptimal plan. This effect is due to four paths in the causal model: Optimizer Complexity can cause Suboptimality (i) directly, or (ii) by increasing the prevalence of flutter, which can cause suboptimality, or (iii) by increasing plan space complexity, which can directly cause suboptimality, or (iv) indirectly through flutter. Query Complexity also causes Suboptimality, but only indirectly, mediated by the Plan Space Complexity.

The next step is to develop hypotheses from the model that make predictions that can then be tested experimentally. Only if the model holds up to rigorous testing can it be used to address the question we started out with.

Analyses of experimental data over such predictive models also can indicate the relative strength of the causal paths in the model. For example, there are three causal paths impinging on Suboptimality in our model. Regression analysis of experimental data can estimate the amount of variance accounted for by each of these paths. Experiments can thus validate or refute intuition about what is really going on in these complex software systems, and can indicate the relative important of identified causal variables.

Summary

Structural causal models allow predictions of the behavior of computational tools. Predictive models can make predictions about the tool itself, as the DBMS example above illustrates, or about the use of such tools, or about computation itself. A predictive model can be tested by comparing its predictions to what we observe in experiments. These tests give us evidence for whether the model accurately describes nature, lending credence to the model, and thus helping us uncover causal relationships. Knowing about these underlying causal relationship enables control of the computational tool, which then enables improvement (in terms of performance, functionality, reliability, and other engineering goals) as well as possibly mathematical insights in the form of new theorems.

Webmaster: Andrey Kvochko