A Causal Analytics Toolkit (CAT) for Assessing Potential Causal Relations in Data

May 11, 2016

Louis Anthony Cox, Jr.


Introduction

To best inform real-world policy decisions, it is necessary for policy analysts, risk analysts, scientists, and economists to attempt to answer the crucial question:  How would changing what we choose to do change the consequences that we care about?  In a world of realistically incomplete knowledge and imperfect information, the answer is seldom certain. A more sophisticated version of the same question is then: How would the probabilities of consequences that we care about change if we made different choices?  This is the fundamental question of causal analysis in public policy, linking alternative choices to expected results and to probabilities of different consequences when prediction with certainty is impossible.  Such causal analysis provides the key technical information on which decision analysis, risk analysis, and policy analysis build.

Expert Opinions are Often Unreliable

Several different ways to try to answer this causal analysis question have become well established in policy and regulatory circles.  One popular approach is to elicit the opinions of selected experts. But experts are often notoriously mistaken about causation, even when they feel confident about their answers (Kahneman, 2012; Tetlock and Gardner, 2015).  Moreover, they are often asked to answer misleading or meaningless questions, such as “What is the probability that this association is causal?” rather than better questions such as “What fraction of this association is causal?” 

Statistical and Econometric Models Usually Require Uncertain Assumptions

A second approach is to select a statistical or econometric model, estimate its parameters from data, and then use it to answer what-if questions and to solve for actions that make preferred outcomes more likely and undesired outcomes less likely.  But models, too, may be mistaken or unreliable.  Recipients of model-based predictions are often left to wonder how much they are determined by data and how much by the investigator’s choice of modeling assumptions. In general, model-based predictions are trustworthy only in the relatively rare cases where a model has been carefully validated for the specific circumstances to which it is applied. Such validation is far too rare in practice.  The availability of powerful statistical modeling packages has made it easy to search for combinations of modeling assumptions that imply desired (e.g., publishable) results – the problem known as p-hacking, which undermines the credibility of many published scientific findings and reported significance levels based on statistical modeling. 

A Data Science Approach: The Causal Analytics Toolkit (CAT)

The Causal Analytics Toolkit (CAT) is an Excel add-in for Microsoft Windows users, developed by Cox Associates with support from the GW Regulatory Studies Center, that takes a third, more objective, approach to causal analysis: Use data to discover how changes in inputs have changed outcomes in the past, and learn from this experience the causal relations among variables.  Then, use this causal knowledge to make choices that increase the desirability (e.g., the expected utility) of the probability distributions of future outcomes. 

This empirically guided approach, which is becoming increasingly promising in the era of big data and rapidly advancing data science, requires both relevant data and methods for analyzing it to determine what has caused what – the central challenge for causal analytics in data science.  To guide sound policy-making, the causal analytics algorithms that process data to reach causal conclusions must identify stable causal laws or relations that will hold between future actions and consequence probabilities.  Thus, they must do more than quantify past statistical associations between or among variables (such as policy inputs and outputs): they must also identify the pathways by which changes in some variables can produce changes in others.  To do so without relying on unproved modeling assumptions, CAT applies the following principles of data science for causal analytics: 

  • Information principle for identifying potential causes:  For X to be a cause of Y, X must be informative about Y.  That is, it must be possible to predict Y better if X is known than if it is not, and it must not be possible to find some other set of observations that would make Y statistically independent of X after conditioning on the observed values of other variables.  There are a number of statistical tests for determining whether this information condition holds, and they are built into CAT.  The multiple information relations among variables are summarized in a network diagram with arrows between variables indicating which variables are informative about which others.  In such a diagram, the potential causes of any variable are limited to its neighbors in the diagram, i.e., other variables that are informative about it.
  • Temporal principles for identifying potential causes:  Not only must causes precede their effects, but past changes in causes should help to predict and explain future changes in their effects.
  • Ensemble principle for model uncertainty:  To avoid making causal conclusions contingent on modeling assumptions having unknown or uncertain validity, CAT relies heavily on “model-free” (non-parametric) statistical methods and on the use of model ensembles, which are collections of hundreds of non-parametric models that provide plausible descriptions of the data.  It has been discovered that averaging the predictions from many such models yields much more accurate and reliable predictions than using any single model, so CAT incorporates such non-parametric model ensembles as a  primary way to estimate the quantitative relations between variables.
  • Conditional expectation principle for quantifying causal relations:  The direct causal relation between one variable and another (e.g., between exposure and effect) can be quantified by studying how the effect changes as only that one cause is varied, holding all other variables fixed at their observed values.   CAT includes sensitivity plots (also called partial dependence plots) to present this information, as well as classification and regression trees that show how the effect depends on multiple, possibly interacting, variables.

Easy Access to Analytics for All

The Causal Analytics Toolkit (CAT) provides simple, powerful commands and a point-and-click interface for applying these principles and for doing other advanced analytics (e.g., estimating and visualizing regression models and associations) from Excel in Windows using advanced R packages from the R project for statistical computing, even if the user does not know R. It can be used in many ways, from push-button, fully automated analyses to programming in R with additional CAT commands to increase programming productivity, depending on the level of user’s experience and familiarity with statistics and with R. For users who have no knowledge of R, a few mouse clicks will display results from advanced R packages without the need to learn R.

Advanced Analytics Made Simple

CAT gives simplified access to the analytics power of a vast array of R packages for detecting, analyzing, quantifying, and visualizing associations and other relations (such as information relations among multiple variables) in data sets using standardized, well-documented, and well-supported algorithms. For advanced users, CAT provides a convenient way to integrate R programming directly into Excel, while also providing pre-built commands with simplified syntax, push-button analytics capabilities that can save time on routine tasks, and reports that often integrate the analyses from multiple R packages to provide different perspectives on relations in the data.  The current version of CAT for Windows includes a first working prototype of artificial intelligence (AI) support providing rules for automatically selecting and running appropriate analyses and interpreting their results for non-statisticians.  This approach to intelligent automated analysis will be extended in future releases of CAT.

CAT’s developers plan to add more capabilities and to update CAT frequently as R releases and packages are updated and as new packages for advanced analytics and causal analysis are added to the R project’s CRAN repository.  Users are therefore encouraged to check for CAT updates often and to send comments, questions, notifications of bugs or difficulties in using CAT, and suggestions for improvements and additions to [email protected] or [email protected].