11.07.2015 Views

Causality in Time Series - ClopiNet

Causality in Time Series - ClopiNet

Causality in Time Series - ClopiNet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Volume 5:<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Flor<strong>in</strong> Popescu, Isabelle Guyon and Nicola TalbotFlor<strong>in</strong> Popescu and Isabelle GuyonProduction Editor: Nicola Talbot


ForewordFlor<strong>in</strong> PopescuFraunhofer Institute FIRSTi


PrefaceThe follow<strong>in</strong>g is a pr<strong>in</strong>t version of articles submitted by speakers at the M<strong>in</strong>i Symposiumon <strong>Time</strong> <strong>Series</strong> <strong>Causality</strong> which was part of the 2009 Neural Information Process<strong>in</strong>gSystems (NIPS) 2009 Conference <strong>in</strong> Vancouver, CA.March 2011The Editorial Team:Flor<strong>in</strong> PopescuFraunhofer Institute FIRST, Berl<strong>in</strong>Flor<strong>in</strong>.popescu@first.fraunhofer.deIsabelle GuyonClop<strong>in</strong>et, San Franciscoguyon@clop<strong>in</strong>et.comiii


Table of ContentsForewordPrefaceiiiiL<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systems 1H. White, K. Chalak & X. Lu; JMLR W&CP 12:1–29 .Robust statistics for describ<strong>in</strong>g causality <strong>in</strong> multivariate time-series. 33F. Popescu; JMLR W&CP 12:30–64, 2011.Causal <strong>Time</strong> <strong>Series</strong> Analysis of functional Magnetic Resonance Imag<strong>in</strong>g Data 73A. Roebroeck, A.K. Seth & P. Valdes-Sosa; JMLR W&CP 12:65–94, 2011.Causal Search <strong>in</strong> Structural Vector Autoregressive Models 105A. Moneta, N. Chlaß, D. Entner & P. Hoyer; JMLR W&CP 12:95–114, 2011.<strong>Time</strong> <strong>Series</strong> Analysis with the <strong>Causality</strong> Workbench 131I. Guyon, A. Statnikov & C. Aliferis; JMLR W&CP 12:115–139, 2011.v


JMLR: Workshop and Conference Proceed<strong>in</strong>gs 12:1–29L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model withSettable SystemsHalbert WhiteDepartment of EconomicsUniversity of California, San DiegoLa Jolla, CA 92093Karim ChalakDepartment of EconomicsBoston College140 Commonwealth AvenueChestnut Hill, MA 02467Xun LuDepartment of EconomicsHong Kong University of Science and TechnologyClear Water Bay, Hong Konghwhite@ucsd.educhalak@bc.eduxunlu@ust.hkEditors: Flor<strong>in</strong> Popescu and Isabelle GuyonAbstractThe causal notions embodied <strong>in</strong> the concept of Granger causality have been arguedto belong to a different category than those of Judea Pearl’s Causal Model, and sofar their relation has rema<strong>in</strong>ed obscure. Here, we demonstrate that these conceptsare <strong>in</strong> fact closely l<strong>in</strong>ked by show<strong>in</strong>g how each relates to straightforward notions ofdirect causality embodied <strong>in</strong> settable systems, an extension and ref<strong>in</strong>ement of the PearlCausal Model designed to accommodate optimization, equilibrium, and learn<strong>in</strong>g. Wethen provide straightforward practical methods to test for direct causality us<strong>in</strong>g testsfor Granger causality.Keywords: Causal Models, Conditional Exogeneity, Conditional Independence, GrangerNon-causality1. IntroductionThe causal notions embodied <strong>in</strong> the concept of Granger causality (“G−causality") (e.g.,Granger, C.W.J., 1969; Granger, C.W.J. and P. Newbold, 1986) are probabilistic, relat<strong>in</strong>gto the ability of one time series to predict another, conditional on a given <strong>in</strong>formationset. On the other hand, the causal notions of the Pearl Causal Model (“PCM")(e.g., Pearl, J., 2000) <strong>in</strong>volve specific notions of <strong>in</strong>terventions and of functional ratherthan probabilistic dependence. The relation between these causal concepts has so farrema<strong>in</strong>ed obscure. For his part, Granger, C.W.J. (1969) acknowledged that G−causalitywas not “true" causality, whatever that might be, but that it seemed likely to be an importantpart of the full story. On the other hand, Pearl, J. (2000, p. 39) states thatc○ H. White, K. Chalak & X. Lu.


White Chalak Lu“econometric concepts such as ‘Granger causality’ (Granger, C.W.J., 1969) and ‘strongexogeneity’ (Engle, R., D. Hendry, and J.-F. Richard, 1983) will be classified as statisticalrather than causal." In practice, especially <strong>in</strong> economics, numerous studies haveused G−causality either explicitly or implicitly to draw structural or policy conclusions,but without any firm foundation.Recently, White, H. and X. Lu (2010a, “WL”) have provided conditions underwhich G−causality is equivalent to a form of direct causality aris<strong>in</strong>g naturally <strong>in</strong> dynamicstructural systems, def<strong>in</strong>ed <strong>in</strong> the context of settable systems. The settable systemsframework, <strong>in</strong>troduced by White, H. and K. Chalak (2009, “WC”), extends andref<strong>in</strong>es the PCM to accommodate optimization, equilibrium, and learn<strong>in</strong>g. In this paper,we explore the relations between direct structural causality <strong>in</strong> the settable systemsframework and notions of direct causality <strong>in</strong> the PCM for both recursive andnon-recursive systems. The close correspondence between these concepts <strong>in</strong> the recursivesystems relevant to G−causality then enables us to show that there is <strong>in</strong> fact aclose l<strong>in</strong>kage between G−causality and PCM notions of direct causality. This enablesus to provide straightforward practical methods to test for direct causality us<strong>in</strong>g testsfor Granger causality.In a related paper, Eichler, M. and V. Didelez (2009) also study the relation betweenG−causality and <strong>in</strong>terventional notions of causality. They give conditions under whichGranger non-causality implies that an <strong>in</strong>tervention has no effect. In particular, Eichler,M. and V. Didelez (2009) use graphical representations as <strong>in</strong> Eichler, M. (2007)of given G−causality relations satisfy<strong>in</strong>g the “global Granger causal Markov property"to provide graphical conditions for the identification of effects of <strong>in</strong>terventions<strong>in</strong> “stable" systems. Here, we pursue a different route for study<strong>in</strong>g the <strong>in</strong>terrelationsbetween G−causality and <strong>in</strong>terventional notions of causality. Specifically, we see thatG−causality and certa<strong>in</strong> settable systems notions of direct causality based on functionaldependence are equivalent under a conditional form of exogeneity. Our conditions arealternative to “stability" and the “global Granger causal Markov property," althoughparticular aspects of our conditions have a similar flavor.As a referee notes, the present work also provides a rigorous complement, <strong>in</strong> discretetime, to work by other authors <strong>in</strong> this volume (for example Roebroeck, A., Seth,A.K., and Valdes-Sosa, P., 2011) on comb<strong>in</strong><strong>in</strong>g structural and dynamic concepts ofcausality.The plan of the paper is as follows. In Section 2, we briefly review the PCM. In Section3, we motivate settable systems by discuss<strong>in</strong>g certa<strong>in</strong> limitations of the PCM us<strong>in</strong>ga series of examples <strong>in</strong>volv<strong>in</strong>g optimization, equilibrium, and learn<strong>in</strong>g. We then specifya formal version of settable systems that readily accommodates the challenges to causaldiscourse presented by the examples of Section 3. In Section 4, we def<strong>in</strong>e direct structuralcausality for settable systems and relate this to correspond<strong>in</strong>g notions <strong>in</strong> the PCM.The close correspondence between these concepts <strong>in</strong> recursive systems establishes thefirst step <strong>in</strong> l<strong>in</strong>k<strong>in</strong>g G−causality and the PCM. In Section 5, we discuss how the resultsof WL complete the cha<strong>in</strong> by l<strong>in</strong>k<strong>in</strong>g direct structural causality and G−causality. Thisalso <strong>in</strong>volves a conditional form of exogeneity. Section 6 constructs convenient practi-2


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systemscal tests for structural causality based on proposals of WL, us<strong>in</strong>g tests for G−causalityand conditional exogeneity. Section 7 conta<strong>in</strong>s a summary and conclud<strong>in</strong>g remarks.2. Pearl’s Causal ModelPearl’s def<strong>in</strong>ition of a causal model (Pearl, J., 2000, def. 7.1.1, p. 203) provides a formalstatement of elements support<strong>in</strong>g causal reason<strong>in</strong>g. The PCM is a triple M := (u,v, f ),where u := {u 1 ,...,u m } conta<strong>in</strong>s “background" variables determ<strong>in</strong>ed outside the model,v := {v 1 ,...,v n } conta<strong>in</strong>s “endogenous" variables determ<strong>in</strong>ed with<strong>in</strong> the model, and f :={ f 1 ,..., f n } conta<strong>in</strong>s “structural" functions specify<strong>in</strong>g how each endogenous variable isdeterm<strong>in</strong>ed by the other variables of the model, so that v i = f i (v (i) ,u), i = 1,...,n. Here,v (i) is the vector conta<strong>in</strong><strong>in</strong>g every element of v but v i . The <strong>in</strong>tegers m and n are f<strong>in</strong>ite.The elements of u and v are system “units."F<strong>in</strong>ally, the PCM requires that for each u, f yields a unique fixed po<strong>in</strong>t. Thus, theremust be a unique collection g := {g 1 ,...,g n } such that for each u,v i = g i (u) = f i (g (i) (u),u), i = 1,...,n. (1)The unique fixed po<strong>in</strong>t requirement is crucial to the PCM, as this is necessary fordef<strong>in</strong><strong>in</strong>g the potential response function (Pearl, J., 2000, def. 7.1.4). This provides thefoundation for discourse about causal relations between endogenous variables; withoutthe potential response function, causal discourse is not possible <strong>in</strong> the PCM. A variantof the PCM (Halpern, J., 2000) does not require a fixed po<strong>in</strong>t, but if any exist, there maybe multiple collections of functions g yield<strong>in</strong>g a fixed po<strong>in</strong>t. We call this a GeneralizedPearl Causal Model (GPCM). As GPCMs also do not possess an analog of the potentialresponse function <strong>in</strong> the absence of a unique fixed po<strong>in</strong>t, causal discourse <strong>in</strong> the GPCMis similarly restricted.In present<strong>in</strong>g the PCM, we have adapted Pearl’s notation somewhat to facilitatesubsequent discussion, but all essential elements are present and complete.Pearl, J. (2000) gives numerous examples for which the PCM is ideally suited forsupport<strong>in</strong>g causal discourse. As a simple game-theoretic example, consider a market<strong>in</strong> which there are exactly two firms produc<strong>in</strong>g similar but not identical products (e.g.,Coke and Pepsi <strong>in</strong> the cola soft-dr<strong>in</strong>k market). Price determ<strong>in</strong>ation <strong>in</strong> this market is atwo-player game known as “Bertrand duopoly."In decid<strong>in</strong>g its price, each firm maximizes its profit, tak<strong>in</strong>g <strong>in</strong>to account the prevail<strong>in</strong>gcost and demand conditions it faces, as well as the price of its rival. A simplesystem represent<strong>in</strong>g price determ<strong>in</strong>ation <strong>in</strong> this market isp 1 = a 1 + b 1 p 2p 2 = a 2 + b 2 p 1 .Here, p 1 and p 2 represent the prices chosen by firms 1 and 2 respectively, and a 1 , b 1 ,a 2 , and b 2 embody the prevail<strong>in</strong>g cost and demand conditions.3


White Chalak LuWe see that this maps directly to the PCM with n = 2, endogenous variables v =(p 1 , p 2 ), background variables u = (a 1 ,b 1 ,a 2 ,b 2 ), and structural functionsf 1 (v 2 ,u) = a 1 + b 1 p 2f 2 (v 1 ,u) = a 2 + b 2 p 1 .These functions are the Bertrand "best response" or "reaction" functions.Further, provided b 1 b 2 1, this system has a unique fixed po<strong>in</strong>t,p 1 = g 1 (u) = (a 1 + b 1 a 2 )/(1 − b 1 b 2 )p 2 = g 2 (u) = (a 2 + b 2 a 1 )/(1 − b 1 b 2 ).This fixed po<strong>in</strong>t represents the Nash equilibrium for this two-player game.Clearly, the PCM applies perfectly, support<strong>in</strong>g causal discourse for this Bertrandduopoly game. Specifically, we see that p 1 causes p 2 and vice-versa, and that the effectof p 2 on p 1 is b 1 , whereas that of p 1 on p 2 is b 2 .In fact, the PCM applies directly to a wide variety of games, provided that the gamehas a unique equilibrium. But there are many important cases where there may be noequilibrium or multiple equilibria. This limits the applicability of the PCM. We exploreexamples of this below, as well as other features of the PCM that limit its applicability.3. Settable Systems3.1. Why Settable Systems?WC motivate the development of the settable system (SS) framework as an extension ofthe PCM that accommodates optimization, equilibrium, and learn<strong>in</strong>g, which are centralfeatures of the explanatory structures of <strong>in</strong>terest <strong>in</strong> economics. But these features areof <strong>in</strong>terest more broadly, especially <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g, as optimization correspondsto any <strong>in</strong>telligent or rational behavior, whether artificial or natural; equilibrium (e.g.,Nash equilibrium) or transitions toward equilibrium characterize stable <strong>in</strong>teractions betweenmultiple <strong>in</strong>teract<strong>in</strong>g systems; and learn<strong>in</strong>g corresponds to adaptation and evolutionwith<strong>in</strong> and between <strong>in</strong>teract<strong>in</strong>g systems. Given the prevalence of these features <strong>in</strong>natural and artificial systems, it is clearly desirable to provide means for explicit andrigorous causal discourse relat<strong>in</strong>g to systems with these features.To see why an extension of the PCM is needed to handle optimization, equilibrium,and learn<strong>in</strong>g, we consider a series of examples that highlight certa<strong>in</strong> limit<strong>in</strong>g featuresof the PCM: (i) <strong>in</strong> the absence of a unique fixed po<strong>in</strong>t, causal discourse is undef<strong>in</strong>ed;(ii) background variables play no causal role; (iii) the role of attributes is restricted;and (iv) only a f<strong>in</strong>ite rather than a countable number of units is permitted. WC discussfurther relevant aspects of the PCM, but these suffice for present purposes.Example 3.1 (Equilibria <strong>in</strong> Game Theory) Our first example concerns generaltwo-player games, extend<strong>in</strong>g the discussion that we began above <strong>in</strong> consider<strong>in</strong>g Bertrandduopoly.4


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsLet two players i = 1,2 have strategy sets S i and utility functions u i , such that π i =u i (z 1 ,z 2 ) gives player i’s payoff when player 1 plays z 1 ∈ S 1 and player 2 plays z 2 ∈ S 2 .Each player solves the optimization problemmax u i (z 1 ,z 2 ).z i ∈S iThe solution to this problem, when it exists, is player i’s best response, denotedy i = r e i (z (i);a),where ri e is player i’s best response function (the superscript “e" stands for “elementary,"conform<strong>in</strong>g to notation formally <strong>in</strong>troduced below); z (i) denotes the strategy played bythe player other than i; and a := (S 1 ,u 1 ,S 2 ,u 2 ) denotes given attributes def<strong>in</strong><strong>in</strong>g thegame. For simplicity here, we focus on “pure strategy" games; see Gibbons, R. (1992)for an accessible <strong>in</strong>troduction to game theory.Different configurations for a correspond to different games. For example, one ofthe most widely known games is prisoner’s dilemma, where two suspects <strong>in</strong> a crime areseparated and offered a deal: if one confesses and the other does not, the confessor isreleased and the other goes to jail. If both confess, both receive a mild punishment. Ifneither confesses, both are released. The strategies are whether to confess or not. Eachplayer’s utility is determ<strong>in</strong>ed by both players’ strategies and the punishment structure.Another well known game is hide and seek. Here, player 1 w<strong>in</strong>s by match<strong>in</strong>g player2’s strategy and player 2 w<strong>in</strong>s by mismatch<strong>in</strong>g player 1’s strategy. A familiar exampleis a penalty kick <strong>in</strong> soccer: the goalie w<strong>in</strong>s by match<strong>in</strong>g the direction (right or left) ofthe kicker’s kick; the kicker w<strong>in</strong>s by mismatch<strong>in</strong>g the direction of the goalie’s lunge.The same structure applies to baseball (hitter vs. pitcher) or troop deployment <strong>in</strong> battle(aggressor vs. defender).A third famous game is battle of the sexes. In this game, Ralph and Alice are try<strong>in</strong>gto decide how to spend their weekly night out. Alice prefers the opera, and Ralphprefers box<strong>in</strong>g; but both would rather be together than apart.Now consider whether the PCM permits causal discourse <strong>in</strong> these games, e.g., aboutthe effect of one player’s action on that of the other. We beg<strong>in</strong> by mapp<strong>in</strong>g the elementsof the game to the elements of the PCM. First, we see that a corresponds to PCM backgroundvariables u, as these are specified outside the system. The variables determ<strong>in</strong>edwith<strong>in</strong> the system, i.e., the PCM endogenous variables are z := (z 1 ,z 2 ) correspond<strong>in</strong>g tov, provided that (for now) we drop the dist<strong>in</strong>ction between y i and z i . F<strong>in</strong>ally, we see thatthe best response functions ri e correspond to the PCM structural functions f i .To determ<strong>in</strong>e whether the PCM permits causal discourse <strong>in</strong> these games, we cancheck whether there is a unique fixed po<strong>in</strong>t for the best responses. In prisoner’s dilemma,there is <strong>in</strong>deed a unique fixed po<strong>in</strong>t (both confess), provided the punishments are suitablychosen. The PCM therefore applies to this game to support causal discourse. Butthere is no fixed po<strong>in</strong>t for hide and seek, so the PCM cannot support causal discoursethere. On the other hand, there are two fixed po<strong>in</strong>ts for battle of the sexes: both Ralphand Alice choose opera or both choose box<strong>in</strong>g. The PCM does not support causal discoursethere either. Nor does the GPCM apply to the latter games, because even though5


White Chalak Luit does not require a unique fixed po<strong>in</strong>t, the potential response functions required forcausal discourse are not def<strong>in</strong>ed.The importance of game theory generally <strong>in</strong> describ<strong>in</strong>g the outcomes of <strong>in</strong>teractionsof goal-seek<strong>in</strong>g agents and the fact that the unique fixed po<strong>in</strong>t requirement prohibits thePCM from support<strong>in</strong>g causal discourse <strong>in</strong> important cases strongly motivates formulat<strong>in</strong>ga causal framework that drops this requirement. As we discuss below, the SSframework does not require a unique fixed po<strong>in</strong>t, and it applies readily to games generally.Moreover, recogniz<strong>in</strong>g and enforc<strong>in</strong>g the dist<strong>in</strong>ction between y i (i’s best responsestrategy) and z i (an arbitrary sett<strong>in</strong>g of i’s strategy) turns out to be an important componentto elim<strong>in</strong>at<strong>in</strong>g this requirement.Another noteworthy aspect of this example is that a is a fixed list of elements thatdef<strong>in</strong>e the game. Although elements of a may differ across players, they do not vary fora given player. This dist<strong>in</strong>ction should be kept <strong>in</strong> m<strong>in</strong>d when referr<strong>in</strong>g to the elementsof a as background “variables."Example 3.2 (Optimization <strong>in</strong> Consumer Demand) The neoclassical theory ofconsumer demand posits that consumers determ<strong>in</strong>e their optimal goods consumption bymaximiz<strong>in</strong>g utility subject to a budget constra<strong>in</strong>t (see, e.g., Varian, H., 2009). Supposefor simplicity that there are just two goods, say beer and pizza. Then a typical consumersolves the problemmaxz 1 ,z 2U(z 1 ,z 2 ) s.t. m = z 1 + pz 2 ,where z 1 and z 2 represent quantities consumed of goods 1 (beer) and 2 (pizza) respectivelyand U is the utility function that embodies the consumer’s preferences for thetwo goods. For simplicity, let the price of a beer be $1, and let p represent the price ofpizza; m represents funds available for expenditure, “<strong>in</strong>come" for short 1 . The budgetconstra<strong>in</strong>t m = z 1 + pz 2 ensures that total expenditure on beer and pizza does not exceed<strong>in</strong>come (no borrow<strong>in</strong>g) and also that total expenditure on beer and pizza is not less thanm. (As long as utility is <strong>in</strong>creas<strong>in</strong>g <strong>in</strong> consumption of the goods, it is never optimal toexpend less than the funds available.)Solv<strong>in</strong>g the consumer’s demand problem leads to the optimal consumer demandsfor beer and pizza, y 1 and y 2 . It is easy to show that these can be represented asy 1 = r a 1 (p,m;a) and y 2 = r a 2 (p,m;a),where r1 a and ra 2are known as the consumer’s market demand functions for beer andpizza. The “a" superscript stands for “agent," correspond<strong>in</strong>g to notation formally <strong>in</strong>troducedbelow. The attributes a <strong>in</strong>clude the consumer’s utility function U (preferences)and the admissible values for z 1 , z 2 , p, and m, e.g., R + := [0,∞).Now consider how this problem maps to the PCM. First, we see that a and (p,m)correspond to the background variables u, as these are not determ<strong>in</strong>ed with<strong>in</strong> the system.Next, we see that y := (y 1 ,y 2 ) corresponds to PCM endogenous variables v. F<strong>in</strong>ally,1. S<strong>in</strong>ce a beer costs a dollar, it is the “numeraire," imply<strong>in</strong>g that <strong>in</strong>come is measured <strong>in</strong> units of beer.This is a convenient convention ensur<strong>in</strong>g that we only need to keep track of the price ratio betweenpizza and beer, p, rather than their two separate prices.6


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systemswe see that the consumer demand functions ri a correspond to the PCM structural functionsf i . Also, because the demand for beer, y 1 , does not enter the demand functionfor pizza, r2 a , and vice versa, there is a unique fixed po<strong>in</strong>t for this system of equations.Thus, the PCM supports causal discourse <strong>in</strong> this system.Nevertheless, this system is one where, <strong>in</strong> the PCM, the causal discourse naturalto economists is unavailable. Specifically, economists f<strong>in</strong>d it natural to refer to “priceeffects" and “<strong>in</strong>come effects" on demand, implicitly or explicitly view<strong>in</strong>g price p and<strong>in</strong>come m as causal drivers of demand. For example, the pizza demand price effectis (∂/∂p)r2 a (p,m;a). This represents how much optimal pizza consumption (demand)will change as a result of a small (marg<strong>in</strong>al) <strong>in</strong>crease <strong>in</strong> the price of pizza. Similarly,the pizza demand <strong>in</strong>come effect is (∂/∂m)r2 a (p,m;a), represent<strong>in</strong>g how much optimalpizza consumption will change as a result of a small <strong>in</strong>crease <strong>in</strong> <strong>in</strong>come. But <strong>in</strong> thePCM, causal discourse is reserved only for endogenous variables y 1 and y 2 . The factthat background variables p and m do not have causal status prohibits speak<strong>in</strong>g abouttheir effects.Observe that the “endogenous" status of y and “exogenous" status of p and m isdeterm<strong>in</strong>ed <strong>in</strong> SS by utility maximization, the “govern<strong>in</strong>g pr<strong>in</strong>ciple" here. In contrast,there is no formal mechanism <strong>in</strong> the PCM that permits mak<strong>in</strong>g these dist<strong>in</strong>ctions. Althoughcausal discourse <strong>in</strong> the PCM can be rescued for such systems by “endogeniz<strong>in</strong>g"p and m, that is, by posit<strong>in</strong>g additional structure that expla<strong>in</strong>s the genesis of p and m<strong>in</strong> terms of further background variables, this is unduly cumbersome. It is much morenatural simply to permit p and m to have causal status from the outset, so that price and<strong>in</strong>come effects are immediately mean<strong>in</strong>gful, without hav<strong>in</strong>g to specify their determ<strong>in</strong><strong>in</strong>gprocesses. The SS framework embodies this direct approach. Those familiar withtheories of price and <strong>in</strong>come determ<strong>in</strong>ation will appreciate the considerable complicationsavoided <strong>in</strong> this way. The same simplifications occur with respect to the primitivevariables appear<strong>in</strong>g <strong>in</strong> any responses determ<strong>in</strong>ed by optimiz<strong>in</strong>g behavior.Also noteworthy here is the important dist<strong>in</strong>ction between a, which represents fixedattributes of the system, and p and m, which are true variables that can each take a rangeof different possible values. As WC (p.1774) note, restrict<strong>in</strong>g the role of attributesby “lump<strong>in</strong>g together" attributes and structurally exogenous variables as backgroundobjects without causal status creates difficulties for causal discourse <strong>in</strong> the PCM:[this] misses the opportunity to make an important dist<strong>in</strong>ction between<strong>in</strong>variant aspects of the system units on the one hand and counterfactualvariation admissible for the system unit values on the other. Among otherth<strong>in</strong>gs, assign<strong>in</strong>g attributes to u <strong>in</strong>terferes with assign<strong>in</strong>g natural causalroles to structurally exogenous variables.By dist<strong>in</strong>guish<strong>in</strong>g between attributes and structurally exogenous variables, settable systemspermit causal status for variables determ<strong>in</strong>ed outside a given system, such as whenprice and <strong>in</strong>come drive consumer demand.Example 3.3 (Learn<strong>in</strong>g <strong>in</strong> Structural Vector Autoregressions) Structural vectorautoregressions (VARs) are widely used to analyze time-series data. For example,7


White Chalak Luconsider the structural VARy 1,t = a 11 y 1,t−1 + a 12 y 2,t−1 + u 1,ty 2,t = a 21 y 1,t−1 + a 22 y 2,t−1 + u 2,t , t = 1,2,...,where y 1,0 and y 2,0 are given scalars, a := (a 11 ,a 12 ,a 21 ,a 22 ) ′ is a given real “coefficient"vector, and {u t := (u 1,t ,u 2,t ) : t = 1,2,...} is a given sequence. This system describes theevolution of {y t := (y 1,t ,y 2,t ) : t = 1,2,...} through time.Now consider how this maps to the PCM. We see that y 0 := (y 1,0 ,y 2,0 ), {u t }, and acorrespond to the PCM background variables u, as these are not determ<strong>in</strong>ed with<strong>in</strong> thesystem. Further, we see that the sequence {y t } corresponds to the endogenous variablesv, and that the PCM structural functions f i correspond tor 1,t (y t−1 ,u t ;a) = a 11 y 1,t−1 + a 12 y 2,t−1 + u 1,tr 2,t (y t−1 ,u t ;a) = a 21 y 1,t−1 + a 22 y 2,t−1 + u 2,t , t = 1,2,...,where y t−1 := (y 0 ,...,y t−1 ) and u t := (u 1 ,...,u t ) represent f<strong>in</strong>ite “histories" of the <strong>in</strong>dicatedvariables. We also see that this system is recursive, and therefore has a unique fixedpo<strong>in</strong>t.The challenge to the PCM here is that it permits only a f<strong>in</strong>ite rather than a countablenumber of units: both the number of background variables (m) and endogenous variables(n) must be f<strong>in</strong>ite <strong>in</strong> the PCM, whereas the structural VAR requires a countable<strong>in</strong>f<strong>in</strong>ity of background and endogenous variables. In contrast, settable systems permit(but do not require) a countable <strong>in</strong>f<strong>in</strong>ity of units, readily accommodat<strong>in</strong>g structuralVARs.In l<strong>in</strong>e with our previous discussion, settable systems dist<strong>in</strong>guish between systemattributes a (a fixed vector) and structurally exogenous causal variables y 0 and {u t }. Thedifference <strong>in</strong> the roles of y 0 and {u t } on the one hand and a on the other are particularlyclear <strong>in</strong> this example. In the PCM, these are lumped together as background variablesdevoid of causal status. S<strong>in</strong>ce a is fixed, its lack of causal status is appropriate; <strong>in</strong>deed,a represents effects here 2 , not causes. But the lack of causal status is problematic forthe variables y 0 and {u t }; for example, this prohibits discuss<strong>in</strong>g the effects of structural“shocks" u t .Observe that the structural VAR represents u 1,t as a causal driver of y 1,t , as is standard.Nevertheless, settable systems do not admit “<strong>in</strong>stantaneous" causation, so eventhough u 1,t has the same time <strong>in</strong>dex as y 1,t , i.e. t, we adopt the convention that u 1,t isrealized prior to y 1,t . That is, there must be some positive time <strong>in</strong>terval δ > 0, no matterhow small, separat<strong>in</strong>g these realizations. For example, δ can represent the amount oftime it takes to compute y 1,t once all its determ<strong>in</strong>ants are <strong>in</strong> place. Strictly speak<strong>in</strong>g,then, we could write u 1,t−δ <strong>in</strong> place of u 1,t , but for notational convenience, we leavethis implicit. We refer to this as “contemporaneous" causation to dist<strong>in</strong>guish it from<strong>in</strong>stantaneous causation.2. For example, (∂/∂y 1,t−1 )r 1,t (y t−1 ,e t ;a) = a 11 can be <strong>in</strong>terpreted as the marg<strong>in</strong>al effect of y 1,t−1 on y 1,t .8


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsA common focus of <strong>in</strong>terest when apply<strong>in</strong>g structural VARs is to learn the coefficientvector a. In applications, it is typically assumed that the realizations {y t } areobserved, whereas {u t } is unobserved. The least squares estimator for a sample of sizeT, say â T , is commonly used to learn (estimate) a <strong>in</strong> such cases. This estimator is astraightforward function of y T , say â T = r a,T (y T ). If {u t } is generated as a realizationof a sequence of mean zero f<strong>in</strong>ite variance <strong>in</strong>dependent identically distributed (IID)random variables, then â T generally converges to a with probability one as T → ∞,imply<strong>in</strong>g that a can be fully learned <strong>in</strong> the limit. View<strong>in</strong>g â T as causally determ<strong>in</strong>ed byy T , we see that we require a countable number of units to treat this learn<strong>in</strong>g problem.As these examples demonstrate, the PCM exhibits a number of features that limitits applicability to systems <strong>in</strong>volv<strong>in</strong>g optimization, equilibrium, and learn<strong>in</strong>g. Theselimitations motivate a variety of features of settable systems, extend<strong>in</strong>g the PCM <strong>in</strong>ways that permit straightforward treatment of such systems. We now turn to a morecomplete description of the SS framework.3.2. Formal Settable SystemsWe now provide a formal description of settable systems that readily accommodatescausal discourse <strong>in</strong> the forego<strong>in</strong>g examples and that also suffices to establish the desiredl<strong>in</strong>kage between Granger causality and causal notions <strong>in</strong> the PCM. The material thatfollows is adapted from Chalak, K. and H. White (2010). For additional details, seeWC.A stochastic settable system is a mathematical framework <strong>in</strong> which a countablenumber of units i, i = 1,...,n, <strong>in</strong>teract under uncerta<strong>in</strong>ty. Here, n ∈ ¯N + := N + ∪{∞}, whereN + denotes the positive <strong>in</strong>tegers. When n = ∞, we <strong>in</strong>terpret i = 1,...,n as i = 1,2,....Units have attributes a i ∈ A; these are fixed for each unit, but may vary across units.Each unit also has associated random variables, def<strong>in</strong>ed on a measurable space (Ω,F ).It is convenient to def<strong>in</strong>e a pr<strong>in</strong>cipal space Ω 0 and let Ω := × n i=0 Ω i, with each Ω i acopy of Ω 0 . Often, Ω 0 = R is convenient. A probability measure P a on (Ω,F ) assignsprobabilities to events <strong>in</strong>volv<strong>in</strong>g random variables. As the notation suggests, P a candepend on the attribute vector a := (a 1 ,...,a n ) ∈ A := × n i=1 A.The random variables associated with unit i def<strong>in</strong>e a settable variable X i for thatunit. A settable variable X i has a dual aspect. It can be set to a random variabledenoted by Z i (the sett<strong>in</strong>g), where Z i : Ω i → S i . S i denotes the admissible sett<strong>in</strong>g valuesfor Z i , a multi-element subset of R. Alternatively, the settable variable can be freeto respond to sett<strong>in</strong>gs of other settable variables. In the latter case, it is denoted bythe response Y i : Ω → S i . The response Y i of a settable variable X i to the sett<strong>in</strong>gsof other settable variables is determ<strong>in</strong>ed by a response function, r i . For example, r ican be determ<strong>in</strong>ed by optimization, determ<strong>in</strong><strong>in</strong>g the response for unit i that is best <strong>in</strong>some sense, given the sett<strong>in</strong>gs of other settable variables. The dual role of a settablevariable X i : {0,1} × Ω → S i , dist<strong>in</strong>guish<strong>in</strong>g responses X i (0,ω) := Y i (ω) and sett<strong>in</strong>gsX i (1,ω) := Z i (ω i ), ω ∈ Ω, permits formaliz<strong>in</strong>g the directional nature of causal relations,whereby sett<strong>in</strong>gs of some variables (causes) determ<strong>in</strong>e responses of others.9


White Chalak LuThe pr<strong>in</strong>cipal unit i = 0 also plays a key role. We let the pr<strong>in</strong>cipal sett<strong>in</strong>g Z 0 andpr<strong>in</strong>cipal response Y 0 of the pr<strong>in</strong>cipal settable variable X 0 be such that Z 0 : Ω 0 → Ω 0is the identity map, Z 0 (ω 0 ) := ω 0 , and we def<strong>in</strong>e Y 0 (ω) := Z 0 (ω 0 ). The sett<strong>in</strong>g Z 0 ofthe pr<strong>in</strong>cipal settable variable may directly <strong>in</strong>fluence all other responses <strong>in</strong> the system,whereas its response Y 0 is unaffected by other sett<strong>in</strong>gs. Thus, X 0 supports <strong>in</strong>troduc<strong>in</strong>gan aspect of “pure randomness” to responses of settable variables.3.2.1. Elementary Settable SystemsIn elementary settable systems, Y i is determ<strong>in</strong>ed (actually or potentially) by the sett<strong>in</strong>gsof all other system variables, denoted Z (i) . Thus, <strong>in</strong> elementary settable systems, Y i =r i (Z (i) ;a). The sett<strong>in</strong>gs Z (i) take values <strong>in</strong> S (i) ⊆ Ω 0 × ji S j . We have that S (i) is a strictsubset of Ω 0 × ji S j if there are jo<strong>in</strong>t restrictions on the admissible sett<strong>in</strong>gs values, forexample, when certa<strong>in</strong> elements of S (i) represent probabilities that sum to one.We now give a formal def<strong>in</strong>ition of elementary settable systems.Def<strong>in</strong>ition 3.1 (Elementary Settable System) Let A be a set and let attributes a ∈ Abe given. Let n ∈ ¯N + be given, and let (Ω,F , P a ) be a complete probability space suchthat Ω := × n i=0 Ω i, with each Ω i a copy of the pr<strong>in</strong>cipal space Ω 0 , conta<strong>in</strong><strong>in</strong>g at leasttwo elements.Let the pr<strong>in</strong>cipal sett<strong>in</strong>g Z 0 : Ω 0 → Ω 0 be the identity mapp<strong>in</strong>g. For i = 1,2,...,n,let S i be a multi-element Borel-measurable subset of R and let sett<strong>in</strong>gs Z i : Ω i → S i besurjective measurable functions. Let Z (i) be the vector <strong>in</strong>clud<strong>in</strong>g every sett<strong>in</strong>g except Z iand tak<strong>in</strong>g values <strong>in</strong> S (i) ⊆ Ω 0 × ji S j , S (i) ∅. Let response functions r i ( · ;a) : S (i) → S ibe measurable functions and def<strong>in</strong>e responses Y i (ω) := r i (Z (i) (ω);a). Def<strong>in</strong>e settablevariables X i : {0,1} × Ω → S i asX i (0,ω) := Y i (ω) and X i (1,ω) := Z i (ω i ), ω ∈ Ω.Def<strong>in</strong>e Y 0 and X 0 by Y 0 (ω) := X 0 (0,ω) := X 0 (1,ω) := Z 0 (ω 0 ), ω ∈ Ω.Put X := {X 0 ,X 1 ,...}. The triple S := {(A,a),(Ω,F , P a ),X} is an elementary settablesystem.An elementary settable system thus comprises an attribute component, (A,a), astochastic component, (Ω,F , P a ), and a structural or causal component X, consist<strong>in</strong>gof settable variables whose properties are crucially determ<strong>in</strong>ed by response functionsr := {r i }. It is formally correct to write X a <strong>in</strong>stead of X; we write X for simplicity.Note the absence of any fixed po<strong>in</strong>t requirement, the dist<strong>in</strong>ct roles played by fixedattributes a and sett<strong>in</strong>g variables Z i (<strong>in</strong>clud<strong>in</strong>g pr<strong>in</strong>cipal sett<strong>in</strong>gs Z 0 ), and the countablenumber of units allowed.Example 3.1 is covered by this def<strong>in</strong>ition. There, n = 2. Attributes a := (S 1 ,u 1 ,S 2 ,u 2 )belong to a suitably chosen set A. Here, S i = S i . We take z i = Z i (ω i ), ω i ∈ Ω i andy i = Y i (ω) = r e i (Z (i)(ω);a) = r e i (z (i);a), i = 1,2. The “e" superscript <strong>in</strong> r e iemphasizes thatthe response function is for an elementary settable system. In the example games, theresponses y i only depend on sett<strong>in</strong>gs (z 1 ,z 2 ). In more elaborate games, dependence onz 0 = ω 0 can accommodate random responses.10


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systems3.2.2. Partitioned Settable SystemsIn elementary settable systems, each s<strong>in</strong>gle response Y i can freely respond to sett<strong>in</strong>gsof all other system variables. We now consider systems where several settable variablesjo<strong>in</strong>tly respond to sett<strong>in</strong>gs of the rema<strong>in</strong><strong>in</strong>g settable variables, as when responses representthe solution to a jo<strong>in</strong>t optimization problem. For this, partitioned settable systemsgroup jo<strong>in</strong>tly respond<strong>in</strong>g variables <strong>in</strong>to blocks. In elementary settable systems, everyunit i forms a block by itself. We now def<strong>in</strong>e general partitioned settable systems.Def<strong>in</strong>ition 3.2 (Partitioned Settable System) Let (A,a),(Ω,F , P a ), X 0 , n, and S i , i =1,...,n, be as <strong>in</strong> Def<strong>in</strong>ition 3.1. Let Π = {Π b } be a partition of {1,...,n}, with card<strong>in</strong>alityB ∈ ¯N + (B := #Π).For i = 1,2,...,n, let ZiΠZi Π ,i Π b , and tak<strong>in</strong>g values <strong>in</strong> S Π (b) ⊆ Ω 0 × iΠb S i , S Π (b)be sett<strong>in</strong>gs and let Z Π (b) be the vector conta<strong>in</strong><strong>in</strong>g Z 0 and ∅, b = 1,..., B. For b = 1,..., Band i ∈ Π b , suppose there exist measurable functions r Π i ( · ;a) : SΠ (b) → S i, specific to Πsuch that responses Y Π i(ω) are jo<strong>in</strong>tly determ<strong>in</strong>ed asY Π i:= r Π i (ZΠ (b) ;a).Def<strong>in</strong>e the settable variables X Π i: {0,1} × Ω → S i asX Π i (0,ω) := YΠ i (ω) and X Π i (1,ω) := ZΠ i (ω i ) ω ∈ Ω.Put X Π := {X 0 ,X Π 1 ,XΠ 2 ...}. The triple S := {(A,a),(Ω,F ),(Π,XΠ )} is a partitionedsettable system.The sett<strong>in</strong>gs Z(b) Π may be partition-specific; this is especially relevant when the admissibleset S Π (b) imposes restrictions on the admissible values of ZΠ (b). Crucially, responsefunctions and responses are partition-specific. In Def<strong>in</strong>ition 3.2, the jo<strong>in</strong>t responsefunction r[b] Π := (rΠ i ,i ∈ Π b) specifies how the sett<strong>in</strong>gs Z(b) Π outside of block Π bdeterm<strong>in</strong>e the jo<strong>in</strong>t response Y[b] Π := (YΠ i,i ∈ Π b ), i.e., Y[b] Π = rΠ [b] (ZΠ (b);a). For conveniencebelow, we let Π 0 = {0} represent the block correspond<strong>in</strong>g to X 0 .Example 3.2 makes use of partition<strong>in</strong>g. Here, we have n = 4 settable variables withB = 2 blocks. Let settable variables 1 and 2 correspond to beer and pizza consumption,respectively, and let settable variables 3 and 4 correspond to price and <strong>in</strong>come. Theagent partition groups together all variables under the control of a given agent. Let theconsumer be agent 2, so Π 2 = {1,2}. Let the rest of the economy, determ<strong>in</strong><strong>in</strong>g price and<strong>in</strong>come, be agent 1, so Π 1 = {3,4}. The agent partition is Π a = {Π 1 ,Π 2 }. Then for block2,y 1 = Y a 1 (ω) = ra 1 (Z 0(ω 0 ),Z a 3 (ω 3),Z a 4 (ω 4);a) = r a 1 (p,m;a)y 2 = Y a 2 (ω) = ra 2 (Z 0(ω 0 ),Z a 3 (ω 3),Z a 4 (ω 4);a) = r a 2 (p,m;a)represents the jo<strong>in</strong>t demand for beer and pizza (belong<strong>in</strong>g to block 2) as a function ofsett<strong>in</strong>gs of price and <strong>in</strong>come (belong<strong>in</strong>g to block 1). This jo<strong>in</strong>t demand is unique under11


White Chalak Lumild conditions. Observe that z 0 = Z 0 (ω 0 ) formally appears as an allowed argumentof ria after the second equality, but when the consumer’s optimization problem has aunique solution, there is no need for a random component to demand. We thus suppressthis argument <strong>in</strong> writ<strong>in</strong>g ri a (p,m;a), i = 1,2. Nevertheless, when the solution to theconsumer’s optimization problem is not unique, a random component can act to ensurea unique consumer demand. We do not pursue this here; WC provide related discussion.We write the block 1 responses for the price and <strong>in</strong>come settable variables asy 3 = Y a 3 (ω) = ra 3 (Z 0(ω 0 ),Z a 1 (ω 1),Z a 2 (ω 2);a) = r a 3 (z 0;a)y 4 = Y a 4 (ω) = ra 4 (Z 0(ω 0 ),Z a 1 (ω 1),Z a 2 (ω 2);a) = r a 4 (z 0;a).In this example, price and <strong>in</strong>come are not determ<strong>in</strong>ed by the <strong>in</strong>dividual consumer’sdemands, so although Z a 1 (ω 1) and Z a 2 (ω 2) formally appear as allowed arguments of r a iafter the second equality, we suppress these <strong>in</strong> writ<strong>in</strong>g r a i (z 0;a), i = 3,4. Here, priceand <strong>in</strong>come responses (belong<strong>in</strong>g to block 1) are determ<strong>in</strong>ed solely by block 0 sett<strong>in</strong>gsz 0 = Z 0 (ω 0 ) = ω 0 . This permits price and <strong>in</strong>come responses to be randomly distributed,under the control of P a .It is especially <strong>in</strong>structive to consider the elementary partition for this example,Π e = {{1},{2},{3},{4}}, so that Π i = {i}, i = 1,...,4. The elementary partition specifieshow each system variable freely responds to sett<strong>in</strong>gs of all other system variables. Inparticular, it is easy to verify that when consumption of pizza is set to a given level,the consumer’s optimal response is to spend whatever <strong>in</strong>come is left on beer, and viceversa. Thus,y 1 = r1 e (Z 0(ω 0 ),Z2 e (ω 2),Z3 e (ω 3),Z4 e (ω 4);a) = r1 e (z 2, p,m;a) = m − pz 2y 2 = r2 e (Z 0(ω 0 ),Z1 e (ω 2),Z3 e (ω 3),Z4 e (ω 4);a) = r2 e (z 1, p,m;a) = (m − z 1 )/p.Replac<strong>in</strong>g (y 1 ,y 2 ) with (z 1 ,z 2 ), we see that this system does not have a unique fixedpo<strong>in</strong>t, as any (z 1 ,z 2 ) such that m = z 1 + pz 2 satisfies bothz 1 = m − pz 2 and z 2 = (m − z 1 )/p.Causal discourse <strong>in</strong> the PCM is ruled out by the lack of a fixed po<strong>in</strong>t. Nevertheless,the settable systems framework supports the natural economic causal discourse hereabout effects of prices, <strong>in</strong>come, and, e.g., pizza consumption on beer demand. Further,<strong>in</strong> settable systems, the govern<strong>in</strong>g pr<strong>in</strong>ciple of optimization (embedded <strong>in</strong> a) ensuresthat the response functions for both the agent partition and the elementary partition aremutually consistent.3.2.3. Recursive and Canonical Settable SystemsThe l<strong>in</strong>k between Granger causality and the causal notions of the PCM emerges from aparticular class of recursive partitioned settable systems that we call canonical settablesystems, where the system evolves naturally without <strong>in</strong>tervention. This corresponds towhat are also called “idle regimes" <strong>in</strong> the literature (Pearl, J., 2000; Eichler, M. and V.Didelez, 2009; Dawid, 2010).12


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsTo def<strong>in</strong>e recursive settable systems, for b ≥ 0 def<strong>in</strong>e Π [0:b] := Π 0 ∪ ... ∪ Π b−1 ∪ Π b .Def<strong>in</strong>ition 3.3 (Recursive Partitioned Settable System) Let S be a partitioned settablesystem. For b = 0,1,..., B, let Z[0:b] Π denote the vector conta<strong>in</strong><strong>in</strong>g the sett<strong>in</strong>gs ZΠ ifor i ∈ Π [0:b] and tak<strong>in</strong>g values <strong>in</strong> S [0:b] ⊆ Ω 0 × i∈Π[1:b] S i , S [0:b] ∅. For b = 1,..., B andi ∈ Π b , suppose that r Π := {ri Π} is such that the responses YΠ i= X Π i(1,·) are determ<strong>in</strong>edas:= ri Π (ZΠ [0:b−1] ;a).Y Π iThen we say that Π is a recursive partition, that r Π is recursive, and that S :={(A,a),(Ω,F ), (Π,X Π )} is a recursive partitioned settable system or simply that S isrecursive.Example 3.2 is a recursive settable system, as the responses of block 1 depend onthe sett<strong>in</strong>gs of block 0, and the responses of block 2 depend on the sett<strong>in</strong>gs of block 1.Canonical settable systems are recursive settable systems <strong>in</strong> which the sett<strong>in</strong>gs fora given block equal the responses for that block, i.e.,Z[b] Π = YΠ [b] := rΠ [b] (ZΠ [0:b−1];a), b = 1,..., B.Without loss of generality, we can represent canonical responses and sett<strong>in</strong>gs solely asa function of ω 0 , so thatZ Π [b] (ω 0) = Y Π [b] (ω 0) := r Π [b] (ZΠ [0:b−1] (ω 0);a), b = 1,..., B.The canonical representation drops the dist<strong>in</strong>ction between sett<strong>in</strong>gs and responses; wewriteY[b] Π = rΠ [b] (YΠ [0:b−1];a), b = 1,..., B.It is easy to see that the structural VAR of Example 3.3 corresponds to the canonicalrepresentation of a canonical settable system. The canonical responses y 0 and {u t }belong to the first block, and canonical responses y t = (y 1,t ,y 2,t ) belong to block t + 1,t = 1,2,... Example 3.3 implements the time partition, where jo<strong>in</strong>t responses for a giventime period depend on previous sett<strong>in</strong>gs.4. <strong>Causality</strong> <strong>in</strong> Settable Systems and <strong>in</strong> the PCMIn this section we exam<strong>in</strong>e the relations between concepts of direct causality <strong>in</strong> settablesystems and <strong>in</strong> the PCM, specifically the PCM notions of direct cause and controlleddirect effect (Pearl, J. (2000, p. 222); Pearl, J. (2001, def<strong>in</strong>ition 1)). The close correspondencebetween these notions for the recursive systems relevant to Granger causalityenables us to take the first step <strong>in</strong> l<strong>in</strong>k<strong>in</strong>g Granger causality and causal notions <strong>in</strong> thePCM. Section 5 completes the cha<strong>in</strong> by l<strong>in</strong>k<strong>in</strong>g direct structural causality and Grangercausality.13


White Chalak Lu4.1. Direct Structural <strong>Causality</strong> <strong>in</strong> Settable SystemsDirect structural causality is def<strong>in</strong>ed for both recursive and non-recursive partitionedsettable systems. For notational simplicity <strong>in</strong> what follows, we may drop the explicitpartition superscript Π when the specific partition is clearly understood. Thus, we maywrite Y, Z, and X <strong>in</strong> place of the more explicit Y Π , Z Π , and X Π when there is no possibilityof confusion.Let X j belong to block b ( j ∈ Π b ). Heuristically, we say that a settable variableX i , outside of block b, directly causes X j <strong>in</strong> S when the response for X j differs fordifferent sett<strong>in</strong>gs of X i , while hold<strong>in</strong>g all other variables outside of block b to the samesett<strong>in</strong>g values. There are two ma<strong>in</strong> <strong>in</strong>gredients to this notion. The first <strong>in</strong>gredient is anadmissible <strong>in</strong>tervention. To def<strong>in</strong>e this, let z * (b);idenote the vector otherwise identicalto z (b) , but replac<strong>in</strong>g elements z i with z * i . An admissible <strong>in</strong>tervention z (b) → z * (b);i :=(z (b) ,z * (b);i ) is a pair of dist<strong>in</strong>ct elements of S (b). The second <strong>in</strong>gredient is the behavior ofthe response under this <strong>in</strong>tervention.We formalize this notion of direct causality as follows.Def<strong>in</strong>ition 4.1 (Direct <strong>Causality</strong>) Let S be a partitioned settable system. For givenpositive <strong>in</strong>teger b, let j ∈ Π b . (i) For given i Π b , X i directly causes X j <strong>in</strong> S if thereexists an admissible <strong>in</strong>tervention z (b) → z * (b);isuch thatr j (z * (b);i ;a) − r j(z (b) ;a) 0,and we write X iD⇒S X j . Otherwise, we say X i does not directly cause X j <strong>in</strong> S andwrite X iDS X j . (ii) For i, j ∈ Π b ,X iDS X j .We emphasize that although we follow the literature <strong>in</strong> referr<strong>in</strong>g to “<strong>in</strong>terventions,”with their mechanistic or manipulative connotations, the formal concept only <strong>in</strong>volvesthe properties of a response function on its doma<strong>in</strong>.By def<strong>in</strong>ition, variables with<strong>in</strong> the same block do not directly cause each other. Inparticular X iDS X i . Also, Def<strong>in</strong>ition 4.1 permits mutual causality, so that X iD⇒S X jand X jD⇒S X i without contradiction for i and j <strong>in</strong> different blocks. Nevertheless, <strong>in</strong>recursive systems, mutual causality is ruled out: if X iD⇒S X j then X jDS X i .We call the response value difference <strong>in</strong> Def<strong>in</strong>ition 4.1 the direct effect of X i on X j<strong>in</strong> S of the specified <strong>in</strong>tervention. Chalak, K. and H. White (2010) also study variousnotions of <strong>in</strong>direct and total causality.These notions of direct cause and direct effect are well def<strong>in</strong>ed regardless of whetheror not the system possesses a unique fixed po<strong>in</strong>t. Further, all settable variables, <strong>in</strong>clud<strong>in</strong>gX 0 , can act as causes and have effects. On the other hand, attributes a, be<strong>in</strong>g fixed,do not play a causal role. These def<strong>in</strong>itions apply regardless of whether there is a f<strong>in</strong>iteor countable number of units. It is readily verified that this def<strong>in</strong>ition rigorouslysupports causal discourse <strong>in</strong> each of the examples of Section 3.As we discuss next, <strong>in</strong> the recursive systems relevant for G−causality, these conceptscorrespond closely to notions of direct cause and “controlled” direct effect <strong>in</strong>14


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsPearl, J. (2000, 2001). To dist<strong>in</strong>guish the settable system direct causality conceptfrom Pearl’s notion and later from Granger causality, we follow WL and refer to directcausality <strong>in</strong> settable systems as direct structural causality.4.2. Direct Causes and Effects <strong>in</strong> the PCMPearl, J. (2000, p. 222), draw<strong>in</strong>g on Galles, D. and J. Pearl (1997), gives a succ<strong>in</strong>ctstatement of the notion of direct cause, coherent with the PCM as specified <strong>in</strong> Section2:X is a direct cause of Y if there exist two values x and x ′ of X and a valueu of U such that Y xr (u) Y x ′ r(u), where r is some realization of V∖{X,Y}.To make this statement fully mean<strong>in</strong>gful requires apply<strong>in</strong>g Pearl’s (2000) def<strong>in</strong>itions7.1.2 (Submodel) and 7.1.4 (Potential Response) to arrive at the potential response,Y xr (u). For brevity, we do not reproduce Pearl’s def<strong>in</strong>itions here. Instead, it sufficesto map Y xr (u) and its elements to their settable system counterparts. Specifically, ucorresponds to (a,z 0 ); x corresponds to z i ; r corresponds to the elements of z (b) otherthan z 0 and z i , say z (b)(i,0) ; and, provided it exists, Y xr (u) corresponds to r j (z (b) ;a).The caveat about the existence of Y xr (u) is significant, as Y xr (u) is not def<strong>in</strong>ed <strong>in</strong>the absence of a unique fixed po<strong>in</strong>t for the system. Further, even with a unique fixedpo<strong>in</strong>t, the potential response Y xr (u) must also uniquely solve a set of equations denotedF x (see Pearl, J., 2000, eq. (7.1)) for a submodel, and there is no general guarantee ofsuch a solution. Fortunately, however, this caveat matters only for non-recursive PCMs.In the recursive case relevant for G−causality, the potential response is generally welldef<strong>in</strong>ed.Mak<strong>in</strong>g a f<strong>in</strong>al identification between x ′ and z * i, and given the existence of potentialresponses Y x ′ r(u) and Y xr (u), we see that Y x ′ r(u) Y xr (u) corresponds to the settablesystems requirement r j (z * (b);i ;a) − r j(z (b) ;a) 0.Pearl, J. (2001, def<strong>in</strong>ition 1) gives a formal statement of the notion stated above,say<strong>in</strong>g that if for given u and some r, x, and x ′ we have Y xr (u) Y x ′ r(u) then X has acontrolled direct effect on Y <strong>in</strong> model M and situation U = u. In def<strong>in</strong>ition 2, Pearl,J. (2001) labels Y x ′ r(u) − Y xr (u) the controlled direct effect, correspond<strong>in</strong>g to the directstructural effect r j (z * (b);i ;a) − r j(z (b) ;a) def<strong>in</strong>ed for settable systems.Thus, although there are important differences, especially <strong>in</strong> non-recursive systems,the settable systems and PCM notions of direct causality and direct effects closely correspond<strong>in</strong> recursive systems. These differences are sufficiently modest that the resultsof WL l<strong>in</strong>k<strong>in</strong>g direct structural causality to Granger causality, discussed next, also serveto closely l<strong>in</strong>k the PCM notion of direct cause to that of Granger causality.5. G−<strong>Causality</strong> and Direct Structural <strong>Causality</strong>In this section we exam<strong>in</strong>e the relation between direct structural causality and Grangercausality, draw<strong>in</strong>g on results of WL. See WL for additional discussion and proofs of allformal results given here and <strong>in</strong> Section 6.15


White Chalak Lu5.1. Granger <strong>Causality</strong>Granger, C.W.J. (1969) def<strong>in</strong>ed G−causality <strong>in</strong> terms of conditional expectations. Granger,C.W.J. and P. Newbold (1986) gave a def<strong>in</strong>ition us<strong>in</strong>g conditional distributions. Wework with the latter, as this is what relates generally to structural causality. In what follows,we adapt Granger and Newbold’s notation, but otherwise preserve the conceptualcontent.For any sequence of random vectors {Y t , t = 0,1,...}, let Y t := (Y 0 ,...,Y t ) denoteits “t−history," and let σ(Y t ) denote the sigma-field (“<strong>in</strong>formation set") generated byY t . Let {Q t ,S t ,Y t } be a sequence of random vectors. Granger, C.W.J. and P. Newbold(1986) say that Q t−1 does not G-cause Y t+k with respect to σ(Q t−1 ,S t−1 ,Y t−1 ) if for allt = 0,1,...,F t+k ( · | Q t−1 ,S t−1 ,Y t−1 ) = F t+k ( · | S t−1 ,Y t−1 ), k = 0,1,..., (2)where F t+k ( · | Q t−1 ,S t−1 ,Y t−1 ) denotes the conditional distribution function of Y t+kgiven Q t−1 ,S t−1 ,Y t−1 , and F t+k ( · | S t−1 ,Y t−1 ) denotes that of Y t+k given S t−1 ,Y t−1 .Here, we focus only on the k = 0 case, as this is what relates generally to structuralcausality.As Florens, J.P. and M. Mouchart (1982) and Florens, J.P. and D. Fougère (1996)note, G non-causality is a form of conditional <strong>in</strong>dependence. Follow<strong>in</strong>g Dawid (1979),we write X ⊥ Y | Z when X and Y are <strong>in</strong>dependent given Z. Translat<strong>in</strong>g (2) gives thefollow<strong>in</strong>g version of the classical def<strong>in</strong>ition of Granger causality:Def<strong>in</strong>ition 5.1 (Granger <strong>Causality</strong>) Let {Q t ,S t ,Y t } be a sequence of random vectors.Suppose thatY t ⊥ Q t−1 | Y t−1 ,S t−1 t = 1,2,... . (3)Then Q does not G−cause Y with respect to S. Otherwise, Q G−causes Y with respectto S.As it stands, this def<strong>in</strong>ition has no necessary structural content, as Q t , S t , and Y t can beany random variables whatsoever. This def<strong>in</strong>ition relates solely to the ability of Q t−1 tohelp <strong>in</strong> predict<strong>in</strong>g Y t given Y t−1 and S t−1 .In practice, researchers do not test classical G−causality, as this <strong>in</strong>volves data historiesof arbitrary length. Instead, researchers test a version of G−causality <strong>in</strong>volv<strong>in</strong>gonly a f<strong>in</strong>ite number of lags of Y t , Q t , and S t . This does not test classical G−causality,but rather a related property, f<strong>in</strong>ite-order G−causality, that is neither necessary nor sufficientfor classical G−causality.Because of its predom<strong>in</strong>ant practical relevance, we focus here on f<strong>in</strong>ite-order ratherthan classical G−causality. (See WL for discussion of classical G−causality.) To def<strong>in</strong>ethe f<strong>in</strong>ite-order concept, we def<strong>in</strong>e the f<strong>in</strong>ite histories Y t−1 := (Y t−l ,...,Y t−1 ) and Q t :=(Q t−k ,..., Q t ).Def<strong>in</strong>ition 5.2 (F<strong>in</strong>ite-Order Granger <strong>Causality</strong>) Let {Q t ,S t ,Y t } be a sequence ofrandom variables, and let k ≥ 0 and l ≥ 1 be given f<strong>in</strong>ite <strong>in</strong>tegers. Suppose thatY t ⊥ Q t | Y t−1 ,S t , t = 1,2,... .16


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsThen we say Q does not f<strong>in</strong>ite-order G−cause Y with respect to S . Otherwise, we sayQ f<strong>in</strong>ite-order G−causes Y with respect to S.We call max(k,l − 1) the “order" of the f<strong>in</strong>ite-order G non-causality.Observe that Q t replaces Q t−1 <strong>in</strong> the classical def<strong>in</strong>ition, that Y t−1 replaces Y t−1 , andthat S t replaces S t−1 . Thus, <strong>in</strong> addition to dropp<strong>in</strong>g all but a f<strong>in</strong>ite number of lags <strong>in</strong> Q t−1and Y t−1 , this version <strong>in</strong>cludes Q t . As WL discuss, however, the appearance of Q t neednot <strong>in</strong>volve <strong>in</strong>stantaneous causation. It suffices that realizations of Q t precede thoseof Y t , as <strong>in</strong> the case of contemporaneous causation discussed above. The replacementof S t−1 with S t entails first view<strong>in</strong>g S t as represent<strong>in</strong>g a f<strong>in</strong>ite history, and second therecognition that s<strong>in</strong>ce S t plays purely a condition<strong>in</strong>g role, there need be no restrictionwhatever on its tim<strong>in</strong>g. We thus call S t “covariates." As WL discuss, the covariatescan even <strong>in</strong>clude leads relative to time t. When covariate leads appear, we call this the“retrospective" case.In what follows, when we refer to G−causality, it will be understood that we arereferr<strong>in</strong>g to f<strong>in</strong>ite-order G−causality, as just def<strong>in</strong>ed. We will always refer to the conceptof Def<strong>in</strong>ition 5.1 as classical G−causality to avoid confusion.5.2. A Dynamic Structural SystemWe now specify a canonical settable system that will enable us to exam<strong>in</strong>e the relationbetween G−causality and direct structural causality. As described above, <strong>in</strong> suchsystems “predecessors" structurally determ<strong>in</strong>e “successors," but not vice versa. In particular,future variables cannot precede present or past variables, enforc<strong>in</strong>g the causaldirection of time. We write Y ⇐ X to denote that Y succeeds X (X precedes Y ). When Yand X have identical time <strong>in</strong>dexes, Y ⇐ X rules out <strong>in</strong>stantaneous causation but allowscontemporaneous causation.We now specify a version of the causal data generat<strong>in</strong>g structures analyzed by WLand White, H. and P. Kennedy (2009). We let N denote the <strong>in</strong>tegers {0,1,...} and def<strong>in</strong>e¯N := N ∪ {∞}. For given l,m,∈ N, l ≥ 1, we let Y t−1 := (Y t−l ,...,Y t−1 ) as above; we alsodef<strong>in</strong>e Z t := (Z t−m ,...,Z t ). For simplicity, we keep attributes implicit <strong>in</strong> what follows.Assumption A.1 Let {U t ,W t ,Y t ,Z t ; t = 0,1,...} be a stochastic process on (Ω,F , P), acomplete probability space, with U t ,W t ,Y t , and Z t tak<strong>in</strong>g values <strong>in</strong> R k u,R k w,R k y, andR k zrespectively, where k u ∈ ¯N and k w ,k y ,k z ∈ N, with k y > 0. Further, suppose thatY t ⇐ (Y t−1 ,U t ,W t ,Z t ), where, for an unknown measurable k y × 1 function q t , and forgiven l,m,∈ N,l ≥ 1, {Y t } is structurally generated assuch that, with Y t := (Y ′ 1,t ,Y′ 2,t )′ and U t := (U ′ 1,t ,U′ 2,t )′ ,Y t = q t (Y t−1 , Z t ,U t ), t = 1,2,..., (4)Y 1,t = q 1,t (Y t−1 , Z t ,U 1,t ) Y 2,t = q 2,t (Y t−1 , Z t ,U 2,t ).Such structures are well suited to represent<strong>in</strong>g the structural evolution of time-seriesdata <strong>in</strong> economic, biological, or other systems. Because Y t is a vector, this covers the17


White Chalak Lucase of panel data, where one has a cross-section of time-series observations, as <strong>in</strong> fMRIor EEG data sets. For practical relevance, we explicitly impose the Markov assumptionthat Y t is determ<strong>in</strong>ed by only a f<strong>in</strong>ite number of its own lags and those of Z t and U t . WLdiscuss the general case.Throughout, we suppose that realizations of W t ,Y t , and Z t are observed, whereasrealizations of U t are not. Because U t ,W t , or Z t may have dimension zero, their presenceis optional. Usually, however, some or all will be present. S<strong>in</strong>ce there may be acountable <strong>in</strong>f<strong>in</strong>ity of unobservables, there is no loss of generality <strong>in</strong> specify<strong>in</strong>g that Y tdepends only on U t rather than on a f<strong>in</strong>ite history of lags of U t .This structure is general: the structural relations may be nonl<strong>in</strong>ear and non-monotonic<strong>in</strong> their arguments and non-separable between observables and unobservables. This systemmay generate stationary processes, non-stationary processes, or both. AssumptionA.1 is therefore a general structural VAR; Example 3.3 is a special case.The vector Y t represents responses of <strong>in</strong>terest. Consistent with a ma<strong>in</strong> applicationof G−causality, our <strong>in</strong>terest here attaches to the effects on Y 1,t of the lags of Y 2,t . Wethus call Y 2,t−1 and its further lags “causes of <strong>in</strong>terest." Note that A.1 specifies that Y 1,tand Y 2,t each have their own unobserved drivers, U 1,t and U 2,t , as is standard.The vectors U t and Z t conta<strong>in</strong> causal drivers of Y t whose effects are not of primary<strong>in</strong>terest; we thus call U t and Z t “ancillary causes." The vector W t may conta<strong>in</strong> responsesto U t . Observe that W t does not appear <strong>in</strong> the argument list for q t , so it explicitlydoes not directly determ<strong>in</strong>e Y t . Note also that Y t ⇐ (Y t−1 ,U t ,W t ,Z t ) ensures that W t isnot determ<strong>in</strong>ed by Y t or its lags. A useful convention is that W t ⇐ (W t−1 ,U t ,Z t ), sothat W t does not drive unobservables. If a structure does not have this property, thensuitable substitutions can usually yield a derived structure satisfy<strong>in</strong>g this convention.Nevertheless, we do not require this, so W t may also conta<strong>in</strong> drivers of unobservablecauses of Y t .For concreteness, we now specialize the settable systems def<strong>in</strong>ition of direct structuralcausality (Def<strong>in</strong>ition 4.1) to the specific system given <strong>in</strong> A.1. For this, let y s,t−1 bethe sub-vector of y t−1 with elements <strong>in</strong>dexed by the non-empty set s ⊆ {1,...,k y } × {t −l,...,t − 1}, and let y (s),t−1 be the sub-vector of y t−1 with elements of s excluded.Def<strong>in</strong>ition 5.3 (Direct Structural <strong>Causality</strong>) Given A.1, for given t > 0, j ∈ {1,...,k y },and s, suppose that for all admissible values of y (s),t−1 , z t , and u t , the function y s,t−1 →q j,t (y t−1 , z t ,u t ) is constant <strong>in</strong> y s,t−1 . Then we say Y s,t−1 does not directly structurallyDcause Y j,t and write Y s,t−1 S Y j,t . Otherwise, we say Y s,t−1 directly structurallycauses Y j,t and write Y s,t−1D⇒S Y j,t .We can similarly def<strong>in</strong>e direct causality or non-causality of Z s,t or U s,t for Y j,t , butwe leave this implicit. We write, e.g., Y s,t−1D⇒S Y t when Y s,t−1D⇒S Y j,t for somej ∈ {1,...,k y }.Build<strong>in</strong>g on work of White, H. (2006a) and White, H. and K. Chalak (2009), WLdiscuss how certa<strong>in</strong> exogeneity restrictions permit identification of expected causal effects<strong>in</strong> dynamic structures. Our next result shows that a specific form of exogeneity18


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systemsenables us to l<strong>in</strong>k direct structural causality and f<strong>in</strong>ite order G−causality. To state thisexogeneity condition, we write Y 1,t−1 := (Y 1,t−l ,...,Y 1,t−1 ), Y 2,t−1 := (Y 2,t−l ,...,Y 2,t−1 ),and, for given τ 1 ,τ 2 ≥ 0, X t := (X t−τ1 ,..., X t+τ2 ), where X t := (W ′ t ,Z′ t )′ .Assumption A.2 For l and m as <strong>in</strong> A.1 and for τ 1 ≥ m,τ 2 ≥ 0, suppose that Y 2,t−1 ⊥U 1,t | (Y 1,t−1 , X t ), t = 1,...,T − τ 2 .The classical strict exogeneity condition specifies that (Y t−1 , Z t ) ⊥ U 1,t , which impliesY 2,t−1 ⊥ U 1,t | (Y 1,t−1 , Z t ). (Here, W t can be omitted.) Assumption A.2 is a weakerrequirement, as it may hold when strict exogeneity fails. Because of the condition<strong>in</strong>g<strong>in</strong>volved, we call this conditional exogeneity. Chalak, K. and H. White (2010) discussstructural restrictions for canonical settable systems that deliver conditional exogeneity.Below, we also discuss practical tests for this assumption.Because of the f<strong>in</strong>ite numbers of lags <strong>in</strong>volved <strong>in</strong> A.2, this is a f<strong>in</strong>ite-order conditionalexogeneity assumption. For convenience and because no confusion will arisehere, we simply refer to this as “conditional exogeneity."Assumption A.2 ensures that expected direct effects of Y 2,t−1 on Y 1,t are identified.As WL note, it suffices for A.2 that U t−1 ⊥ U 1,t | (Y 0 ,Z t−1 , X t ) and Y 2,t−1 ⊥ (Y 0 ,Z t−τ 1−1 ) |(Y 1,t−1 , X t ). Impos<strong>in</strong>g U t−1 ⊥ U 1,t | (Y 0 ,Z t−1 , X t ) is the analog of requir<strong>in</strong>g that serialcorrelation is absent when lagged dependent variables are present. Impos<strong>in</strong>g Y 2,t−1 ⊥(Y 0 ,Z t−τ 1−1 ) | (Y 1,t−1 , X t ) ensures that ignor<strong>in</strong>g Y 0 and omitt<strong>in</strong>g distant lags of Z t fromX t doesn’t matter.Our first result l<strong>in</strong>k<strong>in</strong>g direct structural causality and G−causality shows that, givenA.1 and A.2 and with proper choice of Q t and S t , G−causality implies direct structuralcausality.Proposition 5.4 Let A.1 and A.2 hold. If Y 2,t−1f<strong>in</strong>ite order G−cause Y 1 with respect to X, i.e.,DS Y 1,t , t = 1,2,..., then Y 2 does notY 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t , t = 1,...,T − τ 2 .In stat<strong>in</strong>g G non-causality, we make the explicit identifications Q t = Y 2,t−1 and S t = X t .This result leads one to ask whether the converse relation also holds: does directstructural causality imply G−causality? Strictly speak<strong>in</strong>g, the answer is no. WL discussseveral examples. The ma<strong>in</strong> issue is that with suitably chosen causal and probabilisticrelationships, Y 2,t−1 can cause Y 1,t , but Y 2,t−1 and Y 1,t can be <strong>in</strong>dependent, conditionallyor unconditionally, i.e. Granger non-causal.As WL further discuss, however, these examples are exceptional, <strong>in</strong> the sense thatmild perturbations to their structure destroy the Granger non-causality. WL <strong>in</strong>troduce aref<strong>in</strong>ement of the notion of direct structural causality that accommodates these specialcases and that does yield a converse result, permitt<strong>in</strong>g a characterization of structuraland Granger causality. Let supp(Y 1,t ) denote the support of Y 1,t , i.e., the smallest setconta<strong>in</strong><strong>in</strong>g Y 1,t with probability 1, and let F 1,t (· | Y 1,t−1 , X t ) denote the conditional distributionfunction of U 1,t given Y 1,t−1 , X t . WL <strong>in</strong>troduce the follow<strong>in</strong>g def<strong>in</strong>ition:19


White Chalak LuDef<strong>in</strong>ition 5.5 Suppose A.1 holds and that for given τ 1 ≥ m,τ 2 ≥ 0 and for each y ∈supp(Y 1,t ) there exists a σ(Y 1,t−1 , X t )−measurable version of the random variable∫︁1{q 1,t (Y t−1 , Z t ,u 1,t ) < y} dF 1,t (u 1,t | Y 1,t−1 , X t ).DThen Y 2,t−1 S(Y1,t−1 ,X t ) Y 1,t (direct non-causality−σ(Y 1,t−1 , X t ) a.s.). If not, Y 2,t−1D⇒ S(Y1,t−1 ,X t ) Y 1,t .For simplicity, we refer to this as direct non-causality a.s. The requirement that the<strong>in</strong>tegral <strong>in</strong> this def<strong>in</strong>ition is σ(Y 1,t−1 , X t )−measurable means that the <strong>in</strong>tegral does notdepend on Y 2,t−1 , despite its appearance <strong>in</strong>side the <strong>in</strong>tegral as an argument of q 1,t . Forthis, it suffices that Y 2,t−1 does not directly cause Y 1,t ; but it is also possible that q 1,tand the conditional distribution of U 1,t given Y 1,t−1 , X t are <strong>in</strong> just the right relation tohide the structural causality. Without the ability to manipulate this distribution, thestructural causality will not be detectable. One possible avenue to manipulat<strong>in</strong>g thisdistribution is to modify the choice of X t , as there are often multiple choices for X tthat can satisfy A.2 (see White, H. and X. Lu, 2010b). For brevity and because hiddenstructural causality is an exceptional circumstance, we leave aside further discussion ofthis possibility here. The key fact to bear <strong>in</strong> m<strong>in</strong>d is that the causal concept of Def<strong>in</strong>ition5.5 dist<strong>in</strong>guishes between those direct causal relations that are empirically detectableand those that are not, for a given set of covariates X t .We now give a structural characterization of G−causality for structural VARs:Theorem 5.6 Let A.1 and A.2 hold. Then Y 2,t−1DS(Y1,t−1 ,X t ) Y 1,t , t = 1,...,T− τ 2 , if andonly ifY 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t , t = 1,...,T − τ 2 ,i.e., Y 2 does not f<strong>in</strong>ite-order G−cause Y 1 with respect to X.Thus, given conditional exogeneity of Y 2,t−1 , G non-causality implies direct non-causalitya.s. and vice-versa, justify<strong>in</strong>g tests of direct non-causality a.s. <strong>in</strong> structural VARs us<strong>in</strong>gtests for G−causality.This result completes the desired l<strong>in</strong>kage between G−causality and direct causality<strong>in</strong> the PCM. Because direct causality <strong>in</strong> the recursive PCM corresponds essentiallyto direct structural causality <strong>in</strong> canonical settable systems, and because the latter isessentially equivalent to G−causality, as just shown, direct causality <strong>in</strong> the PCM isessentially equivalent to G−causality, provided A.1 and A.2 hold.5.3. The Central Role of Conditional ExogeneityTo relate direct structural causality to G−causality, we ma<strong>in</strong>ta<strong>in</strong> A.2, a specific conditionalexogeneity assumption. Can this assumption be elim<strong>in</strong>ated or weakened? Weshow that the answer is no: A.2 is <strong>in</strong> a precise sense a necessary condition. We alsogive a result support<strong>in</strong>g tests for conditional exogeneity.20


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsFirst, we specify the sense <strong>in</strong> which conditional exogeneity is necessary for theequivalence of G−causality and direct structural causality.DProposition 5.7 Given A.1, suppose that Y 2,t−1 S Y 1,t , t = 1,2,... . If A.2 does nothold, then for each t there exists q 1,t such that Y 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t does not hold.That is, if conditional exogeneity does not hold, then there are always structures thatgenerate data exhibit<strong>in</strong>g G−causality, despite the absence of direct structural causality.Because q 1,t is unknown, this worst case scenario can never be discounted. Further, asWL show, the class of worst case structures <strong>in</strong>cludes precisely those usually assumed<strong>in</strong> applications, namely separable structures (e.g., Y 1,t = q 1,t (Y 1,t−1 , Z t ) + U 1,t ), as wellas the more general class of <strong>in</strong>vertible structures. Thus, <strong>in</strong> the cases typically assumed<strong>in</strong> the literature, the failure of conditional exogeneity guarantees G−causality <strong>in</strong> theabsence of structural causality. We state this formally as a corollary.Corollary 5.8 Given A.1 with Y 2,t−1DS Y 1,t , t = 1,2,..., suppose that q 1,t is <strong>in</strong>vertible<strong>in</strong> the sense that Y 1,t = q 1,t (Y 1,t−1 , Z t ,U 1,t ) implies the existence of ξ 1,t such that U 1,t =ξ 1,t (Y 1,t−1 , Z t ,Y 1,t ), t = 1,2,... . If A.2 fails, then Y 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t fails, t = 1,2,....Together with Theorem 5.6, this establishes that <strong>in</strong> the absence of direct causality andfor the class of <strong>in</strong>vertible structures predom<strong>in</strong>ant <strong>in</strong> applications, conditional exogeneityis necessary and sufficient for G non-causality.Tests of conditional exogeneity for the general separable case follow from:Proposition 5.9 Given A.1, suppose that E(Y 1,t ) < ∞ and thatq 1,t (Y t−1 , Z t ,U 1,t ) = ζ t (Y t−1 , Z t ) + υ t (Y 1,t−1 , Z t ,U 1,t ),where ζ t and υ t are unknown measurable functions. Let ε t := Y 1,t − E(Y 1,t |Y t−1 , X t ). IfA.2 holds, thenε t = υ t (Y 1,t−1 , Z t ,U 1,t ) − E(υ t (Y 1,t−1 , Z t ,U 1,t ) | Y 1,t−1 , X t )E(ε t |Y t−1 , X t ) = E(ε t |Y 1,t−1 , X t ) = 0andY 2,t−1 ⊥ ε t | Y 1,t−1 , X t . (5)Tests based on this result detect the failure of A.2, given separability. Such tests arefeasible because even though the regression error ε t is unobserved, it can be consistentlyestimated, say as ˆε t := Y 1,t − Ê(Y 1,t |Y t−1 , X t ), where Ê(Y 1,t |Y t−1 , X t ) is a parametric ornonparametric estimator of E(Y 1,t |Y t−1 , X t ). These estimated errors can then be used totest (5). If we reject (5), then we must reject A.2. We discuss a practical procedure <strong>in</strong>the next section. WL provide additional discussion.WL also discuss dropp<strong>in</strong>g the separability assumption. For brevity, we ma<strong>in</strong>ta<strong>in</strong>separability here. Observe that under the null of direct non-causality, q 1,t is necessarilyseparable, as then ζ t is the zero function.21


White Chalak Lu6. Test<strong>in</strong>g Direct Structural <strong>Causality</strong>Here, we discuss methods for test<strong>in</strong>g direct structural causality. First, we discuss a generalapproach that comb<strong>in</strong>es tests of G non-causality (GN) and conditional exogeneity(CE). Then we describe straightforward practical methods for implement<strong>in</strong>g the generalapproach.6.1. Comb<strong>in</strong><strong>in</strong>g Tests for GN and CETheorem 5.6 implies that if we test and reject GN, then we must reject either directstructural non-causality (SN) or CE, or both. If CE is ma<strong>in</strong>ta<strong>in</strong>ed, then we can directlytest SN by test<strong>in</strong>g GN; otherwise, a direct test is not available.Similarly, under the traditional separability assumption, Corollary 5.8 implies thatif we test and reject CE, then we must reject either SN or GN (or both). If GN isma<strong>in</strong>ta<strong>in</strong>ed, then we can directly test SN by test<strong>in</strong>g CE; otherwise, a direct test is notavailable.When neither CE nor GN is ma<strong>in</strong>ta<strong>in</strong>ed, no direct test of SN is possible. Nevertheless,we can test structural causality <strong>in</strong>directly by comb<strong>in</strong><strong>in</strong>g the results of the GN andCE tests to isolate the source of any rejections. WL propose the follow<strong>in</strong>g <strong>in</strong>direct test:(1) Reject SN if either:(i) the CE test fails to reject and the GN test rejects; or(ii) the CE test rejects and the GN test fails to reject.If these rejection conditions do not hold, however, we cannot just decide to “accept"(i.e., fail to reject) SN. As WL expla<strong>in</strong> <strong>in</strong> detail, difficulties arise when CE and GN bothfail, as fail<strong>in</strong>g to reject SN here runs the risk of Type II error, whereas reject<strong>in</strong>g SN runsthe risk of Type I error. We resolve this dilemma by specify<strong>in</strong>g the further rules:(2) Fail to reject SN if the CE and GN tests both fail to reject;(3) Make no decision as to SN if the CE and GN tests both reject.In the latter case, we conclude only that CE and GN both fail, thereby obstruct<strong>in</strong>g structural<strong>in</strong>ference. This sends a clear signal that the researcher needs to revisit the modelspecification, with particular attention to specify<strong>in</strong>g covariates sufficient to ensure conditionalexogeneity.Because of the structure of this <strong>in</strong>direct test, it is not enough simply to consider itslevel and power. We must also account for the possibility of mak<strong>in</strong>g no decision. Forthis, def<strong>in</strong>ep : = P[ wrongly make a decision ]= P[ fail to reject CE or GN | CE is false and GN is false ]q : = P[ wrongly make no decision ]= P[ reject CE and GN | CE is true or GN is true ].22


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsThese are the analogs of the probabilities of Type I and Type II errors for the “nodecision" action. We would like these probabilities to be small. Next, we considerα * : = P[ reject SN or make no decision | CE is true and GN is true ]π * : = P[ reject SN | exactly one of CE and GN is true ].These quantities correspond to notions of level and power, but with the sample spacerestricted to the subset on which CE is true or GN is true, that is, the space where adecision can be made. Thus, α * differs from the standard notion of level, but it doescapture the probability of tak<strong>in</strong>g an <strong>in</strong>correct action when SN (the null) holds <strong>in</strong> therestricted sample space, i.e., when CE and GN are both true. Similarly, π * captures theprobability of tak<strong>in</strong>g the correct action when SN does not hold <strong>in</strong> the restricted samplespace. We would like the “restricted level" α * to be small and the “restricted power" π *to be close to one.WL provide useful bounds on the asymptotic properties (T → ∞) of the sample-sizeT values of the probabilities def<strong>in</strong>ed above, p T , q T , α * T , and π* T :Proposition 6.1 Suppose that for T = 1,2,... the significance levels of the CE and GNtests are α 1T and α 2T , respectively, and that α 1T → α 1 < .5 and α 2T → α 2 < .5. Supposethe powers of the CE and GN tests are π 1T and π 2T , respectively, and that π 1T → 1 andπ 2T → 1. Thenp T → 0, limsupq T ≤ max{α 1 ,α 2 },|α 1 − α 2 | ≤ lim<strong>in</strong>f α * T ≤ limsupα* T ≤ α 1 + α 2 + m<strong>in</strong>{α 1 ,α 2 }, andm<strong>in</strong>{1 − α 1 ,1 − α 2 } ≤ lim<strong>in</strong>f π * T ≤ limsupπ* T ≤ max{1 − α 1,1 − α 2 }.When π 1T → 1 and π 2T → 1, one can also typically ensure α 1 = 0 and α 2 = 0 by suitablechoice of an <strong>in</strong>creas<strong>in</strong>g sequence of critical values. In this case, q T → 0, α * T→ 0, andπ * T→ 1. Because GN and CE tests will not be consistent aga<strong>in</strong>st every possible alternative,weaker asymptotic bounds on the level and power of the <strong>in</strong>direct test hold for thesecases by Proposition 8.1 of WL. Thus, whenever possible, one should carefully designGN and CE tests to have power aga<strong>in</strong>st particularly important or plausible alternatives.See WL for further discussion.6.2. Practical Tests for GN and CETo test GN and CE, we require tests for conditional <strong>in</strong>dependence. Nonparametrictests for conditional <strong>in</strong>dependence consistent aga<strong>in</strong>st arbitrary alternatives are readilyavailable (e.g. L<strong>in</strong>ton, O. and P. Gozalo, 1997; Fernandes, M. and R. G. Flores, 2001;Delgado, M. A. and W. Gonzalez-Manteiga, 2001; Su, L. and H. White, 2007a,b, 2008;Song, K., 2009; Huang, M. and H. White, 2009). In pr<strong>in</strong>ciple, one can apply any ofthese to consistently test GN and CE.But nonparametric tests are often not practical, due to the typically modest numberof time-series observations available relative to the number of relevant observable23


White Chalak Luvariables. In practice, researchers typically use parametric methods. These are convenient,but they may lack power aga<strong>in</strong>st important alternatives. To provide convenientprocedures for test<strong>in</strong>g GN and CE with power aga<strong>in</strong>st a wider range of alternatives,WL propose augment<strong>in</strong>g standard tests with neural network terms, motivated by the“QuickNet" procedures <strong>in</strong>troduced by White, H (2006b) or the extreme learn<strong>in</strong>g mach<strong>in</strong>e(ELM) methods of Huang, G.B., Q.Y. Zhu, and C.K. Siew (2006). We now provideexplicit practical methods for test<strong>in</strong>g GN and CE for a lead<strong>in</strong>g class of structuresobey<strong>in</strong>g A.1.6.2.1. Test<strong>in</strong>g Granger Non-<strong>Causality</strong>Standard tests for f<strong>in</strong>ite-order G−causality (e.g., Stock, J. and M. Watson, 2007, p. 547)typically assume a l<strong>in</strong>ear regression, such as 3E(Y 1,t |Y t−1 , X t ) = α 0 + Y ′ 1,t−1 ρ 0 + Y ′ 2,t−1 β 0 + X ′ tβ 1 .For simplicity, we let Y 1,t be a scalar here. The extension to the case of vector Y 1,t iscompletely straightforward. Under the null of GN, i.e., Y 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t , we haveβ 0 = 0. The standard procedure therefore tests β 0 = 0 <strong>in</strong> the regression equationY 1,t = α 0 + Y ′ 1,t−1 ρ 0 + Y ′ 2,t−1 β 0 + X ′ tβ 1 + ε t . (GN Test Regression 1)If we reject β 0 = 0, then we also reject GN. But if we don’t reject β 0 = 0, care is needed,as not all failures of GN will be <strong>in</strong>dicated by β 0 0.Observe that when CE holds and if GN Test Regression 1 is correctly specified, i.e.,the conditional expectation E(Y 1,t |Y t−1 , X t ) is <strong>in</strong>deed l<strong>in</strong>ear <strong>in</strong> the condition<strong>in</strong>g variables,then β 0 represents precisely the direct structural effect of Y 2,t−1 on Y 1,t . Thus,GN Test Regression 1 may not only permit a test of GN, but it may also provide aconsistent estimate of the direct structural effect of <strong>in</strong>terest.To mitigate specification error and ga<strong>in</strong> power aga<strong>in</strong>st a wider range of alternatives,WL propose augment<strong>in</strong>g GN Test Regression 1 with neural network terms, as <strong>in</strong>White’s (2006b, p. 476) QuickNet procedure. This <strong>in</strong>volves test<strong>in</strong>g β 0 = 0 <strong>in</strong>Y 1,t = α 0 + Y ′ 1,t−1 ρ 0 + Y ′ 2,t−1 β 0 + X ′ tβ 1 +r∑︁ψ(Y ′ 1,t−1 γ 1, j + X ′ tγ j )β j+1 + ε t .j=1(GN Test Regression 2)Here, the activation function ψ is a generically comprehensively reveal<strong>in</strong>g (GCR) function(see St<strong>in</strong>chcombe, M. and H. White, 1998). For example, ψ can be the logistic cdfψ(z) = 1/(1 + exp(−z)) or a ridgelet function, e.g., ψ(z) = (−z 5 + 10z 3 − 15z)exp(−.5z 2 )(see, for example, Candès, E. (1999)). The <strong>in</strong>teger r lies between 1 and ¯r, the maximumnumber of hidden units. We randomly choose (γ 0 j ,γ j ) as <strong>in</strong> White, H (2006b, p. 477).3. For notational convenience, we understand that all regressors have been recast as vectors conta<strong>in</strong><strong>in</strong>gthe referenced elements.24


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsParallel to our comment above about estimat<strong>in</strong>g direct structural effects of <strong>in</strong>terest,we note that given A.1, A.2, and some further mild regularity conditions, such effectscan be identified and estimated from a neural network regression of the formY 1,t = α 0 + Y ′ 1,t−1 ρ 0 + Y ′ 2,t−1 β 0 + X ′ tβ 1r∑︁+ ψ(Y ′ 1,t−1 γ 1, j + Y ′ 2,t−1 γ 2, j + X ′ tγ 3, j )β j+1 + ε t .j=1Observe that this regression <strong>in</strong>cludes Y 2,t−1 <strong>in</strong>side the hidden units. With r chosen sufficientlylarge, this permits the regression to achieve a sufficiently close approximationto E(Y 1,t |Y t−1 , X t ) and its derivatives (see Hornik, K. M. St<strong>in</strong>chcombe, and H. White,1990; Gallant, A.R. and H. White, 1992) that regression misspecification is not such anissue. In this case, the derivative of the estimated regression with respect to Y 2,t−1 wellapproximates(∂/∂y 2 )E(Y 1,t | Y 1,t−1 ,Y 2,t−1 = y 2 , X t )= E[ (∂/∂y 2 )q 1,t (Y 1,t−1 , y 2 , Z t ,U 1,t ) | Y 1,t−1 , X t ].This quantity is the covariate conditioned expected marg<strong>in</strong>al direct effect of Y 2,t−1 onY 1,t .Although it is possible to base a test for GN on these estimated effects, we do notpropose this here, as the required analysis is much more <strong>in</strong>volved than that associatedwith GN Test Regression 2.F<strong>in</strong>ally, to ga<strong>in</strong> additional power WL propose tests us<strong>in</strong>g transformations of Y 1,t ,Y 1,t−1 , and Y 2,t−1 , as Y 1,t ⊥ Y 2,t−1 | Y 1,t−1 , X t implies f (Y 1,t ) ⊥ g(Y 2,t−1 ) | Y 1,t−1 , X t forall measurable f and g. One then tests β 1,0 = 0 <strong>in</strong>ψ 1 (Y 1,t ) = α 1,0 + ψ 2 (Y 1,t−1 ) ′ ρ 1,0 + ψ 3 (Y 2,t−1 ) ′ β 1,0 + X ′ tβ 1,1r∑︁+ ψ(Y ′ 1,t−1 γ 1,1, j + X ′ tγ 1, j )β 1, j+1 + η t . (GN Test Regression 3)j=1We take ψ 1 and the elements of the vector ψ 3 to be GCR, e.g., ridgelets or the logisticcdf. The choices of γ,r, and ψ are as described above. Here, ψ 2 can be the identity(ψ 2 (Y 1,t−1 ) = Y 1,t−1 ), its elements can co<strong>in</strong>cide with ψ 1 , or it can be a different GCRfunction.6.2.2. Test<strong>in</strong>g Conditional ExogeneityTest<strong>in</strong>g conditional exogeneity requires test<strong>in</strong>g A.2, i.e., Y 2,t−1 ⊥ U 1,t | Y 1,t−1 , X t . S<strong>in</strong>ceU 1,t is unobservable, we cannot test this directly. But with separability (which holdsunder the null of direct structural non-causality), Proposition 5.9 shows that Y 2,t−1 ⊥U 1,t | Y 1,t−1 , X t implies Y 2,t−1 ⊥ ε t | Y 1,t−1 , X t , where ε t := Y 1,t − E(Y 1,t |Y t−1 , X t ). Withcorrect specification of E(Y 1,t |Y t−1 , X t ) <strong>in</strong> either GN Test Regression 1 or 2 (or someother appropriate regression), we can estimate ε t and use these estimates to test Y 2,t−1 ⊥25


White Chalak Luε t | Y 1,t−1 , X t . If we reject this, then we also must reject CE. We describe the procedure<strong>in</strong> detail below.As WL discuss, such a procedure is not “watertight," as this method may miss certa<strong>in</strong>alternatives to CE. But, as it turns out, there is no completely <strong>in</strong>fallible method. Byoffer<strong>in</strong>g the opportunity of falsification, this method provides crucial <strong>in</strong>surance aga<strong>in</strong>stbe<strong>in</strong>g naively misled <strong>in</strong>to <strong>in</strong>appropriate causal <strong>in</strong>ferences. See WL for further discussion.The first step <strong>in</strong> construct<strong>in</strong>g a practical test for CE is to compute estimates of ε t ,say ˆε t . This can be done <strong>in</strong> the obvious way by tak<strong>in</strong>g ˆε t to be the estimated residualsfrom a suitable regression. For concreteness, suppose this is either GN Test Regression1 or 2.The next step is to use ˆε t to test Y 2,t−1 ⊥ ε t | Y 1,t−1 , X t . WL recommend do<strong>in</strong>g thisby estimat<strong>in</strong>g the follow<strong>in</strong>g analog of GN Test Regression 3:ψ 1 (ˆε t ) = α 2,0 + ψ 2 (Y 1,t−1 ) ′ ρ 2,0 + ψ 3 (Y 2,t−1 ) ′ β 2,0 + X ′ tβ 2,1r∑︁+ ψ(Y ′ 1,t−1 γ 2,1, j + X ′ tγ 2, j )β 2, j+1 + η t .j=1(CE Test Regression)Note that the right-hand-side regressors are identical to those of GN Test Regression 3;we just replace the dependent variable ψ 1 (Y 1,t ) for GN with ψ 1 (ˆε t ) for CE. Nevertheless,the transformations ψ 1 , ψ 2 , and ψ 3 here may differ from those of GN Test Regression3. To keep the notation simple, we leave these possible differences implicit. To test CEus<strong>in</strong>g this regression, we test the null hypothesis β 2,0 = 0 : if we reject β 2,0 = 0, then wereject CE.As WL expla<strong>in</strong>, the fact that ˆε t is obta<strong>in</strong>ed from a “first-stage" estimation (GN) <strong>in</strong>volv<strong>in</strong>gpotentially the same regressors as those appear<strong>in</strong>g <strong>in</strong> the CE regression meansthat choos<strong>in</strong>g ψ 1 (ˆε t ) = ˆε t can easily lead to a test with no power. For CE, WL thus recommendchoos<strong>in</strong>g ψ 1 to be GCR. Alternatively, non-GCR choices may be <strong>in</strong>formative,such asψ 1 (ˆε t ) = |ˆε t |, ψ 1 (ˆε t ) = ˆε t (λ − 1{ˆε t < 0}), λ ∈ (0,1), or ψ 1 (ˆε t ) = ˆε 2 t .Significantly, the asymptotic sampl<strong>in</strong>g distributions needed to test β 2,0 = 0 will generallybe impacted by the first-stage estimation. Handl<strong>in</strong>g this properly is straightforward,but somewhat <strong>in</strong>volved. To describe a practical method, we denote the first-stage(GN) estimator as ˆθ 1,T := ( ˆα 1,T , ˆρ 1,T , ˆβ ′ 1,0,T , ˆβ ′ 1,1,T ,..., ˆβ 1,r+1,T ) ′ , computed from GN TestRegression 1 (r = 0) or 2 (r > 0). Let the second stage (CE) regression estimator be ˆθ 2,T ;this conta<strong>in</strong>s the estimated coefficients for Y 2,t−1 , say ˆβ 2,0,T , which carry the <strong>in</strong>formationabout CE. Under mild conditions, a central limit theorem ensures that√T(ˆθ T − θ 0 ) d → N(0,C 0 ),where ˆθ T := (ˆθ ′ 1,T , ˆθ ′ 2,T )′ , θ 0 := plim(ˆθ T ), convergence <strong>in</strong> distribution as T → ∞ is denotedd →, and N(0,C 0 ) denotes the multivariate normal distribution with mean zero and26


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systemscovariance matrix C 0 := A −10 B 0A −1′0, where[︃ ]︃A011 0A 0 :=A 022A 021is a two-stage analog of the log-likelihood Hessian and B 0 is an analog of the <strong>in</strong>formationmatrix. See White, H. (1994, pp. 103–108) for specifics. 4 This fact can then be useto construct a well behaved test for β 2,0 = 0.Construct<strong>in</strong>g this test is especially straightforward when the regression errors of theGN and CE regressions, ε t and η t , are suitable mart<strong>in</strong>gale differences. Then B 0 has theform[︃ E[ Zt εB 0 :=t ε ′ t Z′ t ] E[ Z t ε t η ′ t Z′ t ] ]︃E[ Z t η t ε ′ t Z′ t ] E[ Z t η t η ′ t Z′ t ] ,where the CE regressors Z t are measurable-σ(X t ), X t := (vec[Y t−1 ] ′ ,vec[X t ] ′ ) ′ , ε t :=Y 1,t − E(Y 1,t | X t ), and η t := ψ 1 (ε t ) − E[ψ 1 (ε t ) | X t ]. For this, it suffices that U 1,t ⊥(Y t−l−1 , X t−τ1−1 ) | X t , as WL show. This memory condition is often plausible, as itsays that the more distant history (Y t−l−1 , X t−τ1−1 ) is not predictive for U 1,t , given themore recent history X t of (Y t−1 , X t+τ 2). Note that separability is not needed here.The details of C 0 can be <strong>in</strong>volved, especially with choices like ψ 1 (ˆε t ) = |ˆε t |. But thisis a standard m−estimation sett<strong>in</strong>g, so we can avoid explicit estimation of C 0 : suitablebootstrap methods deliver valid critical values, even without the mart<strong>in</strong>gale differenceproperty (see, e.g., Gonçalves, S. and H. White, 2004; Kiefer, N. and T. Vogelsang,2002, 2005; Politis, D. N., 2009).An especially appeal<strong>in</strong>g method is the weighted bootstrap (Ma, S. and M. Kosorok,2005), which works under general conditions, given the mart<strong>in</strong>gale difference property.To implement this, for i = 1,...,n generate sequences {W t,i ,t = 1,...,T} of IID positivescalar weights with E(W t,i ) = 1 and σ 2 W := var(W t,i) = 1. For example, take W t,i ∼χ 2 1 /√ 2 + (1 − 1/ √ 2), where χ 2 1is chi-squared with one degree of freedom. The weightsshould be <strong>in</strong>dependent of the sample data and of each other. Then compute estimatorsˆθ T,i by weighted least squares applied to the GN and CE regressions us<strong>in</strong>g (the same)weights {W t,i ,t = 1,...,T}. By Ma, S. and M. Kosorok (2005, theorem 2), the randomvariables √T(ˆθ T,i − ˆθ T ), i = 1,...,ncan then be used to form valid asymptotic critical values for test<strong>in</strong>g hypotheses aboutθ 0 .To test CE, we test β 2,0 = 0. This is a restriction of the form S 2 θ 0 = 0, where S 2 is theselection matrix that selects the elements β 2,0 from θ 0 . Thus, to conduct an asymptoticlevel α test, we can first compute the test statistic, sayT T := T ˆθ ′ T S′ 2 S 2 ˆθ T = T ˆβ ′ 2,0,T ˆβ 2,0,T .4. The regularity conditions <strong>in</strong>clude plausible memory and moment requirements, together with certa<strong>in</strong>smoothness and other technical conditions.27


White Chalak LuWe then reject CE if T T > ĉ T,n,1−α , where, with n chosen sufficiently large, ĉ T,n,1−α isthe 1 − α percentile of the weighted bootstrap statisticsT T,i := T (ˆθ T,i − ˆθ T ) ′ S ′ 2 S 2 (ˆθ T,i − ˆθ T ) = T (ˆβ 2,0,T,i − ˆβ 2,0,T ) ′ (ˆβ 2,0,T,i − ˆβ 2,0,T ),i = 1,...,n.This procedure is asymptotically valid, even though T T is based on the “unstudentized"statistic S 2 ˆθ T = ˆβ 2,0,T . Alternatively, one can construct a studentized statisticT * T := T ˆθ ′ T S′ 2 [S 2 ĈT,n S ′ 2 ]−1 S 2 ˆθ T ,where Ĉ T,n is an asymptotic covariance estimator constructed from √ T(ˆθ T,i − ˆθ T ), i =1,...,n. The test rejects CE if T * T > c 1−α, where c 1−α is the 1 − α percentile of the chisquareddistribution with dim(β 0,2 ) degrees of freedom. This method is more <strong>in</strong>volvedbut may have better control over the level of the test. WL provide further discussionand methods.Because the given asymptotic distribution is jo<strong>in</strong>t for ˆθ 1,T and ˆθ 2,T , the same methodsconveniently apply to test<strong>in</strong>g GN, i.e., β 1,0 = S 1 θ 0 = 0, where S 1 selects β 1,0 fromθ 0 . In this way, GN and CE test statistics can be constructed at the same time.WL discuss three examples, illustrat<strong>in</strong>g tests for direct structural non-causalitybased on tests of Granger non-causality and conditional exogeneity. A matlab module,testsn, implement<strong>in</strong>g the methods described here is available at http://ihome.ust.hk/~xunlu/code.htm.7. Summary and Conclud<strong>in</strong>g RemarksIn this paper, we explore the relations between direct structural causality <strong>in</strong> the settablesystems framework and direct causality <strong>in</strong> the PCM for both recursive and nonrecursivesystems. The close correspondence between these concepts <strong>in</strong> recursive systemsand the equivalence between direct structural causality and G−causality establishedby WL enable us to show the close l<strong>in</strong>kage between G−causality and PCM notionsof direct causality. We apply WL’s results to provide straightforward practicalmethods for test<strong>in</strong>g direct causality us<strong>in</strong>g tests for Granger causality and conditionalexogeneity.The methods and results described here draw largely from work of WC and WL.These papers conta<strong>in</strong> much additional relevant discussion and detail. WC provide furtherexamples contrast<strong>in</strong>g settable systems and the PCM. Chalak, K. and H. White(2010) build on WC, exam<strong>in</strong><strong>in</strong>g not only direct causality <strong>in</strong> settable systems, but alsonotions of <strong>in</strong>direct causality, which <strong>in</strong> turn yield implications for conditional <strong>in</strong>dependencerelations, such as those embodied <strong>in</strong> conditional exogeneity, which plays a keyrole here. WL treat not only the structural VAR case analyzed here, but also the “timeseriesnatural experiment" case, where causal effects of variables D t , absorbed here <strong>in</strong>toZ t , are explicitly analyzed. The sequence {D t } represents external stimuli, not driven by{Y t }, whose effects on {Y t } are of <strong>in</strong>terest. For example, {D t } could represent passivelyobserved visual or auditory stimuli, and {Y t } could represent measured neural activity.Interest may attach to which stimuli directly or <strong>in</strong>directly affect which neurons or28


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable Systemsgroups of neurons. WL also exam<strong>in</strong>e the structural content of classical Granger causalityand a variety of related alternative versions that emerge naturally from differentversions of Assumption A.1.AcknowledgmentsWe express our deep appreciation to Sir Clive W.J. Granger for his encouragement ofthe research underly<strong>in</strong>g the work presented here.ReferencesE. Candès. Ridgelets: Estimat<strong>in</strong>g with Ridge Functions. Annals of Statistics, 31:1561–1599, 1999.K. Chalak and H. White. <strong>Causality</strong>, Conditional Independence, and Graphical Separation<strong>in</strong> Settable Systems. Technical report, Department of Economics, BostonCollege, 2010.A. P. Dawid. Conditional Independence <strong>in</strong> Statistical Theory. Journal of the RoyalStatistical Society, <strong>Series</strong> B, 41:1–31, 1979.A. P. Dawid. Beware of the DAG! Proceed<strong>in</strong>gs of the NIPS 2008 Workshop on <strong>Causality</strong>,Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research Workshop and Conference Proceed<strong>in</strong>gs,6:59–86, 2010.M. A. Delgado and W. Gonzalez-Manteiga. Significance Test<strong>in</strong>g <strong>in</strong> NonparametricRegression Based on the Bootstrap. Annals of Statistics, 29:1469–1507, 2001.M. Eichler. Granger <strong>Causality</strong> and Path Diagrams for Multivariate <strong>Time</strong> <strong>Series</strong>. Journalof Econometrics, 137:334-353, 2007.M. Eichler and V. Didelez. Granger-causality and the Effect of Interventions <strong>in</strong> <strong>Time</strong><strong>Series</strong>. Lifetime Data Analysis, forthcom<strong>in</strong>g.R. Engle, D. Hendry, and J.F. Richard. Exogeneity. Econometrica 51:277–304, 1983.M. Fernandes and R. G. Flores. Tests for Conditional Independence, Markovian Dynamics,and Noncausality. Technical report, European University Institute, 2001.J.P. Florens and D. Fougère. Non-causality <strong>in</strong> Cont<strong>in</strong>uous <strong>Time</strong>. Econometrica,64:1195–1212, 1996.J.P. Florens and M. Mouchart. A Note on Non-causality. Econometrica, 50:583–591,1982.A.R. Gallant and H. White. On Learn<strong>in</strong>g the Derivatives of an Unknown Mapp<strong>in</strong>g withMultilayer Feedforward Networks. Neural Networks, 5:129–138, 1992.29


White Chalak LuD. Galles and J. Pearl. Axioms of Causal Relevance. Artificial Intelligence, 97:9–43,1997.R. Gibbons. Game Theory for Applied Economists. Pr<strong>in</strong>ceton University Press, Pr<strong>in</strong>ceton,1992.S. Gonçalves and H. White. Maximum Likelihood and the Bootstrap for Nonl<strong>in</strong>earDynamic Models. Journal of Econometrics, 119:199–219, 2004.C. W. J. Granger. Investigat<strong>in</strong>g Causal Relations by Econometric Models and CrossspectralMethods. Econometrica, 37:424–438, 1969.C. W. J. Granger and P. Newbold. Forecast<strong>in</strong>g Economic <strong>Time</strong> <strong>Series</strong> (2nd edition).Academic Press, New York, 1986.J. Halpern. Axiomatiz<strong>in</strong>g Causal Reason<strong>in</strong>g. Journal of Artificial Intelligence Research,12:317-337, 2000.K. Hornik, M. St<strong>in</strong>chcombe, and H. White. Universal Approximation of an UnknownMapp<strong>in</strong>g and its Derivatives Us<strong>in</strong>g Multilayer Feedforward Networks. Neural Networks,3:551–560, 1990.G. B. Huang, Q.Y. Zhu, and C.K. Siew. Extreme Learn<strong>in</strong>g Mach<strong>in</strong>es: Theory andApplications. Neurocomput<strong>in</strong>g, 70:489–501, 2006.M. Huang and H. White. A Flexible Test for Conditional Independence. Technicalreport, Department of Economics, University of California, San Diego.N. Kiefer and T. Vogelsang. Heteroskedasticity-autocorrelation Robust Test<strong>in</strong>g Us<strong>in</strong>gBandwidth Equal to Sample Size. Econometric Theory, 18:1350–1366, 2002.N. Kiefer and T. Vogelsang. A New Asymptotic Theory for HeteroskedasticityautocorrelationRobust Tests. Econometric Theory, 21:1130–1164, 2005.O. L<strong>in</strong>ton and P. Gozalo. Conditional Independence Restrictions: Test<strong>in</strong>g and Estimation.Technical report, Cowles Foundation for Research, Yale University, 1997.S. Ma and M. Kosorok. Robust Semiparametric M-estimation and the Weighted Bootstrap.Journal of Multivariate Analysis, 96:190-217, 2005.J. Pearl. <strong>Causality</strong>. Cambridge University Press, New York, 2000.J. Pearl. Direct and Indirect Effects. In Proceed<strong>in</strong>gs of the Seventeenth Conference onUncerta<strong>in</strong>ty <strong>in</strong> Artificial Intelligence, pages 411-420, 2001.D.N. Politis. Higher-Order Accurate, Positive Semi-def<strong>in</strong>ite Estimation of Large-Sample Covariance and Spectral Density Matrices. Technical report, Departmentof Economics, University of California, San Diego, 2009.30


L<strong>in</strong>k<strong>in</strong>g Granger <strong>Causality</strong> and the Pearl Causal Model with Settable SystemsA. Roebroeck, A.K. Seth, and P. Valdes-Sosa. <strong>Causality</strong> Analysis of Functional MagneticResonance Imag<strong>in</strong>g Data. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, (this issue),2011.K. Song. Test<strong>in</strong>g Conditional Independence via Rosenblatt Transforms. Annals ofStatistics, 37:4011-4015, 2009.M. St<strong>in</strong>chcombe and H. White. Consistent Specification Test<strong>in</strong>g with Nuisance ParametersPresent Only Under the Alternative. Econometric Theory, 14:295-324, 1998.J. Stock and M. Watson. Introduction to Econometrics. Addison-Wesley, Boston, 2007.L. Su and H. White. A Consistent Characteristic Function-Based Test for ConditionalIndependence. Journal of Econometrics, 141:807-834, 2007a.L. Su and H. White. Test<strong>in</strong>g Conditional Independence via Empirical Likelihood. Technicalreport, Department of Economics, University of California, San Diego, 2007b.L. Su and H. White. A Nonparametric Hell<strong>in</strong>ger Metric Test for Conditional Independence.Econometric Theory, 24:829–864, 2008.H. Varian. Intermediate Microeconomics (8th edition). Norton, New York, 2009.H. White. Estimation, Inference, and Specification Analysis. Cambridge UniversityPress, New York, 1994.H. White. <strong>Time</strong>-series Estimation of the Effects of Natural Experiments. Journal ofEconometrics, 135:527-566, 2006a.H. White. Approximate Nonl<strong>in</strong>ear Forecast<strong>in</strong>g Methods. In G. Elliott, C.W.J. Granger,and A. Timmermann, editors, Handbook of Economic Forecast<strong>in</strong>g, pages 460–512,Elsevier, New York, 2006b.H. White and K. Chalak. Settable Systems: An Extension of Pearl’s Causal Model withOptimization, Equilibrium, and Learn<strong>in</strong>g. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research,10:1759-1799, 2009.H. White and P. Kennedy. Retrospective Estimation of Causal Effects Through <strong>Time</strong>. InJ. Castle and N. Shephard editors, The Methodology and Practice of Econometrics:A Festschrift <strong>in</strong> Honour of David F. Hendry, pages 59–87, Oxford University Press,Oxford, 2009.H. White and X. Lu. Granger <strong>Causality</strong> and Dynamic Structural Systems. Journal ofF<strong>in</strong>ancial Econometrics, 8:193-243, 2010a.H. White and X. Lu. Causal Diagrams for Treatment Effect Estimation with Applicationto Selection of Efficient Covariates. Technical report, Department of Economics,University of California, San Diego, 2010b.31


JMLR: Workshop and Conference Proceed<strong>in</strong>gs 12:30–64, 2011<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Robust statistics for describ<strong>in</strong>g causality <strong>in</strong> multivariatetime-series.Flor<strong>in</strong> PopescuFraunhofer Institute FIRSTKekulestr. 7, Berl<strong>in</strong> 12489 Germanyflor<strong>in</strong>.popescu@first.fraunhofer.deEditors: Flor<strong>in</strong> Popescu and Isabelle GuyonAbstractA widely agreed upon def<strong>in</strong>ition of time series causality <strong>in</strong>ference, established <strong>in</strong> thesem<strong>in</strong>al 1969 article of Clive Granger (1969), is based on the relative ability of the historyof one time series to predict the current state of another, conditional on all otherpast <strong>in</strong>formation. While the Granger <strong>Causality</strong> (GC) pr<strong>in</strong>ciple rema<strong>in</strong>s uncontested,its literal application is challenged by practical and physical limitations of the processof discretely sampl<strong>in</strong>g cont<strong>in</strong>uous dynamic systems. Advances <strong>in</strong> methodology fortime-series causality subsequently evolved ma<strong>in</strong>ly <strong>in</strong> econometrics and bra<strong>in</strong> imag<strong>in</strong>g:while each doma<strong>in</strong> has specific data and noise characteristics the basic aims andchallenges are similar. Dynamic <strong>in</strong>teractions may occur at higher temporal or spatialresolution than our ability to measure them, which leads to the potentially false <strong>in</strong>ferenceof causation where only correlation is present. <strong>Causality</strong> assignment can be seenas the pr<strong>in</strong>cipled partition of spectral coherence among <strong>in</strong>teract<strong>in</strong>g signals us<strong>in</strong>g bothauto-regressive (AR) modell<strong>in</strong>g and spectral decomposition. While both approachesare theoretically equivalent, <strong>in</strong>terchangeably describ<strong>in</strong>g l<strong>in</strong>ear dynamic processes, thepurely spectral approach currently differs <strong>in</strong> its somewhat higher ability to accuratelydeal with mixed additive noise.Two new methods are <strong>in</strong>troduced 1) a purely auto-regressive method named CausalStructural Information is <strong>in</strong>troduced which unlike current AR-based methods is robustto mixed additive noise and 2) a novel means of calculat<strong>in</strong>g multivariate spectrafor unevenly sampled data based on card<strong>in</strong>al trigonometric functions is <strong>in</strong>corporated<strong>in</strong>to the recently <strong>in</strong>troduced phase slope <strong>in</strong>dex (PSI) spectral causal <strong>in</strong>ference method(Nolte et al., 2008). In addition to these, PSI, partial coherence-based PSI and exist<strong>in</strong>gAR-based causality measures were tested on a specially constructed data-set simulat<strong>in</strong>gpossible confound<strong>in</strong>g effects of mixed noise and another additionally test<strong>in</strong>g the<strong>in</strong>fluence of common, background driv<strong>in</strong>g signals. Tabulated statistics are provided,<strong>in</strong> which true causality <strong>in</strong>fluence is subjected to an acceptable level of false <strong>in</strong>ferenceprobability. The new methods as well as PSI are shown to allow reliable <strong>in</strong>ferencefor signals as short as 100 po<strong>in</strong>ts and to be robust to additive colored mixed noise andto the <strong>in</strong>fluence commonly coupled driv<strong>in</strong>g signals, as well as provide for a usefulmeasure of strength of causal <strong>in</strong>fluence.Keywords: <strong>Causality</strong>, spectral decomposition, cross-correlation, auto regressive models.c○ 2011 F. Popescu.


Popescu1. Introduction<strong>Causality</strong> is the s<strong>in</strong>e qua non of scientific <strong>in</strong>ference methodology, allow<strong>in</strong>g us, amongother th<strong>in</strong>gs to advocate effective policy, diagnose and cure disease and expla<strong>in</strong> bra<strong>in</strong>function. While it has recently attracted much <strong>in</strong>terest with<strong>in</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g, itbears rem<strong>in</strong>d<strong>in</strong>g that a lot of this recent effort has been directed toward static data ratherthan time series. The ‘classical’ statisticians of the early 20 th century, such as Fisher,Gosset and Karl Pearson, aimed at a rational and general recipe for causal <strong>in</strong>ference anddiscovery (Gigerenzer et al., 1990) but the tools they developed applied to simple typesof <strong>in</strong>ference which required the pres-selection, through consensus or by design, of ahandful of candidate causes (or ‘treatments’) and a handful of subsequently occurr<strong>in</strong>gcandidate effects. Numerical experiments yielded tables which were <strong>in</strong>tended to serveas a technician’s almanac (Pearson, 1930; Fisher, 1925), and are today an essential partof the vocabulary of scientific discourse, although tables have been replaced by preciseformulae and specialized software. These methods rely on remov<strong>in</strong>g possible causall<strong>in</strong>ks at a certa<strong>in</strong> ‘significance level’, on the basic premise that a tw<strong>in</strong> experiment ondata of similar size generated by a hypothetical non-causal mechanism would yield aresult of similar strength only with a known (small) probability. While it may havebeen hoped that a generalization of the statistical test of difference among populationmeans (e.g. the t-test) to the case of time series causal structure may be possible us<strong>in</strong>ga similar almanac or recipe book approach, <strong>in</strong> reality causality has proven to be a muchmore contentious - and difficult - issue.<strong>Time</strong> series theory and analysis immediately followed the development of classicalstatistics (Yule, 1926; Wold, 1938) and was spurred thereafter by exigence (a severeeconomic boom/bust cycle, an <strong>in</strong>tense high-tech global conflict) as well as opportunity(the post-war advent of a mach<strong>in</strong>e able to perform large l<strong>in</strong>ear algebra calculations).From a wide historical perspective, Fisher’s ‘almanac’ has rendered the <strong>in</strong>dustrialage more orderly and understandable. It can be argued, however, that the ‘scientificmethod’, at least <strong>in</strong> its account<strong>in</strong>g/statistical aspects, has not kept up with the explosivegrowth of data tabulated <strong>in</strong> history, geology, neuroscience, medic<strong>in</strong>e, population dynamics,economics, f<strong>in</strong>ance and other fields <strong>in</strong> which causal structure is at best partiallyknown and understood, but is needed <strong>in</strong> order to cure or to advocate policy. While itmay have been hoped that the advent of the computer might give rise to an automatic<strong>in</strong>ference mach<strong>in</strong>e able to ‘sort out’ the ever-expand<strong>in</strong>g data sphere, the potential of acomputer of any conceivable power to condense the world to predictable patterns haslong been proven to be shock<strong>in</strong>gly limited by mathematicians such as Tur<strong>in</strong>g (Tur<strong>in</strong>g,1936) and Kolmogorov (Kolmogorov and Shiryayev, 1992) - even before the ENIACwas built. The basic problem reduces itself to the curse of dimensionality: be<strong>in</strong>g forcedto choose among comb<strong>in</strong>ations of members of a large set of hypotheses (Lanterman,2001). Scientists as a whole took a more positive outlook, <strong>in</strong> l<strong>in</strong>e with post-war boomoptimism, and focused on accessible automatic <strong>in</strong>ference problems. One of these wasscientists was Norbert Wiener, who, besides found<strong>in</strong>g the field of cybernetics (the precursorof ML), <strong>in</strong>troduced some of the basic tools of modern time-series analysis, a l<strong>in</strong>eof research he began dur<strong>in</strong>g wartime and focused on feedback control <strong>in</strong> ballistics. The34


Robust Statistics for <strong>Causality</strong>time-series causality def<strong>in</strong>ition of Granger (1969) owes <strong>in</strong>spiration to earlier discussionof causality by Wiener (1956). Granger’s approach blended spectral analysis withvector auto-regression, which had long been basic tools of economics (Wold, 1938;Koopmans, 1950), and appeared nearly at the same time as similar work by Akaike(1968) and Gersch and Goddard (1970).It is useful to highlight the differences <strong>in</strong> methodological pr<strong>in</strong>ciple and <strong>in</strong> motivationfor static vs. time series data causality <strong>in</strong>ference, start<strong>in</strong>g with the former asit comprises a large part of the pert<strong>in</strong>ent corpus <strong>in</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g and <strong>in</strong> data m<strong>in</strong><strong>in</strong>g.Static causal <strong>in</strong>ference is important <strong>in</strong> the sense that any classification or regressionpresumes some k<strong>in</strong>d of causality, for the result<strong>in</strong>g relation to be useful <strong>in</strong> identify<strong>in</strong>g elementsor features of the data which ‘cause’ or predict target labels or variables and areto be selected at the exclusion of other confound<strong>in</strong>g ‘features’. In learn<strong>in</strong>g and generalizationof static data, sample order<strong>in</strong>g is either un<strong>in</strong>formative or unknown. Yet order isimplicitly relevant to learn<strong>in</strong>g both <strong>in</strong> the sense that some calculation occurs <strong>in</strong> the physicalworld <strong>in</strong> some f<strong>in</strong>ite number of steps which transform <strong>in</strong>dependent <strong>in</strong>puts (stimuli)to dependent output (responses), and <strong>in</strong> the sense that generalization should occur onexpected future stimuli. To ably generalize from a limited set of samples implies mak<strong>in</strong>gaccurate causal <strong>in</strong>ference. With this priority <strong>in</strong> m<strong>in</strong>d prior NIPS workshops haveconcentrated on feature selection and on graphical model-type causal <strong>in</strong>ference (Guyonand Elisseeff, 2003; Guyon et al., 2008, 2010) <strong>in</strong>spired by the work of Pearl (2000) andSpirtes et al. (2000). The basic technique or underly<strong>in</strong>g pr<strong>in</strong>ciple of this type of <strong>in</strong>ferenceis vanish<strong>in</strong>g partial correlation or the <strong>in</strong>ference of static conditional <strong>in</strong>dependenceamong 3 or more random variables. While it may seem limit<strong>in</strong>g that no unambiguous,generally applicable causality assignment procedure exists among s<strong>in</strong>gle pairs of randomvariables, for large ensembles the ambiguity may be partially resolved. Statisticaltests exist which assign, with a controlled probability of false <strong>in</strong>ference, random variableX 1 as dependent on X 2 given no other <strong>in</strong>formation, but as <strong>in</strong>dependent on X 2 givenX 3 , a conceptual framework proposed for time-series causality soon after Granger’s1969 paper us<strong>in</strong>g partial coherence rather than static correlation (Gersch and Goddard,1970). Applied to an ensemble of observations X 1 ..X N , efficient polynomial time algorithmshave been devised which comb<strong>in</strong>e <strong>in</strong>formation about pairs, triples and othersub-ensembles of random variables <strong>in</strong>to a complete dependency graph <strong>in</strong>clud<strong>in</strong>g, butnot limited to, a directed acyclical graph (DAG). Such <strong>in</strong>ference algorithms operate<strong>in</strong> a nearly deductive manner but are not guaranteed to have unique, optimal solution.Underly<strong>in</strong>g predictive models upon which this type of <strong>in</strong>ference can operate <strong>in</strong>cludesl<strong>in</strong>ear regression (or structural equation model<strong>in</strong>g) (Richardson and Spirtes, 1999; Lacerdaet al., 2008; Pearl, 2000) and Markov cha<strong>in</strong> probabilistic models (Sche<strong>in</strong>es et al.,1998; Spirtes et al., 2000). Importantly, a previously unclear conceptual l<strong>in</strong>k betweenthe notions of time series causality and static causal <strong>in</strong>ference has been formally described:see White and Lu (2010) <strong>in</strong> this volume.Likewise, algorithmic and functional relation constra<strong>in</strong>ts, or at least likelihoodsthereof, have been proposed as to assign causality for co-observed random variablepairs (i.e. simply by analyz<strong>in</strong>g the scatter plot of X 1 vs. X 2 ) (Hoyer et al., 2009). In35


Popescugeneral terms, if we are presented a scatter plot X 1 vs. X 2 which looks like a noisys<strong>in</strong>e wave, we may reasonably <strong>in</strong>fer that X 2 causes X 1 , s<strong>in</strong>ce a given value of X 2 ‘determ<strong>in</strong>es’X 1 and not vice versa. We may even make some mild assumptions about thenoise process which superimposes on a functional relation ( X 2 = X 1 + additive noisewhich is <strong>in</strong>dependent of X 1 ) and by this means turn our <strong>in</strong>tuition <strong>in</strong>to a proper asymmetricstatistic, i.e. a controlled probability that X 1 does not determ<strong>in</strong>e X 2 , an approachthat has proven remarkably successful <strong>in</strong> some cases where the presence of a causalrelation was known but the direction was not (Hoyer et al., 2009). The challenge hereis that, unlike <strong>in</strong> traditional statistics, there is not simply the case of the null hypothesisand its converse, but one of 4 mutually exclusive cases. A) X 1 is <strong>in</strong>dependent ofX 2 B) X 1 causes X 2 C) X 2 causes X 1 and D) X 1 and X 2 are observations of dependentand non-causally related random variables (bidirectional <strong>in</strong>formation flow or feedback).The appearance of a symmetric bijection (with additive noise) between X 1 and X 2 doesnot mean absence of causal relation, as asymmetry <strong>in</strong> the apparent relations is merely aclue and not a determ<strong>in</strong>ant of causality. Inference over static data is not without ambiguitieswithout additional assumptions and requires observations of <strong>in</strong>teract<strong>in</strong>g triples(or more) of variables as to allow somewhat reliable descriptions of causal relations orlack thereof (see Guyon et al. (2010) for a more comprehensive overview). Statisticalevaluation requires estimation of relative likelihood of various candidate models orcausal structures, <strong>in</strong>clud<strong>in</strong>g a null hypothesis of non-causality. In the case of complexmultidimensional data theoretical derivation of such probabilities is quite difficult, s<strong>in</strong>ceit is hard to analytically describe the class of dynamic systems we may be expected toencounter. Instead, common ML practice consists <strong>in</strong> runn<strong>in</strong>g toy experiments <strong>in</strong> whichthe ‘ground truth’ (<strong>in</strong> our case, causal structure) is only known to those who run theexperiment, while other scientists aim to test their discovery algorithms on such data,and methodological validity (<strong>in</strong>clud<strong>in</strong>g error rate) of any candidate method rests on itsability to predict responses to a set of ‘stimuli’ (test data samples) available only tothe scientists organiz<strong>in</strong>g the challenge. This is the underly<strong>in</strong>g paradigm of the <strong>Causality</strong>Workbench (Guyon, 2011). In time series causality, we fortunately have far more<strong>in</strong>formation at our disposal relevant to causality than <strong>in</strong> the static case. Any type ofreasonable <strong>in</strong>terpretation of causality implies a physical mechanism which accepts amodifiable <strong>in</strong>put and performs some operations <strong>in</strong> some f<strong>in</strong>ite time which then producean output and <strong>in</strong>cludes a source of randomness which gives it a stochastic nature, be it<strong>in</strong>herent to the mechanism itself or <strong>in</strong> the observation process. Intuitively, the structureor connectivity among <strong>in</strong>put-output blocks that govern a data generat<strong>in</strong>g process are relatedto causality no matter (with<strong>in</strong> limits) what the exact <strong>in</strong>put-output relationships are:this is what we mean by structural causality. However, not all structures of data generat<strong>in</strong>gprocesses are obviously causal, nor is it self evident how structure correspondsto Granger (non) causality (GC), as shown <strong>in</strong> further detail by White and Lu (2010).Granger causality is a measure of relative predictive <strong>in</strong>formation among variables andnot evidence of a direct physical mechanism l<strong>in</strong>k<strong>in</strong>g the two processes: no amount ofanalysis can exclude a latent unobserved cause. Strictly speak<strong>in</strong>g the GC statistic is36


Robust Statistics for <strong>Causality</strong>not a measure of causal relation: it is the possible non-rejection of a null hypothesis oftime-ordered <strong>in</strong>dependence.Although time <strong>in</strong>formation helps solve many of the ambiguities of static data severalproblems, and despite the large body of literature on time-series model<strong>in</strong>g, severalproblems <strong>in</strong> time-series causality rema<strong>in</strong> vex<strong>in</strong>g. Knowledge of the structure of theoverall multivariate data generat<strong>in</strong>g process is an <strong>in</strong>dispensable aid to <strong>in</strong>ferr<strong>in</strong>g causalrelationships: but how to <strong>in</strong>fer the structure us<strong>in</strong>g weak a priori assumptions is an openresearch question. Sections 3, 4 and 5 will address this issue. Even <strong>in</strong> the simplest case(the bivariate case) the observation process can <strong>in</strong>troduce errors <strong>in</strong> time-series causal<strong>in</strong>ference by means of co-variate observation noise (Nolte et al., 2010). The bivariatedataset NOISE <strong>in</strong> the <strong>Causality</strong> Workbench addresses this case, and is extended <strong>in</strong>this study to the evaluation datasets PAIRS and TRIPLES. Two new methods are <strong>in</strong>troduced:an autoregressive method named Causal Structural Information (Section 7) anda method for estimat<strong>in</strong>g spectral coherence <strong>in</strong> the case of unevenly sampled data (Section8.1). A pr<strong>in</strong>cipled comparison of different methods as well as their performance <strong>in</strong>terms of type I, II and III errors is necessary, which addresses both the presence/absenceof causal <strong>in</strong>teraction and directionality. In discuss<strong>in</strong>g causal <strong>in</strong>fluence <strong>in</strong> real-world processes,we may reasonably expect that not <strong>in</strong>ferr<strong>in</strong>g a potentially weak causal l<strong>in</strong>k maybe acceptable but posit<strong>in</strong>g one where none is miss<strong>in</strong>g may be problematic. Sections 2,6, 7 and 8 address robustness of bivariate causal <strong>in</strong>ference, <strong>in</strong>troduc<strong>in</strong>g a pair of novelmethods and evaluat<strong>in</strong>g them along with exist<strong>in</strong>g ones. Another common source ofargument <strong>in</strong> discussions of causal structure is the case of false <strong>in</strong>ference by neglect<strong>in</strong>gto condition the proposed causal <strong>in</strong>formation on other background variables which mayexpla<strong>in</strong> the proposed effect equally well. While the description of a general deductivemethod of causal connectivity <strong>in</strong> multivariate time series is beyond the scope of thisarticle, Section 9 evaluates numerical and statistical performance <strong>in</strong> the tri-variate case,us<strong>in</strong>g methods such as CSI and partial coherence based PSI which can apply to bivariate<strong>in</strong>teractions conditioned by an arbitrary number of background variables.2. <strong>Causality</strong> statistic<strong>Causality</strong> <strong>in</strong>ference is subject to a wider class of errors than classical statistics, whichtests <strong>in</strong>dependence among variables. A general hypothesis evaluation framework canbe:Null Hypothesis = No causal <strong>in</strong>teraction H 0 = A ⊥ C B |CHypothesis 1a = A drives B H a = A → B |CHypothesis 1b = B drives A H b = B → A |CType I error prob. α = P (︁ Ĥ a or Ĥ b | H 0)︁(1)Type II error prob. β = P (︁ Ĥ 0 | H a or H b)︁37


PopescuType III error prob. γ = P (︁ Ĥ a |H b or Ĥ b |H a)︁The notation Ĥ means that our statistical estimate of the estimated likelihood of Hexceeds the threshold needed for our decision to confirm it. This formulation carriessome caveats the justification for which is pragmatic and will be expounded upon <strong>in</strong>later sections. The ma<strong>in</strong> one is the use of the term ‘drives’ <strong>in</strong> place of ‘causes’. Thenull hypothesis can be viewed as equivalent to strong Granger non-causality (as it willbe argued is necessary), but it does not mean that the signals A and B are <strong>in</strong>dependent:they may well be correlated to one another. Furthermore, we cannot realisticallyaim at statistically support<strong>in</strong>g strict Granger causality, i.e. strictly one-sided causal<strong>in</strong>teraction, s<strong>in</strong>ce asymmetry <strong>in</strong> bidirectional <strong>in</strong>teraction may be more likely <strong>in</strong> realworldobservations and is equally mean<strong>in</strong>gful. By ‘driv<strong>in</strong>g’ we mean <strong>in</strong>stead that thehistory of one time series element A is more useful to predict<strong>in</strong>g the current state ofB than vice-versa, and not that the history of B is irrelevant to predict<strong>in</strong>g A. In thelatter case we would specify ‘G-causes’ <strong>in</strong>stead of ‘drives’ and for H 0 we would employnon-parametric <strong>in</strong>dependence tests of Granger non causality (GNC) which havealready been developed as <strong>in</strong> Su and White (2008) and Moneta et al. (2010). Note thatthe def<strong>in</strong>ition <strong>in</strong> (1) is different from that recently proposed <strong>in</strong> White and Lu (2010),which goes further than GNC test<strong>in</strong>g to make the po<strong>in</strong>t that structural causality <strong>in</strong>ferencemust also <strong>in</strong>volve a further conditional <strong>in</strong>dependence test: Conditional Exogeneity(CE). In simple terms, CE tests whether the <strong>in</strong>novations process of the potential effectis conditionally <strong>in</strong>dependent of the cause (or, by practical consequence, whether the<strong>in</strong>novations processes are uncorrelated). White and Lu argue that if both GNC and CEfail we ought not make any decision regard<strong>in</strong>g causality, and comb<strong>in</strong>e the power ofboth tests <strong>in</strong> a pr<strong>in</strong>cipled manner such that the probability of false causal <strong>in</strong>ference, ornon-decision, is controlled. The difference <strong>in</strong> this study is that the concurrent failureof GNC and CE is precisely the difficult situation requir<strong>in</strong>g additional focus and it willbe argued that methods that can cope with this situation can also perform well for thecase of CE, although they require stronger assumptions. In effect, it is assumed thatreal-world signals feature a high degree of non-causal correlation, due to alias<strong>in</strong>g effectsas described <strong>in</strong> the follow<strong>in</strong>g section, and that strong evidence to the contrary isrequired, i.e. that non-decision is equivalent to <strong>in</strong>ference of non-causality. The precisemean<strong>in</strong>g of ’driv<strong>in</strong>g’ will also be made explicit <strong>in</strong> the description of Causal StructuralInformation, which is implicitly a proposed def<strong>in</strong>ition of H 0 . Also different <strong>in</strong> Def<strong>in</strong>ition(1) than <strong>in</strong> White and Lu is the account<strong>in</strong>g of potential error <strong>in</strong> causal directionassignment under a framework which forces the practitioner to make such a choice ifGNC is rejected.One of the difficulties of causality <strong>in</strong>ference methodology is that it is difficult to ascerta<strong>in</strong>what true causality <strong>in</strong> the real world (‘ground truth’) is for a sufficiently comprehensiveclass of problems (such that we can reliably gage error probabilities): hence theneed for extensive simulation. A clear means of validat<strong>in</strong>g a causal hypothesis wouldbe <strong>in</strong>tervention Pearl (2000), i.e. modification of the presumed cause, but <strong>in</strong> <strong>in</strong>stancessuch as historic and geological data this is not feasible. The basic approach will be toassume a non-<strong>in</strong>formative probability distribution of the degree degree of mix<strong>in</strong>g, or38


Robust Statistics for <strong>Causality</strong>non-causal dynamic <strong>in</strong>teractions, as well as over <strong>in</strong>dividual spectra and compile <strong>in</strong>ferenceerror probabilities over a wide class of coupled dynamic systems. In construct<strong>in</strong>g a‘robust causality’ statistic there is more than simply null-hypothesis rejection and accuratedirectionality to consider, however. In scientific practice we are not only <strong>in</strong>terestedto know that A and B are causally related or not, but which is the ma<strong>in</strong> driver <strong>in</strong> case ofbidirectional coupl<strong>in</strong>g, and among a time series vector A, B, C, D... it is important todeterm<strong>in</strong>e which of these factors are the ma<strong>in</strong> causes of the target variable, say A. Therelative effect size and relative causal <strong>in</strong>fluence strength, lest the analysis be misused(Ziliak and McCloskey, 2008). The rhetorical and scientific value of effect size <strong>in</strong> noway devalues the underly<strong>in</strong>g pr<strong>in</strong>ciple of robust statistics and controlled <strong>in</strong>ference errorprobabilities used to quantify it.3. Auto-regression and alias<strong>in</strong>gA simple multivariate time series model is the multivariate auto-regressive model (abbreviatedas MVAR or VAR). It assumes that the data generat<strong>in</strong>g process (DGP) thatcreated the observations is a l<strong>in</strong>ear dynamic model and, as such, it conta<strong>in</strong>s poles onlyi.e. the numerator of the transfer function between <strong>in</strong>novations process and observationis a scalar. The more complex auto-regressive mov<strong>in</strong>g average model (ARMA)<strong>in</strong>cludes zeros as well. Despite the rather str<strong>in</strong>gent assumptions of VAR, a time-seriesextension of ord<strong>in</strong>ary least squares l<strong>in</strong>ear regression, it has been hugely successful <strong>in</strong>applications from neuroscience to eng<strong>in</strong>eer<strong>in</strong>g to sociology and economics. Its familiarVAR (or VARX) formulation is:y i =K∑︁A k y i−k + Bu + w i (2)k=1Where {y i,d=1..D } is a real valued vector of dimension D. Notice the absence of asubscript <strong>in</strong> the exogenous <strong>in</strong>put term u. This is because a general treatment of exogenous<strong>in</strong>puts requires a lagged sum, i.e. Kk=1 ∑︀B k u i−k . S<strong>in</strong>ce exogenous <strong>in</strong>puts are notexplicitly addressed <strong>in</strong> the follow<strong>in</strong>g derivations the general l<strong>in</strong>ear operator placeholderBu is used <strong>in</strong>stead and can be re-substituted for subsequent use.Granger non-causality for this system, expressed <strong>in</strong> terms of conditional <strong>in</strong>dependence,would place a relation among elements of y subject to knowledge of u. If D = 2, for all iy 1,i ⊥ y 2,i−1..i−K | y 1,i−1..i−K (3)If the above is true, we would say that y 2 does not f<strong>in</strong>ite-order G cause y 1 . If theworld was made exclusively of l<strong>in</strong>ear VARs, it would not be terribly difficult to devise areliable statistic for G causality. We would, given a sequence of N data po<strong>in</strong>ts, identifythe maximum-likelihood parameters A and B via ord<strong>in</strong>ary least squares (OLS) l<strong>in</strong>earregression after hav<strong>in</strong>g, via some model selection criterion, determ<strong>in</strong>ed the order K.Furthermore we would choose another criterion (e.g. test and p-value) which tells uswhether any particular coefficient is likely to be statistically <strong>in</strong>dist<strong>in</strong>guishable from 0,39


Popescuwhich would correspond to a vanish<strong>in</strong>g partial correlation. If all A’s are lower triangularG non-causality is satisfied (<strong>in</strong> one direction but not the converse). It is however veryrare that the physical mechanism we are observ<strong>in</strong>g is <strong>in</strong>deed the embodiment of a VAR,and therefore even <strong>in</strong> the case <strong>in</strong> which G non-causality can be safely rejected, it isnot likely that the best VAR approximation of the data observed is strictly lower/uppertriangular. The necessity of a dist<strong>in</strong>ction between strict causality, which has a structural<strong>in</strong>terpretation, and a causality statistic, which does not measure <strong>in</strong>dependence <strong>in</strong>the sense of Granger-non causality, but rather relative degree of dependence <strong>in</strong> bothdirections among two signals (driv<strong>in</strong>g) is most evident <strong>in</strong> this case. If the VAR <strong>in</strong> questionhad very small (and statistically observable) upper triangular elements would adiscussion of causality of the observed time series be rendered moot?One of the most common physical mechanisms which is <strong>in</strong>compatible with VARis alias<strong>in</strong>g, i.e. dynamics which are faster than the (shortest) sampl<strong>in</strong>g <strong>in</strong>terval. Thestandard <strong>in</strong>terpretation of alias<strong>in</strong>g is the false representation of frequency componentsof a signal due to sub-Nyquist frequency sampl<strong>in</strong>g: <strong>in</strong> the multivariate time-series casethis can also lead to spurious correlations <strong>in</strong> the observed <strong>in</strong>novations process (Phillips,1973). Consider a cont<strong>in</strong>uous bivariate VAR of order 1 with Gaussian <strong>in</strong>novations <strong>in</strong>which the sampl<strong>in</strong>g frequency is several orders of magnitude smaller than the Nyquistfrequency. In this case we would observe a covariate time <strong>in</strong>dependent Gaussian processs<strong>in</strong>ce for all practical purposes the <strong>in</strong>formation travels ‘<strong>in</strong>stantaneously’. In economics,this effect could be due to social <strong>in</strong>teractions or market reactions to news whichhappen faster than the sampl<strong>in</strong>g <strong>in</strong>terval (be it daily, hourly or monthly). In fMRI analysissub- sampl<strong>in</strong>g <strong>in</strong>terval bra<strong>in</strong> dynamics are observed over a relatively slow timeconvolution process of hemodynamic response of neural activity (for a detailed expositionof causality <strong>in</strong>ference <strong>in</strong> fMRI see Roebroeck et al. (2011) <strong>in</strong> this volume).Although ‘alias<strong>in</strong>g’ normally refers to temporal alias<strong>in</strong>g, the same process can occurspatially. In neuroscience and <strong>in</strong> economics the observed variables are summations(dimensionality reductions) of a far larger set of <strong>in</strong>teract<strong>in</strong>g agents, be they <strong>in</strong>dividualsor neurons. In electroencephalography (EEG) the propagation of electrical potentialfrom cortical axons arrives via multiple pathways to the same record<strong>in</strong>g location onthe scalp: the summation of micrometer scale electric potentials on the scalp at centimeterscale. Once aga<strong>in</strong> there are spurious observable correlations: this is known asthe mix<strong>in</strong>g problem. Such effects can be modeled, albeit with significant <strong>in</strong>formationloss, by the same DGP class which is a superset of VAR and known <strong>in</strong> econometrics asSVAR (structural vector auto-regression, the time series equivalent of structural equationmodel<strong>in</strong>g (SEM), often used <strong>in</strong> static causality <strong>in</strong>ference (Pearl, 2000)). Anotherbasic problem <strong>in</strong> dynamic system identification is that we not only discard much <strong>in</strong>formationfrom the world <strong>in</strong> sampl<strong>in</strong>g it, but that our observations are susceptible toadditive noise, and that the randomness we see <strong>in</strong> the data is not entirely the randomnessof the mechanism we <strong>in</strong>tend to study. One of the most problematic of additivenoise models is mixed colored noise, <strong>in</strong> which there are structured correlations both <strong>in</strong>time and across elements of the time-series, but not <strong>in</strong> any causal way: there is only al<strong>in</strong>ear transformation of colored noise, sometimes called mix<strong>in</strong>g, due to spatial alias<strong>in</strong>g.40


Robust Statistics for <strong>Causality</strong>Mix<strong>in</strong>g may occur due to temporal alias<strong>in</strong>g <strong>in</strong> sampl<strong>in</strong>g a coupled cont<strong>in</strong>uous-variableVAR system. In EEG analysis mixed colored noise models the background electricalactivity of the bra<strong>in</strong>. In other doma<strong>in</strong>s such as economics, one can imag<strong>in</strong>e the <strong>in</strong>fluenceof unpredictable events such as natural cataclysms or macroenomic cycles whichare not white noise and which are reflect nearly ‘<strong>in</strong>stantaneously’ but to vary<strong>in</strong>g degree<strong>in</strong> all our measurements. In this case, s<strong>in</strong>ce each additive noise component is colored (ithas temporal auto- correlation), its past helps predict its current value. S<strong>in</strong>ce the observationis a l<strong>in</strong>ear mixture of noise components, all current observations are correlated,and the past of any component can help predict the current state of any other. In thiscase, the strict def<strong>in</strong>ition of Granger causality would not make practical sense, s<strong>in</strong>cethis cross-predictability is not mean<strong>in</strong>gful.It should be noted on this po<strong>in</strong>t that the literature conta<strong>in</strong>s (sometimes <strong>in</strong>consistent)sub-classifications of Granger <strong>Causality</strong>, such as weak and strong Granger causality.One def<strong>in</strong>ition which is particularly pert<strong>in</strong>ent to this work is that given <strong>in</strong> Ca<strong>in</strong>es (1976)and Solo (2006) and is that strong Granger causality allows <strong>in</strong>stantaneous dependenceand that weak Granger causality does not (i.e. it is strictly time ordered). We are aim<strong>in</strong>g<strong>in</strong> this work at strong Granger causality <strong>in</strong>ference, i.e. one which is robust to alias<strong>in</strong>geffects such as colored noise. While we should account for <strong>in</strong>stantaneous <strong>in</strong>teractions,we do not have to assign causal <strong>in</strong>terpretations to them, s<strong>in</strong>ce they are symmetric (thecross-correlation of <strong>in</strong>dependent mixed signals is symmetric).4. Auto-regression, learn<strong>in</strong>g and Granger <strong>Causality</strong>Learn<strong>in</strong>g is the process of discover<strong>in</strong>g predictable patterns <strong>in</strong> the real world, where a‘pattern’ is described by an algorithm or an automaton. Besides the object of learn<strong>in</strong>g,i.e. the algorithm which we <strong>in</strong>fer and which maps stimuli to responses, we need to considerthe algorithm which performs the learn<strong>in</strong>g process and outputs the former. Thethird algorithm we should consider is the algorithm embodied <strong>in</strong> the real world, whichwe do not know, which generates the data we observe, and which we hope to be able torecover, or at least approximate. How can we formally describe it? A Data Generat<strong>in</strong>gProcess (DGP) can be a mach<strong>in</strong>e or automaton: an algorithm that performs every operationdeterm<strong>in</strong>istically <strong>in</strong> a f<strong>in</strong>ite number of steps, but which conta<strong>in</strong>s an oracle thatgenerates perfectly random numbers. It is sufficient that this oracle generate 1’s and0’s only: all other computable probability distributions can be calculated from it. ADGP conta<strong>in</strong>s rational valued parameters (rational as to comply with f<strong>in</strong>ite computability),<strong>in</strong> this case the <strong>in</strong>teger K and all elements of the matrices A. Last but not leasta DGP specification may limit the set of admissible parameter values and probabilitydistributions of the oracle-generated values. The set of all possible outputs of a DGPcorresponds to the set of all probability distributions generated by it over all admissibleparameter values, which we shall call the DGP class.Def<strong>in</strong>ition 1 Let i ∈ N and let s a , s w , p w be f<strong>in</strong>ite length prefix-free b<strong>in</strong>ary str<strong>in</strong>gs.Furthermore let y and u be rational valued matrices of size N × i and M × i, and t berational valued vector with dist<strong>in</strong>ct elements, of length i. Let a also be a f<strong>in</strong>ite rational41


Popescuvalued vector. A Data Generat<strong>in</strong>g Process is a qu<strong>in</strong>tuple {s a ,p w ,T a ,T w } where T a , T ware f<strong>in</strong>ite time Tur<strong>in</strong>g mach<strong>in</strong>es which perform the follow<strong>in</strong>g operations: Given an <strong>in</strong>putof the <strong>in</strong>compressible str<strong>in</strong>g p w the mach<strong>in</strong>e T w calculates a rational valued matrix w.The mach<strong>in</strong>e T a when given matrices y, a, u, t, w and a positive rational ∆t outputs avector y i+1 which is assigned for future operations to the time t i+1 = max(t) + ∆tThe def<strong>in</strong>ition is somewhat unusual <strong>in</strong> terms of the def<strong>in</strong>ition of stochastic systemsas embodiments of Tur<strong>in</strong>g mach<strong>in</strong>es, but it is quite standard <strong>in</strong> terms of def<strong>in</strong><strong>in</strong>g an<strong>in</strong>novations term w, a probability distribution thereof p w , a state y, a generat<strong>in</strong>g functionp a with parameters a and an exogenous <strong>in</strong>put u. The motivation for us<strong>in</strong>g theterm<strong>in</strong>ology of algorithmic <strong>in</strong>formation theory is to analyse causality assignment as acomputational problem. For reasons of f<strong>in</strong>ite description and computability our variablesare rational, rather than real valued. Notice that there is no real restriction onhow the time series is to be generated, recursively or otherwise. The <strong>in</strong>itial condition<strong>in</strong> case of recursion is implicit, and time is specified as dist<strong>in</strong>ct and <strong>in</strong>creas<strong>in</strong>g but otherwisearbitrarily distributed - it does not necessarily grow <strong>in</strong> constant <strong>in</strong>crements (itis asynchronous). The slight paradox about describ<strong>in</strong>g stochastic dynamical systems<strong>in</strong> algorithmic terms is the necessity of postulat<strong>in</strong>g a random number generator (an oracle)which <strong>in</strong> some ways is our ma<strong>in</strong> tool for abstract<strong>in</strong>g the complexity of the realworld, but yet is a physical impossibility (s<strong>in</strong>ce such an oracle would require <strong>in</strong>f<strong>in</strong>itecomputational time see Li and Vitanyi (1997) for overview). Also, the Tur<strong>in</strong>g mach<strong>in</strong>eswe consider have f<strong>in</strong>ite memory and are time restricted (they implement a predef<strong>in</strong>edmaximum number of operations before yield<strong>in</strong>g a default output). Otherwise the rulesof algebra (s<strong>in</strong>ce they perform algebraic operations) apply normally. The cover of aDGP can be def<strong>in</strong>ed as:Def<strong>in</strong>ition 2 The cover of a Data Generat<strong>in</strong>g Process (DGP) class is the cover of theset of all outputs y that a DGP calculates for each member of the set of admissibleparameters a,u,t,w and for each <strong>in</strong>itial condition y 1 . Two DGPs are stochasticallyequivalent if the cover of the set of their possible outputs (for fixed parameters) is thesame.Let us now attempt to def<strong>in</strong>e a Granger <strong>Causality</strong> statistic <strong>in</strong> algorithmic terms.Allow<strong>in</strong>g for the notation j..k = { j − 1, j − 2..,k + 1,k} if j > k and <strong>in</strong> reverse order ifj < k1i∑︁K(y 1, j | y 1, j−1..1 ,u j−1..1 ) − K(y 1, j | y 2, j−1..1 ,y 1, j−1..1 ,u j−1..1 ) (4)ij=1This differs from Equation (3) <strong>in</strong> two elemental ways: it is not a statement of <strong>in</strong>dependencebut a number (statistic), namely the average difference (rate) of conditional(or prefix) Kolmogorov complexity of each po<strong>in</strong>t <strong>in</strong> the presumed effect vector whengiven both vector histories or just one, and given the exogenous <strong>in</strong>put history. It is ageneralized conditional entropy rate, and may be reasonably be normalized as such:42


Robust Statistics for <strong>Causality</strong>F K2→1|u = 1 ii∑︁j=1(︃1 − K(y )︃1, j | y 2, j−1..1 ,y 1, j−1..1 ,u j−1..1 )K(y 1, j | y 1, j−1..1 ,u j−1..1 )which is a fraction rang<strong>in</strong>g from 0 - mean<strong>in</strong>g no <strong>in</strong>fluence of y 1 by y 2 - to 1, correspond<strong>in</strong>gto complete determ<strong>in</strong>ation of y 1 by y 2 and can be transformed <strong>in</strong>to a statisticcompar<strong>in</strong>g different data sets and processes, and which gives probabilities of spuriousresults. Another difference with Equation (3) is that we do not refer to f<strong>in</strong>ite-order Gcausality but simply G causality (<strong>in</strong> the general case we do not know the maximum lagorder but must <strong>in</strong>fer it). For a more <strong>in</strong> depth look at DGPs, structure and G-causality,see White and Lu (2010). The larger the value F2→1|u K , the more likely that y 2 G-causesy 1 . The def<strong>in</strong>ition is one of conditional <strong>in</strong>formation and it is one of an averaged processrather than a s<strong>in</strong>gle <strong>in</strong>stance (time po<strong>in</strong>t). However, Kolmogorov complexity is <strong>in</strong>computable,and as such Granger (non) causality must also be, <strong>in</strong> general, <strong>in</strong>computable.A detailed look at this issue is beyond the scope of this article, but <strong>in</strong> essence, we cannever test all possible models that could tell us wether the history of a time series helpsor does not help predict (compress) another, and the set of f<strong>in</strong>ite runn<strong>in</strong>g time Tur<strong>in</strong>gmach<strong>in</strong>es is not enumerable. We’ve partially circumvented the halt<strong>in</strong>g problem s<strong>in</strong>cewe’ve specified f<strong>in</strong>ite-state, f<strong>in</strong>ite-operation mach<strong>in</strong>es as the basis of DGPs but havenot specified a search procedure over all DGPs that enumerates them. Even if we limitourselves to DGPs which are MVAR, the necessary computational time to calculate thedescription length (<strong>in</strong>stead of K(.)) is NP-complete, i.e. it requires an enumeration ofall possible parameters of a DGP class, barr<strong>in</strong>g any special properties thereof: f<strong>in</strong>d<strong>in</strong>gthe optimal model order requires such a search (keep <strong>in</strong> m<strong>in</strong>d VAR estimation is convexonly once we know the model order and AR structure).In practice, we should limit the class of DGPs we consider with<strong>in</strong> out statistic to onewhich allows the possibility of polynomial time computation. Let us take Equation (2),and further make the common assumption that the <strong>in</strong>put vector w is an i.i.d. normallydistributed sequence <strong>in</strong>dependent along dimension d, we’ve specified the l<strong>in</strong>ear VARGaussian DGP class (which we shall shorten as VAR class). This DGP class, aga<strong>in</strong>, hasproven remarkably useful <strong>in</strong> cases where noth<strong>in</strong>g else except the time series vector y isknown. Re-writ<strong>in</strong>g (2):y i =K∑︁A k y i−k + D w i−1 ,D ii > 0,D i j = 0 (6)k=1The matrix D is a positive diagonal matrix conta<strong>in</strong><strong>in</strong>g the scal<strong>in</strong>g, or effective standarddeviations of the <strong>in</strong>novation terms. The standard deviation of each element of the<strong>in</strong>novations term w is assumed hereafter to be equal to 1.5. Equivalence of auto-regressive data generation processes.In econometrics the follow<strong>in</strong>g formulation is familiar (SVAR):(5)43


Popescuy i =K∑︁A k y i−k + Bu + Dw i (7)k=0The difference between this and Equation (6) is the presence of a 0-lag matrix A 0which, for easy tractability has zero diagonal entries and is sometimes present on theLHS. This 0-lag matrix is meant to model the sub-sampl<strong>in</strong>g <strong>in</strong>terval dynamic <strong>in</strong>teractionsamong observations, which appear <strong>in</strong>stantaneous, see Moneta et al. (2011) <strong>in</strong> thisvolume. Let us call this form zero lag SVAR. In electric- and magneto- encephalography(EEG/MEG) we often encounter the follow<strong>in</strong>g form:x i =K∑︁µA k x i−k + µ Bu + Dw i ,k=1y i = Cx i (8)Where C represents the observation matrix, or mix<strong>in</strong>g matrix and is determ<strong>in</strong>edby the conductivity/permeability of tissue, and accounts for the superposition of theelectromagnetic fields created by neural activity, which happens at nearly the speed oflight and therefore appears <strong>in</strong>stantaneous. Let us call this mixed output SVAR. F<strong>in</strong>ally,<strong>in</strong> certa<strong>in</strong> eng<strong>in</strong>eer<strong>in</strong>g applications we may see structured disturbances:y i =K∑︁θA k y i−k + θ Bu + D w w i (9)k=1Which we shall call covariate <strong>in</strong>novations SVAR (D w is a general nons<strong>in</strong>gular matrixunlike D which is diagonal). Another f<strong>in</strong>al SVAR form to consider would be one<strong>in</strong> which the 0-lag matrix ⊳A 0 is strictly upper triangular (upper triangular zero lagSVAR):y i = ⊳A 0 y i +K∑︁A k y i−k + ⊳Bu + Dw i (10)k=1F<strong>in</strong>ally, we may consider a upper or lower triangular co-variate <strong>in</strong>novations SVAR:y i =K∑︁A k y i−k + Bu + ⊳Dw i (11)k=0Where ⊳D is upper/lower triangular. The SVAR forms (6)-(10) may look different,and <strong>in</strong> fact each of them may uniquely represent physical processes and allow for direct<strong>in</strong>terpretation of parameters. From a statistical po<strong>in</strong>t of view, however, all four SVARDGPs <strong>in</strong>troduced above are equivalent s<strong>in</strong>ce they have identical cover.Lemma 3 The Gaussian covariate <strong>in</strong>novations SVAR DGP has the same cover as theGaussian mixed output SVAR DGP. Each of these sets has a redundancy of 2 N N! for44


Robust Statistics for <strong>Causality</strong><strong>in</strong>stances <strong>in</strong> which the matrices D w is the product of and unitary and diagonal matrices,the matrix C is a unitary matrix and the matrix A 0 is a permutation of an uppertriangular matrix.Proof Star<strong>in</strong>g with the def<strong>in</strong>ition of covariate <strong>in</strong>novations SVAR <strong>in</strong> Equation (9) weuse the variable transformation y = D w x and obta<strong>in</strong> the mixed-output form (trivial).The set of Guassian random variables is closed under scalar multiplication (and hencesign change) and addition. This means that the variance if the <strong>in</strong>novations term <strong>in</strong>Equation (9) can be written as:Σ w = D T wD w = D T wU T UD wWhere U is a unitary (orthogonal, unit 2-norm) matrix. S<strong>in</strong>ce all <strong>in</strong>novations termelements are zero mean, the covariance matrix is the sole descriptor of the Gaussian<strong>in</strong>novations term. This <strong>in</strong> turn means that any other matrix D ′ w = D T wU T substituted <strong>in</strong>tothe DGP described <strong>in</strong> Equation (9) amounts to a stochastically equivalent DGP. Thematrix D ′ w can belong to a number of general sets of matrices, one of which is the set ofnons<strong>in</strong>gular upper triangular matrices (the transformation is achievable through the QRdecomposition of Σ w ). Another such set is lower triangular matrix set. Both are subsetsof the set of matrices sometimes named ‘psychologically upper triangular’, mean<strong>in</strong>g apermutation of an upper triangular matrix.If we constra<strong>in</strong> D w to be of the form D w = UD, i.e. such that (by polar decomposition)it is the product of a unitary and a diagonal positive def<strong>in</strong>ite matrix, the onlystochastically equivalent transformations of D w are a symmetry preserv<strong>in</strong>g permutationof its rows/columns and a sign change <strong>in</strong> one of the columns (this is a property oforthogonal matrices such as U). There are N! such permutations and 2 N possible signchanges. For the general case, <strong>in</strong> which the <strong>in</strong>put u has no special properties, there areno other redundancies <strong>in</strong> the SVAR model (s<strong>in</strong>ce chang<strong>in</strong>g any parameter <strong>in</strong> A and Bwill otherwise change the output). Without loss of generality then, we can write thetransformation from covariate <strong>in</strong>novations to mixed output SVAR form as:x i =y i =K∑︁θA k y i−k + θ Bu + UD w w ik=1K∑︁U T ( θ A k )Ux i−k + U T ( θ B)u + D w w ik=1y i = U T x iS<strong>in</strong>ce the transformation U is one to one and <strong>in</strong>vertible, and s<strong>in</strong>ce this transformationis what allows a (restricted) a covariate noise SVAR to map, one to one, onto amixed output SVAR, the card<strong>in</strong>ality of both covers is the same.Now consider the zero-lag SVAR form:45


Popescuy i =K∑︁A k y i−k + Bu + Dw ik=0D −1 (1 − A 0 )y i =K∑︁D −1 A k y i−k + D −1 Bu + wk=1Tak<strong>in</strong>g the s<strong>in</strong>gular value decomposition of the (nons<strong>in</strong>gular) matrix coefficient on theLHS:V T 0 y i = S −1 U T 0U 0 S V T 0 y i =K∑︁D −1 A k y i−k + D −1 Bu + w ik=1K∑︁D −1 A k y i−k + S −1 U0 T D−1 Bu + S −1 U0 T w ik=1Us<strong>in</strong>g the coord<strong>in</strong>ate transformation z = V0 T y. The unitary transformation UT 0can beignored due closure properties of the Gaussian. This leaves us with the mixed-outputform:z i =K∑︁S −1 U0 T D−1 A k V 0z i−k + S −1 U0 T D−1 Bu + S −1 w ′ ik=1y = V 0 zSo far we’ve shown that for every zero-lag SVAR there is at least one mixed-outputVAR. Let us for a moment consider the covariate noise SVAR (after pre-multiplication)D −1w y i =K∑︁k=1D −1w θA k y i−k + D −1w θBu + w iWe can easily then write it <strong>in</strong> terms of zero lag:y i = (︁ )︁I − D −1w yi +However, the entries of I − D −1wdone by scal<strong>in</strong>g by the diagonal:K∑︁k=1diag(D −1w )y i = (diag(D −1w ) − D −1w )y i +D −1w θA k y i−k + D −1w θBu + w iare not zero (as required by def<strong>in</strong>ition). This can beK∑︁k=1D 0 diag(D −1w )D −1w θA k y i−k + D −1w θBu + w i46


Robust Statistics for <strong>Causality</strong>y i = (I − D −10 D−1 w )y i +K∑︁k=1D −10 D−1 w θA k y i−k + D −10 D−1 w θBu + D −10 w iA 0 = (I − D −10 D−1 w )D −1w= diag(D −1w )(I − A 0 )While the follow<strong>in</strong>g constant relation preserves DGP equivalence:(D T wD w ) −1 = Σ −1w= D −1w D −Tw = D 0 (I − A 0 )(I − A 0 ) T D 0A 0 = (I − D −10 D−1 w ) T (I − D −10 D−1 w )The zero lag matrix is a function of D −1w , the <strong>in</strong>verse of which is an eigenvalueproblem. However, as long as the covariance matrix or its <strong>in</strong>verse is constant, the DGPis unchanged and this allows N(N − 1)/2 degrees of freedom. Let us consider onlymixed <strong>in</strong>put systems for which the <strong>in</strong>novations terms are of unit variance. There is noreal loss of generality s<strong>in</strong>ce a simple row-division by each element of D 0 normalized thecovariate noise form (to be rega<strong>in</strong>ed by scal<strong>in</strong>g the output). In this case the equivalenceconstra<strong>in</strong>t on is one of <strong>in</strong> which:(I − ⊳ A 0 ) T (I − ⊳ A 0 ) = (I − A 0 ) T (I − A 0 )If (I − A 0 ) is full rank, a strictly upper triangular matrix ⊳ A 0 may be found that isequivalent (this would be the Cholesky decomposition of the <strong>in</strong>verse covariance matrix<strong>in</strong> reverse order). As D Wis equivalent to a unitary transformation UD Wthis will<strong>in</strong>clude permutations and orthogonal rotations. Any permutation of D Wwill imply acorrespond<strong>in</strong>g permutation of A 0 , which (along with rotations) has 2 N N! solutions.The non-uniqueness of SVAR and the problematic <strong>in</strong>terpretation of AR coefficientswith respect to variable permutation is a known problem Sims (1981), as is the factthat model<strong>in</strong>g zero-lag matrices is equivalent to covariance estimation for the Gaussiancase <strong>in</strong> the other lag coefficients are zero. In fact, statistically vanish<strong>in</strong>g elements ofthe covariance matrix are used <strong>in</strong> Structural Equation Model<strong>in</strong>g and are given causality<strong>in</strong>terpretations Pearl (2000). It is not clear how robust such <strong>in</strong>ferences are with respectto equivalent permutations. The po<strong>in</strong>t of the lemma above is to illustrate the ambiguityof <strong>in</strong>terpretation if the structure of (sparse or full) AR systems <strong>in</strong> the case of covariate<strong>in</strong>novations, zero-lag, or mixed output, which are equivalent to each other. In the case ofSVAR, one approach is to perform standard AR followed by a Cholesky decompositionof the covariance of the residuals and then pre-multiply<strong>in</strong>g. In Popescu (2008), theupper triangular SVAR estimation is done directly by s<strong>in</strong>gular value decompositionafter regression and the <strong>in</strong>novations covariance estimated from the zero-lag matrix.47


PopescuFigure 1: SVAR causality and equivalence. Structural VAR equivalence and causality.A) direct structural Granger causality (both directions shown). z standsfor the delay operator. B) equivalent covariate <strong>in</strong>novations (left) and mixedoutput systems. Neither representation shows dynamic <strong>in</strong>teraction C) sparse,one sided covariate <strong>in</strong>novations DAG is non sparse <strong>in</strong> the mixed output case(and vice-versa). D) upper triangular structure of the zero-lag matrix is not<strong>in</strong>formative <strong>in</strong> the 2 variable Gaussian case, and is equivalent to a full mixedoutput system.48


Robust Statistics for <strong>Causality</strong>Granger, <strong>in</strong> his 1969 paper, suggests that ‘<strong>in</strong>stantaneous’ (i.e. covariate) effects beignored and only the temporal structure be used. Whether or not we accept <strong>in</strong>stantaneouscausality depends on prior knowledge: <strong>in</strong> the case of EEG, the mix<strong>in</strong>g matrixcannot have any physical ‘causal’ explanation even if it is sparse. Without additionala priori assumptions, either we <strong>in</strong>fer causality on unseen and presumably <strong>in</strong>teract<strong>in</strong>ghidden variables (mixed output form, the case of EEG/MEG) or we assume a a noncausalmixed <strong>in</strong>novations <strong>in</strong>put. Note also that the zero-lag system appears to be causalbut can be written <strong>in</strong> a form which suggest the opposite difference causal <strong>in</strong>fluence(hence it is sometimes termed ‘spurious causality’). In short, s<strong>in</strong>ce <strong>in</strong>stantaneous <strong>in</strong>teraction<strong>in</strong> the Gaussian case cannot be resolved causally purely <strong>in</strong> terms of predictionand conditional <strong>in</strong>formation (as <strong>in</strong>tended by Wiener and Granger), it is proposed thatsuch <strong>in</strong>teractions be accounted for but not given causal <strong>in</strong>terpretation (as <strong>in</strong> ‘strong’Granger non-causality) .There are at least four dist<strong>in</strong>ct overall approaches to deal<strong>in</strong>g with alias<strong>in</strong>g effects<strong>in</strong> time series causality. 1) is to make prior assumptions about covariance matrices andlimit <strong>in</strong>ference to doma<strong>in</strong> relevant and <strong>in</strong>terpretable posteriors, as <strong>in</strong> Bernanke et al.(2005) <strong>in</strong> economics and Valdes-Sosa et al. (2005) <strong>in</strong> neuroscience. 2) to allow forunconstra<strong>in</strong>ed graphical causal model type <strong>in</strong>ference among covariate <strong>in</strong>novations, byeither assum<strong>in</strong>g Gaussianity or non-Gaussianity, the latter allow<strong>in</strong>g for stronger causal<strong>in</strong>ferences (see Moneta et al. (2011) <strong>in</strong> this volume). One possible drawback of thisapproach is that DAG-type <strong>in</strong>ference, at least <strong>in</strong> the Gaussian case <strong>in</strong> which there isso-called ’Markov equivalence’ among candidate graphs, is non-unique. 3) a physically<strong>in</strong>terpretable mixed output or co-variate <strong>in</strong>novations is assumed and the <strong>in</strong>ferredsparsity structure (or the <strong>in</strong>tersection thereof over the nonzero lag coefficient matrices)as the connection graph. Popescu (2008) implemented such an approach by us<strong>in</strong>g them<strong>in</strong>imum description length pr<strong>in</strong>ciple to provide a universal prior over rational-valuedcoefficients, and was able to recover structure <strong>in</strong> the majority of simulated co-variate<strong>in</strong>novations processes of arbitrary sparsity. This approach is computationally laborious,as it is NP and non-convex, and moreover a system that is sparse <strong>in</strong> one form (covariate<strong>in</strong>novations or mixed-ouput) is not necessarily sparse <strong>in</strong> another equivalent SVARform. Moreover completely dense SVAR systems may be non-causal (<strong>in</strong> the strongGC sense). 4) <strong>Causality</strong> is not <strong>in</strong>terpreted as a b<strong>in</strong>ary value, but rather direction of<strong>in</strong>teraction is determ<strong>in</strong>ed as a cont<strong>in</strong>uous valued statistic, and one which is theoreticallyrobust to covariate <strong>in</strong>novations or mixtures. This is the pr<strong>in</strong>ciple of the recently<strong>in</strong>troduced phase slope <strong>in</strong>dex (PSI), which belongs to a class of methods based on spectraldecomposition and partition of coherency. Although auto-regressive, spectral andimpulse response convolution are theoretically equivalent representation of l<strong>in</strong>ear dynamics,they do differ numerically and spectral representations afford direct access tophase estimates which are crucial to the <strong>in</strong>terpretation of lead and lag as it relates tocausal <strong>in</strong>fluence. These methods are reviewed <strong>in</strong> the next section.49


Popescu6. Spectral methods and phase estimationCross- and auto spectral densities of a time series, assum<strong>in</strong>g zero-mean or de-trendedvalues, are def<strong>in</strong>ed as:ρ Li j (τ) = E (︁ y i (t)y j (t − τ) )︁S i j (ω) = F (ρ Li j (τ)) (12)Note that cont<strong>in</strong>uous, l<strong>in</strong>ear, raw correlation values are used <strong>in</strong> the above def<strong>in</strong>itionas well as the cont<strong>in</strong>uous Fourier transform. Bivariate coherency is def<strong>in</strong>ed as:C i j (ω) =S i j (ω)√︀ S ii(ω)S j j (ω)(13)Which consists of a complex numerator and a real-valued denom<strong>in</strong>ator. The coherenceis the squared magnitude of the coherency:c i j (ω) = C i j (ω) * C i j (ω) (14)Besides various histogram and discrete (fast) Fourier transform methods availablefor the computation of coherence, AR methods may be also used, s<strong>in</strong>ce they are alsol<strong>in</strong>ear transforms, the Fourier transform of the delay operator be<strong>in</strong>g simply z k = e − j2πωτ Swhere τ S is the sampl<strong>in</strong>g time and k = ωτ S . Plugg<strong>in</strong>g this <strong>in</strong>to Equation (9) we obta<strong>in</strong>:⎛⎞K∑︁X( jω) = ⎜⎝ A k e − j2πωτ S kk=1⎛⎞K∑︁Y( jω) = C ⎜⎝ I − A k e − j2πωτ S k⎟⎠k=1⎟⎠ X( jω) + BU( jω) + DY( jω) = CX( jω) (15)−1(BU( jω) + DW( jω)) (16)In terms of a SVAR therefore (as opposed to VAR) the mix<strong>in</strong>g matrix C does notaffect stability, nor the dynamic response (i.e. the poles). The transfer functions fromith <strong>in</strong>novations to jth output are entries of the follow<strong>in</strong>g matrix of functions:⎛⎞K∑︁H( jω) = C ⎜⎝ I − A k e − j2πωτ S k⎟⎠k=1−1D (17)The spectral matrix is simply (hav<strong>in</strong>g already assumed <strong>in</strong>dependent unit Gaussiannoise):S ( jω) = H( jω) * H( jω) (18)The coherency as the coherence follow<strong>in</strong>g def<strong>in</strong>itions above. The partial coherenceconsiders the pair (i, j) of signals conditioned on all other signals, the (ordered) set ofwhich we denote (i, j):50


Robust Statistics for <strong>Causality</strong>S i, j|(i, j)( jω) = S (i, j),(i, j) + S (i, j),(i, j)S −1(i, j),(i, j) S (i, j),(i, j)(19)Where the subscripts refer to row/column subsets of the matrix S ( jω). The partialspectrum, substituted <strong>in</strong>to Equation (13) gives us partial coherency C i, j|(i, j)( jω) andcorrespond<strong>in</strong>gly, partial coherence c i, j|(i, j)( jω) . These functions are symmetric andtherefore cannot <strong>in</strong>dicate direction of <strong>in</strong>teraction <strong>in</strong> the pair (i, j). Several alternativeshave been proposed to account for this limitation. Kam<strong>in</strong>ski and Bl<strong>in</strong>owska (1991);Bl<strong>in</strong>owska et al. (2004) proposed the follow<strong>in</strong>g normalization of H( jω) which attemptsto measure the relative magnitude of the transfer function from any <strong>in</strong>novations processto any output (which is equivalent to measur<strong>in</strong>g the normalized strength of Grangercausality) and is called the directed transfer function (DTF):γ i j ( jω) =H i j ( jω)√︁ ∑︀k |H ik ( jω)| 2γ 2 i j ( jω) = ⃒ ⃒⃒Hij ( jω) ⃒ ⃒ ⃒2∑︀k |H ik ( jω)| 2 (20)A similar measure is called directed coherence Baccalá et al. (Feb 1991), later elaborated<strong>in</strong>to a method complimentary to DTF, called partial directed coherence (PDC)Baccalá and Sameshima (2001); Sameshima and Baccalá (1999), based on the <strong>in</strong>verseof H:π i j ( jω) =H −1i j√︁ ( jω)∑︀ ⃒k ⃒Hik −1(jω)⃒ ⃒2The objective of these coherency-like measures is to place a measure of directionalityon the otherwise <strong>in</strong>formation-symmetric coherency. While SVAR is not generallyused as a basis of the autoregressive means of spectral and coherence estimation, or ofDTF/PDC is is done so <strong>in</strong> this paper for completeness (otherwise it is assumed C = I).Granger’s 1969 paper did consider a mix<strong>in</strong>g matrix (<strong>in</strong>directly, by add<strong>in</strong>g non-diagonalterms to the zero-lag matrix), and suggested ignor<strong>in</strong>g the role of that part of coherencywhich depends on mix<strong>in</strong>g terms as non-<strong>in</strong>formative ‘<strong>in</strong>stantaneous causality’. Notethat the ambiguity of the role and identifiability of the full zero lag matrix, as describedhere<strong>in</strong>, was fully known at the time and was one of the justifications given for separat<strong>in</strong>gsub-sampl<strong>in</strong>g time dynamics. Another measure of directionality, proposed bySchreiber (2000) is a Shannon-entropy <strong>in</strong>terpretation of Granger <strong>Causality</strong>, and thereforewill be referred to as GC here<strong>in</strong>. The Shannon entropy, and conditional Shannonentropy of a random process is related to its spectrum. The conditional entropy formulationof Granger <strong>Causality</strong> for AR models <strong>in</strong> the multivariate case is (where (i) denotes,as above, all other elements of the vector except i ):H GCj→i|u = H(y i,t+1|y :,t:t−K ,u :,t:t−K ) − H(y i,t+1 |y ( j),t:t−K,u :,t:t−K )51


PopescuH GCj→i|u = logD i − logD ( j)i(21)The Shannon entropy of a Gaussian random variable is the logarithm of its standarddeviation plus a constant. Notice than <strong>in</strong> this paper the def<strong>in</strong>ition of Granger <strong>Causality</strong>is slightly different than the literature <strong>in</strong> that it relates to the <strong>in</strong>novations process of amixed output SVAR system of closest rotation and not a regular MVAR. The secondterm D ( j)iis formed by comput<strong>in</strong>g a reduced SVAR system which omits the jth variable.Recently Barrett et al. have proposed an extension of GC, based on prior work byGeweke (1982) from <strong>in</strong>teraction among pairs of variables to groups of variables, termedmultivariate Granger <strong>Causality</strong> (MVGC) Barrett et al. (2010). The above def<strong>in</strong>ition isstraightforwardly extensible to the group case, where I ad J are subsets of 1..D, s<strong>in</strong>cetotal entropy of <strong>in</strong>dependent variables is the sum of <strong>in</strong>dividual entropies.H GCJ→I|u = ∑︁i∈I(︂logD i − logD (J)iThe Granger entropy can be calculated directly from the transfer function, us<strong>in</strong>g theShannon-Hartley theorem:H GCHj→i)︂(22)⎛ ⃒∑︁ ⃒⃒Hi= − ∆ω ln⎜⎝ 1 − j (ω) ⃒ 2 ⎞⃒S ii (ω)⎟⎠ (23)ωF<strong>in</strong>ally Nolte (Nolte et al., 2008) <strong>in</strong>troduced a method called Phase Slope Indexwhich evaluates bilateral causal <strong>in</strong>teraction and is robust to mix<strong>in</strong>g effects (i.e. zerolag, observation or <strong>in</strong>novations covariance matrices that depart from MVAR):⎛⎞∑︁PS I i j→i = Im⎜⎝C i * j (ω) C i j(ω + dω) ⎟⎠ (24)ωPSI, as a method is based on the observation that pure mix<strong>in</strong>g (that is to say, alleffects stochastically equivalent to output mix<strong>in</strong>g as outl<strong>in</strong>ed above) does not affectthe imag<strong>in</strong>ary part of the coherency C i j just as (equivalently) it does not affect theantisymmetric part of the auto-correlation of a signal. It does not place a measure thephase relationship per se, but rather the slope of the coherency phase weighted by themagnitude of the coherency.7. Causal Structural InformationCurrently, Granger causality estimation based on l<strong>in</strong>ear VAR model<strong>in</strong>g has been shownto be susceptible to mixed noise, <strong>in</strong> the presence of which it may produce false causalityassignment Nolte et al. (2010). In order to allow for accurate causality assignment <strong>in</strong>the presence of <strong>in</strong>stantaneous <strong>in</strong>teraction and alias<strong>in</strong>g the Causal Structural Information(CSI) method and statistic for causality assignment is <strong>in</strong>troduced below.Consider the SVAR lower triangular form <strong>in</strong> (10) for a set of observations y. The<strong>in</strong>formation transfer from i to j may be obta<strong>in</strong>ed by first def<strong>in</strong><strong>in</strong>g the <strong>in</strong>dex re-order<strong>in</strong>gs:52


Robust Statistics for <strong>Causality</strong>i j* {i, j,i j}i* {i,i j}This means that we reorder the (identified) mixed-<strong>in</strong>novations system by plac<strong>in</strong>gthe target time series first and the driver series second, followed by all the rest. Thesame order<strong>in</strong>g, m<strong>in</strong>us the driver is also useful. We def<strong>in</strong>e CSI asCS I( j → i|i j) log( ⊳ i j*D 11 ) − log( ⊳i* D 11 ) (25)CS I(i, j|i j) CS I( j → i|i j) −CS I(i → j|i j) (26)Where the D is upper-triangular form <strong>in</strong> each <strong>in</strong>stance. This Granger <strong>Causality</strong> formulationrequires the identification of 3 different SVAR models, one for the entire timeseries vector, and one each for all elements except i and all elements except j. ViaCholesky decomposition, the logarithm of the top vertex of the triangle is proportionalto the entropy rate (conditional <strong>in</strong>formation) of the <strong>in</strong>novations process for the targetseries given all other (past and present) <strong>in</strong>formation <strong>in</strong>clud<strong>in</strong>g the <strong>in</strong>novations process.While this def<strong>in</strong>ition is clearly an <strong>in</strong>terpretation of the core idea of Granger causality,it is, like DTF and PDC, not an <strong>in</strong>dependence statistic but a measure of (causal) <strong>in</strong>formationflow among elements of a time-series vector. Note the anti-symmetry (bydef<strong>in</strong>ition) of this <strong>in</strong>formation measure CS I(i, j|i j) = −CS I( j,i|i j) . Note also thatCS I( j → i|i j) and CS I(i → j|i j) may very conceivably have the same sign: the varioustriangular forms used to derive this measure are purely for calculation purposes,and do not carry <strong>in</strong>tr<strong>in</strong>sic mean<strong>in</strong>g. As a matter of fact other re-order<strong>in</strong>gs and SVARforms may be employed for convenient calculation as well. In order to improve the explanatorypower of the CSI statistic the follow<strong>in</strong>g normalization is proposed, mirror<strong>in</strong>gthat def<strong>in</strong>ed <strong>in</strong> Equation (5) :F CS Ij→i|i j CS I(i, j|i j)log( ⊳i* D 11 ) + log( ⊳ j* D 11 ) + ζThis normalization effectively measures the ratio of causal to non-causal <strong>in</strong>formation,where ζ is a constant which depends on the number of dimensions and quantizationwidth and is necessary to transform cont<strong>in</strong>uous entropy to discrete entropy.8. Estimation of multivariate spectra and causality assignmentIn Section 6 and a series of causality assignment methods based on spectral decompositionof a multivariate signal were described. In this section spectral decompositionitself will be discussed, and a novel means of do<strong>in</strong>g so for unevenly sampled data willbe <strong>in</strong>troduced and evaluated along with the other methods for a bivariate benchmarkdata set.(27)53


Popescu8.1. The card<strong>in</strong>al transform of the autocorrelationCurrently there are few commonly used methods for cross- power spectrum estimation(i.e. multivariate spectral power estimation) as opposed to univariate power spectrumestimation, and these methods average over repeated, or shift<strong>in</strong>g, time w<strong>in</strong>dows andtherefore require a lot of data po<strong>in</strong>ts. Furthermore all commonly used multivariate spectralpower estimation methods rely on synchronous, evenly spaced sampl<strong>in</strong>g, despitethe fact that much of available data is unevenly sampled, has miss<strong>in</strong>g values, and canbe composed of relatively short sequences. Therefore a novel method is presented belowfor multivariate spectral power estimation which can be estimated on asynchronousdata.Return<strong>in</strong>g to the def<strong>in</strong>ition of coherency as the Fourier transform of the auto-correlation,which are both cont<strong>in</strong>uous transforms, we may extend the conceptual means of its estimation<strong>in</strong> the discrete sense as a regression problem (as a discrete Fourier transform,DFT) <strong>in</strong> the evenly sampled case as:Ω n n, n = −⌊N/2⌋...⌊N/2⌋ (28)2τ 0 (N − 1)Ĉ i j (ω)| ω=Ω = a i j,n + jb i j,n (29)ρ ji (−kτ) = ρ i j (kτ) = E(x i (t)x j (t + kτ)) (30)ρ i j (kτ 0 ) 1N − k∑︁q=1:N−kx i (q)x j (q + k) (31){︁ }︁ ∑︁N/2ai j ,b i j = argm<strong>in</strong>k=−N/2(︁ρi j (kτ 0 ) − a i j,n cos(2πΩ n τ 0 k) − b i j,n s<strong>in</strong>(2πΩ n τ 0 k) )︁ 2(32)where τ 0 is the sampl<strong>in</strong>g <strong>in</strong>terval. Note that for an odd number of po<strong>in</strong>ts the regressionabove is actually a well determ<strong>in</strong>ed set of equations, correspond<strong>in</strong>g to the 2-sidedDFT. Note also that by replac<strong>in</strong>g the expectation with the geometric mean, the aboveequation can also be written (with a slight change <strong>in</strong> weight<strong>in</strong>g at <strong>in</strong>dividual lags) as:{︁ai j ,b i j}︁= argm<strong>in</strong>∑︁(︁xi,p x j,q − a i j,n cos(2πΩ k (t i,p − t j,q )) − b i j,n s<strong>in</strong>(2πΩ n (t i,p − t j,q )) )︁ 2p,q∈1..N(33)The above equation holds even for time series sampled at unequal (but overlapp<strong>in</strong>g)times (x i ,t i ) and (x j ,t j ) as long as the frequency basis def<strong>in</strong>ition is adjusted (for exampleτ 0 = 1). It represents a discrete, f<strong>in</strong>ite approximation of the cont<strong>in</strong>uous, <strong>in</strong>f<strong>in</strong>ite autoregressionfunction of an <strong>in</strong>f<strong>in</strong>itely long random process. It is a regression on the outerproduct of the vectors x i and x j . S<strong>in</strong>ce autocorrelations for f<strong>in</strong>ite memory systems tend54


Robust Statistics for <strong>Causality</strong>to fall off to zero with <strong>in</strong>creas<strong>in</strong>g lag magnitude, a novel coherency estimate is proposedbased on the card<strong>in</strong>al s<strong>in</strong>e and cos<strong>in</strong>e functions, which also decay, as a compact basis:Ĉ i j (ω) = a i j,n∑︁C(Ω n ) + jb i j,n S(Ω n ) (34)∑︁p,q∈1..N{︁ai j ,b i j}︁= argm<strong>in</strong>(︁xi,p x j,q − a i j,n cosc(2πΩ k (t i,p − t j,q )) − b i j,n s<strong>in</strong>c(2πΩ n (t i,p − t j,q )) )︁ 2Where the s<strong>in</strong>e card<strong>in</strong>al is def<strong>in</strong>ed as s<strong>in</strong>c(x) = s<strong>in</strong>(πx)/x, and its Fourier transformis S( jω) = 1,| jω| < 1 and S( jω) = 0 otherwise. Also the Fourier transform of the cos<strong>in</strong>ecard<strong>in</strong>al can be written as C( jω) = jω S ( jω). Although <strong>in</strong> pr<strong>in</strong>ciple we could chooseany complete basis as a means of Fourier transform estimation, the card<strong>in</strong>al transformpreserves the odd-even function structure of the standard trigonometric pair. Computationallythis means that for autocorrelations, which are real valued and even, only s<strong>in</strong>cneeds to be calculated and used, while for cross-correlation both functions are needed.As l<strong>in</strong>ear mixtures of <strong>in</strong>dependent signals only have symmetric cross-correlations, anynonzero values of the cosc coefficients would <strong>in</strong>dicate the presence of dynamic <strong>in</strong>teraction.Note that the Fast Fourier Transform earns its moniker thanks to the orthogonalityof s<strong>in</strong> and cos which allows us to avoid a matrix <strong>in</strong>version. However their orthogonalityholds true only for <strong>in</strong>f<strong>in</strong>ite support, and slight correlations are found for f<strong>in</strong>ite w<strong>in</strong>dows -<strong>in</strong> practice this effect requires further computation (w<strong>in</strong>dow<strong>in</strong>g) to counteract. The card<strong>in</strong>albasis is not orthogonal, requires full regression and may have demand<strong>in</strong>g memoryrequirements. For moderate size data this not problematic and implementation detailswill be discussed elsewhere.8.2. Robustness evaluation based on the NOISE datasetA dataset named NOISE, <strong>in</strong>tended as a benchmark for the bivariate case, has been<strong>in</strong>troduced <strong>in</strong> the preced<strong>in</strong>g NIPS workshop on causality Nolte et al. (2010) and canbe found onl<strong>in</strong>e at www.causality.<strong>in</strong>f.ethz.ch/repository.php, alongwith the code that generated the data. It was awarded best dataset prize <strong>in</strong> the previousNIPS causality workshop and challenge Guyon et al. (2010). For further discussion of<strong>Causality</strong> Workbench and current dataset usage see Guyon (2011). NOISE is createdby the summation of the output of a strictly causal VAR DGP and a non-causal SVARDGP which consists of mixed colored noise:y C,i =x N,i =K∑︁k=1K∑︁k=1[︃ ]︃a11 a 120 a 22[︃ ]︃a11 00 a 22C,kN,k(35)y C,i−k + w C,i (36)x N,i−k + w N,i55


Popescuy N,i = Bx N,i (37)y = (1 − |β|)y N + |β|y C‖y N ‖ F‖y C ‖ F(38)The two sub-systems are pictured graphically as systems A and B <strong>in</strong> Figure 1.If β < 0 the AR matrices that create y C are transposed (mean<strong>in</strong>g that y 1C causes y 2C<strong>in</strong>stead of the opposite). The coefficient β is represented <strong>in</strong> Nolte et al. (2010) by ’γ’ where β = sgn(γ)(1 − |γ |). All coefficients are generated as <strong>in</strong>dependent Gaussianrandom variables of unit variance, and unstable systems are discarded. While boththe causal and noise generat<strong>in</strong>g systems have the same order, note that the system thatwould generate the sum thereof requires an <strong>in</strong>f<strong>in</strong>ite order SVAR DGP to generate (itis not stochastically equivalent to any SVAR DGP but <strong>in</strong>stead is a SVARMA DGP,hav<strong>in</strong>g both poles and zeros). Nevertheless it is an <strong>in</strong>terest<strong>in</strong>g benchmark s<strong>in</strong>ce the exactparameters are not fully recoverable via the commonly used VAR model<strong>in</strong>g procedureand because the causality <strong>in</strong>terpretation is fairly clear: the sum of a strictly causal DGPand a stochastic noncausal DGP should reta<strong>in</strong> the causality of the former.In this study, the same DGPs were used as <strong>in</strong> NOISE but as one of the currentaims is to study the <strong>in</strong>fluence of sample size on the reliability of causality assignment,signals of 100, 500, 1000 and 5000 po<strong>in</strong>ts were generated (as opposed to the orig<strong>in</strong>al6000). This is the dataset referred to as PAIRS below, which only differs <strong>in</strong> numbersof samples per time series. For each evaluation 500 time series were simulated, withthe order for each system of be<strong>in</strong>g uniformly distributed from 1 to 10. The follow<strong>in</strong>gmethods were evaluated:∙ PSI (Ψ) us<strong>in</strong>g Welch’s method, and segment and epoch lengths be<strong>in</strong>g equal andset to ⌈︁ √N⌉︁and otherwise is the same as Nolte et al. (2010).∙ Directed transfer function DTF. estimation us<strong>in</strong>g an automatic model order selectioncriterion (BIC, Bayesian Information Criterion) us<strong>in</strong>g a maximum modelorder of 20. DTF has been shown to be equivalent to GC for l<strong>in</strong>ear AR models(Kam<strong>in</strong>ski et al., 2001) and therefore GC itself is not shown . The covariancematrix of the residuals is also <strong>in</strong>cluded <strong>in</strong> the estimate of the transfer function.The same holds for all methods described below.∙ Partial directed coherence PDC. As described <strong>in</strong> the previous section, it is similarto DTF except it operates on the signal-to-<strong>in</strong>novations (i.e. <strong>in</strong>verse) transferfunction.∙ Causal Structural Information. As a described above this is based on the triangular<strong>in</strong>novations equivalent to the estimated SVAR (of which there are 2 possibleforms <strong>in</strong> the bivariate case) and which takes <strong>in</strong>to account <strong>in</strong>stantaneous <strong>in</strong>teraction/ <strong>in</strong>novations process covariance.56


Robust Statistics for <strong>Causality</strong>All methods were statistically evaluated for robustness and generality by perform<strong>in</strong>ga 5-fold jackknife, which gave both a mean and standard deviation estimate for eachmethod and each simulation. All statistics reported below are mean normalized by standarddeviation (from jackknife). For all methods the f<strong>in</strong>al output could be -1, 0, or 1,correspond<strong>in</strong>g to causality assignment 1→2, no assignment, and causality 2 → 1. A truepositive (TP) was the rate of correct causality assignment, while a false positive (FP)was the rate of <strong>in</strong>correct causality assignment (type III error), such that TP+FP+NA=1,where NA stands for rate of no assignment (or neutral assignment). The TP and FPrates are be co-modulated by <strong>in</strong>creas<strong>in</strong>g/decreas<strong>in</strong>g the threshold of the absolute valueof the mean/std statistic, under which no causality assignment is made:S T AT = rawS T AT/std(rawS T AT), rawSTAT=PSI, DTF, PDC ..c = sign(S T AT) if S T AT > TRES H, 0 otherwiseIn Table 1 we see results of overall accuracy and controlled True Positive rate forthe non-mixed colored noise case (mean<strong>in</strong>g the matrix B above is diagonal). In Table 1and Table 2 methods are ordered accord<strong>in</strong>g to the mean TP rate over time series length(highest on top).Table 1: Unmixed colored noise PAIRSMax. Accuracy TP , FP < 0.10100 500 1000 5000 100 500 1000 5000Ψ 0.62 0.73 0.83 0.88 0.25 0.56 0.75 0.85DTF 0.58 0.79 0.82 0.88 0.18 0.58 0.72 0.86CSI 0.62 0.72 0.79 0.89 0.23 0.53 0.66 0.88Ψ C 0.57 0.68 0.81 0.88 0.19 0.29 0.70 0.87PDC 0.64 0.67 0.75 0.78 0.23 0.33 0.48 0.57In Table 2 we can see results for a PAIRS, <strong>in</strong> which the noise mix<strong>in</strong>g matrix B isnot strictly diagonal.Table 2: Mixed colored noise PAIRSMax. Accuracy TP , FP< 0.10N=100 500 1000 5000 N=100 500 1000 5000Ψ C 0.64 0.74 0.81 0.83 0.31 0.49 0.64 0.73Ψ 0.66 0.76 0.78 0.81 0.25 0.59 0.61 0.71CSI 0.63 0.77 0.79 0.80 0.27 0.62 0.59 0.66PDC 0.64 0.71 0.69 0.66 0.24 0.30 0.29 0.24DTF 0.55 0.61 0.66 0.66 0.11 0.10 0.09 0.1257


Popescu(a) Unmixed colored noise(b) Mixed colored noiseFigure 2: PSI vs. DTF Scatter plots of β vs. STAT (to the left of each panel), TP vs.FP curves for different time series lengths (100, 500,1000 and 500) (right).a) colored unmixed additive noise. b) colored mixed additive noise. DTFis equivalent to Granger <strong>Causality</strong> for l<strong>in</strong>ear systems. All STAT values arejackknife mean normalized by standard deviation.58


Robust Statistics for <strong>Causality</strong>(a) Unmixed colored noise(b) Mixed colored noiseFigure 3: PSI vs. CSI Scatter plots of β vs. STAT (to the left of each panel), TP vs. FPcurves for different time series lengths (right). a) unmixed additive noise. b)mixed additive noise59


PopescuAs we can see <strong>in</strong> both Figure 2 and Table 1, all methods are almost equally robustto unmixed colored additive noise (except PDC). However, while addition of mixedcolored noise <strong>in</strong>duces a mild gap <strong>in</strong> maximum accuracy, it creates a large gap <strong>in</strong> termsof TP/FP rates. Note the dramatic drop-off of the TP rate of VAR/SVAR based methodsPDC and DTF. Figure 3 shows this most clearly, by a wide scatter of STAT outputsfor DTF around β = 0 that is to say with no actual causality <strong>in</strong> the time series and acorrespond<strong>in</strong>g fall-off of TP vs. FP rates. Note also that PSI methods still allow a fairlyreasonable TP rate determ<strong>in</strong>ation at low FP rates of 10% even at 100 po<strong>in</strong>ts per timeseries,while the CSI method was also robust to the addition of colored mixed noise,not show<strong>in</strong>g any significant difference with respect to PSI except a higher FP rate forlonger time series (N=5000). The advantage of PSIcard<strong>in</strong>al was near PSI <strong>in</strong> overallaccuracy. In conclusion, DTF (or weak Granger causality) and PDC are not robust withrespect to additive mixed colored noise, although they perform similarly to PSI and CSIfor <strong>in</strong>dependent colored noise. 19. Conditional causality assignmentIn multivariate time series analysis we are often concerned with <strong>in</strong>ference of causalrelationship among more than 2 variables, <strong>in</strong> which the role of a potential commoncause must be accounted for, analogously to vanish<strong>in</strong>g partial correlation <strong>in</strong> the staticdata case. For this reason the PAIRS data set was extended <strong>in</strong>to a set called TRIPLES<strong>in</strong> which the degree of common driver <strong>in</strong>fluence versus direct coupl<strong>in</strong>g was controlled.In effect, the TRIPLES DGP is similar to PAIRS, <strong>in</strong> that additive noise is mixedcolored noise (<strong>in</strong> 3 dimensions) but <strong>in</strong> this case another variable x 3 may drive the pairx 1 , x 2 <strong>in</strong>dependently of each other, also with random coefficients (but either one set to1/10 of the other randomly). That is to say, the signal is itself a mixture of one wherethere is a direct one sided causal l<strong>in</strong>k among x 1 , x 2 as <strong>in</strong> PAIRS and one where they areactually <strong>in</strong>dependent but commonly driven, accord<strong>in</strong>g to a parameter χ which at 0 iscommonly driven and at 1 is causal.β < 0 y C,i =⎡K∑︁k=1⎢⎣a 11 a 12 00 a 22 0 y⎤⎥⎦ C,i−k + w C,i0 0 a 33 C,k1. Correlation and rank correlation analysis was performed (for N=5000) to shed light on the reasonfor the discrepancy between PSI and CSI. The l<strong>in</strong>ear correlation between rawSTAT and STATwas .87 and .89 for PSI and CSI. No <strong>in</strong>fluence of model order K of the simulated system wasseen <strong>in</strong> the error of either PSI or CSI, where error is estimated as the difference <strong>in</strong> rank ofrankerr(S T AT) = |rank(β) − rank(S T AT)|. There were however significant correlations betweenrank(|β|) and rankerr(S T AT), -.13 for PSI and -.27 for CSI. Note that as expected, standard Grangercausality (GC) performed the same as DTF (TP=0.116 for FP


Robust Statistics for <strong>Causality</strong>Figure 4: Diagram of TRIPLES dataset with common driver61


Popescuβ > 0 y C,i =x N,i =⎡K∑︁k=1⎢⎣⎡K∑︁k=1⎢⎣a 11 0 0a 12 a 22 0 y⎤⎥⎦ C,i−k + w C,i (39)0 0 a 33 C,ka 11 0 00 a 22 0 x⎤⎥⎦ N,i−k + w N,i0 0 a 22 N,ky N,i = Bx N,i (40)x D,i =⎡K∑︁k=1⎢⎣a 11 0 a 130 a 22 a 23 x⎤⎥⎦ D,i−k + w D,i0 0 a 22 D,ky MC = (1 − |β|)y N + |β|y C‖y N ‖ F‖y C ‖ F(41)y DC = (1 − χ )y MC + χ y D‖y MC ‖ F‖y D ‖ F(42)The Table 3, similar to the tables <strong>in</strong> the preced<strong>in</strong>g section, shows results for allusual methods, except for PSIpartial which is PSI calculated on the partial coherenceas def<strong>in</strong>ed above and calculated from Welch (cross-spectral) estimators <strong>in</strong> the case ofmixed noise and a common driver.Table 3: TRIPLES: Commonly driven, additive mixed colored noiseMax. Accuracy TP , FP< 0.10100 500 1000 5000 100 500 1000 5000Ψ p 0.53 0.61 0.71 0.75 0.12 0.31 0.49 0.56Ψ 0.54 0.60 0.70 0.72 0.10 0.25 0.40 0.52CSI 0.51 0.60 0.69 0.76 0.09 0.27 0.38 0.45PDC 0.55 0.54 0.60 0.58 0.13 0.12 0.16 0.13DTF 0.51 0.56 0.59 0.61 0.12 0.09 0.09 0.11Notice that the TP rates are lower for all methods with respect to Table 2 whichrepresents the mixed noise situation without any common driver.10. DiscussionIn a recent talk, Emanuel Parzen (Parzen, 2004) proposed, both <strong>in</strong> h<strong>in</strong>dsight and forfuture consideration, that aim of statistics consist <strong>in</strong> an ‘answer mach<strong>in</strong>e’, i.e. a more<strong>in</strong>telligent, automatic and comprehensive version of Fisher’s almanac, which currentlyconsists <strong>in</strong> a plenitude of chapters and sections related to different types of hypotheses62


Robust Statistics for <strong>Causality</strong>and assumption sets meant to model, <strong>in</strong>sofar as possible, the ever expand<strong>in</strong>g variety ofdata available. These categories and sub-categories are not always dist<strong>in</strong>ct, and furthermorethere are compet<strong>in</strong>g general approaches to the same problems (e.g. Bayesian vs.frequentist). Is an ‘answer mach<strong>in</strong>e’ realistic <strong>in</strong> terms of time-series causality, prerequisitesfor which are found throughout this almanac, and which has developed <strong>in</strong> parallel<strong>in</strong> different discipl<strong>in</strong>es?This work began by discuss<strong>in</strong>g Granger causality <strong>in</strong> abstract terms, po<strong>in</strong>t<strong>in</strong>g out theimplausibility of f<strong>in</strong>d<strong>in</strong>g a general method of causal discovery, s<strong>in</strong>ce that depends onthe general learn<strong>in</strong>g and time-series prediction problem, which are <strong>in</strong>computable. However,if any consistent patterns that can be found mapp<strong>in</strong>g the history of one time seriesvariable to the current state of another (us<strong>in</strong>g non-parametric tests), there is sufficientevidence of causal <strong>in</strong>teraction and the null hypothesis is rejected. Such a determ<strong>in</strong>ationstill does not address direction of <strong>in</strong>teraction and relative strength of causal <strong>in</strong>fluence,which may require a complete model of the DGP. This study - like many others - reliedon the rather strong assumption of stationary l<strong>in</strong>ear Gaussian DGPs but otherwise madeweak assumptions on model order, sampl<strong>in</strong>g and observation noise. Are there, <strong>in</strong>stead,more general assumptions we can use? The follow<strong>in</strong>g is a list of compet<strong>in</strong>g approaches<strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order of (subjectively judged) strength of underly<strong>in</strong>g assumption(s):∙ Non-parametric tests of conditional probability for Granger non-causality rejection.These directly compare the probability distributions P(y 1, j | y 1, j−1..1 ,u j−1..1 )P(y 1, j | y 1, j−1..1 ,u j−1..1 ) to detect a possible statistically significant difference. Proposedapproaches (see chapter <strong>in</strong> this volume by (Moneta et al., 2011) for adetailed overview and tabulated robustness comparison) <strong>in</strong>clude product kerneldensity with kernel smooth<strong>in</strong>g (Chlaß and Moneta, 2010), made robust by bootstrapp<strong>in</strong>gand with density distances such as the Hell<strong>in</strong>ger (Su and White, 2008),Euclidean (Szekely and Rizzo, 2004), or completely nonparametric differencetests such Cramer-Von Mises or Kolmogorov-Smirnov. A potential pitfall ofnonparametric approaches is their loss of power for higher dimensionality of thespace over which the probabilities are estimated - aka the curse of dimensionality(Yatchew, 1998). This can occur if the lag order K needed to be considered ishigh, if the system memory is long, or the number of other variables over whichGC must be conditioned (u j−1..1 ) is high. In the case of mixed noise, strongGC estimation would require account<strong>in</strong>g for all observed variables (which <strong>in</strong>neuroscience can number <strong>in</strong> the hundreds). While non-parametric non-causalityrejection is a very useful tool (and could be valid even if the lag considered <strong>in</strong>analysis is much smaller than the true lag K), <strong>in</strong> practice we would require robustestimated of causal direction and relative strength of different factors, which impliesa complete account<strong>in</strong>g of all relevant factors. As was already discussed, <strong>in</strong>many cases Granger non-causality is likely to be rejected <strong>in</strong> both directions: it isuseful to f<strong>in</strong>d the dom<strong>in</strong>ant one.∙ General parametric or semi-parametric (black-box) predictive model<strong>in</strong>g subjectto GC <strong>in</strong>terpretation which can provide directionality, factor analysis and <strong>in</strong>ter-63


Popescupretation of <strong>in</strong>formation flow. A large body of literature exists on neural networktime series model<strong>in</strong>g (<strong>in</strong> this context see White (2006) ), complemented <strong>in</strong> recentyears by support vector regression and Bayesian processes. The major concernwith black-box predictive approaches is model validation: does the fact that agiven model features a high cross-validation score automatically imply the implausibilityof another predictive model with equal CV-score that would lead todifferent conclusions about causal structure? A reasonable compromise betweennonl<strong>in</strong>earity and DGP class restriction can be seen <strong>in</strong> (Chu and Glymour, 2008)and Ryali et al. (2010), <strong>in</strong> which the VAR model is augmented by additive nonl<strong>in</strong>earfunctions of the regressed variable and exogenous <strong>in</strong>put. Robustness tonoise, sample size <strong>in</strong>fluence and accuracy of effect strength and direction determ<strong>in</strong>ationare open questions.∙ L<strong>in</strong>ear dynamic models which <strong>in</strong>corporate (and often require) non-Gaussianity<strong>in</strong> the <strong>in</strong>novations process such as ICA and TDSEP (Ziehe and Mueller, 1998).See Moneta et al. (2011) <strong>in</strong> this volume for a description of causality <strong>in</strong>ferenceus<strong>in</strong>g ICA and causal model<strong>in</strong>g of <strong>in</strong>novation processes (i.e. <strong>in</strong>dependent components).Robustness under permutation is necessary for a pr<strong>in</strong>cipled account<strong>in</strong>g ofdynamic <strong>in</strong>teraction and partition of <strong>in</strong>novations process entropy. Note that manyICA variants assume that at most one of the <strong>in</strong>novations processes is Gaussian,a strong assumption which requires a posteriori checks. To be elucidated is therobustness to filter<strong>in</strong>g and additive noise.∙ Non-stationary Gaussian l<strong>in</strong>ear models. In neuroscience non-stationarity is important(the bra<strong>in</strong> may change state abruptly, respond to stimuli, have transientpathological episodes etc). Furthermore account<strong>in</strong>g for non-stochastic exogenous<strong>in</strong>puts needs further consideration. Encourag<strong>in</strong>gly, the current study showsthat even <strong>in</strong> the presence of complex confounders such as common driv<strong>in</strong>g signalsand co-variate noise, segments as small as 100 po<strong>in</strong>ts can yield accurate causalityestimation, such that changes <strong>in</strong> longer time series can be adaptively tracked.Note that <strong>in</strong> establish<strong>in</strong>g statistical significance we must take <strong>in</strong>to account signalbandwidth: up-sampl<strong>in</strong>g the same process would arbitrarily <strong>in</strong>crease the numberof samples but not the <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> the signal. See Appendix A for aproposal on non-parametric bandwidth estimation.∙ L<strong>in</strong>ear Gaussian dynamic models: <strong>in</strong> this work we have considered SVAR butnot wider classes of l<strong>in</strong>ear DGPs such as VARMA and heteroskedastic (GARCH)models. In compar<strong>in</strong>g PSI and CSI note that overall accuracy of directionalityassignment was virtually identical, but PSI correlated slightly better with effectsize. While CSI made slightly more errors at low strengths of ‘causality’, PSImade slightly more errors at high strengths. Nevertheless, PSI was most robustto (colored, mixed) noise and hidden driv<strong>in</strong>g/condition<strong>in</strong>g signal (tabulated significanceresults are provided <strong>in</strong> Appendix A). Jackknifed, normalized estimatescan help establish causality at low strength levels, although a large raw PSI statisticvalue may also suffice. A potential problem with the jackknife (or bootstrap)64


Robust Statistics for <strong>Causality</strong>procedure is the strong stationarity assumption which allows segmentation andrearrangement of the data.Although AR model<strong>in</strong>g was commonly used to model <strong>in</strong>teraction <strong>in</strong> time seriesand served as a basis for (l<strong>in</strong>ear) Granger causality model<strong>in</strong>g (Bl<strong>in</strong>owska et al., 2004;Baccalá and Sameshima, 2001), robustness to mixed noise rema<strong>in</strong>ed a problem, whichthe spectral method PSI was meant to resolve (Nolte et al., 2008). While ‘phase’, ifstructured, already implies prediction, precedence and mutual <strong>in</strong>formation among timeseries elements, it was not clear how SVAR methods would reconcile with PSI performance,until now. This prompted the <strong>in</strong>troduction <strong>in</strong> this article of the causal ARmethod (CSI) which takes <strong>in</strong>to account ‘<strong>in</strong>stantaneous’ causality . A prior study hadshown that strong Granger causality is preserved under addition of colored noise, asopposed to weak (i.e. strictly time ordered) causality Solo (2006). This is consistentwith the results obta<strong>in</strong>ed here<strong>in</strong>. The CSI method, measur<strong>in</strong>g strong Granger <strong>Causality</strong>,was <strong>in</strong> fact robust with respect to a type of noise not studied <strong>in</strong> (Solo, 2006), which ismixed colored noise; other VAR based methods and (weak) Granger causality measureswere not. While and SVAR DGP observed under additive colored noise is a VARMAprocess (the case of the PAIRS and TRIPLES datasets), SVAR model<strong>in</strong>g did not result<strong>in</strong> a severe loss of accuracy. AR processes of longer lags can approximate VARMAprocesses by us<strong>in</strong>g higher orders and more parameters, even if do<strong>in</strong>g so <strong>in</strong>creases exposureto over-fit and may have resulted <strong>in</strong> a small number of outliers. Future workmust be undertaken to ascerta<strong>in</strong> what robustness and specificity advantages result fromVARMA model<strong>in</strong>g, and if it is worth do<strong>in</strong>g so consider<strong>in</strong>g the <strong>in</strong>creased computationaloverload. One of the common ‘defects’ of real-life data are miss<strong>in</strong>g/outlier samples,or uneven sampl<strong>in</strong>g <strong>in</strong> time, or that the time stamps of two time series to be comparedare unrelated though overlapp<strong>in</strong>g: it is for these case that the method PSIcard<strong>in</strong>al wasdeveloped and shown to be practically equal <strong>in</strong> numerical performance to the Welchestimate-based PSI method (though it is slower computationally). Both PSI estimateswere robust to common driver <strong>in</strong>fluence even when not based on partial but direct coherencybecause it is the asymmetry <strong>in</strong> <strong>in</strong>fluence of the driver on phase that is measuredrather than its overall strength. While 2-way <strong>in</strong>teraction with condition<strong>in</strong>g was considered,future work must organize multivariate signals us<strong>in</strong>g directed graphs, as <strong>in</strong> DAGtypestatic causal <strong>in</strong>ference. Although only 1 condition<strong>in</strong>g signal was analysed <strong>in</strong> thispaper, the methods apply to higher numbers of background variables. Directed TransferFunction and Partial Directed Coherence did not perform as well under additive colorednoise, but their formulation does address a theoretically important question, namely thepartition of strength of <strong>in</strong>fluence among various candidate causes of an observation;CSI also proposes an <strong>in</strong>dex for this important purpose. Whether the assumptions aboutstationarity or any other data properties discussed are warranted may be checked byperform<strong>in</strong>g appropriate a posteriori tests. If these tests justify prior assumptions anda correspond<strong>in</strong>gly significant causal effect is observed, we can assign statistical confidence<strong>in</strong>tervals to the causality structure of the system under study. The ‘almanac’chapter on time series causality is rich and new alternatives are emerg<strong>in</strong>g. For the entirecorpus of time-series causality statistics to become an ‘answer mach<strong>in</strong>e’, however,65


Popescuit is suggested that a pr<strong>in</strong>cipled bottom-up <strong>in</strong>vestigation be undertaken, beg<strong>in</strong>n<strong>in</strong>g withthe simple SVAR form studied <strong>in</strong> this paper and all proposed criteria be quantified:type I, II and III errors, accurate determ<strong>in</strong>ation of causality strength and direction androbustness <strong>in</strong> the presence of condition<strong>in</strong>g variables and colored mixed noise.AcknowledgmentsI would like to thank Guido Nolte for his <strong>in</strong>sightful feedback and <strong>in</strong>formative discussions.ReferencesH. Akaike. On the use of a l<strong>in</strong>ear model for the identification of feedback systems.Annals of the Institute of statistical mathematics, 20(1):425–439, 1968.L. A Baccalá and K. Sameshima. Partial directed coherence: a new concept <strong>in</strong> neuralstructure determ<strong>in</strong>ation. Biological cybernetics, 84(6):463–474, 2001.L. A. Baccalá, M. A. Nicolelis, C. H. Yu, and M. Oshiro. Structural analysis of neuralcircuits us<strong>in</strong>g the theory of directed graphs. Computers and Biomedical Research,an International Journal, 24:7–28, Feb 1991. URL http://www.ncbi.nlm.nih.gov/pubmed/2004525.A. B. Barrett, L. Barnett, and A. K. Seth. Multivariate granger causality and generalizedvariance. Physical Review E, 81(4):041907, April 2010. doi: 10.1103/PhysRevE.81.041907. URL http://l<strong>in</strong>k.aps.org/doi/10.1103/PhysRevE.81.041907.B. S. Bernanke, J. Boiv<strong>in</strong>, and P. Eliasz. Measur<strong>in</strong>g the Effects of Monetary Policy: AFactor-Augmented Vector Autoregressive (FAVAR) Approach. Quarterly Journal ofEconomics, 120(1):387–422, 2005.K. J. Bl<strong>in</strong>owska, R. Kuś, and M. Kamiński. Granger causality and <strong>in</strong>formationflow <strong>in</strong> multivariate processes. Physical Review E, 70(5):050902, November 2004.doi: 10.1103/PhysRevE.70.050902. URL http://l<strong>in</strong>k.aps.org/doi/10.1103/PhysRevE.70.050902.P. E. Ca<strong>in</strong>es. Weak and strong feedback free processes. IEEE. Trans. Autom . Contr,21:737–739, 1976.N. Chlaß and A. Moneta. Can Graphical Causal Inference Be Extended to Nonl<strong>in</strong>earSett<strong>in</strong>gs? EPSA Epistemology and Methodology of Science, pages 63–72, 2010.T. Chu and C. Glymour. Search for additive nonl<strong>in</strong>ear time series causal models. TheJournal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 9:967–991, 2008.66


Robust Statistics for <strong>Causality</strong>R. A. Fisher. Statistical Methods for Research Workers. Macmillan Pub Co, 1925.ISBN 0028447301.W. Gersch and G. V. Goddard. Epileptic focus location: spectral analysis method.Science (New York, N.Y.), 169(946):701–702, August 1970. ISSN 0036-8075. URLhttp://www.ncbi.nlm.nih.gov/pubmed/5429908. PMID: 5429908.J. Geweke. Measurement of l<strong>in</strong>ear dependence and feedback between multiple timeseries. Journal of the American Statistical Association, 77:304–313, 1982.G. Gigerenzer, Z. Swijt<strong>in</strong>k, T. Porter, L. Daston, J. Beatty, and L. Kruger. The Empireof Chance: How Probability Changed Science and Everyday Life. CambridgeUniversity Press, October 1990. ISBN 052139838X.C. W. J. Granger. Investigat<strong>in</strong>g causal relations by econometric models and crossspectralmethods. Econometrica, 37(3):424–438, August 1969. ISSN 00129682.URL http://www.jstor.org/stable/1912791.I. Guyon. <strong>Time</strong> series analysis with the causality workbench. Journal of Mach<strong>in</strong>eLearn<strong>in</strong>g Research, Workshop and Conference Proceed<strong>in</strong>gs, XX. <strong>Time</strong> <strong>Series</strong><strong>Causality</strong>:XX–XX, 2011.I. Guyon and A. Elisseeff. An <strong>in</strong>troduction to variable and feature selection. JMLR, 3:1157–1182, March 2003.I. Guyon, C. Aliferis, G. Cooper, A. Elisseeff, J.-P. Pellet, P. Spirtes, and A. Statnikov.Design and analysis of the causation and prediction challenge. wcci2008 workshopon causality, hong kong, june 3-4 2008. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g ResearchWorkshop and Conference Proceed<strong>in</strong>gs, 3:1–33, 2008.I. Guyon, D. Janz<strong>in</strong>g, and B. Schölkopf. <strong>Causality</strong>: Objectives and assessment. JMLRW&CP, 6:1–38, 2010.P. O. Hoyer, D. Janz<strong>in</strong>g, J. M. Mooij, J. Peters, and B. Schölkopf. Nonl<strong>in</strong>ear causaldiscovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, andL. Bottou, editors, Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems 21, pages689–696, 2009.M Kam<strong>in</strong>ski, M D<strong>in</strong>g, W A Truccolo, and S L Bressler. Evaluat<strong>in</strong>g causal relations<strong>in</strong> neural systems: granger causality, directed transfer function and statistical assessmentof significance. Biological Cybernetics, 85(2):145–157, August 2001. ISSN0340-1200. URL http://www.ncbi.nlm.nih.gov/pubmed/11508777.PMID: 11508777.M. J. Kam<strong>in</strong>ski and K. J. Bl<strong>in</strong>owska. A new method of the description of the <strong>in</strong>formationflow <strong>in</strong> the bra<strong>in</strong> structures. Biological Cybernetics, 65(3):203–210, 1991. ISSN0340-1200. doi: 10.1007/BF00198091. URL http://dblp.uni-trier.de/rec/bibtex/journals/bc/Kam<strong>in</strong>skiB91.67


PopescuA. N. Kolmogorov and A. N. Shiryayev. Selected Works of A.N. Kolmogorov: Probabilitytheory and mathematical statistics. Spr<strong>in</strong>ger, 1992. ISBN 9789027727978.T. C. Koopmans. Statistical Inference <strong>in</strong> Dynamic Economic Models, Cowles CommissionMonograph, No. 10. New York: John Wiley & Sons, 1950.G. Lacerda, P. Spirtes, J. Ramsey, and P. O. Hoyer. Discover<strong>in</strong>g Cyclic Causal Modelsby Independent Component Analysis. In Proceed<strong>in</strong>gs of the 24th Conference onUncerta<strong>in</strong>ty <strong>in</strong> Artificial Intelligence (UAI-2008), Hels<strong>in</strong>ki, F<strong>in</strong>land, 2008.A. D. Lanterman. Schwarz, wallace and rissanen: Intertw<strong>in</strong><strong>in</strong>g themes <strong>in</strong> theories ofmodel selection. International Statistical Review, 69(2):185–212, 2001.M. Li and P. M. B. Vitanyi. An <strong>in</strong>troduction to Kolmogorov complexity and its applications,2nd edition. Spr<strong>in</strong>ger-Verlag, 1997.A. Moneta, D. Entner, P.O. Hoyer, and A. Coad. Causal <strong>in</strong>ference by <strong>in</strong>dependent componentanalysis with applications to micro-and macroeconomic data. Jena EconomicResearch Papers, 2010:031, 2010.A. Moneta, N. Chlaß, D. Entner, and P.Hoyer. Causal search <strong>in</strong> structural vector autoregressionmodels. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, Workshop and ConferenceProceed<strong>in</strong>gs, XX. <strong>Time</strong> <strong>Series</strong> <strong>Causality</strong>:XX–XX, 2011.G. Nolte, A. Ziehe, V.V. Nikul<strong>in</strong>, A. Schlögl, N. Krämer, T. Brismar, and K.-R. Müller.Robustly estimat<strong>in</strong>g the flow direction of <strong>in</strong>formation <strong>in</strong> complex physical systems.Physical Review Letters, 00(23):234101, 2008.G. Nolte, A. Ziehe, N. Kraemer, F. Popescu, and K.-R. Müller. Comparison of grangercausality and phase slope <strong>in</strong>dex. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research Workshop& Conference Proceed<strong>in</strong>gs., <strong>Causality</strong>: Objectives and Assessment:267:276, 2010.E. Parzen. Long memory of statistical time series model<strong>in</strong>g. presented at the 2004nber/nsf time series conference at smu, dallas, usa. Technical report, Texas A&MUniversity, http://www.stat.tamu.edu/~eparzen/Long%20Memory%20of%20Statistical%20<strong>Time</strong>%20<strong>Series</strong>%20Model<strong>in</strong>g.pdf, 2004.J. Pearl. <strong>Causality</strong>: models, reason<strong>in</strong>g and <strong>in</strong>ference. Cambridge University Press,Cambridge, 2000.K. Pearson. Tables for statisticians and biometricians,. University Press, UniversityCollege, London, [Cambridge Eng., 1930.Peter C.B. Phillips. The problem of identification <strong>in</strong> f<strong>in</strong>ite parameter cont<strong>in</strong>ous timemodels. Journal of Econometrics, 1:351–362, 1973.F. Popescu. Identification of sparse multivariate autoregressive models. Proceed<strong>in</strong>gs ofthe European Signal Process<strong>in</strong>g Conference (EUSIPCO 2008), Lausanne, Switzerland,2008.68


Robust Statistics for <strong>Causality</strong>T. Richardson and P. Spirtes. Automated discovery of l<strong>in</strong>ear feedback models. InComputation, causation and discovery. AAAI Press and MIT Press, Menlo Park,1999.A. Roebroeck, A. K. Seth, and P. Valdes-Sosa. Causal time series analysis of functionalmagnetic resonance imag<strong>in</strong>g data. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, Workshopand Conference Proceed<strong>in</strong>gs, XX. <strong>Time</strong> <strong>Series</strong> <strong>Causality</strong>:XX–XX, 2011.S. Ryali, K. Supekar, T. Chen, and V. Menon. Multivariate dynamical systems modelsfor estimat<strong>in</strong>g causal <strong>in</strong>teractions <strong>in</strong> fmri. Neuroimage, 2010.K. Sameshima and L. A Baccalá. Us<strong>in</strong>g partial directed coherence to describe neuronalensemble <strong>in</strong>teractions. Journal of Neuroscience Methods, 94(1):93:103, 1999.R. Sche<strong>in</strong>es, P. Spirtes, C. Glymour, C. Meek, and T. Richardson. The TETRADproject: Constra<strong>in</strong>t based aids to causal model specification. Multivariate BehavioralResearch, 33(1):65–117, 1998.T. Schreiber. Measur<strong>in</strong>g <strong>in</strong>formation transfer. Physical Review Letters, 85(2):461, July2000. doi: 10.1103/PhysRevLett.85.461. URL http://l<strong>in</strong>k.aps.org/doi/10.1103/PhysRevLett.85.461.C. A. Sims. An autoregressive <strong>in</strong>dex model for the u.s. 1948-1975. In J. Kmenta andJ.B. Ramsey, editors, Large-scale macro-econometric models: theory and practice,pages 283–327. North-Holland, 1981.V. Solo. On causality i: Sampl<strong>in</strong>g and noise. Proceed<strong>in</strong>gs of the 46th IEEE Conferenceon Decision and Control, pages 3634–3639, 2006.P. Spirtes, C. Glymour, and R. Sche<strong>in</strong>es. Causation, prediction, and search. MIT Press,Cambridge MA, 2nd edition, 2000.L. Su and H. White. A nonparametric Hell<strong>in</strong>ger metric test for conditional <strong>in</strong>dependence.Econometric Theory, 24(04):829–864, 2008.G. J. Szekely and M. L. Rizzo. Test<strong>in</strong>g for equal distributions <strong>in</strong> high dimension.InterStat, 5, 2004.A. M. Tur<strong>in</strong>g. On computable numbers, with an application to the entscheidungsproblem.Proceed<strong>in</strong>gs of the London Mathematical Society, 42:230–65, 1936.J. M Valdes-Sosa, P. A aand Sanchez-Bornot, A. Lage-Castellanos, M. Vega-Hernandez, J. Bosch-Bayard, L. Melie-Garca, and E. Canales-Rodriguez. Estimat<strong>in</strong>gbra<strong>in</strong> functional connectivity with sparse multivariate autoregression. Neuro<strong>in</strong>formatics,360(1457):969, 2005.H White. Approximate nonl<strong>in</strong>ear forecast<strong>in</strong>g methods. In G. Elliott, C. W. J. Granger,and A. Timmermann, editors, Handbook of Economic Forecast<strong>in</strong>g, chapter 9, pages460–512. Elsevier, New York, 2006.69


PopescuH. White and X. Lu. Granger <strong>Causality</strong> and Dynamic Structural Systems. Journal ofF<strong>in</strong>ancial Econometrics, 8(2):193, 2010.N. Wiener. The theory of prediction. Modern mathematics for eng<strong>in</strong>eers, <strong>Series</strong>, 1:125–139, 1956.H. O. Wold. A Study <strong>in</strong> the Analysis of Stationary <strong>Time</strong> <strong>Series</strong>. Stockholm: Almqvistand Wiksell., 1938.A. Yatchew. Nonparametric regression techniques <strong>in</strong> economics. Journal of EconomicLiterature, 36(2):669–721, 1998.G. U. Yule. Why do we sometimes get nonsense correlations between time series?a study <strong>in</strong> sampl<strong>in</strong>g and the nature of time series. Journal of the Royal StatisticalSociety, 89:1–64, 1926.A. Ziehe and K.-R. Mueller. Tdsep- an efficient algorithm for bl<strong>in</strong>d separation us<strong>in</strong>gtime structure. ICANN Proceed<strong>in</strong>gs, pages 675–680, 1998.S. T. Ziliak and D. N. McCloskey. The Cult of Statistical Significance: How the StandardError Costs Us Jobs, Justice, and Lives. University of Michigan Press, February2008. ISBN 0472050079.Appendix A. Statistical significance tables for Type I and Type III errorsIn order to assist practitioners <strong>in</strong> evaluat<strong>in</strong>g the statistical significance of bivariatecausality test<strong>in</strong>g, tables were prepared for type I and type III error probabilities as def<strong>in</strong>ed<strong>in</strong> (1) for different values of the base statistic. Below tables are provided for boththe jacknifed statistic Ψ/std(Ψ) and for the raw statistic Ψ, which is needed <strong>in</strong> case thenumber of po<strong>in</strong>ts is too low to allow a jackknife/cross-validation/bootstrap or computationalspeed is at a premium. The spectral evaluation method is Welch’s method asdescribed <strong>in</strong> Section 6. There were 2000 simulations for each condition. The tables <strong>in</strong>this Appendix differ <strong>in</strong> one important aspect with respect to those <strong>in</strong> the ma<strong>in</strong> text. Inorder to avoid non-<strong>in</strong>formative comparison of datasets which are, for example, analysesof the same physical process sampled at different sampl<strong>in</strong>g rates, the number of po<strong>in</strong>tsis scaled by the ‘effective’ number of po<strong>in</strong>ts which is essentially the number of samplesrelative to a simple estimate of the observed signal bandwidth:N * = Nτ S /̂︂BŴ︂BW =‖X‖ F‖∆X/∆T‖ FThe values marked with an asterisk have values of both α and γ which are less than5%. Note also that Ψ is non-dimensional <strong>in</strong>dex.70


Robust Statistics for <strong>Causality</strong>Table 4: α vs. Ψ/std(Ψ)N * → 50 100 200 500 750 1000 1500 2000 50000.125 0.82 0.83 0.86 0.87 0.88 0.90 0.89 0.89 0.890.25 0.67 0.69 0.73 0.77 0.76 0.78 0.78 0.78 0.770.5 0.41 0.44 0.50 0.54 0.55 0.56 0.56 0.59 0.570.75 0.26 0.26 0.32 0.36 0.36 0.37 0.38 0.40 0.391 0.15 0.15 0.20 0.23 0.23 0.25 0.25 0.26 0.261.25 0.09 0.09 0.11 0.13 0.14 0.16 0.14 0.15 0.161.5 0.06 0.05 0.06 0.07 0.08 0.09 0.08 0.10 0.101.75 0.04 0.03 0.04 0.05 0.05 * 0.05 * 0.04 * 0.06 0.062 0.03 0.02 0.03 0.02 * 0.02 * 0.03 * 0.02 * 0.03 * 0.03 *2.5 0.01 0.01 0.01 0.01 * 0.01 * 0.01 * 0.00 * 0.01 * 0.01 *3 0.01 0.00 0.00 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *4 0.00 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *8 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *16 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *Table 5: γ vs. Ψ/std(Ψ)N * → 50 100 200 500 750 1000 1500 2000 50000.125 0.46 0.41 0.35 0.26 0.23 0.21 0.19 0.18 0.160.25 0.46 0.39 0.33 0.24 0.21 0.19 0.18 0.16 0.140.5 0.44 0.36 0.29 0.21 0.18 0.15 0.14 0.13 0.120.75 0.43 0.35 0.28 0.17 0.14 0.12 0.11 0.10 0.091 0.43 0.31 0.23 0.13 0.12 0.09 0.08 0.07 0.061.25 0.42 0.31 0.20 0.09 0.08 0.06 0.05 0.05 0.041.5 0.40 0.26 0.20 0.06 0.05 0.04 0.04 0.04 0.021.75 0.42 0.26 0.16 0.05 0.04 * 0.02 * 0.02 * 0.03 0.012 0.41 0.23 0.11 0.03 * 0.03 * 0.02 * 0.02 * 0.02 * 0.01 *2.5 0.41 0.20 0.06 0.02 * 0.02 * 0.01 * 0.01 * 0.01 * 0.01 *3 0.33 0.09 0.06 0.03 * 0.01 * 0.00 * 0.00 * 0.00 * 0.00 *4 0.33 0.00 * 0.00 * 0.02 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *8 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *16 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *71


PopescuTable 6: α vs. ΨN * → 50 100 200 500 750 1000 1500 2000 50000.125 0.25 0.22 0.25 0.24 0.24 0.24 0.21 0.21 0.150.25 0.09 0.09 0.09 0.09 0.10 0.09 0.09 0.08 0.05 *0.5 0.02 * 0.01 * 0.01 * 0.02 * 0.02 * 0.02 * 0.01 * 0.01 * 0.01 *0.75 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *1 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *Table 7: γ vs. ΨN * → 50 100 200 500 750 1000 1500 2000 50000.125 0.21 0.17 0.14 0.10 0.08 0.07 0.06 0.05 0.030.25 0.11 0.08 0.06 0.04 0.03 0.03 0.02 0.02 0.01 *0.5 0.03 * 0.01 * 0.01 * 0.01 * 0.01 * 0.00 * 0.00 * 0.01 * 0.00 *0.75 0.02 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *1 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 * 0.00 *72


JMLR: Workshop and Conference Proceed<strong>in</strong>gs 12:65–94, 2011<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Causal <strong>Time</strong> <strong>Series</strong> Analysis of functional MagneticResonance Imag<strong>in</strong>g DataAlard RoebroeckFaculty of Psychology & NeuroscienceMaastricht University, the NetherlandsAnil K. SethSackler Centre for Consciousness ScienceUniversity of Sussex, UKPedro Valdes-SosaCuban Neuroscience Centre, Playa, Cubaa.roebroeck@maastrichtuniversity.nla.k.seth@sussex.ac.ukpeter@cneuro.edu.cuEditors: Flor<strong>in</strong> Popescu and Isabelle GuyonAbstractThis review focuses on dynamic causal analysis of functional magnetic resonance(fMRI) data to <strong>in</strong>fer bra<strong>in</strong> connectivity from a time series analysis and dynamicalsystems perspective. Causal <strong>in</strong>fluence is expressed <strong>in</strong> the Wiener-Akaike-Granger-Schweder (WAGS) tradition and dynamical systems are treated <strong>in</strong> a state space model<strong>in</strong>gframework. The nature of the fMRI signal is reviewed with emphasis on the<strong>in</strong>volved neuronal, physiological and physical processes and their model<strong>in</strong>g as dynamicalsystems. In this context, two streams of development <strong>in</strong> model<strong>in</strong>g causalbra<strong>in</strong> connectivity us<strong>in</strong>g fMRI are discussed: time series approaches to causality <strong>in</strong> adiscrete time tradition and dynamic systems and control theory approaches <strong>in</strong> a cont<strong>in</strong>uoustime tradition. This review closes with discussion of ongo<strong>in</strong>g work and futureperspectives on the <strong>in</strong>tegration of the two approaches.Keywords: fMRI, hemodynamics, state space model, Granger causality, WAGS <strong>in</strong>fluence1. IntroductionUnderstand<strong>in</strong>g how <strong>in</strong>teractions between bra<strong>in</strong> structures support the performance ofspecific cognitive tasks or perceptual and motor processes is a prom<strong>in</strong>ent goal <strong>in</strong> cognitiveneuroscience. Neuroimag<strong>in</strong>g methods, such as Electroencephalography (EEG),Magnetoencephalography (MEG) and functional Magnetic Resonance Imag<strong>in</strong>g (fMRI)are employed more and more to address questions of functional connectivity, <strong>in</strong>terregioncoupl<strong>in</strong>g and networked computation that go beyond the ‘where’ and ‘when’ oftask-related activity (Friston, 2002; Horwitz et al., 2000; McIntosh, 2004; Salmel<strong>in</strong> andKujala, 2006; Valdes-Sosa et al., 2005a). A network perspective onto the parallel anddistributed process<strong>in</strong>g <strong>in</strong> the bra<strong>in</strong> - even on the large scale accessible by neuroimag<strong>in</strong>gmethods - is a promis<strong>in</strong>g approach to enlarge our understand<strong>in</strong>g of perceptual, cognitiveand motor functions. Functional Magnetic Resonance Imag<strong>in</strong>g (fMRI) <strong>in</strong> particular isc○ 2011 A. Roebroeck, A.K. Seth & P. Valdes-Sosa.


Roebroeck Seth Valdes-Sosa<strong>in</strong>creas<strong>in</strong>gly used not only to localize structures <strong>in</strong>volved <strong>in</strong> cognitive and perceptualprocesses but also to study the connectivity <strong>in</strong> large-scale bra<strong>in</strong> networks that supportthese functions.Generally a dist<strong>in</strong>ction is made between three types of bra<strong>in</strong> connectivity. Anatomicalconnectivity refers to the physical presence of an axonal projection from one bra<strong>in</strong>area to another. Identification of large axon bundles connect<strong>in</strong>g remote regions <strong>in</strong> thebra<strong>in</strong> has recently become possible non-<strong>in</strong>vasively <strong>in</strong> vivo by diffusion weighted Magneticresonance imag<strong>in</strong>g (DWMRI) and fiber tractography analysis (Johansen-Berg andBehrens, 2009; Jones, 2010). Functional connectivity refers to the correlation structure(or more generally: any order of statistical dependency) <strong>in</strong> the data such that bra<strong>in</strong> areascan be grouped <strong>in</strong>to <strong>in</strong>teract<strong>in</strong>g networks. F<strong>in</strong>ally, effective connectivity model<strong>in</strong>gmoves beyond statistical dependency to measures of directed <strong>in</strong>fluence and causalitywith<strong>in</strong> the networks constra<strong>in</strong>ed by further assumptions (Friston, 1994).Recently, effective connectivity techniques that make use of the temporal dynamics<strong>in</strong> the fMRI signal and employ time series analysis and systems identification theoryhave become popular. With<strong>in</strong> this class of techniques two separate developments havebeen most used: Granger causality analysis (GCA; Goebel et al., 2003; Roebroecket al., 2005; Valdes-Sosa, 2004) and Dynamic Causal Model<strong>in</strong>g (DCM; Friston et al.,2003). Despite the common goal, there seem to be differences between the two methods.Whereas GCA explicitly models temporal precedence and uses the concept ofGranger causality (or G-causality) mostly formulated <strong>in</strong> a discrete time-series analysisframework, DCM employs a biophysically motivated generative model formulated<strong>in</strong> a cont<strong>in</strong>uous time dynamic system framework. In this chapter we will give a generalcausal time-series analysis perspective onto both developments from what we havecalled the Wiener-Akaike-Granger-Schweder (WAGS) <strong>in</strong>fluence formalism (Valdes-Sosa et al., <strong>in</strong> press).Effective connectivity model<strong>in</strong>g of neuroimag<strong>in</strong>g data entails the estimation of multivariatemathematical models that benefits from a state space formulation, as we willdiscuss below. Statistical <strong>in</strong>ference on estimated parameters that quantify the directed<strong>in</strong>fluence between bra<strong>in</strong> structures, either <strong>in</strong>dividually or <strong>in</strong> groups (model comparison)then provides <strong>in</strong>formation on directed connectivity. In such models, bra<strong>in</strong> structures aredef<strong>in</strong>ed from at least two viewpo<strong>in</strong>ts. From a structural viewpo<strong>in</strong>t they correspond to aset of “nodes" that comprise a graph, the purpose of causal discovery be<strong>in</strong>g the identificationof active l<strong>in</strong>ks <strong>in</strong> the graph. The structural model conta<strong>in</strong>s i) a selection of thestructures <strong>in</strong> the bra<strong>in</strong> that are assumed to be of importance <strong>in</strong> the cognitive process ortask under <strong>in</strong>vestigation, ii) the possible <strong>in</strong>teractions between those structures and iii)the possible effects of exogenous <strong>in</strong>puts onto the network. The exogenous <strong>in</strong>puts maybe under control of the experimenter and often have the form of a simple <strong>in</strong>dicator functionthat can represent, for <strong>in</strong>stance, the presence or absence of a visual stimulus <strong>in</strong> thesubject’s view. From a dynamical viewpo<strong>in</strong>t bra<strong>in</strong> structures are represented by statesor variables that describe time vary<strong>in</strong>g neural activity with<strong>in</strong> a time-series model of themeasured fMRI time-series data. The functional form of the model equations can em-74


Causal analysis of fMRIbed assumptions on signal dynamics, temporal precedence or physiological processesfrom which signals orig<strong>in</strong>ate.We start this review by focus<strong>in</strong>g on the nature of the fMRI signal <strong>in</strong> some detail<strong>in</strong> section 2, separat<strong>in</strong>g the treatment <strong>in</strong>to neuronal, physiological and physical processes.In section 3 we review two important formal concepts: causal <strong>in</strong>fluence <strong>in</strong> theWiener-Akaike-Granger-Schweder tradition and the state space model<strong>in</strong>g framework,with some emphasis on the relations between discrete and cont<strong>in</strong>uous time series models.Build<strong>in</strong>g on this discussion, section 4 reviews time series model<strong>in</strong>g of causality <strong>in</strong>fMRI data. The review proceeds somewhat chronologically, discuss<strong>in</strong>g and compar<strong>in</strong>gthe two separate streams of development (GCA and DCM) that have recently begun tobe <strong>in</strong>tegrated. F<strong>in</strong>ally, section 5 summarizes and discusses the ma<strong>in</strong> topics <strong>in</strong> generaldynamic state space models of bra<strong>in</strong> connectivity and provides an outlook on futuredevelopments.2. The fMRI SignalThe fMRI signal reflects the activity with<strong>in</strong> neuronal populations non-<strong>in</strong>vasively withexcellent spatial resolution (millimeters down to hundreds of micrometers at high fieldstrength), good temporal resolution (seconds down to hundreds of milliseconds) andwhole-bra<strong>in</strong> coverage of the human or animal bra<strong>in</strong> (Logothetis, 2008). Although fMRIis possible with a few different techniques, the Blood Oxygenation Level Dependent(BOLD) contrast mechanism is employed <strong>in</strong> the great majority of cases. In short, theBOLD fMRI signal is sensitive to changes <strong>in</strong> blood oxygenation, blood flow and bloodvolume that result from oxidative glucose metabolism which, <strong>in</strong> turn, is needed to fuellocal neuronal activity (Buxton et al., 2004). This is why fMRI is usually classified as a‘metabolic’ or ‘hemodynamic’ neuroimag<strong>in</strong>g modality. Its superior spatial resolution,<strong>in</strong> particular, dist<strong>in</strong>guishes it from other functional bra<strong>in</strong> imag<strong>in</strong>g modalities used <strong>in</strong>humans, such as EEG, MEG and Positron Emission Tomography (PET). Although itstemporal resolution is far superior to PET (another ‘metabolic’ neuroimag<strong>in</strong>g modality)it is still an order of magnitude below that of EEG and MEG, result<strong>in</strong>g <strong>in</strong> a relativelysparse sampl<strong>in</strong>g of fast neuronal processes, as we will discuss below. The f<strong>in</strong>al fMRIsignal arises from a complex cha<strong>in</strong> of processes that we can classify <strong>in</strong>to neuronal,physiological and physical processes (Uludag et al., 2005), each of which conta<strong>in</strong> somecrucial parameters and variables and have been modeled <strong>in</strong> various ways as illustrated<strong>in</strong> Figure 1. We will discuss each of the three classes of processes to expla<strong>in</strong> the <strong>in</strong>tricacies<strong>in</strong>volved <strong>in</strong> try<strong>in</strong>g to model this causal cha<strong>in</strong> of events with the ultimate goal ofestimat<strong>in</strong>g neuronal activity and <strong>in</strong>teractions from the measured fMRI signal.On the neuronal level, it is important to realize that fMRI reflects certa<strong>in</strong> aspectsof neuronal function<strong>in</strong>g more than others. A wealth of processes are cont<strong>in</strong>uously <strong>in</strong>operation at the microscopic level (i.e. <strong>in</strong> any s<strong>in</strong>gle neuron), <strong>in</strong>clud<strong>in</strong>g ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>ga rest<strong>in</strong>g potential, post-synaptic conduction and <strong>in</strong>tegration (spatial and temporal) ofgraded excitatory and <strong>in</strong>hibitory post synaptic potentials (EPSPs and IPSPs) arriv<strong>in</strong>gat the dendrites, subthreshold dynamic (possibly oscillatory) potential changes, spikegeneration at the axon hillock, propagation of spikes by cont<strong>in</strong>uous regeneration of75


Roebroeck Seth Valdes-SosaFigure 1: The neuronal, physiological and physical processes (top row) and variablesand parameters <strong>in</strong>volved (middle row) <strong>in</strong> the complex causal cha<strong>in</strong> of eventsthat leads to the formation of the fMRI signal. The bottom row lists somemathematical models of the sub-processes that play a role <strong>in</strong> the analysis andmodel<strong>in</strong>g of fMRI signals. See ma<strong>in</strong> text for further explanation.76


Causal analysis of fMRIthe action potential along the axon, and release of neurotransmitter substances <strong>in</strong>to thesynaptic cleft at arrival of an action potential at the synaptic term<strong>in</strong>al. There are manydifferent types of neurons <strong>in</strong> the mammalian bra<strong>in</strong> that express these processes <strong>in</strong> differentdegrees and ways. In addition, there are other cells, such as glia cells, that performimportant processes, some of them possibly directly relevant to computation or signal<strong>in</strong>g.As expla<strong>in</strong>ed below, the fMRI signal is sensitive to the local oxidative metabolism<strong>in</strong> the bra<strong>in</strong>. This means that, <strong>in</strong>directly, it ma<strong>in</strong>ly reflects the most energy consum<strong>in</strong>gof the neuronal processes. In primates, post-synaptic processes account for the greatmajority (about 75%) of the metabolic costs of neuronal signal<strong>in</strong>g events (Attwell andIadecola, 2002). Indeed, the greater sensitivity of fMRI to post-synaptic activity, ratherthan axon generation and propagation (‘spik<strong>in</strong>g’), has been experimentally verified. For<strong>in</strong>stance, <strong>in</strong> a simultaneous <strong>in</strong>vasive electrophysiology and fMRI measurement <strong>in</strong> theprimate, Logothetis and colleagues (Logothetis et al., 2001) found the fMRI signal tobe more correlated to the mean Local Field Potential (LFP) of the electrophysiologicalsignal, known to reflect post-synaptic graded potentials, than to high-frequency andmulti-unit activity, known to reflect spik<strong>in</strong>g. In another study it was shown that, bysuppress<strong>in</strong>g action potentials while keep<strong>in</strong>g LFP responses <strong>in</strong>tact by <strong>in</strong>ject<strong>in</strong>g a seroton<strong>in</strong>agonist, the fMRI response rema<strong>in</strong>ed <strong>in</strong>tact, aga<strong>in</strong> suggest<strong>in</strong>g that LFP is a betterpredictor for fMRI activity (Rauch et al., 2008). These results confirmed earlier resultsobta<strong>in</strong>ed on the cerebellum of rats (Thomsen et al., 2004).Neuronal activity, dynamics and computation can be modeled at a different levelsof abstraction, <strong>in</strong>clud<strong>in</strong>g the macroscopic (whole bra<strong>in</strong> areas), mesoscopic (sub-areasto cortical columns) and microscopic level (<strong>in</strong>dividual neurons or groups of these).The levels most relevant to model<strong>in</strong>g fMRI signals are at the macro- and mesoscopiclevels. Macroscopic models used to represent considerable expanses of gray matter tissueor sub-cortical structures as Regions Of Interest (ROIs) prom<strong>in</strong>ently <strong>in</strong>clude s<strong>in</strong>glevariable determ<strong>in</strong>istic (Friston et al., 2003) or stochastic (autoregressive; Penny et al.,2005; Roebroeck et al., 2005; Valdes-Sosa et al., 2005b) exponential activity decaymodels. Although the simplicity of such models entail a large degree of abstraction<strong>in</strong> represent<strong>in</strong>g neuronal activity dynamics, their modest complexity is generally wellmatched to the limited temporal resolution available <strong>in</strong> fMRI. Nonetheless, more complexmulti-state neuronal dynamics models have been <strong>in</strong>vestigated <strong>in</strong> the context offMRI signal generation. These <strong>in</strong>clude the 2 state variable Wilson-Cowan model (Marreiroset al., 2008), with one excitatory and one <strong>in</strong>hibitory sub-population per ROI andthe 3 state variable Jansen-Rit model with a pyramidal excitatory output populationand an <strong>in</strong>hibitory and excitatory <strong>in</strong>terneuron population, particularly <strong>in</strong> the model<strong>in</strong>g ofsimultaneously acquired fMRI and EEG (Valdes-Sosa et al., 2009).The physiology and physics of the fMRI signal is most easily expla<strong>in</strong>ed by start<strong>in</strong>gwith the physics. We will give a brief overview here and refer to more dedicatedoverviews (Haacke et al., 1999; Uludag et al., 2005) for extended treatment. The hallmarkof Magnetic Resonance (MR) spectroscopy and imag<strong>in</strong>g is the use of the resonancefrequency of magnetized nuclei possess<strong>in</strong>g a magnetic moment, mostly protons(hydrogen nuclei, 1H), called ‘sp<strong>in</strong>s’. Radiofrequency antennas (RF coils) can measure77


Roebroeck Seth Valdes-Sosasignal from ensembles of sp<strong>in</strong>s that resonate <strong>in</strong> phase at the moment of measurement.The first important physical factor <strong>in</strong> MR is the ma<strong>in</strong> magnetic field strength (B 0 ),which determ<strong>in</strong>es both the resonance frequency (directly proportional to field-strength)and the basel<strong>in</strong>e signal-to-noise ratio of the signal, s<strong>in</strong>ce higher fields make a larger proportionof sp<strong>in</strong>s <strong>in</strong> the tissue available for measurement. The most used field strengthsfor fMRI research <strong>in</strong> humans range from 1,5T (Tesla) to 7T. The second importantphysical factor – conta<strong>in</strong><strong>in</strong>g several crucial parameters – is the MR pulse-sequence thatdeterm<strong>in</strong>es the magnetization preparation of the sample and the way the signal is subsequentlyacquired. The pulse sequence is essentially a series of radiofrequency pulses,l<strong>in</strong>ear magnetic gradient pulses and signal acquisition (readout) events (Bernste<strong>in</strong> et al.,2004; Haacke et al., 1999). An important variable <strong>in</strong> a BOLD fMRI pulse sequence iswhether it is a gradient-echo (GRE) sequence or a sp<strong>in</strong>-echo (SE) sequence, whichdeterm<strong>in</strong>es the granularity of the vascular processes that are reflected <strong>in</strong> the signal, asexpla<strong>in</strong>ed later this section. These effects are further modulated by the echo-time (timeto echo; TE) and repetition time (time to repeat; TR) that are usually set by the end-userof the pulse sequence. F<strong>in</strong>ally, an important variable with<strong>in</strong> the pulse sequence is thetype of spatial encod<strong>in</strong>g that is employed. Spatial encod<strong>in</strong>g can primarily be achievedwith gradient pulses and it embodies the essence of ‘Imag<strong>in</strong>g’ <strong>in</strong> MRI. It is only withspatial encod<strong>in</strong>g that signal can be localized to certa<strong>in</strong> ‘voxels’ (volume elements) <strong>in</strong> thetissue. A strength of fMRI as a neuroimag<strong>in</strong>g technique is that an adjustable trade-offis available to the user between spatial resolution, spatial coverage, temporal resolutionand signal-to-noise ratio (SNR) of the acquired data. For <strong>in</strong>stance, although fMRI canachieve excellent spatial resolution at good SNR and reasonable temporal resolution,one can choose to sacrifice some spatial resolution to ga<strong>in</strong> a better temporal resolutionfor any given study. Note, however, that this concerns the resolution and SNR of thedata acquisition. As expla<strong>in</strong>ed below, the physiology of fMRI can put fundamental limitationson the nom<strong>in</strong>al resolution and SNR that is achieved <strong>in</strong> relation to the neuronalprocesses of <strong>in</strong>terest.On the physiological level, the ma<strong>in</strong> variables that mediate the BOLD contrast <strong>in</strong>fMRI are cerebral blood flow (CBF), cerebral blood volume (CBV) and the cerebralmetabolic rate of oxygen (CMRO2) which all change the oxygen saturation of the blood(as usefully quantified by the concentration of deoxygenated hemoglob<strong>in</strong>). The BOLDcontrast is made possible by the fact that oxygenation of the blood changes its magneticsusceptibility, which has an effect on the MR signal as measured <strong>in</strong> GRE and SEsequences. More precisely, oxygenated and deoxygenated hemoglob<strong>in</strong> (oxy-Hb anddeoxy-Hb) have different magnetic properties, the former be<strong>in</strong>g diamagnetic and thelatter paramagnetic. As a consequence, deoxygenated blood creates local microscopicmagnetic field gradients, such that local sp<strong>in</strong> ensembles dephase, which is reflected <strong>in</strong> alower MR signal. Conversely oxygenation of blood above basel<strong>in</strong>e lowers the concentrationof deoxy-Hb, which decreases local sp<strong>in</strong> dephas<strong>in</strong>g and results <strong>in</strong> a higher MRsignal. This means that fMRI is directly sensitive to the relative amount of oxy- and deoxyHb and to the fraction of cerebral tissue that is occupied by blood (the CBV), whichare controlled by local neurovascular coupl<strong>in</strong>g processes. Neurovascular processes, <strong>in</strong>78


Causal analysis of fMRIturn, are tightly coupled to neurometabolic processes controll<strong>in</strong>g the rate of oxidativeglucose metabolism (the CMRO2) that is needed to fuel neural activity.Naively one might expect local neuronal activity to quickly <strong>in</strong>crease CMRO2 and<strong>in</strong>crease the local concentration of deoxy-Hb, lead<strong>in</strong>g to a lower<strong>in</strong>g of the MR signal.However, this transient <strong>in</strong>crease <strong>in</strong> deoxy-Hb or the <strong>in</strong>itial dip <strong>in</strong> the fMRI signal is notconsistently observed and, thus, there is a debate whether this signal is robust, elusive orsimply not existent (Buxton, 2001; Ugurbil et al., 2003; Uludag, 2010). Instead, earlyexperiments showed that the dynamics of blood flow and blood volume, the hemodynamics,lead to a robust BOLD signal <strong>in</strong>crease. Neuronal activity is quickly followedby a large CBF <strong>in</strong>crease that serves the cont<strong>in</strong>ued function<strong>in</strong>g of neurons by clear<strong>in</strong>gmetabolic by-products (such as CO2) and supply<strong>in</strong>g glucose and oxy-Hb. This CBFresponse is an overcompensat<strong>in</strong>g response, supply<strong>in</strong>g much more oxy-Hb to the localblood system than has been metabolized. As a consequence, with<strong>in</strong> 1-2 seconds, theoxygenation of the blood <strong>in</strong>creases and the MR signal <strong>in</strong>creases. The <strong>in</strong>creased flowalso <strong>in</strong>duces a ‘balloon<strong>in</strong>g’ of the blood vessels, <strong>in</strong>creas<strong>in</strong>g CBV, the proportion ofvolume taken up by blood, further <strong>in</strong>creas<strong>in</strong>g the signal.Figure 2: A: Simplified causal cha<strong>in</strong> of hemodynamic events as modeled by the balloonmodel. Grey arrows show how variable <strong>in</strong>creases (decreases) tend to relate toeach other. The dynamic changes after a brief pulse of neuronal activity areplotted for CBF (<strong>in</strong> red), CBV (<strong>in</strong> purple), deoxyHb (<strong>in</strong> green) and BOLDsignal (<strong>in</strong> blue). B: A more abstract representation of the hemodynamic responsefunction as a set of l<strong>in</strong>ear basis functions act<strong>in</strong>g as convolution kernels(arbitrary amplitude scal<strong>in</strong>g). Solid l<strong>in</strong>e: canonical two-gamma HRF; Dottedl<strong>in</strong>e: time derivative; Dashed l<strong>in</strong>e: dispersion derivative.A mathematical characterization of the hemodynamic processes <strong>in</strong> BOLD fMRI at1.5-3T has been given <strong>in</strong> the biophysical balloon model (Buxton et al., 2004, 1998),schematized <strong>in</strong> Figure 2A. A simplification of the full balloon model has become important<strong>in</strong> causal models of bra<strong>in</strong> connectivity (Friston et al., 2000). In this simplifiedmodel, the dependence of fractional fMRI signal change ∆SS, on normalized cerebral79


Roebroeck Seth Valdes-Sosablood flow f , normalized cerebral blood volume v and normalized deoxyhemoglob<strong>in</strong>content q is modeled as:∆S[︂S = V 0 · k 1 · (1 − q) + k 2 · (1 − q ]︂v ) + k 3 · (1 − v)(1)˙v t = 1 (︁ft − v 1/α )︁t(2)τ 0(︁ )︁˙q t = 1 f t 1 − (1 − E0 )τ 0⎛⎜⎝1/ f ⎞t− q tE 0⎟⎠ (3)v 1−1/αtThe term E 0 is the rest<strong>in</strong>g oxygen extraction fraction, V 0 is the rest<strong>in</strong>g blood volumefraction, τ 0 is the mean transit time of the venous compartment, α is the stiffness componentof the model balloon and {k 1 ,k 2 ,k 3 } are calibration parameters. The ma<strong>in</strong> simplificationsof this model with respect to a more complete balloon model (Buxton et al.,2004) are a one-to-one coupl<strong>in</strong>g of flow and volume <strong>in</strong> (2), thus neglect<strong>in</strong>g the actualballoon effect, and a perfect coupl<strong>in</strong>g between flow and metabolism <strong>in</strong> (3). Fristonet al. (2000) augment this model with a putative relation between the a neuronal activityvariable z, a flow-<strong>in</strong>duc<strong>in</strong>g signal s, and the normalized cerebral blood flow f .They propose the follow<strong>in</strong>g relations <strong>in</strong> which neuronal activity z causes an <strong>in</strong>crease <strong>in</strong>a vasodilatory signal that is subject to autoregulatory feedback:ṡ t = z t − 1 τ ss t − 1τ f2 ( f t − 1) (4)˙ f t = s t (5)Here τ s is the signal decay time constant, τ f is the time-constant of the feedback autoregulatorymechanism 1 , and f is the flow normalized to basel<strong>in</strong>e flow. The physiological<strong>in</strong>terpretation of the autoregulatory mechanism is unspecified, leav<strong>in</strong>g us witha neuronal activity variable z that is measured <strong>in</strong> units of s −2 . The physiology of thehemodynamics conta<strong>in</strong>ed <strong>in</strong> differential equations (2) to (5), on the other hand, is morereadily <strong>in</strong>terpretable, and when <strong>in</strong>tegrated for a brief neuronal <strong>in</strong>put pulse shows thebehavior as described above (Figure 2A, upper panel). This simulation highlights afew crucial features. First, the hemodynamic response to a brief neural activity eventis sluggish and delayed, entail<strong>in</strong>g that the fMRI BOLD signal is a delayed and lowpassfiltered version of underly<strong>in</strong>g neuronal activity. More than the distort<strong>in</strong>g effects ofhemodynamic processes on the temporal structure of fMRI signals per se, it is the difference<strong>in</strong> hemodynamics <strong>in</strong> different parts of the bra<strong>in</strong> that forms a severe confound fordynamic bra<strong>in</strong> connectivity models. Particularly, the delay imposed upon fMRI signalswith respect to the underly<strong>in</strong>g neural activity is known to vary between subjects andbetween different bra<strong>in</strong> regions of the same subject (Aguirre et al., 1998; Saad et al.,2001). Second, although CBF, CBV and deoxyHb changes range <strong>in</strong> the tens of percents,the BOLD signal change at 1.5T or 3T is <strong>in</strong> the range of 0.5-2%. Nevertheless,1. Note that we have reparametrized the equation here <strong>in</strong> terms of τ f 2 to make τ f a proper time constant<strong>in</strong> units of seconds80


Causal analysis of fMRIthe SNR of BOLD fMRI <strong>in</strong> general is very good <strong>in</strong> comparison to electrophysiologicaltechniques like EEG and MEG.Although the balloon model and its variations have played an important role <strong>in</strong>describ<strong>in</strong>g the transient features of the fMRI response and <strong>in</strong>ferr<strong>in</strong>g neuronal activity,simplified ways of represent<strong>in</strong>g the BOLD signal responses are very often used. Mostprom<strong>in</strong>ent among these is a l<strong>in</strong>ear f<strong>in</strong>ite impulse response (FIR) convolution with asuitable kernel. The most used s<strong>in</strong>gle convolution kernel characteriz<strong>in</strong>g the ‘canonical’hemodynamic reponse is formed by a superposition of two gamma functions (Glover,1999), the first characteriz<strong>in</strong>g the <strong>in</strong>itial signal <strong>in</strong>crease, the second the later negativeundershoot (Figure 2B, solid l<strong>in</strong>e):h(t) = m 1 t τ 1e (−l 1t) − cm 2 t τ 2e (−l 2t)m i = max (︁ t τ ie (−l it) )︁ (6)With times-to-peak <strong>in</strong> seconds τ 1 = 6, τ 2 = 16, scale parameters l i (typically equal to 1)and a relative amplitude of undershoot to peak of c = 1/6.Often, the canonical two-gamma HRF kernel is augmented with one or two additionalorthogonalized convolution kernels: a temporal derivative and a dispersionderivative. Together the convolution kernels form a flexible basis function expansionof possible HRF shapes, with the temporal derivative of the canonical account<strong>in</strong>g forvariation <strong>in</strong> the response delay and the dispersion derivative account<strong>in</strong>g for variations <strong>in</strong>temporal response width (Henson et al., 2002; Liao et al., 2002). Thus, the l<strong>in</strong>ear basisfunction representation is a more abstract characterization of the HRF (i.e. further awayfrom the physiology) that still captures the possible variations <strong>in</strong> responses.It is an <strong>in</strong>terest<strong>in</strong>g property of hemodynamic processes that, although they are characterizedby a large overcompensat<strong>in</strong>g reaction to neuronal activity, their effects arehighly local. The locality of the hemodynamic reponse to neuronal activity limits theactual spatial resolution of fMRI. The path blood <strong>in</strong>flow <strong>in</strong> the bra<strong>in</strong> is from large arteriesthrough arterioles <strong>in</strong>to capillaries where exchange with neuronal tissue takes placeat a microscopic level. Blood outflow takes place via venules <strong>in</strong>to the larger ve<strong>in</strong>s. Thema<strong>in</strong> regulators of blood flow are the arterioles that are surrounded by smooth muscle,although arteries and capillaries are also thought to be <strong>in</strong>volved <strong>in</strong> blood flow regulation(Attwell et al., 2010). Different hemodynamic parameters have different spatialresolutions. While CBV and CBF changes <strong>in</strong> all compartments but mostly venules,oxygenation changes mostly <strong>in</strong> the venules and ve<strong>in</strong>s. Thus, the achievable spatial resolutionwith fMRI is limited by its specificity to the smaller arterioles and venules andmicroscopic capillaries supply<strong>in</strong>g the tissue, rather than the larger supply<strong>in</strong>g arteriesdra<strong>in</strong><strong>in</strong>g ve<strong>in</strong>s. The larger vessels have a larger doma<strong>in</strong> of supply or extraction and, as aconsequence, their signal is blurred and mislocalized with respect to active tissue. Here,physiology and physics <strong>in</strong>teract <strong>in</strong> an important way. It can be shown theoretically – bythe effects of thermal motion of sp<strong>in</strong> diffusion over time and the distance of the sp<strong>in</strong>sto deoxy-Hb – that the orig<strong>in</strong> of the BOLD signal <strong>in</strong> SE sequences at high ma<strong>in</strong> fieldstrengths (larger than 3T) is much more specific to the microscopic vasculature than tothe larger arteries and ve<strong>in</strong>s. This does not hold for GRE sequences or SE sequences81


Roebroeck Seth Valdes-Sosaat lower field strengths. The cost of this greater specificity and higher effective spatialresolution is that SE-BOLD has a lower <strong>in</strong>tr<strong>in</strong>sic SNR than GRE-BOLD. The balloonmodel equations above are specific to GRE-BOLD at 1.5T and 3T and have been extendedto reflect diffusion effects for higher field strengths (Uludag et al., 2009).In summary, fMRI is an <strong>in</strong>direct measure of neuronal and synaptic activity. Thephysiological quantities directly determ<strong>in</strong><strong>in</strong>g signal contrast <strong>in</strong> BOLD fMRI are hemodynamicquantities such as cerebral blood flow and volume and oxygen metabolism.fMRI can achieve a excellent spatial resolution (millimeters down to hundreds of micrometersat high field strength) with good temporal resolution (seconds down to hundredsof milliseconds). The potential to resolve neuronal population <strong>in</strong>teractions at ahigh spatial resolution is what drives attempts at causal time series model<strong>in</strong>g of fMRIdata. However, the significant aspects of fMRI that pose challenges for such attemptsare i) the enormous dimensionality of the data that conta<strong>in</strong>s hundreds of thousands ofchannels (voxels) ii) the temporal convolution of neuronal events by sluggish hemodynamicsthat can differ between remote parts of the bra<strong>in</strong> and iii) the relatively sparsetemporal sampl<strong>in</strong>g of the signal.3. <strong>Causality</strong> and state-space modelsThe <strong>in</strong>ference of causal <strong>in</strong>fluence relations from statistical analysis of observed data hastwo dom<strong>in</strong>ant approaches. The first approach is <strong>in</strong> the tradition of Granger causality orG-causality, which has its signature <strong>in</strong> improved predictability of one time series by another.The second approach is based on graphical models and the notion of <strong>in</strong>tervention(Glymour, 2003), which has been formalized us<strong>in</strong>g a Bayesian probabilistic frameworktermed causal calculus or do-calculus (Pearl, 2009). Interest<strong>in</strong>gly, recent work hascomb<strong>in</strong>ed of the two approaches <strong>in</strong> a third l<strong>in</strong>e of work, termed Dynamic StructuralSystems (White and Lu, 2010). The focus here will be on the first approach, <strong>in</strong>itiallyfirmly rooted <strong>in</strong> econometrics and time-series analysis. We will discuss this tradition <strong>in</strong>a very general form, acknowledg<strong>in</strong>g early contributions from Wiener, Akaike, Grangerand Schweder and will follow (Valdes-Sosa et al., <strong>in</strong> press) <strong>in</strong> refer<strong>in</strong>g to the crucialconcept as WAGS <strong>in</strong>fluence.3.1. Wiener-Akaike-Granger-Schweder (WAGS) <strong>in</strong>fluenceThe crucial premise of the WAGS statistical causal model<strong>in</strong>g tradition is that a causemust precede and <strong>in</strong>crease the predictability of its effect. In other words: a variableX 2 <strong>in</strong>fluences another variable X 1 if the prediction of X 1 improves when we use pastvalues of X 2 , given that all other relevant <strong>in</strong>formation (importantly: the past of X1itself) is taken <strong>in</strong>to account. This type of reason<strong>in</strong>g can be traced back at least toHume and is particularly popular <strong>in</strong> analyz<strong>in</strong>g dynamical data measured as time series.In a formal framework it was orig<strong>in</strong>ally proposed (<strong>in</strong> an abstract form) by Wiener(Wiener, 1956), and then <strong>in</strong>troduced <strong>in</strong>to practical data analysis and popularized byGranger (Granger, 1969). A po<strong>in</strong>t stressed by Granger is that <strong>in</strong>creased predictabilityis a necessary but not sufficient condition for a causal relation between time series.82


Causal analysis of fMRIIn fact, Granger dist<strong>in</strong>guished true causal relations – only to be <strong>in</strong>ferred <strong>in</strong> the presenceof knowledge of the state of the whole universe – from “prima facie" causal relationsthat we refer to as “<strong>in</strong>fluence" <strong>in</strong> agreement with other authors (Commenges andGegout-Petit, 2009). Almost simultaneous with Grangers work, Akaike (Akaike, 1968),and Schweder (Schweder, 1970) <strong>in</strong>troduced similar concepts of <strong>in</strong>fluence, prompt<strong>in</strong>g(Valdes-Sosa et al., <strong>in</strong> press) to co<strong>in</strong> the term WAGS <strong>in</strong>fluence (for Wiener-Akaike-Granger-Schweder). This is a generalization of a proposal by placeAalen (Aalen, 1987;Aalen and Frigessi, 2007) who was among the first to po<strong>in</strong>t out the connections betweenGranger’s and Schweder’s <strong>in</strong>fluence concepts. With<strong>in</strong> this framework we candef<strong>in</strong>e several general types of WAGS <strong>in</strong>fluence, which are applicable to both Markovianand non-Markovian processes, <strong>in</strong> discrete or cont<strong>in</strong>uous time.For three vector time series X 1 (t), X 2 (t), X 3 (t) we wish to know if time series X 1 (t)is <strong>in</strong>fluenced by time series X 2 (t) conditional on X 3 (t). Here X 3 (t) can be consideredany set of relevant time series to be controlled for. Let X [a,b] = {X (t),t ∈ [a,b]} denotethe history of a time series <strong>in</strong> the discrete or cont<strong>in</strong>uous time <strong>in</strong>terval [a,b] The firstcategorical dist<strong>in</strong>ction is based on what part of the present or future of X 1 (t) can be predictedby the past or present of X 2 (τ 2 ) τ 2 ≤ t. This leads to the follow<strong>in</strong>g classification(Florens, 2003; Florens and Fougere, 1996):1. If X 2 (τ 2 ) : τ 2 < t , can <strong>in</strong>fluence any future value of X 1 (t) it is a global <strong>in</strong>fluence.2. If X 2 (τ 2 ) : τ 2 < t , can <strong>in</strong>fluence X 1 (t) at time t it is a local <strong>in</strong>fluence.3. If X 2 (τ 2 ) : τ 2 = t , can <strong>in</strong>fluence X 1 (t) it is a contemporaneous <strong>in</strong>fluence.A second dist<strong>in</strong>ction is based on predict<strong>in</strong>g the whole probability distribution (strong<strong>in</strong>fluence) or only given moments (weak <strong>in</strong>fluence). S<strong>in</strong>ce the most natural formal def<strong>in</strong>itionis one of <strong>in</strong>dependence, every <strong>in</strong>fluence type amounts to the negation of an<strong>in</strong>dependence statement. The two classifications give rise to six types of <strong>in</strong>dependenceand correspond<strong>in</strong>g <strong>in</strong>fluence as set out <strong>in</strong> Table 1.To illustrate, X 1 (t) is strongly, conditionally, and globally <strong>in</strong>dependent of X 2 (t)given X 3 (t), ifP( X 1 (∞,t]| X 1 (t,−∞], X 2 (t,−∞], X 3 (t,−∞]) = P( X 1 (∞,t]| X 1 (t,−∞], X 3 (t,−∞])That is: the probability distribution of the future values of X 1 does not depend on thepast of X 2 , given that the <strong>in</strong>fluence of the past of both X 1 and X 3 has been taken <strong>in</strong>toaccount. When this condition does not hold we say X 2 (t) strongly, conditionally, andglobally <strong>in</strong>fluences (SCGi) X 1 (t) given X 3 (t). Here we use a convention for <strong>in</strong>tervals[a,b) which <strong>in</strong>dicates that the left endpo<strong>in</strong>t is <strong>in</strong>cluded but not the right and thatb precedes a. Note that the whole future of X t is <strong>in</strong>cluded (hence the term “global").And the whole past of all time series is considered. This means these def<strong>in</strong>itions accommodatenon-Markovian processes (for Markovian processes, we only consider theprevious time po<strong>in</strong>t). Furthermore, these def<strong>in</strong>itions do not depend on an assumptionof l<strong>in</strong>earity or any given functional relationship between time series. Note also that83


Roebroeck Seth Valdes-SosaTable 1: Types of Influence def<strong>in</strong>ed by absence of the correspond<strong>in</strong>g <strong>in</strong>dependence relations.See text for acronym def<strong>in</strong>itions.Global( All horizons)Local(Immediate future)ContemporaneousStrong(Probability Distribution)By absence ofstrong, conditional, global <strong>in</strong>dependence:X 2 (t)SCGi X 1 (t)||X 3 (t)By absence ofstrong, conditional, local <strong>in</strong>dependence:X 2 (t)SCLi X 1 (t)||X 3 (t)By absence ofstrong, conditional, contemporaneous<strong>in</strong>dependence:X 2 (t)SCCi X 1 (t)||X 3 (t)Weak(Expectation)By absence ofweak, conditional, global <strong>in</strong>dependence:X 2 (t)WCGi X 1 (t)||X 3 (t)By absence ofweak, conditional, local <strong>in</strong>dependence:X 2 (t)WCLi X 1 (t)||X 3 (t)By absence ofweak, conditional, contemporaneous<strong>in</strong>dependence:X 2 (t)WCCi X 1 (t)||X 3 (t)this def<strong>in</strong>ition is appropriate for po<strong>in</strong>t processes, discrete and cont<strong>in</strong>uous time series,even for categorical (qualitative valued) time series. The only problem with this formulationis that it calls on the whole probability distribution and therefore its practicalassessment requires the use of measures such as mutual <strong>in</strong>formation that estimate theprobability densities nonparametrically.As an alternative, weak concepts of <strong>in</strong>fluence can be def<strong>in</strong>ed based on expectations.Consider weak conditional local <strong>in</strong>dependence <strong>in</strong> discrete time, which is def<strong>in</strong>ed:E [ X 1 [t + ∆t]| X 1 [t,−∞], X 2 [t,−∞], X 3 [t,−∞]] = E [ X 1 [t + ∆t]| X 1 [t,−∞], X 3 [t,−∞]](7)When this condition does not hold we say X 2 weakly, conditionally and locally <strong>in</strong>fluences(WCLi) X 1 given X 3 . To make the implementation this def<strong>in</strong>ition <strong>in</strong>sightful,consider a discrete first-order vector auto-regressive (VAR) model for X = [X 1 X 2 X 3 ]:X [t + ∆t] = AX [t] + e[t + ∆t] (8)For this case E [ X[t + ∆t]| X[t,−∞]] = AX [t], and analyz<strong>in</strong>g <strong>in</strong>fluence reduces to f<strong>in</strong>d<strong>in</strong>gwhich of the autoregressive coefficients are zero. Thus, many proposed operationaltests of WAGS <strong>in</strong>fluence, particularly <strong>in</strong> fMRI analysis, have been formulated as testsof discrete autoregressive coefficients, although not always of order 1. With<strong>in</strong> the samemodel one can operationalize weak conditional <strong>in</strong>stantaneous <strong>in</strong>dependence <strong>in</strong> dis-84


Causal analysis of fMRIcrete time as zero off-diagonal entries <strong>in</strong> the co-variance matrix of the <strong>in</strong>novations e[t]:Σ e = cov[X [t + ∆t]|X [t,−∞]] = E [︀ X [t + ∆t] X ′ [t + ∆t]|X [t,−∞] ]︀In comparison weak conditional local <strong>in</strong>dependence <strong>in</strong> cont<strong>in</strong>uous time is def<strong>in</strong>ed:E [Y 1 [t]|Y 1 (t,−∞],Y 2 (t,−∞],Y 3 (t,−∞]] = E [Y 1 [t]|Y 1 (t,−∞],Y 3 (t,−∞]] (9)Now consider a first-order stochastic differential equation (SDE) model for Y = [Y 1 Y 2 Y 3 ]:dY = BYdt + dω (10)Then, s<strong>in</strong>ce ω is a Wiener process with zero-mean white Gaussian noise as a derivative,E [Y[t]|Y(t,−∞]] = BY (t)and analys<strong>in</strong>g <strong>in</strong>fluence amounts to estimat<strong>in</strong>g the parametersB of the SDE. However, if one were to observe a discretely sampled versionX[k] =Y (k∆t) at sampl<strong>in</strong>g <strong>in</strong>terval ∆tand model this with the discrete autoregressive modelabove, this would be <strong>in</strong>adequate to estimate the SDE parameters for large ∆t, s<strong>in</strong>ce theexact relations between cont<strong>in</strong>uous and discrete system matrices are known to be:A = e B∆t = I + ∞ ∑︀Σ e = ∫︀ t+∆tti=1∆t ii! Bie Bs∑︀ ω e Bs dsThe power series expansion of the matrix exponential <strong>in</strong> the first l<strong>in</strong>e shows A to bea weighted sum of successive matrix powers B i of the cont<strong>in</strong>uous time system matrix.Thus, the Awill conta<strong>in</strong> contributions from direct (<strong>in</strong> B) and <strong>in</strong>direct (<strong>in</strong> i steps <strong>in</strong>B i )causal l<strong>in</strong>ks between the modeled areas. The contribution of the more <strong>in</strong>direct l<strong>in</strong>ks isprogressively down-weighted with the number of causal steps from one area to anotherand is smaller when the sampl<strong>in</strong>g <strong>in</strong>terval ∆t is smaller. This makes clear that multivariatediscrete signal models have some undesirable properties for coarsely sampled signals(i.e. a large ∆t with respect to the system dynamics), such as fMRI data. Critically,entirely rul<strong>in</strong>g out <strong>in</strong>direct <strong>in</strong>fluences is not actually achieved merely by employ<strong>in</strong>g amultivariate discrete model. Furthermore, estimated WAGS <strong>in</strong>fluence (particularly therelative contribution of <strong>in</strong>direct l<strong>in</strong>ks) is dependent on the employed sampl<strong>in</strong>g <strong>in</strong>terval.However, the discrete system matrix still represents the presence and direction of<strong>in</strong>fluence, possibly mediated through other regions.When the goal is to estimate WAGS <strong>in</strong>fluence for discrete data start<strong>in</strong>g from a cont<strong>in</strong>uoustime model, one has to model explicitly the mapp<strong>in</strong>g to discrete time. Mapp<strong>in</strong>gcont<strong>in</strong>uous time predictions to discrete samples is a well known topic <strong>in</strong> eng<strong>in</strong>eer<strong>in</strong>gand can be solved by explicit <strong>in</strong>tegration over discrete time steps as performed <strong>in</strong> (11)above. Although this def<strong>in</strong>es the mapp<strong>in</strong>g from cont<strong>in</strong>uous to discrete parameters, itdoes not solve the reverse assignment of estimat<strong>in</strong>g cont<strong>in</strong>uous model parameters fromdiscrete data. Do<strong>in</strong>g so requires a solution to the alias<strong>in</strong>g problem (Mccrorie, 2003) <strong>in</strong>cont<strong>in</strong>uous stochastic system identification by sett<strong>in</strong>g sufficient conditions on the matrixlogarithm function to make Babove identifiable (uniquely def<strong>in</strong>ed) <strong>in</strong> terms of A.Interest<strong>in</strong>g <strong>in</strong> this regard is a l<strong>in</strong>e of work <strong>in</strong>itiated by Bergstrom (Bergstrom, 1966,(11)85


Roebroeck Seth Valdes-Sosa1984) and Phillips (Phillips, 1973, 1974) study<strong>in</strong>g the estimation of cont<strong>in</strong>uous timeAutoregressive models (McCrorie, 2002), and cont<strong>in</strong>uous time Autoregressive Mov<strong>in</strong>gAverage Models (Chambers and Thornton, 2009) from discrete data. This work restson the observation that the lag zero covariance matrix Σ e will show contemporaneouscovariance even if the cont<strong>in</strong>uous covariance matrix Σ ω is diagonal. In other words,the discrete noise becomes correlated over the discrete time-series because the randomfluctuations are aggregated over time. Rather than consider<strong>in</strong>g this a disadvantage, thisapproach tries to use both lag <strong>in</strong>formation (the AR part) and zero-lag covariance <strong>in</strong>formationto identify the underly<strong>in</strong>g cont<strong>in</strong>uous l<strong>in</strong>ear model.Notwithstand<strong>in</strong>g the desirability of a cont<strong>in</strong>uous time model for consistent <strong>in</strong>ferenceon WAGS <strong>in</strong>fluence, there are a few <strong>in</strong>variances of discrete VAR models, or moregenerally discrete Vector Autoregressive Mov<strong>in</strong>g Average (VARMA) models that allowtheir carefully qualified usage <strong>in</strong> estimat<strong>in</strong>g causal <strong>in</strong>fluence. The VAR formulation ofWAGS <strong>in</strong>fluence has the property of <strong>in</strong>variance under <strong>in</strong>vertible l<strong>in</strong>ear filter<strong>in</strong>g. Moreprecisely, a general measure of <strong>in</strong>fluence rema<strong>in</strong>s unchanged if channels are each premultipliedwith different <strong>in</strong>vertible lag operators (Geweke, 1982). However, <strong>in</strong> practicethe order of the estimated VAR model would need to be sufficient to accommodatethese operators. Beyond <strong>in</strong>vertible l<strong>in</strong>ear filter<strong>in</strong>g, a VARMA formulation has further<strong>in</strong>variances. Solo (2006) showed that causality <strong>in</strong> a VARMA model is preserved undersampl<strong>in</strong>g and additive noise. More precisely, if both local and contemporaneous <strong>in</strong>fluenceis considered (as def<strong>in</strong>ed above) the VARMA measure is preserved under sampl<strong>in</strong>gand under the addition of <strong>in</strong>dependent but colored noise to the different channels. F<strong>in</strong>ally,Amendola et al. (2010) shows the class of VARMA models to be closed underaggregation operations, which <strong>in</strong>clude both sampl<strong>in</strong>g and time-w<strong>in</strong>dow averag<strong>in</strong>g.3.2. State-space modelsA general state-space model for a cont<strong>in</strong>uous vector time-series y(t)can be formulatedwith the set of equations:ẋ(t) = f (x(t),v(t),Θ) + ω(t)y(t) = g(x(t),v(t),Θ) + ε(t)This expresses the observed time-series y(t)as a function of the state variables x(t),which are possibly hidden (i.e. unobserved) and observed exogenous <strong>in</strong>puts v(t), whichare possibly under control. All parameters <strong>in</strong> the model are grouped <strong>in</strong>to Θ. Notethat some generality is sacrificed from the start s<strong>in</strong>ce f and gdo not depend on t (Themodel is autonomous and generates stationary processes) or on ω(t) or ε(t), that is:noise enters only additively. The first set of equations, the transition equations or stateequations, describe the evolution of the dynamic system over time <strong>in</strong> terms of stochasticdifferential equations (SDEs, though technically only when ω(t) = Σẇ(t) with w(t) aWiener process), captur<strong>in</strong>g relations among the hidden state variables x(t) themselvesand the <strong>in</strong>fluence of exogenous <strong>in</strong>puts v(t). The second set of equations, the observationequations or measurement equations, describe how the measurement variables y(t) areobta<strong>in</strong>ed from the <strong>in</strong>stantaneous values of the hidden state variables x(t) and the <strong>in</strong>puts(12)86


Causal analysis of fMRIv(t). In fMRI experiments the exogenous <strong>in</strong>puts v(t) mostly reflect experimental controland often have the form of a simple <strong>in</strong>dicator function that can represent, for <strong>in</strong>stance,the presence or absence of a visual stimulus. The vector-functions f and g can generallybe non-l<strong>in</strong>ear.The state-space formalism allows representation of a very large class of stochasticprocesses. Specifically, it allows representation of both so-called ‘black-box’ models,<strong>in</strong> which parameters are treated as means to adjust the fit to the data without reflect<strong>in</strong>gphysically mean<strong>in</strong>gful quantities, and ‘grey-box’ models, <strong>in</strong> which the adjustable parametersdo have a physical or physiological (<strong>in</strong> the case of the bra<strong>in</strong>) <strong>in</strong>terpretation. Aprom<strong>in</strong>ent example of a black-box model <strong>in</strong> econometric time-series analysis and systemsidentification is the (discrete) Vector Autoregressive Mov<strong>in</strong>g Average model withexogenous <strong>in</strong>puts (VARMAX model) def<strong>in</strong>ed as (Ljung, 1999; Re<strong>in</strong>sel, 1997):F (B)y t = G (B)v t + L(B)e t ⇔∑︀ pi=0 F iy t−i = ∑︀ sj=0G j v t− j + ∑︀ qk=0 L ke t−k(13)Here, the backshift operator B is def<strong>in</strong>ed, for any η t as B i η t = η t−i and F, G and Lare polynomials <strong>in</strong> the backshift operator, such that e.g. F (B) = ∑︀ pi=0 F iB i and p, sand q are the dynamic orders of the VARMAX(p,s,q) model. The m<strong>in</strong>imal constra<strong>in</strong>tson (13) to make it identifiable are F 0 = L 0 = I, which yields the standard VARMAXrepresentation. The VARMAX model and its various reductions (by use of only oneor two of the polynomials, e.g. VAR, VARX or VARMA models) have played a largerole <strong>in</strong> time-series prediction and WAGS <strong>in</strong>fluence model<strong>in</strong>g. Thus, <strong>in</strong> the context ofstate space models it is important to consider that the VARMAX model form can beequivalently formulated <strong>in</strong> a discrete l<strong>in</strong>ear state space form:x k+1 = Ax k + Bv k + Ke ky k = Cx k + Dv k + e k(14)In turn the discrete l<strong>in</strong>ear state space form can be explicitly accommodated by the cont<strong>in</strong>uousgeneral state-space framework <strong>in</strong> (12) when we def<strong>in</strong>e:f (x(t),v(t),Θ) ≃ Fx(t) +Gv(t)g(x(t),v(t),Θ) ≃ Hx(t) + Dv(t)ω(t) = ˜Kε(t)Θ = {︁ F,G, H, D, ˜K,Σ e}︁ (15)Aga<strong>in</strong>, the exact relations between the discrete and cont<strong>in</strong>uous state space parametermatrices can be derived analytically by explicit <strong>in</strong>tegration over time (Ljung, 1999).And, as discussed above, wherever discrete data is used to model cont<strong>in</strong>uous <strong>in</strong>fluencerelations the problems of temporal aggregation and alias<strong>in</strong>g have to be taken <strong>in</strong>to account.Although analytic solutions for the discretely sampled cont<strong>in</strong>uous l<strong>in</strong>ear systemsexist, the discretization of the nonl<strong>in</strong>ear stochastic model (12) does not have a uniqueglobal solution. However, physiological models of neuronal population dynamics andhemodynamics are formulated <strong>in</strong> cont<strong>in</strong>uous time and are mostly nonl<strong>in</strong>ear while fMRIdata is <strong>in</strong>herently discrete with low sampl<strong>in</strong>g frequencies. Therefore, it is the discretizationof the nonl<strong>in</strong>ear dynamical stochastic models that is especially relevant to causal87


Roebroeck Seth Valdes-Sosaanalysis of fMRI data. A local l<strong>in</strong>earization approach was proposed by (Ozaki, 1992)as bridge between discrete time series models and nonl<strong>in</strong>ear cont<strong>in</strong>uous dynamical systemsmodel. Consider<strong>in</strong>g the nonl<strong>in</strong>ear state equation without exogenous <strong>in</strong>put:ẋ(t) = f (x(t)) + ω(t). (16)The essential assumption <strong>in</strong> local l<strong>in</strong>earization (LL) of this nonl<strong>in</strong>ear system is to considerthe Jacobian matrix J (l,m) = ∂ f l(X)∂X mas constant over the time period [t + ∆t,t]. ThisJacobian plays the same role as the autoregressive matrix <strong>in</strong> the l<strong>in</strong>ear systems above.Integration over this <strong>in</strong>terval gives the solution:x k+1 = x k + J −1 (e J∆t − I) f (x k ) + e k+1 (17)where I is the identity matrix. Note <strong>in</strong>tegration should not be computed this way s<strong>in</strong>ceit is numerically unstable, especially when the Jacobian is poorly conditioned. A listof robust and fast procedures is reviewed <strong>in</strong> (Valdes-Sosa et al., 2009). This solution islocally l<strong>in</strong>ear but crucially it changes with the state at the beg<strong>in</strong>n<strong>in</strong>g of each <strong>in</strong>tegration<strong>in</strong>terval; this is how it accommodates nonl<strong>in</strong>earity (i.e., a state-dependent autoregressionmatrix). As above, the discretized noise shows <strong>in</strong>stantaneous correlations due tothe aggregation of ongo<strong>in</strong>g dynamics with<strong>in</strong> the span of a sampl<strong>in</strong>g period. Once aga<strong>in</strong>,this highlights the underly<strong>in</strong>g mechanism for problems with temporal sub-sampl<strong>in</strong>g andaggregation for some discrete time models of WAGS <strong>in</strong>fluence.4. Dynamic causality <strong>in</strong> fMRI connectivity analysisTwo streams of developments have recently emerged that make use of the temporaldynamics <strong>in</strong> the fMRI signal to analyse directed <strong>in</strong>fluence (effective connectivity):Granger causality analysis (GCA; Goebel et al., 2003; Roebroeck et al., 2005; Valdes-Sosa, 2004) <strong>in</strong> the tradition of time series analysis and WAGS <strong>in</strong>fluence on the one hand,and Dynamic Causal Model<strong>in</strong>g (DCM; Friston et al., 2003) <strong>in</strong> the tradition of systemcontrol on the other hand. As we will discuss <strong>in</strong> the f<strong>in</strong>al section, these approacheshave recently started develop<strong>in</strong>g <strong>in</strong>to an <strong>in</strong>tegrated s<strong>in</strong>gle direction. However, <strong>in</strong>itiallyeach was focused on separate issues that pose challenges for the estimation of causal<strong>in</strong>fluence from fMRI data. Whereas DCM is formulated as an explicit grey box statespace model to account for the temporal convolution of neuronal events by sluggishhemodynamics, GCA analysis has been mostly aimed at solv<strong>in</strong>g the problem of regionselection <strong>in</strong> the enormous dimensionality of fMRI data.4.1. Hemodynamic deconvolution <strong>in</strong> a state space approachWhile hav<strong>in</strong>g a long history <strong>in</strong> eng<strong>in</strong>eer<strong>in</strong>g, state space model<strong>in</strong>g was only <strong>in</strong>troducedrecently for the <strong>in</strong>ference of neural states from neuroimag<strong>in</strong>g signals. The earliest attemptstargeted estimat<strong>in</strong>g hidden neuronal population dynamics from scalp-level EEGdata (Hernandez et al., 1996; Valdes-Sosa et al., 1999). This work first advanced theidea that state space models and appropriate filter<strong>in</strong>g algorithms are an important tool to88


Causal analysis of fMRIestimate the trajectories of hidden neuronal processes from observed neuroimag<strong>in</strong>g dataif one can formulate an accurate model of the processes lead<strong>in</strong>g from neuronal activityto data records. A few years later, this idea was robustly transferred to fMRI data <strong>in</strong> theform of DCM (Friston et al., 2003). DCM comb<strong>in</strong>es three ideas about causal <strong>in</strong>fluenceanalysis <strong>in</strong> fMRI data (or neuroimag<strong>in</strong>g data <strong>in</strong> general), which can be understood <strong>in</strong>terms of the discussion of the fMRI signal and state space models above (Daunizeauet al., 2009a).First, neuronal <strong>in</strong>teractions are best modeled at the level of unobserved (latent)signals, <strong>in</strong>stead of at the level of observed BOLD signals. This requires a state spacemodel with a dynamic model of neuronal population dynamics and <strong>in</strong>teractions. Theorig<strong>in</strong>al model that was formulated for the dynamics of neuronal states x = {x 1 ,..., x N }is a bil<strong>in</strong>ear ODE model:∑︁ẋ = Ax + v j B j x + Cv (18)That is, the noiseless neuronal dynamics are characterized by a l<strong>in</strong>ear term (with entries<strong>in</strong> A represent<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sic coupl<strong>in</strong>g between populations), an exogenous term (with Crepresent<strong>in</strong>g driv<strong>in</strong>g <strong>in</strong>fluence of experimental variables) and a bil<strong>in</strong>ear term (with B jrepresent<strong>in</strong>g the modulatory <strong>in</strong>fluence of experimental variables on coupl<strong>in</strong>g betweenpopulations). More recent work has extended this model, e.g. by add<strong>in</strong>g a quadraticterm (Stephan et al., 2008), stochastic dynamics (Daunizeau et al., 2009b) or multiplestate variables per region (Marreiros et al., 2008).Second, the latent neuronal dynamics are related to observed data by a generative(forward) model that accounts for the temporal convolution of neuronal events by slowand variably delayed hemodynamics. This generative forward model <strong>in</strong> DCM for fMRIis exactly the (simplified) balloon model set out <strong>in</strong> section 2. Thus, for every selectedregion a s<strong>in</strong>gle state variable represents the neuronal or synaptic activity of a local populationof neurons and (<strong>in</strong> DCM for BOLD fMRI) four or five more represent hemodynamicquantities such as capillary blood volume, blood flow and deoxy-hemoglob<strong>in</strong>content. All state variables (and the equations govern<strong>in</strong>g their dynamics) that servethe mapp<strong>in</strong>g of neuronal activity to the fMRI measurements (<strong>in</strong>clud<strong>in</strong>g the observationequation) can be called the observation model. Most of the physiologically motivatedgenerative model <strong>in</strong> DCM for fMRI is therefore concerned with an observation modelencapsulat<strong>in</strong>g hemodynamics. The parameters <strong>in</strong> this model are estimated conjo<strong>in</strong>tlywith the parameters quantify<strong>in</strong>g neuronal connectivity. Thus, the forward biophysicalmodel of hemodynamics is ‘<strong>in</strong>verted’ <strong>in</strong> the estimation procedure to achieve a deconvolutionof fMRI time series and obta<strong>in</strong> estimates of the underly<strong>in</strong>g neuronal states. DCMhas also been applied to EEG/MEG, <strong>in</strong> which case the observation model encapsulatesthe lead-field matrix from neuronal sources to EEG electrodes or MEG sensors (Kiebelet al., 2009).Third, the approach to estimat<strong>in</strong>g the hidden state trajectories (i.e. filter<strong>in</strong>g andsmooth<strong>in</strong>g) and parameter values <strong>in</strong> DCM is cast <strong>in</strong> a Bayesian framework. In short,Bayes’ theorem is used to comb<strong>in</strong>e priors p(Θ|M)and likelihood p(y|Θ, M)<strong>in</strong>to the89


Roebroeck Seth Valdes-Sosamarg<strong>in</strong>al likelihood or evidence:∫︁p(y|M) =p(y|Θ, M) p(Θ|M)dΘ (19)Here, the model M is understood to def<strong>in</strong>e the priors on all parameters and the likelihoodthrough the generative models for neuronal dynamics and hemodynamics. Aposterior for the parameters p(Θ|y, M) can be obta<strong>in</strong>ed as the distribution over parameterswhich maximizes the evidence (19). S<strong>in</strong>ce this optimization problem has no analyticsolution and is <strong>in</strong>tractable with numerical sampl<strong>in</strong>g schemes for complex models,such as DCM, approximations must be used. The <strong>in</strong>ference approach for DCM relieson variational Bayes methods (Beal, 2003) that optimize an approximation density q(Θ)to the posterior. The approximation density is taken to have a Gaussian form, which isoften referred to as the “Laplace approximation” (Friston et al., 2007). In addition tothe approximate posterior on the parameters, the variational <strong>in</strong>ference will also result<strong>in</strong>to a lower bound on the evidence, sometimes referred to as the “free energy”. Thislower bound (or other approximations to the evidence, such as the Akaike InformationCriterion or the Bayesian Information Criterion) are used for model comparison(Penny et al., 2004). Importantly, these quantities explicitly balance goodness-of-itaga<strong>in</strong>st model complexity as a means of avoid<strong>in</strong>g overfitt<strong>in</strong>g.An important limit<strong>in</strong>g aspect of DCM for fMRI is that the models M that are comparedalso (implicitly) conta<strong>in</strong> an anatomical model or structural model that conta<strong>in</strong>s i)a selection of the ROIs <strong>in</strong> the bra<strong>in</strong> that are assumed to be of importance <strong>in</strong> the cognitiveprocess or task under <strong>in</strong>vestigation, ii) the possible <strong>in</strong>teractions between those structuresand iii) the possible effects of exogenous <strong>in</strong>puts onto the network. In other words, eachmodel M specifies the nodes and edges <strong>in</strong> a directed (possibly cyclic) structural graphmodel. S<strong>in</strong>ce the anatomical model also determ<strong>in</strong>es the selected part y of the totaldataset (all voxels) one cannot use the evidence to compare different anatomical models.This is because the evidence of different anatomical models is def<strong>in</strong>ed over differentdata. Applications of DCM to date <strong>in</strong>variably use very simple anatomical models (typicallyemploy<strong>in</strong>g 3-6 ROIs) <strong>in</strong> comb<strong>in</strong>ation with its complex parameter-rich dynamicalmodel discussed above. The clear danger with overly simple anatomical models is thatof spurious <strong>in</strong>fluence: an erroneous <strong>in</strong>fluence found between two selected regions that<strong>in</strong> reality is due to <strong>in</strong>teractions with additional regions which have been ignored. Prototypicalexamples of spurious <strong>in</strong>fluence, of relevance <strong>in</strong> bra<strong>in</strong> connectivity, are thosebetween unconnected structures A and B that receive common <strong>in</strong>put from, or are <strong>in</strong>tervenedby, an unmodeled region C.4.2. Exploratory approaches for model selectionEarly applications of WAGS <strong>in</strong>fluence to fMRI data were aimed at counteract<strong>in</strong>g theproblems with overly restrictive anatomical models by employ<strong>in</strong>g more permissiveanatomical models <strong>in</strong> comb<strong>in</strong>ation with a simple dynamical model (Goebel et al., 2003;Roebroeck et al., 2005; Valdes-Sosa, 2004). These applications reflect the observation90


Causal analysis of fMRIthat estimation of mathematical models from time-series data generally has two importantaspects: model selection and model identification (Ljung, 1999). In the modelselection stage a class of models is chosen by the researcher that is deemed suitablefor the problem at hand. In the model identification stage the parameters <strong>in</strong> the chosenmodel class are estimated from the observed data record. In practice, model selectionand identification often occur <strong>in</strong> a somewhat <strong>in</strong>teractive fashion where, for <strong>in</strong>stance,model selection can be <strong>in</strong>formed by the fit of different models to the data achieved <strong>in</strong>an identification step. The important po<strong>in</strong>t is that model selection <strong>in</strong>volves a mixtureof choices and assumptions on the part of the researcher and the <strong>in</strong>formation ga<strong>in</strong>edfrom the data-record itself. These considerations <strong>in</strong>dicate that an important dist<strong>in</strong>ctionmust be made between exploratory and confirmatory approaches, especially <strong>in</strong> structuralmodel selection procedures for bra<strong>in</strong> connectivity. Exploratory techniques use<strong>in</strong>formation <strong>in</strong> the data to <strong>in</strong>vestigate the relative applicability of many models. Assuch, they have the potential to detect ‘miss<strong>in</strong>g’ regions <strong>in</strong> structural models. Confirmatoryapproaches, such as DCM, test hypotheses about connectivity with<strong>in</strong> a set ofmodels assumed to be applicable. Sources of common <strong>in</strong>put or <strong>in</strong>terven<strong>in</strong>g causes aretaken <strong>in</strong>to account <strong>in</strong> a multivariate confirmatory model, but only if the employed structuralmodel allows it (i.e. if the common <strong>in</strong>put or <strong>in</strong>terven<strong>in</strong>g node is <strong>in</strong>corporated <strong>in</strong>the model).The technique of Granger <strong>Causality</strong> Mapp<strong>in</strong>g (GCM) was developed to exploreall regions <strong>in</strong> the bra<strong>in</strong> that <strong>in</strong>teract with a s<strong>in</strong>gle selected reference region us<strong>in</strong>g autoregressivemodel<strong>in</strong>g of fMRI time-series (Roebroeck et al., 2005). By employ<strong>in</strong>g asimple bivariate model conta<strong>in</strong><strong>in</strong>g the reference region and, <strong>in</strong> turn, every other voxel <strong>in</strong>the bra<strong>in</strong>, the sources and targets of <strong>in</strong>fluence for the reference region can be mapped.It was shown that such an ‘exploratory’ mapp<strong>in</strong>g approach can form an important tool<strong>in</strong> structural model selection. Although a bivariate model does not discern direct from<strong>in</strong>direct <strong>in</strong>fluences, the mapp<strong>in</strong>g approach locates potential sources of common <strong>in</strong>putand areas that could act as <strong>in</strong>terven<strong>in</strong>g network nodes. In addition, by settl<strong>in</strong>g fora bivariate model one trivially avoids the conflation of direct and <strong>in</strong>direct <strong>in</strong>fluencesthat can arise <strong>in</strong> discrete AR model due to temporal aggregation, as discussed above.Other applications of autoregressive model<strong>in</strong>g to fMRI data have considered full multivariatemodels on large sets of selected bra<strong>in</strong> regions, illustrat<strong>in</strong>g the possibility toestimate high-dimensional dynamical models. For <strong>in</strong>stance, Valdes-Sosa (2004) andValdes-Sosa et al. (2005b) applied these models to parcellations of the entire cortex <strong>in</strong>conjunction with sparse regression approaches that enforce an implicit structural modelselection with<strong>in</strong> the set of parcels. In another more recent example (Deshpande et al.,2008) a full multivariate model was estimated over 25 ROIs (that were found to be activated<strong>in</strong> the <strong>in</strong>vestigated task) together with an explicit reduction procedure to pruneregions from the full model as a structural model selection procedure. Additional variantsof VAR model based causal <strong>in</strong>ference that has been applied to fMRI <strong>in</strong>clude timevary<strong>in</strong>g <strong>in</strong>fluence (Havlicek et al., 2010), blockwise (or ‘cluster-wise’) <strong>in</strong>fluence fromone group of variables to another (Barrett et al., 2010; Sato et al., 2010) and frequencydecomposed<strong>in</strong>fluence (Sato et al., 2009).91


Roebroeck Seth Valdes-SosaThe <strong>in</strong>itial developments <strong>in</strong> autoregressive model<strong>in</strong>g of fMRI data led to a numberof <strong>in</strong>terest<strong>in</strong>g applications study<strong>in</strong>g human mental states and cognitive processes, suchas gestural communication (Schippers et al., 2010), top-down control of visual spatialattention (Bressler et al., 2008), switch<strong>in</strong>g between executive control and default-modenetworks (Sridharan et al., 2008), fatigue (Deshpande et al., 2009) and the rest<strong>in</strong>g state(Udd<strong>in</strong> et al., 2009). Nonetheless, the lack of AR models to account for the vary<strong>in</strong>ghemodynamics convolv<strong>in</strong>g the signals of <strong>in</strong>terest and aggregation of dynamics betweentime samples has prompted a set of validation studies evaluat<strong>in</strong>g the conditions underwhich discrete AR models can provide reliable connectivity estimates. In (Roebroecket al., 2005) simulations were performed to validate the use of bivariate AR models <strong>in</strong>the face of hemodynamic convolution and sampl<strong>in</strong>g. They showed that under these conditions(even without variability <strong>in</strong> hemodynamics) AR estimates for a unidirectional<strong>in</strong>fluence are biased towards <strong>in</strong>ferr<strong>in</strong>g bidirectional causality, a well known problemwhen deal<strong>in</strong>g with aggregated time series (Wei, 1990). They then went on to show that<strong>in</strong>stead unbiased non-parametric <strong>in</strong>ference for bivariate AR models can be based on adifference of <strong>in</strong>fluence terms (X → Y − Y → X). In addition, they posited that <strong>in</strong>ferenceon such <strong>in</strong>fluence estimates should always <strong>in</strong>clude experimental modulation of <strong>in</strong>fluence,<strong>in</strong> order to rule out hemodynamic variation as an underly<strong>in</strong>g reason for spuriouscausality. In Deshpande et al. (2010) the authors simulated fMRI data by manipulat<strong>in</strong>gthe causal <strong>in</strong>fluence and neuronal delays between local field potentials (LFPs) acquiredfrom the macaque cortex and vary<strong>in</strong>g the hemodynamic delays of a convolv<strong>in</strong>g hemodynamicresponse function and the signal-to-noise ratio (SNR) and the sampl<strong>in</strong>g periodof the f<strong>in</strong>al simulated fMRI data. They found that <strong>in</strong> multivariate (4 dimensional) simulationswith hemodynamic and neuronal delays drawn from a uniform random distributioncorrect network detection from fMRI was well above chance and was up to 90%under conditions of fast sampl<strong>in</strong>g and low measurement noise. Other studies confirmedthe observation that techniques with <strong>in</strong>termediate temporal resolution, such as fMRI,can yield good estimates of the causal connections based on AR models (Stevensonand Kord<strong>in</strong>g, 2010), even <strong>in</strong> the face of variable hemodynamics (Ryali et al., 2010).However, another recent simulation study, <strong>in</strong>vestigat<strong>in</strong>g a host of connectivity methodsconcluded low detection performance of directed <strong>in</strong>fluence by AR models undergeneral conditions (Smith et al., 2010).4.3. Toward <strong>in</strong>tegrated modelsDavid et al. (2008) aimed at direct comparison of autoregressive model<strong>in</strong>g and DCMfor fMRI time series and explicitly po<strong>in</strong>ted at deconvolution of variable hemodynamicsfor causality <strong>in</strong>ferences. The authors created a controlled animal experiment wheregold standard validation of neuronal connectivity estimation was provided by <strong>in</strong>tracranialEEG (iEEG) measurements. As discussed extensively <strong>in</strong> Friston (2009b) and Roebroecket al. (2009a) such a validation experiment can provide important <strong>in</strong>formationon best practices <strong>in</strong> fMRI based bra<strong>in</strong> connectivity model<strong>in</strong>g that, however, need to becarefully discussed and weighed. In David et al.’s study, simultaneous fMRI, EEG, andiEEG were measured <strong>in</strong> 6 rats dur<strong>in</strong>g epileptic episodes <strong>in</strong> which spike-and-wave dis-92


Causal analysis of fMRIcharges (SWDs) spread through the bra<strong>in</strong>. fMRI was used to map the hemodynamicresponse throughout the bra<strong>in</strong> to seizure activity, where ictal and <strong>in</strong>terictal states werequantified by the simultaneously recorded EEG. Three structures were selected by theauthors as the crucial nodes <strong>in</strong> the network that generates and susta<strong>in</strong>s seizure activityand further analysed with i) DCM, ii) simple AR model<strong>in</strong>g of the fMRI signal andiii) AR model<strong>in</strong>g applied to neuronal state-variable estimates obta<strong>in</strong>ed with a hemodynamicdeconvolution step. By apply<strong>in</strong>g G-causality analysis to deconvolved fMRItime-series, the stochastic dynamics of the l<strong>in</strong>ear state-space model are augmented withthe complex biophysically motivated observation model <strong>in</strong> DCM. This step is crucialif the goal is to compare the dynamic connectivity models and draw conclusions onthe relative merits of l<strong>in</strong>ear stochastic models (explicitly estimat<strong>in</strong>g WAGS <strong>in</strong>fluence)and bil<strong>in</strong>ear determ<strong>in</strong>istic models. The results showed both AR analysis after deconvolutionand DCM analysis to be <strong>in</strong> accordance with the gold-standard iEEG analyses,identify<strong>in</strong>g the most pert<strong>in</strong>ent <strong>in</strong>fluence relations undisturbed by variations <strong>in</strong> HRF latencies.In contrast, the f<strong>in</strong>al result of simple AR model<strong>in</strong>g of the fMRI signal showedless correspondence with the gold standard, due to the confound<strong>in</strong>g effects of differenthemodynamic latencies which are not accounted for <strong>in</strong> the model.Two important lessons can be drawn from David et al.’s study and the ensu<strong>in</strong>gdiscussions (Bressler and Seth, 2010; Daunizeau et al., 2009a; David, 2009; Friston,2009b,a; Roebroeck et al., 2009a,b). First, it confirms aga<strong>in</strong> the distort<strong>in</strong>g effects ofhemodynamic processes on the temporal structure of fMRI signals and, more importantly,that the difference <strong>in</strong> hemodynamics <strong>in</strong> different parts of the bra<strong>in</strong> can form aconfound for dynamic bra<strong>in</strong> connectivity models (Roebroeck et al., 2005). Second,state-space models that embody observation models that connect latent neuronal dynamicsto observed fMRI signals have a potential to identify causal <strong>in</strong>fluence unbiasedby this confound. As a consequence, substantial recent methodological work has aimedat comb<strong>in</strong><strong>in</strong>g different models of latent neuronal dynamics with a form of a hemodynamicobservation model <strong>in</strong> order to provide an <strong>in</strong>version or filter<strong>in</strong>g algorithm for estimationof parameters and hidden state trajectories. Follow<strong>in</strong>g the orig<strong>in</strong>al formulationof DCM that provides a bil<strong>in</strong>ear ODE form for the hidden neuronal dynamics, attemptshave been made at explicit <strong>in</strong>tegration of hemodynamics convolution with stochasticdynamic models that are <strong>in</strong>terpretable <strong>in</strong> the framework of WAGS <strong>in</strong>fluence.For <strong>in</strong>stance <strong>in</strong> (Ryali et al., 2010), follow<strong>in</strong>g earlier work (Penny et al., 2005;Smith et al., 2009), a discrete state space model with a bi-l<strong>in</strong>ear vector autoregressivemodel to quantify dynamic neuronal state evolution and both <strong>in</strong>tr<strong>in</strong>sic and modulatory<strong>in</strong>teractions is proposed:x k = Ax k−1 + ∑︀ j=1 v j k B j x k−1 + Cv j k + ε kx m k = [︁ xk m, xm k−1 ,··· , ]︁xm k−L+1(20)y m k = βm Φx m k + em kHere, we <strong>in</strong>dex exogenous <strong>in</strong>puts with j and ROIs with m<strong>in</strong> superscripts. The entries<strong>in</strong> the autoregressive matrix A, exogenous <strong>in</strong>fluence matrix C and bi-l<strong>in</strong>ear matricesB j have the same <strong>in</strong>terpretation as <strong>in</strong> determ<strong>in</strong>istic DCM. The relation between93


Roebroeck Seth Valdes-Sosaobserved BOLD-fMRI data yand latent neuronal sources x is modeled by a temporalembedd<strong>in</strong>g of <strong>in</strong>to x m for each region or ROI m. This allows convolution with a flexiblebasis function expansion of possible HRF shapes to be represented by a simple matrixmultiplication β m Φxk m <strong>in</strong> the observation equation. Here Φ conta<strong>in</strong>s the temporal basisfunctions <strong>in</strong> Figure 2B and β m the basis function parameters to be estimated. By estimat<strong>in</strong>gbasis function parameters <strong>in</strong>dividually per region, variations <strong>in</strong> the HRF shapebetween region can be accounted for and the confound<strong>in</strong>g effects of these on WAGS<strong>in</strong>fluence estimate can be avoided. Ryali et al. found that robust estimates of parametersΘ = {︁ }︁A,B j ,C,β m ,Σ ε ,Σ e and states xk can be obta<strong>in</strong>ed from a variational Bayesianapproach. In their simulations, they show that a state-space model with <strong>in</strong>teractionsmodeled at the latent level can compensate well for the effects of HRF variability, evenwhen relative HRF delays are opposed to delayed <strong>in</strong>teractions. Note, however, that subsampl<strong>in</strong>gof the BOLD signal is not explicitly characterized <strong>in</strong> their state-space model.A few <strong>in</strong>terest<strong>in</strong>g variations on this discrete state-space model<strong>in</strong>g have recentlybeen proposed. For <strong>in</strong>stance <strong>in</strong> (Smith et al., 2009) a switch<strong>in</strong>g l<strong>in</strong>ear systems modelfor latent neuronal state evolution, rather than a bi-l<strong>in</strong>ear model was used. This modelrepresents experimental modulation of connections as a random variable, to be learnedfrom the data. This variable switches between different l<strong>in</strong>ear system <strong>in</strong>stantiationsthat each characterize connectivity <strong>in</strong> a s<strong>in</strong>gle experimental condition. Such a schemehas the important advantage that an n-fold cross validation approach can be used toobta<strong>in</strong> a measure of absolute model-evidence (rather than relative between a selectedset of models). Specifically, one could learn parameters for each context-specific l<strong>in</strong>earsystem with knowledge of the tim<strong>in</strong>g of chang<strong>in</strong>g experimental conditions <strong>in</strong> a tra<strong>in</strong><strong>in</strong>gdata set. Then the classification accuracy of experimental condition periods <strong>in</strong> a testdata set based on connectivity will provide a absolute model-fit measure, controlled formodel complexity, which can be used to validate overall usefulness of the fitted model.In particular, this can po<strong>in</strong>t to important bra<strong>in</strong> regions miss<strong>in</strong>g from the model <strong>in</strong>caseof poor classification accuracy.Another related l<strong>in</strong>e of developments <strong>in</strong>stead has <strong>in</strong>volved generaliz<strong>in</strong>g the ODEmodels <strong>in</strong> DCM for fMRI to stochastic dynamic models formulated <strong>in</strong> cont<strong>in</strong>uous time(Daunizeau et al., 2009b; Friston et al., 2008). An early exponent of this approach usedlocal l<strong>in</strong>earization <strong>in</strong> a (generalized) Kalman filter to estimate states and parameters <strong>in</strong>a non-l<strong>in</strong>ear SDE models of hemodynamics (Riera et al., 2004). Interest<strong>in</strong>gly, the <strong>in</strong>clusionof stochastics <strong>in</strong> the state equations makes <strong>in</strong>ference on coupl<strong>in</strong>g parameters ofsuch models usefully <strong>in</strong>terpretable <strong>in</strong> the framework of WAGS <strong>in</strong>fluence. This h<strong>in</strong>ts atthe ongo<strong>in</strong>g convergence, <strong>in</strong> model<strong>in</strong>g of bra<strong>in</strong> connectivity, of time series approachesto causality <strong>in</strong> a discrete time tradition and dynamic systems and control theory approaches<strong>in</strong> a cont<strong>in</strong>uous time tradition.5. Discussion and OutlookThe model<strong>in</strong>g of an enormously complex biological system such as the bra<strong>in</strong> has manychallenges. The abstractions and choices to be made <strong>in</strong> useful models of bra<strong>in</strong> connectivityare therefore unlikely to be accommodated by one s<strong>in</strong>gle ‘master’ model that94


Causal analysis of fMRIdoes better than all other models on all counts. Nonetheless, the ongo<strong>in</strong>g developmentefforts towards improved approaches are cont<strong>in</strong>ually extend<strong>in</strong>g and generaliz<strong>in</strong>g thecontexts <strong>in</strong> which dynamic time series models can be applied. It is clear that state spacemodel<strong>in</strong>g and <strong>in</strong>ference on WAGS <strong>in</strong>fluence are fundamental concepts with<strong>in</strong> this endeavor.We end here with some considerations of dynamic bra<strong>in</strong> connectivity modelsthat summarize some important po<strong>in</strong>ts and anticipate future developments.We have emphasized that WAGS <strong>in</strong>fluence models of bra<strong>in</strong> connectivity have largelybeen aimed at data driven exploratory analysis, whereas biophysically motivated statespace models are mostly used for hypothesis-led confirmatory analysis. This is especiallyrelevant <strong>in</strong> the <strong>in</strong>teraction between model selection and model identification.Exploratory techniques use <strong>in</strong>formation <strong>in</strong> the data to <strong>in</strong>vestigate the relative applicabilityof many models. As such, they have the potential to detect ‘miss<strong>in</strong>g’ regions <strong>in</strong>anatomical models. Confirmatory approaches test hypotheses about connectivity with<strong>in</strong>a set of models assumed to be applicable.As mentioned above, the WAGS <strong>in</strong>fluence approach to statistical analysis of causal<strong>in</strong>fluence that we focused on here is complemented by the <strong>in</strong>terventional approachrooted <strong>in</strong> the theory of graphical models and causal calculus. Graphical causal modelshave been recently applied to bra<strong>in</strong> connectivity analysis of fMRI data (Ramseyet al., 2009). Recent work comb<strong>in</strong><strong>in</strong>g the two approaches (White and Lu, 2010) possiblyleads the way to a comb<strong>in</strong>ed causal treatment of bra<strong>in</strong> imag<strong>in</strong>g data <strong>in</strong>corporat<strong>in</strong>gdynamic models and <strong>in</strong>terventions. Such a comb<strong>in</strong>ation could enable <strong>in</strong>corporation ofdirect manipulation of bra<strong>in</strong> activity by (for example) transcranial magnetic stimulation(Pascual-Leone et al., 2000; Paus, 1999; Walsh and Cowey, 2000) <strong>in</strong>to the current statespace model<strong>in</strong>g framework.Causal models of bra<strong>in</strong> connectivity are <strong>in</strong>creas<strong>in</strong>gly <strong>in</strong>spired by biophysical theories.For fMRI this is primarily applicable <strong>in</strong> model<strong>in</strong>g the complex cha<strong>in</strong> of eventsseparat<strong>in</strong>g neuronal population activity from the BOLD signal. Inversion of such amodel (<strong>in</strong> state space form) by a suitable filter<strong>in</strong>g algorithm amounts to a model-baseddeconvolution of the fMRI signal result<strong>in</strong>g <strong>in</strong> an estimate of latent neuronal populationactivity. If the biophysical model is appropriately formulated to be identifiable (possibly<strong>in</strong>clud<strong>in</strong>g priors on relevant parameters), it can take variation <strong>in</strong> the hemodynamicsbetween bra<strong>in</strong> regions <strong>in</strong>to account that can otherwise confound time series causalityanalyses of fMRI signals. Although models of hemodynamics for causal fMRI analysishave reached a reasonable level of complexity, the models of neuronal dynamicsused to date have rema<strong>in</strong>ed simple, compris<strong>in</strong>g one or two state variables for an entirecortical region or subcortical structure. Realistic dynamic models of neuronal activityhave a long history and have reached a high level of sophistication (Deco et al., 2008;Markram, 2006). It rema<strong>in</strong>s an open issue to what degree complex realistic equationsystems can be embedded <strong>in</strong> analysis of fMRI – or <strong>in</strong> fact: any bra<strong>in</strong> imag<strong>in</strong>g modality– and result <strong>in</strong> identifiable models of neuronal connectivity and computation.Two recent developments create opportunities to <strong>in</strong>crease complexity and realismof neuronal dynamics models and move the level of model<strong>in</strong>g from the macroscopic(whole bra<strong>in</strong> areas) towards the mesoscopic level compris<strong>in</strong>g sub-populations of areas95


Roebroeck Seth Valdes-Sosaand cortical columns. First, the fusion of multiple imag<strong>in</strong>g modalities, possibly simultaneouslyrecorded, has received a great deal of attention. Particularly, several attemptsto model-driven fusion of simultaneousy recorded fMRI and EEG data, by <strong>in</strong>vert<strong>in</strong>g aseparate observation model for each modality while us<strong>in</strong>g the same underly<strong>in</strong>g neuronalmodel, have been reported (Deneux and Faugeras, 2010; Riera et al., 2007; Valdes-Sosaet al., 2009). This approach holds great potential to fruitfully comb<strong>in</strong>e the superior spatialresolution of fMRI with the superior temporal resolution of EEG. In (Valdes-Sosaet al., 2009) anatomical connectivity <strong>in</strong>formation obta<strong>in</strong>ed from diffusion tensor imag<strong>in</strong>gand fiber tractography is also <strong>in</strong>corporated. Second, advances <strong>in</strong> MRI technology,particularly <strong>in</strong>creases of ma<strong>in</strong> field strength to 7T (and beyond) and advances <strong>in</strong> parallelimag<strong>in</strong>g (de Zwart et al., 2006; Heidemann et al., 2006; Pruessmann, 2004; Wies<strong>in</strong>geret al., 2006), greatly <strong>in</strong>crease the level spatial detail that are accessible with fMRI. For<strong>in</strong>stance, fMRI at 7T with sufficient spatial resolution to resolve orientation columns <strong>in</strong>human visual cortex has been reported (Yacoub et al., 2008).The development of state space models for causal analysis of fMRI data has movedfrom discrete to cont<strong>in</strong>uous and from determ<strong>in</strong>istic to stochastic models. Cont<strong>in</strong>uousmodels with stochastic dynamics have desirable properties, chief among them a robust<strong>in</strong>ference on causal <strong>in</strong>fluence <strong>in</strong>terpretable <strong>in</strong> the WAGS framework, as discussedabove. However, deal<strong>in</strong>g with cont<strong>in</strong>uous stochastic models leads to technical issuessuch as the properties and <strong>in</strong>terpretation of Wiener processes and Ito calculus (Friston,2008). A number of <strong>in</strong>version or filter<strong>in</strong>g methods for cont<strong>in</strong>uous stochastic modelshave been recently proposed, particularly for the goal of causal analysis of bra<strong>in</strong> imag<strong>in</strong>gdata, <strong>in</strong>clud<strong>in</strong>g the local l<strong>in</strong>earization and <strong>in</strong>novations approach (Hernandez et al.,1996; Riera et al., 2004), dynamic expectation maximization (Friston et al., 2008) andgeneralized filter<strong>in</strong>g (Friston et al., 2010). The ongo<strong>in</strong>g development of these filter<strong>in</strong>gmethods, their validation and their scalability towards large numbers of state variableswill be a topic of cont<strong>in</strong>u<strong>in</strong>g research.AcknowledgmentsThe authors thank Kamil Uludag for comments and discussion.ReferencesOdd O. Aalen. Dynamic model<strong>in</strong>g and causality. Scand<strong>in</strong>avian Actuarial journal,pages 177–190, 1987.O.O. Aalen and A. Frigessi. What can statistics contribute to a causal understand<strong>in</strong>g?Board of the Foundation of the Scand<strong>in</strong>avian journal of Statistics, 34:155–168, 2007.G. K. Aguirre, E. Zarahn, and M. D’Esposito. The variability of human, bold hemodynamicresponses. Neuroimage, 8(4):360–9, 1998.H Akaike. On the use of a l<strong>in</strong>ear model for the identification of feedback systems.Annals of the Institute of statistical mathematics, 20(1):425–439, 1968.96


Causal analysis of fMRIA Amendola, M Niglio, and C Vitale. Temporal aggregation and closure of VARMAmodels: Some new results. In F. Palumbo et al., editors, Data Analysis and Classification:Studies <strong>in</strong> Classification, Data Analysis, and Knowledge Organization,pages 435–443. Spr<strong>in</strong>ger Berl<strong>in</strong>/Heidelberg, 2010.D. Attwell and C. Iadecola. The neural basis of functional bra<strong>in</strong> imag<strong>in</strong>g signals. TrendsNeurosci, 25(12):621–5, 2002.D. Attwell, A. M. Buchan, S. Charpak, M. Lauritzen, B. A. Macvicar, and E. A. Newman.Glial and neuronal control of bra<strong>in</strong> blood flow. Nature, 468(7321):232–43,2010.A. B. Barrett, L. Barnett, and A. K. Seth. Multivariate granger causality and generalizedvariance. Phys Rev E Stat Nonl<strong>in</strong> Soft Matter Phys, 81(4 Pt 1):041907, 2010.M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis,University College London, 2003.A R Bergstrom. Nonrecursive models as discrete approximations to systems of stochasticdifferential equations. Econometrica, 34:173–182, 1966.A R Bergstrom. Cont<strong>in</strong>uous time stochastic models and issues of aggregation. InZ. Griliches and M.D. Intriligator, editors, Handbook of econometrics, volume II.Elsevier, 1984.M.A. Bernste<strong>in</strong>, K.F. K<strong>in</strong>g, and X.J. Zhou. Handbook of MRI Pulse Sequences. ElsevierAcademic Press, url<strong>in</strong>gton, 2004.S. L. Bressler and A. K. Seth. Wiener-granger causality: A well established methodology.Neuroimage, 2010.S. L. Bressler, W. Tang, C. M. Sylvester, G. L. Shulman, and M. Corbetta. Top-downcontrol of human visual cortex by frontal and parietal cortex <strong>in</strong> anticipatory visualspatial attention. J Neurosci, 28(40):10056–61, 2008.R. B. Buxton. The elusive <strong>in</strong>itial dip. Neuroimage, 13(6 Pt 1):953–8, 2001.R. B. Buxton, E. C. Wong, and L. R. Frank. Dynamics of blood flow and oxygenationchanges dur<strong>in</strong>g bra<strong>in</strong> activation: the balloon model. Magn Reson Med, 39(6):855–64,1998.R. B. Buxton, K. Uludag, D. J. Dubowitz, and T. T. Liu. Model<strong>in</strong>g the hemodynamicresponse to bra<strong>in</strong> activation. Neuroimage, 23 Suppl 1:S220–33, 2004.Marcus J Chambers and Michael A Thornton. Discrete time representation of cont<strong>in</strong>uoustime arma processes, 2009.97


Roebroeck Seth Valdes-SosaDaniel Commenges and Anne Gegout-Petit. A general dynamical statistical modelwith possible causal <strong>in</strong>terpretation. journal of the Royal Statistical Society: <strong>Series</strong> B(Statistical Methodology), 71(3):1–43, 2009.J. Daunizeau, O. David, and K. E. Stephan. Dynamic causal modell<strong>in</strong>g: A criticalreview of the biophysical and statistical foundations. Neuroimage, 2009a.J. Daunizeau, K. J. Friston, and S. J. Kiebel. Variational bayesian identification andprediction of stochastic nonl<strong>in</strong>ear dynamic causal models. Physica D, 238(21):2089–2118, 2009b.O. David. fmri connectivity, mean<strong>in</strong>g and empiricism comments on: Roebroeck et al.the identification of <strong>in</strong>teract<strong>in</strong>g networks <strong>in</strong> the bra<strong>in</strong> us<strong>in</strong>g fmri: Model selection,causality and deconvolution. Neuroimage, 2009.O. David, I. Guillema<strong>in</strong>, S. Saillet, S. Reyt, C. Deransart, C. Segebarth, and A. Depaulis.Identify<strong>in</strong>g neural drivers with functional mri: an electrophysiological validation.PLoS Biol, 6(12):2683–97, 2008.J. A. de Zwart, P. van Gelderen, X. Golay, V. N. Ikonomidou, and J. H. Duyn. Acceleratedparallel imag<strong>in</strong>g for functional imag<strong>in</strong>g of the human bra<strong>in</strong>. NMR Biomed, 19(3):342–51, 2006.G. Deco, V. K. Jirsa, P. A. Rob<strong>in</strong>son, M. Breakspear, and K. Friston. The dynamicbra<strong>in</strong>: from spik<strong>in</strong>g neurons to neural masses and cortical fields. PLoS Comput Biol,4(8):e1000092, 2008.T. Deneux and O. Faugeras. Eeg-fmri fusion of paradigm-free activity us<strong>in</strong>g kalmanfilter<strong>in</strong>g. Neural Comput, 22(4):906–48, 2010.G. Deshpande, X. Hu, R. Stilla, and K. Sathian. Effective connectivity dur<strong>in</strong>g hapticperception: a study us<strong>in</strong>g granger causality analysis of functional magnetic resonanceimag<strong>in</strong>g data. Neuroimage, 40(4):1807–14, 2008.G. Deshpande, S. LaConte, G. A. James, S. Peltier, and X. Hu. Multivariate grangercausality analysis of fmri data. Hum Bra<strong>in</strong> Mapp, 30(4):1361–73, 2009.G. Deshpande, K. Sathian, and X. Hu. Effect of hemodynamic variability on grangercausality analysis of fmri. Neuroimage, 52(3):884–96, 2010.J Florens. Some technical issues <strong>in</strong> def<strong>in</strong><strong>in</strong>g causality. journal of Econometrics, 112:127–128, 2003.J.P. Florens and D. Fougere. Noncausality <strong>in</strong> cont<strong>in</strong>uous time. Econometrica, 64(5):1195–1212, 1996.K. Friston. Functional and effective connectivity <strong>in</strong> neuroimag<strong>in</strong>g: A synthesis. HumBra<strong>in</strong> Mapp, 2:56–78, 1994.98


Causal analysis of fMRIK. Friston. Beyond phrenology: what can neuroimag<strong>in</strong>g tell us about distributed circuitry?Annu Rev Neurosci, 25:221–50, 2002.K. Friston. Dynamic causal model<strong>in</strong>g and granger causality comments on: The identificationof <strong>in</strong>teract<strong>in</strong>g networks <strong>in</strong> the bra<strong>in</strong> us<strong>in</strong>g fmri: Model selection, causalityand deconvolution. Neuroimage, 2009a.K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny. Variational freeenergy and the laplace approximation. Neuroimage, 34(1):220–34, 2007.K. J. Friston, A. Mechelli, R. Turner, and C. J. Price. Nonl<strong>in</strong>ear responses <strong>in</strong> fmri:the balloon model, volterra kernels, and other hemodynamics. Neuroimage, 12(4):466–77, 2000.K. J. Friston, L. Harrison, and W. Penny. Dynamic causal modell<strong>in</strong>g. Neuroimage, 19(4):1273–302, 2003.K. J. Friston, N. Trujillo-Barreto, and J. Daunizeau. Dem: a variational treatment ofdynamic systems. Neuroimage, 41(3):849–85, 2008.Karl Friston. Hierarchical models <strong>in</strong> the bra<strong>in</strong>. PLoS Computational Biology, 4, 2008.Karl Friston. Causal modell<strong>in</strong>g and bra<strong>in</strong> connectivity <strong>in</strong> functional magnetic resonanceimag<strong>in</strong>g. PLoS biology, 7:e33, 2009b.Karl Friston, Klaas Stephan, Baojuan Li, and Jean Daunizeau. Generalised filter<strong>in</strong>g.Mathematical Problems <strong>in</strong> Eng<strong>in</strong>eer<strong>in</strong>g, 2010:1–35, 2010.J. F. Geweke. Measurement of l<strong>in</strong>ear dependence and feedback between multiple timeseries. journal of the American Statistical Association, 77(378):304–324, 1982.G. H. Glover. Deconvolution of impulse response <strong>in</strong> event-related bold fmri. Neuroimage,9(4):416–29, 1999.C. Glymour. Learn<strong>in</strong>g, prediction and causal bayes nets. Trends Cogn Sci, 7(1):43–48,2003.R. Goebel, A. Roebroeck, D. S. Kim, and E. Formisano. Investigat<strong>in</strong>g directed cortical<strong>in</strong>teractions <strong>in</strong> time-resolved fmri data us<strong>in</strong>g vector autoregressive model<strong>in</strong>g andgranger causality mapp<strong>in</strong>g. Magn Reson Imag<strong>in</strong>g, 21(10):1251–61, 2003.C. W. J. Granger. Investigat<strong>in</strong>g causal relations by econometric models and crossspectralmethods. Econometrica, 37(3):424–438, 1969.E.M. Haacke, R.W. Brown, M.R. Thompson, and R. Venkatesan. Magnetic ResonanceImag<strong>in</strong>g: Physical Pr<strong>in</strong>ciples and Sequence Design. John Wiley and Sons, Inc, NewYork, 1999.99


Roebroeck Seth Valdes-SosaM. Havlicek, J. Jan, M. Brazdil, and V. D. Calhoun. Dynamic granger causality basedon kalman filter for evaluation of functional network connectivity <strong>in</strong> fmri data. Neuroimage,53(1):65–77, 2010.R. M. Heidemann, N. Seiberlich, M. A. Griswold, K. Wohlfarth, G. Krueger, and P. M.Jakob. Perspectives and limitations of parallel mr imag<strong>in</strong>g at high field strengths.Neuroimag<strong>in</strong>g Cl<strong>in</strong> N Am, 16(2):311–20, 2006.R. N. Henson, C. J. Price, M. D. Rugg, R. Turner, and K. J. Friston. Detect<strong>in</strong>g latencydifferences <strong>in</strong> event-related bold responses: application to words versus nonwordsand <strong>in</strong>itial versus repeated face presentations. Neuroimage, 15(1):83–97, 2002.J. L. Hernandez, P. A. Valdés, and P. Vila. Eeg spike and wave modelled by a stochasticlimit cycle. NeuroReport, 1996.B. Horwitz, K. J. Friston, and J. G. Taylor. Neural model<strong>in</strong>g and functional bra<strong>in</strong>imag<strong>in</strong>g: an overview. Neural Netw, 13(8-9):829–46, 2000.H. Johansen-Berg and T.E.J Behrens, editors. Diffusion MRI: From quantitative measurementto <strong>in</strong>-vivo neuroanatomy. Academic Press, London, 2009.D.K. Jones, editor. Diffusion MRI: Theory, Methods, and Applications. Oxford UniversityPress, Oxford, 2010.S. J. Kiebel, M. I. Garrido, R. Moran, C. C. Chen, and K. J. Friston. Dynamic causalmodel<strong>in</strong>g for eeg and meg. Hum Bra<strong>in</strong> Mapp, 30(6):1866–76, 2009.C. H. Liao, K. J. Worsley, J. B. Pol<strong>in</strong>e, J. A. Aston, G. H. Duncan, and A. C. Evans.Estimat<strong>in</strong>g the delay of the fmri response. Neuroimage, 16(3 Pt 1):593–606, 2002.L. Ljung. System Identification: Theory for the User. Prentice-Hall, New Jersey, 2ndedition, 1999.N. K. Logothetis. What we can do and what we cannot do with fmri. Nature, 453(7197):869–78, 2008.N. K. Logothetis, J. Pauls, M. Augath, T. Tr<strong>in</strong>ath, and A. Oeltermann. Neurophysiological<strong>in</strong>vestigation of the basis of the fmri signal. Nature, 412(6843):150–7, 2001.H. Markram. The blue bra<strong>in</strong> project. Nat Rev Neurosci, 7(2):153–60, 2006.A. C. Marreiros, S. J. Kiebel, and K. J. Friston. Dynamic causal modell<strong>in</strong>g for fmri: atwo-state model. Neuroimage, 39(1):269–78, 2008.J. Roderick McCrorie. The likelihood of the parameters of a cont<strong>in</strong>uous time vector autoregressivemodel. Statistical Inference for Stochastic Processes, 5:273–286, 2002.J. Roderick Mccrorie. The problem of alias<strong>in</strong>g <strong>in</strong> identify<strong>in</strong>g f<strong>in</strong>ite parameter cont<strong>in</strong>uoustime stochastic models. Acta Applicandae Mathematicae, 79:9–16, 2003.100


Causal analysis of fMRIA. R. McIntosh. Contexts and catalysts: a resolution of the localization and <strong>in</strong>tegrationof function <strong>in</strong> the bra<strong>in</strong>. Neuro<strong>in</strong>formatics, 2(2):175–82, 2004.T Ozaki. A bridge between nonl<strong>in</strong>ear time series models and nonl<strong>in</strong>ear stochastic dynamicalsystems: A local l<strong>in</strong>earization approach. Statistica S<strong>in</strong>ica, 2:113–135, 1992.A. Pascual-Leone, V. Walsh, and J. Rothwell. Transcranial magnetic stimulation <strong>in</strong> cognitiveneuroscience–virtual lesion, chronometry, and functional connectivity. CurrOp<strong>in</strong> Neurobiol, 10(2):232–7, 2000.T. Paus. Imag<strong>in</strong>g the bra<strong>in</strong> before, dur<strong>in</strong>g, and after transcranial magnetic stimulation.Neuropsychologia, 37(2):219–24, 1999.J. Pearl. <strong>Causality</strong>: Models, Reason<strong>in</strong>g and Inference. Cambridge University Press,New York, 2nd edition, 2009.W. Penny, Z. Ghahramani, and K. Friston. Bil<strong>in</strong>ear dynamical systems. Philos Trans RSoc Lond B Biol Sci, 360(1457):983–93, 2005.W. D. Penny, K. E. Stephan, A. Mechelli, and K. J. Friston. Compar<strong>in</strong>g dynamic causalmodels. Neuroimage, 22(3):1157–72, 2004.Peter C.B. Phillips. The problem of identification <strong>in</strong> f<strong>in</strong>ite parameter cont<strong>in</strong>uous timemodels. journal of Econometrics, 1:351–362, 1973.Peter C.B. Phillips. The estimation of some cont<strong>in</strong>uous time models. Econometrica,42:803–823, 1974.K. P. Pruessmann. Parallel imag<strong>in</strong>g at high field strength: synergies and jo<strong>in</strong>t potential.Top Magn Reson Imag<strong>in</strong>g, 15(4):237–44, 2004.J. D. Ramsey, S. J. Hanson, C. Hanson, Y. O. Halchenko, R. A. Poldrack, and C. Glymour.Six problems for causal <strong>in</strong>ference from fmri. Neuroimage, 49(2):1545–58,2009.A. Rauch, G. Ra<strong>in</strong>er, and N. K. Logothetis. The effect of a seroton<strong>in</strong>-<strong>in</strong>duced dissociationbetween spik<strong>in</strong>g and perisynaptic activity on bold functional mri. Proc NatlAcad Sci U S A, 105(18):6759–64, 2008.G.C. Re<strong>in</strong>sel. Elements of Multivariate <strong>Time</strong> <strong>Series</strong> Analysis. Spr<strong>in</strong>ger-Verlag, NewYork, 2nd edition, 1997.J. J. Riera, J. Watanabe, I. Kazuki, M. Naoki, E. Aubert, T. Ozaki, and R. Kawashima. Astate-space model of the hemodynamic approach: nonl<strong>in</strong>ear filter<strong>in</strong>g of bold signals.Neuroimage, 21(2):547–67, 2004.J. J. Riera, J. C. Jimenez, X. Wan, R. Kawashima, and T. Ozaki. Nonl<strong>in</strong>ear localelectrovascular coupl<strong>in</strong>g. ii: From data to neuronal masses. Hum Bra<strong>in</strong> Mapp, 28(4):335–54, 2007.101


Roebroeck Seth Valdes-SosaA. Roebroeck, E. Formisano, and R. Goebel. Mapp<strong>in</strong>g directed <strong>in</strong>fluence over the bra<strong>in</strong>us<strong>in</strong>g granger causality and fmri. Neuroimage, 25(1):230–42, 2005.A. Roebroeck, E. Formisano, and R. Goebel. The identification of <strong>in</strong>teract<strong>in</strong>g networks<strong>in</strong> the bra<strong>in</strong> us<strong>in</strong>g fmri: Model selection, causality and deconvolution. Neuroimage,2009a.A. Roebroeck, E. Formisano, and R. Goebel. Reply to friston and david after commentson: The identification of <strong>in</strong>teract<strong>in</strong>g networks <strong>in</strong> the bra<strong>in</strong> us<strong>in</strong>g fmri: Modelselection, causality and deconvolution. Neuroimage, 2009b.S. Ryali, K. Supekar, T. Chen, and V. Menon. Multivariate dynamical systems modelsfor estimat<strong>in</strong>g causal <strong>in</strong>teractions <strong>in</strong> fmri. Neuroimage, 2010.Z. S. Saad, K. M. Ropella, R. W. Cox, and E. A. DeYoe. Analysis and use of fmriresponse delays. Hum Bra<strong>in</strong> Mapp, 13(2):74–93, 2001.R. Salmel<strong>in</strong> and J. Kujala. Neural representation of language: activation versus longrangeconnectivity. Trends Cogn Sci, 10(11):519–25, 2006.J. R. Sato, D. Y. Takahashi, S. M. Arcuri, K. Sameshima, P. A. Morett<strong>in</strong>, and L. A.Baccala. Frequency doma<strong>in</strong> connectivity identification: an application of partialdirected coherence <strong>in</strong> fmri. Hum Bra<strong>in</strong> Mapp, 30(2):452–61, 2009.J. R. Sato, A. Fujita, E. F. Cardoso, C. E. Thomaz, M. J. Brammer, and Jr. E. Amaro.Analyz<strong>in</strong>g the connectivity between regions of <strong>in</strong>terest: an approach based on clustergranger causality for fmri data analysis. Neuroimage, 52(4):1444–55, 2010.M. B. Schippers, A. Roebroeck, R. Renken, L. Nanetti, and C. Keysers. Mapp<strong>in</strong>g the<strong>in</strong>formation flow from one bra<strong>in</strong> to another dur<strong>in</strong>g gestural communication. ProcNatl Acad Sci U S A, 107(20):9388–93, 2010.T Schweder. Composable markov processes. journal of Applied Probability, 7(2):400–410, 1970.J. F. Smith, A. Pillai, K. Chen, and B. Horwitz. Identification and validation of effectiveconnectivity networks <strong>in</strong> functional magnetic resonance imag<strong>in</strong>g us<strong>in</strong>g switch<strong>in</strong>gl<strong>in</strong>ear dynamic systems. Neuroimage, 52(3):1027–40, 2009.S. M. Smith, K. L. Miller, G. Salimi-Khorshidi, M. Webster, C. F. Beckmann, T. E.Nichols, J. D. Ramsey, and M. W. Woolrich. Network modell<strong>in</strong>g methods for fmri.Neuroimage, 2010.V Solo. On causality i: Sampl<strong>in</strong>g and noise. Proceed<strong>in</strong>gs of the 46th IEEE Conferenceon Decision and Control, pages 3634–3639, 2006.D. Sridharan, D. J. Levit<strong>in</strong>, and V. Menon. A critical role for the right fronto-<strong>in</strong>sularcortex <strong>in</strong> switch<strong>in</strong>g between central-executive and default-mode networks. Proc NatlAcad Sci U S A, 105(34):12569–74, 2008.102


Causal analysis of fMRIK. E. Stephan, L. Kasper, L. M. Harrison, J. Daunizeau, H. E. den Ouden, M. Breakspear,and K. J. Friston. Nonl<strong>in</strong>ear dynamic causal models for fmri. Neuroimage, 42(2):649–62, 2008.I. H. Stevenson and K. P. Kord<strong>in</strong>g. On the similarity of functional connectivity betweenneurons estimated across timescales. PLoS One, 5(2):e9206, 2010.K. Thomsen, N. Offenhauser, and M. Lauritzen. Pr<strong>in</strong>cipal neuron spik<strong>in</strong>g: neithernecessary nor sufficient for cerebral blood flow <strong>in</strong> rat cerebellum. J Physiol, 560(Pt1):181–9, 2004.L. Q. Udd<strong>in</strong>, A. M. Kelly, B. B. Biswal, F. Xavier Castellanos, and M. P. Milham.Functional connectivity of default mode network components: correlation, anticorrelation,and causality. Hum Bra<strong>in</strong> Mapp, 30(2):625–37, 2009.K. Ugurbil, L. Toth, and D. S. Kim. How accurate is magnetic resonance imag<strong>in</strong>g ofbra<strong>in</strong> function? Trends Neurosci, 26(2):108–14, 2003.K. Uludag. To dip or not to dip: reconcil<strong>in</strong>g optical imag<strong>in</strong>g and fmri data. Proc NatlAcad Sci U S A, 107(6):E23; author reply E24, 2010.K. Uludag, D. J. Dubowitz, and R. B. Buxton. Basic pr<strong>in</strong>ciples of functional mri. InR. Edelman, J. Hessel<strong>in</strong>k, and M. Zlatk<strong>in</strong>, editors, Cl<strong>in</strong>ical MRI. Elsevier, San Diego,2005.K. Uludag, B. Muller-Bierl, and K. Ugurbil. An <strong>in</strong>tegrative model for neuronal activity<strong>in</strong>ducedsignal changes for gradient and sp<strong>in</strong> echo functional imag<strong>in</strong>g. Neuroimage,48(1):150–65, 2009.P Valdes-Sosa, J C Jimenez, J Riera, R Biscay, and T Ozaki. Nonl<strong>in</strong>ear eeg analysisbased on a neural mass model. Biological cybernetics, 81:415–24, 1999.P. Valdes-Sosa, A. Roebroeck, J. Daunizeau, and K. Friston. Effective connectivity:Influence, causality and biophysical model<strong>in</strong>g. Neuroimage, <strong>in</strong> press.P. A. Valdes-Sosa. Spatio-temporal autoregressive models def<strong>in</strong>ed over bra<strong>in</strong> manifolds.Neuro<strong>in</strong>formatics, 2(2):239–50, 2004.P. A. Valdes-Sosa, R. Kotter, and K. J. Friston. Introduction: multimodal neuroimag<strong>in</strong>gof bra<strong>in</strong> connectivity. Philos Trans R Soc Lond B Biol Sci, 360(1457):865–7, 2005a.P. A. Valdes-Sosa, J. M. Sanchez-Bornot, A. Lage-Castellanos, M. Vega-Hernandez,J. Bosch-Bayard, L. Melie-Garcia, and E. Canales-Rodriguez. Estimat<strong>in</strong>g bra<strong>in</strong> functionalconnectivity with sparse multivariate autoregression. Philos Trans R Soc LondB Biol Sci, 360(1457):969–81, 2005b.P. A. Valdes-Sosa, J. M. Sanchez-Bornot, R. C. Sotero, Y. Iturria-Med<strong>in</strong>a, Y. Aleman-Gomez, J. Bosch-Bayard, F. Carbonell, and T. Ozaki. Model driven eeg/fmri fusionof bra<strong>in</strong> oscillations. Hum Bra<strong>in</strong> Mapp, 30(9):2701–21, 2009.103


Roebroeck Seth Valdes-SosaV. Walsh and A. Cowey. Transcranial magnetic stimulation and cognitive neuroscience.Nat Rev Neurosci, 1(1):73–9, 2000.W. W. S. Wei. <strong>Time</strong> <strong>Series</strong> Analysis: Univariate and Multivariate Methods. Addison-Wesley, Redwood City, 1990.Halbert White and Xun Lu. Granger causality and dynamic structural systems. Journalof F<strong>in</strong>ancial Econometrics, 8(2):193–243, 2010.N. Wiener. The theory of prediction. In E.F. Berkenbach, editor, Modern Mathematicsfor Eng<strong>in</strong>eers. McGraw-Hill, New York, 1956.F. Wies<strong>in</strong>ger, P. F. Van de Moortele, G. Adriany, N. De Zanche, K. Ugurbil, and K. P.Pruessmann. Potential and feasibility of parallel mri at high field. NMR Biomed, 19(3):368–78, 2006.E. Yacoub, N. Harel, and K. Ugurbil. High-field fmri unveils orientation columns <strong>in</strong>humans. Proc Natl Acad Sci U S A, 105(30):10607–12, 2008.104


JMLR: Workshop and Conference Proceed<strong>in</strong>gs 12:95–114, 2011<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Causal Search <strong>in</strong> Structural Vector Autoregressive ModelsAlessio MonetaMax Planck Institute of EconomicsJena, GermanyNad<strong>in</strong>e ChlaßFriedrich Schiller University of Jena, GermanyDoris EntnerHels<strong>in</strong>ki Institute for Information Technology, F<strong>in</strong>landPatrik HoyerHels<strong>in</strong>ki Institute for Information Technology, F<strong>in</strong>landmoneta@econ.mpg.denad<strong>in</strong>e.chlass@uni-jena.dedoris.entner@cs.hels<strong>in</strong>ki.fipatrk.hoyer@hels<strong>in</strong>ki.fiEditors: Flor<strong>in</strong> Popescu and Isabelle GuyonAbstractThis paper reviews a class of methods to perform causal <strong>in</strong>ference <strong>in</strong> the framework ofa structural vector autoregressive model. We consider three different sett<strong>in</strong>gs. In thefirst sett<strong>in</strong>g the underly<strong>in</strong>g system is l<strong>in</strong>ear with normal disturbances and the structuralmodel is identified by exploit<strong>in</strong>g the <strong>in</strong>formation <strong>in</strong>corporated <strong>in</strong> the partial correlationsof the estimated residuals. Zero partial correlations are used as <strong>in</strong>put of a searchalgorithm formalized via graphical causal models. In the second, semi-parametric,sett<strong>in</strong>g the underly<strong>in</strong>g system is l<strong>in</strong>ear with non-Gaussian disturbances. In this casethe structural vector autoregressive model is identified through a search procedurebased on <strong>in</strong>dependent component analysis. F<strong>in</strong>ally, we explore the possibility ofcausal search <strong>in</strong> a nonparametric sett<strong>in</strong>g by study<strong>in</strong>g the performance of conditional<strong>in</strong>dependence tests based on kernel density estimations.Keywords: Causal <strong>in</strong>ference, econometric time series, SVAR, graphical causal models,<strong>in</strong>dependent component analysis, conditional <strong>in</strong>dependence tests1. Introduction1.1. Causal <strong>in</strong>ference <strong>in</strong> econometricsApplied economic research is pervaded by questions about causes and effects. Forexample, what is the effect of a monetary policy <strong>in</strong>tervention? Is energy consumptioncaus<strong>in</strong>g growth or the other way around? Or does causality run <strong>in</strong> both directions? Areeconomic fluctuations ma<strong>in</strong>ly caused by monetary, productivity, or demand shocks?Does foreign aid improve liv<strong>in</strong>g standards <strong>in</strong> poor countries? Does firms’ expenditure<strong>in</strong> R&D causally <strong>in</strong>fluence their profits? Are recent rises <strong>in</strong> oil prices <strong>in</strong> part caused byspeculation? These are seem<strong>in</strong>gly heterogeneous questions, but they all require someknowledge of the causal process by which variables came to take the values we observe.c○ 2011 A. Moneta, N. Chlaß, D. Entner & P. Hoyer.


Moneta Chlass Entner HoyerA traditional approach to address such questions h<strong>in</strong>ges on the explicit use of apriori economic theory. The gist of this approach is to partition a causal process <strong>in</strong>a determ<strong>in</strong>istic, and a random part and to articulate the determ<strong>in</strong>istic part such as toreflect the causal dependencies dictated by economic theory. If the formulation of thedeterm<strong>in</strong>istic part is accurate and reliable enough, the random part is expected to displayproperties that can easily be analyzed by standard statistical tools. The touchstoneof this approach is represented by the work of Haavelmo (1944), which <strong>in</strong>spired theresearch program subsequently pursued by the Cowles Commission (Koopmans, 1950;Hood and Koopmans, 1953). There<strong>in</strong>, the causal process is formalized by means ofa structural equation model, that is, a system of equations with endogenous variables,exogenous variables, and error terms, first developed by Wright (1921). Its coefficientswere given a causal <strong>in</strong>terpretation (Pearl, 2000).This approach has been strongly criticized <strong>in</strong> the 1970s for be<strong>in</strong>g <strong>in</strong>effective <strong>in</strong>both policy evaluation and forecast<strong>in</strong>g. Lucas (1976) po<strong>in</strong>ted out that the economictheory <strong>in</strong>cluded <strong>in</strong> the SEM fails to take economic agents’ (rational) motivations andexpectations <strong>in</strong>to consideration. Agents, accord<strong>in</strong>g to Lucas, are able to anticipate policy<strong>in</strong>tervention and act contrary to the prediction derived from the structural equationmodel, s<strong>in</strong>ce the model usually ignores such anticipations. Sims (1980) puts forth anothercritique which runs parallel to Lucas’ one. It explicitly addresses the status ofexogeneity which the Cowles Commission approach attributes (arbitrarily, accord<strong>in</strong>gto Sims) to some variables such that the structural model can be identified. Sims arguesthat theory is not a reliable source for deem<strong>in</strong>g a variable as exogenous. More generally,the Cowles Commission approach with its strong a priori commitment to theory,risks fall<strong>in</strong>g <strong>in</strong>to a vicious circle: if causal <strong>in</strong>formation (even if only about direction)can exclusively be derived from background theory, how do we obta<strong>in</strong> an empiricallyjustified theory? (Cfr. Hoover, 2006, p.75).An alternative approach has been pursued s<strong>in</strong>ce Wiener (1956) and Granger’s (1969)work. It aims at <strong>in</strong>ferr<strong>in</strong>g causal relations directly from the statistical properties of thedata rely<strong>in</strong>g only to a m<strong>in</strong>imal extent on background knowledge. Granger (1980) proposesa probabilistic concept of causality, similar to Suppes (1970). Granger def<strong>in</strong>escausality <strong>in</strong> terms of the <strong>in</strong>cremental predictability (at horizon one) of a time seriesvariable {Y t } (given the present and past values of {Y t } and of a set {Z t } of possible relevantvariables) when another time series variable {X t } (<strong>in</strong> its present and past values) isnot omitted. More formally:{X t } Granger-causes {Y t } if P(Y t+1 |X t , X t−1 ,...,Y t ,Y t−1 ,...,Z t ,Z t−1 ,...) P(Y t+1 |Y t ,Y t−1 ,...,Z t ,Z t−1 ,...)(1)As po<strong>in</strong>ted out by Florens and Mouchart (1982), test<strong>in</strong>g the hypothesis of Grangernoncausality corresponds to test<strong>in</strong>g conditional <strong>in</strong>dependence. Given lags p, {X t } doesnot Granger cause {Y t }, ifY t+1 ⊥ (X t , X t−1 ,..., X t−p ) | (Y t ,Y t−1 ,...,Y t−p ,Z t ,Z t−1 ,...,Z t−p ) (2)106


Causal Search <strong>in</strong> SVARTo test Granger noncausality, researchers often specify l<strong>in</strong>ear vector autoregressive(VAR) models:Y t = A 1 Y t−1 + ... + A p Y t−p + u t , (3)<strong>in</strong> which Y t is a k × 1 vector of time series variables (Y 1,t ,...,Y k,t ) ′ , where () ′ is thetranspose, the A j ( j = 1,..., p) are k × k coefficient matrices, and u t is the k × 1 vectorof random disturbances. In this framework, test<strong>in</strong>g the hypothesis that {Y i,t } does notGranger-cause {Y j,t }, reduces to test whether the ( j,i) entries of the matrices A 1 ,...,A pare vanish<strong>in</strong>g simultaneously. Granger noncausality tests have been extended to nonl<strong>in</strong>earsett<strong>in</strong>gs by Baek and Brock (1992), Hiemstra and Jones (1994), and Su and White(2008), us<strong>in</strong>g nonparametric tests of conditional <strong>in</strong>dependence (more on this topic <strong>in</strong>section 4).The concept of Granger causality has been criticized for fail<strong>in</strong>g to capture ‘structuralcausality’ (Hoover, 2008). Suppose one f<strong>in</strong>ds that a variable A Granger-causesanother variable B. This does not necessarily imply that an economic mechanism existsby which A can be manipulated to affect B. The existence of such a mechanism <strong>in</strong>turn does not necessarily imply Granger causality either (for a discussion see Hoover2001, pp. 150-155). Indeed, the analysis of Granger causality is based on coefficientsof reduced-form models, like those <strong>in</strong>corporated <strong>in</strong> equation (3), which are unlikely toreliably represent actual economic mechanisms. For <strong>in</strong>stance, <strong>in</strong> equation (3) the simultaneouscausal structure is not modeled <strong>in</strong> order to facilitate estimation. (However, notethat Eichler (2007) and White and Lu (2010) have recently developed and formalizedricher structural frameworks <strong>in</strong> which Granger causality can be fruitfully analyzed.)1.2. The SVAR frameworkStructural vector autoregressive (SVAR) models constitute a middle way between theCowles Commission approach and the Granger-causality approach. SVAR models aimat recover<strong>in</strong>g the concept of structural causality, but eschew at the same time the strong‘apriorism’ of the Cowles Commission approach. The idea is, like <strong>in</strong> the Cowles Commissionapproach, to articulate an unobserved structural model, formalized as a dynamicgenerative model: at each time unit the system is affected by unobserved <strong>in</strong>novationterms, by which, once filtered by the model, the variables come to take the valueswe observe. But, differently from the Cowles Commission approach, and similarly tothe Granger-VAR model, the data generat<strong>in</strong>g process is generally enough articulatedso that time series variables are not dist<strong>in</strong>guished a priori between exogenous and endogenous.A l<strong>in</strong>ear SVAR model is <strong>in</strong> pr<strong>in</strong>ciple a VAR model ‘augmented’ by thecontemporaneous structure:Γ 0 Y t = Γ 1 Y t−1 + ... + Γ p Y t−p + ε t . (4)This is easily obta<strong>in</strong>ed by pre-multiply<strong>in</strong>g each side of the VAR modelY t = A 1 Y t−1 + ... + A p Y t−p + u t , (5)by a matrix Γ 0 so that Γ i = Γ 0 A i , for i = 1,...,k and ε t = Γ 0 u t . Note, however, that notany matrix Γ 0 will be suitable. The appropriate Γ 0 will be that matrix correspond<strong>in</strong>g107


Moneta Chlass Entner Hoyerto the ‘right’ rotation of the VAR model, that is the rotation compatible both with thecontemporaneous causal structure of the variable and the structure of the <strong>in</strong>novationterm. Let us consider a matrix B 0 = I − Γ 0 . If the system is normalized such that thematrix Γ 0 has all the elements of the pr<strong>in</strong>cipal diagonal equal to one (which can be donestraightforwardly), the diagonal elements of B 0 will be equal to zero. We can write:Y t = B 0 Y t + Γ 1 Y t−1 + ... + Γ p Y t−p + ε t (6)from which we see that B 0 (and thus Γ 0 ) determ<strong>in</strong>es <strong>in</strong> which form the values of a variableY i,t will be dependent on the contemporaneous value of another variable Y j,t . The‘right’ rotation will also be the one which makes ε t a vector of authentic <strong>in</strong>novationterms, which are expected to be <strong>in</strong>dependent (not only over time, but also contemporaneously)sources or shocks.In the literature, different methods have been proposed to identify the SVAR model(4) on the basis of the estimation of the VAR model (5). Notice that there are moreunobserved parameters <strong>in</strong> (4), whose number amounts to k 2 (p + 1), than parametersthat can be estimated from (5), which are k 2 p + k(k + 1)/2, so one has to impose atleast k(k − 1)/2 restrictions on the system. One solution to this problem is to get arotation of (5) such that the covariance matrix of the SVAR residuals Σ ε is diagonal,us<strong>in</strong>g the Cholesky factorization of the estimated residuals Σ u . That is, let P be thelower-triangular Cholesky factorization of Σ u (i.e. Σ u = PP ′ ), let D be a k × k diagonalmatrix with the same diagonal as P, and let Γ 0 = DP −1 . By pre-multiply<strong>in</strong>g (5) byΓ 0 , it turns out that Σ ε = E[Γ 0 u t u ′ t Γ′ 0 ] = DD′ , which is diagonal. A problem withthis method is that P changes if the order<strong>in</strong>g of the variables (Y 1t ,...,Y kt ) ′ <strong>in</strong> Y t and,consequently, the order of residuals <strong>in</strong> Σ u , changes. S<strong>in</strong>ce researchers who estimate aSVAR are often exclusively <strong>in</strong>terested on track<strong>in</strong>g down the effect of a structural shockε it on the variables Y 1,t ,...,Y k,t over time (impulse response functions), Sims (1981)suggested <strong>in</strong>vestigat<strong>in</strong>g to what extent the impulse response functions rema<strong>in</strong> robustunder changes of the order of variables.Popular alternatives to the Cholesky identification scheme are based either on theuse of a priori, theory-based, restrictions or on the use of long-run restrictions. Theformer solution consists <strong>in</strong> impos<strong>in</strong>g economically plausible constra<strong>in</strong>ts on the contemporaneous<strong>in</strong>teractions among variables (Blanchard and Watson, 1986; Bernanke,1986) and has the drawback of ultimately depend<strong>in</strong>g on the a priori reliability of economictheory, similarly to the Cowles Commission approach. The second solution isbased on the assumptions that certa<strong>in</strong> economic shocks have long-run effect to othervariables, but do not <strong>in</strong>fluence <strong>in</strong> the long-run the level of other variables (see Shapiroand Watson, 1988; Blanchard and Quah, 1989; K<strong>in</strong>g et al., 1991). This approach hasbeen criticized as not be<strong>in</strong>g very reliable unless strong a priori restrictions are imposed(see Faust and Leeper, 1997).In the rest of the paper, we first present a method, based on the graphical causalmodel framework, to identify the SVAR (section 2). This method is based on conditional<strong>in</strong>dependence tests among the estimated residuals of the VAR estimated model.Such tests rely on the assumption that the shocks affect<strong>in</strong>g the model are Gaussian.108


Causal Search <strong>in</strong> SVARWe then relax the Gaussianity assumption and present a method to identify the SVARmodel based on <strong>in</strong>dependent component analysis (section 3). Here the ma<strong>in</strong> assumptionis that shocks are non-Gaussian and <strong>in</strong>dependent. F<strong>in</strong>ally (section 4), we explorethe possibility of extend<strong>in</strong>g the framework for causal <strong>in</strong>ference to a nonparametric sett<strong>in</strong>g.In section 5 we wrap up the discussion and conclude by formulat<strong>in</strong>g some openquestions.2. SVAR identification via graphical causal models2.1. BackgroundA data-driven approach to identify the structural VAR is based on the analysis of theestimated residuals û t . Notice that when a basic VAR model is estimated (equation 3),the <strong>in</strong>formation about contemporaneous causal dependence is <strong>in</strong>corporated exclusively<strong>in</strong> the residuals (be<strong>in</strong>g not modeled among the variables). Graphical causal models,as orig<strong>in</strong>ally developed by Pearl (2000) and Spirtes et al. (2000), represent an efficientmethod to recover, at least <strong>in</strong> part, the contemporaneous causal structure mov<strong>in</strong>g fromthe analysis of the conditional <strong>in</strong>dependencies among the estimated residuals. Once thecontemporaneous causal structure is recovered, the estimation of the lagged autoregressivecoefficients permits us to identify the complete SVAR model.This approach was <strong>in</strong>itiated by Swanson and Granger (1997), who proposed to testwhether a particular causal order of the VAR is <strong>in</strong> accord with the data by test<strong>in</strong>g all thepartial correlations of order one among error terms and check<strong>in</strong>g whether some partialcorrelations are vanish<strong>in</strong>g. Reale and Wilson (2001), Bessler and Lee (2002), Demiralpand Hoover (2003), and Moneta (2008) extended the approach by us<strong>in</strong>g the partialcorrelations of the VAR residuals as <strong>in</strong>put to graphical causal model search algorithms.In graphical causal models, the structural model is represented as a causal graph (aDirected Acyclic Graph if the presence of causal loops is excluded), <strong>in</strong> which each noderepresents a random variable and each edge a causal dependence. Furthermore, a setof assumptions or ‘rules of <strong>in</strong>ference’ are formulated, which regulate the relationshipbetween causal and probabilistic dependencies: the causal Markov and the faithfulnessconditions (Spirtes et al., 2000). The former restricts the jo<strong>in</strong>t probability distributionof modeled variables: each variable is <strong>in</strong>dependent of its graphical non-descendantsconditional on its graphical parents. The latter makes causal discovery possible: all ofthe conditional <strong>in</strong>dependence relations among the modeled variables follow from thecausal Markov condition. Thus, for example, if the causal structure is represented asY 1t → Y 2t → Y t,3 , it follows from the Markov condition that Y 1,t ⊥ Y 3,t |Y 2,t . If, on theother hand, the only (conditional) <strong>in</strong>dependence relation among Y 1,t ,Y 2,t ,Y 3,t is Y 1,t ⊥Y 3,t , it follows from the faithfulness condition that Y 1,t → Y 3,t


Moneta Chlass Entner Hoyersecond step, conditional <strong>in</strong>dependence relations (or d-separations, which are the graphicalcharacterization of conditional <strong>in</strong>dependence) are merely used to erase edges and,<strong>in</strong> further steps, to direct edges. The output of such algorithms are not necessarily ones<strong>in</strong>gle graph, but a class of Markov equivalent graphs.There is noth<strong>in</strong>g neither <strong>in</strong> the Markov or faithfulness condition, nor <strong>in</strong> the constra<strong>in</strong>tbasedalgorithms that limits them to l<strong>in</strong>ear and Gaussian sett<strong>in</strong>gs. Graphical causalmodels do not require per se any a priori specification of the functional dependence betweenvariables. However, <strong>in</strong> applications of graphical models to SVAR, conditional<strong>in</strong>dependence is ascerta<strong>in</strong>ed by test<strong>in</strong>g vanish<strong>in</strong>g partial correlations (Swanson andGranger, 1997; Bessler and Lee, 2002; Demiralp and Hoover, 2003; Moneta, 2008).S<strong>in</strong>ce normal distribution guarantees the equivalence between zero partial correlationand conditional <strong>in</strong>dependence, these applications deal de facto with l<strong>in</strong>ear and Gaussianprocesses.2.2. Test<strong>in</strong>g residuals zero partial correlationsThere are alternative methods to test zero partial correlations among the error termsû t = (u 1t ,...,u kt ) ′ . Swanson and Granger (1997) use the partial correlation coefficient.That is, <strong>in</strong> order to test, for <strong>in</strong>stance, ρ(u it ,u kt |u jt ) = 0, they use the standard t statisticsfrom a least square regression of the model:u it = α j u jt + α k u kt + ε it , (7)on the basis that α k = 0 ⇔ ρ(u it ,u kt |u jt ) = 0. S<strong>in</strong>ce Swanson and Granger (1997) imposethe partial correlation constra<strong>in</strong>ts look<strong>in</strong>g only at the set of partial correlations of orderone (that is conditioned on only one variable), <strong>in</strong> order to run their tests they considerregression equations with only two regressors, as <strong>in</strong> equation (7).Bessler and Lee (2002) and Demiralp and Hoover (2003) use Fisher’s z that is<strong>in</strong>corporated <strong>in</strong> the software TETRAD (Sche<strong>in</strong>es et al., 1998):z(ρ XY.K ,T) = 1 (︃ )︃√︀ |1 + ρXY.K |T − |K| − 3 log2|1 − ρ XY.K | , (8)where |K| equals the number of variables <strong>in</strong> K and T the sample size. If the variables(for <strong>in</strong>stance X = u it , Y = u kt , K = (u jt ,u ht )) are normally distributed, we have thatz(ρ XY.K ,T) − z(ˆρ XY.K ,T) ∼ N(0,1) (9)(see Spirtes et al., 2000, p.94).A different approach, which takes <strong>in</strong>to account the fact that correlations are obta<strong>in</strong>edfrom residuals of a regression, is proposed by Moneta (2008). In this case it is usefulto write the VAR model of equation (3) <strong>in</strong> a more compact form:Y t = Π ′ X t + u t , (10)110


Causal Search <strong>in</strong> SVARwhere X ′ t = [Y′ t−1 , ...,Y′ t−p ], which has dimension (1 × kp) and Π′ = [A 1 ,...,A p ], whichhas dimension (k × kp). In case of stable VAR process (see next subsection), the conditionalmaximum likelihood estimate of Π for a sample of size T is given byMoreover, the ith row of ˆΠ ′ is⎡ ⎤⎡⎤T∑︁ T∑︁ˆΠ ′ = ⎢⎣ Y t X ′ t⎥⎦⎢⎣ X t X ′ t⎥⎦t=1t=1t=1⎤⎡⎤T∑︁ T∑︁ˆπ ′ i⎡⎢⎣ = Y it X ′ t⎥⎦⎢⎣ X t X ′ t⎥⎦which co<strong>in</strong>cides with the estimated coefficient vector from an OLS regression of Y it onX t (Hamilton 1994: 293). The maximum likelihood estimate of the matrix of varianceand covariance among the error terms Σ u turns out to be ˆΣ u = (1/T) ∑︀ Tt=1 û t û ′ t , whereû t = Y t − ˆΠ ′ X t . Therefore, the maximum likelihood estimate of the covariance betweenu it and u jt is given by the (i, j) element of ˆΣ u : ˆσ i j = (1/T) ∑︀ Tt=1 û it û jt . Denot<strong>in</strong>g by σ i jthe (i, j) element of Σ u , let us first def<strong>in</strong>e the follow<strong>in</strong>g matrix transform operators: vec,which stacks the columns of a k × k matrix <strong>in</strong>to a vector of length k 2 and vech, whichvertically stacks the elements of a k × k matrix on or below the pr<strong>in</strong>cipal diagonal <strong>in</strong>toa vector of length k(k + 1)/2. For example:[︃ ]︃σ11 σvec 12σ 21 σ 22⎡=⎢⎣t=1−1−1σ 11[︃ ]︃σ 21 σ11 σ, vech 12σ 12 σ⎤⎥⎦21 σ 22σ 22.,⎡= ⎢⎣σ 11σ 21σ 22⎤⎥⎦ .The process be<strong>in</strong>g stationary and the error terms Gaussian, it turns out that:√ dT [vech( ˆΣ u ) − vech(Σ u )] −→ N(0, Ω), (11)where Ω = 2D + k (Σ u ⊗ Σ u )(D + k )′ , D + k ≡ (D′ k D k) −1 D ′ k , D k is the unique (k 2 × k(k +1)/2) matrix satisfy<strong>in</strong>g D k vech(Ω) = vec(Ω), and ⊗ denotes the Kronecker product (seeHamilton 1994: 301). For example, for k = 2, we have,√T⎡⎢⎣ˆσ 11 − σ 11ˆσ 12 − σ 12⎤⎥⎦ˆσ 22 − σ 22d⎛⎡−→ N ⎜⎝ ⎢⎣000⎤⎥⎦ , ⎡⎢⎣2σ 2 112σ 11 σ 12 2σ 2 122σ 11 σ 12 σ 11 σ 22 + σ 2 122σ 12 σ 222σ 2 122σ 12 σ 22 2σ 2 22Therefore, to test the null hypothesis that ρ(u it ,u jt ) = 0 from the VAR estimated residuals,it is possible to use the Wald statistic:⎤⎞⎥⎦ ⎟⎠T ( ˆσ i j ) 2ˆσ ii ˆσ j j + ˆσ 2 i j≈ χ 2 (1).111


Moneta Chlass Entner HoyerThe Wald statistic for test<strong>in</strong>g vanish<strong>in</strong>g partial correlations of any order is obta<strong>in</strong>edby apply<strong>in</strong>g the delta method, which suggests that if X T is a (r × 1) sequenceof vector-valued random-variables and if [ √ T(X 1T − θ 1 ),..., √ T(X rT − θ r )] −→ N(0,Σ)and h 1 ,...,h r are r real-valued functions of θ = (θ 1 ,...,θ r ), h i : R r → R, def<strong>in</strong>ed andcont<strong>in</strong>uously differentiable <strong>in</strong> a neighborhood ω of the parameter po<strong>in</strong>t θ and such thatthe matrix B = ||∂h i /∂θ j || of partial derivatives is nons<strong>in</strong>gular <strong>in</strong> ω, then:[ √ T[h 1 (X T ) − h 1 (θ)],..., √ T[h r (X T ) − h r (θ)]] −→ N(0, BΣB ′ )(see Lehmann and Casella 1998: 61).Thus, for k = 4, suppose one wants to test corr(u 1t ,u 3t |u 2t ) = 0. First, notice thatρ(u 1 ,u 3 |u 2 ) = 0 if and only if σ 22 σ 13 − σ 12 σ 23 = 0 (by def<strong>in</strong>ition of partial correlation).One can def<strong>in</strong>e a function g : R k(k+1)/2 → R, such that g(vech(Σ u )) = σ 22 σ 13 − σ 12 σ 23 .Thus,∇g ′ = (0, −σ 23 , σ 22 , 0, σ 13 , −σ 12 , 0, 0, 0, 0).Apply<strong>in</strong>g the delta method:√ dT[( ˆσ22 ˆσ 13 − ˆσ 12 ˆσ 23 ) − (σ 22 σ 13 − σ 12 σ 23 )] −→ N(0,∇g ′ Ω∇g).The Wald test of the null hypothesis corr(u 1t ,u 3t |u 2t ) = 0 is given by:ddT( ˆσ 22 ˆσ 13 − ˆσ 12 ˆσ 23 ) 2∇g ′ Ω∇g≈ χ 2 (1).Tests for higher order correlations and for k > 4 follow analogously (see also Moneta,2003). This test<strong>in</strong>g procedure has the advantage, with respect to the alternative methods,to be straightforwardly applied to the case of co<strong>in</strong>tegrated data, as will be expla<strong>in</strong>ed<strong>in</strong> the next subsection.2.3. Co<strong>in</strong>tegration caseA typical feature of economic time series data <strong>in</strong> which there is some form of causaldependence is co<strong>in</strong>tegration. This term denotes the phenomenon that nonstationaryprocesses can have l<strong>in</strong>ear comb<strong>in</strong>ations that are stationary. That is, suppose that eachcomponent Y it of Y t = (Y 1t ,...,Y kt ) ′ , which follows the VAR processY t = A 1 Y t−1 + ... + A p Y t−p + u t ,is nonstationary and <strong>in</strong>tegrated of order one (∼ I(1)). This means that the VAR processY t is not stable, i.e. det(I k − A 1 z − A p z p ) is equal to zero for some |z| ≤ 1 (Lütkepohl,2006), and that each component ∆Y it of ∆Y t = (Y t − Y t−1 ) is stationary (I(0)), that isit has time-<strong>in</strong>variant means, variances and covariance structure. A l<strong>in</strong>ear comb<strong>in</strong>ationbetween between the elements of Y t is called a co<strong>in</strong>tegrat<strong>in</strong>g relationship if there is al<strong>in</strong>ear comb<strong>in</strong>ation c 1 Y 1t + ... + c k Y kt which is stationary (I(0)).112


Causal Search <strong>in</strong> SVARIf it is the case that the VAR process is unstable with the presence of co<strong>in</strong>tegrat<strong>in</strong>grelationships, it is more appropriate (Lütkepohl, 2006; Johansen, 2006) to estimate thefollow<strong>in</strong>g re-parametrization of the VAR model, called Vector Error Correction Model(VECM):∆Y t = F 1 ∆Y t−1 + ... + F p−1 ∆Y t−p+1 − GY t−p + u t , (12)where F i = −(I k −A 1 −...−A i ), for i = 1,..., p−1 and G = I k −A 1 −...−A p . The (k×k)matrix G has rank r and thus G can be written as HC with H and C ′ of dimension (k×r)and of rank r. C ≡ [c 1 ,...,c r ] ′ is called the co<strong>in</strong>tegrat<strong>in</strong>g matrix.Let ˜C, ˜H, and ˜F i be the maximum likelihood estimator of C, H, F accord<strong>in</strong>g toJohansen’s (1988, 1991) approach. Then the asymptotic distribution of ˜Σ u , that is themaximum likelihood estimator of the covariance matrix of u t , is:√ dT [vech( ˜Σ u ) − vech(Σ u )] −→ N(0, 2D + k (Σ u ⊗ Σ u )D +′ ), (13)which is equivalent to equation (11) (see it aga<strong>in</strong> for the def<strong>in</strong>ition of the various operators).Thus, it turns out that the asymptotic distribution of the maximum likelihoodestimator ˜Σ u is the same as the OLS estimation ˆΣ u for the case of stable VAR.Thus, the application of the method described for test<strong>in</strong>g residuals zero partial correlationscan be applied straightforwardly to co<strong>in</strong>tegrated data. The model is estimatedas a VECM error correction model us<strong>in</strong>g Johansen’s (1988, 1991) approach, correlationsare tested exploit<strong>in</strong>g the asymptotic distribution of ˜Σ u and f<strong>in</strong>ally can be parameterizedback <strong>in</strong> its VAR form of equation (3).2.4. Summary of the search procedureThe graphical causal models approach to SVAR identification, which we suggest <strong>in</strong>case of Gaussian and l<strong>in</strong>ear processes, can be summarized as follows.Step 1 Estimate the VAR model Y t = A 1 Y t−1 +...+A p Y t−p +u t with the usual specificationtests about normality, zero autocorrelation of residuals, lags, and unit roots (seeLütkepohl, 2006). If the hypothesis of nonstationarity is rejected, estimate the VARmodel via OLS (equivalent to MLE under the assumption of normality of the errors).If unit root tests do not reject I(1) nonstationarity <strong>in</strong> the data, specify the model asVECM test<strong>in</strong>g the presence of co<strong>in</strong>tegrat<strong>in</strong>g relationships. If tests suggest the presenceof co<strong>in</strong>tegrat<strong>in</strong>g relationships, estimate the model as VECM. If co<strong>in</strong>tegration is rejectedestimate the VAR models tak<strong>in</strong>g first difference.Step 2 Run tests for zero partial correlations between the elements u 1t ,...,u kt of u tus<strong>in</strong>g the Wald statistics on the basis of the asymptotic distribution of the covariancematrix of u t . Note that not all possible partial correlations ρ(u it ,u jt |u ht ,...) need to betested, but only those necessary for step 3.k113


Moneta Chlass Entner HoyerStep 3 Apply a causal search algorithm to recover the causal structure among u 1t ,...,u kt ,which is equivalent to the causal structure among Y 1t ,...,Y kt (cfr. section 1.2 and seeMoneta 2003). In case of acyclic (no feedback loops) and causally sufficient (no latentvariables) structure, the suggested algorithm is the PC algorithm of Spirtes et al. (2000,pp. 84-85). Moneta (2008) suggested few modifications to the PC algorithm <strong>in</strong> orderto make the orientation of edges compatible with as many conditional <strong>in</strong>dependencetests as possible. This <strong>in</strong>creases the computational time of the search algorithm, butconsider<strong>in</strong>g the fact that VAR models deal with a few number of time series variables(rarely more than six to eight; see Bernanke et al. 2005), this slow<strong>in</strong>g down does notcreate a serious concern <strong>in</strong> this context. Table 1 reports the modified PC algorithm. Incase of acyclic structure without causal sufficiency (i.e. possibly <strong>in</strong>clud<strong>in</strong>g latent variables),the suggested algorithm is FCI (Spirtes et al. 2000, pp. 144-145). In the caseof no latent variables and <strong>in</strong> the presence of feedback loops, the suggested algorithmis CCD (Richardson and Spirtes, 1999). There is no algorithm <strong>in</strong> the literature whichis consistent for search when both latent variables and feedback loops may be present.If the goal of the study is only impulse response analysis (i.e. trac<strong>in</strong>g out the effectsof structural shocks ε 1t ,...,ε kt on Y t ,Y t−1 ,...) and neither contemporaneous feedbacksnor latent variables can be excluded a priori, a possible solution is to apply only steps(A) and (B) of the PC algorithm. If the result<strong>in</strong>g set of possible causal structures (representedby an undirected graph) conta<strong>in</strong>s a manageable number of elements, one canstudy the characteristics of the impulse response functions which are robust across allthe possible causal structures, where the presence of both feedbacks and latent variablesis allowed (Moneta, 2004).Step 4 Calculate structural coefficients and impulse response functions. If the outputof Step 3 is a set of causal structures, run sensitivity analysis to <strong>in</strong>vestigate therobustness of the conclusions under the different possible causal structures. Bootstrapprocedures may also be applied to determ<strong>in</strong>e which is the most reliable causal order(see simulations and applications <strong>in</strong> Demiralp et al., 2008).3. Identification via <strong>in</strong>dependent component analysisThe methods considered <strong>in</strong> the previous section use tests for zero partial correlation onthe VAR-residuals to obta<strong>in</strong> (partial) <strong>in</strong>formation about the contemporaneous structure<strong>in</strong> an SVAR model with Gaussian shocks. In this section we show how non-Gaussianand <strong>in</strong>dependent shocks can be exploited for model identification by us<strong>in</strong>g the statisticalmethod of ‘Independent Component Analysis’ (ICA, see Comon (1994); Hyvär<strong>in</strong>enet al. (2001)). The method is aga<strong>in</strong> based on the VAR-residuals u t which can be obta<strong>in</strong>edas <strong>in</strong> the Gaussian case by estimat<strong>in</strong>g the VAR model us<strong>in</strong>g for example ord<strong>in</strong>aryleast squares or least absolute deviations, and can be tested for non-Gaussianity us<strong>in</strong>gany normality test (such as the Shapiro-Wilk or Jarque-Bera test).To motivate, we note that, from equations (3) and (4) (with matrix Γ 0 ) or theCholesky factorization <strong>in</strong> section 1.2 (with matrix PD −1 ), the VAR-disturbances u t and114


Causal Search <strong>in</strong> SVARTable 1: Search algorithm (adapted from the PC Algorithm of Spirtes et al. (2000:84-85); <strong>in</strong> bold character the modifications).Under the assumption of Gaussianity conditional <strong>in</strong>dependence is tested by zero partialcorrelation tests.(A): (connect everyth<strong>in</strong>g):Form the complete undirected graph G on the vertex set u 1t ,...,u kt so that each vertex isconnected to any other vertex by an undirected edge.(B)(cut some edges):n = 0repeat :repeat :select an ordered pair of variables u ht and u it that are adjacent <strong>in</strong> G such thatthe number of variables adjacent to u ht is equal or greater than n + 1. Select aset S of n variables adjacent to u ht such that u ti S . If u ht ⊥ u it |S delete edgeu ht — u it from G;until all ordered pairs of adjacent variables u ht and u it such that the number ofvariables adjacent to u ht is equal or greater than n + 1 and all sets S of n variablesadjacent to u ht such that u it S have been checked to see if u ht ⊥ u it |S ;n = n + 1;until for each ordered pair of adjacent variables u ht , u it , the number of adjacent variablesto u ht is less than n + 1;(C)(build colliders):for each triple of vertices u ht ,u it ,u jt such that the pair u ht ,u it and the pair u it ,u jt are eachadjacent <strong>in</strong> G but the pair u ht ,u jt is not adjacent <strong>in</strong> G, orient u ht — u it — u jt as u ht −→ u it


Moneta Chlass Entner Hoyerthe structural shocks ε t are connected byu t = Γ −10 ε t = PD −1 ε t (14)with square matrices Γ 0 and PD −1 , respectively. Equation (14) has two important properties:First, the vectors u t and ε t are of the same length, mean<strong>in</strong>g that there are asmany residuals as structural shocks. Second, the residuals u t are l<strong>in</strong>ear mixtures of theshocks ε t , connected by the ‘mix<strong>in</strong>g matrix’ Γ −10. This resembles the ICA model, whenplac<strong>in</strong>g certa<strong>in</strong> assumptions on the shocks ε t .In short, the ICA model is given by x = As, where x are the mixed components, s the<strong>in</strong>dependent, non-Gaussian sources, and A a square <strong>in</strong>vertible mix<strong>in</strong>g matrix (mean<strong>in</strong>gthat there are as many mixtures as <strong>in</strong>dependent components). Given samples from themixtures x, ICA estimates the mix<strong>in</strong>g matrix A and the <strong>in</strong>dependent components s, byl<strong>in</strong>early transform<strong>in</strong>g x <strong>in</strong> such a way that the dependencies among the <strong>in</strong>dependentcomponents s are m<strong>in</strong>imized. The solution is unique up to order<strong>in</strong>g, sign and scal<strong>in</strong>g(Comon, 1994; Hyvär<strong>in</strong>en et al., 2001).By compar<strong>in</strong>g the ICA model x = As and equation (14), we see a one-to-one correspondenceof the mixtures x to the residuals u t and the <strong>in</strong>dependent components s tothe shocks ε t . Thus, to be able to apply ICA, we need to assume that the shocks arenon-Gaussian and mutually <strong>in</strong>dependent. We want to emphasize that no specific non-Gaussian distribution is assumed for the shocks, but only that they cannot be Gaussian. 1For the shocks to be mutually <strong>in</strong>dependent their jo<strong>in</strong>t distribution has to factorize <strong>in</strong>tothe product of the marg<strong>in</strong>al distributions. In the non-Gaussian sett<strong>in</strong>g, this implies zeropartial correlation, but the converse is not true (as opposed to the Gaussian case wherethe two statements are equivalent). Thus, for non-Gaussian distributions conditional<strong>in</strong>dependence is a much stronger requirement than uncorrelatedness.Under the assumption that the shocks ε t are non-Gaussian and <strong>in</strong>dependent, equation(14) follows exactly the ICA-model and apply<strong>in</strong>g ICA to the VAR residuals u tyields a unique solution (up to order<strong>in</strong>g, sign, and scal<strong>in</strong>g) for the mix<strong>in</strong>g matrix Γ −10and the <strong>in</strong>dependent components ε t (i.e. the structural shocks <strong>in</strong> our case). However,the ambiguities of ICA make it hard to directly <strong>in</strong>terpret the shocks found by ICA s<strong>in</strong>cewithout further analysis we cannot relate the shocks directly to the measured variables.Hence, we assume that the residuals u t follow a l<strong>in</strong>ear non-Gaussian acyclic model(Shimizu et al., 2006), which means that the contemporaneous structure is representedby a DAG (directed acyclic graph). In particular, the model is given byu t = B 0 u t + ε t (15)with a matrix B 0 , whose diagonal elements are all zero and, if permuted accord<strong>in</strong>g tothe causal order, is strictly lower triangular. By rewrit<strong>in</strong>g equation (15) we see thatΓ 0 = I − B 0 . (16)From this equation it follows that the matrix B 0 describes the contemporaneous structureof the variables Y t <strong>in</strong> the SVAR model as shown <strong>in</strong> equation (6). Thus, if we can1. Actually, the requirement is that at most one of the residuals can be Gaussian.116


Causal Search <strong>in</strong> SVARidentify the matrix Γ 0 , we also obta<strong>in</strong> the matrix B 0 for the contemporaneous effects.As po<strong>in</strong>ted out above, the matrix Γ −10(and hence Γ 0 ) can be estimated us<strong>in</strong>g ICA up toorder<strong>in</strong>g, scal<strong>in</strong>g, and sign. With the restriction of B 0 represent<strong>in</strong>g an acyclic system,we can resolve these ambiguities and are able to fully identify the model. For simplicity,let us assume that the variables are arranged accord<strong>in</strong>g to a causal order<strong>in</strong>g, sothat the matrix B 0 is strictly lower triangular. From equation (16) then follows that thematrix Γ 0 is lower triangular with all ones on the diagonal. Us<strong>in</strong>g this <strong>in</strong>formation, theambiguities of ICA can be resolved <strong>in</strong> the follow<strong>in</strong>g way.The lower triangularity of B 0 allows us to f<strong>in</strong>d the unique permutation of the rows ofΓ 0 , which yields all non-zero elements on the diagonal of Γ 0 , mean<strong>in</strong>g that we replacethe matrix Γ 0 with Q 1 Γ 0 where Q 1 is the uniquely determ<strong>in</strong>ed permutation matrix.F<strong>in</strong>d<strong>in</strong>g this permutation resolves the order<strong>in</strong>g-ambiguity of ICA and l<strong>in</strong>ks the shocksε t to the components of the residuals u t <strong>in</strong> a one-to-one manner. The sign- and scal<strong>in</strong>gambiguityis now easy to fix by simply divid<strong>in</strong>g each row of Γ 0 (the row-permutedversion from above) by the correspond<strong>in</strong>g diagonal element yield<strong>in</strong>g all ones on thediagonal, as implied by Equation (16). This ensures that the connection strength of theshock ε t on the residual u t is fixed to one <strong>in</strong> our model (Equation (15)).For the general case where B 0 is not arranged <strong>in</strong> the causal order, the above argumentsfor solv<strong>in</strong>g the ambiguities still apply. Furthermore, we can f<strong>in</strong>d the causalorder of the contemporaneous variables by perform<strong>in</strong>g simultaneous row- and columnpermutationson Γ 0 yield<strong>in</strong>g the matrix closest to lower triangular, <strong>in</strong> particular ˜Γ 0 =Q 2 Γ 0 Q ′ 2 with an appropriate permutation matrix Q 2. In case non of these permutationsleads to a close to lower triangular matrix a warn<strong>in</strong>g is issued.Essentially, the assumption of acyclicity allows us to uniquely connect the structuralshocks ε t to the components of u t and fully identify the contemporaneous structure.Details of the procedure can be found <strong>in</strong> (Shimizu et al., 2006; Hyvär<strong>in</strong>en et al.,2010). In the sense of the Cholesky factorization of the covariance matrix expla<strong>in</strong>ed <strong>in</strong>Section 1 (with PD −1 = Γ −10), full identifiability means that a causal order among thecontemporaneous variables can be determ<strong>in</strong>ed.In addition to yield<strong>in</strong>g full identification, an additional benefit of us<strong>in</strong>g the ICAbasedprocedure when shocks are non-Gaussian is that it does not rely on the faithfulnessassumption, which was necessary <strong>in</strong> the Gaussian case.We note that there are many ways of exploit<strong>in</strong>g non-Gaussian shocks for modelidentification as alternatives to directly us<strong>in</strong>g ICA. One such approach was <strong>in</strong>troducedby Shimizu et al. (2009). Their method relies on iteratively f<strong>in</strong>d<strong>in</strong>g an exogenous variableand regress<strong>in</strong>g out their <strong>in</strong>fluence on the rema<strong>in</strong><strong>in</strong>g variables. An exogenous variableis characterized by be<strong>in</strong>g <strong>in</strong>dependent of the residuals when regress<strong>in</strong>g any othervariable <strong>in</strong> the model on it. Start<strong>in</strong>g from the model <strong>in</strong> equation (15), this procedurereturns a causal order<strong>in</strong>g of the variables u t and then the matrix B 0 can be estimatedus<strong>in</strong>g the Cholesky approach.One relatively strong assumption of the above methods is the acyclicity of the contemporaneousstructure. In (Lacerda et al., 2008) an extension was proposed wherefeedback loops were allowed. In terms of the matrix B 0 this means that it is not re-117


Moneta Chlass Entner Hoyerstricted to be<strong>in</strong>g lower triangular (<strong>in</strong> an appropriate order<strong>in</strong>g of the variables). While <strong>in</strong>general this model is not identifiable because we cannot uniquely match the shocks tothe residuals, Lacerda et al. (2008) showed that the model is identifiable when assum<strong>in</strong>gstability of the generat<strong>in</strong>g model <strong>in</strong> (15) (the absolute value of the biggest eigenvalue <strong>in</strong>B 0 is smaller than one) and disjo<strong>in</strong>t cycles.Another restriction of the above model is that all relevant variables must be <strong>in</strong>cluded<strong>in</strong> the model (causal sufficiency). Hoyer et al. (2008b) extended the abovemodel by allow<strong>in</strong>g for hidden variables. This leads to an overcomplete basis ICAmodel, mean<strong>in</strong>g that there are more <strong>in</strong>dependent non-Gaussian sources than observedmixtures. While there exist methods for estimat<strong>in</strong>g overcomplete basis ICA models,those methods which achieve the required accuracy do not scale well. Additionally,the solution is aga<strong>in</strong> only unique up to order<strong>in</strong>g, scal<strong>in</strong>g, and sign, and when <strong>in</strong>clud<strong>in</strong>ghidden variables the order<strong>in</strong>g-ambiguity cannot be resolved and <strong>in</strong> some cases leads toseveral observationally equivalent models, just as <strong>in</strong> the cyclic case above.We note that it is also possible to comb<strong>in</strong>e the approach of section 2 with that describedhere. That is, if some of the shocks are Gaussian or close to Gaussian, it maybe advantageous to use a comb<strong>in</strong>ation of constra<strong>in</strong>t-based search and non-Gaussianitybasedsearch. Such an approach was proposed <strong>in</strong> Hoyer et al. (2008a). In particular,the proposed method does not make any assumptions on the distributions of the VARresidualsu t . Basically, the PC algorithm (see Section 2) is run first, followed by utilizationof whatever non-Gaussianity there is to further direct edges. Note that there isno need to know <strong>in</strong> advance which shocks are non-Gaussian s<strong>in</strong>ce f<strong>in</strong>d<strong>in</strong>g such shocksis part of the algorithm.F<strong>in</strong>ally, we need to po<strong>in</strong>t out that while the basic ICA-based approach does notrequire the faithfulness assumption, the extensions discussed at the end of this sectiondo.4. Nonparametric sett<strong>in</strong>g4.1. TheoryL<strong>in</strong>ear systems dom<strong>in</strong>ate VAR, SVAR, and more generally, multivariate time seriesmodels <strong>in</strong> econometrics. However, it is not always the case that we know how a variableX may cause another variable Y. It may be the case that we have little or no a prioriknowledge about the way how Y depends on X. In its most general form we wantto know whether X is <strong>in</strong>dependent of Y conditional on the set of potential graphicalparents Z, i.e.H 0 : Y ⊥ X | Z, (17)where Y, X,Z is a set of time series variables. Thereby, we do not per se require ana priori specification of how Y possibly depends on X. However, constra<strong>in</strong>t basedalgorithms typically specify conditional <strong>in</strong>dependence <strong>in</strong> a very restrictive way. In cont<strong>in</strong>uoussett<strong>in</strong>gs, they simply test for nonzero partial correlations, or <strong>in</strong> other words, forl<strong>in</strong>ear (<strong>in</strong>)dependencies. Hence, these algorithms will fail whenever the data generationprocess (DGP) <strong>in</strong>cludes nonl<strong>in</strong>ear causal relations.118


Causal Search <strong>in</strong> SVARIn search for a more general specification of conditional <strong>in</strong>dependency, Chlaß andMoneta (2010) suggest a procedure based on nonparametric density estimation. There<strong>in</strong>,neither the type of dependency between Y and X, nor the probability distributions ofthe variables need to be specified. The procedure exploits the fact that if two randomvariables are <strong>in</strong>dependent of a third, one obta<strong>in</strong>s their jo<strong>in</strong>t density by the product of thejo<strong>in</strong>t density of the first two, and the marg<strong>in</strong>al density of the third. Hence, hypothesistest (17) translates <strong>in</strong>to:f (Y, X,Z)H 0 : = f (YZ)f (XZ) f (Z) . (18)If we def<strong>in</strong>e h 1 (·) := f (Y, X,Z) f (Z), and h 2 (·) := f (YZ) f (XZ), we have:H 0 : h 1 (·) = h 2 (·). (19)We estimate h 1 and h 2 us<strong>in</strong>g a kernel smooth<strong>in</strong>g approach (see Wand and Jones, 1995,ch.4). Kernel smooth<strong>in</strong>g has the outstand<strong>in</strong>g property that it is <strong>in</strong>sensitive to autocorrelationphenomena and, therefore, immediately applicable to longitud<strong>in</strong>al or time seriessett<strong>in</strong>gs (Welsh et al., 2002).In particular, we use a so-called product kernel estimator:)︁ (︁1ĥ 1 (x,y,z;b) = KYi)︁ (︁−y KZi −z)︁}︁{︁∑︀ (︁ ni=1KZi)︁}︁−zp{︁∑︀ ni=1K (︁ X i −xN 2 b m+d b )︁ (︁ b bbZi −z)︁}︁{︁∑︀ ni=1 KZ K (︁ )︁ (︁Y i −y Zi)︁}︁−z (20)Kp ,1{︁∑︀ĥ 2 (x,y,z;b) = ni=1K (︁ X i −xN 2 b m+d bwhere X i , Y i , and Z i are the i th realization of the respective time series, K denotes thekernel function, b <strong>in</strong>dicates a scalar bandwidth parameter, and K p represents a productkernel 2 .So far, we have shown how we can estimate h 1 and h 2 . To see whether these aredifferent, we require some similarity measure between both conditional densities. Thereare different ways to measure the distance between a product of densities:(i) The weighted Hell<strong>in</strong>ger distance proposed by Su and White (2008):⎧ √︃⎫d H = 1 2n∑︁ ⎪⎨n ⎪⎩ 1 − h 2 (X i ,Y i ,Z i ) ⎪⎬a(X i ,Y i ,Z i ), (21)h 1 (X i ,Y i ,Z i ) ⎪⎭i=1where a(·) is a nonnegative weight<strong>in</strong>g function. Both the weight<strong>in</strong>g function a(·),and the result<strong>in</strong>g test statistic are specified <strong>in</strong> Su and White (2008).(ii) The Euclidean distance proposed by Szekely and Rizzo (2004) <strong>in</strong> their ‘energytest’:d E = 1 n∑︁ n∑︁||h 1i − h 2nj|| − 1 n∑︁ n∑︁||h 1i − h 12nj|| − 1 n∑︁ n∑︁||h 2i − h 22nj||, (22)i=1j=1i=1j=12. I.e. K p ((Z i − z)/b) = ∏︀ dj=1 K((Z ji − z j )/b). For our simulations (see next section) we choose thekernel: K(u) = (3 − u 2 )φ(u)/2, with φ(u) the standard normal probability density function. We use a“rule-of-thumb” bandwidth: b = n −1/8.5 .bbi=1j=1b119


Moneta Chlass Entner Hoyerwhere h 1i = h 1 (X i ,Y i ,Z i ), h 2i = h 2 (X i ,Y i ,Z i ), and || · || is the Euclidean norm. 3Given these test statistics and their distributions, we compute the type-I error, or p-valueof our test problem (19). If Z = ∅, the tests are available <strong>in</strong> R-packages energy andcramer. The Hell<strong>in</strong>ger distance is not suitable here, s<strong>in</strong>ce one can only test for Z ∅.For Z ∅, our test problem (19) requires higher dimensional kernel density estimation.The more dimensions, i.e. the more elements <strong>in</strong> Z, the scarcer the data, andthe greater the distance between two subsequent data po<strong>in</strong>ts. This so-called Curse ofdimensionality strongly reduces the accuracy of a nonparametric estimation (Yatchew,1998). To circumvent this problem, we calculate the type-I errors for Z ∅ by a localbootstrap procedure, as described <strong>in</strong> Su and White (2008, pp. 840-841) and Paparoditisand Politis (2000, pp. 144-145). Local bootstrap draws repeatedly with replacementfrom the sample and counts how many times the bootstrap statistic is larger than thetest statistic of the entire sample. Details on the local bootstrap procedure ca be found<strong>in</strong> appendix A.Now, let us see how this procedure fares <strong>in</strong> those time series sett<strong>in</strong>gs, where othertest<strong>in</strong>g procedures failed - the case of nonl<strong>in</strong>ear time series.4.2. Simulation DesignOur simulation design should allow us to see how the search procedures of 4.1 perform<strong>in</strong> terms of size and power. To identify size properties (type-I error), H 0 (19) must holdeverywhere. We call data generation processes for which H 0 holds everywhere, size-DGPs. We <strong>in</strong>duce a system of time series {V 1,t ,V 2,t ,V 3,t } n t=1whereby each time seriesfollows an autoregressive process AR(1) with a 1 = 0.5 and error term e t ∼ N(0,1), for<strong>in</strong>stance, V 1,t = a 1 V 1,t−1 + e V1 ,t. These time series may cause each other as illustrated <strong>in</strong>Fig. 1.V 1,t−1 V 2,t−1 V 3,t−1❅❅❅❄ ❅ ❅❘ ❄ ❄V 1,t V 2,t V 3,tFigure 1: <strong>Time</strong> series DAG.There<strong>in</strong>, V 1,t ⊥ V 2,t |V 1,t−1 , s<strong>in</strong>ce V 1,t−1 d-separates V 1,t from V 2,t , while V 2,t ⊥ V 3,s ,for any t and s. Hence, the set of variables Z, conditional on which two sets of variablesX and Y are <strong>in</strong>dependent of each other, conta<strong>in</strong>s zero elements, i.e. V 2,t ⊥ V 3,t−1 ,conta<strong>in</strong>s one element, i.e. V 1,t ⊥ V 2,t |V 1,t−1 , or conta<strong>in</strong>s two elements, i.e. V 1,t ⊥3. An alternative Euclidean distance is proposed by Bar<strong>in</strong>ghaus and Franz (2004) <strong>in</strong> their Cramer test.This distance turns out to be d E /2. The only substantial difference from the distance proposed <strong>in</strong> (ii)lies <strong>in</strong> the method to obta<strong>in</strong> the critical values (see Bar<strong>in</strong>ghaus and Franz 2004).120


Causal Search <strong>in</strong> SVARV 2,t |V 1,t−1 ,V 3,t−1 .In our simulations, we vary two aspects. The first aspect is the functional form ofthe causal dependency. To systematically vary nonl<strong>in</strong>earity and its impact, we characterizethe causal relation between, say, V 1,t−1 and V 2,t , <strong>in</strong> a polynomial form, i.e. viaV 2,t = f (V 1,t−1 ) + e, where f = ∑︀ pj=0 b jV j1,t−1. Here<strong>in</strong>, j reflects the degree of nonl<strong>in</strong>earity,while b j would capture the impact nonl<strong>in</strong>earity exerts. For polynomials of anydegree, we set only b p 0. An additive error term e completes the specification.The second aspect is the number of variables <strong>in</strong> Z conditional on which X and Ycan be <strong>in</strong>dependent. Either zero, one, but maximally two variables may form the set Z= {Z 1 ,...,Z d } of conditioned variables; hence Z has card<strong>in</strong>ality #Z = {0,1,2}.To identify power properties, H 0 must not hold anywhere, i.e. X ⊥/ Y|Z. We calldata generation processes where H 0 does not hold anywhere, power-DGPs. Such DGPscan be <strong>in</strong>duced by (i) a direct path between X and Y which does not <strong>in</strong>clude Z, (ii) acommon cause for X and Y which is not an element of Z, or (iii) a “collider” between Xand Y belong<strong>in</strong>g to Z. 4 As before, we vary the functional form f of these causal pathspolynomially where aga<strong>in</strong>, only b p 0. Third, we <strong>in</strong>vestigate different card<strong>in</strong>alities#Z = {0,1,2} of set Z conditional on which X and Y become dependent.4.3. Simulation ResultsLet us start with #Z = 0, that is, H 0 := X ⊥ Y. Table 2 reports our simulation results forboth size and power DGPs. Rejection frequencies are reported for three different tests,for a theoretical level of significance of 0.05, and 0.1.Table 2: Proportion of rejection of H 0 (no conditioned variables)Energy Cramer Fisher Energy Cramer Fisherlevel of significance 5% level of significance 10%Size DGPsS0.1 (<strong>in</strong>d. time series) 0.065 0.000 0.151 0.122 0.000 0.213Power DGPsP0.1 (time series l<strong>in</strong>ear) 0.959 0.308 0.999 0.981 0.462 1P0.2 (time series quadratic) 0.986 0.255 0.432 0.997 0.452 0.521P0.3 (time series cubic) 1 0.905 1 1 0.975 1P0.3 (time series quartic) 1 0.781 0.647 1 0.901 0.709Note: length series (n) = 100; number of iterations = 1000Take the first l<strong>in</strong>e of Table 2. For size DGPs, H 0 holds everywhere. A test performsaccurately if it rejects H 0 <strong>in</strong> accordance with the respective theoretical significancelevel. We see that the energy test rejects H 0 slightly more often than it should (0.065 >0.05;0.122 > 0.1), whereas the Cramer test does not reject H 0 often enough (0.000


Moneta Chlass Entner Hoyerlatter rejects H 0 much more often than it should. The energy test keeps the type-I errormost accurately. Contrary to both nonparametric tests, the parametric procedure leadsus to suspect a lot more causal relationships than there actually are, if #Z = 0.How well do these tests perform if H 0 does not hold anywhere? That is, howaccurately do they reject H 0 if it is false (power-DGPs)? For l<strong>in</strong>ear time series, we seethat the nonparametric energy test has nearly as much power as Fisher’s z. For nonl<strong>in</strong>eartime series, the energy test clearly outperforms Fisher’s z 5 . As it did for size, Cramer’stest generally underperforms <strong>in</strong> terms of power. Interest<strong>in</strong>gly, its power appears to behigher for higher degrees of nonl<strong>in</strong>earity. In summary, if one wishes to test for marg<strong>in</strong>al<strong>in</strong>dependence without any <strong>in</strong>formation on the type of a potential dependence, one wouldopt for the energy test. It has a size close to the theoretical significance level, and haspower similar to a parametric specification.Let us turn to #Z = 1, where H 0 := X ⊥ Y|Z, for which the results are shown <strong>in</strong>Table 3. Start<strong>in</strong>g with size DGPs, tests based on Hell<strong>in</strong>ger and Euclidian distanceslightly underreject H 0 whereas for the highest polynomial degree, the Hell<strong>in</strong>ger teststrongly overrejects H 0 . The parametric Fisher’s z slightly overrejects H 0 <strong>in</strong> case ofl<strong>in</strong>earity, and for higher degrees, starts to underreject H 0 .Table 3: Proportion of rejection of H 0 (one conditioned variable)Hell<strong>in</strong>ger Euclid Fisher Hell<strong>in</strong>ger Euclid Fisherlevel of significance 5% level of significance 10%Size DGPsS1.1 (time series l<strong>in</strong>ear) 0.035 0.035 0.062 0.090 0.060 0.103S1.2 (time series quadratic) 0.040 0.020 0.048 0.065 0.035 0.104S1.3 (time series cubic) 0.010 0.010 0.050 0.020 0.015 0.093S1.4 (time series quartic) 0.13 0 0.023 0.2 0.1 0.054Power DGPsP1.1 (time series l<strong>in</strong>ear) 0.875 0.910 0.999 0.925 0.950 1P1.2 (time series quadratic) 0.905 0.895 0.416 0.940 0.950 0.504P1.3 (time series cubic) 0.990 1 1 1 1 1P1.4 (time series quartic) 0.84 0.995 0.618 0.91 0.995 0.679Note: n = 100; number of iterations = 200; number of bootstrap iterations (I) = 200Turn<strong>in</strong>g to power DGPs, Fisher’s z suffers a dramatic loss <strong>in</strong> power for those polynomialdegrees which depart most from l<strong>in</strong>earity, i.e. quadratic, and quartic relations.Nonparametric tests which do not require l<strong>in</strong>earity have high power <strong>in</strong> absolute terms,and nearly twice as much as compared to Fisher’s z. The power properties of the nonparametricprocedures <strong>in</strong>dicate that our local bootstrap succeeds <strong>in</strong> mitigat<strong>in</strong>g the Curseof dimensionality. In sum, nonparametric tests exhibit good power properties for #Z = 1whereas Fisher’s z would fail to discover underly<strong>in</strong>g quadratic or quartic relationships<strong>in</strong> some 60%, and 40% of the cases, respectively.5. For cubic time series, Fisher’s z performs as well as the energy test does. This may be due to the factthat a cubic relation resembles more to a l<strong>in</strong>e than other polynomial specifications do.122


Causal Search <strong>in</strong> SVARThe results for #Z = 2 are presented <strong>in</strong> Table 4. We f<strong>in</strong>d that both nonparametrictests have a size which is notably smaller than the theoretical significance level we <strong>in</strong>duce.Hence, both have a strong tendency to underreject H 0 . Turn<strong>in</strong>g to power DGPs,we f<strong>in</strong>d that the Euclidean test still has over 90% power to correctly reject H 0 . Forthose polynomial degrees which depart most from l<strong>in</strong>earity, i.e. quadratic and quartic,the Euclidean test has three times as much power as Fisher’s z. However, the Hell<strong>in</strong>gertest performs even worse than Fisher’s z. Here, it may be the Curse of dimensionalitywhich starts to show an impact.Table 4: Proportion of rejection of H 0 (two conditioned variables)Hell<strong>in</strong>ger Euclid Fisher Hell<strong>in</strong>ger Euclid Fisherlevel of significance 5% level of significance 10%Size DGPsS2.1 (time series l<strong>in</strong>ear) 0.006 0.020 0.050 0.033 0.046 0.102S2.2 (time series quadratic) 0.000 0.010 0.035 0.000 0.040 0.087S2.3 (time series cubic) 0 0.007 0.056 0 0.007 0.109S2.4 (time series quartic) 0.006 0 0.031 0.013 0 0.067Power DGPsP2.1 (time series l<strong>in</strong>ear) 0.28 0.92 1 0.4 0.973 1P2.2 (time series quadratic) 0.170 0.960 0.338 0.250 0.980 0.411P2.3 (time series cubic) 0.667 1 1 0.754 1 1P2.4 (time series quartic) 0.086 0.946 0.597 0.133 0.966 0.665Note: n = 100; number of iterations = 150; number of bootstrap iterations (I) = 100To sum up, we can say that both marg<strong>in</strong>al <strong>in</strong>dependencies, and higher dimensionalconditional <strong>in</strong>dependencies, i.e. (#Z = 1,2) are best tested for us<strong>in</strong>g Euclidean tests. TheHell<strong>in</strong>ger test seems to be more affected by the Curse of dimensionality. We see that ourlocal bootstrap procedure mitigates the latter, but we admit that the number of variablesour nonparametric procedure can deal with is very small. Here, it might be promis<strong>in</strong>g toopt for semiparametric (Chu and Glymour, 2008), rather than nonparametric procedureswhich comb<strong>in</strong>e parametric and nonparametric approaches.5. ConclusionsThe difficulty of learn<strong>in</strong>g causal relations from passive, that is non-experimental, observationsis one of the central challenges of econometrics. Traditional solutions <strong>in</strong>volvethe dist<strong>in</strong>ction between structural and reduced form model. The former is meant toformalize the unobserved data generat<strong>in</strong>g process, whereas the latter aims to describe asimpler transformation of that process. The structural model is articulated h<strong>in</strong>g<strong>in</strong>g ona priori economic theory. The reduced form model is formalized <strong>in</strong> such a way that itcan be estimated directly from the data. In this paper, we have presented an approach toidentify the structural model which m<strong>in</strong>imizes the role of a priori economic theory andemphasizes the need of an appropriate and rich statistical model of the data. Graphical123


Moneta Chlass Entner Hoyercausal models, <strong>in</strong>dependent component analysis, and tests of conditional <strong>in</strong>dependenceare the tools we propose for structural identification <strong>in</strong> vector autoregressive models.We conclude with an overview of some important issues which are left open <strong>in</strong> thisdoma<strong>in</strong>.1. Specification of the statistical model. Data driven procedures for SVAR identificationdepend upon the specification of the (reduced form) VAR model. Therefore, itis important to make sure that the estimated VAR model is an accurate description ofthe dynamics of the <strong>in</strong>cluded variables (whereas the contemporaneous structure is <strong>in</strong>tentionallyleft out, as seen <strong>in</strong> section 1.2). The usual criterion for accuracy is to checkthat the model estimates residuals conform to white noise processes (although serial<strong>in</strong>dependence of residuals is not a sufficient criterion for model validation). This impliesstable dependencies captured by the relationships among the modeled variables,and an unsystematic noise. It may be the case, as <strong>in</strong> many empirical applications, thatdifferent VAR specifications pass the model check<strong>in</strong>g tests equally well. For example,a VAR with Gaussian errors and p lags may fit the data equally well as a VAR withnon-Gaussian errors and q lags and these two specifications justify two different causalsearch procedures. So far, we do not know how to adjudicate among alternative andseem<strong>in</strong>gly equally accurate specifications.2. Background knowledge and assumptions. Search algorithms are based on differentassumptions, such as, for example, causal sufficiency, acyclicity, the Causal MarkovCondition, Faithfulness, and/or the existence of <strong>in</strong>dependent components. Maybe,background knowledge could justify some of these assumptions and reject others. Forexample, <strong>in</strong>stitutional or theoretical knowledge about an economic process might <strong>in</strong>formus that Faithfulness is a plausible assumption <strong>in</strong> some contexts rather than <strong>in</strong>others, or <strong>in</strong>stead, that one should expect feedback loops if data are collected at certa<strong>in</strong>levels of temporal aggregation. Yet, if background <strong>in</strong>formation could <strong>in</strong>form us here,this might aga<strong>in</strong> provoke a problem of circularity mentioned at the outset of the paper.3. Search algorithms <strong>in</strong> nonparametric sett<strong>in</strong>gs. We have provided some <strong>in</strong>formationon which nonparametric test procedures might be more appropriate <strong>in</strong> certa<strong>in</strong> circumstances.However, it is not clear which causal search algorithms are most efficient<strong>in</strong> exploit<strong>in</strong>g the nonparametric conditional <strong>in</strong>dependence tests proposed <strong>in</strong> Section 4.The more variables the search algorithm needs to be <strong>in</strong>formed about at the same po<strong>in</strong>tof the search, the higher the number of conditioned variables, and hence, the slower, orthe more <strong>in</strong>accurate, the test.4. Number of shocks and number of variables. To conserve degrees of freedom,SVARs rarely model more than six to eight time series variables (Bernanke et al., 2005,p.388). It is an open question how the procedures for causal <strong>in</strong>ference we reviewed canbe applied to large scale systems such as dynamic factor models. (Forni et al., 2000)5. Simulations and empirical applications. Graphical causal models for identify<strong>in</strong>gSVARs, equivalent or similar to the search procedures described <strong>in</strong> section 2, have beenapplied to several sets of macroeconomic data (Swanson and Granger, 1997; Besslerand Lee, 2002; Demiralp and Hoover, 2003; Moneta, 2004; Demiralp et al., 2008;Moneta, 2008; Hoover et al., 2009). Demiralp and Hoover (2003) present Monte Carlo124


Causal Search <strong>in</strong> SVARsimulations to evaluate the performance of the PC algorithm for such an identification.There are no simulation results so far about the performance of the alternative testson residual partial correlations presented <strong>in</strong> section 2.2. Moneta et al. (2010) appliedan <strong>in</strong>dependent component analysis as described <strong>in</strong> section 3, to microeconomic USdata about firms’ expenditures on R&D and performance, as well as to macroeconomicUS data about monetary policy and its effects on the aggregate economy. Hyvär<strong>in</strong>enet al. (2010) assess the performance of <strong>in</strong>dependent component analysis for identify<strong>in</strong>gSVAR models. It is yet to be established how <strong>in</strong>dependent component analysis appliedto SVARs fares compared to graphical causal models (based on the appropriate conditional<strong>in</strong>dependence tests) <strong>in</strong> non-Gaussian sett<strong>in</strong>gs. Nonparametric tests of conditional<strong>in</strong>dependence, as those proposed <strong>in</strong> section 4, have been applied to test for Grangernon-causality (Su and White, 2008), but there are not yet any applications where thesetest results <strong>in</strong>form a graphical causal search algorithm. Overall, there is a need for moreempirical applications of the procedures described <strong>in</strong> this paper. Such applications willbe useful to test, compare, and improve different search procedures, to suggest newproblems, and obta<strong>in</strong> new causal knowledge.6. Appendix6.1. Appendix 1 - Details of the bootstrap procedure from 4.1.(1) Draw a bootstrap sampl<strong>in</strong>g Z * t (for t = 1,...,n) from the estimated kernel densityˆ f (z) = n −1 b −d ∑︀ nt=1K p ((Z t − z)/b).(2) For t = 1,...,n, given Z t *, draw X* t and Y t * <strong>in</strong>dependently from the estimated kerneldensity f ˆ(x|Z t *) and f ˆ(y|Z t *) respectively.(3) Us<strong>in</strong>g X * t , Y* t , and Z* t , compute the bootstrap statistic S * n us<strong>in</strong>g one of the distancesdef<strong>in</strong>ed above.(4) Repeat steps (1) and (2) I times to obta<strong>in</strong> I statistics {S * ni }I i=1 .(5) The p-value is then obta<strong>in</strong>ed by:∑︀ Ii=11{Sni * p ≡> S n},Iwhere S n is the statistic obta<strong>in</strong>ed from the orig<strong>in</strong>al data us<strong>in</strong>g one of the distancesdef<strong>in</strong>ed above, and 1{·} denotes an <strong>in</strong>dicator function tak<strong>in</strong>g value one ifthe expression between brackets is true and zero otherwise.ReferencesE. Baek and W. Brock. A general test for nonl<strong>in</strong>ear Granger causality: Bivariate model.Discuss<strong>in</strong> paper, Iowa State University and University of Wiscons<strong>in</strong>, Madison, 1992.125


Moneta Chlass Entner HoyerL. Bar<strong>in</strong>ghaus and C. Franz. On a new multivariate two-sample test. Journal of MultivariateAnalysis, 88(1):190–206, 2004.B. S. Bernanke. Alternative explanations of the money-<strong>in</strong>come correlation. InCarnegie-Rochester Conference <strong>Series</strong> on Public Policy, volume 25, pages 49–99.Elsevier, 1986.B.S. Bernanke, J. Boiv<strong>in</strong>, and P. Eliasz. Measur<strong>in</strong>g the Effects of Monetary Policy: AFactor-Augmented Vector Autoregressive (FAVAR) Approach. Quarterly Journal ofEconomics, 120(1):387–422, 2005.D. A. Bessler and S. Lee. Money and prices: US data 1869-1914 (a study with directedgraphs). Empirical Economics, 27:427–446, 2002.O. J. Blanchard and D. Quah. The dynamic effects of aggregate demand and supplydisturbances. The American Economic Review, 79(4):655–673, 1989.O. J. Blanchard and M. W. Watson. Are bus<strong>in</strong>ess cycles all alike? The Americanbus<strong>in</strong>ess cycle: Cont<strong>in</strong>uity and change, 25:123–182, 1986.N. Chlaß and A. Moneta. Can Graphical Causal Inference Be Extended to Nonl<strong>in</strong>earSett<strong>in</strong>gs? EPSA Epistemology and Methodology of Science, pages 63–72, 2010.T. Chu and C. Glymour. Search for additive nonl<strong>in</strong>ear time series causal models. TheJournal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 9:967–991, 2008.P. Comon. Independent component analysis, a new concept? Signal process<strong>in</strong>g, 36(3):287–314, 1994.S. Demiralp and K. D. Hoover. Search<strong>in</strong>g for the causal structure of a vector autoregression.Oxford Bullet<strong>in</strong> of Economics and Statistics, 65:745–767, 2003.S. Demiralp, K. D. Hoover, and D. J. Perez. A Bootstrap method for identify<strong>in</strong>g andevaluat<strong>in</strong>g a structural vector autoregression. Oxford Bullet<strong>in</strong> of Economics andStatistics, 65, 745-767, 2008.M. Eichler. Granger causality and path diagrams for multivariate time series. Journalof Econometrics, 137(2):334–353, 2007.J. Faust and E. M. Leeper. When do long-run identify<strong>in</strong>g restrictions give reliableresults? Journal of Bus<strong>in</strong>ess & Economic Statistics, 15(3):345–353, 1997.J. P. Florens and M. Mouchart. A note on noncausality. Econometrica, 50(3):583–591,1982.M. Forni, M. Hall<strong>in</strong>, M. Lippi, and L. Reichl<strong>in</strong>. The generalized dynamic-factor model:Identification and estimation. Review of Economics and Statistics, 82(4):540–554,2000.126


Causal Search <strong>in</strong> SVARC. W. J. Granger. Investigat<strong>in</strong>g causal relations by econometric models and crossspectralmethods. Econometrica: Journal of the Econometric Society, 37(3):424–438, 1969.C. W. J. Granger. Test<strong>in</strong>g for causality:: A personal viewpo<strong>in</strong>t. Journal of EconomicDynamics and Control, 2:329–352, 1980.T. Haavelmo. The probability approach <strong>in</strong> econometrics. Econometrica, 12:1–115,1944.C. Hiemstra and J. D. Jones. Test<strong>in</strong>g for l<strong>in</strong>ear and nonl<strong>in</strong>ear Granger causality <strong>in</strong> thestock price-volume relation. Journal of F<strong>in</strong>ance, 49(5):1639–1664, 1994.W. C. Hood and T. C. Koopmans. Studies <strong>in</strong> econometric method, Cowles CommissionMonograph, No. 14. New York: John Wiley & Sons, 1953.K. D. Hoover. <strong>Causality</strong> <strong>in</strong> macroeconomics. Cambridge University Press, 2001.K. D. Hoover. The methodology of econometrics. New Palgrave Handbook of Econometrics,1:61–87, 2006.K. D. Hoover. <strong>Causality</strong> <strong>in</strong> economics and econometrics. In The New Palgrave Dictionaryof Economics. London: Palgrave Macmillan, 2008.K.D. Hoover, S. Demiralp, and S.J. Perez. Empirical Identification of the Vector Autoregression:The Causes and Effects of US M2. In The Methodology and Practiceof Econometrics. A Festschrift <strong>in</strong> Honour of David F. Hendry, pages 37–58. OxfordUniversity Press, 2009.P. O. Hoyer, A. Hyvär<strong>in</strong>en, R. Sche<strong>in</strong>es, P. Spirtes, J. Ramsey, G. Lacerda, andS. Shimizu. Causal discovery of l<strong>in</strong>ear acyclic models with arbitrary distributions. InProceed<strong>in</strong>gs of the 24th Conference on Uncerta<strong>in</strong>ty <strong>in</strong> Artificial Intelligence, 2008a.P. O. Hoyer, S. Shimizu, A. J. Kerm<strong>in</strong>en, and M. Palvia<strong>in</strong>en. Estimation of causaleffects us<strong>in</strong>g l<strong>in</strong>ear non-gaussian causal models with hidden variables. InternationalJournal of Approximate Reason<strong>in</strong>g, 49:362–378, 2008b.A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, 2001.A. Hyvär<strong>in</strong>en, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a StructuralVector Autoregression model us<strong>in</strong>g non-Gaussianity. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>gResearch, 11:1709–1731, 2010.S. Johansen. Statistical analysis of co<strong>in</strong>tegrat<strong>in</strong>g vectors. Journal of Economic Dynamicsand Control, 12:231–254, 1988.S. Johansen. Estimation and hypothesis test<strong>in</strong>g of co<strong>in</strong>tegrat<strong>in</strong>g vectors <strong>in</strong> Gaussianvector autoregressive models. Econometrica, 59:1551–1580, 1991.127


Moneta Chlass Entner HoyerS. Johansen. Co<strong>in</strong>tegration: An Overview. In Palgrave Handbook of Econometrics.Volume 1. Econometric Theory, pages 540–577. Palgrave Macmillan, 2006.R. G. K<strong>in</strong>g, C. I. Plosser, J. H. Stock, and M. W. Watson. Stochastic trends and economicfluctuations. American Economic Review, 81:819–840, 1991.T. C. Koopmans. Statistical Inference <strong>in</strong> Dynamic Economic Models, Cowles CommissionMonograph, No. 10. New York: John Wiley & Sons, 1950.G. Lacerda, P. Spirtes, J. Ramsey, and P. O. Hoyer. Discover<strong>in</strong>g cyclic causal modelsby Independent Components Analysis. In Proc. 24th Conference on Uncerta<strong>in</strong>ty <strong>in</strong>Artificial Intelligence (UAI-2008), Hels<strong>in</strong>ki, F<strong>in</strong>land, 2008.R. E. Lucas. Econometric policy evaluation: A critique. In Carnegie-Rochester Conference<strong>Series</strong> on Public Policy, volume 1, pages 19–46. Elsevier, 1976.H. Lütkepohl. Vector Autoregressive Models. In Palgrave Handbook of Econometrics.Volume 1. Econometric Theory, pages 477–510. Palgrave Macmillan, 2006.A. Moneta. Graphical Models for Structural Vector Autoregressions. LEM Papers<strong>Series</strong>, Sant’Anna School of Advanced Studies, Pisa, 2003.A. Moneta. Identification of monetary policy shocks: a graphical causal approach.Notas Económicas, 20, 39-62, 2004.A. Moneta. Graphical causal models and VARs: an empirical assessment of the realbus<strong>in</strong>ess cycles hypothesis. Empirical Economics, 35(2):275–300, 2008.A. Moneta, D. Entner, P.O. Hoyer, and A. Coad. Causal <strong>in</strong>ference by <strong>in</strong>dependent componentanalysis with applications to micro-and macroeconomic data. Jena EconomicResearch Papers, 2010:031, 2010.E. Paparoditis and D. N. Politis. The local bootstrap for kernel estimators under generaldependence conditions. Annals of the Institute of Statistical Mathematics, 52(1):139–159, 2000.J. Pearl. <strong>Causality</strong>: models, reason<strong>in</strong>g and <strong>in</strong>ference. Cambridge University Press,Cambridge, 2000.M. Reale and G. T. Wilson. Identification of vector AR models with recursive structuralerrors us<strong>in</strong>g conditional <strong>in</strong>dependence graphs. Statistical Methods and Applications,10, 49-65, 2001.T. Richardson and P. Spirtes. Automated discovery of l<strong>in</strong>ear feedback models. InComputation, causation and discovery. AAAI Press and MIT Press, Menlo Park,1999.128


Causal Search <strong>in</strong> SVARR. Sche<strong>in</strong>es, P. Spirtes, C. Glymour, C. Meek, and T. Richardson. The TETRADproject: Constra<strong>in</strong>t based aids to causal model specification. Multivariate BehavioralResearch, 33(1):65–117, 1998.M. D. Shapiro and M. W. Watson. Sources of bus<strong>in</strong>ess cycle fluctuations. NBERMacroeconomics annual, 3:111–148, 1988.S. Shimizu, P. O. Hoyer, A. Hyvär<strong>in</strong>en, and A. Kerm<strong>in</strong>en. A l<strong>in</strong>ear non-Gaussianacyclic model for causal discovery. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 7:2003–2030, 2006.S. Shimizu, A. Hyvär<strong>in</strong>en, Y. Kawahara, and T. Washio. A direct method for estimat<strong>in</strong>ga causal order<strong>in</strong>g <strong>in</strong> a l<strong>in</strong>ear non-Gaussian acyclic model. In Proceed<strong>in</strong>gs of the 25thConference on Uncerta<strong>in</strong>ty <strong>in</strong> Artificial Intelligence, 2009.C. A. Sims. Macroeconomics and Reality. Econometrica, 48, 1-47, 1980.C. A. Sims. An autoregressive <strong>in</strong>dex model for the u.s. 1948-1975. In J. Kmenta andJ.B. Ramsey, editors, Large-scale macro-econometric models: theory and practice,pages 283–327. North-Holland, 1981.P. Spirtes, C. Glymour, and R. Sche<strong>in</strong>es. Causation, prediction, and search. MIT Press,Cambridge MA, 2nd edition, 2000.L. Su and H. White. A nonparametric Hell<strong>in</strong>ger metric test for conditional <strong>in</strong>dependence.Econometric Theory, 24(04):829–864, 2008.P. Suppes. A probabilistic theory of causation. Acta Philosophica Fennica, XXIV, 1970.N. R. Swanson and C. W. J. Granger. Impulse response function based on a causal approachto residual orthogonalization <strong>in</strong> vector autoregressions. Journal of the AmericanStatistical Association, 92:357–367, 1997.G. J. Szekely and M. L. Rizzo. Test<strong>in</strong>g for equal distributions <strong>in</strong> high dimension.InterStat, 5, 2004.M. P. Wand and M. C. Jones. Kernel smooth<strong>in</strong>g. Chapman&Hall Ltd., London, 1995.A. H. Welsh, X. L<strong>in</strong>, and R. J. Carroll. Marg<strong>in</strong>al Longitud<strong>in</strong>al Nonparametric Regression.Journal of the American Statistical Association, 97(458):482–493, 2002.H. White and X. Lu. Granger <strong>Causality</strong> and Dynamic Structural Systems. Journal ofF<strong>in</strong>ancial Econometrics, 8(2):193, 2010.N. Wiener. The theory of prediction. Modern mathematics for eng<strong>in</strong>eers, <strong>Series</strong>, 1:125–139, 1956.S. Wright. Correlation and causation. Journal of agricultural research, 20(7):557–585,1921.129


Moneta Chlass Entner HoyerA. Yatchew. Nonparametric regression techniques <strong>in</strong> economics. Journal of EconomicLiterature, 36(2):669–721, 1998.130


JMLR: Workshop and Conference Proceed<strong>in</strong>gs 12:115–139, 2011<strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong><strong>Time</strong> <strong>Series</strong> Analysis with the <strong>Causality</strong> WorkbenchIsabelle Guyonisabelle@clop<strong>in</strong>et.com<strong>ClopiNet</strong>, Berkeley, CaliforniaAlexander StatnikovAlexander.Statnikov@nyumc.orgNYU Langone Medical Center, New York cityConstant<strong>in</strong> AliferisConstant<strong>in</strong>.Aliferis@nyumc.orgNYU Center for Health Informatics and Bio<strong>in</strong>formatics, New York cityEditors: Flor<strong>in</strong> Popescu and Isabelle GuyonAbstractThe <strong>Causality</strong> Workbench project is an environment to test causal discovery algorithms.Via a web portal (http://clop<strong>in</strong>et.com/causality), it provides anumber of resources, <strong>in</strong>clud<strong>in</strong>g a repository of datasets, models, and software packages,and a virtual laboratory allow<strong>in</strong>g users to benchmark causal discovery algorithmsby perform<strong>in</strong>g virtual experiments to study artificial causal systems. We regularlyorganize competitions. In this paper, we describe what the platform offers forthe analysis of causality <strong>in</strong> time series analysis.Keywords: <strong>Causality</strong>, Benchmark, Challenge, Competition, <strong>Time</strong> <strong>Series</strong> Prediction.1. IntroductionUncover<strong>in</strong>g cause-effect relationships is central <strong>in</strong> many aspects of everyday life <strong>in</strong> bothhighly <strong>in</strong>dustrialized and develop<strong>in</strong>g countries: what affects our health, the economy,climate changes, world conflicts, and which actions have beneficial effects? Establish<strong>in</strong>gcausality is critical to guid<strong>in</strong>g policy decisions <strong>in</strong> areas <strong>in</strong>clud<strong>in</strong>g medic<strong>in</strong>e andpharmacology, epidemiology, climatology, agriculture, economy, sociology, law enforcement,and manufactur<strong>in</strong>g.One important goal of causal model<strong>in</strong>g is to predict the consequences of given actions,also called <strong>in</strong>terventions, manipulations or experiments. This is fundamentallydifferent from the classical mach<strong>in</strong>e learn<strong>in</strong>g, statistics, or data m<strong>in</strong><strong>in</strong>g sett<strong>in</strong>g, whichfocuses on mak<strong>in</strong>g predictions from observations. Observations imply no manipulationon the system under study whereas actions <strong>in</strong>troduce a disruption <strong>in</strong> the naturalfunction<strong>in</strong>g of the system. In the medical doma<strong>in</strong>, this is the dist<strong>in</strong>ction made between“diagnosis” and “prognosis” (prediction from observations of diseases or disease evolution)and “treatment” (<strong>in</strong>tervention). For <strong>in</strong>stance, smok<strong>in</strong>g and cough<strong>in</strong>g might be bothpredictive of respiratory disease and helpful for diagnosis purposes. However, if smok<strong>in</strong>gis a cause and cough<strong>in</strong>g a consequence, act<strong>in</strong>g on the cause (smok<strong>in</strong>g) can changeyour health status, but not act<strong>in</strong>g on the symptom or consequence (cough<strong>in</strong>g). Thus itc○ 2011 I. Guyon, A. Statnikov & C. Aliferis.


Guyon Statnikov Aliferisis extremely important to dist<strong>in</strong>guish between causes and consequences to predict theresult of actions like predict<strong>in</strong>g the effect of forbidd<strong>in</strong>g smok<strong>in</strong>g <strong>in</strong> public places.The need for assist<strong>in</strong>g policy mak<strong>in</strong>g while reduc<strong>in</strong>g the cost of experimentationand the availability of massive amounts of “observational” data prompted the proliferationof proposed computational causal discovery techniques (Glymour and Cooper,1999; Pearl, 2000; Spirtes et al., 2000; Neapolitan, 2003; Koller and Friedman, 2009),but it is fair to say that to this day, they have not been widely adopted by scientists andeng<strong>in</strong>eers. Part of the problem is the lack of appropriate evaluation and the demonstrationof the usefulness of the methods on a range of pilot applications. To fill this need,we started a project called the ”<strong>Causality</strong> Workbench”, which offers the possibility ofexpos<strong>in</strong>g the research community to challeng<strong>in</strong>g causal problems and dissem<strong>in</strong>at<strong>in</strong>gnewly developed causal discovery technology. In this paper, we outl<strong>in</strong>e our setup andmethods and the possibilities offered by the <strong>Causality</strong> Workbench to solve problems ofcausal <strong>in</strong>ference <strong>in</strong> time series analysis.2. <strong>Causality</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>Causal discovery is a multi-faceted problem. The def<strong>in</strong>ition of causality itself haseluded philosophers of science for centuries, even though the notion of causality isat the core of the scientific endeavor and also a universally accepted and <strong>in</strong>tuitive notionof everyday life. But, the lack of broadly acceptable def<strong>in</strong>itions of causality hasnot prevented the development of successful and mature mathematical and algorithmicframeworks for <strong>in</strong>duc<strong>in</strong>g causal relationships.Causal relationships are frequently modeled by causal Bayesian networks or structuralequation models (SEM) (Pearl, 2000; Spirtes et al., 2000; Neapolitan, 2003). Inthe graphical representation of such models, an arrow between two variables A → B<strong>in</strong>dicates the direction of a causal relationship: A causes B. A node <strong>in</strong> the graph correspond<strong>in</strong>gto a particular variable X, represents a “mechanism” to evaluate the value ofX, given the “parent” node variable values (immediate antecedents <strong>in</strong> the graph). ForBayesian networks, such evaluation is carried out by a conditional probability distributionP(X|Parents(X)) while for structural equation models it is carried out by a functionof the parent variables and a noise model.Our everyday-life concept of causality is very much l<strong>in</strong>ked to time dependencies(causes precede their effects). Hence an <strong>in</strong>tuitive <strong>in</strong>terpretation of an arrow <strong>in</strong> a causalnetwork represent<strong>in</strong>g A causes B is that A preceded B. 1 But, <strong>in</strong> reality, Bayesiannetworks are a graphical representation of a factorization of conditional probabilities,hence a pure mathematical construct. The arrows <strong>in</strong> a “regular” Bayesian network (not a“causal Bayesian network”) do not necessarily represent either causal relationships norprecedence, which often creates some confusion. In particular, many mach<strong>in</strong>e learn<strong>in</strong>gproblems are concerned with stationary systems or “cross-sectional studies”, which are1. More precise semantics have been developed. Such semantics assume discrete time po<strong>in</strong>t or <strong>in</strong>tervaltime models and allow for cont<strong>in</strong>uous or episodic “occurrences” of the values of a variable as wellas overlapp<strong>in</strong>g or non-overlapp<strong>in</strong>g <strong>in</strong>tervals (Aliferis, 1998). Such practical semantics <strong>in</strong> Bayesiannetworks allow for abstracted and explicit time.132


<strong>Causality</strong> Workbenchstudies where many samples are drawn at a given po<strong>in</strong>t <strong>in</strong> time. Thus, sometimes thereference to time <strong>in</strong> Bayesian networks is replaced by the notion of “causal order<strong>in</strong>g”.Causal order<strong>in</strong>g can be understood as fix<strong>in</strong>g a particular time scale and consider<strong>in</strong>g onlycauses happen<strong>in</strong>g at time t and effects happen<strong>in</strong>g at time t + δt, where δt can be madeas small as we want. With<strong>in</strong> this framework, causal relationships may be <strong>in</strong>ferred fromdata <strong>in</strong>clud<strong>in</strong>g no explicit reference to time. Causal clues <strong>in</strong> the absence of temporal<strong>in</strong>formation <strong>in</strong>clude conditional <strong>in</strong>dependencies between variables and loss of <strong>in</strong>formationdue to irreversible transformations or the corruption of signal by noise (Sun et al.,2006; Zhang and Hyvär<strong>in</strong>en, 2009).In seems reasonable to th<strong>in</strong>k that temporal <strong>in</strong>formation should resolve many causalrelationship ambiguities. Yet, the addition of the time dimension simplifies the problemof <strong>in</strong>ferr<strong>in</strong>g causal relationships only to a limited extend. For one, it reduces, but doesnot elim<strong>in</strong>ate, the problem of confound<strong>in</strong>g: A correlated event A happen<strong>in</strong>g <strong>in</strong> the pastof event B cannot be a consequence of B; however it is not necessarily a cause becausea previous event C might have been a “common cause” of A and B. Secondly, it opensthe door to many subtle model<strong>in</strong>g questions, <strong>in</strong>clud<strong>in</strong>g problems aris<strong>in</strong>g with model<strong>in</strong>gthe dynamic systems, which may or may not be stationary. One of the charters ofour <strong>Causality</strong> Workbench project is to collect both problems of practical and academic<strong>in</strong>terest to push the envelope of research <strong>in</strong> <strong>in</strong>ferr<strong>in</strong>g causal relationships from timeseries analysis.3. A Virtual LaboratoryMethods for learn<strong>in</strong>g cause-effect relationships without experimentation (learn<strong>in</strong>g fromobservational data) are attractive because observational data is often available <strong>in</strong> abundanceand experimentation may be costly, unethical, impractical, or even pla<strong>in</strong> impossible.Still, many causal relationships cannot be ascerta<strong>in</strong>ed without the recourse toexperimentation and the use of a mix of observational and experimental data might bemore cost effective. We implemented a Virtual Lab allow<strong>in</strong>g researchers to performexperiments on artificial systems to <strong>in</strong>fer their causal structure. The design of the platformis such that researchers can submit new artificial systems for others to experiment,experimenters can place queries and get answers, the activity is logged, and registeredusers have their own virtual lab space. This environment allows researchers to testcomputational causal discovery algorithms and, <strong>in</strong> particular, to test whether model<strong>in</strong>gassumptions made hold <strong>in</strong> real and simulated data.We have released a first version http://www.causality.<strong>in</strong>f.ethz.ch/workbench.php. We plan to attach to the virtual lab sizeable realistic simulatorssuch as the Spatiotemporal Epidemiological Modeler (STEM), an epidemiology simulatordeveloped at IBM, now publicly available: http://www.eclipse.org/stem/. The virtual lab was put to work <strong>in</strong> a recent challenge we organized on theproblem of “Active Learn<strong>in</strong>g” (see http://clop<strong>in</strong>et.com/al). More details onthe virtual lab are given <strong>in</strong> the appendix.133


Guyon Statnikov Aliferis4. A Data RepositoryPart of our benchmark<strong>in</strong>g effort is dedicated to collect<strong>in</strong>g problems from diverse applicationdoma<strong>in</strong>s. Via the organization of competitions, we have successfully channeledthe effort or dozens of researchers to solve new problems of scientific and practical<strong>in</strong>terest and identified effective methods. However, competition without collaborationis sterile. Recently, we have started <strong>in</strong>troduc<strong>in</strong>g new dimensions to our effort of researchcoord<strong>in</strong>ation: stimulat<strong>in</strong>g creativity, collaborations, and data exchange. Weare organiz<strong>in</strong>g regular teleconference sem<strong>in</strong>ars. We have created a data repositoryfor the <strong>Causality</strong> Workbench already populated by 15 datasets. All the resources,which are the product of our effort, are freely available on the Internet at http://clop<strong>in</strong>et.com/causality. The repository already <strong>in</strong>cludes several time seriesdatasets, illustrat<strong>in</strong>g problems of practical and academic <strong>in</strong>terest (see table 1):- Learn<strong>in</strong>g the structure of a fairly complex dynamic system that disobeys equilibrationmanipulationcommutability, and predict<strong>in</strong>g the effect of manipulations that donot cause <strong>in</strong>stabilities (the MIDS task) (Voortman et al., 2010);- Learn<strong>in</strong>g causal relationships us<strong>in</strong>g time series when noise is corrupt<strong>in</strong>g data <strong>in</strong> away that classical “Granger causality” fails (the NOISE task) (Nolte et al., 2010);- Uncover<strong>in</strong>g which promotions affect most sales <strong>in</strong> a market<strong>in</strong>g database (thePROMO task) (Pellet, 2010);- Identify<strong>in</strong>g <strong>in</strong> a manufactur<strong>in</strong>g process (wafer production) faulty steps affect<strong>in</strong>ga performance metric (the SEFTI task) (Tuv, 2010);- Model<strong>in</strong>g a biological signall<strong>in</strong>g process (the SIGNET task) (Jenk<strong>in</strong>s, 2010).The donor of the dataset NOISE (Guido Nolte) received the best dataset award. Thereviewers appreciated that the task <strong>in</strong>cludes both real data from EEG time series andartificial data model<strong>in</strong>g EEG. We want to encourage future data donors to move <strong>in</strong> thisdirection.5. Benchmarks and CompetitionsOur effort has been ga<strong>in</strong><strong>in</strong>g momentum with the organization of two challenges, whicheach attracted over 50 participants. The first causality challenge we have organized(Causation and Prediction challenge, December 15 2007 - April 30 2008) allowed researchersboth from the causal discovery community and the mach<strong>in</strong>e learn<strong>in</strong>g communityto try their algorithms on sizable tasks of real practical <strong>in</strong>terest <strong>in</strong> medic<strong>in</strong>e, pharmacology,and sociology (see http://www.causality.<strong>in</strong>f.ethz.ch/challenge.php). The goal was to tra<strong>in</strong> models exclusively on observational data, then make predictionsof a target variable on data collected after <strong>in</strong>tervention on the system understudy were performed. This first challenge reached a number of goals that we had setto ourselves: familiariz<strong>in</strong>g many new researchers and practitioners with causal discoveryproblems and exist<strong>in</strong>g tools to address them, po<strong>in</strong>t<strong>in</strong>g out the limitations of current134


<strong>Causality</strong> WorkbenchTable 1: <strong>Time</strong> dependent datasets.“TP” is the data type, “NP” the number of participants who returned results and “V” the number ofviews as of January 2011. The semi-artificial datasets are obta<strong>in</strong>ed from simulators of real tasks. N is the number of variables,T is the number of time samples (not necessarily evenly spaced) and R the number of simulations with different <strong>in</strong>itial statesor conditions.Name(TP; NP; V)MIDS(Artificial;NA; 794)NOISE(Real +artificial; NA;783)PROMO(Semiartificial;3;1601)SEFTI(Semiartificial;NA;908)SIGNET(Semi-artif.;2; 2663)Size Description ObjectiveT=12 sampled values <strong>in</strong> time(unevenly spaced); R=10000simulations. N=9 variables.Artificial: T=6000 time po<strong>in</strong>ts;R=1000 simul.; N=2 var.Real: R=10 subjects.T≃200000 po<strong>in</strong>ts sampled at256Hz. N=19 channels.T=365*3 days; R=1 simulation;N=1000 promotions + 100products.R=4000 manufactur<strong>in</strong>g lots;T=300 async. operations (pairof values {one of N=25 toolIDs, date of proc.}) + cont.target (circuit perf. for each lot).T=21 asynchronous stateupdates; R=300 pseudodynamicsimulations; N=43 rules.Mixed Dynamic Systems. Simulatedtime-series based on l<strong>in</strong>ear Gaussianmodels with no latent common causes,but with multiple dynamic processes.Real and simulated EEG data.Learn<strong>in</strong>g causal relationships us<strong>in</strong>gtime series when noise is corrupt<strong>in</strong>gdata caus<strong>in</strong>g the classical Grangercausality method to fail.Simulated market<strong>in</strong>g task. Dailyvalues of 1000 promotions and 100product sales for three years<strong>in</strong>corporat<strong>in</strong>g seasonal effects.Semiconductor manufactur<strong>in</strong>g.Each wafer undergoes 300 steps each<strong>in</strong>volv<strong>in</strong>g one of 25 tools. A regressionproblem for quality control ofend-of-l<strong>in</strong>e circuit performance.Abscisic Acid Signal<strong>in</strong>g Network.Model <strong>in</strong>spired by a true biologicalsignal<strong>in</strong>g network.Use the tra<strong>in</strong><strong>in</strong>g data tobuild a model able topredict the effects ofmanipulations on thesystem <strong>in</strong> test data.Artificial task: f<strong>in</strong>d thecausal dir. <strong>in</strong> pairs of var.Real task: F<strong>in</strong>d whichbra<strong>in</strong> region <strong>in</strong>fluenceeach other.Predict a 1000x100boolean matrix of causal<strong>in</strong>fluences of promotionson product sales.F<strong>in</strong>d the tools that areguilty of performancedegradation and eventual<strong>in</strong>teractions and <strong>in</strong>fluenceof time.Determ<strong>in</strong>e the set of 43boolean rules thatdescribe the network.135


Guyon Statnikov Aliferismethods on some particular difficulties, and foster<strong>in</strong>g the development of new algorithms.The results <strong>in</strong>dicated that causal discovery from observational data is not animpossible task, but a very hard one and po<strong>in</strong>ted to the need for further research andbenchmarks (Guyon et al., 2008). The Causal Explorer package (Aliferis et al., 2003),which we had made available to the participants and is downloadable as shareware,proved to be competitive and is a good start<strong>in</strong>g po<strong>in</strong>t for researchers new to the field. Itis a Matlab toolkit support<strong>in</strong>g “local” causal discovery algorithms, efficient to discoverthe causal structure around a target variable, even for a large number of variables. Thealgorithms are based on structure learn<strong>in</strong>g from tests of conditional <strong>in</strong>dependence, asall the top rank<strong>in</strong>g methods <strong>in</strong> this first challenge.The first challenge (Guyon et al., 2008) explored an important problem <strong>in</strong> causalmodel<strong>in</strong>g, but is only one of many possible problem statements. The second challenge(Guyon et al., 2010) called “competition pot-luck” aimed at enlarg<strong>in</strong>g the scope ofcausal discovery algorithm evaluation by <strong>in</strong>vit<strong>in</strong>g members of the community to submittheir own problems and/or solve problems proposed by others. The challenge startedSeptember 15, 2008 and ended November 20, 2008, see http://www.causality.<strong>in</strong>f.ethz.ch/pot-luck.php. One task proposed by a participant drew a lot ofattention: the cause-effect pair task. The problem was to try to determ<strong>in</strong>e <strong>in</strong> pairs ofvariables (of known causal relationships), which one was the cause of the other. Thisproblem is hard for a lot of algorithms, which rely on the result of conditional <strong>in</strong>dependencetests of three or more variables. Yet the w<strong>in</strong>ners of the challenge succeeded <strong>in</strong>unravel<strong>in</strong>g 8/8 correct causal directions (Zhang and Hyvär<strong>in</strong>en, 2009).Our planned challenge ExpDeCo (Experimental Design <strong>in</strong> Causal Discovery) willbenchmark methods of experimental design <strong>in</strong> application to causal model<strong>in</strong>g. The goalwill be to identify effective methods to unravel causal models, requir<strong>in</strong>g a m<strong>in</strong>imum ofexperimentation, us<strong>in</strong>g the Virtual Lab. A budget of virtual cash will be allocated toparticipants to “buy” the right to observe or manipulate certa<strong>in</strong> variables, manipulationsbe<strong>in</strong>g more expensive that observations. The participants will have to spend theirbudget optimally to make the best possible predictions on test data. This setup lendsitself to <strong>in</strong>corporat<strong>in</strong>g problems of relevance to development projects, <strong>in</strong> particular <strong>in</strong>medic<strong>in</strong>e and epidemiology where experimentation is difficult while develop<strong>in</strong>g newmethodology.We are plann<strong>in</strong>g another challenge called CoMSICo for “Causal Models for SystemIdentification and Control”, which is more ambitious <strong>in</strong> nature because it will perform acont<strong>in</strong>uous evaluation of causal models rather than separat<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g and test phase. Incontrast with ExpDeCo <strong>in</strong> which the organizers will provide test data with prescribedmanipulations to test the ability of the participants to make predictions of the consequencesof actions, <strong>in</strong> CoMSICo, the participants will be <strong>in</strong> charge of mak<strong>in</strong>g theirown plan of action (policy) to optimize an overall objective (e.g., improve the life expectancyof a population, improve the GNP, etc.) and they will be judged directly withthis objective, on an on-go<strong>in</strong>g basis, with no dist<strong>in</strong>ction between “tra<strong>in</strong><strong>in</strong>g” and “test”data. This challenge will also be via the Virtual Lab. The participants will be givenan <strong>in</strong>itial amount of virtual cash, and, as previously, both actions and observations will136


<strong>Causality</strong> Workbenchhave a price. New <strong>in</strong> CoMSICo, virtual cash rewards will be given for achiev<strong>in</strong>g good<strong>in</strong>termediate performance, which the participants will be allowed to re-<strong>in</strong>vest to conductadditional experiments and improve their plan of action (policy). The w<strong>in</strong>ner willbe the participant end<strong>in</strong>g up with the largest amount of virtual cash.6. ConclusionOur program of data exchange and benchmark proposes to challenge the research communitywith a wide variety of problems from many doma<strong>in</strong>s and focuses on realisticsett<strong>in</strong>gs. Causal discovery is a problem of fundamental and practical <strong>in</strong>terest <strong>in</strong> manyareas of science and technology and there is a need for assist<strong>in</strong>g policy mak<strong>in</strong>g <strong>in</strong> allthese areas while reduc<strong>in</strong>g the costs of data collection and experimentation. Hence, theidentification of efficient techniques to solve causal problems will have a widespreadimpact. By choos<strong>in</strong>g applications from a variety of doma<strong>in</strong>s and mak<strong>in</strong>g connectionsbetween discipl<strong>in</strong>es as varied as mach<strong>in</strong>e learn<strong>in</strong>g, causal discovery, experimental design,decision mak<strong>in</strong>g, optimization, system identification, and control, we anticipatethat there will be a lot of cross-fertilization between different doma<strong>in</strong>s.AcknowledgmentsThis project is an activity of the <strong>Causality</strong> Workbench supported by the Pascal networkof excellence funded by the European Commission and by the U.S. National ScienceFoundation under Grant N0. ECCS-0725746. Any op<strong>in</strong>ions, f<strong>in</strong>d<strong>in</strong>gs, and conclusionsor recommendations expressed <strong>in</strong> this material are those of the authors and do not necessarilyreflect the views of the National Science Foundation. We are very grateful toall the members of the causality workbench team for their contribution and <strong>in</strong> particularto our co-founders Constant<strong>in</strong> Aliferis, Greg Cooper, André Elisseeff, Jean-PhilippePellet, Peter Spirtes, and Alexander Statnikov.ReferencesC. F. Aliferis, I. Tsamard<strong>in</strong>os, A. Statnikov, and L.E. Brown. Causal explorer: A probabilisticnetwork learn<strong>in</strong>g toolkit for biomedical discovery. In 2003 InternationalConference on Mathematics and Eng<strong>in</strong>eer<strong>in</strong>g Techniques <strong>in</strong> Medic<strong>in</strong>e and BiologicalSciences (METMBS), Las Vegas, Nevada, USA, June 23-26 2003. CSREA Press.Constant<strong>in</strong> Aliferis. A Temporal Representation and Reason<strong>in</strong>g Model for MedicalDecision-Support Systems. PhD thesis, University of Pittsburgh, 1998.C. Glymour and G.F. Cooper, editors. Computation, Causation, and Discovery. AAAIPress/The MIT Press, Menlo Park, California, Cambridge, Massachusetts, London,England, 1999.I. Guyon, C. Aliferis, G. Cooper, A. Elisseeff, J.-P. Pellet, P. Spirtes, and A. Statnikov.Design and analysis of the causation and prediction challenge. In JMLR W&CP,137


Guyon Statnikov Aliferisvolume 3, pages 1–33, WCCI2008 workshop on causality, Hong Kong, June 3-42008.I. Guyon, D. Janz<strong>in</strong>g, and B. Schölkopf. <strong>Causality</strong>: Objectives and assessment. JMLRW&CP, 6:1–38, 2010.Jerry Jenk<strong>in</strong>s. Signet: Boolean rile determ<strong>in</strong>ation for abscisic acid signal<strong>in</strong>g. In<strong>Causality</strong>: objectives and assessment (NIPS 2008), volume 6, pages 215–224. JMLRW&CP, 2010.Daphne Koller and Nir Friedman.Techniques. MIT Press, 2009.Probabilistic Graphical Models: Pr<strong>in</strong>ciples andR. E. Neapolitan. Learn<strong>in</strong>g Bayesian Networks. Prentice Hall series <strong>in</strong> Artificial Intelligence.Prentice Hall, 2003.G. Nolte, A. Ziehe, N. Krämer, F. Popescu, and K.-R. Müller. Comparison of grangercausality and phase slope <strong>in</strong>dex. In <strong>Causality</strong>: objectives and assessment (NIPS2008), volume 6, pages 267–276. JMLR W&CP, 2010.J. Pearl. <strong>Causality</strong>: Models, Reason<strong>in</strong>g, and Inference. Cambridge University Press,2000.J.-P. Pellet. Detect<strong>in</strong>g simple causal effects <strong>in</strong> time series. In <strong>Causality</strong>: objectives andassessment (NIPS 2008). JMLR W&CP volume 6, supplemental material, 2010.P. Spirtes, C. Glymour, and R. Sche<strong>in</strong>es. Causation, Prediction, and Search. The MITPress, Cambridge, Massachusetts, London, England, 2000.X. Sun, D. Janz<strong>in</strong>g, and B. Schölkopf. Causal <strong>in</strong>ference by choos<strong>in</strong>g graphs with mostplausible Markov kernels. In N<strong>in</strong>th International Symposium on Artificial Intelligenceand Mathematics, 2006.E. Tuv. Pot-luck challenge: Tied. In <strong>Causality</strong>: objectives and assessment (NIPS 2008).JMLR W&CP volume 6, supplemental material, 2010.M. Voortman, D. Dash, and M. J. Druzdzel. Learn<strong>in</strong>g causal models that make correctmanipulation predictions. In <strong>Causality</strong>: objectives and assessment (NIPS 2008),volume 6, pages 257–266. JMLR W&CP, 2010.K. Zhang and A. Hyvär<strong>in</strong>en. Dist<strong>in</strong>guish<strong>in</strong>g causes from effects us<strong>in</strong>g nonl<strong>in</strong>ear acycliccausal models. In <strong>Causality</strong>: objectives and assessment (NIPS 2008), volume 6,pages 157–164. JMLR W&CP, 2009.138


TABLE OF CONTENTS 139

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!