Inferring Hidden Statuses and Actions in Video by Causal Reasoning
Amy Fire and Song-Chun Zhu


In the physical world, cause and effect are inseparable: ambient conditions trigger humans to perform actions, thereby driving status changes of objects. In video, these actions and statuses may be hidden due to ambiguity, occlusion, or because they are otherwise unobservable, but humans nevertheless perceive them. In this paper, we extend the Causal And-Or Graph (C-AOG) to a sequential model representing actions and their effects on objects over time, and we build a probability model for it. For inference, we apply a Viterbi algorithm, grounded on probabilistic detections from video, to fill in hidden and misdetected actions and statuses. We analyze our method on a new video dataset that showcases causes and effects. Our results demonstrate the effectiveness of reasoning with causality over time.

Fluents are specifically those object statuses that change over time.

Reasoning over Time

Causal relationships pass information between actions and fluents over time. This long-term reasoning fills in hidden and missing actions and fluents consistently.

Fluent Examples

Inference Example


  title={Inferring Hidden Statuses and Actions in Video by Causal Reasoning},
  author={Fire, A. and Zhu, S.-C.},
  booktitle = {CVPR Workshop: Vision Meets Cognition}


The code to run inference is available on github. It is primarily python, with some analysis in R (although that is mostly for older experiments). There is a file which explains the basic workflow for using the code and a rough sketch of the inner workings.

The "data specific to replicating the experiments" below can also be considered a quick-start guide (after downloading) insofar as flow, and a foundation for playing around.


Inference Dataset Summary

The full dataset is roughly 18GB. Note that this "full dataset" excludes the "minimum data specific to replicating the experiments" below.

The minimum data specific to replicating the experiments is roughly 1MB. (This only includes the action and fluent detections used in the paper.)