Academia.eduAcademia.edu

Cause and Effect: Can Large Language Models Truly Understand Causality?

2024, arXiv (Cornell University)

Cause and Effect: Can Large Language Models Truly Understand Causality? Swagata Ashwani1 , Kshiteesh Hegde2 , Nishith Reddy Mannuru3 , Dushyant Singh Sengar4 , Mayank Jindal4 , Krishna Chaitanya Rao Kathala5 , Dishant Banga6 , Vinija Jain7 , Aman Chadha7,8 * arXiv:2402.18139v3 [cs.CL] 30 Sep 2024 1 Carnegie Mellon University sashwani@alumni.cmu.edu 2 Rensselaer Polytechnic Institute, 3 University of North Texas, 4 Independent Researcher, 5 University of Massachusetts, 6 Bridgetree, 7 Stanford University, 8 Amazon GenAI Abstract With the rise of Large Language Models (LLMs), it has become crucial to understand their capabilities and limitations in deciphering and explaining the complex web of causal relationships that language entails. Current methods use either explicit or implicit causal reasoning, yet there is a strong need for a unified approach combining both to tackle a wide array of causal relationships more effectively. This research proposes a novel architecture called Context-Aware Reasoning Enhancement with Counterfactual Analysis (CARE-CA) to enhance causal reasoning and explainability. The proposed framework incorporates an explicit causal detection module with ConceptNet and counterfactual statements, as well as implicit causal detection through LLMs. Our framework goes one step further with a layer of counterfactual explanations to accentuate LLMs’ understanding of causality. The knowledge from ConceptNet enhances the performance of multiple causal reasoning tasks such as causal discovery, causal identification, and counterfactual reasoning. The counterfactual sentences add explicit knowledge of ‘not caused by’ scenarios. By combining these powerful modules, our model aims to provide a deeper understanding of causal relationships, enabling enhanced interpretability. Evaluation of benchmark datasets shows improved performance across all metrics, such as accuracy, precision, recall, and F1 scores. We also present CausalNet, a novel dataset specifically curated to benchmark and enhance the causal reasoning capabilities of LLMs. This dataset is accompanied by code designed to facilitate further research in this domain. Introduction As Large Language Models (LLMs) play an increasingly central role in technology, their ability to understand and logically navigate causal relationships becomes essential since they impact the trust their users have on them. (Kıçıman et al. 2023) This skill is paramount for refining the depth and applicability of LLMs in complex scenarios, * Work done outside position at Amazon. Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. driving advancements that hinge on nuanced interpretations of cause and effect. “My body cast a shadow over the grass.” cause “The sun was rising.” not cause “The grass was cut.” Correct Hypothesis Figure 1: Causal reasoning without CARE-CA: Given the premise “My body cast a shadow over the grass.”, the left hypothesis, “The sun was rising,” should be identified as the cause to arrive at the correct hypothesis conclusion. Given the growing reliance on AI systems to make consequential, mission-critical decisions, we need to enhance the causal reasoning capabilities of LLMs. Prior research (Weng et al. 2023; Zhang et al. 2023) has revealed significant limitations in LLMs’ causal reasoning capabilities. While they may mimic causal language, most need a genuine comprehension of causal mechanisms. This is concerning as it could propagate misinformation or lead to unreliable predictions. Bridging this causal reasoning gap is an active area of research. Enhancing the causal reasoning capabilities of LLMs can significantly impact their reliability and trustworthiness across many applications. A more robust causal understanding of LLMs could improve healthcare and public policy decision-making (Peña et al. 2023). It also promises to enhance interpretability and transparency. However, prevailing approaches need help with flexibility and depth of causal inference. This work investigates whether these advanced models, like BERT (Devlin et al. 2018), RoBERTa (Liu et al. 2019), XLM-RoBERTa (Conneau et al. 2019), ALBERT (Lan et al. 2019), DeBERTa (He et al. 2020), Llama 2 (Touvron et al. 2023), T5 (Raffel et al. “My body cast a shadow over the grass.” “The sun was rising.” “The grass was cut.” not cause cause Correct Hypothesis “ConceptNet Integration: ‘Shadows’ related to ‘light source’.” gap between knowledge-driven and data-driven inference. It enhances the model’s performance across four critical domains of causal reasoning: Causal Relationship Identification, Causal Discovery, Causal Explanation, and Counterfactual Reasoning. We present a comprehensive suite of evaluation metrics, including accuracy, F1, precision, recall, and human evaluation, to assess and compare the performance of existing LLMs against our proposed CARE-CA framework. Furthermore, we introduce a new dataset, CasualNet, which, we experimentally demonstrate, boosts LLMs’ causal reasoning ability. CasualNet is poised to serve as a benchmark for future advancements in this field, providing a rigorous testing ground for emerging AI models. By uniting explicit and implicit causal modules alongside contextual and counterfactual enhancements, this research nudges LLMs towards improved causal reasoning — a pivotal step in unraveling AI’s black box and realizing more trustworthy, explainable systems. Related Work “Contextual Prompting: Hypotheses contextualized with time of day.” “Counterfactual Reasoning: ‘What if no light source?’ scenario.” “Improved Causal Reasoning: Correct hypothesis identified with context and counterfactuals.” Figure 2: Causal Reasoning Enhanced with CARE-CA: Starting from a premise, causal hypotheses are evaluated. Integration of external knowledge from ConceptNet enhances understanding. Contextual prompting adapts hypotheses to the time of day. Counterfactual reasoning explores alternative scenarios. Improved causal reasoning is achieved by incorporating context and counterfactuals, leading to the identification of the correct hypothesis. 2020), Mistral (Jiang et al. 2023), GPT-3.5 (OpenAI 2024), and Gemini Pro (Team et al. 2023), can truly grasp and articulate causal relationships, a cornerstone in the journey towards Artificial General Intelligence (AGI). We explore this through a blend of theoretical analysis and empirical investigation, focusing on the capability of LLMs to comprehend and articulate causality in the literal sense. Building on this foundation, we introduce the CARECA framework, a novel architecture designed to amplify the causal reasoning competence of LLMs. The CARECA framework is distinct in its use of explicit knowledge integration from resources like ConceptNet (Speer, Chin, and Havasi 2017) and implicit reasoning patterns derived from models such as BERT. This dual approach bridges the Prior research has explored various approaches to understand and enhance causal reasoning capabilities of LLMs. There have been claims that LLMs can only mimic causal language and they lack genuine causal understanding, calling them “causal parrots” (Zečević et al. 2023). So, assessing the ability of LLMs to answer causal questions, discussing their strengths and weaknesses is vital. To this end, we further explore the potential of integrating explicit and implicit causal modules to improve LLM performance (Zhang et al. 2023), This is a key principle underlying our CARE-CA framework. One remarkable work is the CRAB benchmark (Romanou et al. 2023), which evaluates the ability of LLMs to infer causal relationships between real-world events. The authors found that while LLMs can perform well on certain causal reasoning tasks, they struggle with more complex scenarios that require a deeper understanding of causality. Another work showed that LLMs can infer causation from correlation, a crucial skill for causal reasoning (Jin et al. 2023b). Their findings suggest that while LLMs can learn some causal patterns, they often fail to distinguish between causal and non-causal relationships, highlighting the need for more targeted approaches. Additionally, (Jin et al. 2021) explored the impact of the causal direction of data collection on the performance of LLMs in causal reasoning tasks. They found that models trained on data with a specific causal direction perform better on tasks that align with that direction, underscoring the importance of dataset design in causal reasoning research. These studies provide a solid foundation for understanding the current state of causal reasoning in LLMs. Given the widespread implications of LLM causal reasoning capabilities, we aim to enhance the effectiveness of all four aspects of causal reasoning in addition to the LLM evaluation work done before (Zhuang et al. 2023). Our method will specifically focus on enhancing the causal reasoning by incorporating explicit knowledge from knowledge graphs such as ConceptNet. While various past works have demonstrated the superior performance of GPT-3.5 and Gemini Pro in certain causal reasoning tasks, their work did not provide a concrete architecture to enhance these capabilities. In contrast, our CARECA framework goes a step further by proposing a novel hybrid approach that combines explicit causal knowledge from resources like ConceptNet introduced by (Speer, Chin, and Havasi 2017) with the implicit reasoning capabilities of LLMs. Contributions: CARE-CA aims to provide a more comprehensive and effective solution for tackling a wider array of causal reasoning tasks by incorporating counterfactual reasoning and contextual prompting. Unlike previous methods that either relied on explicit or implicit causal reasoning, CARE-CA’s unique integration of these two complementary approaches sets it apart, allowing for a more robust and flexible causal understanding. This distinction enables CARE-CA to potentially outperform existing techniques in tasks such as causal relationship identification, counterfactual reasoning, and causal discovery, as demonstrated in our experimental evaluation. Furthermore, our methodological advancements are showcased through the development and utilization of the CausalNet dataset, specifically designed to benchmark and refine the causal reasoning capabilities of LLMs. By focusing on the four key aspects of causal reasoning—Causal Relationship Identification, Counterfactual Reasoning, Causal Discovery, and Causal Explanation—CARE-CA represents a comprehensive approach to enhancing LLMs’ causal reasoning faculties. Approach Our approach combines the explicit, structured causal reasoning of ConceptNet knowledge graphs coupled with counterfactual sentences to improvise the causal understanding of LLMs. This novel architecture aims to surpass traditional decoder or encoder-only models by leveraging the rich semantic knowledge base of ConceptNet with advanced contextual inference capabilities and ‘alternate scenarios’ of the contextual sentences to further aid the LLMs in understanding the causality of scenarios. The combination of the above provides relevant contextual information for the LLMs to understand the causal reasoning in question. We carry out a single variable test comparing the performance (X and Y) on CARE-CA using accuracy, recall, precision and F1 scores. We illustrate the components of CARE-CA in Figure 3 and expand on the critical components briefly. 1. Contextual Knowledge Integrator (CKI) enriches the AI’s reasoning process with relevant external knowledge graph - ConceptNet, providing a deep contextual backdrop against which causal relationships can be examined. 2. Counterfactual Reasoning Enhancer (CRE) introduces hypothetical ‘what-if’ scenarios to test and refine the AI’s causal inferences, ensuring that identified causal links are robust and not merely correlational. 3. Context-Aware Prompting Mechanism (CAPM) crafts tailored prompts that encapsulate enriched context and counterfactual insights, directing LLMs toward more precise and accurate causal reasoning. The CARE-CA framework’s unique strength lies in its seamless integration of structured knowledge from ConceptNet with the contextual understanding capabilities of LLMs. We extract relevant concepts and causal relationships from ConceptNet based on the input scenario. The extracted knowledge is transformed into natural language statements and embedded into the context provided to the LLM. Then, we generate counterfactual scenarios using ConceptNet information to encourage more robust causal reasoning. To illustrate, consider this example: Input scenario: ”After heavy rain, the streets were flooded.” The following steps are then undertaken: 1. ConceptNet extraction: Relevant concepts (rain, flood, street) and relationships (Rain CapableOf CauseFlooding) are extracted. 2. Contextual embedding: We add context like ”Rain is capable of causing flooding, especially in urban areas with poor drainage.” 3. Counterfactual enhancement: We introduce a counterfactual scenario such as ”If the city had better drainage systems, would the streets still flood after heavy rain?” This integrated approach combines structured, explicit causal knowledge from ConceptNet with the flexible, context-aware reasoning of LLMs, resulting in more robust and nuanced causal reasoning capabilities. We evaluate CARE-CA’s performance using accuracy, recall, precision, and F1 scores in a single variable test. To provide readers with additional context, we provide an example of a prompt for COPA Dataset below: Input: “Shadows are formed when a light source illuminates an object, creating a dark area on the opposite side. Given that ‘My body cast a shadow over the grass,’ which hypothesis seems more plausible based on the understanding of shadows? Counterfactual statement: “If the grass were on fire, my shadow would have been the least of my concerns.” Hypothesis 1: ‘The sun was rising.’ (providing the light that cast the shadow) Hypothesis 2: ‘The grass was cut.’ (which is a condition unrelated to shadow formation) Datasets To develop and evaluate our CARE-CA framework, we employ six distinct datasets. Each dataset serves a specific function within our research, ranging from training the model’s causal reasoning capabilities to evaluating its performance in various causal reasoning tasks. All experiments were performed with a dataset split of 75%-25% for train and test sets, and 3 runs were conducted for each dataset-model combination. We evaluated 5 LLMs - GPT-3.5, Mistral 7b, Gemini Pro, Llama 2, T5 using 5 datasets - COPA, Timetravel, Input: LLM Contextual Sentences CKI CRE CAPM Output: Causal Reasoning Performance Single Variable Test on CARE-CA Enhanced LLM Understanding Figure 3: Enhancing LLM Causal Understanding via Structured Knowledge and Counterfactuals: This approach integrates ConceptNet knowledge graphs and ’what-if’ scenarios to improve LLMs’ causal reasoning, using CKI, CRE, and CAPM to boost performance on causal benchmarks like CARE-CA. CLadder, Com2sense and e-care, and then compared the above LLMs with our proposed method CARE-CA. Dataset(s) for Causal Relationship Identification (CRI): • CLadder and Com2Sense: Composition: Derived from narrative texts, these datasets are crafted to pinpoint explicit causal links within a narrative context. Purpose: They provide foundational training for the model’s explicit causal reasoning abilities, allowing it to recognize and understand causal relationships within complex text structures. Dataset(s) for Counterfactual Reasoning (CR): • TimeTravel: Composition: This dataset presents hypothetical scenarios that challenge the model to reason about events that did not occur. Purpose: It is crucial for enhancing the model’s counterfactual reasoning, teaching it to contemplate different possibilities and their implications. Dataset(s) for Causal Discovery: • COPA and e-care: Composition: COPA focuses on scenarios that require understanding potential outcomes and alternate realities, while e-care contains medical narratives that add domain-specific intricacies. Purpose: These datasets are utilized to challenge the model in discovering underlying causal mechanisms within varied and domain-specific contexts. Each dataset contributes uniquely to the robustness of the CARE-CA framework, ensuring comprehensive coverage across the spectrum of causal reasoning tasks. Proposed Dataset We also propose a new dataset called CausalNet which is carefully designed to facilitate causal reasoning and counterfactual analysis research1 . Comprising 1000 carefully curated scenarios, this dataset presents a diverse set of causal and counterfactual questions, allowing researchers to explore the intricacies of cause-and-effect relationships in various contexts. Each entry in CausalNet consists of the following components: Context: A detailed narrative context provides the backdrop for each scenario. These narratives describe situations 1 https://github.com/swagata15/causal-reasoning where multiple events or factors coincide, potentially influencing outcomes. The contexts are designed to be realistic and thought-provoking, setting the stage for causal reasoning and counterfactual exploration. Causal Questions: For each scenario, a set of causal questions is provided to challenge the models’ abilities in causal reasoning. These questions are categorized into two main types: Cause-Effect Questions: These questions prompt models to identify less obvious factors that may have contributed to observed outcomes. Models must discern the subtle interplay of various events or conditions in determining the outcome. Counterfactual Questions: Counterfactual questions explore how changes in the scenario’s main cause might impact the outcome. Models are evaluated based on their capacity to predict the consequences of hypothetical alterations to the causal factor. Choices and Answers: Each question is accompanied by a set of choices, one designated as the correct answer. For cause-effect questions, the choices represent potential influencing factors, while for counterfactual questions, the choices depict possible outcomes under different circumstances. The correct answers are carefully labeled to facilitate evaluation. CausalNet was meticulously constructed using a multistep process to ensure its quality and relevance. 1. Initial Generation: We utilized GPT-4’s advanced language capabilities to generate an initial set of 1,500 scenarios. Each scenario was designed to include a context, causal questions, and counterfactual questions. 2. Prompt Engineering: We used the following prompt: Develop a dataset composed of entries that challenge and enhance machine learning models’ understanding of causal relationships and counterfactual reasoning across various domains. Each entry in the dataset should follow this structure: “Context”: A detailed description of a scenario that outlines a complex situation involving causal relationships. “Questions”: A set of questions focusing on (1) identifying causal effects within the context and (2) exploring counterfactual scenarios, with multiplechoice answers to infer the model’s reasoning capabilities. 3. Filtering and Refinement: The initial set was filtered down to 1,000 high-quality scenarios. This process involved removing duplicates, overly simplistic scenarios, and those with ambiguous causal relationships. CausalNet is designed to bridge the gap between academic causal reasoning tasks and real-world applications. It covers a wide range of fields, mirroring the complexity of real-world causal reasoning tasks in areas such as healthcare, policy-making, and business decision-making. The richness of the dataset is ensured by adding many scenarios that require multi-step causal inference, simulating the complexity of real-world problem-solving. The dataset also incorporates subtle contextual cues that influence causal relationships, reflecting the nuanced nature of real-world causality. Scenarios often conclude with questions about potential interventions or decisions, aligning with practical applications of causal reasoning in fields like management and public policy. This tests LLMs on their practical decision-making ability. 4. Verification Process: Due to the AI-generated nature of CausalNet, we employed a stringent human verification process to ensure that the dataset meets the highest academic standards. All authors of this research effort reviewed a subset of the scenarios for logical consistency and real-world relevance. Based on feedback, we iteratively refined the dataset, adjusting scenarios and questions to improve clarity and causal validity. We tested the refined dataset against existing causal reasoning benchmarks to ensure its uniqueness and added value to the field. Results The performance was quantitatively assessed through mean accuracy, precision, recall, and F1 scores which is illustrated in Figure 4. Causal Discovery We examine CARE-CA’s capability to unearth hidden or implicit causal relationships within complex scenarios. Our method showcased superior accuracy (76%) on the COPA dataset, emphasizing the framework’s strength in integrating contextual and counterfactual insights to uncover underlying causal mechanisms. Interestingly, GPT-3.5 and Gemini Pro also performed well, with accuracies of 73.3% and 70.1%, respectively, indicating their potential in learning causal patterns. The lower performance of models like XLM-RoBERTa and DeBERTa, with accuracies of 53.2% and 51.8%, respectively, could stem from their less effective handling of the dataset’s counterfactual and causal scenarios without specific fine-tuning. On the Ecare (Du et al. 2022) dataset, our method also performed well with 85.9% accuracy, compared to the next closest decoder model performance of T5 at 84%. Causal Relationship Identification The objective is to assess CARE-CA’s proficiency in recognizing explicit causal links within narrative contexts. On the Cladder (Jin et al. 2023a) dataset, the CARE-CA model led with a standout performance, achieving a 63% accuracy, indicating its strong capability to identify causal relationships. The decoder model T5 highlighted its proficiency with a balanced performance, showcasing the effectiveness of its decoding capabilities in causal reasoning tasks. On the Com2sense (Singh et al. 2021) dataset, the decoder models encountered diverse challenges, with CARECA again leading at 67.1% accuracy, suggesting its consistent ability to navigate causal reasoning tasks. On our CausalNet dataset, CARE-CA’s remarkable accuracy of 94.6% sets a high benchmark, emphasizing the model’s superior causal reasoning capabilities. The T5 decoder model mirrored this high performance with a 94.2% accuracy, showcasing the strength of decoder architectures in extracting and interpreting causal relationships from data. 4b illustrates the performance of the CARE-CA model across multiple datasets. It is clear that the model demonstrates strong performance observed on CausalNet. Counterfactual Reasoning Here we test CARE-CA’s ability to reason with hypothetical scenarios and their implications for understanding potential outcomes. The timetravel dataset (Qin et al. 2019), focused on counterfactual reasoning, highlighted models’ challenges in understanding hypothetical scenarios. The Gemini Pro and Llama models scored 38.4% and 24.2%, respectively, suggesting that despite their extensive training data, they might struggle with tasks requiring deep counterfactual inference, underscoring the importance of specialized training or prompting for such tasks.T5 and GPT 3.5 models performed well with 61.7% and 63.2% respectively. Our method got a slight jump in accuracy from the bestperforming decoders; however, due to information overload, it could not compete with relatively more straightforward encoders such as ALBERT with 68% accuracy. The CARE-CA framework demonstrated superior performance across various causal reasoning tasks compared to traditional LLMs. This exceptional performance (94.6% accuracy) on our novel CausalNet dataset shows robustness in handling diverse causal reasoning tasks, effective integration of explicit knowledge and counterfactual reasoning in realworld scenarios, and the ability to generalize causal understanding across various contexts. The hybrid approach of combining explicit knowledge (ConceptNet) with implicit reasoning (LLMs) creates a more comprehensive causal understanding while the inclusion of counterfactual reasoning allows for more robust causal inferences and hypothesis testing. CARE-CA’s architecture enables better adaptation to different contextual nuances in causal scenarios. The framework effectively manages the trade-off between leveraging external knowledge and avoiding information overload. Remark: The observed performances underscore the complexity of causal reasoning tasks and the varying abilities of models to address them. The CARE-CA framework’s superior performance across several tasks suggests that its hybrid approach, which leverages explicit causal knowledge and counterfactual reasoning, significantly enhances causal inference capabilities. LLMs exhibit strong foundational abilities in causal reasoning, likely benefiting from their diverse pre-training. However, tasks requiring nuanced (a) CausalNet dataset enhances performance across all models. T5 shows highest improvement with 94.2% accuracy. Results suggest CausalNet’s effectiveness in boosting causal reasoning capabilities. (b) CARE-CA model excels in causal reasoning across datasets and tasks. On CausalNet, it achieves 94.6% mean accuracy, demonstrating superior performance in diverse causal contexts. Figure 4: Performance comparison of causal reasoning models across datasets, highlighting the effectiveness of the CausalNet dataset and the CARE-CA model. understanding or domain-specific knowledge, such as counterfactual reasoning and causal explanation, highlight the limitations of LLMs and the value of specialized training or frameworks like CARE-CA. Conclusion & Future Work We present the CARE-CA framework as a significant advancement in enhancing the causal reasoning capabilities of large language models (LLMs). By integrating explicit knowledge from ConceptNet and employing counterfactual reasoning, CARE-CA bridges the gap between data-driven and knowledge-driven causal inference, offering a robust solution for various causal reasoning tasks. The evaluation on multiple datasets, including the newly introduced CausalNet, demonstrates that CARE-CA consistently outperforms traditional LLM approaches in accuracy, precision, recall, and F1 scores. This work not only contributes to the field of causal reasoning in AI but also paves the way for more interpretable and reliable AI systems. Our system works well under restrictive token constraints. Future Directions: These results pave the way for further research into hybrid models that combine the breadth of knowledge from resources like ConceptNet with the depth of understanding inherent in LLMs. Fine-tuning strategies, domain-specific model adaptations, and developing more comprehensive benchmarks like CausalNet are promising areas for future exploration. Future research can further focus on expanding the multilingual capabilities of CARE-CA and further optimizing the framework to enhance its applicability across diverse domains and complex scenarios. Limitations In our research on the efficacy of causal reasoning in LLMs through the CARE-CA framework, we encountered several limitations that highlight areas for future exploration and im- provement. Firstly, we were able to run CARE-CA only on best performing decoders of each dataset and compare the results. The comparison of CARE-CA on all decoders as well as on all encoders was a challenge due to computational resource constraints. Secondly, our focus on English limits the generalizability of our findings across languages and cultures; this opens a door for a need for multilingual datasets and cross-cultural validation. The challenge of applying our general causal reasoning framework effectively in domainspecific scenarios, such as those presented in the e-care dataset, indicates an opportunity for refining its adaptability to specialized fields. Additionally, the significant computational resources required by the CARE-CA framework may limit accessibility for those with constrained computational budgets, pointing to a need for optimization strategies. While CARE-CA enhances interpretability in causal reasoning tasks, further research is required to improve transparency and explain the model’s reasoning processes, especially for non-expert users. These limitations underscore the necessity for ongoing research to enhance the efficacy, inclusiveness, and applicability of causal reasoning models and invite the broader research community to address these challenges collaboratively. Ethics Statement Ethical considerations are paramount in research, particularly when LLMs are involved. We have strived to prevent the propagation of bias within CausalNet, the dataset we introduced in this work, by carefully curating and filtering the data to mitigate the inclusion of sensitive or discriminatory content. Furthermore, we have committed to transparency regarding the dataset’s origins and potential implications, acknowledging the ethical responsibilities of conducting research with LLMs. Experiment Causal Discovery Dataset COPA Ecare Counterfactual Reasoning Causal Reasoning Identification Timetravel Cladder Com2sense CasualNet Model Mean Accuracy Mean F1 Mean Precision Mean Recall CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT-3.5 Gemini Pro CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT-3.5 Gemini Pro 76.0 69.2 57.2 53.2 62.2 51.8 62.4 53.5 67.2 73.3 70.1 85.9 50 49.7 48.2 47.7 46.6 62.2 84 50 77.8 67.8 82.3 66.3 56.2 47.0 63.1 0.0 56.0 1.0 67.2 78 1.0 88.8 39.4 51.5 58.7 41.4 63.6 60.0 84.8 49.9 75.9 63.0 1.0 70.0 58.3 52.1 64.0 0.0 87.0 54.0 1.0 1.0 70.1 84.6 66 50.8 46.7 50.9 46.6 63.8 80.5 50 83.3 74.4 78.1 68.6 61.1 56.2 66.2 0.0 68.0 70.0 87.1 87.5 82.4 82.9 47.6 73.1 84.2 57.7 100.0 56.7 89.6 49.9 69.7 54.5 CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT 3.5 Gemini Pro 69.4 56.3 68.7 56.9 68 58.1 24.2 63.2 27.5 61.7 38.4 40.1 6.0 3.0 5.0 6.0 6.0 1.0 19.1 2.0 8.0 17.4 20.2 11.0 9.0 10.0 11.2 11.0 1.0 12.7 1.0 5.0 10.2 13.5 5.0 2.0 3.0 4.0 4.0 5.0 38.2 6.0 14.7 57.3 CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT 3.5 Gemini Pro CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT 3.5 Gemini Pro CARE-CA BERT RoBERTa XLM-RoBERTa ALBERT DeBERTa Llama2 T5 Mistral GPT 3.5 Gemini Pro 63.0 53.0 50.3 49.5 49.4 49.8 48.0 60.0 51.0 52.0 59.0 67.1 44.6 45.5 50.4 51.2 45.3 50 65.4 54.3 62.8 65.8 94.6 39.0 38.0 37.5 33.8 33.5 27.3 94.2 36.8 70.3 79.5 62.5 48.6 65.2 64.3 46.2 22.1 60.0 59.0 59.0 54.0 65.0 28.6 59.2 1.0 51.4 35.0 60.0 20.0 63.4 69.1 23.2 25.2 95.4 21.8 20.9 20.4 19.3 25.8 23.8 94.5 29.2 70.9 80.0 61.9 52.3 50.3 49.5 40.5 18.0 47.0 59.0 52.0 53.0 57.0 25.7 44.9 3.0 45.0 25.0 45.6 10.0 46.2 71.7 30.4 31.6 95 15.2 14.4 14.9 27.2 22.0 51.3 95.0 60.9 84.6 83.8 62.5 52.4 100.0 99.3 68.9 33.2 82.0 59.0 70.0 55.0 76.0 32.3 96.0 1.0 60.0 30.0 96.5 13.3 53.4 70.4 28.0 28.0 95.4 39.0 38.0 37.5 33.8 33.5 27.3 94.2 36.8 70.3 79.5 Table 1: The table summarizes performance metrics of encoders and decoders on three different tasks including causal discovery, counterfactual reasoning, and causal reasoning identification References Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Unsupervised crosslingual representation learning at scale. arXiv preprint arXiv:1911.02116. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Du, L.; Ding, X.; Xiong, K.; Liu, T.; and Qin, B. 2022. e-CARE: a New Dataset for Exploring Explainable Causal Reasoning. Submitted on 12 May 2022, arXiv:2205.05849. He, P.; Liu, X.; Gao, J.; and Chen, W. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654. Jiang, A. Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D. S.; Casas, D. d. l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825. Jin, Z.; Chen, Y.; Leeb, F.; Gresele, L.; Kamal, O.; Lyu, Z.; Blin, K.; Gonzalez Adauto, F.; Kleiman-Weiner, M.; Sachan, M.; and Schölkopf, B. 2023a. CLadder: Assessing Causal Reasoning in Language Models. NeurIPS 2023; updated with CLadder dataset v1.5, arXiv:2312.04350. Jin, Z.; Liu, J.; Lyu, Z.; Poff, S.; Sachan, M.; Mihalcea, R.; Diab, M.; and Schölkopf, B. 2023b. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836. Jin, Z.; von Kügelgen, J.; Ni, J.; Vaidhya, T.; Kaushal, A.; Sachan, M.; and Schoelkopf, B. 2021. Causal direction of data collection matters: Implications of causal and anticausal learning for NLP. arXiv preprint arXiv:2110.03618. Kıçıman, E.; Ness, R.; Sharma, A.; and Tan, C. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. arXiv preprint arXiv:2305.00050. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Soricut, R. 2019. Albert: A lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. OpenAI. 2024. https://platform.openai.com/docs. Peña, A.; Morales, A.; Fierrez, J.; Serna, I.; Ortega-Garcia, J.; Puente, I.; Cordova, J.; and Cordova, G. 2023. Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs. Accepted in ICDAR 2023 Workshop on Automatic Domain-Adapted and Personalized Document Analysis, arXiv:2306.02864. Qin, L.; Bosselut, A.; Holtzman, A.; Bhagavatula, C.; Clark, E.; and Choi, Y. 2019. Counterfactual story reasoning and generation. arXiv preprint arXiv:1909.04076. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551. Romanou, A.; Montariol, S.; Paul, D.; Laugier, L.; Aberer, K.; and Bosselut, A. 2023. Crab: Assessing the strength of causal relationships between real-world events. arXiv preprint arXiv:2311.04284. Singh, S.; Wen, N.; Hou, Y.; Alipoormolabashi, P.; Wu, T.L.; Ma, X.; and Peng, N. 2021. COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences. In Findings of the Association for Computational Linguistics: ACL 2021. In Proceedings of Findings of the Association for Computational Linguistics: ACL 2021 (ACL-Findings). Contains 16 pages, 14 figures, and 11 tables. Speer, R.; Chin, J.; and Havasi, C. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. AAAI Conference on Artificial Intelligence, 4444–4451. Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; and Zhao, J. 2023. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2550– 2575. Zečević, M.; Willig, M.; Dhami, D. S.; and Kersting, K. 2023. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067. Zhang, C.; Bauer, S.; Bennett, P.; Gao, J.; Gong, W.; Hilmkil, A.; Jennings, J.; Ma, C.; Minka, T.; Pawlowski, N.; and Vaughan, J. 2023. Understanding Causality with Large Language Models: Feasibility and Opportunities. arXiv preprint arXiv:2304.05524. Zhuang, Z.; Chen, Q.; Ma, L.; Li, M.; Han, Y.; Qian, Y.; Bai, H.; Feng, Z.; Zhang, W.; and Liu, T. 2023. Through the Lens of Core Competency: Survey on Evaluation of Large Language Models. arXiv preprint arXiv:2308.07902.