| Title | Dynamic identification and measurement of factors influencing tail risk driven by text data: A case study of listed companies in China′s financial industry [Abstract] [Full text] |
| Author | LIU Chao; QIAN Cun |
| Abstract | In the context of rapid advancements in big data, cloud computing, artificial intelligence, and blockchain technology, accurately identifying risk weaknesses and effectively preventing and controlling financial tail risk has become a critical concern for governments and academia alike. Addressing these challenges is complex, as traditional approaches to analysing structured data are limited by statistical period lags and content gaps, preventing timely and comprehensive capture of potential risk drivers. In contrast, microfinancial data, particularly unstructured text data—which are inherently complex, large scale, and interconnected—present promising new avenues to assist in financial tail risk decision-making. Nonetheless, methods for systematically extracting relevant risk factors from text and measuring their unique contributions remain underdeveloped.In response, this paper proposes a novel framework that integrates text feature mining techniques with multivariate statistical methods aimed at dynamic identification and measurement of tail-risk factors. Specifically, this framework employs periodic reports and analyst commentaries as original text data sources and uses a hybrid algorithm combining Latent Dirichlet Allocation (LDA) and Word2Vec models to extract thematic risk factors. To quantify the impact of these themes, multiple regression analysis is applied to measure the marginal contribution of comprehensive price information to these extracted risk factors. This framework offers an advantage by capturing both long-term and short-term risk factors, helping to alleviate challenges such as data lag and improving the timeliness of analysis, which are often limitations in traditional structured data studies.To validate this approach, we conducted static and dynamic empirical analyses using data from publicly listed companies in the financial industry over the period from 2001 to 2022. The results underscore the framework′s effectiveness, demonstrating its capacity to not only extract long-term and emerging risk factors from text-based disclosures but also continuously track new risk indicators. By integrating statistical models, the framework verifies the marginal contribution of various risk factors to tail risk, revealing key factors that influence tail risk during periods of heightened financial instability.Our static long-term model analysis identified several persistent risk categories that companies and analysts frequently highlight in their periodic reporting, including market risk, credit risk, investment risk, market instruments and information, macroeconomic and policy risks, and risk management challenges. By clustering text-extracted risk factors and validating them via multiple regression analysis, we confirmed that these themes provide supplementary explanatory power for tail risk, such as value at risk (VaR) and expected shortfall (ES), beyond what structured data alone reveal. Multiple regression was applied quarterly, allowing calculation of annual and quarterly marginal contributions of risk themes. The findings show that when tail risk increases, the marginal contributions of text-disclosed risk themes also increase, often fluctuating significantly during periods of financial crisis. Further analysis of these fluctuations revealed that different risk factors display distinct marginal contribution patterns during varying extreme events, offering insights into pivotal risk factors specific to certain financial conditions or events.Short-term dynamic model analysis also underscored the sensitivity of dynamic models in detecting short-term risk factors that static models may miss. Our findings show that when quarterly data are used for multiple regression, dynamic models provide a more variable and robust understanding of risk, revealing short-term risk drivers that static models overlook. Additionally, regression analysis utilizing rolling event windows enabled an in-depth examination of quarterly shifts in marginal contributions, showing that these changes are more pronounced on a quarterly basis. This cyclical variation highlights the dynamic explanatory power of continuous text-based risk factors in explaining tail risk, an area that structured data are often insufficient to address.This text-data-driven approach to dynamically identifying and measuring the factors influencing tail risk represents an innovative application of text mining for tail risk management. From a theoretical perspective, this framework extends the literature on tail risk by providing a text-based approach to analysing influencing factors, confirming the informational value of thematic risk factors. This approach not only complements traditional structured data analysis but also expands the understanding of risk factor behaviours over time. Practically, the proposed framework offers regulatory authorities a powerful tool for harnessing collective intelligence from text mining to more comprehensively capture potential risk factors. This can significantly aid in the early detection and prevention of tail risk events. Additionally, from an investor perspective, this framework enables more informed investment decisions and supports improved earnings forecasting by providing richer data on underlying risk dynamics. |
| Keywords | Text-driven decision-making; Tail risk; LDA model; Word2Vec model; LASSO model |
| Issue | Vol. 39, No. 6, 2025 |
Title
Dynamic identification and measurement of factors influencing tail risk driven by text data: A case study of listed companies in China′s financial industry [Abstract] [Full text]
Author
LIU Chao; QIAN Cun
Abstract
In the context of rapid advancements in big data, cloud computing, artificial intelligence, and blockchain technology, accurately identifying risk weaknesses and effectively preventing and controlling financial tail risk has become a critical concern for governments and academia alike. Addressing these challenges is complex, as traditional approaches to analysing structured data are limited by statistical period lags and content gaps, preventing timely and comprehensive capture of potential risk drivers. In contrast, microfinancial data, particularly unstructured text data—which are inherently complex, large scale, and interconnected—present promising new avenues to assist in financial tail risk decision-making. Nonetheless, methods for systematically extracting relevant risk factors from text and measuring their unique contributions remain underdeveloped.In response, this paper proposes a novel framework that integrates text feature mining techniques with multivariate statistical methods aimed at dynamic identification and measurement of tail-risk factors. Specifically, this framework employs periodic reports and analyst commentaries as original text data sources and uses a hybrid algorithm combining Latent Dirichlet Allocation (LDA) and Word2Vec models to extract thematic risk factors. To quantify the impact of these themes, multiple regression analysis is applied to measure the marginal contribution of comprehensive price information to these extracted risk factors. This framework offers an advantage by capturing both long-term and short-term risk factors, helping to alleviate challenges such as data lag and improving the timeliness of analysis, which are often limitations in traditional structured data studies.To validate this approach, we conducted static and dynamic empirical analyses using data from publicly listed companies in the financial industry over the period from 2001 to 2022. The results underscore the framework′s effectiveness, demonstrating its capacity to not only extract long-term and emerging risk factors from text-based disclosures but also continuously track new risk indicators. By integrating statistical models, the framework verifies the marginal contribution of various risk factors to tail risk, revealing key factors that influence tail risk during periods of heightened financial instability.Our static long-term model analysis identified several persistent risk categories that companies and analysts frequently highlight in their periodic reporting, including market risk, credit risk, investment risk, market instruments and information, macroeconomic and policy risks, and risk management challenges. By clustering text-extracted risk factors and validating them via multiple regression analysis, we confirmed that these themes provide supplementary explanatory power for tail risk, such as value at risk (VaR) and expected shortfall (ES), beyond what structured data alone reveal. Multiple regression was applied quarterly, allowing calculation of annual and quarterly marginal contributions of risk themes. The findings show that when tail risk increases, the marginal contributions of text-disclosed risk themes also increase, often fluctuating significantly during periods of financial crisis. Further analysis of these fluctuations revealed that different risk factors display distinct marginal contribution patterns during varying extreme events, offering insights into pivotal risk factors specific to certain financial conditions or events.Short-term dynamic model analysis also underscored the sensitivity of dynamic models in detecting short-term risk factors that static models may miss. Our findings show that when quarterly data are used for multiple regression, dynamic models provide a more variable and robust understanding of risk, revealing short-term risk drivers that static models overlook. Additionally, regression analysis utilizing rolling event windows enabled an in-depth examination of quarterly shifts in marginal contributions, showing that these changes are more pronounced on a quarterly basis. This cyclical variation highlights the dynamic explanatory power of continuous text-based risk factors in explaining tail risk, an area that structured data are often insufficient to address.This text-data-driven approach to dynamically identifying and measuring the factors influencing tail risk represents an innovative application of text mining for tail risk management. From a theoretical perspective, this framework extends the literature on tail risk by providing a text-based approach to analysing influencing factors, confirming the informational value of thematic risk factors. This approach not only complements traditional structured data analysis but also expands the understanding of risk factor behaviours over time. Practically, the proposed framework offers regulatory authorities a powerful tool for harnessing collective intelligence from text mining to more comprehensively capture potential risk factors. This can significantly aid in the early detection and prevention of tail risk events. Additionally, from an investor perspective, this framework enables more informed investment decisions and supports improved earnings forecasting by providing richer data on underlying risk dynamics.
Keywords
Text-driven decision-making; Tail risk; LDA model; Word2Vec model; LASSO model
Issue
Vol. 39, No. 6, 2025
References