PhD Thesis Defense: Erica Cai, From Text to Networks: Enabling and Investigating Social Measurement via Low-Resource Knowledge Graph Extraction
Content
Speaker
The thesis is motivated by the challenge of extracting structured instances of action or relationship occurrences from large amounts of unstructured text to populate knowledge graphs (KGs). In KGs, nodes represent entities mentioned in the text (e.g., Portugal, the United Kingdom, protein) and edges represent relationships or events (e.g., ally, payment). Extracted KGs allow researchers to perform various downstream analyses, such as identifying central nodes that indicate important entities in intelligence reports or examining comembership density in affiliation networks of elites. However, research literature shows that information extraction methods often struggle to perform well in low-resource settings, making their application over massive text data in real settings difficult. Therefore, my thesis aims to (1) introduce information extraction methods that perform effectively and efficiently in low-resource contexts and improve the evaluation of such methods, and (2) examine how errors in populating a knowledge graph using these information extraction methods affect subsequent analyses on the extracted KGs.
In the first part of the thesis, we focus on information extraction methods that extract tuple structures which populate graphs with entities (e.g., Sherlock Holmes, John Watson) and relationships between them (e.g., friend, enemy). We contribute methods and improvement of evaluation for two key tasks: (1) Event extraction: (a) We propose an interpretable, efficient approach to extracting event structures from text that outperforms state-of-the-art methods; (b) apply a slightly modified version of this method over millions of news articles to investigate bias in global news coverage of critical disaster and terrorist attack events; (c) provide recommendations for and implementations of fixes for issues related to evaluating such methods. (2) Named entity recognition and relation extraction: (a) We develop a few-shot method for extracting fine-grained named entities (e.g., religious institution, soldier, politician) that achieves state-of-art performance and (b) propose solutions to challenges in evaluating relation extraction methods due to issues in label assignment methods for datasets.
In the second part of the thesis, we investigate how errors from information extraction impact downstream analyses on the extracted knowledge graph. These analyses include measurements of centrality, projection network density, and clustering coefficients, which are crucial for capturing node importance and how nodes tend to cluster. Because KGs in existing NLP datasets are often too small or disconnected for reliable experimentation, we curate a new collection of datasets that pair scanned book text with a large, labeled KG. We examine the effects of real errors (e.g., those introduced by OCR or relation extraction) on downstream analyses over extracted KGs from text and find that widely and exclusively used KG evaluation metrics may not always correlate with performance on these real-world analyses. To deepen our understanding, we also study the effect of error on synthetic graphs, where errors range from simple (e.g., random) to more realistic (e.g., node
disaggregation, preferential attachment). We provide closed forms on how errors affect measurement of projection network density, global clustering coefficient and other real-world relevant analyses on simpler synthetic networks, and conduct simulations to understand how more realistic types of errors affect these measurements on real-world networks. These findings guide social scientists and biomedicine researchers on how varying error magnitudes and types may influence downstream analyses over extracted KGs.
Advisor
Brendan O'Connor