Unsupervised Methods for Determining Object and Relation Synonyms on the Web

Alexander Yates and Oren Etzioni

Abstract:

The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fully-implemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of Resolver's probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% accuracy and 95% recall on a data set from the TREC corpus.

Paper:

PDF

Bibtex:
@article{jair09resolver,
  author = {Alexander Yates and Oren Etzioni},
  title = {Unsupervised Methods for Determining Object and Relation Synonyms on the Web},
  journal = {Journal of Artificial Intelligence Research (JAIR)},
  volume = {34},
  pages = {255-296},
  month = {March},
  year = {2009}
}

Previous Publications

A version of the probabilistic model for Resolver (called ESP) was first described in a conference paper at NAACL-HLT 2007. (pdf)

My dissertation includes a chapter on Resolver, with details on some experiments we performed that were not included in the JAIR publication. In those experiments, we used simulations to test variations of Resolver's probabilistic model and clustering algorithm. The simulations showed that under appropriate conditions, a version of the ESP model that incorporates TFIDF-like weights can improve precision by up to 26%. A variation of the Resolver clustering algorithm, which we called ``Mutual Recursion,'' was able to improve recall by using object synonym clusters to inform decisions on relation synonymy, and vice versa.

Data:

Output Synonym Clusters (150KB, gzipped tarball, high precision, low recall)
Output Ranked List of Similar Pairs (233MB, gzipped tarball, low precision, high recall)
Input TextRunner Tuples (25MB, gzipped tarball)
Gold standard entity clusters
Gold standard relation clusters

Links:

TextRunner, the Open Information Extraction system used to produce the input relationships for Resolver.
The Turing Center at the University of Washington, where this research was conducted.
Temple University's Center for Information Science and Technology, where this research was conducted.