Unsupervised Learning of Natural Language Structure
We argue that previous methods for this task have generally failed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren't rich enough to capture important first-order properties of language (such as directionality, adjanency, and valence). We describe several syntactic representations which are designed to capture the basic character of natural language syntax as directly as possible.
With these representations, high-quality parses can be learned from surprisingly little text, with no labeled examples and no language-specific biases. Our results are the first to show above-baseline performance in unsupervised parsing, and far exceed the baseline (in multiple languages).These specific grammar learning methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing even for common languages like English.