To generate the entailments for the PETE task we followed these three steps:
1. Identifying Challenging Dependencies
To identify syntactic dependencies that are challenging for current state of the art parsers, we tested a number of parsers (both phrase structure and dependency) on a mixed domain corpus and identified the differences in their output. We took sentences where at least one of the parsers gave a different answer than the others or the gold parse. Some of these differences reflected linguistic convention rather than semantic disagreement (e.g. representation of coordination) and some did not represent meaningful differences that can be expressed with entailments (e.g. labeling a phrase ADJP vs ADVP). The remaining differences typically reflected genuine semantic disagreements that would effect downstream applications. These were chosen to turn into entailments in the next step.
2. Constructing Entailments
We tried to make the entailments as targeted as possible by building them around two content words that are syntactically related. When the two content words were not sufficient to construct a grammatical sentence we used one of the following techniques:
To identify the entailments that are clear to human judgement we used the following procedure:
The instructions for the annotators were brief and targeted people with no linguistic background:
The "Not sure" answers were grouped with the "No" answers during evaluation. Approximately 50% of the original entailments were retained after the inter-annotator-agreement filtering.
By picking dependencies that current state of the art parsers disagree on, we hoped to create a dataset that would focus attention on the long tail of parsing problems that do not get sufficient attention using common evaluation metrics. By further restricting ourselves to differences that can be expressed by natural language entailments, we hoped to focus on semantically relevant decisions rather than accidents of convention which get mixed up in common evaluation metrics. We chose to rely on untrained annotators on a natural inference task rather than trained annotators on an artificial tagging task because we believe (i) many subfields of computational linguistics are struggling to make progress because of the noise in artificially tagged data, and (ii) systems should try to model natural human linguistic competence rather than their dubious competence in artificial tagging tasks. Our hope is datasets like PETE will be used not only for evaluation but also for training and fine-tuning systems in the future.