To generate the entailments for the PETE task we followed these three steps:
  • Identify syntactic dependencies that are challenging to state of the art parsers.
  • Construct short entailment sentences that paraphrase those dependencies.
  • Identify the subset of the entailments with high inter annotator agreement.

1. Identifying Challenging Dependencies

To identify syntactic dependencies that are challenging for current state of the art parsers, we tested a number of parsers (both phrase structure and dependency) on a mixed domain corpus and identified the differences in their output.  We took sentences where at least one of the parsers gave a different answer than the others or the gold parse.  Some of these differences reflected linguistic convention rather than semantic disagreement (e.g. representation of coordination) and some did not represent meaningful differences that can be expressed with entailments (e.g. labeling a phrase ADJP vs ADVP).  The remaining differences typically reflected genuine semantic disagreements that would effect downstream applications.  These were chosen to turn into entailments in the next step.

2. Constructing Entailments

We tried to make the entailments as targeted as possible by building them around two content words that are syntactically related.  When the two content words were not sufficient to construct a grammatical sentence we used one of the following techniques:

  • complete the mandatory elements using the words "somebody" or "something".  (e.g. To test the subject-verb dependency in "John kissed Mary." we construct the entailment "John kissed somebody.")
  • make a passive sentence to avoid using a spurious subject.  (e.g. To test the verb-object dependency in "John kissed Mary." we construct the entailment "Mary was kissed.")
  • make a copular sentence to express noun modification.  (e.g. To test the noun-modifier dependency in "The big red boat sank." we construct the entailment "The boat was big.")
3. Filtering Entailments

To identify the entailments that are clear to human judgement we used the following procedure:

  • each entailment was tagged by 5 untrained annotators.
  • the results from the annotators whose agreement with the gold parse fell below 70% were eliminated.
  • the entailments for which there was unanimous agreement of at least 3 annotators were kept.

The instructions for the annotators were brief and targeted people with no linguistic background:

Computers try to understand long sentences by dividing them into a set of short facts. You will help judge whether the computer extracted the right facts from a given set of 25 English sentences. Each of the following examples consists of a sentence (T), and a short statement (H) derived from this sentence by a computer. Please read both of them carefully and choose "Yes" if the meaning of (H) can be inferred from the meaning of (T). Here is an example:

(T) Any lingering suspicion that this was a trick Al Budd had thought up was dispelled . 
(H) The suspicion was dispelled. Answer: YES 
(H) The suspicion was a trick. Answer: NO

You can choose the third option "Not sure" when the (H) statement is unrelated, unclear, ungrammatical or confusing in any other manner.

The "Not sure" answers were grouped with the "No" answers during evaluation.  Approximately 50% of the original entailments were retained after the inter-annotator-agreement filtering.

4. Rationale

By picking dependencies that current state of the art parsers disagree on, we hoped to create a dataset that would focus attention on the long tail of parsing problems that do not get sufficient attention using common evaluation metrics.  By further restricting ourselves to differences that can be expressed by natural language entailments, we hoped to focus on semantically relevant decisions rather than accidents of convention which get mixed up in common evaluation metrics.  We chose to rely on untrained annotators on a natural inference task rather than trained annotators on an artificial tagging task because we believe (i) many subfields of computational linguistics are struggling to make progress because of the noise in artificially tagged data, and (ii) systems should try to model natural human linguistic competence rather than their dubious competence in artificial tagging tasks.  Our hope is datasets like PETE will be used not only for evaluation but also for training and fine-tuning systems in the future.