Task #10: English Lexical Substitution

  • Organized by:

    Questions

    1. Q: Are verbs and other PoS among the target words?
    2. A: Yes verbs, adjectives and adverbs are included (see full description at http://nlp.cs.swarthmore.edu/semeval/tasks/task10/description.shtml). There may be minor modifications to the description as we nail down a few measures

    3. Q: Each target word will have at least one substitute?
    4. A: Yes all words will have at least 1 substitute. We are going to weed out cases which are clearly problematic

    5. Q: Is it true that the word types (lexelt items) included in the trial data (e.g. bright, film, take, etc) are not necessarily the words that you will include in the test data?
    6. A: You are correct that the words in the trial dataset are not those that will be in the test set.

    7. Q: Will there be any training data? If so, will those lexelt items be representative of the test data, or will it be similar to the trial data (in that you make no claims about those words appearing in the test data)?
    8. A: There is no training data. (full description "For this reason we will not provide training data since this would mean we would need to specify potential substitutes in advance. "). There is no guarantee how the test words will behave and what the synonyms will be. We are not certain yet how much data we can get annotated in time. We should have 1500 test sentences at minimum (but hopefully more). Each word will have 10 sentences (there may be one or two sentences taken out if they are problematic in some way). Some words are selected manually and some randomly from lists of potential candidates. In the test data 20 (approx) words for each PoS will have sentences selected manually (rather than the random process). This may mean that the sense distribution in those sentences is not representative of the corpus as a whole. As mentioned in our description we will provide breakdown of scores for these 2 different sentence sampling approaches.

    9. Q: I don't understand why score.pl uses 298 as the total number of items in the trial data when there are 300?
    10. A: This is because it only uses items with 2 or more non NIL and non proper name responses from the annotators (see document task10documentation.pdf)

    11. Q: We noticed that some suggested synonyms in the trial data were spelled in British English (e.g. bright.a #3 lists 'colourful' as the first suggested replacement). Will spelling differences be accounted for in the scoring, or should we make every effort to report our answers with British spellings?
    12. A: The annotators are all British subjects so I would advise that you provide substitutes with British spellings. (We do mention that all subjects are living in the UK on the task description web page). Our subjects are free to use American spellings though and we suspect that will happen some of the time. We don't want to promise allowing for this in the scoring in case we add rules which inadvertently cause errors.

    13. Q: I have a question after learning about the examples you provided. The word to be substituted may be not the basic form, for example "<head>takes<\head>" if we have find "last" as a correct substitution of "take" in this given sentence, do we have to change "last" to "lasts". In other words, if we output "last", whether it will be judged as correct?
    14. A: We are expecting substitutes in lemmatised form i.e. last not lasts. This is stated in our documentation (see document task10documentation.pdf) and is hopefully evident from the trial gold standard. The multiword identification and detection subtask is perhaps more complicated in some cases where it isn't obvious that the lemmatised form is the canonical one. In these cases we take the response from the annotators 'as is'.

    15. Q: If we participate in the Best or OOT method, are we expected to come up with the multi-word synonyms? Using your scorer (distributed with the trial data) it seems like we are penalised for not guessing these, even though MW is evaluated as a separate task. It would be useful (to us, at least) if there was a scoring method that only judged us on our ability to guess single-word replacements.
    16. A: The MW task is identification of "multiwords" in the original sentence rather than scoring multiword substitutes. In response to your request, we have given the scorer for the test run an option to score only the single word substitutes in the gold standard. This will work on the subset of the data that has 2 or more single word (non proper name) responses from the annotators . We will provide this breakdown to participants, as well as the scores on the full set of substitutes.

    17. Q:Is it true that in the "best" scoring a precision of 100% is not always possible? For instance, in the example given in Section 4, here is what the various possible submissions would get for item 9999, if my understanding is correct: glad => 3/7 merry => 2/7 cheerful => 1/7 glad;merry;cheerful;jovial => (7/4)/7 = 1/4
    18. A: Yes, that is right. The reason is there is more uncertainty for items with more variation as to what is the correct answer. To get maximum credit you are best guessing the mode and giving only one answer. We want to favour systems which provide the best answer and we put more weight on items where there is more agreement. For oot 100% is possible if the |Hi| doesn't exceed 10.

      Q: So is there a simple reason why the max score for the best evaluation is possible by giving only one answer?

      A:The system should be trying to find the best substitute and not hedging its bets. If a system really thought several were equally good then it should provide these as best. This would be reflected by equal choices from the annotators. The system needs to guess the favourite from the annotators. The idea of scoring against all of the annotators' responses is that there will be variation. It is not a black and white situation and we want to emphasis test items with better agreement and less variation

    You may also be interested to read some issues with the trial data that have been raised by other participants.

    Contact

    if you have questions please write to lexsub at sussex dot ac dot uk

    Back to task web page


 For more information, visit the SemEval-2007 home page.