Task #10: English Lexical Substitution

    Issues with the trial data

    1. We are aware that there was an error in the dtd released with the trial dataset. The lines

      "<!ATTLIST instance
              id ID #REQUIRED>"

      should be replaced with

      "<!ATTLIST instance id CDATA #REQUIRED>"

      because our IDs are numeric. We will fix this in the test run version.

    2. We are aware that there are xml errors in the input file (lexsub_trial.xml). For example, XML entities representing extended characters are sometimes messed up. This partcularly happens with quotes and apostrophes and also non-English characters. This is because we have left the original corpus data (http://corpus.leeds.ac.uk/internet.html) as it was without any further cleaning of the sentences. This is how the annotators will have received it. You can fix some of the xml issues if you wish with the perl script available here hope to provide a simple tool for participants to use on the data if they wish. Thanks to Richard Wicentowski for drawing this to our attention this and suggesting the fix.
    3. Also thanks to Richard Wicentowski, we realise there are a few spelling mistakes (optimisitc,ascertail, ininformed, assesment, vicioous) in the following lines of the file "gold.trial":-
    4. bright.a 7 :: positive 3;promising 2;good 1;optimisitc 1;hopeful 1;
      find.v 79 :: discover 2;locate 2;determine 1;ascertail 1;calculate 1
      dark.n 173 :: ininformed 1;unsure 1;unclear 1;oblivious 1;uncertain 1;ignorant 1;unaware 1;unilluminated 1;
      examination.n 182 :: investigation 2;inspection 2;consultation 1;assesment 1;
      nasty.a 246 :: mean 3;unpleasant 2;spiteful 1;vicioous 1;
      We have checked annotators responses manually for lemmatisation and spelling errors but these slipped the net. We will run a spell check on the gold standard for the test dataset.


    if you have questions please write to lexsub at sussex dot ac dot uk

