Added missing code in exemplary notebook - custom datasets fine-tuning (huggingface#15300)

Pawloch247 · web-flow · commit e79a0faeae80 · 2022-01-25T17:26:17.000-05:00
* Added missing code in exemplary notebook - custom datasets fine-tuning Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification. The missing code concerns adding labels for all but first token in a single word. The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb). * Changes requested in the review - keep the code as simple as possible
diff --git a/docs/source/custom_datasets.mdx b/docs/source/custom_datasets.mdx
@@ -326,7 +326,9 @@ def tokenize_and_align_labels(examples):
                 label_ids.append(-100)
             elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                 label_ids.append(label[word_idx])
-
+            else:
+                label_ids.append(-100)
+            previous_word_idx = word_idx
         labels.append(label_ids)
 
     tokenized_inputs["labels"] = labels