Do I need to train a tokenizer when training SBERT with MLM?

Question

I have trained a SBERT model with MLM on my own corpus which is somewhat domain specific using these guides:

When I saved a tokenizer with

tokenizer.save_pretrained(output_dir)

it created a set of files. So I opened a vocab.txt and tried to search for some domain specific words but I could not find them.

Do I need to train a tokenizer as well on my corpus?
If so, then do I need to retrain the SBERT model again using MLM? (which would be really disappointing since I don't have a GPU and had to train in the cloud paying for GPU)
Is the model I have now useless if I use it with the original tokenizer that lacks domain specific words?

score 1 · Answer 1 · answered Nov 03 '22 at 07:44

Normally, the vocab file should have tokens (pieces of words) instead of words: is it the case?

Remember that you can select how tokens are made: either full words, or smaller pieces of them.

Then, if your vocab is new, you need to train your model from scratch because the tokens are the ground base of semantic reference of the model.

Note: you could have also good results using a pre trained model, overall if your data set is small and not very varied.

1 Answers1