I have trained a SBERT model with MLM on my own corpus which is somewhat domain specific using these guides:
https://ireneli.eu/2021/03/28/deep-learning-19-training-mlm-on-any-pre-trained-bert-models/ https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
When I saved a tokenizer with
tokenizer.save_pretrained(output_dir)
it created a set of files. So I opened a vocab.txt and tried to search for some domain specific words but I could not find them.
- Do I need to train a tokenizer as well on my corpus?
- If so, then do I need to retrain the SBERT model again using MLM? (which would be really disappointing since I don't have a GPU and had to train in the cloud paying for GPU)
- Is the model I have now useless if I use it with the original tokenizer that lacks domain specific words?