This is not necessarily wrong, but I want to point out that using a ReLU here is not a very common choice as far as I know. This might not hurt anything, but if it does, this could be a thing to check.
Also: you can tie the input/output embedding matrices (ie. use a single parameter instead of self.embedding.weight and self.out). This will reduce your vocabulary size by half and might help a bit with overfitting. Note that you would still need the bias which is included in the self.out layer.