I’m solving the NLP task of find anomalies in text.
I have a number of medical articles. And for the new medical article I want to identify - does this article contains some anomaly information or not? By “anomaly” information I mean - for example if my train dataset doesn’t have any articles about ‘death’, but a new article is about it. This article will be anomaly.
I’ve tried to vectorized my dataset with google “use”, then tried to use auto-encoder models/isolation-forest/local outlier factor/etc. , but it doesn’t solve my task. For the new articles it makes some strange classification (it’s very depends on “use” vectors), that’s why I’m trying another approach with bag of words methods.
So, I’ve tried tf-idf model on my dataset. Then I’m getting a new article that can contains some new words (that my tf-idf model doesn’t know yet). My question is - how can I identify, is it really ‘anomaly’ word or not? Because if we a getting a new word, it doesn’t mean, that it is an anormaly!
What can you suggest me?