Unbalanced Samples | Coursera Community
Coursera Header

Unbalanced Samples

  • 16 November 2018
  • 7 replies

Badge +1
I have a dataset for input to a neural network containing ~52,000 labeled records. However, only about 1,500 labels are true and the rest are false.

Is this going to impact the successful training of my network?

7 replies

Userlevel 5
Badge +5
If you train an ML model for a classification problem using an unbalanced dataset as you describe, your model may have trouble detecting the true data since it has much less examples of what that data looks like.

Perhaps you should rephrase your problem as an anomaly detection problem? Train a model on all of the false data (which you have lots of), then any time that your model finds something different to the false data it will predict this is anomalous data.

But this all really depends on your dataset, could you give a bit of information about the specific problem you are working on and the dataset you are working with? Even though you only have 1500 true data points, is it fair to say that these actually are a very good representation of the true data class? You need to think about the test data that your model would need to deal with after deployment, how similar is this to the training data? If the training data fully represents the test data then you shouldn't have too much of a problem.

Also do you plan to use the entire 50500 false labelled data points in the training? Do you really need this much data to teach the model what false data looks like?

You can also try to just train the model, see how it performs and plot some learning curves (this is easy if you are using pythons sklearn package). The learning curves will tell you if adding more training data might improve your results.
Userlevel 1
This post on 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset might be worth a look.
Badge +1
Liz, Robert,

Thanks for the replies. The model is trying to forecast a result based on a set of historical input data. A positive(true) result is achieved when a volume production actually hits or exceeds a certain level depending on past production volumes/position in cycle/etc. A negative(false) result is when the production does not reach that threshold.

My concerns were exactly those that have been expressed in the links that a good result will be 'learnt' by simply discarding those good cases and achieving 98% accuracy.

The problem I have is I am exploring ideas right now. I am not sure that the features I have chosen will actually end up being good predictors.

The data is truly representative and comes from several years' historical inputs. There are genuinely not that many 'positive' cases within the history and I would not expect that to differ going forward.

I will try to see if I can reduce the dataset on the false cases. I just worry I may be losing some important info.
Userlevel 5
Badge +5
@DavidA So you should not use accuracy as your evaluation metric. Look at the number of false positives and false negatives to get a better idea of how well your model actually performs.

The reason I ask about your dataset and how well it represents the real life test data is not about how often you might encounter data point from the smaller class. What I mean to ask is does the amount of data you have from the smaller class fully represent all future possible data points from the same class?

If this is not the case, but you do have enough data to give a good representation of the bigger class, then I think it makes sense to train an anomaly detection model, rather than a classifier.

Also in my experience the best way to see if the features you have selected are good predictors is to just train a model with a very simple architecture - do not worry about things like hyperparameter tuning yet. Train it, and see how it performs on your cross-validation data. Once you have set up the code to do this once, trying out the same methodology for different combinations of parameters is trivial.
Badge +1
Thanks Liz. Definitely good advice to look at the false positive and negatives. I am just starting out in this journey of ML modelling so really appreciate your inputs.
Userlevel 5
Badge +5
@DavidA No problem, I know the struggle!
Userlevel 4
Badge +3
The main issue you will have with imbalanced datasets is that the algorithm can get very high levels of accuracy by just guessing X all the time. -- A simple rule ends up being unreasonably effective.

If your dataset is skewed 90/10 then any algorithm you build needs to have a accuracy of 91+%. And even in that case its only 1% better than the dumb "always guess false" algorithm.

So long as you understand how to interpret the results a skewed dataset isn't so bad.


    Cookie policy

    We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

    Accept cookies Cookie settings