Using Deep Learning for Tabular Data | Coursera Community
Coursera Header

Using Deep Learning for Tabular Data


Hey. I've been working on some loan approval and customer retention models at work developed using data from operation data stores (in house databases) within the organization. I've been using tree based models from sklearn which are giving me great results. Since business users in my organization are attracted by buzz words like deep learning and neural networks, we have been having a debate around using deep learning for the aforementioned problems. I personally think deep learning is well suited for problems around NLP and Computer Vision. Moreover, we've also tried deep learning models on the same data set but couldn't get better results than tree based models which further validates my stance. I believe deep learning is very powerful but should be used where it is well suited to the problem you're solving and would outperform algorithms like adaboost, xgboost, random forest, bagging classifier etc. What's your stance on this?

2 replies

Userlevel 7
Badge +12
@Liz @THANGA MANICKAM M – do you have any insights on this?
Userlevel 3
Badge +4
Deep learning tends to work well when the dataset size is too large,when the features are not sparse and when transfer learning is applicable. Deep learning shows good results in computer vision and NLP because of the above reasons. It is not common to see these features in tabular data. Most of the tabular databases have categorical variables like customer ID, city name etc. This leads to sparse features after one hot encoding them to use in deep learning. Also in tasks like loan approval , we cannot expect a very large dataset. Using deep learning on these smaller data sets can lead to over fitting. Hence tree based models and other machine learning algorithms shows better results on tabular data. Reducing the over fitting of the model is the serious issue in using deep learning on tabular data. But you can try using deep learning on tabular data for the following reasons:
  1. When the dataset size is too large deep learning model can learn some complex relationship in the data. With larger dataset , it is possible to have a bigger validation dataset and testing dataset which allows to reduce the overfiting.
  2. Deep learning models have lot of hyper parameters to fine-tune the model compared to tree based models. While finding the right set of these hyper parameters can take time , it allows us to experiment with different combinations of these hyper parameters . This gives us more options for improving generalization of the model.
  3. Embeddings allow to use deep learning in dataset with categorical variables. Embeddings have the advantage of re-usability. The main advantage of embeddings is less hand engineering involved. It is fully automated.
  4. When the tabular dataset involves time series (like stock prediction, customer activity for past 12 months,etc) recurrent neural networks can show good results.
Artificial Neural Networks Applied to Taxi Destination Prediction is an example where deep learning is used in tabular data. This model scored the highest prediction in leader board in Kaggle's Taxi destination prediction challenge. The dataset was a tabular data with fields like customer id , day , call type etc. It had 1.7 million data points in it which was large enough to use deeplearning. It used embeddings to represent categorical features. It outscored other machine learning algorithms like random forest.
My opinion is when your dataset size is large try using deep learning in it. But it takes a lot of time to train and improve deep-learning models. In some projects with large tabular dataset I found best results with deep-learning.

Reply

    Cookie policy

    We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

    Accept cookies Cookie settings