Using Deep Learning for Tabular Data | Coursera Community
Coursera Header

Using Deep Learning for Tabular Data

  • 27 March 2019
  • 4 replies

Hey. I've been working on some loan approval and customer retention models at work developed using data from operation data stores (in house databases) within the organization. I've been using tree based models from sklearn which are giving me great results. Since business users in my organization are attracted by buzz words like deep learning and neural networks, we have been having a debate around using deep learning for the aforementioned problems. I personally think deep learning is well suited for problems around NLP and Computer Vision. Moreover, we've also tried deep learning models on the same data set but couldn't get better results than tree based models which further validates my stance. I believe deep learning is very powerful but should be used where it is well suited to the problem you're solving and would outperform algorithms like adaboost, xgboost, random forest, bagging classifier etc. What's your stance on this?

4 replies

Userlevel 7
@Liz @THANGA MANICKAM M – do you have any insights on this?
Userlevel 3
Badge +5
Deep learning tends to work well when the dataset size is too large,when the features are not sparse and when transfer learning is applicable. Deep learning shows good results in computer vision and NLP because of the above reasons. It is not common to see these features in tabular data. Most of the tabular databases have categorical variables like customer ID, city name etc. This leads to sparse features after one hot encoding them to use in deep learning. Also in tasks like loan approval , we cannot expect a very large dataset. Using deep learning on these smaller data sets can lead to over fitting. Hence tree based models and other machine learning algorithms shows better results on tabular data. Reducing the over fitting of the model is the serious issue in using deep learning on tabular data. But you can try using deep learning on tabular data for the following reasons:
  1. When the dataset size is too large deep learning model can learn some complex relationship in the data. With larger dataset , it is possible to have a bigger validation dataset and testing dataset which allows to reduce the overfiting.
  2. Deep learning models have lot of hyper parameters to fine-tune the model compared to tree based models. While finding the right set of these hyper parameters can take time , it allows us to experiment with different combinations of these hyper parameters . This gives us more options for improving generalization of the model.
  3. Embeddings allow to use deep learning in dataset with categorical variables. Embeddings have the advantage of re-usability. The main advantage of embeddings is less hand engineering involved. It is fully automated.
  4. When the tabular dataset involves time series (like stock prediction, customer activity for past 12 months,etc) recurrent neural networks can show good results.
Artificial Neural Networks Applied to Taxi Destination Prediction is an example where deep learning is used in tabular data. This model scored the highest prediction in leader board in Kaggle's Taxi destination prediction challenge. The dataset was a tabular data with fields like customer id , day , call type etc. It had 1.7 million data points in it which was large enough to use deeplearning. It used embeddings to represent categorical features. It outscored other machine learning algorithms like random forest.
My opinion is when your dataset size is large try using deep learning in it. But it takes a lot of time to train and improve deep-learning models. In some projects with large tabular dataset I found best results with deep-learning.
Userlevel 2
Badge +1
I saw a presentation by Kaggle CEO, where he says that most of the submissions that won on unstructured data (images, words, sounds, etc.) were based on deep learning, while most of the submissions that won on structured data (tables of rows and columns) were based on gradient boosting (a.k.a. boosted trees). And most of the winning submissions used stacking (whatever the kind of data).
Here there is a teaser of the presentation
Unfortunately the whole thing can be accessed only via O'Reilly Safari, but it is possible to do it for an evaluation period without paying.

We have tried using CNN for tabular data, and result is published in CVPR2019 Precognition Workshop. 

The paper is titled “SuperTML: Two-Dimensional Word Embedding for the Precognition on Structured Tabular Data”

“In this paper, we propose the SuperTML method, which borrows the idea of Super Characters method and two-dimensional embeddings to address the problem of classification on tabular data. For each input of tabular data, the features are first projected into two-dimensional embeddings like an image, and then this image is fed into fine-tuned two-dimensional CNN models for classification. The proposed SuperTML method handles the categorical data and missing values in tabular data automatically, without any need to pre-process into numerical values. Comparisons of model performance are conducted on one of the largest and most active competitions on the Kaggle platform, as well as on the top three most popular data sets in the UCI Machine Learning Repository. Experimental results have shown that the proposed SuperTML method have achieved state-of-the-art results on both large and small datasets.”


We also found it is already replicated by people who are interested in it, and published in github (Thank EmjayAhn!):


Welcome to try it!