One hot encoder after data split | Coursera Community
Coursera Header
Question

One hot encoder after data split


Badge +1

Hello,

i was recently experimenting with sklearn Pipelines. i made a preprocessing pipeline with one hot encoding for categorical data.  i split my training set into training and dev set. The problem is that some categories only go into test set and hence have no column due to encoding, and my program fails :/. Is there a way to get preprocessed dataset back from pipeline object? so i can first preprocess all data then split and still use my pipeline. 

Thanks, 

P.S. - sorry if it’s a rookie mistake i’m new


2 replies

Userlevel 2
Badge +2

The problem is that some categories only go into test set and hence have no column due to encoding, and my program fails :/.

Hey parth,

Could you please update the question with the complete traceback if there’s any error.

If it’s a logical error then could you post the format of the dataset, coz I’m trying to understand your issue but it’s too vaguely defined.

Badge +1

The problem is that some categories only go into test set and hence have no column due to encoding, and my program fails :/.

Hey parth,

Could you please update the question with the complete traceback if there’s any error.

If it’s a logical error then could you post the format of the dataset, coz I’m trying to understand your issue but it’s too vaguely defined.

import pandas as pdimport numpy as npfrom sklearn.model_selection import StratifiedKFold, cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.svm import SVRfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressorfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerdataset_training = pd.read_csv("../datasets/HousePrices/train.csv")dataset_test = pd.read_csv("../datasets/HousePrices/test.csv")models = [LinearRegression(), SVR(kernel='rbf', degree=2), DecisionTreeRegressor(),          RandomForestRegressor(n_estimators=10), KNeighborsRegressor(n_neighbors=5),          GradientBoostingRegressor(learning_rate=0.1, n_estimators=10)]def Cleaner(dataset, argument = 'train'):    cleaned_data = dataset.loc[:, dataset.isnull().sum(axis=0) < 500]    if argument == 'train':        labels = cleaned_data['SalePrice']        data = cleaned_data.drop(['SalePrice', 'Id'], axis=1)    else:        data = cleaned_data.drop(['Id'], axis=1)        labels = np.array([])    return data, labelsdef prediction(dataset, labels):    labels = labels.to_numpy()    numeric_features = dataset.select_dtypes(include=['int64', 'float64']).columns    categorical_features = dataset.select_dtypes(include=['object']).columns    numeric_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='median')),    ('scale', StandardScaler())    ])    categorical_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),    ('onehot', OneHotEncoder())])    data_pipeline = ColumnTransformer(    transformers=[        ('num', numeric_transformer, numeric_features),        ('cat', categorical_transformer, categorical_features)])    predictor_pipeline = Pipeline(steps=[        ('preprocessor', data_pipeline),        ('classifier', LinearRegression()),    ])    kf = StratifiedKFold(n_splits=2, shuffle=True)    for classifier in models:        score_list = []        for train_index, test_index in kf.split(dataset, labels):            X_train, X_test = dataset.loc[train_index, :], dataset.loc[test_index, :]            y_train, y_test = labels[train_index], labels[test_index]            predictor_pipeline.set_params(classifier=classifier)            predictions = predictor_pipeline.fit(X_train, y_train).predict(X_test)            results = cross_val_score(classifier, predictions, y_test, cv=kf, scoring='accuracy')            score_list = score_list + [results]    return resultsdata, label = Cleaner(dataset_training, 'train')results = prediction(data, label)

Hello, Thanks for the reply, i am working on this dataset, sorry for not pretty code i’m just starting. 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Reply