python - Label encoding across multiple columns in scikit-learn -
i'm trying use scikit-learn's labelencoder
encode pandas dataframe
of string labels. dataframe has many (50+) columns, want avoid creating labelencoder
object each column; i'd rather have 1 big labelencoder
objects works across all columns of data.
throwing entire dataframe
labelencoder
creates below error. please bear in mind i'm using dummy data here; in actuality i'm dealing 50 columns of string labeled data, need solution doesn't reference columns name.
import pandas sklearn import preprocessing df = pandas.dataframe({'pets':['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 'owner':['champ', 'ron', 'brick', 'champ', 'veronica', 'ron'], 'location':['san_diego', 'new_york', 'new_york', 'san_diego', 'san_diego', 'new_york']}) le = preprocessing.labelencoder() le.fit(df) traceback (most recent call last): file "<stdin>", line 1, in <module> file "/users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=true) file "/users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise valueerror("bad input shape {0}".format(shape)) valueerror: bad input shape (6, 3)
any thoughts on how around problem?
you can though,
df.apply(labelencoder().fit_transform)
edit:
since answer on year ago, , generated many upvotes (including bounty), should extend further.
for inverse_transform , transform, have little bit of hack.
from collections import defaultdict d = defaultdict(labelencoder)
with this, retain columns labelencoder
dictionary.
# encoding variable fit = df.apply(lambda x: d[x.name].fit_transform(x)) # inverse encoded fit.apply(lambda x: d[x.name].inverse_transform(x)) # using dictionary label future data df.apply(lambda x: d[x.name].transform(x))
Comments
Post a Comment