python - pandas: groupby and unstack to create feature vector for classification -
i have pandas dataframe displaying users' performance on test questions. looks this:
userid questionid correct ------------------------------- 1 1 1 1 5 1 1 6 0 1 8 0 1 10 1 2 3 1 2 5 1 2 6 0 . . . . . . . . .
i want make feature vector each user saying whether or not got each question right, looks this:
questionid 1 2 3 4 5 6 ... userid ------------------------------------------------- 1 1 nan nan nan 1 0 ... 2 nan nan 1 nan 1 0 ... . ... . ... .
each user gets shown subset of questions, it's sparse matrix.
how can make above table in pandas?
i wanted below - grouping userid , questionid , unstacking, i'm not sure how should work.
df = df.groupby(['user_id','question_id']) df.unstack()
thanks help.
you're looking pivot
:
in [11]: df.pivot(values='correct', index='userid', columns='questionid') out[11]: questionid 1 3 5 6 8 10 userid 1 1 nan 1 0 0 1 2 nan 1 1 0 nan nan
you might reindex columns (based on questions) if you're not surjective.
in [12]: _.reindex_axis(np.arange(1, 10), 1) out[12]: 1 2 3 4 5 6 7 8 9 userid 1 1 nan nan nan 1 0 nan 0 nan 2 nan nan 1 nan 1 0 nan nan nan
note: answer suggested pivot_table
(which uses aggfunc on repeated values, default mean, , that's not want here - @u2ef1 points out), offers other additional features on pivot little slower:
df.pivot_table(values='correct', rows='userid', cols='questionid')
i have feeling in older versions of pandas, pivot sensitive nan had use pivot_table...
Comments
Post a Comment