python - Multiple column pandas vectorized string function? -
is there way of querying dataframe rows contain string in column? series.str
except dataframe? here's have far:
in [2]: s = "lorem ipsum dolor sit amet, consectetur adipisicing elit, sed eiusmod tempor incididunt ut labore et dolore magna aliqua. ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est" in [3]: df = pd.dataframe(np.array(s.split(' ')).reshape((-1, 4)), columns=['one', 'two', 'three', 'four']) in [4]: df out[4]: 1 2 3 4 0 lorem ipsum dolor sit 1 amet, consectetur adipisicing elit, 2 sed eiusmod tempor 3 incididunt ut labore et 4 dolore magna aliqua. ut 5 enim ad minim veniam, 6 quis nostrud exercitation ullamco 7 laboris nisi ut aliquip 8 ex ea commodo consequat. 9 duis aute irure dolor 10 in reprehenderit in voluptate 11 velit esse cillum dolore 12 eu fugiat nulla pariatur. 13 excepteur sint occaecat cupidatat 14 non proident, sunt in 15 culpa qui officia deserunt 16 mollit anim id est [17 rows x 4 columns] in [5]: mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor') in [6]: df[mask] out[6]: 1 2 3 4 0 lorem ipsum dolor sit 4 dolore magna aliqua. ut 9 duis aute irure dolor 11 velit esse cillum dolore [4 rows x 4 columns]
ideally, replace last 2 lines similar this:
df[df.ix[:, 'one':'four'].str.contains('dolor')]
is possible?
pandas not have dataframe.str methods (at least not yet). however, use
import numpy np mask = np.logical_or.reduce( [df[col].str.contains('dolor') col in df.loc[:, 'one':'four'].columns])
this little less writing, , bit quicker
mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
in [29]: %timeit mask = np.logical_or.reduce([df[col].str.contains('dolor') col in df.loc[:, 'one':'four'].columns]); df[mask] 1000 loops, best of 3: 761 µs per loop in [30]: %timeit mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor'); df[mask] 1000 loops, best of 3: 1.13 ms per loop
Comments
Post a Comment