hadoop - Flattening tuples in Pig -


i’ve data in following format:

(id, description)

1, xyz something. abc bcd & on.

1, xyz something. abc xyz & on.

2, abc something. abc xyz & on.

i need output in format:

id, word

i tried this:

a = load './data.txt' using pigstorage(',') (id: int, desc:chararray);

b = foreach generate id, flatten(strsplit(desc, '[,?:;\s]'));

this results in output such this:

1, xyz, is, something, abc, bcd, so, on

what want is:

1, xyz

1, is

1, something

etc etc..

how can in pig (without writing udf)?

ps: tried:

b = foreach generate id, flatten(datafu.pig.util.transposetupletobag(strsplit(desc, '[.&,?:;\s]')));

you can use tokenize in pig. please find below answer.

here input file

cat file1

1,xyz something

2,abc something

a = load 'file1' using pigstorage(',');

b = foreach generate $0, flatten(tokenize($1));

dump b

(1,xyz)

(1,is)

(1,something)

(2,abc)

(2,is)

(2,something)


Comments