hadoop - How do I store data in multiple, partitioned files on HDFS using Pig -

January 15, 2015

i've got pig job analyzes large number of log files , generates relationship between group of attributes , bag of ids have attributes. i'd store relationship on hdfs, i'd in way friendly other hive/pig/mapreduce jobs operate on data, or subsets of data without having ingest full output of pig job, significant amount of data.

for example, if schema of relationship like:

relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}

i'd able partition data, storing in file structure looks like:

/results/attr1/attr2/attr3/file(s)

where attrx values in path values group, , file(s) contain ids. allow me subset data subsequent analysis without duplicating data.

is such thing possible, custom storefunc? there different approach should taking accomplish goal?

i'm pretty new pig, or general suggestions approach appreciated.

thanks in advance.

multistore wasn't perfect fit trying do, proved example of how write custom storefunc writes multiple, partitioned output files. downloaded pig source code , created own storage function parsed group tuple, using each of items build hdfs path, , parsed bag of ids, writing 1 id per line result file.

Search This Blog

My

hadoop - How do I store data in multiple, partitioned files on HDFS using Pig -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -