hadoop - How do I store data in multiple, partitioned files on HDFS using Pig -
i've got pig job analyzes large number of log files , generates relationship between group of attributes , bag of ids have attributes. i'd store relationship on hdfs, i'd in way friendly other hive/pig/mapreduce jobs operate on data, or subsets of data without having ingest full output of pig job, significant amount of data.
for example, if schema of relationship like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
i'd able partition data, storing in file structure looks like:
/results/attr1/attr2/attr3/file(s)
where attrx values in path values group, , file(s) contain ids. allow me subset data subsequent analysis without duplicating data.
is such thing possible, custom storefunc? there different approach should taking accomplish goal?
i'm pretty new pig, or general suggestions approach appreciated.
thanks in advance.
multistore wasn't perfect fit trying do, proved example of how write custom storefunc writes multiple, partitioned output files. downloaded pig source code , created own storage function parsed group tuple, using each of items build hdfs path, , parsed bag of ids, writing 1 id per line result file.
Comments
Post a Comment