unix - awk count and sum based on slab: -
would extract lines first file (gunzip *.gz i.e input.csv.gz), if first file 4th field falls within range of second file (slab.csv) first field (start range) , second field (end range) populate slab wise count of rows , sum of 4th , 5th field of first file.
input.csv.gz (gunzip)
desc,date,zone,duration,calls ab,01-06-2014,xyz,450,3 ab,01-06-2014,xyz,642,3 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,205,3 ab,01-06-2014,xyz,98,1 ab,01-06-2014,xyz,455,1 ab,01-06-2014,xyz,120,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,193,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,161,2 slab.csv
startrange,endrange 0,0 1,10 11,100 101,200 201,300 301,400 401,500 501,10000 expected output:
startrange,endrange,count,sum-4,sum-5 0,0,3,0,0 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3 i using below 2 commands above output , expect "notfound"cases .
awk -f, 'nr==fnr{s[nr]=$1;e[nr]=$2;c[nr]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' slab.csv <(gzip -dc input.csv.gz) >op_step1.csv cat op_step1.csv | awk -f, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} end{for(i in a)print i","a[i]","b[i]","c[i]}' >op_step2.csv op_step2.csv
101,200,3,474,4 501,10000,1,642,3 0,0,3,0,0 401,500,2,905,4 11,100,1,98,1 201,300,1,205,3 any suggestions make 1 liner command achieve expected output , don't have perl , python access.
here 1 way using awk , sort: 
awk ' begin {      fs = ofs = subsep = ",";     print "startrange,endrange,count,sum-4,sum-5"  }  fnr == 1 { next } nr == fnr {     ranges[$1,$2]++;     next } {     (range in ranges) {         split(range, tmp, subsep);          if ($4 >= tmp[1] && $4 <= tmp[2]) {             count[range]++;             sum4[range]+=$4;             sum5[range]+=$5;              next         }     } } end {     for(range in ranges)          print range, (count[range]?count[range]:"notfound"), (sum4[range]?sum4[range]:"notfound"), (sum5[range]?sum5[range]:"notfound") | "sort -t, -nk1,2" }' slab input startrange,endrange,count,sum-4,sum-5 0,0,3,notfound,notfound 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3 - set input, output field separators , subsep,. print header line.
- if first line skip it.
- load entire slab.txtin array calledranges.
- for every range in rangesarray, split field start , end range. if 4th column in range, increment count array , add valuesum4,sum5array appropriately.
- in endblock, iterate through ranges , print them.
- pipe output sortoutput in order.
Comments
Post a Comment