unix - awk count and sum based on slab: -
would extract lines first file (gunzip *.gz i.e input.csv.gz), if first file 4th field falls within range of second file (slab.csv) first field (start range) , second field (end range) populate slab wise count of rows , sum of 4th , 5th field of first file.
input.csv.gz (gunzip)
desc,date,zone,duration,calls ab,01-06-2014,xyz,450,3 ab,01-06-2014,xyz,642,3 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,205,3 ab,01-06-2014,xyz,98,1 ab,01-06-2014,xyz,455,1 ab,01-06-2014,xyz,120,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,193,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,161,2
slab.csv
startrange,endrange 0,0 1,10 11,100 101,200 201,300 301,400 401,500 501,10000
expected output:
startrange,endrange,count,sum-4,sum-5 0,0,3,0,0 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3
i using below 2 commands above output , expect "notfound"cases .
awk -f, 'nr==fnr{s[nr]=$1;e[nr]=$2;c[nr]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' slab.csv <(gzip -dc input.csv.gz) >op_step1.csv cat op_step1.csv | awk -f, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} end{for(i in a)print i","a[i]","b[i]","c[i]}' >op_step2.csv
op_step2.csv
101,200,3,474,4 501,10000,1,642,3 0,0,3,0,0 401,500,2,905,4 11,100,1,98,1 201,300,1,205,3
any suggestions make 1 liner command achieve expected output , don't have perl , python access.
here 1 way using awk
, sort
:
awk ' begin { fs = ofs = subsep = ","; print "startrange,endrange,count,sum-4,sum-5" } fnr == 1 { next } nr == fnr { ranges[$1,$2]++; next } { (range in ranges) { split(range, tmp, subsep); if ($4 >= tmp[1] && $4 <= tmp[2]) { count[range]++; sum4[range]+=$4; sum5[range]+=$5; next } } } end { for(range in ranges) print range, (count[range]?count[range]:"notfound"), (sum4[range]?sum4[range]:"notfound"), (sum5[range]?sum5[range]:"notfound") | "sort -t, -nk1,2" }' slab input startrange,endrange,count,sum-4,sum-5 0,0,3,notfound,notfound 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3
- set input, output field separators ,
subsep
,
. print header line. - if first line skip it.
- load entire
slab.txt
in array calledranges
. - for every range in
ranges
array, split field start , end range. if 4th column in range, increment count array , add valuesum4
,sum5
array appropriately. - in
end
block, iterate through ranges , print them. - pipe output
sort
output in order.
Comments
Post a Comment