unix - awk count and sum based on slab: -


would extract lines first file (gunzip *.gz i.e input.csv.gz), if first file 4th field falls within range of second file (slab.csv) first field (start range) , second field (end range) populate slab wise count of rows , sum of 4th , 5th field of first file.

input.csv.gz (gunzip)

desc,date,zone,duration,calls ab,01-06-2014,xyz,450,3 ab,01-06-2014,xyz,642,3 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,205,3 ab,01-06-2014,xyz,98,1 ab,01-06-2014,xyz,455,1 ab,01-06-2014,xyz,120,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,193,1 ab,01-06-2014,xyz,0,0 ab,01-06-2014,xyz,161,2 

slab.csv

startrange,endrange 0,0 1,10 11,100 101,200 201,300 301,400 401,500 501,10000 

expected output:

startrange,endrange,count,sum-4,sum-5 0,0,3,0,0 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3 

i using below 2 commands above output , expect "notfound"cases .

awk -f, 'nr==fnr{s[nr]=$1;e[nr]=$2;c[nr]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' slab.csv <(gzip -dc input.csv.gz) >op_step1.csv cat op_step1.csv | awk -f, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} end{for(i in a)print i","a[i]","b[i]","c[i]}' >op_step2.csv 

op_step2.csv

101,200,3,474,4 501,10000,1,642,3 0,0,3,0,0 401,500,2,905,4 11,100,1,98,1 201,300,1,205,3 

any suggestions make 1 liner command achieve expected output , don't have perl , python access.

here 1 way using awk , sort:

awk ' begin {      fs = ofs = subsep = ",";     print "startrange,endrange,count,sum-4,sum-5"  }  fnr == 1 { next } nr == fnr {     ranges[$1,$2]++;     next } {     (range in ranges) {         split(range, tmp, subsep);          if ($4 >= tmp[1] && $4 <= tmp[2]) {             count[range]++;             sum4[range]+=$4;             sum5[range]+=$5;              next         }     } } end {     for(range in ranges)          print range, (count[range]?count[range]:"notfound"), (sum4[range]?sum4[range]:"notfound"), (sum5[range]?sum5[range]:"notfound") | "sort -t, -nk1,2" }' slab input startrange,endrange,count,sum-4,sum-5 0,0,3,notfound,notfound 1,10,notfound,notfound,notfound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,notfound,notfound,notfound 401,500,2,905,4 501,10000,1,642,3 
  • set input, output field separators , subsep ,. print header line.
  • if first line skip it.
  • load entire slab.txt in array called ranges.
  • for every range in ranges array, split field start , end range. if 4th column in range, increment count array , add value sum4 , sum5 array appropriately.
  • in end block, iterate through ranges , print them.
  • pipe output sort output in order.

Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -