Splitting a large, complex one column file into several columns with awk

  • A+
Category:Languages

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.

(1  2  3 ... ) (11 22 33 ... ) (111 222 333 ... ) 

I need to achieve an output like:

 1;  11;   111  2;  22;   222  3;  33;   333 ...  ...  ... 

I found a complicated way that is:

  • perform sed operations to get

    1 2 3 ... # 11 22 33 ... # 111 222 333 ... 
  • use awk as follows to split my file in several sub-files

    awk -v RS="#" '{print > ("splitted-" NR ".txt")}' 
  • remove white spaces from my subfiles again with sed

    sed -i '/^[[:space:]]*$/d' splitted*.txt 
  • join everything together:

    paste splitted*.txt > out.txt 
  • add a field separator (defined in my bash script)

    awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt 

I feel this is crappy as I loop over million lines several time. Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it. Something like:

awk 'BEGIN{RS="(//n)"; OFS=";"} { print something } ' 

I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.

Any help would be appreciated.

 


With GNU awk for multi-char RS and true multi dimensional arrays:

$ cat tst.awk BEGIN {     RS  = "(//s*[()]//s*)+"     OFS = ";" } NR>1 {     cell[NR][1]     split($0,cell[NR]) } END {     for (rowNr=1; rowNr<=NF; rowNr++) {         for (colNr=2; colNr<=NR; colNr++) {             printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)         }     } }  $ awk -f tst.awk file      1;    11;   111      2;    22;   222      3;    33;   333    ...;   ...;   ... 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: