Speeding up separation of large text file based on line content in Bash

  • A+
Category:Languages

I have a very large text file (around 20 GB and 300 million lines), which contains three columns separated by tabs:

word1 word2 word3 word1 word2 word3 word1 word2 word3 word1 word2 word3 

word1, word2, and word3 are different in each line. word3 specifies the class of the line, and repeats often for different lines (having thousands of different values). The goal is to separate the file by the line class (word3). I.e. word1 and word2 should be stored in a file called word3, for all the lines. For example, for the line:

a b c 

the string "a b" should be appended to the file called c.

Now I know how this can be done with a while loop, reading line by line of a file, and appending the proper file for each line:

while IFS='' read -r line || [[ -n "$line" ]]; do     # Variables     read -a line_array <<< ${line}     word1=${line_array[0]}     word2=${line_array[1]}     word3=${line_array[2]}      # Adding word1 and word2 to file word3     echo "${word1} ${word2}" >> ${word3}   done < "inputfile" 

It works, but is very slow (even though I have a fast workstation with an SSD). How can this be speed up? I have already tried to carry out this procedure in /dev/shm, and splitted the file into 10 pieces and run the above script in parallel for each file. But it is still quite slow. Is there a way to further speed this up?

 


Let's generate an example file:

$ seq -f "%.0f" 3000000 | awk -F $'/t' '{print $1 FS "Col_B" FS int(2000*rand())}' >file 

That generates a 3 million line file similar to this:

$ head -n 3 file; echo "..."; tail -n 3 file 1   Col_B   1680 2   Col_B   788 3   Col_B   1566 ... 2999998 Col_B   1562 2999999 Col_B   1803 3000000 Col_B   1252 

With a simple awk you can generate the files you describe this way:

$ time awk -F $'/t' '{ print $1 " " $2 >> $3; close($3) }' file real    3m31.011s user    0m25.260s sys     3m0.994s 

So that awk will generate the 2,000 group files in about 3 minutes 31 seconds. Certainly faster than Bash, but this can be faster by presorting the file by the third column and writing each group file in one go.

You can use the Unix sort utility in a pipe and feed the output to a script that can separate the sorted groups to different files. Use the -s option with sort and the value of the third field will be the only fields that will change the order of the lines.

Since we can assume sort has partitioned the file into groups based on column 3 of the file, the script only needs to detect when that value changes:

$ time sort -s -k3 file | awk -F $'/t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }' real    0m4.727s user    0m5.495s sys     0m0.541s 

Because of the efficiency gained by presorting, the same net process completes in 5 seconds.

If you are sure that the 'words' in column 3 are ascii only (ie, you do not need to deal with UTF-8), you can set LC_ALL=C for additional speed:

$ time LC_ALL=C sort -s -k3 file | awk -F $'/t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }' real    0m3.801s user    0m3.796s sys     0m0.479s 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: