Working with large files (bash)

24 Jun 2019

to count number of rows in file:


wc -l largefile.csv

to see the top 5 rows in file:


head -5 largefile.csv

to filter rows by certain column in file (e.g. column 5 > 20 and != 255) and output another file


awk '$5 >= 20 && $5 != 255{print $0}' largefile.csv > output.csv

to filter rows by certain column (numerical only - e.g. col 3 == 99) in file and output to another file, with header (pipe-delimited)


head -1 largefile.csv > output.csv |
awk -F "|" '$3 == 99 {print $0}' largefile.csv >> output.csv

to filter rows by certain column (string - e.g. col 1 == ‘dec’) in file and output to another file, with header (pipe-delimited)


head -1 largefile.csv > output.csv |
awk -F "|" 'match($1,/dec/) {print $0}' largefile.csv >> output.csv

to filter rows containing string and output to another file, with header


head -1 largefile.csv > output.csv |
awk '/dec/' largefile.csv >> output.csv

modified from https://stackoverflow.com/questions/29503699/filtering-a-csv-file-with-awk

also, check out https://en.wikibooks.org/wiki/An_Awk_Primer/Awk_Command-Line_Examples for more details.

https://www.theurbanpenguin.com/filtering-with-awk/ is also very helpful!!