I recently received a blob of data which needed some alteration to properly suit the CSV format. I found that the data file ‘xyz.dat’ contained close to 600,000 rows so writing a VBA script in Excel to format the data was not an option. I opted to do it using some plain commands in Unix. I had to attain the following purpose:
1. Remove all rows with blank lines.
2. Remove all rows that didn’t start with the number ‘1’
3. Trim the file to 1000 rows to sample the data.
I executed the following commands to attain the above objectives.
To remove all blank lines I used this command. I am basically asking grep to treat the text as ASCII text and using a regular expression to identify blank rows:
cat xyz.dat | grep -a -v -e '^[[:space:]]*$' > xyz_no_space.dat
To remove all rows that don’t start with ‘1’ I used this command. This command is similar to above command except for the regular expression to identify rows that start with ‘1’.
cat xyz_no_space.dat | grep -a '^1' > xyz_no_space_start_with_1.dat
At this point we have the data we want but for sampling purpose I need to copy first 1000 rows into another file. This command uses ‘sed’ to do it.
sed -n -e '1,1000p' xyz_no_space_start_with_1.dat > xyz_trimmed_1_1000.dat
This command basically renames the *.dat file into *.csv for convenience.
mv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv
A summary of all commands I executed is listed below:
cat xyz.dat | grep -a -v -e '^[[:space:]]*$' > xyz_no_space.dat
cat xyz_no_space.dat | grep -a '^1' > xyz_no_space_start_with_1.dat
sed -n -e '1,1000p' xyz_no_space_start_with_1.dat > xyz_trimmed_1_1000.dat
mv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv