Monthly Archives: July 2015

Using grep, sed to filter data in Linux, Mac OS

I recently received a blob of data which needed some alteration to properly suit the CSV format. I found that the data file ‘xyz.dat’ contained close to 600,000 rows so writing a VBA script in Excel to format the data was not an option. I opted to do it using some plain commands in Unix. I had to attain the following purpose:

1. Remove all rows with blank lines.
2. Remove all rows that didn’t start with the number ‘1’
3. Trim the file to 1000 rows to sample the data.

I executed the following commands to attain the above objectives.

To remove all blank lines I used this command. I am basically asking grep to treat the text as ASCII text and using a regular expression to identify blank rows:
cat xyz.dat | grep -a -v -e '^[[:space:]]*$' > xyz_no_space.dat

To remove all rows that don’t start with ‘1’ I used this command. This command is similar to above command except for the regular expression to identify rows that start with ‘1’.
cat xyz_no_space.dat | grep -a '^1' > xyz_no_space_start_with_1.dat

At this point we have the data we want but for sampling purpose I need to copy first 1000 rows into another file. This command uses ‘sed’ to do it.
sed -n -e '1,1000p' xyz_no_space_start_with_1.dat > xyz_trimmed_1_1000.dat

This command basically renames the *.dat file into *.csv for convenience.
mv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv

A summary of all commands I executed is listed below:
cat xyz.dat | grep -a -v -e '^[[:space:]]*$' > xyz_no_space.dat
cat xyz_no_space.dat | grep -a '^1' > xyz_no_space_start_with_1.dat
sed -n -e '1,1000p' xyz_no_space_start_with_1.dat > xyz_trimmed_1_1000.dat
mv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv

Maven and Eclipse Issues

I use Maven for almost all Java projects I work in. I prefer using Netbeans with Maven as it has solid integration with Maven and I find it a hassle free approach of working with Maven without cluttering up your project with ‘special’ project files. However I have found out that Eclipse cannot be ignored as it has much faster performance and it needs lesser memory than Netbeans. So here are some tricks to work around the build issues I have faced with Eclipse when combined by Maven. The usual issue I faced was that I could build my project from console but I saw lots of build issues when building from inside Eclipse and lot of build errors were logged inside Eclipse ‘Problem’ view.

Refresh Project On pom.xml change
If you have changed anything in the pom.xml you should refresh your Eclipse project. It is possible Eclipse will pick up your changes from pom.xml and resolve any build / dependency issue in this step itself.

Re-generate Eclipse Project Files
If you have added a dependency in pom.xml you should ensure that the following commands are executed to regenerate the Eclipse project configuration files.

mvn eclipse:clean
This will delete the .project, .classpath and .settings folder of Eclipse.

mvn eclipse:eclipse
This will regenerate the .project, .classpath and .settings folder for your project.

Once you execute the above commands go back to your Eclipse project and refresh your project by right clicking on the project and clicking on refresh in the context menu or by pressing the ‘F5’ key after selecting the project.

In my experience so far Eclipse is able to resolve all dependence / build issue and you should have a trouble free coding experience now.