{"id":440,"date":"2015-07-18T07:49:19","date_gmt":"2015-07-18T02:19:19","guid":{"rendered":"http:\/\/www.cyberaka.com\/?p=440"},"modified":"2015-07-18T07:51:22","modified_gmt":"2015-07-18T02:21:22","slug":"440","status":"publish","type":"post","link":"https:\/\/www.cyberaka.com\/?p=440","title":{"rendered":"Using grep, sed to filter data in Linux, Mac OS"},"content":{"rendered":"<p>I recently received a blob of data which needed some alteration to properly suit the CSV format. I found that the data file &#8216;xyz.dat&#8217; contained close to 600,000 rows so writing a VBA script in Excel to format the data was not an option. I opted to do it using some plain commands in Unix. I had to attain the following purpose:<\/p>\n<p>1. Remove all rows with blank lines.<br \/>\n2. Remove all rows that didn&#8217;t start with the number &#8216;1&#8217;<br \/>\n3. Trim the file to 1000 rows to sample the data.<\/p>\n<p>I executed the following commands to attain the above objectives.<\/p>\n<p>To remove all blank lines I used this command. I am basically asking grep to treat the text as ASCII text and using a regular expression to identify blank rows:<br \/>\n<code>cat xyz.dat | grep -a -v -e '^[[:space:]]*$' &gt; xyz_no_space.dat<\/code><\/p>\n<p>To remove all rows that don&#8217;t start with &#8216;1&#8217; I used this command. This command is similar to above command except for the regular expression to identify rows that start with &#8216;1&#8217;.<br \/>\n<code>cat xyz_no_space.dat | grep -a '^1' &gt; xyz_no_space_start_with_1.dat<\/code><\/p>\n<p>At this point we have the data we want but for sampling purpose I need to copy first 1000 rows into another file. This command uses &#8216;sed&#8217; to do it.<br \/>\n<code>sed -n -e '1,1000p' xyz_no_space_start_with_1.dat &gt; xyz_trimmed_1_1000.dat<\/code><\/p>\n<p>This command basically renames the *.dat file into *.csv for convenience.<br \/>\n<code>mv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv<\/code><\/p>\n<p>A summary of all commands I executed is listed below:<code><br \/>\ncat xyz.dat | grep -a -v -e '^[[:space:]]*$' &gt; xyz_no_space.dat<br \/>\ncat xyz_no_space.dat | grep -a '^1' &gt; xyz_no_space_start_with_1.dat<br \/>\nsed -n -e '1,1000p' xyz_no_space_start_with_1.dat &gt; xyz_trimmed_1_1000.dat<br \/>\nmv xyz_trimmed_1_1000.dat xyz_trimmed_1_1000.csv<br \/>\n<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently received a blob of data which needed some alteration to properly suit the CSV format. I found that the data file &#8216;xyz.dat&#8217; contained close to 600,000 rows so writing a VBA script in Excel to format the data was not an option. I opted to do it using some plain commands in Unix. [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27,35,24],"tags":[],"class_list":["post-440","post","type-post","status-publish","format-standard","hentry","category-linux-os","category-mac-os","category-tips-and-tricks"],"_links":{"self":[{"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/posts\/440","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=440"}],"version-history":[{"count":3,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/posts\/440\/revisions"}],"predecessor-version":[{"id":443,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=\/wp\/v2\/posts\/440\/revisions\/443"}],"wp:attachment":[{"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=440"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=440"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cyberaka.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=440"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}