Splitting a file in linux based on content

Greenhorn picture Greenhorn · Dec 17, 2011 · Viewed 62k times · Source

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

This means I will have to split my files based on the above said header. How do I go about it in linux?

Answer

kev picture kev · Dec 17, 2011

If you have a mail.txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt