I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.
This means I will have to split my files based on the above said header. How do I go about it in linux?
If you have a mail.txt
$ cat mail.txt
<html>
mail A
</html>
<html>
mail B
</html>
<html>
mail C
</html>
run csplit
to split by <html>
$ csplit mail.txt '/^<html>$/' '{*}'
- mail.txt => input file
- /^<html>$/ => pattern match every `<html>` line
- {*} => repeat the previous pattern as many times as possible
check output
$ ls
mail.txt xx00 xx01 xx02 xx03
If you want do it in awk
$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt 5.txt 9.txt mail.txt