I'm looking for some command-line tools for Linux that can help me detect and convert files from character sets like iso-8859-1 and windows-1252 to utf-8 and from Windows line endings to Unix line endings.
The reason I need this is that I'm working on projects on Linux servers via SFTP with editors on Windows (like Sublime Text) that just constantly screws these things up. Right now I'm guessing about half my files are utf-8, the rest are iso-8859-1 and windows-1252 as it seems Sublime Text is just picking character set by which symbols the file contains when I save it. The line endings are ALWAYS Windows line endings even though I've specified in the options that default line endings are LF, so about half of my files have LF and half are CRLF.
So I would need at least a tool that would recursively scan my project folder and alert me of files that deviate from utf-8 with LF line endings so I could manually fix that before I commit my changes to GIT.
Any comments and personal experiences on the topic would also be welcome.
Thanks
Edit: I have a temporary solution in place where I use tree
and file
to output information about every file in my project, but it's kinda wonky. If I don't include the -i
option for file
then a lot of my files gets different output like ASCII C++ program text and HTML document text and English text etc:
$ tree -f -i -a -I node_modules --noreport -n | xargs file | grep -v directory ./config.json: ASCII C++ program text ./debugserver.sh: ASCII text ./.gitignore: ASCII text, with no line terminators ./lib/config.js: ASCII text ./lib/database.js: ASCII text ./lib/get_input.js: ASCII text ./lib/models/stream.js: ASCII English text ./lib/serverconfig.js: ASCII text ./lib/server.js: ASCII text ./package.json: ASCII text ./public/index.html: HTML document text ./src/config.coffee: ASCII English text ./src/database.coffee: ASCII English text ./src/get_input.coffee: ASCII English text, with CRLF line terminators ./src/jtv.coffee: ASCII English text ./src/models/stream.coffee: ASCII English text ./src/server.coffee: ASCII text ./src/serverconfig.coffee: ASCII text ./testserver.sh: ASCII text ./vendor/minify.json.js: ASCII C++ program text, with CRLF line terminators
But if I do include -i
it doesn't show me line terminators:
$ tree -f -i -a -I node_modules --noreport -n | xargs file -i | grep -v directory ./config.json: text/x-c++; charset=us-ascii ./debugserver.sh: text/plain; charset=us-ascii ./.gitignore: text/plain; charset=us-ascii ./lib/config.js: text/plain; charset=us-ascii ./lib/database.js: text/plain; charset=us-ascii ./lib/get_input.js: text/plain; charset=us-ascii ./lib/models/stream.js: text/plain; charset=us-ascii ./lib/serverconfig.js: text/plain; charset=us-ascii ./lib/server.js: text/plain; charset=us-ascii ./package.json: text/plain; charset=us-ascii ./public/index.html: text/html; charset=us-ascii ./src/config.coffee: text/plain; charset=us-ascii ./src/database.coffee: text/plain; charset=us-ascii ./src/get_input.coffee: text/plain; charset=us-ascii ./src/jtv.coffee: text/plain; charset=us-ascii ./src/models/stream.coffee: text/plain; charset=us-ascii ./src/server.coffee: text/plain; charset=us-ascii ./src/serverconfig.coffee: text/plain; charset=us-ascii ./testserver.sh: text/plain; charset=us-ascii ./vendor/minify.json.js: text/x-c++; charset=us-ascii
Also why does it display charset=us-ascii and not utf-8? And what's text/x-c++? Is there a way I could output only charset=utf-8
and line-terminators=LF
for each file?
The solution I ended up with is the two Sublime Text 2 plugins "EncodingHelper" and "LineEndings". I now get both the file encoding and line endings in the status bar:
If the encoding is wrong, I can File->Save with Encoding. If the line endings are wrong, the latter plugin comes with commands for changing the line endings: