Convert files to UTF-8

I occasionally come across text files with weird squiggles or numbers in them were there should be characters. Usually it’s accented characters, but in extreme cases I’ve seen it happen with speech marks.

The problem of course is that the files are not ASCII, and text files don’t store what character set they were created with. So if the set my system happens to use doesn’t match the one the file uses then the result is screwed up letters.

To solve this I created a little python script that converts the file into using the UTF-8 character set as that is nice and universal. You can specify what codec the input is in with the -c option, if you don’t bother then it assumes the Windows 1252 codepage as that is usually what it is in my experience. There is also a force option for when the conversion comes across characters that don’t match the input codec but you want it to convert anyway.

I thought about making it autodetect the codec of the input file, but it is a lot of work for little benefit. The current code works in 99% of cases for me.

#!/usr/bin/env python

import codecs
import optparse
import sys

if len(sys.argv) < 2:
    print ‘convert_cp1252_to_UTF-8.py [-f] $filename’
    sys.exit(0)

opts = optparse.OptionParser()
# ‘help’, ‘config=’, ‘logfile=’
opts.add_option(‘-f’, ‘–force’, action=‘store_true’)
opts.add_option(‘-c’, ‘–codec’, default=‘cp1252’)
(parsedOpts, args) = opts.parse_args()
filename = args[0]

try:
    if parsedOpts.force:
        textfile = codecs.open(filename, ‘r’, encoding=parsedOpts.codec, errors=‘ignore’)
    else:
        textfile = codecs.open(filename, ‘r’, encoding=parsedOpts.codec)
    utffile = codecs.open(filename+‘.utf-8’, ‘w’, encoding=‘utf-8’)
    utffile.write(textfile.read())
except:
    print ‘Error converting to UTF-8, source file probably doesn’t use cp1252

The above code is made available under the MIT license.