<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Troll-Range &#187; conversion</title>
	<atom:link href="http://blog.trollgod.org.uk/tag/conversion/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.trollgod.org.uk</link>
	<description>Ghworg&#039;s wibblings and geek projects.</description>
	<lastBuildDate>Mon, 26 Apr 2010 19:31:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Convert files to UTF-8</title>
		<link>http://blog.trollgod.org.uk/2009/04/convert-files-to-utf-8/</link>
		<comments>http://blog.trollgod.org.uk/2009/04/convert-files-to-utf-8/#comments</comments>
		<pubDate>Thu, 16 Apr 2009 05:00:54 +0000</pubDate>
		<dc:creator>Ghworg</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[conversion]]></category>
		<category><![CDATA[text]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://blog.trollgod.org.uk/?p=178</guid>
		<description><![CDATA[<p>I occasionally come across text files with weird squiggles or numbers in them were there should be characters. Usually it&#8217;s accented characters, but in extreme cases I&#8217;ve seen it happen with speech marks.</p> <p>The problem of course is that the files are not ASCII, and text files don&#8217;t store what character set they were <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.trollgod.org.uk/2009/04/convert-files-to-utf-8/">Convert files to UTF-8</a></span>]]></description>
			<content:encoded><![CDATA[<p>I occasionally come across text files with weird squiggles or numbers in them were there should be characters.  Usually it&#8217;s accented characters, but in extreme cases I&#8217;ve seen it happen with speech marks.</p>
<p>The problem of course is that the files are not ASCII, and text files don&#8217;t store what character set they were created with.  So if the set my system happens to use doesn&#8217;t match the one the file uses then the result is screwed up letters.</p>
<p>To solve this I created a little python script that converts the file into using the UTF-8 character set as that is nice and universal.  You can specify what codec the input is in with the -c option, if you don&#8217;t bother then it assumes the Windows 1252 codepage as that is usually what it is in my experience.  There is also a force option for when the conversion comes across characters that don&#8217;t match the input codec but you want it to convert anyway.</p>
<p>I thought about making it autodetect the codec of the input file, but it is a lot of work for little benefit.  The current code works in 99% of cases for me.</p>
<div class="dean_ch" style="white-space: nowrap;">
<span class="co1">#!/usr/bin/env python</span></p>
<p><span class="kw1">import</span> <span class="kw3">codecs</span><br />
<span class="kw1">import</span> <span class="kw3">optparse</span><br />
<span class="kw1">import</span> <span class="kw3">sys</span></p>
<p><span class="kw1">if</span> <span class="kw2">len</span><span class="br0">&#40;</span><span class="kw3">sys</span>.<span class="me1">argv</span><span class="br0">&#41;</span> &lt; <span class="nu0">2</span>:<br />
&nbsp; &nbsp; <span class="kw1">print</span> <span class="st0">&#8216;convert_cp1252_to_UTF-8.py [-f] $filename&#8217;</span><br />
&nbsp; &nbsp; <span class="kw3">sys</span>.<span class="me1">exit</span><span class="br0">&#40;</span><span class="nu0">0</span><span class="br0">&#41;</span></p>
<p>opts = <span class="kw3">optparse</span>.<span class="me1">OptionParser</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
<span class="co1"># &#8216;help&#8217;, &#8216;config=&#8217;, &#8216;logfile=&#8217;</span><br />
opts.<span class="me1">add_option</span><span class="br0">&#40;</span><span class="st0">&#8216;-f&#8217;</span>, <span class="st0">&#8216;&#8211;force&#8217;</span>, action=<span class="st0">&#8216;store_true&#8217;</span><span class="br0">&#41;</span><br />
opts.<span class="me1">add_option</span><span class="br0">&#40;</span><span class="st0">&#8216;-c&#8217;</span>, <span class="st0">&#8216;&#8211;codec&#8217;</span>, default=<span class="st0">&#8216;cp1252&#8242;</span><span class="br0">&#41;</span><br />
<span class="br0">&#40;</span>parsedOpts, args<span class="br0">&#41;</span> = opts.<span class="me1">parse_args</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
filename = args<span class="br0">&#91;</span><span class="nu0">0</span><span class="br0">&#93;</span></p>
<p><span class="kw1">try</span>:<br />
&nbsp; &nbsp; <span class="kw1">if</span> parsedOpts.<span class="me1">force</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; textfile = <span class="kw3">codecs</span>.<span class="kw2">open</span><span class="br0">&#40;</span>filename, <span class="st0">&#8216;r&#8217;</span>, encoding=parsedOpts.<span class="me1">codec</span>, errors=<span class="st0">&#8216;ignore&#8217;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; textfile = <span class="kw3">codecs</span>.<span class="kw2">open</span><span class="br0">&#40;</span>filename, <span class="st0">&#8216;r&#8217;</span>, encoding=parsedOpts.<span class="me1">codec</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; utffile = <span class="kw3">codecs</span>.<span class="kw2">open</span><span class="br0">&#40;</span>filename+<span class="st0">&#8216;.utf-8&#8242;</span>, <span class="st0">&#8216;w&#8217;</span>, encoding=<span class="st0">&#8216;utf-8&#8242;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; utffile.<span class="me1">write</span><span class="br0">&#40;</span>textfile.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
<span class="kw1">except</span>:<br />
&nbsp; &nbsp; <span class="kw1">print</span> <span class="st0">&#8216;Error converting to UTF-8, source file probably doesn&#8217;</span>t use cp1252<span class="st0">&#8216;<br />
</span></div>
<p>The above code is made available under the <a href="http://www.opensource.org/licenses/mit-license.php">MIT</a> license.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.trollgod.org.uk/2009/04/convert-files-to-utf-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
