Codec Bug in Python
I seem to have stumbled across a bug in Python’s libraries for dealing with
unicode data: apparently mixing calls to file.readline()
and
file.readlines()
works just fine for regular 8-bit input files, but not for
files read through a codec. Given any file sample.txt
with more than a few
dozen characters—even just ASCII characters—the version of Python 2.5.1
which ships as part of Mac OS X behaves like this:
>>> f = open('sample.txt')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
26
>>> import codecs
>>> f = codecs.open('sample.txt')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
26
>>> f = codecs.open('sample.txt', encoding='utf-8')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
3
Luckily, there’s an obvious workaround:
>>> f = codecs.open('sample.txt', encoding='utf-8')
>>> all_lines = f.readlines()
>>> len(all_lines)
27
If you want to test it with your own data, here is a test script:
#!/usr/bin/env python
import sys
import codecs
def testfile(f):
firstline = f.readline()
remaining_lines = f.readlines()
print "Read " + str(len(remaining_lines) + 1) + " lines."
f.close()
def main(argv=None):
if argv is None: argv = sys.argv
for filename in argv[1:]:
print "Opening " + filename + " using `open` built-in:"
testfile(open(filename))
print "Opening " + filename + " using `codecs.open` with no encoding:"
testfile(codecs.open(filename))
print "Opening " + filename + " using `codecs.open` with encoding:"
testfile(codecs.open(filename, encoding='utf-8'))
if __name__ == "__main__": sys.exit(main())
The documentation suggests that mixing calls to readline()
and readlines()
should be safe for any file object, so it does look like a bug. I’ll have to check whether it’s fixed in the Python 3 betas (which use unicode by default).