I seem to have stumbled across a bug in Python’s libraries for dealing with unicode data: apparently mixing calls to file.readline() and file.readlines() works just fine for regular 8-bit input files, but not for files read through a codec. Given any file sample.txt with more than a few dozen characters—even just ASCII characters—the version of Python 2.5.1 which ships as part of Mac OS X behaves like this:

>>> f = open('sample.txt')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
26
>>> import codecs
>>> f = codecs.open('sample.txt')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
26
>>> f = codecs.open('sample.txt', encoding='utf-8')
>>> first_line = f.readline()
>>> remaining_lines = f.readlines()
>>> len(remaining_lines)
3

Luckily, there’s an obvious workaround:

>>> f = codecs.open('sample.txt', encoding='utf-8')
>>> all_lines = f.readlines()
>>> len(all_lines)
27

If you want to test it with your own data, here is a test script:

#!/usr/bin/env python

import sys
import codecs

def testfile(f):
    firstline = f.readline()
    remaining_lines = f.readlines()
    print "Read " + str(len(remaining_lines) + 1) + " lines."
    f.close()

def main(argv=None):
    if argv is None: argv = sys.argv

    for filename in argv[1:]:
        print "Opening " + filename + " using `open` built-in:"
        testfile(open(filename))
        print "Opening " + filename + " using `codecs.open` with no encoding:"
        testfile(codecs.open(filename))
        print "Opening " + filename + " using `codecs.open` with encoding:"
        testfile(codecs.open(filename, encoding='utf-8'))
    
if __name__ == "__main__": sys.exit(main())

The documentation suggests that mixing calls to readline() and readlines() should be safe for any file object, so it does look like a bug. I’ll have to check whether it’s fixed in the Python 3 betas (which use unicode by default).