Programming
python encoding text-files
Updated Tue, 26 Jul 2022 11:11:38 GMT

How to determine the encoding of text?


I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.




Solution

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds txzqJv 2!dasd0a QqdKjvz will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of typical text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252




Comments (5)

  • +1 – Thanks for the chardet reference. Seems good, although a bit slow. — Jan 28, 2010 at 05:15  
  • +0 – @Geomorillo: There's no such thing as "the encoding standard". Text encoding is something as old as computing, it grew organically with time and needs, it wasn't planned. "Unicode" is an attempt to fix this. — Dec 02, 2013 at 14:34  
  • +1 – And not a bad one, all things considered. What I would like to know is, how do I find out what encoding an open text file was opened with? — Mar 14, 2014 at 06:27  
  • +2 – @dumbledad what I said is that correctly detecting it all times is impossible. All you can do is a guess, but it can fail sometimes, it won't work every time, due to encodings not being really detectable. To do the guess, you can use one of the tools I suggested in the answer — Apr 20, 2018 at 15:41  
  • +1 – Apparently cchardet is faster, but requires cython. — May 26, 2019 at 19:48