22.5 Coding Systems
Users of various languages have established many more-or-less standard coding systems for representing them. Emacs does not use these coding systems internally; instead, it converts from various coding systems to its own system when reading data, and converts the internal coding system to other coding systems when writing data. Conversion is possible in reading or writing files, in sending or receiving from the terminal, and in exchanging data with subprocesses.
Emacs assigns a name to each coding system. Most coding systems are
used for one language, and the name of the coding system starts with
the language name. Some coding systems are used for several
languages; their names usually start with ‘ iso
’. There are also
special coding systems, such as no-conversion
, raw-text
,
and emacs-internal
.
A special class of coding systems, collectively known as
codepages, is designed to support text encoded by MS-Windows and
MS-DOS software. The names of these coding systems are
cpnnnn
, where nnnn is a 3- or 4-digit number of the
codepage. You can use these encodings just like any other coding
system; for example, to visit a file encoded in codepage 850, type
C-x RET c cp850 RET C-x C-f filename RET
.
In addition to converting various representations of non-ASCII characters, a coding system can perform end-of-line conversion. Emacs handles three different conventions for how to separate lines in a file: newline (Unix), carriage return followed by linefeed (DOS), and just carriage return (Mac).
C-h C coding RET
Describe coding system coding ( describe-coding-system
).
C-h C RET
Describe the coding systems currently in use ( describe-coding-system
).
M-x list-coding-systems
Display a list of all the supported coding systems.
The command C-h C
( describe-coding-system
) displays
information about particular coding systems, including the end-of-line
conversion specified by those coding systems. You can specify a coding
system name as the argument; alternatively, with an empty argument, it
describes the coding systems currently selected for various purposes,
both in the current buffer and as the defaults, and the priority list
for recognizing coding systems (see Recognizing Coding Systems).
To display a list of all the supported coding systems, type M-x list-coding-systems
. The list gives information about each coding
system, including the letter that stands for it in the mode line
(see The Mode Line).
Each of the coding systems that appear in this list—except for
no-conversion
, which means no conversion of any kind—specifies
how and whether to convert printing characters, but leaves the choice of
end-of-line conversion to be decided based on the contents of each file.
For example, if the file appears to use the sequence carriage return
and linefeed to separate lines, DOS end-of-line conversion will be used.
Each of the listed coding systems has three variants, which specify exactly what to do for end-of-line conversion:
…-unix
Don’t do any end-of-line conversion; assume the file uses newline to separate lines. (This is the convention normally used on Unix and GNU systems, and macOS.)
…-dos
Assume the file uses carriage return followed by linefeed to separate lines, and do the appropriate conversion. (This is the convention normally used on Microsoft systems. 8)
…-mac
Assume the file uses carriage return to separate lines, and do the appropriate conversion. (This was the convention used in Classic Mac OS.)
These variant coding systems are omitted from the
list-coding-systems
display for brevity, since they are entirely
predictable. For example, the coding system iso-latin-1
has
variants iso-latin-1-unix
, iso-latin-1-dos
and
iso-latin-1-mac
.
The coding systems unix
, dos
, and mac
are
aliases for undecided-unix
, undecided-dos
, and
undecided-mac
, respectively. These coding systems specify only
the end-of-line conversion, and leave the character code conversion to
be deduced from the text itself.
The coding system raw-text
is good for a file which is mainly
ASCII text, but may contain byte values above 127 that are
not meant to encode non-ASCII characters. With
raw-text
, Emacs copies those byte values unchanged, and sets
enable-multibyte-characters
to nil
in the current buffer
so that they will be interpreted properly. raw-text
handles
end-of-line conversion in the usual way, based on the data
encountered, and has the usual three variants to specify the kind of
end-of-line conversion to use.
In contrast, the coding system no-conversion
specifies no
character code conversion at all—none for non-ASCII byte values and
none for end of line. This is useful for reading or writing binary
files, tar files, and other files that must be examined verbatim. It,
too, sets enable-multibyte-characters
to nil
.
The easiest way to edit a file with no conversion of any kind is with
the M-x find-file-literally
command. This uses
no-conversion
, and also suppresses other Emacs features that
might convert the file contents before you see them. See Visiting Files.
The coding system emacs-internal
(or utf-8-emacs
,
which is equivalent) means that the file contains non-ASCII
characters stored with the internal Emacs encoding. This coding
system handles end-of-line conversion based on the data encountered,
and has the usual three variants to specify the kind of end-of-line
conversion.
Footnotes
(8)
It is also specified for
MIME ‘ text/*
’ bodies and in other network transport contexts. It
is different from the SGML reference syntax record-start/record-end
format, which Emacs doesn’t support directly.