Latin-1 Character Set for HTML

David D. McFarland


Latin-1 is the default character set for HTML, the one that is supposed to be available to every web browser. For that reason some familiarity with Latin-1 is useful for every Web user who has occasion to either read or write web pages containing symbols beyond the ordinary keyboard characters.

Latin-1's main feature is an extensive range of accented characters used in western European languages, and scholars who use those languages will be interested in Latin-1 for that reason.

Mathematical sociologists are among those for whose work the ubiquity of Latin-1 creates more problems than solutions. We are not alone, however. Many scholars have use for non-Roman alphabets, such as Cyrillic, Greek, or Hebrew, which are not included in Latin-1. Some need Arabic, Chinese, or other characters. (These needs also arise in international commerce, as well as scholarship.)

Unicode, if and when it becomes widely implemented, promises to solve many of the current problems pertaining to languages that do not use Roman alphabets. That widespread implementation, in turn, might happen before the turn of the century. The most widely used computer platforms at this writing, Windows 3.1 and 95, do not support unicode, but Microsoft has built unicode capability into Windows NT 4.0, and it is reasonable to guess that revisions of other operating systems will do likewise.

What follows, then, is not a full account of Latin-1, but an account intended for mathematical sociologists. Our main concern herein is to know enough about Latin-1 to recognize and work around its limitations for the mathematical notation in our work.

Mathematical notation involves special characters, whose presence or absence in Latin-1 is our current topic. But mathematical notation also involves other considerations beyond the availability of particular characters. For example, a mathematical expression may involve characters positioned above or below the baseline of the expression, such as limits of integration, or multiple levels of sub- or superscripts. Or some characters are required to span several rows, such as brackets enclosing a matrix. So the availability of an appropriate character set will solve many, but not all, of the problems faced in attempts to publish mathematical sociology on the Web.


7-Bit ASCII

The Latin-1 character set is an extension of the earlier ascii character set, to which we turn first. The 7-bit ascii used 7 bits to distinguish 128 (= 2 to the power 7) different codes, numbered 0 through 127. Codes 0-31 and 127 are reserved for control codes, most of which are rarely used these days. The main part of 7-bit ascii covers: These upper and lower case letters, digits, and punctuation marks can all be entered directly from the keyboard; no special codes are needed, but just for the record, the numeric codes are as follows:
  • 0-31. Control characters, such as 12 = formfeed
  • 32. Space
  • 33-47. ! " # $ % & ' ( ) * + , - . /
  • 48-57. Digits 0 through 9
  • 58-64. : ; < = > ? @
  • 65-90. Upper case A through Z
  • 91-96. [ \ ] ^ _ `
  • 97-122. Lower case a through z
  • 123-126. { | } ~
  • 127. Control character

    7-bit ascii offers little to the mathematical sociologist, being geared to an elementary school level of mathematics that includes addition ( + ) and subtraction ( - ), but not yet multiplication or division (symbols missing from ascii).

    Workarounds have a long history, and a glance at a couple of historical workaround strategies may be instructive for our own attempts to use the World Wide Web for mathematical sociology.

    We no longer have to go through such contortions to write about multiplication and division. However, the concept of "workaround" and the notion of pursuing various workaround strategies will be of continuing use to mathematical sociologists, as well as to various others whose work is outside of the market for which software vendors design their mainstream products.


    8-Bit Latin-1

    Clearly the 7-bit ascii character set is inadequate for many purposes, so various organizations developed extensions, and the lack of agreement among those various extensions is a continuing source of headaches. Here we consider just one of those extensions, Latin-1, the one which is the default character set on the World Wide Web.

    Adding an 8th bit to the transmission code doubled the total number of different characters from 128 (= 2 to the power 7) to 256 (= 2 to the power 8). To maintain backward compatibility (though at the cost of wasting space on no longer used control characters), the codes 0 through 127 were assigned to the same characters as in 7-bit ascii. (Also, since 8-bit codes passing through equipment intended for 7-bit codes might have their 8th bits truncated, and since control codes could be dangerous, the codes from 128 to 159 and 255, which on truncation of their 8th bits would become control codes, from 0 to 31 and 127, were left unused.)

    The Latin-1 characters with numerical codes above 127 are mostly accented letters used in various European languages: c cedilla ( ç ), e grave ( è ), n tilde ( ñ ), u umlaut ( ü ), and such. These are needed for writing in French, German, Spanish, etc.

    These characters are not directly on the keyboard, and the ways of getting them into a document vary by computer platform and software. Some programs offer their own ways of entering non-keyboard characters. Here we mention only the ways that come with the operating system.

    The Latin-1 characters with numbers above 127 consist mostly of accented letters; the exceptions are as follows:

    n. &n; Alt-n Name
    161. &161; Inverted exclamation
    162. &162; Cent
    163. &163; Pound (currency)
    164. &164; Currency
    165. &165; Yen
    166. &166; Broken vertical
    167. &167; Section
    168. &168; Umlaut/diaeresis
    169. &169; Copyright
    170. &170; Feminine
    171. &171; Left angle quote
    172. &172; Not sign
    173. &173; Hyphen
    174. &174; Registered Trade Mark
    175. &175; Macron
    176. &176; Degrees
    177. &177; Plus/Minus
    178. &178; Superscript 2
    179. &179; Superscript 3
    180. &180; Acute accent
    181. &181; Micron
    182. &182; Paragraph
    183. &183; Middle dot
    184. &184; Cedilla
    185. &185; Superscript 1
    186. &186; Masculine
    187. &187; Right angle quote
    188. &188; One quarter
    189. &189; One half
    190. &190; Three quarters
    191. &191; Inverted question mark
    ...
    215. &215; Multiplication

    So, just what does Latin-1 offer the mathematical sociologist? Not much. The plus-or-minus sign ( ), the multiplication sign, ( ), and one level of superscripts, provided the superscript you need happens to be 1, or 2, or 3.

    Next we will consider the Symbol character set, which is widely used for printed documents, but not yet available by default on all web browsers, as Latin-1 is supposed to be. Such has been proposed by a standards committee, but as mentioned earlier software vendors don't always follow such recommendations. Nevertheless, Symbol has many characters of use to mathematical sociologists, and there are good reasons to be familiar with that character set.


    References:

    Raggett, Dave, Jenny Lam, and Ian Alexander. 1996. HTML 3: Electronic Publishing on the World Wide Web. Reading: Addison-Wesley. Back


    author