0% found this document useful (0 votes)
47 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics


Na-Rae Han
Objectives
 Understand the fundamentals of how language is
encoded on a computer
 Text encoding systems
 ASCII
 ISO-8859
 Unicode

1/12/2017 2
How is language represented on a
computer?

 Natural ("Human")  The language of


languages: computers:
 Spoken form
 Written form
*Also: sign languages

1/12/2017 3
The language of computers
 At the lowest level, computer language is binary:
Information on a computer is stored in bits
 A bit is either: ON (=1, =yes) or OFF (=0, =no)
 This language essentially contains
two alphabetic characters
 Next level up: byte
 A byte is made up of a sequence of 8 bits
 ex. 01001101 
 Historically, a byte was the number of bits used to
encode a single character of text in a computer
 Byte is a basic addressable unit in most computer
architecture

1/12/2017 4
Encoding a written language
 How to represent a text with 0s and 1s?
 Hello world!
 01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
 Each character is mapped to a code point (=character code),
e.g., a unique integer.
 H  72dec
 e  101dec
 Each code point is represented as a binary number, using a
fixed number of bits.
 8 bits == 1 byte in the example above
 H  72dec  01001000 (26+23 = 64 + 8 = 72)
 e  101dec  01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
 One byte can represent 256 (=28) different characters
 00000000  0dec 11111111  255dec
1/12/2017 5
ASCII encoding for English
 How many bits are needed to encode English?
 26 lowercase letters: a, b, c, d, e, …
 26 uppercase letters: A, B, C, D, E, …
 10 Arabic digits: 0, 1, 2, 3, 4, …
 Punctuation: . , : ; ? ! ' "
 Symbols: ( ) < > & % * $ + -
 We are already up to 80
 6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
 ASCII (the American Standard Code for Information
Interchange) did just that
 Uses 7-bit code (= 128 characters) for storing English text
 Range 0 to 127

1/12/2017 6
The ASCII chart
 https://en.wikipedia.org/wiki/ASCII
 http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
65 100 0001 A
… … …
66 100 0010 B
35 010 0011 #
67 100 0011 C
36 010 0100 &
… … …
… … …
97 110 0001 a
48 011 0000 0
98 110 0010 b
49 011 0001 1
99 110 0011 c
50 011 0010 2
… … …
… … …
127 111 1111 (DEL)
1/12/2017 7
ASCII (the American Standard Code for Information
Interchange)

 The ASCII encoding scheme


 First published in 1963
 Uses 7-bit code (= 128 characters) for storing English text,
ranging from 0 to 127
 In an 8-bit (1 byte) representation, the highest bit is always 0
 Printable characters
 Upper and lower case roman alphabet
 Digits
 Punctuation marks, symbols, and space
 Includes 32 non-printing characters
 Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. 
originally for typewriters, many obsolete now
 WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN

1/12/2017 8
Practice
 What is this English text?
 Note: byte (=8-bit) ASCII representation instead of 7-bit
 Space provided for your convenience only!

01001000 01101001 00100001

 Answer:
Hi!

1/12/2017 9
Extending ASCII: ISO-8859, etc.
 ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
 Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
 ISO-8859: has 16 different implementations!
 ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
 ISO-8859-7 Greek alphabet
 ISO-8859-8 Hebrew alphabet
 JIS X 0208: Japanese characters
 Problem: overlapping character code space.
224dec means à in Latin-1 but ‫ א‬in ISO-8859-8!

1/12/2017 10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
 Solution 1: Provide meta-information on coding system
 Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

 But what if your message contains characters from multiple


coding systems?

 Solution 2: Have a single universal code system for all


writing systems  UNICODE
1/12/2017 11
Unicode
 A character encoding standard developed by the Unicode
Consortium
 Provides a single representation for all world's writing
systems

 "Unicode provides a unique number for every character, no


matter what the platform, no matter what the program, no
matter what the language.”
(http://www.unicode.org)

1/12/2017 12
How big is Unicode?
 Version 9.0 (2016) has codes for 128,237 characters
 Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
 In reality, only 21 bits are needed

 Unicode has three encoding versions


 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

1/12/2017 13
8-bit, 16-bit, 32-bit
 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

 Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
 You don't.
 In reality, only 221 bits are ever utilized for 128K characters.
 UTF-8 and UTF-16 use a variable-width encoding.

 Why UTF-16 and UTF-8?


 They are more compact (more so for certain languages, i.e.,
English)

1/12/2017 14
Variable-width encoding
 'H' as 1 byte (8 bits): 01001000
cf. 'H' as 2 bytes (16 bits): 0000000001001000
 UTF-8 as a variable-width encoding
 ASCII characters get encoded with just 1 byte
 ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
 All other characters are encoded with multiple bytes
 How to tell? The highest bit is used as a flag.
 Highest bit 0: single character É
 Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
 Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017 15
A look at Unicode chart
 How to find your Unicode character:
 http://www.unicode.org/standard/where/
 http://www.unicode.org/charts/

 Basic Latin (ASCII)


 http://www.unicode.org/charts/PDF/U0000.pdf

1/12/2017 16
Code point
for M.
But "004D"?

1/12/2017 17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
 Utilizes 16 characters: 0123456789ABCDEF
 Designed for human readability & easy byte conversion
 24=16: 1 hexadecimal digit is equivalent to 4 bits
 1 byte (=8 bits) is encoded with just 2 hex chars!

Letter Base-10 Base-2 Base-16


(decimal) (binary) (hex)
M 77 0000 0000 0100 1101 004D

 Unicode characters are usually referenced by their hexadecimal code


 Lower-number characters go by their 4-char hex codes (2 bytes), e.g.
U+004D ("M", U+ designates Unicode)
 Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122
(http://www.unicode.org/charts/PDF/U1D100.pdf)

1/12/2017 18

You might also like