0% found this document useful (0 votes)

47 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Laura Amwayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han
Objectives
 Understand the fundamentals of how language is
encoded on a computer
 Text encoding systems
 ASCII
 ISO-8859
 Unicode

1/12/2017 2
How is language represented on a
computer?

 Natural ("Human")  The language of

languages: computers:
 Spoken form
 Written form
*Also: sign languages

1/12/2017 3
The language of computers
 At the lowest level, computer language is binary:
Information on a computer is stored in bits
 A bit is either: ON (=1, =yes) or OFF (=0, =no)
 This language essentially contains
two alphabetic characters
 Next level up: byte
 A byte is made up of a sequence of 8 bits
 ex. 01001101 
 Historically, a byte was the number of bits used to
encode a single character of text in a computer
 Byte is a basic addressable unit in most computer
architecture

1/12/2017 4
Encoding a written language
 How to represent a text with 0s and 1s?
 Hello world!
 01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
 Each character is mapped to a code point (=character code),
e.g., a unique integer.
 H  72dec
 e  101dec
 Each code point is represented as a binary number, using a
fixed number of bits.
 8 bits == 1 byte in the example above
 H  72dec  01001000 (26+23 = 64 + 8 = 72)
 e  101dec  01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
 One byte can represent 256 (=28) different characters
 00000000  0dec 11111111  255dec
1/12/2017 5
ASCII encoding for English
 How many bits are needed to encode English?
 26 lowercase letters: a, b, c, d, e, …
 26 uppercase letters: A, B, C, D, E, …
 10 Arabic digits: 0, 1, 2, 3, 4, …
 Punctuation: . , : ; ? ! ' "
 Symbols: ( ) < > & % * $ + -
 We are already up to 80
 6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
 ASCII (the American Standard Code for Information
Interchange) did just that
 Uses 7-bit code (= 128 characters) for storing English text
 Range 0 to 127

1/12/2017 6
The ASCII chart
 https://en.wikipedia.org/wiki/ASCII
 http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
65 100 0001 A
… … …
66 100 0010 B
35 010 0011 #
67 100 0011 C
36 010 0100 &
… … …
… … …
97 110 0001 a
48 011 0000 0
98 110 0010 b
49 011 0001 1
99 110 0011 c
50 011 0010 2
… … …
… … …
127 111 1111 (DEL)
1/12/2017 7
ASCII (the American Standard Code for Information
Interchange)

 The ASCII encoding scheme

 First published in 1963
 Uses 7-bit code (= 128 characters) for storing English text,
ranging from 0 to 127
 In an 8-bit (1 byte) representation, the highest bit is always 0
 Printable characters
 Upper and lower case roman alphabet
 Digits
 Punctuation marks, symbols, and space
 Includes 32 non-printing characters
 Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. 
originally for typewriters, many obsolete now
 WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN

1/12/2017 8
Practice
 What is this English text?
 Note: byte (=8-bit) ASCII representation instead of 7-bit
 Space provided for your convenience only!

01001000 01101001 00100001

 Answer:
Hi!

1/12/2017 9
Extending ASCII: ISO-8859, etc.
 ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
 Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
 ISO-8859: has 16 different implementations!
 ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
 ISO-8859-7 Greek alphabet
 ISO-8859-8 Hebrew alphabet
 JIS X 0208: Japanese characters
 Problem: overlapping character code space.
224dec means à in Latin-1 but ‫ א‬in ISO-8859-8!

1/12/2017 10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
 Solution 1: Provide meta-information on coding system
 Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

 But what if your message contains characters from multiple

coding systems?

 Solution 2: Have a single universal code system for all

writing systems  UNICODE
1/12/2017 11
Unicode
 A character encoding standard developed by the Unicode
Consortium
 Provides a single representation for all world's writing
systems

 "Unicode provides a unique number for every character, no

matter what the platform, no matter what the program, no
matter what the language.”
(http://www.unicode.org)

1/12/2017 12
How big is Unicode?
 Version 9.0 (2016) has codes for 128,237 characters
 Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
 In reality, only 21 bits are needed

 Unicode has three encoding versions

 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

1/12/2017 13
8-bit, 16-bit, 32-bit
 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

 Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
 You don't.
 In reality, only 221 bits are ever utilized for 128K characters.
 UTF-8 and UTF-16 use a variable-width encoding.

 Why UTF-16 and UTF-8?

 They are more compact (more so for certain languages, i.e.,
English)

1/12/2017 14
Variable-width encoding
 'H' as 1 byte (8 bits): 01001000
cf. 'H' as 2 bytes (16 bits): 0000000001001000
 UTF-8 as a variable-width encoding
 ASCII characters get encoded with just 1 byte
 ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
 All other characters are encoded with multiple bytes
 How to tell? The highest bit is used as a flag.
 Highest bit 0: single character É
 Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
 Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017 15
A look at Unicode chart
 How to find your Unicode character:
 http://www.unicode.org/standard/where/
 http://www.unicode.org/charts/

 Basic Latin (ASCII)

 http://www.unicode.org/charts/PDF/U0000.pdf

1/12/2017 16
Code point
for M.
But "004D"?

1/12/2017 17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
 Utilizes 16 characters: 0123456789ABCDEF
 Designed for human readability & easy byte conversion
 24=16: 1 hexadecimal digit is equivalent to 4 bits
 1 byte (=8 bits) is encoded with just 2 hex chars!

Letter Base-10 Base-2 Base-16

(decimal) (binary) (hex)
M 77 0000 0000 0100 1101 004D

 Unicode characters are usually referenced by their hexadecimal code

 Lower-number characters go by their 4-char hex codes (2 bytes), e.g.
U+004D ("M", U+ designates Unicode)
 Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122
(http://www.unicode.org/charts/PDF/U1D100.pdf)

1/12/2017 18

Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
ESCPXD03296+ +VxRail+7.0.XXX+Administration+ +Participant+Guide (PDF) +
No ratings yet
ESCPXD03296+ +VxRail+7.0.XXX+Administration+ +Participant+Guide (PDF) +
418 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Lesson Plan Data Representation Characters
No ratings yet
Lesson Plan Data Representation Characters
3 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Revision Notes - 12 Character sets
No ratings yet
Revision Notes - 12 Character sets
9 pages
Unit-I Class XI - Encoding Schemes
No ratings yet
Unit-I Class XI - Encoding Schemes
10 pages
Character Sets KS4 Presentation
No ratings yet
Character Sets KS4 Presentation
16 pages
Presentation - 12 Character sets
No ratings yet
Presentation - 12 Character sets
21 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Character Encoding for Sanskrit and Other Languages
No ratings yet
Character Encoding for Sanskrit and Other Languages
8 pages
Machine Level Representation of Data Character Representation
No ratings yet
Machine Level Representation of Data Character Representation
14 pages
Representation of Text
No ratings yet
Representation of Text
5 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
10.2005.5 Unicode
No ratings yet
10.2005.5 Unicode
4 pages
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
No ratings yet
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
34 pages
Computer Codes
No ratings yet
Computer Codes
24 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Chars ASCII v2
No ratings yet
Chars ASCII v2
16 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Lecture-02-write
No ratings yet
Lecture-02-write
9 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
1521 Lec 9 - Unicode
No ratings yet
1521 Lec 9 - Unicode
46 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Lesson 4 - Ascii
No ratings yet
Lesson 4 - Ascii
34 pages
Power Point
No ratings yet
Power Point
10 pages
Encoding Schemes
No ratings yet
Encoding Schemes
23 pages
1.3 Information Coding Scheme
100% (1)
1.3 Information Coding Scheme
24 pages
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
No ratings yet
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
30 pages
Encoding Schemes
100% (1)
Encoding Schemes
4 pages
Data Representation - Characters
No ratings yet
Data Representation - Characters
15 pages
1.0 Computer System 1.3 Information Coding Scheme
No ratings yet
1.0 Computer System 1.3 Information Coding Scheme
6 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Programacion Web Parte-4
No ratings yet
Programacion Web Parte-4
4 pages
COMS1000 Data Representation a Second Half_0bdfbff4518810afb549245ef879c9e4
No ratings yet
COMS1000 Data Representation a Second Half_0bdfbff4518810afb549245ef879c9e4
12 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Lecture 2.3 Information Coding Scheme
0% (1)
Lecture 2.3 Information Coding Scheme
10 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Ascii: Ask-Ee, ASCII Is A Code For Representing English
No ratings yet
Ascii: Ask-Ee, ASCII Is A Code For Representing English
2 pages
31-Character-Sets-SAMPLE-A-Level
No ratings yet
31-Character-Sets-SAMPLE-A-Level
13 pages
DLD Week 2 Class 2 (3)
No ratings yet
DLD Week 2 Class 2 (3)
29 pages
Sample-GSCE-12-Character-Sets
No ratings yet
Sample-GSCE-12-Character-Sets
13 pages
T4 Ascii
No ratings yet
T4 Ascii
20 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Ascii
No ratings yet
Ascii
1 page
ASCII1
No ratings yet
ASCII1
12 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
Notes 07 Compression PDF
No ratings yet
Notes 07 Compression PDF
193 pages
ASCII
No ratings yet
ASCII
12 pages
Compress: Input
No ratings yet
Compress: Input
2 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
35 pages
Week05 Lecture
No ratings yet
Week05 Lecture
5 pages
210 Huffman Encoding
No ratings yet
210 Huffman Encoding
10 pages
Dictionary of Computing
From Everand
Dictionary of Computing
Handz Valentin, Sr
No ratings yet
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Aips
No ratings yet
Aips
8 pages
Mini-Proj A
No ratings yet
Mini-Proj A
20 pages
Understanding MediatR and RabbitMQ_ Key Differences and Use Cases _ by Speedcodelabs _ Medium
No ratings yet
Understanding MediatR and RabbitMQ_ Key Differences and Use Cases _ by Speedcodelabs _ Medium
11 pages
Banana Pi: Specification For PCIE-DVR: Root Complex Architecture Specification
No ratings yet
Banana Pi: Specification For PCIE-DVR: Root Complex Architecture Specification
32 pages
Pega Constellation
No ratings yet
Pega Constellation
5 pages
Updated AI & DS JAN 27
No ratings yet
Updated AI & DS JAN 27
7 pages
Nuvoton NCT5532D
No ratings yet
Nuvoton NCT5532D
302 pages
Computer Graphics Lab 3: Last First
No ratings yet
Computer Graphics Lab 3: Last First
7 pages
Tai Lieu Pop
100% (1)
Tai Lieu Pop
115 pages
Algo
No ratings yet
Algo
14 pages
This Study Resource Was: Cloud Strategy
No ratings yet
This Study Resource Was: Cloud Strategy
3 pages
eos.arista.com-Arista eAPI 101
No ratings yet
eos.arista.com-Arista eAPI 101
6 pages
SPCC Question Bank
No ratings yet
SPCC Question Bank
3 pages
Delhi Public School Bangalore-East Portions For Annual Examination (2022 - 2023)
No ratings yet
Delhi Public School Bangalore-East Portions For Annual Examination (2022 - 2023)
8 pages
PC Assembly
No ratings yet
PC Assembly
14 pages
Microsoft: Question & Answers
No ratings yet
Microsoft: Question & Answers
3 pages
Applied Linguistics and Artificial Intelligence
No ratings yet
Applied Linguistics and Artificial Intelligence
17 pages
CCNA 200-301 Official Cert Guide, Volume 2-31
No ratings yet
CCNA 200-301 Official Cert Guide, Volume 2-31
3 pages
Entity Spaces Quick Reference
No ratings yet
Entity Spaces Quick Reference
25 pages
Is The Process of Encoding Information Using Fewer Bits Than The Original Representation
No ratings yet
Is The Process of Encoding Information Using Fewer Bits Than The Original Representation
4 pages
Shishira SR Resume
No ratings yet
Shishira SR Resume
3 pages
Net 2
No ratings yet
Net 2
14 pages
Theory of Programming Languages
No ratings yet
Theory of Programming Languages
10 pages
E Si210 PCB (Start Up)
No ratings yet
E Si210 PCB (Start Up)
45 pages
GSM and SS7 Overview: by Firoz Ahmed
No ratings yet
GSM and SS7 Overview: by Firoz Ahmed
36 pages
Verilog Programming Styles
No ratings yet
Verilog Programming Styles
95 pages
What's New in IBM BPM 8.6 CF 2017.12 (Paul Pacholski)
No ratings yet
What's New in IBM BPM 8.6 CF 2017.12 (Paul Pacholski)
74 pages
CCC (NR) Supplementary Examination July-2010 Notification For 08,07
No ratings yet
CCC (NR) Supplementary Examination July-2010 Notification For 08,07
8 pages
LM10b XSLT To Code Execution
No ratings yet
LM10b XSLT To Code Execution
10 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics

 Natural ("Human")  The language of

 The ASCII encoding scheme

01001000 01101001 00100001

 But what if your message contains characters from multiple

 Solution 2: Have a single universal code system for all

 "Unicode provides a unique number for every character, no

 Unicode has three encoding versions

 Why UTF-16 and UTF-8?

 Basic Latin (ASCII)

Letter Base-10 Base-2 Base-16

 Unicode characters are usually referenced by their hexadecimal code

You might also like