0% found this document useful (0 votes)

115 views14 pages

Machine Level Representation of Data Character Representation

Machine level representation of data can be summarized as follows: 1) Bits are used to represent characters, logical values, colors, and other data types in computers. The number of bits used depends on the amount of data needed, such as 5 bits for 26 letters or 8 bits for Unicode characters. 2) ASCII was the original standard character encoding that used 7 bits to represent 128 characters including the English alphabet, numbers, and punctuation. It was later expanded to 8-bit encodings to include additional languages. 3) Beyond ASCII, character representations evolved to multi-byte encodings like Unicode that could support the thousands of characters in languages like Chinese, Japanese, and Korean using variable length encodings.

Uploaded by

Renchie Gonzales

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views14 pages

Machine Level Representation of Data Character Representation

Uploaded by

Renchie Gonzales

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Machine level representation of data

Character representation
Bits can represent anything!!
• Characters?
•26 letters  5 bits (25 = 32)
•upper/lower case + punctuation
 7 bits (in 8) (“ASCII”)
•standard code to cover all the world’s languages 
8,16,32 bits (“Unicode”)
• Logical values?
•0  False, 1  True
• colors ? Ex: Red (100) Green (010) Blue (001)

• locations / addresses? commands?

• MEMORIZE: N bits  at most 2N things
Characters representation in computers and devices
ASCII  ANSI  MultiByte  Unicode

American Standard Code for Information Interchange

• This was the de facto world-wide standard for the code numbers
used by computers to represent all the upper and lower-case Latin
letters, numbers, punctuation, etc. How many bits we need for 5
letters representation?
• How many bits we need to represent ?
4 letters - A,B,C,D – 2 bits - 22 = 4 patterns
• How many bits for
26 English uppercase letters ? - 5 bits – 25 = 32 patterns

• 26 English lowercase letters - 5 bits – 25 = 32 patterns

• Decimal digits and special signs - 5 bits – 25 = 32 patterns

• Special Control Characters - 5 bits – 25 = 32 patterns

How many bits we need for 128
patterns representation?
The Content of ASCII table
• Contains 128 characters
• ASCII needs only 7 bits for character representation (“A” - 100 00012 )
• The first printable character is SP (space) and corresponds to the bit
pattern 010 0000 – 0x20.
• The characters A and B correspond to
A - 100 0001 – 0x41 B - 100 00102 – 0x42
Find “z” ’s ASCII code z – 111 10102 – 0x7A
1's place

16's
0 1 2 3 4 5 6 7 8 9 A B C D E F
place
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # $ % & ' ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Control characters
• Control characters are not shown or printed on the different devices.
They control the devices.
• These control characters have different meaning for different devices.
0x0A – Line Feed, 0x0D – Carriage Return
ASCII full chart 1's place
16's
place 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # $ % & ' ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Advantages, Disadvantages

Advantages of ASCII
 The 26 letter codes are contiguous Why this is an advantage?
How?
 Uppercase letters can be converted to lowercase and back by flipping one bit
 The codes for the 10 digits are easily derived from the value of the digits. How?
 No other standard is as prevalent or as ingrained in our keyboards, video displays,
system hardware, printers, font files, operating systems, and the Internet.

1's place
The history of ASCII since 1967 is mostly a history of attempts to overcome its
16'slimitations and make it more applicable to languages other than American English.
0 1 2 3 4 5 6 7 8 9 A B C D E F
place

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SP ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Extension. Character sets.
Extended ASCII
 By the time the early small computers were being developed, the 8-bit byte had been
firmly established.
 Thus, if a byte were used to store characters, 128 additional characters could be
invented to supplement ASCII.
 When the original IBM PC was introduced in 1981, the video adapters included a
ROM-based character set of 256 characters, which in itself was to become an
important part of the IBM standard.

ANSI
 The native Windows character set was called the "ANSI character set" because it was
based on a draft ANSI and ISO standard (ANSI/ISO 8859-1-1987),
 There are 256 character possibilities in ANSI.

8 bits

ANSI - 256 chars IBM - 256 chars Other- 256 chars

ASCII Extra ASCII Extra ASCII

Arme
128 128 128 128 128
nian
chars chars chars chars chars

7 bits 7 bits
1 byte – 256 character Code pages
Code pages
 First IBM (for PCs) then Microsoft then other companies also start to create their own
256 character sets.
 Thus appeared code pages (Latin1, Cyrillic, Armenian,…).

Advantage: Fully supports 2 languages – English and the chosen one.

Disadvantage:
 Different documents created by different code pages even for the same language
(Cyrillic or Armenian) are not compatible. The code page should be “attached” to
document to allow to open it correctly on other computer.
 How to accommodate the ideographic symbols of Chinese, Japanese, and Korean
number about 21,000 - Dedicate 2 bytes for them.

ANSI and similar code pages need 8 bits for character representation
(“A” - 0100 00012 ) (“ Ա ” – 0xB2 - 1011 00102 ) (“ Б ” – 0xE2 - 1110 00102 )
ANSI - 256 IBM - 256 Armenian code Armenian code Cyrillic code
chars chars page #1 page #2 page #1

Extra Extra Arme Cyril

Arme ASCII ASCII
ASCII 128 ASCII 128 ASCII nian 2 lic 1
nian 1
chars chars
Double (or multi) byte character sets
Character Code Lead-Byte Trail-Byte
Chinese code Chinese code Language Set Name Page Ranges Ranges
page (lead byte) page (trail byte) Chinese GB 2312-80 CP 936 0xA1– 0xFE 0xA1– 0xFE
(Simplified)
ASCII
128 Chinese Big-5 CP 950 0x81– 0xFE 0x40– 0x7E
chars (Traditional) 0xA1– 0xFE

Japanese Shift-JIS (Japan CP 932 0x81– 0x9F 0x40– 0xFC

Industry Standard) 0xE0– 0xFC (except 0x7F)

Korean KS C-5601-1987 CP 949 0x81– 0xFE 0x41– 0x5A

(Wansung) 0x61– 0x7A
0x81– 0xFE

Korean KS C-5601-1992 CP 1361 0x84– 0xD3 0x41– 0x7E

(Johab) 0xD8 0x81– 0xFE
0xD9– 0xDE (Government
0xE0– 0xF9 standard:
0x31– 0x7E
• A DBCS starts off with 256 codes 0x41– 0xFE)

• Like any well-behaved code page, the first 128 of these codes are ASCII.
• However, some of the codes in the higher 128 are always followed by a
second byte. The two bytes together (called a lead byte and a trail
byte) define a single character, usually a complex ideograph.
Double (or multi) byte character sets

• Advantage: DBCS allows to create pages also for languages having more
than 256 letters or signs.

• Disadvantage:

• Different documents created by different code pages even for the

same language (Cyrillic or Armenian) are still be not compatible.

• The problem with a double-byte character set is not that characters

are represented by 2 bytes. The problem is that some characters (in
particular, the ASCII characters) are represented by 1 byte. This
creates odd programming problems. For example, the number of
characters in a character string cannot be determined by the byte
size of the string.
Unicode

 The best thing about Unicode is that there's only one

character set. There's simply no ambiguity.

 The representation in bits is enough large to

accommodate all the languages and signs.

 Flavors – UTF-8, UTF-16, UTF-32.

code point – Symbol’s identifier

Unicode UTF-8
Code (like
point –multibyte)
Leading Byte of the
Symbol’s identifier Continuation Byte of the
multi-byte sequence
Last code
multi-byte sequence
Bits point Byte 1 Byte 2 Byte 3 Byte
Single Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
Header bits
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Header bits – The # of
“1”s = # of bytes
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx Header10xxxxxx

10xxxxxx bit always = 0

31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Armenian character set uses the codes 0x0530 through 0x058F

Unicode UTF-16
• The Unicode code space is divided into seventeen PLANES of 216 (65,536) code
points each
• The code points in each plane have the hexadecimal values xx0000 to xxFFFF
• where xx is a hex value from 00 to 10
• 1st plane code points U+0000 to U+D7FF and U+E000 to U+FFFF - Basic
Multilingual Plane – most frequently used characters.
• Code points U+D800 to U+DFFF - Extensions

Examples
code UTF-16 code UTF-16BE code UTF-16LE code
glyph* character
point units (hex) units (hex) units (hex)
U+007A z LATIN SMALL LETTER Z 007A 00, 7A 7A, 00
U+6C34 水 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34 6C, 34 34, 6C
LINEAR B SYLLABLE B008 A (first non-
U+10000 � D800, DC00 D8, 00, DC, 00 00, D8, 00, DC
BMP code point)
U+1D11E � MUSICAL SYMBOL G CLEF D834, DD1E D8, 34, DD, 1E 34, D8, 1E, DD
PRIVATE USE CHARACTER-10FFFD
U+10FFFD � (last Unicode code point) DBFF, DFFD DB, FF, DF, FD FF, DB, FD, DF
Unicode

Advantage:

 Supports all languages and different signs by single code

page.

 UTF-8 is back compatible with the ASCII

 UTF-16 Basic multilingual plane (first 16 bits- 2 bytes) and

UTF-32 (4 bytes) allow to work with the characters like
with the regular text.
Unicode
Disadvantage:

 Fixed 2 or 4 bytes UTF-16 and UTF-32 Unicode

character strings occupy twice or four times as much
memory as ASCII strings.

UTF-8 character set has variable length.

• This is an advantage to have less size than the fixed

Unicode character set.

• And this is a disadvantage with the programming.

• For example, the number of characters in a character string
cannot be determined by the byte size of the string.

ORACLE Interview Questions and Answers
100% (1)
ORACLE Interview Questions and Answers
134 pages
Chapter 6 - Computer Encoding System
No ratings yet
Chapter 6 - Computer Encoding System
46 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
No ratings yet
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
18 pages
Character Sets KS4 Presentation
No ratings yet
Character Sets KS4 Presentation
16 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
6.0 Bit Operations
No ratings yet
6.0 Bit Operations
22 pages
Revision Notes - 12 Character sets
No ratings yet
Revision Notes - 12 Character sets
9 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Computer Codes
No ratings yet
Computer Codes
24 pages
Lec 1c - Character Representation
No ratings yet
Lec 1c - Character Representation
11 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Lecture 04 Ascii vs Unicode
No ratings yet
Lecture 04 Ascii vs Unicode
23 pages
Short Notes On ASCII
100% (1)
Short Notes On ASCII
16 pages
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
No ratings yet
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
30 pages
Data Representation - Characters
No ratings yet
Data Representation - Characters
15 pages
Lecture-02-write
No ratings yet
Lecture-02-write
9 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Character Sets and Encoding
No ratings yet
Character Sets and Encoding
7 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Computer Codes
No ratings yet
Computer Codes
28 pages
ASCII1
No ratings yet
ASCII1
12 pages
ASCII
No ratings yet
ASCII
29 pages
ASCII.docx1
No ratings yet
ASCII.docx1
88 pages
Coding Systems Student Notes
No ratings yet
Coding Systems Student Notes
3 pages
Week05 Lecture
No ratings yet
Week05 Lecture
5 pages
Representation of Text
No ratings yet
Representation of Text
5 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Presentation - 12 Character sets
No ratings yet
Presentation - 12 Character sets
21 pages
MCS-202 Computer Organization
No ratings yet
MCS-202 Computer Organization
127 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Character Sets, Encodings, and Unicode
No ratings yet
Character Sets, Encodings, and Unicode
26 pages
Data Representation Theory_Notes
No ratings yet
Data Representation Theory_Notes
5 pages
31-Character-Sets-SAMPLE-A-Level
No ratings yet
31-Character-Sets-SAMPLE-A-Level
13 pages
Ascii Code: Baudot Code Murray Code
No ratings yet
Ascii Code: Baudot Code Murray Code
8 pages
Unicode Tutorial
No ratings yet
Unicode Tutorial
15 pages
Codes
No ratings yet
Codes
31 pages
US1MACSC01
No ratings yet
US1MACSC01
30 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
Character Encoding for Sanskrit and Other Languages
No ratings yet
Character Encoding for Sanskrit and Other Languages
8 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
ASCII
No ratings yet
ASCII
12 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
ASCII Code System
No ratings yet
ASCII Code System
3 pages
Ascii: Ask-Ee, ASCII Is A Code For Representing English
No ratings yet
Ascii: Ask-Ee, ASCII Is A Code For Representing English
2 pages
Imp Question Computer Architecture
No ratings yet
Imp Question Computer Architecture
20 pages
Data Types T2 ASCII and Unicode
No ratings yet
Data Types T2 ASCII and Unicode
24 pages
Chapter 4 Computer Codes
No ratings yet
Chapter 4 Computer Codes
30 pages
DLD Week 2 Class 2 (3)
No ratings yet
DLD Week 2 Class 2 (3)
29 pages
ASCII Characters Set
No ratings yet
ASCII Characters Set
8 pages
Coding System
No ratings yet
Coding System
8 pages
Sample-GSCE-12-Character-Sets
No ratings yet
Sample-GSCE-12-Character-Sets
13 pages
digital
No ratings yet
digital
68 pages
Lecture 2.3 Information Coding Scheme
0% (1)
Lecture 2.3 Information Coding Scheme
10 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
SuperBASIC: The Manual
From Everand
SuperBASIC: The Manual
Jerry Stratton
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Indexed DB
No ratings yet
Indexed DB
3 pages
Data Models
No ratings yet
Data Models
40 pages
chapter1PHPTUT Nicephotog
No ratings yet
chapter1PHPTUT Nicephotog
1 page
Literatuer Survey On Document Extraction in Web Pages Using Data Mining Techniques
No ratings yet
Literatuer Survey On Document Extraction in Web Pages Using Data Mining Techniques
5 pages
Login and Logout
No ratings yet
Login and Logout
25 pages
Sap Implementation and Administration Guide
100% (1)
Sap Implementation and Administration Guide
326 pages
Documentation 1400g
No ratings yet
Documentation 1400g
14 pages
HALNy HL GSFP Leaflet V1 20181008 1 PDF
No ratings yet
HALNy HL GSFP Leaflet V1 20181008 1 PDF
2 pages
Yozolog
No ratings yet
Yozolog
2 pages
DBMS Unit - 6
No ratings yet
DBMS Unit - 6
16 pages
Computer Hardware Basics
No ratings yet
Computer Hardware Basics
3 pages
Addonmore Apartment List
No ratings yet
Addonmore Apartment List
77 pages
16 FusionSphare OpenStack OM Services Operation
No ratings yet
16 FusionSphare OpenStack OM Services Operation
37 pages
ACN Ch1 Notes
No ratings yet
ACN Ch1 Notes
20 pages
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing Unit Number and Title Submission Date
No ratings yet
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing Unit Number and Title Submission Date
17 pages
Whitepaper PDF
No ratings yet
Whitepaper PDF
8 pages
Corrective Maintenance Report v1.1
100% (1)
Corrective Maintenance Report v1.1
10 pages
Data Independence and Keys
No ratings yet
Data Independence and Keys
8 pages
Business Intelligence - Chapter 4
No ratings yet
Business Intelligence - Chapter 4
28 pages
Apress Kafka Troubleshooting in Production
No ratings yet
Apress Kafka Troubleshooting in Production
229 pages
Actualizar PDMS 12.0 To 12.1
No ratings yet
Actualizar PDMS 12.0 To 12.1
16 pages
Flow-Tools Tutorial: Sanog 6 Bhutan
No ratings yet
Flow-Tools Tutorial: Sanog 6 Bhutan
65 pages
Lecture 01 - Computer Hardware and Software Architectures
No ratings yet
Lecture 01 - Computer Hardware and Software Architectures
67 pages
Hard Disk Redundancy
No ratings yet
Hard Disk Redundancy
3 pages
Alpeta Install QuickGuide (English)
No ratings yet
Alpeta Install QuickGuide (English)
15 pages
CH 7
0% (2)
CH 7
11 pages
Lecture 7 - 14 Commands
No ratings yet
Lecture 7 - 14 Commands
5 pages
Picoblaze Sample Design A Quick-Start Guide Using The Digilent NEXYS 2 Spartan 3E Kit
No ratings yet
Picoblaze Sample Design A Quick-Start Guide Using The Digilent NEXYS 2 Spartan 3E Kit
5 pages
Docu57893 - Documentum Platform REST Services-7.2.Data Access API PDF
No ratings yet
Docu57893 - Documentum Platform REST Services-7.2.Data Access API PDF
127 pages

Machine Level Representation of Data Character Representation

Uploaded by

Machine Level Representation of Data Character Representation

Uploaded by

Machine level representation of data

• locations / addresses? commands?

American Standard Code for Information Interchange

• 26 English lowercase letters - 5 bits – 25 = 32 patterns

• Decimal digits and special signs - 5 bits – 25 = 32 patterns

• Special Control Characters - 5 bits – 25 = 32 patterns

2 SP ! " # $ % & ' ( ) * + , - . /

2 SP ! " # $ % & ' ( ) * + , - . /

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

ANSI - 256 chars IBM - 256 chars Other- 256 chars

ASCII Extra ASCII Extra ASCII

Advantage: Fully supports 2 languages – English and the chosen one.

Extra Extra Arme Cyril

Japanese Shift-JIS (Japan CP 932 0x81– 0x9F 0x40– 0xFC

Korean KS C-5601-1987 CP 949 0x81– 0xFE 0x41– 0x5A

Korean KS C-5601-1992 CP 1361 0x84– 0xD3 0x41– 0x7E

• Different documents created by different code pages even for the

• The problem with a double-byte character set is not that characters

 The best thing about Unicode is that there's only one

 The representation in bits is enough large to

 Flavors – UTF-8, UTF-16, UTF-32.

code point – Symbol’s identifier

26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx Header10xxxxxx

31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Armenian character set uses the codes 0x0530 through 0x058F

 Supports all languages and different signs by single code

 UTF-8 is back compatible with the ASCII

 UTF-16 Basic multilingual plane (first 16 bits- 2 bytes) and

 Fixed 2 or 4 bytes UTF-16 and UTF-32 Unicode

UTF-8 character set has variable length.

• This is an advantage to have less size than the fixed

• And this is a disadvantage with the programming.

You might also like