Machine Level Representation of Data Character Representation
Machine Level Representation of Data Character Representation
Character representation
Bits can represent anything!!
• Characters?
•26 letters 5 bits (25 = 32)
•upper/lower case + punctuation
7 bits (in 8) (“ASCII”)
•standard code to cover all the world’s languages
8,16,32 bits (“Unicode”)
• Logical values?
•0 False, 1 True
• colors ? Ex: Red (100) Green (010) Blue (001)
16's
0 1 2 3 4 5 6 7 8 9 A B C D E F
place
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Advantages, Disadvantages
Advantages of ASCII
The 26 letter codes are contiguous Why this is an advantage?
How?
Uppercase letters can be converted to lowercase and back by flipping one bit
The codes for the 10 digits are easily derived from the value of the digits. How?
No other standard is as prevalent or as ingrained in our keyboards, video displays,
system hardware, printers, font files, operating systems, and the Internet.
1's place
The history of ASCII since 1967 is mostly a history of attempts to overcome its
16'slimitations and make it more applicable to languages other than American English.
0 1 2 3 4 5 6 7 8 9 A B C D E F
place
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Extension. Character sets.
Extended ASCII
By the time the early small computers were being developed, the 8-bit byte had been
firmly established.
Thus, if a byte were used to store characters, 128 additional characters could be
invented to supplement ASCII.
When the original IBM PC was introduced in 1981, the video adapters included a
ROM-based character set of 256 characters, which in itself was to become an
important part of the IBM standard.
ANSI
The native Windows character set was called the "ANSI character set" because it was
based on a draft ANSI and ISO standard (ANSI/ISO 8859-1-1987),
There are 256 character possibilities in ANSI.
8 bits
7 bits 7 bits
1 byte – 256 character Code pages
Code pages
First IBM (for PCs) then Microsoft then other companies also start to create their own
256 character sets.
Thus appeared code pages (Latin1, Cyrillic, Armenian,…).
ANSI and similar code pages need 8 bits for character representation
(“A” - 0100 00012 ) (“ Ա ” – 0xB2 - 1011 00102 ) (“ Б ” – 0xE2 - 1110 00102 )
ANSI - 256 IBM - 256 Armenian code Armenian code Cyrillic code
chars chars page #1 page #2 page #1
• Like any well-behaved code page, the first 128 of these codes are ASCII.
• However, some of the codes in the higher 128 are always followed by a
second byte. The two bytes together (called a lead byte and a trail
byte) define a single character, usually a complex ideograph.
Double (or multi) byte character sets
• Advantage: DBCS allows to create pages also for languages having more
than 256 letters or signs.
• Disadvantage:
Examples
code UTF-16 code UTF-16BE code UTF-16LE code
glyph* character
point units (hex) units (hex) units (hex)
U+007A z LATIN SMALL LETTER Z 007A 00, 7A 7A, 00
U+6C34 水 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34 6C, 34 34, 6C
LINEAR B SYLLABLE B008 A (first non-
U+10000 � D800, DC00 D8, 00, DC, 00 00, D8, 00, DC
BMP code point)
U+1D11E � MUSICAL SYMBOL G CLEF D834, DD1E D8, 34, DD, 1E 34, D8, 1E, DD
PRIVATE USE CHARACTER-10FFFD
U+10FFFD � (last Unicode code point) DBFF, DFFD DB, FF, DF, FD FF, DB, FD, DF
Unicode
Advantage: