0% found this document useful (0 votes)
115 views14 pages

Machine Level Representation of Data Character Representation

Machine level representation of data can be summarized as follows: 1) Bits are used to represent characters, logical values, colors, and other data types in computers. The number of bits used depends on the amount of data needed, such as 5 bits for 26 letters or 8 bits for Unicode characters. 2) ASCII was the original standard character encoding that used 7 bits to represent 128 characters including the English alphabet, numbers, and punctuation. It was later expanded to 8-bit encodings to include additional languages. 3) Beyond ASCII, character representations evolved to multi-byte encodings like Unicode that could support the thousands of characters in languages like Chinese, Japanese, and Korean using variable length encodings.

Uploaded by

Renchie Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views14 pages

Machine Level Representation of Data Character Representation

Machine level representation of data can be summarized as follows: 1) Bits are used to represent characters, logical values, colors, and other data types in computers. The number of bits used depends on the amount of data needed, such as 5 bits for 26 letters or 8 bits for Unicode characters. 2) ASCII was the original standard character encoding that used 7 bits to represent 128 characters including the English alphabet, numbers, and punctuation. It was later expanded to 8-bit encodings to include additional languages. 3) Beyond ASCII, character representations evolved to multi-byte encodings like Unicode that could support the thousands of characters in languages like Chinese, Japanese, and Korean using variable length encodings.

Uploaded by

Renchie Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine level representation of data

Character representation
Bits can represent anything!!
• Characters?
•26 letters  5 bits (25 = 32)
•upper/lower case + punctuation
 7 bits (in 8) (“ASCII”)
•standard code to cover all the world’s languages 
8,16,32 bits (“Unicode”)
• Logical values?
•0  False, 1  True
• colors ? Ex: Red (100) Green (010) Blue (001)

• locations / addresses? commands?


• MEMORIZE: N bits  at most 2N things
Characters representation in computers and devices
ASCII  ANSI  MultiByte  Unicode

American Standard Code for Information Interchange


• This was the de facto world-wide standard for the code numbers
used by computers to represent all the upper and lower-case Latin
letters, numbers, punctuation, etc. How many bits we need for 5
letters representation?
• How many bits we need to represent ?
4 letters - A,B,C,D – 2 bits - 22 = 4 patterns
• How many bits for
26 English uppercase letters ? - 5 bits – 25 = 32 patterns

• 26 English lowercase letters - 5 bits – 25 = 32 patterns

• Decimal digits and special signs - 5 bits – 25 = 32 patterns

• Special Control Characters - 5 bits – 25 = 32 patterns


How many bits we need for 128
patterns representation?
The Content of ASCII table
• Contains 128 characters
• ASCII needs only 7 bits for character representation (“A” - 100 00012 )
• The first printable character is SP (space) and corresponds to the bit
pattern 010 0000 – 0x20.
• The characters A and B correspond to
A - 100 0001 – 0x41 B - 100 00102 – 0x42
Find “z” ’s ASCII code z – 111 10102 – 0x7A
1's place

16's
0 1 2 3 4 5 6 7 8 9 A B C D E F
place
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # $ % & ' ( ) * + , - . /


3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Control characters
• Control characters are not shown or printed on the different devices.
They control the devices.
• These control characters have different meaning for different devices.
0x0A – Line Feed, 0x0D – Carriage Return
ASCII full chart 1's place
16's
place 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # $ % & ' ( ) * + , - . /


3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Advantages, Disadvantages

Advantages of ASCII
 The 26 letter codes are contiguous Why this is an advantage?
How?
 Uppercase letters can be converted to lowercase and back by flipping one bit
 The codes for the 10 digits are easily derived from the value of the digits. How?
 No other standard is as prevalent or as ingrained in our keyboards, video displays,
system hardware, printers, font files, operating systems, and the Internet.

1's place
The history of ASCII since 1967 is mostly a history of attempts to overcome its
16'slimitations and make it more applicable to languages other than American English.
0 1 2 3 4 5 6 7 8 9 A B C D E F
place

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI


1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SP ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
ASCII code Extension. Character sets.
Extended ASCII
 By the time the early small computers were being developed, the 8-bit byte had been
firmly established.
 Thus, if a byte were used to store characters, 128 additional characters could be
invented to supplement ASCII.
 When the original IBM PC was introduced in 1981, the video adapters included a
ROM-based character set of 256 characters, which in itself was to become an
important part of the IBM standard.

ANSI
 The native Windows character set was called the "ANSI character set" because it was
based on a draft ANSI and ISO standard (ANSI/ISO 8859-1-1987),
 There are 256 character possibilities in ANSI.

8 bits

ANSI - 256 chars IBM - 256 chars Other- 256 chars

ASCII Extra ASCII Extra ASCII


Arme
128 128 128 128 128
nian
chars chars chars chars chars

7 bits 7 bits
1 byte – 256 character Code pages
Code pages
 First IBM (for PCs) then Microsoft then other companies also start to create their own
256 character sets.
 Thus appeared code pages (Latin1, Cyrillic, Armenian,…).

Advantage: Fully supports 2 languages – English and the chosen one.


Disadvantage:
 Different documents created by different code pages even for the same language
(Cyrillic or Armenian) are not compatible. The code page should be “attached” to
document to allow to open it correctly on other computer.
 How to accommodate the ideographic symbols of Chinese, Japanese, and Korean
number about 21,000 - Dedicate 2 bytes for them.

ANSI and similar code pages need 8 bits for character representation
(“A” - 0100 00012 ) (“ Ա ” – 0xB2 - 1011 00102 ) (“ Б ” – 0xE2 - 1110 00102 )
ANSI - 256 IBM - 256 Armenian code Armenian code Cyrillic code
chars chars page #1 page #2 page #1

Extra Extra Arme Cyril


Arme ASCII ASCII
ASCII 128 ASCII 128 ASCII nian 2 lic 1
nian 1
chars chars
Double (or multi) byte character sets
Character Code Lead-Byte Trail-Byte
Chinese code Chinese code Language Set Name Page Ranges Ranges
page (lead byte) page (trail byte) Chinese GB 2312-80 CP 936 0xA1– 0xFE 0xA1– 0xFE
(Simplified)
ASCII
128 Chinese Big-5 CP 950 0x81– 0xFE 0x40– 0x7E
chars (Traditional) 0xA1– 0xFE

Japanese Shift-JIS (Japan CP 932 0x81– 0x9F 0x40– 0xFC


Industry Standard) 0xE0– 0xFC (except 0x7F)

Korean KS C-5601-1987 CP 949 0x81– 0xFE 0x41– 0x5A


(Wansung) 0x61– 0x7A
0x81– 0xFE

Korean KS C-5601-1992 CP 1361 0x84– 0xD3 0x41– 0x7E


(Johab) 0xD8 0x81– 0xFE
0xD9– 0xDE (Government
0xE0– 0xF9 standard:
0x31– 0x7E
• A DBCS starts off with 256 codes 0x41– 0xFE)

• Like any well-behaved code page, the first 128 of these codes are ASCII.
• However, some of the codes in the higher 128 are always followed by a
second byte. The two bytes together (called a lead byte and a trail
byte) define a single character, usually a complex ideograph.
Double (or multi) byte character sets

• Advantage: DBCS allows to create pages also for languages having more
than 256 letters or signs.

• Disadvantage:

• Different documents created by different code pages even for the


same language (Cyrillic or Armenian) are still be not compatible.

• The problem with a double-byte character set is not that characters


are represented by 2 bytes. The problem is that some characters (in
particular, the ASCII characters) are represented by 1 byte. This
creates odd programming problems. For example, the number of
characters in a character string cannot be determined by the byte
size of the string.
Unicode

 The best thing about Unicode is that there's only one


character set. There's simply no ambiguity.

 The representation in bits is enough large to


accommodate all the languages and signs.

 Flavors – UTF-8, UTF-16, UTF-32.

code point – Symbol’s identifier


Unicode UTF-8
Code (like
point –multibyte)
Leading Byte of the
Symbol’s identifier Continuation Byte of the
multi-byte sequence
Last code
multi-byte sequence
Bits point Byte 1 Byte 2 Byte 3 Byte
Single Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
Header bits
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Header bits – The # of
“1”s = # of bytes
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx Header10xxxxxx


10xxxxxx bit always = 0

31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Armenian character set uses the codes 0x0530 through 0x058F


Unicode UTF-16
• The Unicode code space is divided into seventeen PLANES of 216 (65,536) code
points each
• The code points in each plane have the hexadecimal values xx0000 to xxFFFF
• where xx is a hex value from 00 to 10
• 1st plane code points U+0000 to U+D7FF and U+E000 to U+FFFF - Basic
Multilingual Plane – most frequently used characters.
• Code points U+D800 to U+DFFF - Extensions

Examples
code UTF-16 code UTF-16BE code UTF-16LE code
glyph* character
point units (hex) units (hex) units (hex)
U+007A z LATIN SMALL LETTER Z 007A 00, 7A 7A, 00
U+6C34 水 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34 6C, 34 34, 6C
LINEAR B SYLLABLE B008 A (first non-
U+10000 � D800, DC00 D8, 00, DC, 00 00, D8, 00, DC
BMP code point)
U+1D11E � MUSICAL SYMBOL G CLEF D834, DD1E D8, 34, DD, 1E 34, D8, 1E, DD
PRIVATE USE CHARACTER-10FFFD
U+10FFFD � (last Unicode code point) DBFF, DFFD DB, FF, DF, FD FF, DB, FD, DF
Unicode

Advantage:

 Supports all languages and different signs by single code


page.

 UTF-8 is back compatible with the ASCII

 UTF-16 Basic multilingual plane (first 16 bits- 2 bytes) and


UTF-32 (4 bytes) allow to work with the characters like
with the regular text.
Unicode
Disadvantage:

 Fixed 2 or 4 bytes UTF-16 and UTF-32 Unicode


character strings occupy twice or four times as much
memory as ASCII strings.

UTF-8 character set has variable length.

• This is an advantage to have less size than the fixed


Unicode character set.

• And this is a disadvantage with the programming.


• For example, the number of characters in a character string
cannot be determined by the byte size of the string.

You might also like