7-Text Preprocessing - ASCII and UNICODE-10!01!2024
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
Text Preprocessing
• Word Segmentation
• Tokens - The point when one word ends another word begins
• Normalization - Merging the different forms of a token into a canonical form.
Ex: Mr. , Mr, Mister, mister – Normalized to a single form.
• Sentence Segmentation
• Finding the sentence boundary between the words in a different sentences.
Sentence segmentation
• Sentence boundary detection
• Sentence boundary disambiguation
• Sentence boundary recognition
Challenges in Text pre-processing
In NLP, the writing system can be,
Running Run
Running ,Runs, Runned, Runly Run
Better Good
Returns to based or dictionary form of a word which is known as lemma.
Stemming and Lemmatization
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form. For
instance:
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be differ color
Character Encoding
Character Encoding
• Encoding - Process of Putting a sequence of characters (Letters,
Numbers, Punctuation, and Symbols)
• Character Encoding – It is the way that letter, digits and other binary
symbols are expressed as binary values that a computer can
understand.
ASCII and UNICODE
• Character Encoding - ASCII was the first character encoding standard
(also called character set). ASCII defined 128 different
alphanumeric characters that could be used on the internet: numbers
(0-9), English letters (A-Z), and some special characters like ! ...
ASCII
American Standard Code for Information
Interchange
http://en.wikipedia.org/wiki/ASCII
ASCII table
Bytes: ASCII
Total 128
What is Unicode?
• A worldwide character-encoding standard
• Its main objective is to enable a single, unique character
set that is capable of supporting all characters from all
scripts, as well as symbols, that are commonly utilized for
computer processing throughout the globe
• Fun fact: Unicode is capable of encoding about at least
1,110,000 characters!
UNICODE
-Both are
character
-Has 128 code points, 0 codes
-Has about 1,114,112 code
through 127 -The 128
first code positions
-Can only encode characters positions -Can encode characters in 16-bits
in 7 bits of Unicode and more
-Can only encode characters mean the -Can encode characters from
from the English language same as virtually all kinds of languages
ASCII -It is a superset of ASCII
8-Bit Encoding (UTF-8)
• Characters are encoded with 8-bit byte (Dominant character
encoding).
• Most of the bits were reserved for ASCII
• Alphabetic and Syllabic Writing Systems
• ISO-8859 series of 10+ Character Sets for most European Languages
• Results in large number of overlapping character sets for different
languages
UTF Encoding
• UTF-16
• An extension of the "UCS-2" Unicode encoding, which uses at least two
bytes to represent about 65,536 characters
• Used by operating systems such as Java and Qualcomm BREW
• UTF-32
• A multi-byte encoding that represents each character with 4 bytes
• Makes it space inefficient
• Main use is in internal APIs where the data is single code points or glyphs,
rather than strings of characters
• Used on Unix systems sometimes for storage of information
2-Byte Encoding (UTF-16)
• 65,536 unique characters
• Pairs of Bytes for a Single Character
• Sometimes single byte letters, spaces and punctuations will be
interspersed with two-byte characters
• They may also contain ASCII characters
• Big-5 - Complex Mandarin – Traditional Chinese character encoding
(Taiwan,Hongkong)
• GB – Simple form
• One Digital File can have multiple writing systems and multiple
encodings – poses a challenge in tokenization
Unicode in the Future…
• Unicode may be capable of encoding characters from
every language across the globe
• Can become the most dominant and resourceful tool in
encoding every kind of character and symbol
• Integrates all kinds of character encoding schemes into
its operations