0% found this document useful (0 votes)
12 views

7-Text Preprocessing - ASCII and UNICODE-10!01!2024

Text processing

Uploaded by

Biji CL
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

7-Text Preprocessing - ASCII and UNICODE-10!01!2024

Text processing

Uploaded by

Biji CL
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Text Preprocessing

Text Preprocessing

• Converting Raw Text into sequence of digital bits (Linguistically


meaningful Units)
• Crucial because, characters, words and sentences identified at this
stage is used for further processing
• Two Stages
• DOCUMENT TRIAGE
• TEXT SEGMENTATION
Document Triage
• Converting a set of digital files into well-defined text document
• Steps
• Characters in file must be MACHINE READABLE (Character Encoding)
One or more byte in a file maps to a known character.
• Character Encoding Identification (ASCII, UNICODE..)
Determines the character encoding for any file and optimally converts between encoding.
• Language Identification (English, French,..)
Language specific algorithm to apply to a document for language identification.
• Text Sectioning
Identifies the actual file within the content by discarding the images,header, HTML tags..
Text Segmentation
Process of Converting well defined text corpus into its
component words and sentence.

• Word Segmentation
• Tokens - The point when one word ends another word begins
• Normalization - Merging the different forms of a token into a canonical form.
Ex: Mr. , Mr, Mister, mister – Normalized to a single form.
• Sentence Segmentation
• Finding the sentence boundary between the words in a different sentences.
Sentence segmentation
• Sentence boundary detection
• Sentence boundary disambiguation
• Sentence boundary recognition
Challenges in Text pre-processing
In NLP, the writing system can be,

• Logographic - Individual symbol represents words


• Syllabic - Individual symbol represents syllables
• Alphabetic - Individual symbol represents sounds
Logographic and Syllabic
Text preprocessing
1. Tokenization
-Tokenization is the process that splits an input sequence into so-called
tokens.
- Useful units for the semantic processing
- Can be a word, paragraph, sentence, etc..
-Tokenization also referred as the text segmentation or analysis

Breaking a stream of input text into words and sentences for


subsequent processing
2. Token Normalization
 Converting all text to the same case (upper or lower), removing
punctuation, converting numbers to their word equivalents, and so
on.
 Normalization puts all words on equal footing, and allows processing
to proceed uniformly.
1. Stemming
2. Lemmatization
3. Everything else
2.1 Stemming
• Process of removing and replacing suffixes to get the root form of the
word. (Prefix, suffix, Infix, Circumfix)

Running  Run
Running ,Runs, Runned, Runly  Run

• if the word ends in 'ed', remove the 'ed'


• if the word ends in 'ing', remove the 'ing'
• if the word ends in 'ly', remove the 'ly'
2.2 Lemmatization
• Lemmatization related to stemming differing in that is able to capture
a canonical form based on a word lemma.
• A more complex approach to the problem of determining a stem of a
word is lemmatization

Better  Good
Returns to based or dictionary form of a word which is known as lemma.
Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form. For
instance:

am, are, is  be
car, cars, car's, cars'  car

The result of this mapping of text will be something like:

the boy's cars are different colors  the boy car be differ color
Character Encoding
Character Encoding
• Encoding - Process of Putting a sequence of characters (Letters,
Numbers, Punctuation, and Symbols)

• Character - Character is a smallest unit of writing system which is


capable of conveying information.

• Character Encoding – It is the way that letter, digits and other binary
symbols are expressed as binary values that a computer can
understand.
ASCII and UNICODE
• Character Encoding - ASCII was the first character encoding standard
(also called character set). ASCII defined 128 different
alphanumeric characters that could be used on the internet: numbers
(0-9), English letters (A-Z), and some special characters like ! ...

• UTF-8 (Unicode) covers almost all of the characters and symbols in


the world.
ASCII -Character Set
List of characters recognised by the computer hardware and the software. Each
character is represented by a number.

ASCII Character set:

• American Standard Code for Information Interchange (English)


• Numbers in 0 to 127 is to represent all English characters as well as the special
character
• The ASCII table has 128 characters with values from 0 through 127.
• Thus, 7 bits are sufficient to represent a character in ASCII; however, most
computers typically reserve 1 byte, (8 bits), for an ASCII character.
ASCII
• It is an acronym for the American Standard Code for Information
Interchange.
• It is a standard seven-bit code that was first proposed by the
American National Standards Institute or ANSI in 1963, and finalized
in 1968 as ANSI Standard X3.4.
• The purpose of ASCII was to provide a standard to code various
symbols ( visible and invisible symbols)
From Bit & Bytes to
ASCII
• Bytes can represent any
collection of items using a
“look-up table” approach
• ASCII is used to represent
characters

ASCII
American Standard Code for Information
Interchange
http://en.wikipedia.org/wiki/ASCII
ASCII table
Bytes: ASCII

• If you were to look at the file as a computer looks at it,


you would find that each byte contains not a letter but
a number -- the number is the ASCII code
corresponding to the character (see below). So on disk,
the numbers for the file look like this:
• Fourandseven
• 70 111 117 114 32 97 110 100 32 115 101 118 101 110
ASCII and EBCDIC (Extended Binary Code Decimal
Interchange Code)
ASCII Table
Upper case (A-Z) 26 26
Digits (0-9) 10
Space 1
Punctuation marks (.,+{)%) 32
Lower case (a-z) 26
Upper case (A-Z) 26
Control characters (tab, cr, lf) 33

Total 128
What is Unicode?
• A worldwide character-encoding standard
• Its main objective is to enable a single, unique character
set that is capable of supporting all characters from all
scripts, as well as symbols, that are commonly utilized for
computer processing throughout the globe
• Fun fact: Unicode is capable of encoding about at least
1,110,000 characters!
UNICODE

• Unicode- standard for the consistent encoding, representation, and


handling of text expressed in most of the writing system.
• Most OSs support Unicode.
• Unicode is required for international document and data interchange.
• Mode modern standards use Unicode such as

• Programming language – Java, C#, Perl, Python


• Markuplanguage – XML, HTML< java script, cobra, etc.
UNICODE scripts

This depicts how Unicode is capable of


encoding characters from virtually
every kind of language
This compares what ASCII and Unicode are
able to encode
ASCII vs Unicode

-Both are
character
-Has 128 code points, 0 codes
-Has about 1,114,112 code
through 127 -The 128
first code positions
-Can only encode characters positions -Can encode characters in 16-bits
in 7 bits of Unicode and more
-Can only encode characters mean the -Can encode characters from
from the English language same as virtually all kinds of languages
ASCII -It is a superset of ASCII
8-Bit Encoding (UTF-8)
• Characters are encoded with 8-bit byte (Dominant character
encoding).
• Most of the bits were reserved for ASCII
• Alphabetic and Syllabic Writing Systems
• ISO-8859 series of 10+ Character Sets for most European Languages
• Results in large number of overlapping character sets for different
languages
UTF Encoding
• UTF-16
• An extension of the "UCS-2" Unicode encoding, which uses at least two
bytes to represent about 65,536 characters
• Used by operating systems such as Java and Qualcomm BREW
• UTF-32
• A multi-byte encoding that represents each character with 4 bytes
• Makes it space inefficient
• Main use is in internal APIs where the data is single code points or glyphs,
rather than strings of characters
• Used on Unix systems sometimes for storage of information
2-Byte Encoding (UTF-16)
• 65,536 unique characters
• Pairs of Bytes for a Single Character
• Sometimes single byte letters, spaces and punctuations will be
interspersed with two-byte characters
• They may also contain ASCII characters
• Big-5 - Complex Mandarin – Traditional Chinese character encoding
(Taiwan,Hongkong)
• GB – Simple form
• One Digital File can have multiple writing systems and multiple
encodings – poses a challenge in tokenization
Unicode in the Future…
• Unicode may be capable of encoding characters from
every language across the globe
• Can become the most dominant and resourceful tool in
encoding every kind of character and symbol
• Integrates all kinds of character encoding schemes into
its operations

You might also like