0% found this document useful (0 votes)
157 views

Uni Code

Unicode is a standard for encoding characters across languages that defines over 149,000 characters covering 161 scripts and languages. It aims to allow all languages to be represented digitally. The standard began development in 1987 and is now widely adopted in technology. It defines character codes that are independent of font or screen details to allow consistent representation of text.

Uploaded by

ElenaDiSavoia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

Uni Code

Unicode is a standard for encoding characters across languages that defines over 149,000 characters covering 161 scripts and languages. It aims to allow all languages to be represented digitally. The standard began development in 1987 and is now widely adopted in technology. It defines character codes that are independent of font or screen details to allow consistent representation of text.

Uploaded by

ElenaDiSavoia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Unicode, formally 

The Unicode Standard,[note 1][note 2] is an information technology standard for the


consistent encoding, representation, and handling of text expressed in most of the world's writing
systems. The standard, which is maintained by the Unicode Consortium, defines as of the current
version (15.0) 149,186 characters[3][4] covering 161 modern and historic scripts, as well as symbols,
3664 emoji[5] (including in colors), and non-visual control and formatting codes.
Unicode's success at unifying character sets has led to its widespread and predominant use in
the internationalization and localization of computer software. The standard has been implemented
in many recent technologies, including modern operating systems, XML, and most
modern programming languages.
The Unicode character repertoire is synchronized with ISO/IEC 10646, each being code-for-code
identical with the other. The Unicode Standard, however, includes more than just the
base code. Alongside the character encodings, the Consortium's official publication includes a wide
variety of details about the scripts and how to display them: normalization rules,
decomposition, collation, rendering, and bidirectional text display order for multilingual texts, and so
on.[6] The Standard also includes reference data files and visual charts to help developers and
designers correctly implement the repertoire.
Unicode can be stored using several different encodings, which translate the character codes into
sequences of bytes. The Unicode standard defines three and several other encodings exist, all in
practice variable-length encodings. The most common encodings are the ASCII-compatible UTF-8,
the ASCII-incompatible UTF-16 (compatible with the obsolete UCS-2), and the Chinese Unicode
encoding standard GB18030 which is not an official Unicode standard but is used in China and
implements Unicode fully.

Origin and development[edit]


Unicode has the explicit aim of transcending the limitations of traditional character encodings, such
as those defined by the ISO/IEC 8859 standard, which find wide usage in various countries of the
world but remain largely incompatible with each other. Many traditional character encodings share a
common problem in that they allow bilingual computer processing (usually using Latin
characters and the local script), but not multilingual computer processing (computer processing of
arbitrary scripts mixed with each other).
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather
than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this
sometimes leads to controversies over distinguishing the underlying character from its variant glyphs
(see Han unification).
In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—
for each character. In other words, Unicode represents a character in an abstract way and leaves
the visual rendering (size, shape, font, or style) to other software, such as a web browser or word
processor. This simple aim becomes complicated, however, because of concessions made by
Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.
The first 256 code points were made identical to the content of ISO/IEC 8859-1 so as to make it
trivial to convert existing western text. Many essentially identical characters were encoded multiple
times at different code points to preserve distinctions used by legacy encodings and therefore, allow
conversion from those encodings to Unicode (and back) without losing any information. For
example, the "fullwidth forms" section of code points encompasses a full duplicate of the Latin
alphabet because Chinese, Japanese, and Korean (CJK) fonts contain two versions of these letters,
"fullwidth" matching the width of the CJK characters, and normal width. For other examples,
see duplicate characters in Unicode.
Unicode Bulldog Award recipients include many names influential in the development of Unicode
and include Tatsuo Kobayashi, Thomas Milo, Roozbeh Pournader, Ken Lunde, and Michael
Everson.[7]

History[edit]
Based on experiences with the Xerox Character Code Standard (XCCS) since 1980,[8] the origins of
Unicode can be traced back to 1987, when Joe Becker from Xerox with Lee Collins and Mark
Davis from Apple started investigating the practicalities of creating a universal character set. [9] With
additional input from Peter Fenwick and Dave Opstad,[8] Joe Becker published a draft proposal for an
"international/multilingual text character encoding system in August 1988, tentatively called
Unicode". He explained that "the name 'Unicode' is intended to suggest a unique, unified, universal
encoding".[8]
In this document, entitled Unicode 88, Becker outlined a 16-bit character model:[8]
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could
be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the
characters of all the world's living languages. In a properly engineered design, 16 bits per character
are more than sufficient for this purpose.
His original 16-bit design was based on the assumption that only those scripts and characters in
modern use would need to be encoded:[8]
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities.
Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all
newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below
214 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare;
these are better candidates for private-use registration than for congesting the public list of generally
useful Unicodes.
In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of
Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of Sun
Microsystems, and in 1990, Michel Suignard and Asmus Freytag from Microsoft and Rick McGowan
of NeXT joined the group. By the end of 1990, most of the work on mapping existing character
encoding standards had been completed, and a final review draft of Unicode was ready.
The Unicode Consortium was incorporated in California on 3 January 1991, [10] and in October 1991,
the first volume of the Unicode standard was published. The second volume, covering Han
ideographs, was published in June 1992.
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no
longer restricted to 16 bits. This increased the Unicode codespace to over a million code points,
which allowed for the encoding of many historic scripts (e.g., Egyptian hieroglyphs) and thousands of
rarely used or obsolete characters that had not been anticipated as needing encoding. Among the
characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of
which are part of personal and place names, making them much more essential than envisioned in
the original architecture of Unicode.[11]
The Microsoft TrueType specification version 1.0 from 1992 used the name 'Apple Unicode' instead
of 'Unicode' for the Platform ID in the naming table.

Unicode Consortium[edit]
Main article: Unicode Consortium
The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full
members include most of the main computer software and hardware companies with any interest in
text-processing standards, including Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix,
and SAP SE.[12]
Over the years several countries or government agencies have been members of the Unicode
Consortium. Presently only the Ministry of Endowments and Religious Affairs (Oman) is a full
member with voting rights.[12]
The Consortium has the ambitious goal of eventually replacing existing character encoding schemes
with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the
existing schemes are limited in size and scope and are incompatible with multilingual environments.

Scripts covered[edit]
Main article: Script (Unicode)

Many modern applications can render a substantial subset of the many scripts in Unicode, as demonstrated by
this screenshot from the OpenOffice.org application.
Unicode currently covers most major writing systems in use today.[13][better  source  needed]
As of 2022, a total of 161 scripts[14] are included in the latest version of Unicode
(covering alphabets, abugidas and syllabaries), although there are still scripts that are not yet
encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further
additions of characters to the already encoded scripts, as well as symbols, in particular for
mathematics and music (in the form of notes and rhythmic symbols), also occur.
The Unicode Roadmap Committee (Michael Everson, Rick McGowan, Ken Whistler, V.S.
Umamaheswaran)[15] maintain the list of scripts that are candidates or potential candidates for
encoding and their tentative code block assignments on the Unicode Roadmap [16] page of
the Unicode Consortium website. For some scripts on the Roadmap, such as Jurchen and Khitan
small script, encoding proposals have been made and they are working their way through the
approval process. For other scripts, such as Mayan (besides numbers) and Rongorongo, no
proposal has yet been made, and they await agreement on character repertoire and other details
from the user communities involved.
Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or
which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon) are listed in
the ConScript Unicode Registry, along with unofficial but widely used Private Use Areas code
assignments.
There is also a Medieval Unicode Font Initiative focused on special Latin medieval characters. Part
of these proposals have been already included into Unicode.

Script Encoding Initiative[edit]


The Script Encoding Initiative, [17] a project run by Deborah Anderson at the University of California,
Berkeley was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the
standard. The project has become a major source of proposed additions to the standard in recent
years.[18]

Versions[edit]
The Unicode Consortium and the International Organization for Standardization (ISO) have together
developed a shared repertoire following the initial publication of The Unicode Standard in 1991;
Unicode and the ISO's Universal Coded Character Set (UCS) use identical character names and
code points. However, the Unicode versions do differ from their ISO equivalents in two significant
ways.
While the UCS is a simple character map, Unicode specifies the rules, algorithms, and properties
necessary to achieve interoperability between different platforms and languages. Thus, The Unicode
Standard includes more information, covering—in depth—topics such as bitwise
encoding, collation and rendering. It also provides a comprehensive catalog of character properties,
including those needed for supporting bidirectional text, as well as visual charts and reference data
sets to aid implementers. Previously, The Unicode Standard was sold as a print volume containing
the complete core specification, standard annexes, and code charts. However, Unicode 5.0,
published in 2006, was the last version printed this way. Starting with version 5.2, only the core
specification, published as print-on-demand paperback, may be purchased. [19] The full text, on the
other hand, is published as a free PDF on the Unicode website.
A practical reason for this publication method highlights the second significant difference between
the UCS and Unicode—the frequency with which updated versions are released and new characters
added. The Unicode Standard has regularly released annual expanded versions, occasionally with
more than one version released in a calendar year and with rare cases where the scheduled release
had to be postponed. For instance, in April 2020, only a month after version 13.0 was published, the
Unicode Consortium announced they had changed the intended release date for version 14.0,
pushing it back six months from March 2021 to September 2021 due to the COVID-19 pandemic.
The latest version of Unicode, 15.0.0, was released on 13 September 2022. Several annexes were
updated including Unicode Security Mechanisms (UTS #39), and a total of 4489 new characters
were encoded, including 20 new emoji characters, such as "wireless" (network) symbol and hearts in
different colors such as pink, two new scripts, CJK Unified Ideographs extension, and multiple
additions to existing blocks.[20][21]

You might also like