Mark the option containing the sentence which has the least number of vowels

In this program, our task is to count the total number of vowels and consonants present in the given string. As we know that, The characters a, e, i, o, u are known as vowels in the English alphabet. Any character other than that is known as the consonant. To solve this problem, First of all, we need to convert every upper-case character in the string to lower-case so that the comparisons can be done with the lower-case vowels only not upper-case vowels, i.e.(A, E, I, O, U). Then, we have to traverse the string using a for or while loop and match each character with all the vowels, i.e., a, e, I, o, u. If the match is found, increase the value of count by 1 otherwise continue with the normal flow of the program. The algorithm of the program is given below.

Algorithm

  1. Define a string.
  2. Convert the string to lower case so that comparisons can be reduced. Else we need to compare with capital (A, E, I, O, U).
  3. If any character in string matches with vowels (a, e, i, o, u ) then increment the vcount by 1.
  4. If any character lies between 'a' and 'z' except vowels, then increment the count for ccount by 1.
  5. Print both the counts.

Complexity

O(n)

Solution

Python

Output:

Number of vowels: 10
Number of consonants: 17

Output:

Number of vowels: 10
Number of consonants: 17

JAVA

Output:

Number of vowels: 10
Number of consonants: 17

C#

Output:

Number of vowels: 10
Number of consonants: 17

PHP

Output:

Number of vowels: 10
Number of consonants: 17

For this, we've lowercased the line using toLowerCase(). This is an optimization done not to check for capitalized A to Z and vowels.

We've used length() function to know the length of the string and charAt() to get the character at the given index (position).

This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database and how it specifies the formal definitions of the Unicode Character Properties.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

Contents

  • 1 Introduction
  • 2 Conformance
    • 2.1 Simple and Derived Properties
    • 2.2 Use of Default Values
    • 2.3 Stability of Releases
  • 3 Documentation
    • 3.1 Character Properties in the Standard
    • 3.2 The Character Property Model
    • 3.3 NamesList.html
    • 3.4 StandardizedVariants.html
    • 3.5 Emoji Variation Sequences
    • 3.6 Unihan and UAX #38
    • 3.7 UTC-Source Ideographs and UAX #45
    • 3.8 Data File Comments
    • 3.9 Obsolete Documentation Files
  • 4 UCD Files
    • 4.1 Directory Structure
    • 4.2 File Format Conventions
    • 4.3 File List
    • 4.4 Zipped Files
    • 4.5 UCD in XML
  • 5 Properties
    • 5.1 Property Index
    • 5.2 About the Property Table
    • 5.3 Property Definitions
    • 5.4 Derived Extracted Properties
    • 5.5 Contributory Properties
    • 5.6 Case and Case Mapping
    • 5.7 Property Value Lists
    • 5.8 Property and Property Value Aliases
    • 5.9 Matching Rules
    • 5.10 Invariants
    • 5.11 Validation
    • 5.12 Deprecation
    • 5.13 Property APIs
    • 5.14 Character Age
  • 6 Test Files
    • 6.1 NormalizationTest.txt
    • 6.2 Segmentation Test Files and Documentation
    • 6.3 Bidirectional Test Files
  • 7 UCD Change History
  • Acknowledgments
  • References
  • Modifications

Note: the information in this annex is not intended as an exhaustive description of the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 15.0.0 of the standard unless otherwise indicated.

1 Introduction

The Unicode Standard is far more than a simple encoding of characters. The standard also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names. The data files define the Unicode character properties and mappings between Unicode characters (such as case mappings).

This annex describes the UCD and provides a guide to the various documentation files associated with it. Additional information about character properties and their use is contained in the Unicode Standard and its annexes. In particular, implementers should familiarize themselves with the formal definitions and conformance requirements for properties detailed in Section 3.5, Properties in [Unicode] and with the material in Chapter 4, Character Properties in [Unicode]. Additional discussion about the Unicode character property model can be found in [UTR23].

The latest version of the UCD is always located on the Unicode website at:

https://www.unicode.org/Public/UCD/latest/

The specific files for the UCD associated with this version of the Unicode Standard (15.0.0) are located at:

https://www.unicode.org/Public/15.0.0/

Stable, archived versions of the UCD associated with all earlier versions of the Unicode Standard can be accessed from:

https://www.unicode.org/ucd/

For a description of the changes in the UCD for this version and earlier versions, see the UCD Change History.

2 Conformance

The Unicode Character Database is an integral part of the Unicode Standard.

The UCD contains normative property and mapping information required for implementation of various Unicode algorithms such as the Unicode Bidirectional Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also contain additional informative and provisional character property information.

Each specification of a Unicode algorithm, whether specified in the text of [Unicode] or in one of the Unicode Standard Annexes, designates which data file(s) in the UCD are needed to provide normative property information required by that algorithm.

For information on the meaning and application of the terms, normative, informative, contributory, and provisional, see Section 3.5, Properties in [Unicode].

For information about the applicable terms of use for the UCD, see the Unicode Terms of Use.

2.1 Simple and Derived Properties

2.1.1 Simple Properties

Some character properties in the UCD are simple properties. This status has no bearing on whether or not the properties are normative, but merely indicates that their values are not derived from some combination of other properties.

2.1.2 Derived Properties

Other character properties are derived. This means that their values are derived by rule from some other combination of properties. Generally such rules are stated as set operations, and may or may not include explicit exception lists for individual characters.

Certain simple properties are defined merely to make the statement of the rule defining a derived property more compact or general. Such properties are known as contributory properties. Sometimes these contributory properties are defined to encapsulate the messiness inherent in exception lists. At other times, a contributory property may be defined to help stabilize the definition of an important derived property which is subject to stability guarantees.

Derived character properties are not considered second-class citizens among Unicode character properties. They are defined to make implementation of important algorithms easier to state. Included among the first-class derived properties important for such implementations are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt, as well as derived properties for the optimization of normalization, defined in DerivedNormalizationProps.txt.

Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.

Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time. These definitions are presented in a modified set notation, expressed as set additions and/or subtractions of various other property values. For example:

# Derived Property: ID_Start
#  Characters that can start an identifier.
#  Generated from:
#      Lu + Ll + Lt + Lm + Lo + Nl
#    + Other_ID_Start
#    - Pattern_Syntax
#    - Pattern_White_Space

When interpreting definitions of derived properties of this sort, keep in mind that set subtraction is not a commutative operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set than "Lo - Pattern_Syntax + Lm". The order of property set operations stated in the definitions affects the composition of the derived set.

If there are any cases of mismatches between the definition of a derived property as listed in DerivedCoreProperties.txt or similar data files in the UCD, and the definition of a derived property as a set definition rule, the explicit listing in the data file should always be taken as the normative definition of the property. As described in Stability of Releases the property listing in the data files for any given version of the standard will never change for that version.

2.1.3 Properties Dependent on External Specifications

In limited cases, a Unicode character property defined in the Unicode Character Database may have an external dependency on another specification which is not a part of the Unicode Standard, and whose data is not formally part of the UCD. In such cases, version stability for the UCD is attained by requiring that dependency to be based on a known, published version of the external specification.

Starting with Version 10.0 of the UCD and continuing through Version 12.1, the clear example of such an external dependency was the derivation of some segmentation-related character properties, in part based on emoji properties associated with UTS #51, "Unicode Emoji" [UTS51]. The details of the derivation were described in the respective annexes, [UAX14] and [UAX29], as well as in the documentation portions of the associated UCD property files. See [Data14] and [Props]. The version of UTS #51 used for those segmentation properties in each of the relevant versions of the UCD was clearly identified in those annexes and data files. Starting with Version 13.0 of the UCD, however, the emoji properties which the UCD previously depended on have been formally incorporated into the UCD, so that they no longer constitute an external dependency.

An external dependency may impact either a simple or a derived property.

2.2 Use of Default Values

Unicode character properties have default values. Default values are the value or values that a character property takes for an unassigned code point, or in some instances, for designated subranges of code points, whether assigned or unassigned. For example, the default value of a binary Unicode character property is always "N".

For the formal discussion of default values, see D26 in Section 3.5, Properties in [Unicode]. For conventions related to default values in various data files of the UCD and for documentation regarding the particular default values of individual Unicode character properties, see Default Values.

2.3 Stability of Releases

Just as for the Unicode Standard as a whole, each version of the UCD, once published, is absolutely stable and will never change. Each released version is archived in a directory on the Unicode website, with a directory number associated with that version. URLs pointing to that version's directory are also stable and will be maintained in perpetuity.

Any errors discovered for a released version of the UCD are noted in [Errata], and if appropriate will be corrected in a subsequent version of the UCD.

Stability guarantees constraining how Unicode character properties can (or cannot) change between releases of the UCD are documented in the Unicode Consortium Stability Policies [Stability].

2.3.1 Changes to Properties Between Releases

Updates to character properties in the Unicode Character Database may be required for any of three reasons:

  1. To cover new characters added to the standard
  2. To add new character properties to the standard
  3. To change the assigned values for a property for some characters already in the standard

While the Unicode Consortium endeavors to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them. In particular, as less well-documented scripts, such as those for minority languages, or historic scripts are added to the standard, the exact character properties and behavior may not fully be known when the script is first encoded. The properties for some of these characters may change as further information becomes available or as implementations turn up problems in the initial property assignments. As far as possible, any readjustment of property values based on growing implementation experience is made to be compatible with established practice.

All changes to normative or informative property values, to the status or type of a property, or to property or property value aliases, must be approved by an explicit decision taken by the Unicode Technical Committee. Changes to provisional property values are subject to less stringent oversight.

Occasionally, a character property value is changed to prevent incorrect generalizations about a character's use based on its nominal property values. For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking.

There is no guarantee that a particular value for an enumerated property will actually have characters associated with it. Also, because of changes in property value assignments between versions of the standard, a property value that once had characters associated with it may later have none. Such conditions and changes are rare, but implementations must not assume that all property values are associated with non-null sets of characters. For example, currently the special Script property value Katakana_Or_Hiragana has no characters associated with it.

2.3.2 Obsolete Properties

In some instances an entire property may become obsolete. For example, the ISO_Comment property was once used to keep track of annotations for characters used in the production of name lists for ISO/IEC 10646 code charts. As of Unicode 5.2.0 that property became obsolete, and its value is now defaulted to the null string for all Unicode code points.

An obsolete property is never removed from the UCD.

2.3.3 Deprecated Properties

Occasionally an obsolete property may also be formally deprecated. This is an indication that the property is no longer recommended for use, perhaps because its original intent has been replaced by another property or because its specification was somehow defective. See also the general discussion of Deprecation.

A deprecated property is never removed from the UCD.

Table 1 lists the properties that are formally deprecated as of this version of the Unicode Standard.

Table 1. Deprecated Properties

Property NameDeprecation VersionReasonGrapheme_Link5.0.0Duplication of ccc=9Hyphen6.0.0Supplanted by Line_Break property valuesISO_Comment6.0.0No longer needed for chart generation; otherwise not usefulExpands_On_NFC6.0.0Less useful than UTF-specific calculationsExpands_On_NFD6.0.0Less useful than UTF-specific calculationsExpands_On_NFKC6.0.0Less useful than UTF-specific calculationsExpands_On_NFKD6.0.0Less useful than UTF-specific calculationsFC_NFKC_Closure6.0.0Supplanted in usage by NFKC_Casefold; otherwise not useful

 

2.3.4 Stabilized Properties

Another possibility is that an obsolete property may be declared to be stabilized. Such a determination does not indicate that the property should or should not be used; instead it is a declaration that the UTC (Unicode Technical Committee) will no longer actively maintain the property or extend it for newly encoded characters. The property values of a stabilized property are frozen as of a particular release of the standard.

A stabilized property is never removed from the UCD.

Table 2 lists the properties that are formally stabilized as of this version of the Unicode Standard.

Table 2. Stabilized Properties

Property NameStabilization VersionHyphen4.0.0ISO_Comment6.0.0

 

3 Documentation

This annex provides the core documentation for the UCD, but additional information about character properties is available in other parts of the standard and in additional documentation files contained within the UCD.

3.1 Character Properties in the Standard

The formal definitions related to character properties used by the Unicode Standard are documented in Section 3.5, Properties in [Unicode]. Understanding those definitions and related terminology is essential to the appropriate use of Unicode character properties.

See Section 4.1, Unicode Character Database, in [Unicode] for a general discussion of the UCD and its use in defining properties. The rest of Chapter 4 provides important explanations regarding the meaning and use of various normative character properties.

3.2 The Character Property Model

For a general discussion of the property model which underlies the definitions associated with the UCD, see Unicode Technical Report #23, "The Unicode Character Property Model" [UTR23]. That technical report is informative, but over the years various content from it has been incorporated into normative portions of the Unicode Standard, particularly for the definitions in Chapter 3.

UTR #23 presents the important distinction between properties defined for strings (in contrast to properties defined for characters or code points) and character properties that have values that are strings. The latter are referred to as string-valued properties in UTR #23 and in this annex. UTR #23 also discusses string functions and their relation to character properties.

3.3 NamesList.html

NamesList.html formally describes the format of the NamesList.txt data file in BNF. That data file is used to drive the PDF formatting of the Unicode code charts and names list. See also Section 24.1, Character Names List, in [Unicode] for a detailed discussion of the conventions used in the Unicode names list as formatted for the online code charts.

3.4 StandardizedVariants.html

StandardizedVariants.html has been obsoleted as of Version 9.0 of the UCD. This file formerly documented standardized variants, showing a representative glyph for each. It was closely tied to the data file, StandardizedVariants.txt, which defines those sequences normatively.

The function of StandardizedVariants.html to show representative glyphs for standardized variants has been superseded. There are now better means of illustrating the glyphs. Many standardized variation sequences are shown in the Unicode code charts directly, in summary sections at the ends of the names list for any block which contains them. Glyphs for standardized variants of CJK compatibility ideographs are also shown directly in the Unicode code charts.

3.5 Emoji Variation Sequences

Emoji variation sequences are a special class of variation sequences involving emoji characters. They are divided into two subtypes: an emoji presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0F, and a text presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0E. Such sequences come in pairs: the text presentation sequence shown with a black and white presentation, as seen in the Unicode code charts, and the emoji presentation sequence shown with a colorful icon, as usually seen in implementations on mobile devices and elsewhere.

Starting with Version 9.0.0, the following page in the Unicode emoji subsite area shows appropriate representative glyphs for all emoji variation sequences, with separate columns for text presentation sequences and for emoji presentation sequences:

https://www.unicode.org/emoji/charts/emoji-variants.html

The data file which defines the exact list of emoji variation sequences is emoji-variation-sequences.txt. That file is maintained in the UCD, but emoji variation sequences are documented in Unicode Technical Standard #51, Unicode Emoji [UTS51].

3.6 Unihan and UAX #38

Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38] describes the format and content of the Unihan Database [Unihan], which collects together all property information for CJK unified ideographs. That annex also specifies in detail which of the Unihan character properties are normative, informative, or provisional.

The Unihan Database contains extensive and detailed mapping information for CJK unified ideographs encoded in the Unicode Standard, but it is aimed only at those ideographs, not at other characters used in the East Asian context in general. In contrast, East Asian legacy character sets, including important commercial and national character set standards, contain many non-CJK characters. As a result, the Unihan Database must be supplemented from other sources to establish mapping tables for those character sets.

The majority of the content of the Unihan Database is released for each version of the Unicode Standard as a collection of Unihan data files in the UCD. Because of their large size, these data files are released only as a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip and the CJK properties each one contains are provided in [UAX38]. For versions of the UCD prior to Version 5.2.0, all of the CJK properties were listed together in a very large, single file, Unihan.txt.

3.7 UTC-Source Ideographs and UAX #45

Unicode Standard Annex #45, "U-Source Ideographs" [UAX45] describes the format of USourceData.txt, which lists all of the information for UTC-Source ideographs.

3.8 Data File Comments

In addition to the specific documentation files for the UCD, individual data files often contain extensive header comments describing their content and any special conventions used in the data.

In some instances, individual property definition sections also contain comments with information about how the property may be derived. Such comments are informative; while they are intended to convey the intent of the derivation, in case of any mismatch between a statement of a derivation in a comment field and the actual listing of the derived property, the list is considered to be definitive. See Simple and Derived Properties.

3.9 Obsolete Documentation Files

UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its content has been wholly incorporated into this document.

Unihan.html was formerly the primary documentation file for the Unihan Database. As of Version 5.1.0, its content has been wholly incorporated into [UAX38].

Versions of the Unicode Standard prior to Version 4.0.0 contained small, focused documentation files, UnicodeCharacterDatabase.html, PropList.html, and DerivedProperties.html, which were later consolidated into UCD.html.

StandardizedVariants.html has been obsoleted as of Version 9.0.0. See Section 3.4, StandardizedVariants.html.

4 UCD Files

The heart of the UCD consists of the data files themselves. This section describes the directory structure for the UCD, the format conventions for the data files, and provides documentation for data files not documented elsewhere in this annex.

4.1 Directory Structure

Each version of the UCD is released in a separate, numbered directory under the Public directory on the Unicode website. The content of that directory is complete for that release. It is also stable—once released, it will be archived permanently in that directory, unchanged, at a stable URL.

The specific files for the UCD associated with this version of the Unicode Standard (15.0.0) are located at:

https://www.unicode.org/Public/15.0.0/

The latest released version of the UCD is always accessible via the following stable URL:

https://www.unicode.org/Public/UCD/latest/

Zipped copies of the latest released version of the UCD are always accessible via the following stable URL:

https://www.unicode.org/Public/zipped/latest/

Prior to Version 6.3.0, access to the latest released version of the UCD was via the following stable URL:

https://www.unicode.org/Public/UNIDATA/

That "UNIDATA" URL will be maintained, but is no longer recommended, because it points to the ucd subdirectory of the latest release, rather than to the parent directory for the release. The "UNIDATA" naming convention is also very old, and does not follow the directory naming conventions currently used for other data releases in the Public directory on the Unicode website.

4.1.1 UCD Files Proper

The UCD proper is located in the ucd subdirectory of the numbered version directory. That directory contains all of the documentation files and most of the data files for the UCD, including some data files for derived properties.

Although all UCD data files are version-specific for a release and most contain internal date and version stamps, the file names of the released data files do not differ from version to version. When linking to a version-specific data file, the version will be indicated by the version number of the directory for the release.

All files for derived extracted properties are in the extracted subdirectory of the ucd subdirectory. See Derived Extracted Properties for documentation regarding those data files and their content.

A number of auxiliary properties are specified in files in the auxiliary subdirectory of the ucd subdirectory. It contains data files specifying properties associated with Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] and with Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14], as well as test data for those algorithms. See Segmentation Test Files and Documentation for more information about the test data.

Certain data files associated with emoji properties are maintained in the emoji subdirectory of the ucd subdirectory. Those data files define the simple character properties associated with emoji characters, as well as the emoji variation sequences. Other data files associated with emoji, including those which define the RGI ("recommended for general interchange") sets of various types of emoji sequences, as well as emoji test data, are maintained elsewhere, and are not considered formally a part of the UCD. See [UTS51] for documentation regarding those data files and their content.

4.1.2 UCD XML Files

The XML version of the UCD is located in the ucdxml subdirectory of the numbered version directory. See the UCD in XML for more details.

4.1.3 Charts

The code charts specific to a version of Unicode are archived as a single large PDF file in the charts subdirectory of the numbered version directory. See the readme.txt in that subdirectory and the general web page explaining the Unicode Code Charts for more details.

4.1.4 Beta Review Considerations

Prior to the formal release for any particular version of the UCD, a beta review is conducted. The beta review files are located in the same directory that is later used for the released UCD, but during the beta review period, the subdirectory structure differs somewhat and may contain temporary files, including documentation of diffs between deltas for the beta review. Also, during the beta review, all data file names are suffixed with version numbers and delta numbers. So a typical file name during beta review may be "PropList-5.2.0d13.txt" instead of the finally released "PropList.txt".

Notices contained in a ReadMe.txt file in the UCD directory during the beta review period also make it clear that that directory contains preliminary material under review, rather than a final, stable release.

4.1.5 File Directory Differences for Early Releases

The UCD in XML was introduced in Version 5.1.0, so UCD directories prior to that do not contain the ucdxml subdirectory.

UCD directories prior to Version 13.0.0 do not contain the emoji subdirectory.

UCD directories prior to Version 4.1.0 do not contain the auxiliary subdirectory.

UCD directories prior to Version 3.2.0 do not contain the extracted subdirectory.

The general structure of the file directory for a released version of the UCD described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0, versions of the UCD were not self-contained, complete sets of data files for that version, but instead only contained any new data files or any data files which had changed since the prior release.

Because of this, the property files for a given version prior to Version 4.1.0 can be spread over several directories. Consult the component listings at Enumerated Versions to find out which files in which directories comprise a complete set of data files for that version.

The directory naming conventions and the file naming conventions also differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD is contained in a directory named 4.0-Update, and Version 4.0.1 of the UCD in a directory named 4.0-Update1. Furthermore, for these earlier versions, the data file names do contain explicit version numbers.

4.2 File Format Conventions

Files in the UCD use the format conventions described in this section, unless otherwise specified.

4.2.1 Data Fields

  • Each line of data consists of fields separated by semicolons. The fields are numbered starting with zero.
  • The first field (0) of each line in the Unicode Character Database files represents a code point or range. The remaining fields (1..n) are properties associated with that code point.
  • Leading and trailing spaces within a field are not significant. However, no leading or trailing spaces are allowed in any field of UnicodeData.txt. For legacy reasons, no spaces are allowed before or after the semicolon in LineBreak.txt and in EastAsianWidth.txt.
  • The Unihan data files [Unihan] in the UCD have a separate format, using tab characters instead of semicolons to separate fields. See [UAX38] for the detailed specification of the format of the Unihan data files. The data files TangutSources.txt and NushuSources.txt also use this format.

4.2.2 Code Points and Sequences

  • Code points are expressed as hexadecimal numbers with four to six digits. (See Appendix A, Notational Conventions in [Unicode] for a full, formal definition of this convention.) They are written without the "U+" prefix in all data files except the Unihan data files. The Unihan data files use the "U+" prefix for all Unicode code points, to distinguish them from other decimal and hexadecimal numerical references occurring in their data fields.
  • When a data field contains a sequence of code points, spaces separate the code points.

4.2.3 Code Point Ranges

  • A range of code points is specified by the form "X..Y".
  • Each code point in a range has the associated property value specified on a data file. For example (from Blocks.txt):
    0000..007F; Basic Latin
    0080..00FF; Latin-1 Supplement
          
  • For backward compatibility, ranges in the file UnicodeData.txt are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets:
    4E00;;Lo;0;L;;;;;N;;;;;
    9FEF;;Lo;0;L;;;;;N;;;;;
          
    For character ranges using this convention, the names of all characters in the range are algorithmically derivable. See Section 4.8, Name in [Unicode] for more information on derivation of character names for such ranges.

4.2.4 Comments

  • U+0023 NUMBER SIGN ("#") is used to indicate comments: all characters from the number sign to the end of the line are considered part of the comment, and are disregarded when parsing data.
  • In many files, the comments on data lines use a common format, as illustrated here (from Scripts.txt):
    09B2          ; Bengali # Lo       BENGALI LETTER LA
  • The first part of a comment using this common format is the General_Category value, provided for information. This is followed by the character name for the code point in the first field (0).
  • The printing of the General_Category value is suppressed in instances where it would be redundant, as for DerivedGeneralCategory.txt, in which the value of the property value in the data field is already the General_Category value.
  • The symbol "L&" indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase, or titlecase letter). For example:
    0386          ; Greek # L&       GREEK CAPITAL LETTER ALPHA WITH TONOS
    L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.
  • When the data line contains a range of code points, this common format for a comment also indicates a range of character names, separated by "..", as illustrated here (from DerivedNumericType.txt):
    00BC..00BE    ; Numeric # No   [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
  • Normally, consecutive characters with the same property value would be represented by a single code point range. In data files using this comment convention, such ranges are subdivided so that all characters in a range also have the same General_Category value (or LC). While this convention results in more ranges than are strictly necessary, it makes the contents of the ranges clearer.
  • When a code point range occurs, the number of items in the range is included in the comment (in square brackets), immediately following the General_Category value.
  • The comments are purely informational, and may change format or be omitted in the future. They should not be parsed for content. However, see Section 4.2.10 @missing Conventions.

4.2.5 Code Point Labels

  • Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names. When such code points are listed in the data files, for example to list their General_Category values, the comments use code point labels instead of character names. For example (from DerivedCoreProperties.txt):
    2065          ; Default_Ignorable_Code_Point # Cn       
  • Although code point labels are not formally character names and are not considered values of the Name property for characters, they are designed to be maintained as unique values within the namespace for Unicode character names. Hence, implementations can safely use them as identifiers for code points without overlap with actual character names.
  • Code point labels use one of the tags as documented in Section 4.8, Name in [Unicode] and as shown in Table 3, followed by "-" and the code point expressed in hexadecimal. The entire label is then enclosed in angle brackets when listed in data files of the UCD.

Table 3. Code Point Label Tags

TagGeneral_CategoryNotereservedCnNoncharacter_Code_Point=FnoncharacterCnNoncharacter_Code_Point=TcontrolCc private-useCo surrogateCs 

 

4.2.6 Multiple Properties in One Data File

  • When a file contains the specification for multiple properties, the second field specifies the name of the property and the third field specifies the property value. For example (from DerivedNormalizationProps.txt):
    03D2  ; FC_NFKC; 03C5           # L&  GREEK UPSILON WITH HOOK SYMBOL
    03D3  ; FC_NFKC; 03CD           # L&  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
          

4.2.7 Binary Property Values

  • For binary properties, the second field specifies the name of the applicable property, with the implied value of the property being "True". Only the ranges of characters with the binary property value of "Y" (= True) are listed. For example (from PropList.txt):
    1680       ; White_Space # Zs      OGHAM SPACE MARK
    2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
          

4.2.8 Multiple Values for Properties

  • When a data file defines a property which may take multiple values for a single code point, the multiple values are expressed in a space-delimited list. For example (from ScriptExtensions.txt):
    0640          ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm       ARABIC TATWEEL
          
  • In some cases—but not all—the order of multiple elements in a space-delimited list may be significant. When the order of multiple elements is significant, it is documented along with the property itself. For example (from Unihan_Readings.txt), for the tag kMandarin, when there are two values for a code point, the first value is used to indicate a preferred pronunciation for zh-Hans (CN) and the second a preferred pronunciation for zh-Hant (TW).
  • For further discussion, see Section 5.7.6 Properties Whose Values Are Sets of Values.

4.2.9 Default Values

  • Entries for a code point may be omitted in a data file if the code point has a default value for the property in question.
  • For string-valued properties, including the definition of foldings, the default value is the code point of the character itself.
  • For miscellaneous properties which take strings as values, such as the Unicode Name property, the default value is a null string.
  • For binary properties except for Extended_Pictographic, the default value is always "N" (= False) and is always omitted.
  • For enumerated and catalog properties, the default value is listed in a comment. For example (from Scripts.txt):
    0000..007F; Basic Latin
    0080..00FF; Latin-1 Supplement
          
    0
  • A few properties of the enumerated type have multiple default values. In those cases, comments in the file explain the code point ranges for applicable values. See also Table 4.
  • Default values are also listed in specially-formatted comment lines, using the keyword "@missing". Parsers which extract and process these lines can algorithmically determine the default values for all code points. See @missing Conventions for details about the syntax and use of these lines.
  • Because of the legacy format constraints for UnicodeData.txt, that file contains no specific information about default values for properties. The default values for fields in UnicodeData.txt are documented in Table 4 below if they cannot be derived from the general rules about default values for properties.
  • The file ArabicShaping.txt is also exceptional, because it omits the listing of many characters whose property value (jt=T) can be derived by rule. Adding an "@missing" line to that file would result in the wrong interpretation of Joining_Type values for omitted characters. The full explicit listing of Joining_Type values and the correct "@missing" line for the default Joining_Type value (jt=U) can be found in the file DerivedJoiningType.txt instead.

Default values for common catalog, enumeration, and numeric properties are listed in Table 4, along with the exceptional binary property, Extended_Pictographic. Further explanation is provided below the table, in those cases where the default values are complex, as indicated in the third column.

Table 4. Default Values for Properties

Property NameDefault Value(s)Complex?AgeUnassigned (= NA)NoBidi_ClassL, AL, R, BN, ETYesBlockNo_BlockNoCanonical_Combining_ClassNot_Reordered (= 0)NoDecomposition_TypeNoneNoEast_Asian_WidthNeutral (= N), Wide (= W)YesExtended_PictographicN (= False), Y (= True)YesGeneral_CategoryCnNoLine_BreakUnknown (= XX), ID, PRYesNumeric_TypeNoneNoNumeric_ValueNaNNoScriptUnknown (= Zzzz)NoVertical_OrientationRotated (= R), Upright (= U)Yes

4.2.9.1 Complex Default Values

Complex default values are those which take multiple values, contingent on code point ranges or other conditions. Complex default values other than those specified in the "@missing" line are explicitly listed in the relevant property file, except for instances noted in this section. This means that a parser extracting property values from the UCD should never encounter an ambiguous condition for which the default value of a property for a particular code point is unclear.

  • Bidi_Class:
    See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [UAX9] and DerivedBidiClass.txt for full details.
  • East_Asian_Width:
    This property defaults to Neutral for most code points, but defaults to Wide for unassigned code points in blocks associated with CJK ideographs. See Unicode Standard Annex #11, "East Asian Width" [UAX11] and EastAsianWidth.txt for documentation of the default values and DerivedEastAsianWidth.txt for the full listing of values.
  • Line_Break:
    This property defaults to Unknown for most code points, but defaults to ID for unassigned code points in blocks associated with CJK ideographs, and in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD. The property defaults to PR for unassigned code points in the Currency Symbols block. See Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14] and LineBreak.txt for documentation of the default values and DerivedLineBreak.txt for the full listing of values.
  • Extended_Pictographic:
    This property defaults to N (= False) for most code points, but defaults to Y (= True) for unassigned code points in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD. Those ranges are correlated with the ranges associated with default values for the Line_Break property, and have the same rationale. They help future-proof the behavior of Unicode segmentation algorithms for code point ranges most likely to be used for future assignment of new emoji characters.
  • Vertical_Orientation:
    This property defaults to Rotated (R) for most code points, but defaults to Upright (U) for unassigned code points in blocks associated with scripts that are themselves predominantly Upright, in blocks for some notational systems, and in blocks predominantly associated with pictographic symbols and emoji. See Unicode Standard Annex #50, "Unicode Vertical Text Layout" [UAX50] and VerticalOrientation.txt for full details.

4.2.10 @missing Conventions

Specially-formatted comment lines with the keyword "@missing" are used to define default property values for ranges of code points not explicitly listed in a data file. These lines follow regular conventions that make them machine-readable.

An @missing line starts with the comment character "#", followed by a space, then the "@missing" keyword, followed by a colon, another space, a code point range, and a semicolon. Then the line typically continues with a semicolon-delimited list of one or more default property values. For example:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
      
1

In general, the code point range and semicolon-delimited list follow the same syntactic conventions as the data file in which the @missing line occurs, so that any parser which interprets that data file can easily be adapted to also parse and interpret an @missing line to pick up default property values for code points.

@missing lines are also supplied for many properties in the file PropertyValueAliases.txt. In this case, because there are many @missing lines in that single data file, each @missing line in that file uses the syntactic pattern code_point_range; property_name; default_prop_val.

An @missing line is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.

Because of the addition of property names when @missing lines are included in PropertyValueAliases.txt, there are currently two syntactic patterns used for @missing lines, as summarized schematically below:

  1. code_point_range; default_prop_val
  2. code_point_range; property_name; default_prop_val

In this schematic representation, "default_prop_val" stands in for either an explicit property value or for a special tag such as or

Numeric_Type is extracted as follows. If fields 6, 7, and 8 in UnicodeData.txt are all non-empty, then Numeric_Type=Decimal. Otherwise, if fields 7 and 8 are both non-empty, then Numeric_Type=Digit. Otherwise, if field 8 is non-empty, then Numeric_Type=Numeric. For characters listed in the Unihan data files, Numeric_Type=Numeric for characters that have kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags. The default value is Numeric_Type=None.

5.5 Contributory Properties

Contributory properties contain sets of exceptions used in the generation of other properties derived from them. The contributory properties specifically concerned with identifiers and casing contribute to the maintenance of stability guarantees for properties and/or to invariance relationships between related properties. Other contributory properties are simply defined as a convenience for property derivation.

Most contributory properties have names using the pattern "Other_XXX" and are used to derive the corresponding "XXX" property. For example, the Other_Alphabetic property is used in the derivation of the Alphabetic property.

Contributory properties are typically defined in PropList.txt and the corresponding derived property is then listed in DerivedCoreProperties.txt.

Jamo_Short_Name is an unusual contributory property, both in terms of its name and how it is used. It is defined in its own property file, Jamo.txt, and is used to derive the Name property value for Hangul syllable characters, according to the rules spelled out in Section 3.12, Conjoining Jamo Behavior in [Unicode].

Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. This distinct status is marked with the symbol "C" in the status column in the property table. For convenience of reference, all contributory properties are also listed in Table 10a, along with the properties whose derivation they contribute to.

Table 10a. Contributory Properties

FilePropertyUsed in Derivation ofJamo.txtJamo_Short_NameNamePropList.txtOther_AlphabeticAlphabeticOther_Default_Ignorable_Code_PointDefault_Ignorable_Code_PointOther_Grapheme_ExtendGrapheme_ExtendOther_ID_StartID_Start, XID_StartOther_ID_ContinueID_Continue, XID_ContinueOther_LowercaseLowercaseOther_MathMathOther_UppercaseUppercase

Contributory properties are incomplete by themselves and are not intended for independent use. For example, an API returning Unicode property values should implement the derived core properties such as Alphabetic or Default_Ignorable_Code_Point, rather than the corresponding contributory properties, Other_Alphabetic or Other_Default_Ignorable_Code_Point.

5.6 Case and Case Mapping

Case for bicameral scripts and case mapping of characters are complicated topics in the Unicode Standard—both because of their inherent algorithmic complexity and because of the number of characters and special edge cases involved.

This section provides a brief roadmap to discussions about these topics, and specifications and definitions in the standard, as well as explaining which case-related properties are defined in the UCD.

Section 3.13, Default Case Algorithms in [Unicode] provides formal definitions for a number of case-related concepts (cased, case-ignorable, ...), for case conversion (toUppercase(X), ...), and for case detection (isUppercase(X), ...). It also provides the formal definition of caseless matching for the standard, taking normalization into account.

Section 4.2, Case in [Unicode] introduces case and case mapping properties. Table 4-3, Sources for Case Mapping Information in [Unicode] describes the kind of case-related information that is available in various data files of the UCD. Table 11 lists those data files again, giving the explicit list of case-related properties defined in each. The link on each property leads its description in Table 9, Property Table.

Table 11. UCD Files and Case Properties

File NameCase PropertiesUnicodeData.txtSimple_Uppercase_Mapping, Simple_Lowercase_Mapping, Simple_Titlecase_MappingSpecialCasing.txtUppercase_Mapping, Lowercase_Mapping, Titlecase_MappingCaseFolding.txtSimple_Case_Folding, Case_FoldingDerivedCoreProperties.txtUppercase, Lowercase, Cased, Case_Ignorable, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casefolded, Changes_When_CasemappedDerivedNormalizationProps.txtNFKC_Casefold, Changes_When_NFKC_CasefoldedPropList.txtSoft_Dotted, Other_Uppercase, Other_Lowercase

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they constitute one-to-one mappings; it also omits information about context-sensitive case mappings. Information about these special cases can be found in the separate data file, SpecialCasing.txt, expressed as separate properties.

Section 5.18, Case Mappings, in [Unicode] discusses various implementation issues for handling case, including language-specific case mapping, as for Greek and for Turkish. That section also describes case folding in particular detail.

The special casing conditions associated with case mapping for Greek, Turkish, and Lithuanian are specified in an additional field in SpecialCasing.txt. For example, the lowercase mapping for sigma in Greek varies according to its position in a word. The condition list does not constitute a formal character property in the UCD, because it is a statement about the context of occurrence of casing behavior for a character or characters, rather than a semantic attribute of those characters. Versions of the UCD from Version 3.2.0 to Version 5.0.0 did list property aliases for Special_Case_Condition (scc), but this was determined to be an error when the UCD was analyzed for representation in XML; consequently, the Special_Case_Condition property aliases were removed as of Version 5.1.0.

Caseless matching is of particular concern for a number of text processing algorithms, so is also discussed at some length in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [UAX31] and in Unicode Technical Standard #10, "Unicode Collation Algorithm" [UTS10].

Further information about locale-specific casing conventions can be found in the Unicode Common Locale Data Repository [CLDR].

5.7 Property Value Lists

The following subsections give summaries of property values for certain Enumeration properties. Other property values are documented in other, topically-specific annexes; for example, the Line_Break property values are documented in Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14] and the various segmentation-related property values are documented in Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29].

5.7.1 General Category Values

The General_Category property of a code point provides for the most general classification of that code point. It is usually determined based on the primary characteristic of the assigned character for that code point. For example, is the character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type? Other General_Category values define the classification of code points which are not assigned to regular graphic characters, including such statuses as private-use, control, surrogate code point, and reserved unassigned.

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value. For example, the General_Category value of Latin, Greek, or Hebrew letters does not attempt to cover (or preclude) the numerical use of such letters as Roman numerals or in other numerary systems. Conversely, the General_Category of ASCII digits 0..9 as Nd (decimal digit) neither attempts to cover (or preclude) the occasional use of these digits as letters in various orthographies. The General_Category is simply the first-order, most usual categorization of a character.

For more information about the General_Category property, see Chapter 4, Character Properties in [Unicode].

The values in the General_Category field in UnicodeData.txt make use of the short, abbreviated property value aliases for General_Category. For convenience in reference, Table 12 lists all the abbreviated and long value aliases for General_Category values, reproduced from PropertyValueAliases.txt, along with a brief description of each category.

Table 12. General_Category Values

AbbrLongDescriptionLuUppercase_Letteran uppercase letterLlLowercase_Lettera lowercase letterLtTitlecase_Lettera digraphic character, with first part uppercaseLCCased_LetterLu | Ll | LtLmModifier_Lettera modifier letterLoOther_Letterother letters, including syllables and ideographsLLetterLu | Ll | Lt | Lm | LoMnNonspacing_Marka nonspacing combining mark (zero advance width)McSpacing_Marka spacing combining mark (positive advance width)MeEnclosing_Markan enclosing combining markMMarkMn | Mc | MeNdDecimal_Numbera decimal digitNlLetter_Numbera letterlike numeric characterNoOther_Numbera numeric character of other typeNNumberNd | Nl | NoPcConnector_Punctuationa connecting punctuation mark, like a tiePdDash_Punctuationa dash or hyphen punctuation markPsOpen_Punctuationan opening punctuation mark (of a pair)PeClose_Punctuationa closing punctuation mark (of a pair)PiInitial_Punctuationan initial quotation markPfFinal_Punctuationa final quotation markPoOther_Punctuationa punctuation mark of other typePPunctuationPc | Pd | Ps | Pe | Pi | Pf | PoSmMath_Symbola symbol of mathematical useScCurrency_Symbola currency signSkModifier_Symbola non-letterlike modifier symbolSoOther_Symbola symbol of other typeSSymbolSm | Sc | Sk | SoZsSpace_Separatora space character (of various non-zero widths)ZlLine_SeparatorU+2028 LINE SEPARATOR onlyZpParagraph_SeparatorU+2029 PARAGRAPH SEPARATOR onlyZSeparatorZs | Zl | ZpCcControla C0 or C1 control codeCfFormata format control characterCsSurrogatea surrogate code pointCoPrivate_Usea private-use characterCnUnassigneda reserved unassigned code point or a noncharacterCOtherCc | Cf | Cs | Co | Cn

Note that the value gc=Cn does not actually occur in UnicodeData.txt, because that data file does not list unassigned code points.

The distinctions between some General_Category values are somewhat arbitrary for edge cases, particularly those involving symbols and punctuation. For example, a number of multiple-function ASCII characters, including "@", "#", "%", and "&", have long been classified as Other_Punctuation (gc=Po), although they are not among the characters used as punctuation marks in traditional Western typography. Other characters may also be ambiguous between functioning to organize and delimit textual units (punctuation-like) or to represent concepts (symbol-like). Likewise, it may not always be clear whether some symbols are primarily used for mathematics or whether they are general symbols with occasional or even common use in mathematics. For example, many arrow symbols are classed as Other_Symbol, although they are widely used in mathematics. The General_Category values constitute a rough partitioning of characters to make distinctions for algorithmic processing, but do not provide a definitive classification for such overlapping or ambiguous usage of characters.

Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.

General_Category values in the table highlighted in light blue (LC, L, M, N, P, S, Z, C) stand for groupings of related General_Category values. The classes they represent can be derived by unions of the relevant simple values, as shown in the table. The abbreviated and long value aliases for these classes are provided as a convenience for implementations, such as regex, which may wish to match more generic categories, such as "letter" or "number", rather than the detailed subtypes for General_Category. These aliases for groupings of General_Category values do not occur in UnicodeData.txt, which instead always specifies the enumerated subtype for the General_Category of a character.

The symbol "L&" is a label used to stand for any combination of uppercase, lowercase or titlecase letters (Lu, Ll, or Lt), in the first part of comments in the data files of the UCD. It is equivalent to gc=LC, but is only a label in comments, and is not expected to be used as an identifier for regular expression matching.

The Unicode Standard does not assign nondefault property values to control characters (gc=Cc), except for certain well-defined exceptions involving the Unicode Bidirectional Algorithm, the Unicode Line Breaking Algorithm, and Unicode Text Segmentation. Also, implementations will usually assign behavior to certain line breaking control characters—most notably U+000D and U+000A (CR and LF)—according to platform conventions. See Section 5.8, Newline Guidelines in [Unicode] for more information.

5.7.2 Bidirectional Class Values

The values in the Bidi_Class field in UnicodeData.txt make use of the short, abbreviated property value aliases for Bidi_Class. For convenience in reference, Table 13 lists all the abbreviated and long value aliases for Bidi_Class values, reproduced from PropertyValueAliases.txt, along with a brief description of each category.

Table 13. Bidi_Class Values

AbbrLongDescriptionStrong TypesLLeft_To_Rightany strong left-to-right characterRRight_To_Leftany strong right-to-left (non-Arabic-type) characterALArabic_Letterany strong right-to-left (Arabic-type) characterWeak TypesENEuropean_Numberany ASCII digit or Eastern Arabic-Indic digitESEuropean_Separatorplus and minus signsETEuropean_Terminatora terminator in a numeric format context, includes currency signsANArabic_Numberany Arabic-Indic digitCSCommon_Separatorcommas, colons, and slashesNSMNonspacing_Markany nonspacing markBNBoundary_Neutralmost format characters, control codes, or noncharactersNeutral TypesBParagraph_Separatorvarious newline charactersSSegment_Separatorvarious segment-related control codesWSWhite_SpacespacesONOther_Neutralmost other symbols and punctuation marksExplicit Formatting TypesLRELeft_To_Right_EmbeddingU+202A: the LR embedding controlLROLeft_To_Right_OverrideU+202D: the LR override controlRLERight_To_Left_EmbeddingU+202B: the RL embedding controlRLORight_To_Left_OverrideU+202E: the RL override controlPDFPop_Directional_FormatU+202C: terminates an embedding or override controlLRILeft_To_Right_IsolateU+2066: the LR isolate controlRLIRight_To_Left_IsolateU+2067: the RL isolate controlFSIFirst_Strong_IsolateU+2068: the first strong isolate controlPDIPop_Directional_IsolateU+2069: terminates an isolate control

Please refer to Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [UAX9] for an an explanation of the significance of these values when formatting bidirectional text.

The four enumerated values for the isolate controls were added in Unicode 6.3. That means there is a discontinuity in the enumeration for Bidi_Class between Unicode 6.2 and Unicode 6.3 (and later versions) which parsers of UnicodeData.txt and DerivedBidiClass.txt must take into account.

5.7.3 Character Decomposition Mapping

The value of the Decomposition_Mapping property for a character is provided in field 5 of UnicodeData.txt. This is a string-valued property, consisting of a sequence of one or more Unicode code points. The default value of the Decomposition_Mapping property is the code point of the character itself. The use of the default value for a character is indicated by leaving field 5 empty in UnicodeData.txt. Informally, the value of the Decomposition_Mapping property for a character is known simply as its decomposition mapping. When a character's decomposition mapping is other than the default value, the decomposition mapping is printed out explicitly in the names list for the Unicode code charts.

The prefixed tags supplied with a subset of the decomposition mappings generally indicate formatting information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a formatting tag also indicates that the mapping is a compatibility mapping and not a canonical mapping. In the absence of other formatting information in a compatibility mapping, the tag is used to distinguish it from canonical mappings.

In some instances a canonical mapping or a compatibility mapping may consist of a single character. For a canonical mapping, this indicates that the character is a canonical equivalent of another single character. For a compatibility mapping, this indicates that the character is a compatibility equivalent of another single character.

A canonical mapping may also consist of a pair of characters, but is never longer than two characters. When a canonical mapping consists of a pair of characters, the first character may itself be a character with a decomposition mapping, but the second character never has a decomposition mapping.

Compatibility mappings can be much longer than canonical mappings. For historical reasons, the longest compatibility mapping is 18 characters long. Compatibility mappings are guaranteed to be no longer than 18 characters, although most consist of just a few characters.

The compatibility formatting tags used in the UCD are listed in Table 14.

Table 14. Compatibility Formatting Tags

TagDescriptionFont variant (for example, a blackletter form)No-break version of a space or hyphenInitial presentation form (Arabic)Medial presentation form (Arabic)Final presentation form (Arabic)Isolated presentation form (Arabic)Encircled formSuperscript formSubscript formVertical layout presentation formWide (or zenkaku) compatibility characterNarrow (or hankaku) compatibility characterSmall variant form (CNS compatibility)CJK squared font variantVulgar fraction formOtherwise unspecified compatibility character

Note: There is a difference between decomposition and the Decomposition_Mapping property. The Decomposition_Mapping property is a string-valued property whose values (mappings) are defined in UnicodeData.txt, while the decomposition (also termed "full decomposition") is defined in Section 3.7, Decomposition in [Unicode] to use those mappings recursively.

  • The canonical decomposition is formed by recursively applying the canonical mappings, then applying the Canonical Ordering Algorithm.
  • The compatibility decomposition is formed by recursively applying the canonical and compatibility mappings, then applying the Canonical Ordering Algorithm.

Starting from Unicode 2.1.9, the decomposition mappings in UnicodeData.txt can be used to derive the full decomposition of any single character in canonical order, without the need to separately apply the Canonical Ordering Algorithm. However, canonical ordering of combining character sequences must still be applied in decomposition when normalizing source text which contains any combining marks.

The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic mapping, as specified in Section 3.12, Conjoining Jamo Behavior in [Unicode]. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping property value for a Hangul syllable is the pairwise decomposition and not the full decomposition.

Each character with the Hangul_Syllable_Type value LVT will have a Decomposition_Mapping consisting of a character with an LV value and a character with a T value. Thus for U+CE31 the Decomposition_Mapping is , rather than .

The Unihan property kCompatibilityVariant consists of a listing of the canonical Decomposition_Mapping property values just for CJK compatibility ideographs. Because its values are derived from UnicodeData.txt, it is formally considered to be a derived property. The exact statement of the derivation for kCompatibilityVariant is listed in Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38].

5.7.4 Canonical Combining Class Values

The values in the Canonical_Combining_Class field in UnicodeData.txt are numerical values used in the Canonical Ordering Algorithm. Some of those numerical values also have explicit symbolic labels as property value aliases, to make their intended application more understandable. For convenience in reference, Table 15 lists the long symbolic aliases for Canonical_Combining_Class values, reproduced from PropertyValueAliases.txt, along with a brief description of each category. The listing for fixed position classes, with long symbolic aliases of the form "Ccc10", and so forth, is abbreviated, as when those labels occur they are predictable in form, based on the numeric values.

Table 15. Canonical_Combining_Class Values

ValueLongDescription0Not_ReorderedSpacing and enclosing marks; also many vowel and consonant signs, even if nonspacing1OverlayMarks which overlay a base letter or symbol6Han_ReadingDiacritic reading marks for CJK unified ideographs7NuktaDiacritic nukta marks in Brahmi-derived scripts8Kana_VoicingHiragana/Katakana voicing marks9ViramaViramas10Ccc10Start of fixed position classes...... 199 End of fixed position classes200Attached_Below_LeftMarks attached at the bottom left202Attached_BelowMarks attached directly below204 Marks attached at the bottom right208 Marks attached to the left210 Marks attached to the right212 Marks attached at the top left214Attached_AboveMarks attached directly above216Attached_Above_RightMarks attached at the top right218Below_LeftDistinct marks at the bottom left220BelowDistinct marks directly below222Below_RightDistinct marks at the bottom right224LeftDistinct marks to the left226RightDistinct marks to the right228Above_LeftDistinct marks at the top left230AboveDistinct marks directly above232Above_RightDistinct marks at the top right233Double_BelowDistinct marks subtending two bases234Double_AboveDistinct marks extending above two bases240Iota_SubscriptGreek iota subscript only

Some of the Canonical_Combining_Class values in the table are not currently used for any characters but are specified here for completeness. Some values do not have long symbolic aliases and are not listed in PropertyValueAliases.txt. Do not assume that absence of a long symbolic alias implies non-use of a particular Canonical_Combining_Class. See DerivedCombiningClass.txt for a complete listing of the use of Canonical_Combining_Class values for any particular version of the UCD.

For use in regular expression matching, fixed position classes (ccc=10 through ccc=199) which actually occur in the Unicode Character Database for any version are given predictable aliases of the form "Ccc10", "Ccc11", and so forth. The complete list of such aliases which are actually defined can be found in PropertyValueAliases.txt.

The character property invariants regarding Canonical_Combining_Class guarantee that values, once assigned, will never change, and that all values used will be in the range 0..254. See Invariants in Implementations.

Combining marks with ccc=224 (Left) follow their base character in storage, as for all combining marks, but are rendered visually on the left side of them. For all past versions of the UCD and continuing with this version of the UCD, only two tone marks used in certain notations for Hangul syllables have ccc=224. Those marks are actually rendered visually on the left side of the preceding grapheme cluster, in the case of Hangul syllables resulting from sequences of conjoining jamos.

Those few instances of combining marks with ccc=Left should be distinguished from the far more numerous examples of left-side vowel signs and vowel letters in Brahmi-derived scripts. The Canonical_Combining_Class value is zero (Not_Reordered) for both ordinary, left-side (reordrant) vowel signs such as U+093F DEVANAGARI VOWEL SIGN I and for Thai-style left-side (Logical_Order_Exception=Yes) vowel letters such as U+0E40 THAI CHARACTER SARA E. The "Not_Reordered" of ccc=Not_Reordered refers to the behavior of the character in terms of the Canonical Ordering Algorithm as part of the definition of Unicode Normalization; it does not refer to any issues of visual reordering of glyphs involved in display and rendering. See "Canonical Ordering Algorithm" in Section 3.11, Normalization Forms in [Unicode].

5.7.5 Decompositions and Normalization

Decomposition is specified in Chapter 3, Conformance of [Unicode]. That chapter also specifies the interaction between decomposition and normalization.

A number of derived properties related to Unicode normalization are called the "Quick_Check" properties. These are defined to enable various optimizations for implementations of normalization, as explained in Section 9, Detecting Normalization Forms, in Unicode Standard Annex #15, "Unicode Normalization Forms" [UAX15]. The values for the four Quick_Check properties for all code points are listed in DerivedNormalizationProps.txt. The interpretations of the possible property values are summarized in Table 16.

Table 16. Quick_Check Property Values

PropertyValueDescriptionNFC_QC, NFKC_QC, NFD_QC, NFKD_QCNoCharacters that cannot ever occur in the respective normalization form.NFC_QC, NFKC_QCMaybeCharacters that may occur in the respective normalization, depending on the context.NFC_QC, NFKC_QC, NFD_QC, NFKD_QCYesAll other characters. This is the default value for Quick_Check properties.

The Quick_Check property values are recommended for exposure in a public library API which supports Unicode character properties, because they can be used to optimize code that needs to normalize Unicode strings. They enable fast checking of whether some input strings are already in the desired normalization form. This may make it possible to bypass the more time-consuming call to run the complete Unicode Normalization Algorithm on the input string.

In contrast, some normalization-related Unicode character properties are not recommended for exposure in a public library API. Notably, these include Decomposition_Mapping, Composition_Exclusion, and the derived Full_Composition_Exclusion. These properties are only used internally in a conformant implementation of the Unicode Normalization Algorithm. Exposing them in a public API can lead to confusion by users of the API. In particular, Decomposition_Mapping is very easy to misinterpret as designating the decomposition of a character, also known as the character's full decomposition. See Definitions D62 and D64 in Section 3.7, Decomposition in [Unicode].

5.7.6 Properties Whose Values Are Sets of Values

Most properties have a single value associated with each code point. However, some properties may instead associate a set of multiple different values with each code point. For example, the provisional kCantonese property, which lists Cantonese pronunciations for unified CJK ideographs, has values which consist of a set of zero or more romanized pronunciation strings. Thus, the Unihan Database contains an entry:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
      
8

This line is to be interpreted as associating a set of three string values, {"gun3", "hung1", "zung1"} with the kCantonese property for U+342B.

Similarly, the Script_Extensions property has values which consist of a set of one or more Script property values. Thus the property file ScriptExtensions.txt in the UCD contains an entry:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
      
9

This line is to be interpreted as associating a set of eight enumerated Script property values, {Adlm, Arab, Mand, Mani, Phlp, Rohg, Sogd, Syrc}, with the Script_Extensions property for U+0640.

In the case of Script_Extensions, in particular, the set of sets which constitute meaningful values of the property is relatively small, and could be explicitly evaluated for any particular Unicode version. For example:

4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
      
0

However, an enumeration of this set of set values is unlikely to be of much implementation value, and would be likely to change significantly between versions of the standard. In other cases, such as for properties defining pronunciation readings for unified CJK ideographs, these sets of sets are completely open-ended, and there is no point to attempting to provide explicit enumerations of such sets in the UCD.

The order of the element values in such sets may or may not be significant. For example, the order among the element values for kCantonese and for Script_Extensions is not significant. By way of contrast, when the kMandarin property shows two values for a code point, the first value is used to indicate a preferred pronunciation for zh-Hans (CN) and the second a preferred pronunciation for zh-Hant (TW).

For data file format considerations regarding properties which take sets of values, see Section 4.2.8 Multiple Values for Properties. For considerations regarding validation of such properties, see Section 5.11.5 Validation of Multivalued Properties. See also Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18] for a discussion of how to handle such properties when processing regular expressions.

5.8 Property and Property Value Aliases

Both Unicode character properties themselves and their values are given symbolic aliases. The formal lists of aliases are provided so that well-defined symbolic values are available for XML formats of the UCD data, for regular expression property tests, and for other programmatic textual descriptions of Unicode data. The aliases for properties are defined in PropertyAliases.txt. The aliases for property values are defined in PropertyValueAliases.txt.

Table 17. Alias Files in the UCD

File NameStatusDescriptionPropertyAliases.txtNNames and abbreviations for propertiesPropertyValueAliases.txtNNames and abbreviations for property values

Aliases are defined as ASCII-compatible identifiers, using only uppercase or lowercase A-Z, digits, and underscore "_". Case is not significant when comparing aliases, but the preferred form used in the data files for longer aliases is to titlecase them.

Aliases may be translated in appropriate environments, and additional aliases may be useful in certain contexts. There is no requirement that only the aliases defined in the alias files of the UCD be used when referring to Unicode character properties or their values; however, their use is recommended for interoperability in data formats or in programmatic contexts.

Aliases may be provided for provisional properties. There are stability guarantees for property aliases and property value aliases, but no stability guarantees for provisional properties or other provisional data files; consequently, there can also be no stability guarantee for property aliases or property value aliases associated with provisional properties.

5.8.1 Property Aliases

In PropertyAliases.txt, the first field typically specifies an abbreviated symbolic name for the property, and the second field specifies the long symbolic name for the property. These are the preferred aliases. Additional aliases for a few properties are specified in the third or subsequent fields.

Aliases for normative and informative properties defined in the Unihan data files are included in PropertyAliases.txt, beginning with Version 5.2.

The long symbolic name alias is self-descriptive, and is treated as the official name of a Unicode character property. For clarity it is used whenever possible when referring to that property in this annex and elsewhere in the Unicode Standard. For example: "The Line_Break property is discussed in Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14]."

The abbreviated symbolic name alias is usually short and less mnemonic, but is useful for expressions such as "lb=BA" in data or in other contexts where the meaning is clear. Note that although the UCD documentation refers to this first symbolic name alias as "abbreviated", there is no requirement that the first field be an actual abbreviation or even that it be shorter than the "long" symbolic name alias. If the long symbolic name alias is already a short identifier, in many cases the "abbreviated" symbolic name alias is identical to the value in the second field. There is also one principled class where the "abbreviated" field is actually longer than the "long" field—the property aliases for the Unihan tags. In that case, the second field deliberately matches the Unihan tags exactly, so that it can serve its function as being the official property value identifier. Then, because there was no systematic way to abbreviate Unihan tags, while still retaining any reasonable comprehensibility for them, the first field in PropertyAliases.txt was created by systematically prefixing "cj" to each Unihan tag, resulting in labels with the mnemonic "cjk" prefix. Thus it is not a mistake that in such cases the first field contains a longer string than the second field. Implementations should not build in assumptions about the relative length of these symbolic name aliases.

The property aliases specified in PropertyAliases.txt constitute a unique namespace. When using these symbolic values, no alias for one property will match an alias for another property.

5.8.2 Property Value Aliases

In PropertyValueAliases.txt, the first field contains the abbreviated alias for a Unicode property, the second field specifies an abbreviated symbolic name for a value of that property, and the third field specifies the long symbolic name for that value of that property. These are the preferred aliases. Additional aliases for some property values may be specified in the fourth or subsequent fields. For example, for binary properties, the abbreviated alias for the True value is "Y", and the long alias is "Yes", but each entry also specifies "T" and "True" as additional aliases for that value, as shown in Table 18.

Table 18. Binary Property Value Aliases

LongAbbreviatedOther AliasesYesYTrue, TNoNFalse, F

Not every property value has an associated alias. Property value aliases are typically supplied for catalog and enumeration properties, which have well-defined, enumerated values. It does not make sense to specify property value aliases, for example, for the Numeric_Value property, whose value could be any number, or for a string-valued property such as Simple_Lowercase_Mapping, whose values are mappings from one code point to another.

The Canonical_Combining_Class property requires special handling in PropertyValueAliases.txt. The values of this property are numeric, but they comprise a closed, enumerated set of values. The more important of those values are given symbolic name aliases. In PropertyValueAliases.txt, the second field provides the numeric value, while the third field contains the abbreviated symbolic name alias and the fourth field contains the long symbolic name alias for that numeric value. For example:

4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
      
1

Taken by themselves, property value aliases do not constitute a unique namespace. The abbreviated aliases, in particular, are often re-used as aliases for values for different properties. All of the binary property value aliases, for example, make use of the same "Y", "Yes", "T", "True" symbols. Property value aliases may also overlap the symbols used for property aliases. For example, "Sc" is the abbreviated alias for the "Currency_Symbol" value of the General_Category property, but it is also the abbreviated alias for the Script property. However, the aliases for values for any single property are always unique within the context of that property. That means that expressions that combine a property alias and a property value alias, such as "lb=BA" or "gc=Sc" always refer unambiguously just to one value of one given property, and will not match any other value of any other property.

Prior to Version 6.1.0, the property value alias entries for three properties, Age, Block, and Joining_Group, made use of a special metavalue "n/a" in the field for the abbreviated alias. This should be understood as meaning that no abbreviated alias was defined for that value for that property, rather than as an alias per se. Starting with Version 6.1.0, all property values for those three properties have abbreviated aliases, so there is no current use of the "n/a" metavalue.

In a few cases, because of longstanding legacy practice in referring to values of a property by short identifiers, the abbreviated alias and the long alias are the same. This can be seen, for example, in some property value aliases for the Line_Break property and the Grapheme_Cluster_Break property.

The property Script_Extensions consists of enumerated sets of Script property values. The set of those sets is potentially open-ended, and no property value aliases are defined for them.

5.9 Matching Rules

When matching Unicode character property names and values, it is strongly recommended that all Property and Property Value Aliases be recognized. For best results in matching, rather than using exact binary comparisons, the following loose matching rules should be observed.

5.9.1 Matching Numeric Property Values

For all numeric properties, and for properties such as Unicode_Radical_Stroke which are constructed from combinations of numeric values, use loose matching rule UAX44-LM1 when comparing property values.

UAX44-LM1. Apply numeric equivalences.

  • "01.00" is equivalent to "1".
  • "1.666667" in the UCD is a repeating fraction, and equivalent to "10/6" or "5/3".

5.9.2 Matching Character Names

Unicode character names constitute a special case. Formally, they are values of the Name property. While each Unicode character name for an assigned character is guaranteed to be unique, names are assigned in such a way that the presence or absence of spaces cannot be used to distinguish them. Furthermore, implementations sometimes create identifiers from Unicode character names by inserting underscores for spaces. For best results in comparing Unicode character names, use loose matching rule UAX44-LM2.

UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

  • "zero-width space" is equivalent to "ZERO WIDTH SPACE" or "zerowidthspace"
  • "character -a" is not equivalent to "character a"

In this rule "medial hyphen" is to be construed as a hyphen occurring immediately between two alphanumeric characters [A..Z, 0..9] in the normative Unicode character name, as published in the Unicode names list in the UCD, and not to any hyphen that may transiently occur medially as a result of removing whitespace before removing hyphens in a particular implementation of matching. (See Section 4.8, Name in [Unicode] for the normative specification of the Unicode Name property and of name uniqueness.)

Thus the hyphens in the following examples of character names are medial, and should be ignored in loose matching:

  • U+10089 LINEAR B IDEOGRAM B107M HE-GOAT
  • U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800
  • U+1FB23 BLOCK SEXTANT-136
  • U+10749 LINEAR A SIGN A709-2 L2
  • U+1F090 DOMINO TILE VERTICAL-06-03

In contrast, the hyphens in the following examples of character names are not medial, and should not be ignored in loose matching.

  • U+0F39 TIBETAN MARK TSA -PHRU
  • U+11C88 MARCHEN LETTER -A

An implementation of this loose matching rule can obtain the correct results when comparing two strings by doing the following three operations, in order:

  1. remove all medial hyphens (except the medial hyphen in the name for U+1180)
  2. remove all whitespace and underscore characters
  3. apply toLowercase() to both strings

After applying these three operations, if the two strings compare binary equal, then they are considered to match.

This is a logical statement of how the rule works. If programmed carefully, an implementation of the matching rule can transform the strings in a single pass. It is also possible to compare two name strings for loose matching while transforming each string incrementally.

Loose matching rule UAX44-LM2 is also appropriate for matching character name aliases, the names of named character sequences, and code point labels, which all share the unique namespace (and matching behavior) of Unicode character names. See Section 4.8, Name in [Unicode]

Examples of medial hyphens in character name aliases include:

  • U+008E SINGLE-SHIFT-2
  • U+11EC HANGUL JONGSEONG YESIEUNG-KIYEOK

Examples of non-medial hyphens in character name aliases include:

  • U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN

Examples of medial hyphens in named character sequences include:

  • MODIFIER LETTER EXTRA-HIGH EXTRA-LOW CONTOUR TONE BAR;02E5 02E9

Implementations of name matching should use extreme care when matching non-standard, alternative names for particular characters. The Name Uniqueness Policy in the Unicode Consortium Stability Policies [Stability] guarantees that the Unicode Standard will never add a character whose name would match an existing encoded character, according to matching rule UAX44-LM2. However, any other name for a character might be used in the future.

The following is a concrete example of the kind of trouble that can occur. Prior to Unicode 6.0 some implementations of regex allowed matching of the name "BELL" for the control code U+0007. When Unicode 6.0 added a different encoded character, U+1F514 BELL for emoji symbols, those regex implementations broke.

As of Version 6.1 of the Unicode Standard, the most commonly occurring alternative names for control codes, as well as many commonly used abbreviations for Unicode format characters, have been added as character name aliases. This automatically excludes all such alternative names and abbreviations from the potential pool for future Unicode character names, because name uniqueness is defined over the namespace which includes both character names and character name aliases. That exclusion should reduce the potential for surprises similar to the "BELL" case, where implementers assume that a name for a control code is already well-defined.

5.9.3 Matching Symbolic Values

Property aliases and property value aliases are symbolic values. When comparing them, use loose matching rule UAX44-LM3.

UAX44-LM3. Ignore case, whitespace, underscore ('_'), hyphens, and any initial prefix string "is".

  • "linebreak" is equivalent to "Line_Break" or "Line-break"
  • "lb=BA" is equivalent to "lb=ba" or "LB=BA"
  • "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"

Loose matching is generally appropriate for the property values of Catalog, Enumeration, and Binary properties, which have symbolic aliases defined for their values. Loose matching should not be done for the property values of String-valued properties, which do not have symbolic aliases defined for their values; exact matching for String-valued property values is important, as case distinctions or other distinctions in those values may be significant.

For loose matching of symbolic values, an initial prefix string "is" is ignored. The reason for this is that APIs returning property values are often named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a property value. Ignoring any initial "is" on a symbolic value during loose matching is likely to produce the best results in application areas such as regex. Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases.

Existing and future property aliases and property value aliases are guaranteed to be unique within their relevant namespaces, even if an initial prefix string "is" is ignored. The existing cases of note for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value alias does not cause any problem, because there is no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is the entire property value alias, and is not a prefix. There is no null value for the Line_Break property for it to contrast with, but implementations of loose matching should be careful of this edge case, so that "lb=IS" is not misinterpreted as matching a null value.

Implementations sometimes use other syntactic constructs that interact with loose matching. For example, the property matching expression \p{L} may be defaulted to refer to the Unicode General_Category property: \p{General_Category=L}. For more information about the use of property values in regular expressions and other environments, see Section 1.2, Properties, in Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18].

5.10 Invariants

Property values in the UCD may be subject to correction in subsequent versions of the standard, as errors are found. Furthermore, any new version of the Unicode Standard may introduce new property values for a given property, except where the set of allowable values is fixed by the property type (such as for binary properties), or where the set of allowable values is subject to a provision of the Unicode Character Encoding Stability Policy [Stability]. Finally, a new version may also introduce new properties or new data files in the UCD.

Implementers of the UCD need to be aware of such changes when updating to new versions. However, some property values and some aspects of the file formats are considered invariant. This section documents such invariants.

5.10.1 Character Property Invariants

All formally guaranteed invariants for properties or property values are described in the Unicode Character Encoding Stability Policy [Stability]. That policy and the list of invariants it enumerates are maintained outside the context of the Unicode Standard per se. They are not part of the standard, but rather are constraints on what can and cannot change in the standard between versions, and on what decisions the Unicode Technical Committee can and cannot take regarding the standard.

In addition to the formally guaranteed invariants described in the Unicode Character Encoding Stability Policy, this section notes a few additional points regarding character property invariants in the UCD.

Some character properties are simply considered immutable: once assigned, they are never changed. For example, a character's name is immutable, because of its importance in exact identification of the character. The Canonical_Combining_Class and Decomposition_Mapping of a character are immutable, because of their importance to the stability of the Unicode Normalization Algorithm [UAX15].

The list of immutable character properties is shown in Table 19.

Table 19. Immutable Properties

Property NameAbbr NameDefault ValueAssignable to New?AgeAgeUnassignedYesNamenanull stringYesName_AliasName_Aliasnull stringYes (see note)Jamo_Short_Namejsnnull stringNoCanonical_Combining_Classccc0YesDecomposition_MappingdmYesPattern_SyntaxPat_SynNoNoPattern_White_SpacePat_WSNoNoNoncharacter_Code_PointNCharNoNo

If a property has "Yes" in the "Assignable to New?" column in Table 19, that means that the property value is immutable once it is initially assigned to a newly encoded character. The value for a reserved code point takes the default value, as shown in the third column of the table, but may change from the default value once the character is encoded. On the other hand, if a property has "No" in the "Assignable to New?" column, that means that it is absolutely immutable: all code points, including reserved code points, have a specific property value assigned, and that value does not change if a new character is encoded at a particular reserved code point in a future version of the standard.

The Name_Alias property is unusual, in that there can be more than one formal name alias assigned to a given encoded character. The default value for Name_Alias is the null string, but once any Name_Alias is assigned to an encoded character, that value is immutable. If more than one formal name alias is assigned to the same encoded character, each of those values is immutable.

A set of binary character properties associated with identifiers have a different kind of immutability, which can be described as locked to Yes. This results from the way these properties are used in the specification of identifiers. Unicode identifiers have the characteristic of stability between versions, so that once a string is specified as belonging to a particular class of identifier, it must stay in that class for future versions of the standard. Because of that requirement for identifier stability, there are associated constraints on how the related character properties can change. In particular, the identifier-related properties listed in Table 19a may have their values for any particular assigned character change from No to Yes between versions of the standard, but once a character has the value Yes, that value is locked in, and cannot ever be changed back to No.

Table 19a. Yes-Locked Properties

Property NameAbbr NameDefault ValueID_StartIDSNoID_ContinueIDCNoXID_StartXIDSNoXID_ContinueXIDCNo

In some cases, a property is not immutable, but the list of possible values that it can have is considered invariant. For example, while at least some General_Category values are subject to change and correction, the enumerated set of possible values that the General_Category property can have is fixed and cannot be added to in the future. However, not all Enumeration properties used by Unicode algorithms have immutable lists of property values. For example, the enumerated lists of values associated with the Line_Break and the Word_Break properties have changed in the past, and may be changed again in future versions of the standard.

All characters other than those of General_Category Mn or Mc are guaranteed to have Canonical_Combining_Class=0.

In Unicode 4.0 and thereafter, the General_Category value Decimal_Number (Nd), and the Numeric_Type value Decimal (de) are defined to be co-extensive; that is, the set of characters having General_Category=Nd will always be the same as the set of characters having NumericType=de.

5.10.2 UCD File Format Invariants

There are also some constraints on allowable change in the file formats for UCD files. In general, the file format conventions are changed as little as possible, to minimize the impact on implementations which parse the machine-readable data files. However, some of the constraints on allowable file format change go beyond conservatism in format and instead have the status of invariants. These guarantees apply in particular to UnicodeData.txt, the very first data file associated with the UCD.

The number and order of the fields in UnicodeData.txt is fixed. Any additional information about character properties to be added to the UCD in the future will appear in separate data files, rather than being added as an additional field to UnicodeData.txt or by reinterpretation of any of the existing fields.

5.10.3 Invariants in Implementations

Applications may wish to take the various character property and file format invariants into account when choosing how to implement character properties.

The Canonical_Combining_Class offers a good example. The character property invariants regarding Canonical_Combining_Class guarantee that values, once assigned, will never change, and that all values used will be in the range 0..254. This means that the Canonical_Combining_Class can be safely implemented in an unsigned byte and that any value stored in a table for an existing character will not need to be updated dynamically for a later version.

In practice, for Canonical_Combining_Class far fewer than 256 values are used. Unicode 3.0 used 53 values; Unicode 3.1 through Unicode 4.1 used 54 values; and Unicode 5.0 through Unicode 9.0 used 55 values. New, non-zero Canonical_Combining_Class values are seldom added to the standard. (For details about this history, see DerivedCombiningClass.txt.) Implementations may take advantage of this fact for compression, because only the ordering of the non-zero values, and not their absolute values, matters for the Canonical Ordering Algorithm. In principle, it would be possible for up to 255 values to be used in the future, but the chances of the actual number of values exceeding 128 are remote at this point. There are implementation advantages in restricting the number of internal class values to 128—for example, the ability to use signed bytes without implicit widening to ints in Java.

5.11 Validation

The Unicode character property values in the UCD files can be validated by means of regular expressions. Such validation can also be useful in testing of implementations that return property values. The method of validation depends on the type of property, as described below. These expressions use Perl syntax, but may of course be converted to other formal conventions for use with other regular expression engines.

The regular expressions which are appropriate for validation of particular properties may change in each subsequent version of the UCD. However, because of stability guarantees for character property aliases, these regular expressions for one version of the Unicode Standard will match valid values for previous versions of the standard.

5.11.1 Enumerated and Binary Properties

Enumerated and binary character properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. Because enumerated properties have a defined list of possible values, the validating regular expression simply ORs together all of the possible values. Binary properties are a special case of enumerated property, with a predefined very short list of possible values.

For example, to validate the East_Asian_Width property in the UCD, or to test an implementation that returns the East_Asian_Width property, parse the following relevant lines from PropertyValueAliases.txt and produce a regular expression that concatenates each of the short and long property alias values.

4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
      
2

The resulting regular expression would then be:

4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
      
3

For each Unicode binary character property, the regular expression can be precomputed simply as:

4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
      
4

The Catalog properties, Age, Block, and Script, are another type of enumerated character property. All possible values of those properties for any given version of the Unicode Standard are listed in PropertyValueAliases.txt, so a validating regular expression for a Catalog property for that given version of the UCD can be generated by concatenating values, as for the other enumerated properties.

5.11.2 Combining_Character_Class Property

The Combining_Character_Class (ccc) property is a hybrid type. The possible values defined for it in UnicodeData.txt range from 0 to 254 and are numeric values. However, Combining_Character_Class also has symbolic aliases defined for those particular values that are in actual use; those symbolic aliases are listed in PropertyValueAliases.txt. To produce a validating regular expression for Combining_Character_Class, concatenate together the symbolic aliases from PropertyValueAliases.txt, and then add the numeric range 0..254.

The value 255 is reserved for use by implementations. When the ccc values are represented by bytes, that additional value of 255 may be used by an implementation for other purposes.

The value 133 is reserved. No characters have that value. The property value alias CCC133 is retained in accordance with the stability policy regarding property value aliases.

5.11.3 Unihan Properties

The validating regular expressions for each property tag defined in the Unihan database are described in detail in [UAX38].

5.11.4 Other Properties

Regular expressions to validate String and Miscellaneous properties in the UCD are provided in Table 21. Although Catalog properties may use strict tests, as described in Section 5.11.1 Enumerated and Binary Properties, generic patterns for Block and Script are also provided in Table 21.

To simplify the presentation of these expressions, commonly occurring subexpressions are first abstracted out as variables defined in Table 20.

Table 20. Common Subexpressions for Validation

VariableValueNotes and Examples$digit[0-9]"0", "3"$hexDigit[0-9A-F]"1", "A"$alphaNum[0-9A-Za-z]"1", "A", "z"$digits$digit+"0", "12345"$label$alphaNum+"A", "Syriac", "NGKWAEN", "123467", "A005A"$positiveDecimal$digits\.$digits"3.1"$decimal-?$positiveDecimal"3.5", "-0.5"$rational-?$digits(/$digits)?"3/4", "-3/4"$optionalDecimal-?$digits(\.$digits)?"3.5", "-0.5", "2", "1000"$name$label(( -|- |[-_ ])$label)*name, with potential non-medial hyphens$name2$label([-_ ]$label)*name, no non-medial hyphens allowed$annotatedName$name2( \(.*\))?name with optional parenthetical annotation$shortName[A-Z]{0,3}"", "O", "WA", "WAE"$codePoint(10|$hexDigit)?$hexDigit{4}"00A0", "E0100", "10FFFF"$codePoints$codePoint(\s$codePoint)*space-delimited list of 1 to n code points$codePoint0($codePoints)?space-delimited list of 0 to n code points

The regular expressions listed in Table 21 cover all the straightforward cases for other property values. For properties involving somewhat more irregular values, such as Age, ISO_Comment, and Unicode_1_Name, details for validation can be found in [UAX42].

Table 21. Regular Expressions for Other Property Values

AbbrNameRegex for Allowable ValuesnvNumeric_Value/$decimal/Field 2/$optionalDecimal/Field 3/$rational/blkBlock/$name2/scScriptdmDecomposition_Mapping/$codePoints/FC_NFKCFC_NFKC_ClosureNFKC_CFNFKC_Casefold/$codePoint0/cfCase_Folding/$codePoints/lcLowercase_MappingtcTitlecase_MappingucUppercase_MappingscfSimple_Case_Folding/$codePoint/slcSimple_Lowercase_MappingstcSimple_Titlecase_MappingsucSimple_Uppercase_MappingbmgBidi_Mirroring_Glyph/$codePoint/bpbBidi_Paired_Bracket/$codePoint/EqUIdeoEquivalent_Unified_Ideograph/$codePoint/naName/$name/Name_AliasName_Alias--Names for named sequences*na1Unicode_1_Name/$annotatedName/JSNJamo_Short_Name/$shortName/

* The Unicode named character sequences constitute a string-valued property for an enumerated set of strings (the actual sequences which are given names). They follow the same syntax as the Name and Name_Alias property values and form part of the same namespace.

5.11.5 Validation of Multivalued Properties

Some properties, such as Script_Extensions of kCantonese, have property values each consisting of a set of element values. In the data files, these element values are separated by spaces. Validation of the property values is performed by first splitting each set into element values at the spaces, and then validating each element value individually. For example, the elements for Script_Extensions are values of the Script property; they are validated according to the validation requirements for the Script property. See also Section 5.7.6 Properties Whose Values Are Sets of Values.

The Name_Alias property has values which consist of sets of one or more name strings. In the data file for this property, each element value occurs on a separate line and can be validated as a separate element.

5.12 Deprecation

In the Unicode Standard, the term deprecation is used somewhat differently than it is in some other standards. Deprecation is used to mean that a character or other feature is strongly discouraged from use. This should not, however, be taken as indicating that anything has been removed from the standard, nor that anything is planned for removal from the standard. Any such change is constrained by the Unicode Consortium Stability Policies [Stability].

For the Unicode Character Database, there are two important types of deprecation to be noted. First, an encoded character may be deprecated. Second, a character property may be deprecated.

When an encoded character is strongly discouraged from use, it is given the property value Deprecated=True. The Deprecated property is a binary property defined specifically to carry this information about Unicode characters. Very few characters are ever formally deprecated this way; it is not enough that a character be uncommon, obsolete, disliked, or not preferred. Only those few characters which have been determined by the UTC to have serious architectural defects or which have been determined to cause significant implementation problems are ever deprecated. Even in the most severe cases, such as the deprecated format control characters (U+206A..U+206F), an encoded character is never removed from the standard. Furthermore, although deprecated characters are strongly discouraged from use, and should be avoided in favor of other, more appropriate mechanisms, they may occur in data. Conformant implementations of Unicode processes such a Unicode normalization must handle even deprecated characters correctly.

In the Unicode Character Database, a character property may also become strongly discouraged—usually because it no longer serves the purpose it was originally defined for. In such cases, the property is labelled "deprecated" in Table 9, Property Table. For example, see the Grapheme_Link property. Deprecated properties are not recommended for exposure in public APIs that support Unicode character properties.

5.13 Property APIs

The Unicode Standard does not specify the exact form of APIs which may be defined in software libraries to surface Unicode character properties to applications. However, there are some recommendations and general guidelines to follow, which should serve to reduce potential confusion and to promote better interoperability between applications using the Unicode Character Database.

In the discussion which follows here, the term API is used to refer to a particular function or method, whereas the term API collection is used to refer to a related group of APIs, which might constitute a set of functions exported from a library, a class definition, or other groupings of related functionality. A distinction is also made between a public API, which is exported for general application use, and a private API, which may be kept hidden within a library or class, intended for internal use.

First, if an API surfaces values of a particular Unicode character property and purports that value to represent a Unicode character property, it should exactly follow the specification of that property in the UCD. This principle follows from the general approach to conformance for the Unicode Standard: If you say it is Unicode, then it should follow the Unicode Standard specification.

Second, an API should be clear about which version of the UCD it supports. This can be done, for example, with documentation, either external or included in the source in header files, class definition notes, and so forth. For an API collection, an even better option is to include an API which explicitly reports which version of the UCD is supported. This provision should reduce confusion regarding particular property values which might change between versions of the Unicode Standard, as well as making it clear which repertoire of encoded characters is intended to be covered. There is no principled constraint on an API supporting more than one version of the UCD, as long as it is clear about how it does so.

Third, although there is no constraint on an API declaring that it only supports a designated subset of Unicode characters, best practice for a general purpose character property API would be to support the entire range of Unicode code points, providing determinant and well-documented property values for any valid Unicode code point input. That would include providing correct default property values for any unassigned code point. See Section 2.2, Use of Default Values for an explanation of that concept.

Fourth, a Unicode character property API is not precluded from extending or tailoring its support of character properties, as long as such behavior is clearly documented, so that applications understand the values they will be getting by calling the API. For example, an API might surface an extended new property such as IsDanda, which is not formally part of the properties specified by the UCD, but which can be inferred from the documentation of the Unicode Standard. An API supporting a particular tailoring of the Unicode Line Breaking Algorithm could surface tailored Line_Break property values to support that behavior. Alternatively, an API supporting a particular private use agreement could surface privately-defined properties for a designated range of PUA characters. All such use of APIs should be considered conformant ways of extending API collections using the UCD.

Designers of API collections to support Unicode character properties must also be aware that not all Unicode character properties are equal. There is no requirement, express or implied, that all Unicode character properties should be supported in a given API collection. In fact, an approach that simply parses the UCD and surfaces all Unicode character properties verbatim is very likely to result in a bad design. Character properties need to be understood in the context of the various Unicode algorithms they are designed to support.

The following subtypes of Unicode character properties should generally not be exposed in APIs, except in limited circumstances. They may not be useful, particularly in public API collections, and may instead prove misleading to the users of such API collections.

  • Contributory properties are not recommended for public APIs.
  • A subset of Unicode normalization-related properties are not recommended for public APIs. See Section 5.7.5, Decompositions and Normalization.
  • Deprecated properties are not recommended for public APIs. See Section 5.12, Deprecation.

5.14 Character Age

The Age property indicates the first version in which a particular Unicode character was assigned. For example, U+20AC € EURO SIGN was added to Version 2.1 of the Unicode Standard, so it has age=2.1, while U+20B9 ₹ INDIAN RUPEE SIGN was added to Version 6.0 of the Unicode Standard, so it has age=6.0.

Formally, the Age property is a catalog property whose enumerated values correspond to a list of tuples consisting of a major version integer and a minor version integer. The major version is a positive integer constrained to the range 1..255. The minor version is a non-negative integer constrained to the range 0..255. These range limitations are specified so that implementations can be guaranteed that all valid, assigned Age values can be represented in a sequence of two unsigned bytes. A third value corresponding to the Unicode update version is not required, because new characters are never assigned in update versions of the standard.

The short values listed in PropertyValueAliases.txt for the Age property for assigned (designated) code points are of the form "m.n", with the first field corresponding to the major version, and the second field corresponding to the minor version.

The long values listed in PropertyValueAliases.txt for the Age property for assigned code points start with a "V" and use an underscore instead of a dot between the major and minor version numbers: V2_1, V6_0, and so on. This makes the long format more useful as an identifier in programming languages. It is also useful in regular expressions, where the dot has other significance.

The default value of the Age property, used for unassigned (undesignated) code points, is expressed with labels that depart from the numerical versioning scheme of the Age property for assigned code points; the short form for the default is "NA", and the long form for the default is "Unassigned". Implementations of parsers which manipulate the Age property need to be prepared for this special case, rather than expecting the default value to be expressed numerically, as "0.0", for example.

The Age property is based on when a character is encoded in the standard. It is normative and immutable, and cannot be meaningfully tailored.

The minimum value of the Age property is "1.1", instead of "1.0", because of the substantial and incompatible changes to the standard resulting from the merger of code points and character names between the Unicode Standard and ISO/IEC 10646 for their 1993 publications. For Hangul syllable characters, which were extensively augmented in Unicode 2.0, the Age value is set to "2.0", even though a subset of the Hangul syllables had been published in earlier versions, at different code points.

Private use characters, noncharacter code points, and surrogate code points also get Age values. The private use characters and noncharacter code points on the BMP have age=1.1. However, the full architecture for UTF-16 and multiple planes was not fully documented until Unicode 2.0, so the private use characters and noncharacter code points on supplementary planes, as well as the surrogate code points in the range D800..DFFF, are given the value age=2.0.

The Age property cannot be derived from the other data files in any single version of the Unicode Character Database. Its derivation is done, rather, by tools that compare the assigned characters between subsequent versions. The data file DerivedAge.txt provides the definitive listing of the Age property value for all code points, as of that version of the standard.

The typical use case for the Age property in regular expressions is to search for all characters that were present in a given version. For this reason, an expression such as "\p{age=V3_0}" is exceptionally defined to match all of the code points assigned in Version 3.0—that is, all the code points with a value less than or equal to the value 3.0 for the Age property, rather than just the subset of those code points with the value 3.0. This interprets "\p{age=V3_0}" as the set of all characters assigned as of Unicode 3.0, rather than as just the set of characters added to Unicode 3.0 subsequent to the prior version. For more information, see Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18].

6 Test Files

The UCD contains a number of test data files. Those provide data in standard formats which can be used to test implementations of Unicode algorithms. The test data files distributed with this version of the UCD are listed in Table 22.

Table 22. Unicode Algorithm Test Data Files

File NameSpecificationStatusUnicode AlgorithmBidiTest.txt[UAX9]NUnicode Bidirectional AlgorithmBidiCharacterTest.txt[UAX9]NUnicode Bidirectional AlgorithmNormalizationTest.txt[UAX15]NUnicode Normalization AlgorithmLineBreakTest.txt[UAX14]NUnicode Line Breaking AlgorithmGraphemeBreakTest.txt[UAX29]NGrapheme Cluster Boundary DeterminationWordBreakTest.txt[UAX29]NWord Boundary DeterminationSentenceBreakTest.txt[UAX29]NSentence Boundary Determination

The normative status of these test files reflects their use to determine the correctness of implementations claiming conformance to the respective algorithms listed in the table. There is no requirement that any particular Unicode implementation also implement the Unicode Line Breaking Algorithm, for example, but if it implements that algorithm correctly, it should be able to replicate the test case results specified in the data entries in LineBreakTest.txt.

6.1 NormalizationTest.txt

This file contains data which can be used to test an implementation of the Unicode Normalization Algorithm. (See [UAX15] and [Tests15].)

The data file has a Unicode string in the first field (which may consist of just a single code point). The next four fields then specify the expected output results of converting that string to Unicode Normalization Forms NFC, NFD, NFKC, and NFKD, respectively. There are many tricky edge cases included in the input data, to ensure that implementations have correctly implemented some of the more complex subtleties of the Unicode Normalization Algorithm.

The header section of NormalizationTest.txt provides additional information regarding the normalization invariant relations that any conformant implementation should be able to replicate.

The Unicode Normalization Algorithm is not tailorable. Conformant implementations should be expected to produce results as specified in NormalizationTest.txt and should not deviate from those results.

6.2 Segmentation Test Files and Documentation

LineBreakTest.txt, located in the auxiliary directory of the UCD, contains data which can be used to test an implementation of the Unicode Line Breaking Algorithm. (See [UAX14] and [Tests14].) The header of that file specifies the data format and the use of the test data to specify line break opportunities. Note that non-ASCII characters are used in this test data as field delimiters.

There is an associated documentation file, LineBreakTest.html, which displays the results of the Line Breaking Algorithm in an interactive chart form, with a documented listing of the rules.

The Unicode text segmentation test data files are also located in the auxiliary directory of the UCD. (See [Tests29].) They contain data which can be used to test an implementation of the segmentation algorithms specified in [UAX29]. The headers of those file specify the data format and the use of the test data to specify text segmentation opportunities. Note that non-ASCII characters are used in this test data as field delimiters.

There are also associated documentation files, which display the results of the segmentation algorithms in an interactive chart form, with a documented listing of the rules:

  • GraphemeBreakTest.html
  • SentenceBreakTest.html
  • WordBreakTest.html

Unlike the Unicode Normalization Algorithm, the Unicode Line Breaking Algorithm and the various text segmentation algorithms are tailorable, and there is every expectation that implementations will tailor these algorithms to produce results as needed. The test data files only test the default behavior of the algorithms. Testing of tailored implementations will need to modify and/or extend the test cases as appropriate to match any documented tailoring.

6.3 Bidirectional Test Files

These files contain data which can be used to test an implementation of the Unicode Bidirectional Algorithm. (See [UAX9] and [Tests9].)

The data in BidiTest.txt is intended to exhaustively test all possible combinations of Bidi_Class values for strings of length four or less. To allow for the resulting very large number of test cases, the data file has a somewhat complicated format which is described in the header. Fundamentally, for each input string and for each possible input paragraph level, the test data specifies the resulting bidi levels and expected reordering.

The data in BidiCharacterTest.txt is provided to test various edge cases for the algorithm. It contains an extra field which allows for explicit control of the overall directional context for each test case.

The Unicode Bidirectional Algorithm is tailorable within certain limits. Conformant implementations with no tailoring are expected to produce the results as specified in BidiTest.txt and BidiCharacterTest.txt, and should not deviate from those results. Tailored implementations can also use the data in the test files to test for overall conformance to the algorithm by changing the assignment of properties to characters to reflect the details of their tailoring.

7 UCD Change History

This section summarizes the recent changes to the UCD—including its documentation files—and is organized by Unicode versions.

References in the change history are often made to a Public Review Issue (PRI). See https://www.unicode.org/review/resolved.html for more information about each of those cases.


Unicode 15.0.0

Changes in specific files:

Appropriate existing data files were updated to add the 4489 new characters encoded in Unicode 15.0. Major changes that are most likely to affect implementations are documented in Section M of the Unicode 15.0.0 page. Significant data file updates resulting from encoding the new characters and from various character property changes are summarized below, in the same grouping manner used in Components of Unicode 15.0.0.

Note that minor editorial updates and changes to the derived and extracted data files are not documented here. Routine additions of expected property values for newly encoded characters are likewise not called out explicitly in this summary.

Core Data

  • Blocks.txt
    • Seven new blocks were added, six allocated in the Supplementary Multilingual Plane, and one in the Tertiary Ideographic Plane. These include two blocks for the newly encoded scripts in Version 15.0—Kawi and Nag Mundari.
    • The block range for Egyptian Hieroglyph Format Controls was extended to end at 1345F, instead of 1343F.
  • IndicPositionalCategory.txt
    • 0953 and 0954 were removed from InPC=Top.
    • Five Kayah Li vowel signs, A926..A92A, were added to InPC=Top.
  • IndicSyllabicCategory.txt
    • Clarification was added re 0A71 GURMUKHI ADDAK.
    • 0AFB GUJARATI SIGN SHADDA was changed from InSC=Cantillation_Mark to InSC=Gemination_Mark.
    • The note regarding the Brahmi Joining Number class was expanded.
  • LineBreak.txt
    • The double diacritic marks 1DCD and 1DFC were changed from lb=CM to lb=GL.
    • 2057 QUADRUPLE PRIME was changed from lb=AL to lb=PO.
  • NamedAliases.txt
    • One formal name alias of type abbreviation was added for 0019.
    • Two formal name aliases of type correction were added for 0616 and 1BBD.
  • NamesList.txt
    • Content was updated throughout with new characters, as well as annotations, cross references, subheadings, and new comments.
  • PropertyValueAliases.txt
    • The 150 value, with the alias V150, was added to the catalog property Age.
    • Aliases were added for new Script and new Block property values.
    • Redundant @missing lines were removed for Bidi_Mirroring_Glyph, Equivalent_Unified_Ideograph, NFKC_Casefold, and Script_Extensions.
  • PropList.txt
    • Two Tibetan bindus, 0F82..0F83, were added to Other_Alphabetic.
    • Several overlooked modifier letters, 10FC, A7F2..A7F4, and AB69, were added to Other_Lowercase, for consistency with other modifier letters.
  • StandardizedVariants.txt
    • 94 standardized variation sequences were added for rotational variants of Egyptian hieroglyph signs.
    • 4 standardized variation sequences were added for expanded variants of Egyptian hieroglyph lost signs.

Unihan Database (Unihan.zip)

  • Unihan_DictionaryIndices.txt
    • Added a single new kCheungBauerIndex record.
    • Added the approximately 14,000 records for the kCihaiT property (moved from Unihan_DictionaryLikeData.txt).
    • Changed two kDaeJaweon property values.
    • Added approximately 50,000 new records to the kKangXi property that were derived from the kIRG_GSource and kIRGKangXi properties, changed approximately 250 kKangXi property values, and removed approximately 30 records with meaningless property values.
    • Removed approximately 300 records for the kMorohashi property with meaningless property values.
  • Unihan_DictionaryLikeData.txt
    • Added the kAlternateTotalStrokes property with approximately 100 records.
    • Removed the approximately 14,000 records for the kCihaiT property (moved to Unihan_DictionaryIndices.txt).
    • Changed a very small number of kHKGlyph property values.
    • Removed approximately 400 records for the kPhonetic property.
  • Unihan_IRGSources.txt
    • Added the "GXM-" and "GZA-" prefixes to the kIRG_GSource property, along with new records.
    • Added the "T12-" prefix to the kIRG_TSource property, along with new records.
    • Added new records to the kIRG_HSource, kIRG_KSource, and kIRG_USource properties.
    • Changed a single kIRG_VSource property value as a result of a disunification.
    • Changed a very small number of kRSUnicode and kTotalStrokes property values.
    • Moved a single kIRG_UKSource source reference.
    • Added IRG source data, kRSUnicode, and kTotalStrokes records for a single new character that was appended to the CJK Unified Ideographs Extension C block.
    • Added IRG source data, kRSUnicode, and kTotalStrokes records for the characters in the newly encoded CJK Unified Ideographs Extension H block.
  • Unihan_RadicalStrokecounts.txt
    • Changed two kRSKangXi property values.
  • Unihan_Readings.txt
    • Changed approximately 150 kCantonese property values, and added approximately 150 new kCantonese records.
    • Changed approximately 800 kDefinition property values, and added approximately 1500 new kDefinition records.
    • Changed a single kVietnamese property value, and added a very small number of new kVietnamese records.
  • Unihan_Variants.txt
    • Changed a small number of kSemanticVariant property values, and added approximately 100 new kSemanticVariant records.
    • Changed a very small number of kSimplifiedVariant property values, and added approximately 400 new kSimplifiedVariant records.
    • Changed a small number of kSpecializedSemanticVariant property values, and added an even smaller number of new kSpecializedSemanticVariant records.
    • Removed a very small number of kSpoofingVariant records.
    • Changed approximately 400 kTraditionalVariant property values, and added a small number of new kTraditionalVariant records.
    • Added a very small number of new kZVariant records, and removed an even smaller number of kZVariant records.

Data for UAX #45

  • USourceData.txt
    • 59 new entries were added for new UTC-Source ideographs.
    • Various entries were updated to reflect their encoding as part of CJK Extension H.
    • Status field values for various CJK Extensions were all changed from single letter values to four letter values ("A" → "ExtA", etc.), as documented in UAX #45.
  • USourceGlyphs.pdf
    • Glyphs were added for the 59 new UTC-Source ideographs introduced in USourceData.txt.
  • USourceRSChart.pdf
    • Added new entries for the radical-stroke index.

Extracted Data

  • DerivedBidiClass.txt
    • Multiple @missing lines were added, to deal with all default value range assignments.
    • Values explicitly assigned formerly to unassigned code points were removed, because they are now redundant.
  • DerivedLineBreak.txt
    • Multiple @missing lines were added, to deal with all default value range assignments.
    • Values explicitly assigned formerly to unassigned code points were removed, because they are now redundant.

Conformance Test Data

No specific items to highlight.

Auxiliary Data for UAX #14 and UAX #29

  • SentenceBreakProperty.txt
    • Several modifier letters (10FC, A7F2..A7F4, and AB69) changed from OLetter to Lower, as a result of their addition to Other_Lowercase.

Documentation for Auxiliary Data

No specific items to highlight.

Emoji Data

No specific items to highlight.


Unicode 14.0.0

Changes in specific files:

Appropriate existing data files were updated to add the 838 new characters encoded in Unicode 14.0. Major changes that are most likely to affect implementations are documented in Section M of the Unicode 14.0.0 page. Significant data file updates resulting from encoding the new characters and from various character property changes are summarized below, in the same grouping manner used in Components of Unicode 14.0.0.

Note that minor editorial updates and changes to the derived and extracted data files are not documented here. Routine additions of expected property values for newly encoded characters are likewise not called out explicitly in this summary.

Core Data

  • Blocks.txt
    • Twelve new blocks were added, all allocated in the Supplementary Multilingual Plane, including four blocks for the newly encoded scripts in Version 14.0—Old Uyghur, Cypro-Minoan, Tangsa, and Toto.
    • The block range for Ahom was extended to end at 1174F, instead of at 1173F.
    • A typo in the end range for the Tangut Supplement block was corrected from 18D8F to 18D7F.
  • IndicPositionalCategory.txt
    • The value for U+1734 HANUNOO PAMUDPOD was changed from InPC=Bottom to Right, for consistency in treatment with the newly encoded U+1715 TAGALOG PAMUDPOD.
  • IndicSyllabicCategory.txt
    • The values for U+1B03 BALINESE SIGN SURANG, U+1B81 SUNDANESE SING PANGLAYAR, and U+A982 JAVANESE SIGN LAYAR were changed from InSC=Consonant_Succeeding_Repha to Consonant_Final, following an analysis of their behavior in rendering.
  • LineBreak.txt
    • Of the 37 newly encoded emoji symbols, 10 were assigned the Line_Break property value E_Base. Those occur as bases in valid emoji modifier sequences, and thus have the Emoji_Modifier_Base property. (See [UTS51].) That Line_Break value represents a change from the default value Line_Break=Ideographic for all unassigned code points in the range U+1F000..U+1FAFF. The other 27 new emoji were assigned the Line_Break property value Ideographic, retaining the default value for that code point range.
  • NamedSequences.txt
    • Added 19 entries that were approved for Unicode 14.0. (Moved from NamedSequencesProv.txt.)
  • NamedSequencesProv.txt
    • Removed 19 provisional entries that were approved for Unicode 14.0.
  • NamesList.txt
    • Content was updated throughout with new characters, as well as annotations, cross references, subheadings, and new comments.
  • PropertyValueAliases.txt
    • The 140 value, with the alias V140, was added to the catalog property Age.
  • PropList.txt
    • Newly encoded modifier letters in the range U+10780..U+107BA were assigned the value Other_Lowercase, for consistency with other, similar modifier letters.
  • Scripts.txt
    • U+16FE2 OLD CHINESE HOOK MARK and U+16FE3 OLD CHINESE ITERATION MARK were changed from sc=Common to Han, to reflect their use exclusively with Chinese.
  • ScriptExtensions.txt
    • Nkoo was added to the scx sets for several characters also used with Arabic, to reflect attested usage.
  • UnicodeData.txt
    • The newly encoded Vithkuqi script is bicameral, with case pairs. New case pairs were also added for four new Latin letters and one new Glagolitic letter.
    • Among the newly encoded nonspacing combining marks, there are 40 which have nonzero Canonical_Combining_Class values.
    • The General_Category for U+1734 HANUNOO PAMUDPOD was changed from Mn to Mc, for consistency in treatment with the newly encoded U+1715 TAGALOG PAMUDPOD.

Unihan Database (Unihan.zip)

  • Unihan_DictionaryLikeData.txt
    • Expanded the kPhonetic property to include over 8,500 new records.
    • Added the kStrange property with nearly 700 records.
  • Unihan_IRGSources.txt
    • Added the "GDM-" prefix to the kIRG_GSource property, along with new records.
    • Added the "MA-," "MB-," "MC-," and "MD-" prefixes to the kIRG_MSource property, along with new records, and changed existing records that used the "MAC-" prefix to use the new "MC-" prefix.
    • Added the "VN-" prefix to the kIRG_VSource property, along with new records, and changed existing records that used the "VU-" prefix to use the new "VN-" prefix.
    • Added new records to the kIRG_KSource, kIRG_TSource, kIRG_USource, and kIRG_UKSource properties.
    • Changed existing kIRG_TSource and kIRG_KPSource records as a result of disunifications.
    • Added IRG source data, kRSUnicode, and kTotalStrokes records for nine new characters that were appended to the CJK Unified Ideographs (3), CJK Unified Ideographs Extension B (2), and CJK Unified Ideographs Extension C (4) blocks.
    • Corrected a small number of kRSUnicode and kTotalStrokes property values.
    • Added 21 missing records for the kTotalStrokes property.
  • Unihan_OtherMappings.txt
    • Changed nearly 300 existing kKoreanName records to include only "2015" as their property value, and added over 100 new kKoreanName records with the property value "2018."
  • Unihan_Readings.txt
    • Overhauled the kCantonese property, which now includes nearly 30,000 records.
    • Corrected a small number of kDefinition property values, and added nearly 300 new kDefinition records.
    • Corrected a small number of kMandarin property values, and added a small number of new kMandarin records.
  • Unihan_Variants.txt
    • Changed a small number of kSimplifiedVariant and kTraditionalVariant property values, and added a small number of new kSimplifiedVariant and kTraditionalVariant records.
    • Added over 100 new kSpoofingVariant records.

Data for UAX #45

  • USourceData.txt
    • Two fields were added to all of the data records, documenting the total strokes value and the first residual stroke for each character.
    • A total of 28 new entries were added for Version 14.0, with the identifiers UTC-03211 through UTC-03238, most noted as for Working Set 2021.
    • For many entries which have already been encoded, the status was changed and the appropriate Unicode code point was added.
  • USourceGlyphs.pdf
    • Glyphs were added for the 28 new UTC-Source ideographs introduced in USourceData.txt.
  • USourceRSChart.pdf
    • Added new entries for the radical-stroke index.

Conformance Test Data

Auxiliary Data for UAX #14 and UAX #29

  • GraphemeBreakProperty.txt
    • U+1715 TAGALOG SIGN PAMUDPOD and U+1734 HANUNOO SIGN PAMUDPOD are now gcb=SpacingMark.
    • U+11720 AHOM VOWEL SIGN A and U+11721 AHOM VOWEL SIGN AA are no longer gcb=SpacingMark.
  • LineBreakTest.txt
    • Test cases were added to exercise the change in rule LB30b in UAX #14.

Documentation for Auxiliary Data

  • LineBreakTest.html
    • Changes were made to the Line_Break chart and the numbering of rules, to reflect the changes to rules LB27 and LB30b in UAX #14.

Emoji Data

  • emoji-data.txt
    • 10 of the 37 new emoji characters were given the Emoji_Modifier_Base=Yes property value.

Acknowledgments

Mark Davis and Ken Whistler are the authors of the initial version and have added to and maintained the text of this annex. Laurențiu Iancu assisted in the documentation of UCD changes for Versions 6.3.0 through 13.0.0. Ken Lunde and John Jenkins assisted in the documentation of Unihan changes for Versions 13.0.0 through 15.0.0. Julie Allen and Asmus Freytag provided editorial suggestions for improvement of the text. Over the years, many members of the UTC have participated in the review of the UCD and its documentation.

References

For references for this annex, see Unicode Standard Annex #41, "Common References for Unicode Standard Annexes."

Modifications

The following summarizes modifications from previous revisions of this annex.

Revision 30 [KW]

  • Reissued for Unicode 15.0.0.
  • Added clarification and examples in UAX44-LM2 regarding the meaning of "medial hyphen" in character name matching.
  • Added a reference out to UTR #23 for more discussion of types of properties, in Section 5.2 About the Property Table.
  • Added a cross-reference to Section 4.2.10 @missing Conventions from Section 4.2.4 Comments.
  • Added a discussion of multiple @missing lines for a single property in Section 4.2.10 @missing Conventions.
  • Added mention of rule GB11 from UAX #29 and rule LB30b from UAX #14 to the discussion of uses of Extended_Pictographic.
  • Updated end range of Egyptian hieroglyph format controls in list of exceptions for Default_Ignorable_Code_Point.
  • Corrected miscellaneous typos.
  • Removed modification log for older versions of the document.

Revision 28 [KW]

  • Reissued for Unicode 14.0.0.
  • In Section 3.2, clarified the distinction between properties of strings and string-valued properties, drawing on terminology from UTR #23.
  • To mark the distinction between properties of strings and string-valued properties, all mentions of "string property" in this annex have been changed to "string-valued property". The exceptions are where the term "string property" is explicitly used in contrast to character or code point property, in the sense of a "property of strings".
  • In Section 4.2.9.1, noted that Vertical_Orientation defaults to U in some blocks associated with notational systems and emoji, as well as in blocks for some scripts.
  • Corrected an erroneous statement about which General_Category values can be associated with ccc≠0, in Section 5.10.1.

Modifications for previous versions are listed in those respective versions.


© 2022 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.