mirror of
https://github.com/python/cpython.git
synced 2025-12-23 09:19:18 +00:00
[3.13] gh-54874: Expand unicodedata module documentation (GH-138301) (#138345)
* gh-54874: Expand unicodedata module documentation (GH-138301)
Closes GH-54874
(cherry picked from commit 0d383f86ee)
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Alexander Belopolsky <abalkin@users.noreply.github.com>
* Changes links to UCD 15.0.0
---------
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Alexander Belopolsky <abalkin@users.noreply.github.com>
This commit is contained in:
parent
a38f0266df
commit
76aa2abdef
1 changed files with 68 additions and 32 deletions
|
|
@ -25,80 +25,133 @@ Standard Annex #44, `"Unicode Character Database"
|
|||
<https://www.unicode.org/reports/tr44/>`_. It defines the
|
||||
following functions:
|
||||
|
||||
.. seealso::
|
||||
|
||||
The :ref:`unicode-howto` for more information about Unicode and how to use
|
||||
this module.
|
||||
|
||||
|
||||
.. function:: lookup(name)
|
||||
|
||||
Look up character by name. If a character with the given name is found, return
|
||||
the corresponding character. If not found, :exc:`KeyError` is raised.
|
||||
For example::
|
||||
|
||||
>>> unicodedata.lookup('LEFT CURLY BRACKET')
|
||||
'{'
|
||||
|
||||
The characters returned by this function are the same as those produced by
|
||||
``\N`` escape sequence in string literals. For example::
|
||||
|
||||
>>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}'
|
||||
True
|
||||
|
||||
.. versionchanged:: 3.3
|
||||
Support for name aliases [#]_ and named sequences [#]_ has been added.
|
||||
|
||||
|
||||
.. function:: name(chr[, default])
|
||||
.. function:: name(chr, default=None, /)
|
||||
|
||||
Returns the name assigned to the character *chr* as a string. If no
|
||||
name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
|
||||
raised.
|
||||
raised. For example::
|
||||
|
||||
>>> unicodedata.name('½')
|
||||
'VULGAR FRACTION ONE HALF'
|
||||
>>> unicodedata.name('\uFFFF', 'fallback')
|
||||
'fallback'
|
||||
|
||||
|
||||
.. function:: decimal(chr[, default])
|
||||
.. function:: decimal(chr, default=None, /)
|
||||
|
||||
Returns the decimal value assigned to the character *chr* as integer.
|
||||
If no such value is defined, *default* is returned, or, if not given,
|
||||
:exc:`ValueError` is raised.
|
||||
:exc:`ValueError` is raised. For example::
|
||||
|
||||
>>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}')
|
||||
9
|
||||
>>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1)
|
||||
-1
|
||||
|
||||
|
||||
.. function:: digit(chr[, default])
|
||||
.. function:: digit(chr, default=None, /)
|
||||
|
||||
Returns the digit value assigned to the character *chr* as integer.
|
||||
If no such value is defined, *default* is returned, or, if not given,
|
||||
:exc:`ValueError` is raised.
|
||||
:exc:`ValueError` is raised::
|
||||
|
||||
>>> unicodedata.digit('\N{SUPERSCRIPT NINE}')
|
||||
9
|
||||
|
||||
|
||||
.. function:: numeric(chr[, default])
|
||||
.. function:: numeric(chr, default=None, /)
|
||||
|
||||
Returns the numeric value assigned to the character *chr* as float.
|
||||
If no such value is defined, *default* is returned, or, if not given,
|
||||
:exc:`ValueError` is raised.
|
||||
:exc:`ValueError` is raised::
|
||||
|
||||
>>> unicodedata.numeric('½')
|
||||
0.5
|
||||
|
||||
|
||||
.. function:: category(chr)
|
||||
|
||||
Returns the general category assigned to the character *chr* as
|
||||
string.
|
||||
string. General category names consist of two letters.
|
||||
See the `General Category Values section of the Unicode Character
|
||||
Database documentation <https://www.unicode.org/reports/tr44/tr44-30.html#General_Category_Values>`_
|
||||
for a list of category codes. For example::
|
||||
|
||||
>>> unicodedata.category('A') # 'L'etter, 'u'ppercase
|
||||
'Lu'
|
||||
|
||||
|
||||
.. function:: bidirectional(chr)
|
||||
|
||||
Returns the bidirectional class assigned to the character *chr* as
|
||||
string. If no such value is defined, an empty string is returned.
|
||||
See the `Bidirectional Class Values section of the Unicode Character
|
||||
Database <https://www.unicode.org/reports/tr44/tr44-30.html#Bidi_Class_Values>`_
|
||||
documentation for a list of bidirectional codes. For example::
|
||||
|
||||
>>> unicodedata.bidirectional('\N{ARABIC-INDIC DIGIT SEVEN}') # 'A'rabic, 'N'umber
|
||||
'AN'
|
||||
|
||||
|
||||
.. function:: combining(chr)
|
||||
|
||||
Returns the canonical combining class assigned to the character *chr*
|
||||
as integer. Returns ``0`` if no combining class is defined.
|
||||
See the `Canonical Combining Class Values section of the Unicode Character
|
||||
Database <www.unicode.org/reports/tr44/tr44-30.html#Canonical_Combining_Class_Values>`_
|
||||
for more information.
|
||||
|
||||
|
||||
.. function:: east_asian_width(chr)
|
||||
|
||||
Returns the east asian width assigned to the character *chr* as
|
||||
string.
|
||||
string. For a list of widths and or more information, see the
|
||||
`Unicode Standard Annex #11 <https://www.unicode.org/reports/tr11/tr11-41.html>`_.
|
||||
|
||||
|
||||
.. function:: mirrored(chr)
|
||||
|
||||
Returns the mirrored property assigned to the character *chr* as
|
||||
integer. Returns ``1`` if the character has been identified as a "mirrored"
|
||||
character in bidirectional text, ``0`` otherwise.
|
||||
character in bidirectional text, ``0`` otherwise. For example::
|
||||
|
||||
>>> unicodedata.mirrored('>')
|
||||
1
|
||||
|
||||
|
||||
.. function:: decomposition(chr)
|
||||
|
||||
Returns the character decomposition mapping assigned to the character
|
||||
*chr* as string. An empty string is returned in case no such mapping is
|
||||
defined.
|
||||
defined. For example::
|
||||
|
||||
>>> unicodedata.decomposition('Ã')
|
||||
'0041 0303'
|
||||
|
||||
|
||||
.. function:: normalize(form, unistr)
|
||||
|
|
@ -122,9 +175,9 @@ following functions:
|
|||
normally would be unified with other characters. For example, U+2160 (ROMAN
|
||||
NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
|
||||
However, it is supported in Unicode for compatibility with existing character
|
||||
sets (e.g. gb2312).
|
||||
sets (for example, gb2312).
|
||||
|
||||
The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
|
||||
The normal form KD (NFKD) will apply the compatibility decomposition, that is,
|
||||
replace all compatibility characters with their equivalents. The normal form KC
|
||||
(NFKC) first applies the compatibility decomposition, followed by the canonical
|
||||
composition.
|
||||
|
|
@ -133,6 +186,7 @@ following functions:
|
|||
a human reader, if one has combining characters and the other
|
||||
doesn't, they may not compare equal.
|
||||
|
||||
|
||||
.. function:: is_normalized(form, unistr)
|
||||
|
||||
Return whether the Unicode string *unistr* is in the normal form *form*. Valid
|
||||
|
|
@ -154,24 +208,6 @@ In addition, the module exposes the following constant:
|
|||
Unicode database version 3.2 instead, for applications that require this
|
||||
specific version of the Unicode database (such as IDNA).
|
||||
|
||||
Examples:
|
||||
|
||||
>>> import unicodedata
|
||||
>>> unicodedata.lookup('LEFT CURLY BRACKET')
|
||||
'{'
|
||||
>>> unicodedata.name('/')
|
||||
'SOLIDUS'
|
||||
>>> unicodedata.decimal('9')
|
||||
9
|
||||
>>> unicodedata.decimal('a')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
ValueError: not a decimal
|
||||
>>> unicodedata.category('A') # 'L'etter, 'u'ppercase
|
||||
'Lu'
|
||||
>>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
|
||||
'AN'
|
||||
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue