Site map  

UnicodeData.txt

Some personal notes on UnicodeData.txt.

Field 0

Code point.
A hex number. Can be used as wchar_t, provided wchar_t is 32 bits.

Field 1

Character name.
Can be used as remark in header files.

Field 2

General Category

LLetter
LlLower casea
LmModifier
LoOther ª
LtTitle Dz
LuUpper caseA
MMark
McSpacing combining
MeEnclosing
MnNonspacing Accent
NNumber
NdDecimal8
NlLetter
NoOther EG Superscript
PPunctuation
PcConnector _
PdDash -
PeClose )
PfFinal quote »
PiInitial quote«
PoOther &
PsOpen (
SSymbol
ScCurrency$
SkModifier^
SmMath +
SoOther ¦
ZSeparator
ZlLine
ZpParagraph
ZsSpace
COther
CcControl
CfFormat
CnNot assigned
CoPrivate
CsSurrogate

These can be used to find the non-ASCII equivalent of 'alnum'. Or the opposite thereof.
I use anything except Ll, Lo, Lt, Lu, Mc, Mn, Nd, Nl and Cs as word separators / delimiters.

Field 5

Character Decomposition Mapping E,S N

<font> A font variant (e.g. a blackletter form).
<noBreak> A no-break version of a space or hyphen.
<initial> An initial presentation form (Arabic).
<medial> A medial presentation form (Arabic).
<final> A final presentation form (Arabic).
<isolated>An isolated presentation form (Arabic).
<circle> An encircled form.
<super> A superscript form.
<sub> A subscript form.
<vertical>A vertical layout presentation form.
<wide> A wide (or zenkaku) compatibility character.
<narrow> A narrow (or hankaku) compatibility character.
<small> A small variant form (CNS compatibility).
<square> A CJK squared font variant.
<fraction>A vulgar fraction form.
<compat> Otherwise unspecified compatibility character.

Compat can be used (among other things) to find ASCII equivalents for non-ASCII glyphs;

GlyphASCII
ÀA
IJIJ
LjLj
VIII

Field 6

Decimal digit value E,N N
EG 8 for Ⅷ.

Field 13

This has some quirks. It applies the lower case mapping to compat as well. For instance, it considers 'OHM SIGN' ('Ω') to be compatible with 'GREEK CAPITAL LETTER OMEGA'. A 'towlower()' will therefore convert 'OHM SIGN' to 'GREEK SMALL LETTER OMEGA' ('ω'). Not the same thing at all.