A significant subset of Unicode characters are visually similar to Latin letters but possess disjointed symbolic and linguistic meanings. For example, the Latin "a" and the Cyrillic "a" appear visually homogeneous, but their underlying Unicode code points, U+0061 and U+0430, are not equivalent. The noise induced by homoglyphs has ramifications within the fields of cybersecurity and natural language processing, as homoglyphs have been found in material ranging from spoofed domain names to inappropriate tweets.
The following visualization allows for a direct comparison between
the 26 Latin letters and the non-Latin homoglyphs they have in common,
determined by 1) the Unicode Std, 2) a Human Annotator, and 3) GPT 4o as
a normalization tool.
Color is used to encode the data source of each matrix: The letter
visual similarity scores (range 1-7), derived by
Simpson et al, are depicted in the
blue matrix. The
common homoglyphs established by the Unicode Standard (std) are
presented in the
red matrix. Both
the common homoglyphs determined by a human annotator and the GPT 4o
large language model were found through observation of a sample (n=700)
of real-world tweets containing homoglyph characters. As such, both
matrices are
green.
Unicode Std:
Human:
GPT: