A Visual Representation of Homoglyphs and the Latin Letters They Resemble

Homoglyphs

A significant subset of Unicode characters are visually similar to Latin letters but possess disjointed symbolic and linguistic meanings. For example, the Latin "a" and the Cyrillic "a" appear visually homogeneous, but their underlying Unicode code points, U+0061 and U+0430, are not equivalent. The noise induced by homoglyphs has ramifications within the fields of cybersecurity and natural language processing, as homoglyphs have been found in material ranging from spoofed domain names to inappropriate tweets.


The following visualization allows for a direct comparison between the 26 Latin letters and the non-Latin homoglyphs they have in common, determined by 1) the Unicode Std, 2) a Human Annotator, and 3) GPT 4o as a normalization tool.

Color is used to encode the data source of each matrix: The letter visual similarity scores (range 1-7), derived by Simpson et al, are depicted in the blue matrix. The common homoglyphs established by the Unicode Standard (std) are presented in the red matrix. Both the common homoglyphs determined by a human annotator and the GPT 4o large language model were found through observation of a sample (n=700) of real-world tweets containing homoglyph characters. As such, both matrices are green.


Usage Tips

Sort All Based on Average...






Sort All Based on Maximum...






Search for a Cell


Letter Overlay

Common Homoglyphs


Unicode Std:


Human:


GPT: