• Web applications that accept input strings from untrusted sources perform filtering and validation mechanisms based on the strings’ character data
  • Unicode standards are followed for character information in Java by:
    – Checking if two strings are equivalent to each other
    – Transforming a particular unicode normalization form to either canonical or compatibility equivalence

Normalization Forms

  • Form
    – Normalization Form D (NFD)
    – Normalization Form C (NFC)
    – Normalization Form KD (NFKD)
    – Normalization Form KC (NFKC)
  • Description
    – Canonical Decomposition
    以標準等價方式來分解
    – Canonical Decomposition, followed by Canonical Composition
    以標準等價方式來分解,然後以標準等價重組之。若是singleton的話,重組結果有可能和分解前不同
    – Compatibility Decomposition
    以相容等價方式來分解
    – Compatibility Decomposition, followed by Canonical Composition
    以相容等價方式來分解,然後以標準等價重組之
  • ref link: https://zh.wikipedia.org/wiki/Unicode%E7%AD%89%E5%83%B9%E6%80%A7
  • Using normalization forms KC and KD for arbitrary input strings may sometimes remove formatting distinctions that are important for text semantics
  • NFKC converts the input strings into an equivalent canonical form without altering formatting distinctions to the required input form
  • The Normalize method is used to convert Unicode text into an equivalent composed or decomposed form making sorting and searching of text easier

Input Validation Errors: Improper Validation of Strings (Cont’d)

  • In the below code, String is validated before normalizing and it fails to detect any arbitrary inputs
  • Validation logic also fails to detect inputs as a check for angle brackets does not detect alternate Unicode representations

Vulnerable Code

  • In the below code, validating is performed after normalizing string into canonical angle brackets
  • Input validation mechanism throws an IllegalStateException if it detects any malicious inputs

Secure Code