- Web applications that accept input strings from untrusted sources perform filtering and validation mechanisms based on the strings’ character data
- Unicode standards are followed for character information in Java by:
– Checking if two strings are equivalent to each other
– Transforming a particular unicode normalization form to either canonical or compatibility equivalence
Normalization Forms
- Form
– Normalization Form D (NFD)
– Normalization Form C (NFC)
– Normalization Form KD (NFKD)
– Normalization Form KC (NFKC) - Description
– Canonical Decomposition
以標準等價方式來分解
– Canonical Decomposition, followed by Canonical Composition
以標準等價方式來分解,然後以標準等價重組之。若是singleton的話,重組結果有可能和分解前不同
– Compatibility Decomposition
以相容等價方式來分解
– Compatibility Decomposition, followed by Canonical Composition
以相容等價方式來分解,然後以標準等價重組之 - ref link: https://zh.wikipedia.org/wiki/Unicode%E7%AD%89%E5%83%B9%E6%80%A7
- Using normalization forms KC and KD for arbitrary input strings may sometimes remove formatting distinctions that are important for text semantics
- NFKC converts the input strings into an equivalent canonical form without altering formatting distinctions to the required input form
- The Normalize method is used to convert Unicode text into an equivalent composed or decomposed form making sorting and searching of text easier
Input Validation Errors: Improper Validation of Strings (Cont’d)
- In the below code, String is validated before normalizing and it fails to detect any arbitrary inputs
- Validation logic also fails to detect inputs as a check for angle brackets does not detect alternate Unicode representations
Vulnerable Code

- In the below code, validating is performed after normalizing string into canonical angle brackets
- Input validation mechanism throws an IllegalStateException if it detects any malicious inputs
Secure Code
