PHP RFC: Unicode Codepoint Escape Syntax


Despite the wide and increasing adoption of Unicode (and UTF-8 in particular) in PHP applications, PHP does not yet have a Unicode codepoint escape syntax in string literals, unlike many other languages. This is unfortunate, as in many cases it can be useful to specify Unicode codepoints by number, rather than using the codepoint directly. For example, say you wish to output the UTF-8 encoded Unicode codepoint U+202E RIGHT-TO-LEFT OVERRIDE in order to display text right-to-left. You could embed it in source code directly, but it is an invisible character and would display the rest of the line of code (or indeed entire program) in reverse!

The solution is to add a Unicode codepoint escape sequence syntax to string literals. This would mean you could produce U+202E like so:

echo "\u{202E}Reversed text"; // outputs ‮Reversed text

Another use is to visually distinguish between visually similar or identical, yet differently encoded, Unicode characters, if you need to output one or the other specifically. The following two lines of code actually have slightly different output, but you couldn't tell by looking at them:

echo "mañana";
echo "mañana";

However, by using an escape sequence to produce the ñ, it becomes clearer:

echo "ma\u{00F1}ana"; // pre-composed character
echo "man\u{0303}ana"; // "n" with combining ~ character (U+0303)

A further use is to produce characters you can't type on your keyboard. If you are unable to type the emoji for FACE WITH TEARS OF JOY, you can use its escape sequence instead:

echo "\u{1F602}"; // outputs 


Version Changed Date
5 Added Errata
0.1.3 \u without a following opening { passes through verbatim
0.1.2 Ruby support
0.1.1 Added Future Scope note on named literals
0.1 Initial version


An option needs 2/3 votes to win

Accept the Unicode Codepoint Escape Syntax RFC and merge into master? (92% approved)
