TL;DR
A bug affecting Unicode surrogate pairs in emojis can cause silent data loss in collaborative editors. Confirmed by developers after identifying that inserting certain emojis disrupts data sync. The issue highlights challenges in Unicode handling in web apps.
Developers have confirmed a bug in their real-time collaborative editing tool that causes silent data loss when inserting certain emojis, specifically those requiring surrogate pairs in Unicode. This issue impacts the reliability of content synchronization in web-based editors, raising concerns for developers working with Unicode and emoji input.
The bug was discovered during debugging of a migration to a collaborative editor built on TipTap, ProseMirror, and Yjs. Developers observed that when users inserted emojis above U+FFFF—such as 🤠 or 👩🚀—the underlying CRDT library would splice surrogate pairs, breaking the string and causing encoding errors. This led to the loss of subsequent edits without user notification.
Further investigation revealed that the issue occurred only when editing operations caused a splice at specific byte offsets within surrogate pairs. The root cause was identified as the lib0 splice method, which used JavaScript’s .slice() to modify strings containing surrogate pairs. When a splice split a surrogate pair, it resulted in orphaned surrogates that caused URI errors during encoding, halting the sync process silently.
Why It Matters
This bug highlights a critical challenge in handling Unicode characters, especially emojis, in web applications. Since emojis are increasingly common in user content, such issues can lead to data loss and reduce trust in collaborative tools. The incident underscores the importance of robust Unicode handling and error management in real-time editing platforms.

Engineering Text: Unicode Standards for Developers (Unicodes Book 1)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
Prior to this discovery, developers suspected network issues or websocket instability as causes of sync failures. The problem was rare and difficult to reproduce, often occurring only with specific emoji operations. The bug emerged as part of ongoing efforts to enhance collaborative editing experiences, which rely heavily on CRDT algorithms and Unicode string manipulation.
Unicode surrogate pairs are a well-known complexity in string processing, but their impact on CRDT-based sync systems was previously underestimated. This incident has prompted a review of string handling practices in such systems.
“This bug exposed the fragility of handling surrogate pairs in our CRDT operations, causing silent data loss during collaborative editing.”
— Lead developer
“Inserting specific emojis triggered the splice that broke the sync, revealing a subtle Unicode handling flaw.”
— Product manager involved in debugging
string manipulation libraries for JavaScript
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It remains unclear whether similar issues exist in other parts of the system or with other Unicode characters. The full scope of the bug’s impact across different browsers, devices, and input methods is still being evaluated. Additionally, the long-term solution and whether other CRDT implementations are affected are under investigation.
CRDT-compatible text editors
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
The development team is working on patches to improve Unicode handling, including safer string manipulation methods that avoid splitting surrogate pairs. They plan to release an update in the coming weeks, along with guidelines for developers on managing Unicode in CRDT-based systems.
Unicode surrogate pair debugging tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What are surrogate pairs in Unicode?
Surrogate pairs are a way to encode characters outside the Basic Multilingual Plane (U+0000 to U+FFFF) in UTF-16. They use two 16-bit code units to represent a single character, such as many emojis.
Why did this bug cause silent data loss?
The splice operation split surrogate pairs, creating orphaned surrogates that the encoding process couldn’t handle, leading to errors that halted synchronization without notifying users.
Is this issue limited to specific emojis?
Yes. It primarily affects emojis that require surrogate pairs, i.e., those above U+FFFF, such as 🤠 or 👩🚀. Not all emojis or characters cause this problem.
Will this bug affect other applications?
Potentially. Any application that manipulates Unicode strings with surrogate pairs and relies on similar splice operations could encounter comparable issues.