You can now use
Intl.Segmenter
for locale-sensitive text segmentation to split a string into words, sentences,
or graphemes.
Many non-Latin languages, such as Chinese and Japanese,
don’t use spaces to separate words. Therefore, using the JavaScript split()
method on whitespace to split text into words, will return incorrect results.
When creating a new Intl.Segmenter
object with the
Intl.segmenter()
constructor,
pass in a locale
and options including granularity
, which can
have values of "grapheme"
, "word"
, or "sentence"
. The following
example creates a new Intl.Segmenter
object for Japanese, splitting on words.
const segmenter = new Intl.Segmenter('ja-JP', { granularity: 'word' });
Calling the
segment()
method on an Intl.Segmenter
object with a string of text
returns an iterable:
const segments = segmenter.segment(str);
console.table(Array.from(segments));
Read Using the Intl.Segmenter API
on the Polypane blog for an excellent tutorial on how to use this feature.
International Text Segmentation with Intl.Segmenter in JavaScript
has more examples, including how to use Intl.Segmenter
with emoji.