From d6e5aa3bdcae7c5f84f794ccc09c59dbe6666725 Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 00:13:02 -0400 Subject: [PATCH 1/6] Write some Str docs --- compiler/builtins/docs/Str.roc | 110 +++++++++++++++++++++++++++++++-- 1 file changed, 105 insertions(+), 5 deletions(-) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index 450189640c..39664350be 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -2,12 +2,66 @@ api Str provides Str, isEmpty, join ## Types -## A sequence of [UTF-8](https://en.wikipedia.org/wiki/UTF-8) text characters. +## A [Unicode](https://unicode.org) text value. +## +## Dealing with text is deep topic, so by design, Roc's `Str` module sticks +## to the basics. For more advanced uses such as working with raw [code points](https://en.wikipedia.org/wiki/Code_point), +## see the [roc/unicode](roc/unicode) package, and for locale-specific text +## functions (including capitalization, as capitalization rules vary by locale) +## see the [roc/locale](roc/locale) package. +## +## ### Unicode +## +## Unicode can represent text values which span multiple languages, symbols, and emoji. +## Here are some valid Roc strings: +## +## * "Roc" +## * "鹏" +## * "🐦" +## +## Every Unicode string is a sequence of [grapheme clusters](https://unicode.org/glossary/#grapheme_cluster). +## A grapheme cluster corresponds to what a person reading a string might call +## a "character", but because the term "character" is used to mean many different +## concepts across different programming languages, we intentionally avoid it in Roc. +## Instead, we use the term "clusters" as a shorthand for "grapheme clusters." +## +## You can get the number of grapheme clusters in a string by calling `Str.countClusters` on it: +## +## >>> Str.countClusters "Roc" +## +## >>> Str.countClusters "音乐" +## +## >>> Str.countClusters "πŸ‘" +## +## > The `countClusters` function traverses the entire string to calculate its answer, +## > so it's much better for performance to use `Str.isEmpty` instead of +## > calling `Str.countClusters` and checking whether the count was `0`. +## +## ### Escape characters +## +## ### String interpolation +## +## ### Encoding +## +## Whenever any Roc string is created, its [encoding](https://en.wikipedia.org/wiki/Character_encoding) +## comes from a configuration option chosen by [the host](guide|hosts). +## Because of this, None of the functions in this module +## make assumptions about the underlying encoding. After all, different hosts +## may choose different encodings! Here are some factors hosts may consider +## when deciding which encoding to choose: +## +## * Linux APIs typically use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding +## * Windows APIs and Apple [Objective-C](https://en.wikipedia.org/wiki/Objective-C) APIs typically use [UTF-16](https://en.wikipedia.org/wiki/UTF-16) encoding +## * Hosts which call [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions may choose [MUTF-8](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) to disallow a valid UTF-8 character which can prematurely terminate C strings +## +## > Roc strings only support Unicode, so they do not support non-Unicode character +## > encodings like [ASCII](https://en.wikipedia.org/wiki/ASCII). +## +## To write code which behaves differently depending on which encoding the host chose, +## the #Str.codeUnits function will do that. However, if you are doing encoding-specific work, +## you should take a look at the [roc/unicode](roc/unicode) pacakge; +## it has many more tools than this module does. ## -## One #Str can be up to 2 gigabytes in size. If you need to store larger -## strings than that, you can split them into smaller chunks and operate -## on those instead of on one large #Str. This often runs faster in practice, -## even for strings much smaller than 2 gigabytes. Str : [ @Str ] ## Convert @@ -59,3 +113,49 @@ padStart : Str, Int, Str -> Str padEnd : Str, Int, Str -> Str +foldClusters : Str, { start: state, step: (state, Str -> state) } -> state + +## Returns #True if the string begins with a capital letter, and #False otherwise. +## +## >>> Str.isCapitalized "hi" +## +## >>> Str.isCapitalized "Hi" +## +## >>> Str.isCapitalized " Hi" +## +## >>> Str.isCapitalized "ČeskΓ‘" +## +## >>> Str.isCapitalized "Π­" +## +## >>> Str.isCapitalized "東京" +## +## >>> Str.isCapitalized "🐦" +## +## >>> Str.isCapitalized "" +## +## Since the rules for how to capitalize an uncapitalized string vary by locale, +## see the [roc/locale](roc/locale) package for functions which do that. +isCapitalized : Str -> Bool + + +## Deconstruct the string into raw code unit integers. (Note that code units +## are not the same as code points; to work with code points, see [roc/unicode](roc/unicode)). +## +## This returns a different tag depending on the string encoding chosen by the host. +## +## The size of an individual code unit depends on the encoding. For example, +## in UTF-8 and MUTF-8, a code unit is 8 bits, so those encodings +## are returned as `List U8`. In contrast, UTF-16 encoding uses 16-bit code units, +## so the `Utf16` tag holds a `List U16` instead. +## +## > Code units are no substitute for grapheme clusters! +## > +## > For example, `Str.countGraphemes "πŸ‘"` always returns `1` no matter what, +## > whereas `Str.codeUnits "πŸ‘"` could give you back a `List U8` with a length +## > of 4, or a `List U16` with a length of 2, neither of which is equal to +## > the correct number of grapheme clusters in that string. +## > +## > This function exists for more advanced use cases like those found in +## > [roc/unicode](roc/unicode), and using code points when grapheme clusters would +## > be more appropriate can very easily lead to bugs. +codeUnits : Str -> [ Utf8 (List U8), Mutf8 (List U8), Ucs2 (List U16), Utf16 (List U16), Utf32 (List U32) ] From aa3030ab85fbb940cce5eb506e484e175a9f1c2e Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 01:51:12 -0400 Subject: [PATCH 2/6] Revise Str docs --- compiler/builtins/docs/Str.roc | 59 ++++++++++++++++------------------ 1 file changed, 27 insertions(+), 32 deletions(-) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index 39664350be..db8ff78c1e 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -43,25 +43,16 @@ api Str provides Str, isEmpty, join ## ## ### Encoding ## -## Whenever any Roc string is created, its [encoding](https://en.wikipedia.org/wiki/Character_encoding) -## comes from a configuration option chosen by [the host](guide|hosts). -## Because of this, None of the functions in this module -## make assumptions about the underlying encoding. After all, different hosts -## may choose different encodings! Here are some factors hosts may consider -## when deciding which encoding to choose: -## -## * Linux APIs typically use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding -## * Windows APIs and Apple [Objective-C](https://en.wikipedia.org/wiki/Objective-C) APIs typically use [UTF-16](https://en.wikipedia.org/wiki/UTF-16) encoding -## * Hosts which call [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions may choose [MUTF-8](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) to disallow a valid UTF-8 character which can prematurely terminate C strings -## -## > Roc strings only support Unicode, so they do not support non-Unicode character -## > encodings like [ASCII](https://en.wikipedia.org/wiki/ASCII). -## -## To write code which behaves differently depending on which encoding the host chose, -## the #Str.codeUnits function will do that. However, if you are doing encoding-specific work, -## you should take a look at the [roc/unicode](roc/unicode) pacakge; -## it has many more tools than this module does. +## Roc strings are not coupled to any particular +## [encoding](https://en.wikipedia.org/wiki/Character_encoding). As it happens, +## they are currently encoded in UTF-8, but this module is intentionally designed +## not to rely on that implementation detail so that a future release of Roc can +## potentially change it without breaking existing Roc applications. ## +## This module has functions to can convert a #Str to a #List of raw code unit integers +## in a particular encoding, but if you are doing encoding-specific work, +## you should take a look at the [roc/unicode](roc/unicode) pacakge. +## It has many more tools than this module does! Str : [ @Str ] ## Convert @@ -137,25 +128,29 @@ foldClusters : Str, { start: state, step: (state, Str -> state) } -> state ## see the [roc/locale](roc/locale) package for functions which do that. isCapitalized : Str -> Bool - -## Deconstruct the string into raw code unit integers. (Note that code units -## are not the same as code points; to work with code points, see [roc/unicode](roc/unicode)). +## ## Code Units ## -## This returns a different tag depending on the string encoding chosen by the host. +## Besides grapheme clusters, another way to break down strings is into +## raw code unit integers. ## -## The size of an individual code unit depends on the encoding. For example, -## in UTF-8 and MUTF-8, a code unit is 8 bits, so those encodings -## are returned as `List U8`. In contrast, UTF-16 encoding uses 16-bit code units, -## so the `Utf16` tag holds a `List U16` instead. +## The size of a code unit depends on the string's encoding. For example, in a +## string encoded in UTF-8, a code unit is 8 bits. This is why #Str.toUtf8 +## returns a `List U8`. In contrast, UTF-16 encoding uses 16-bit code units, +## so #Str.toUtf16 returns a `List U16` instead. ## ## > Code units are no substitute for grapheme clusters! ## > ## > For example, `Str.countGraphemes "πŸ‘"` always returns `1` no matter what, -## > whereas `Str.codeUnits "πŸ‘"` could give you back a `List U8` with a length -## > of 4, or a `List U16` with a length of 2, neither of which is equal to -## > the correct number of grapheme clusters in that string. +## > whereas `Str.toUtf8 "πŸ‘"` returns a list with a length of 4, +## > and `Str.toUtf16 "πŸ‘"` returns a list with a length of 2. ## > -## > This function exists for more advanced use cases like those found in -## > [roc/unicode](roc/unicode), and using code points when grapheme clusters would +## > These functions exists for more advanced use cases like those found in +## > [roc/unicode](roc/unicode), and using code units when grapheme clusters would ## > be more appropriate can very easily lead to bugs. -codeUnits : Str -> [ Utf8 (List U8), Mutf8 (List U8), Ucs2 (List U16), Utf16 (List U16), Utf32 (List U32) ] + +toUtf8 : Str -> List U8 + +toUtf16 : Str -> List U16 + +toUtf32 : Str -> List U16 + From 1bee949ad077f3dd9852feeceb4bd082dcacccf4 Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 02:06:12 -0400 Subject: [PATCH 3/6] Fix some Str docs --- compiler/builtins/docs/Str.roc | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index db8ff78c1e..092cdc053a 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -5,7 +5,7 @@ api Str provides Str, isEmpty, join ## A [Unicode](https://unicode.org) text value. ## ## Dealing with text is deep topic, so by design, Roc's `Str` module sticks -## to the basics. For more advanced uses such as working with raw [code points](https://en.wikipedia.org/wiki/Code_point), +## to the basics. For more advanced use cases like working with raw [code points](https://en.wikipedia.org/wiki/Code_point), ## see the [roc/unicode](roc/unicode) package, and for locale-specific text ## functions (including capitalization, as capitalization rules vary by locale) ## see the [roc/locale](roc/locale) package. @@ -133,24 +133,18 @@ isCapitalized : Str -> Bool ## Besides grapheme clusters, another way to break down strings is into ## raw code unit integers. ## -## The size of a code unit depends on the string's encoding. For example, in a -## string encoded in UTF-8, a code unit is 8 bits. This is why #Str.toUtf8 -## returns a `List U8`. In contrast, UTF-16 encoding uses 16-bit code units, -## so #Str.toUtf16 returns a `List U16` instead. +## Code units are no substitute for grapheme clusters! +## These functions exist to support advanced use cases like those found in +## [roc/unicode](roc/unicode), and using code units when grapheme clusters would +## be more appropriate can very easily lead to bugs. ## -## > Code units are no substitute for grapheme clusters! -## > -## > For example, `Str.countGraphemes "πŸ‘"` always returns `1` no matter what, -## > whereas `Str.toUtf8 "πŸ‘"` returns a list with a length of 4, -## > and `Str.toUtf16 "πŸ‘"` returns a list with a length of 2. -## > -## > These functions exists for more advanced use cases like those found in -## > [roc/unicode](roc/unicode), and using code units when grapheme clusters would -## > be more appropriate can very easily lead to bugs. +## For example, `Str.countGraphemes "πŸ‘"` returns `1`, +## whereas `Str.toUtf8 "πŸ‘"` returns a list with a length of 4, +## and `Str.toUtf16 "πŸ‘"` returns a list with a length of 2. toUtf8 : Str -> List U8 toUtf16 : Str -> List U16 -toUtf32 : Str -> List U16 +toUtf32 : Str -> List U32 From 0ed8f90f110e200dda21068ff48b9a9a44896b8c Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 02:25:31 -0400 Subject: [PATCH 4/6] Fix some type signatures in Str docs --- compiler/builtins/docs/Str.roc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index 092cdc053a..452f828991 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -66,10 +66,10 @@ Str : [ @Str ] ## but it's recommended to pass much smaller numbers instead. ## ## Passing a negative number for decimal places is equivalent to passing 0. -decimal : Int, Float -> Str +decimal : Float *, Int * -> Str ## Convert an #Int to a string. -int : Float -> Str +int : Int * -> Str ## Check From 3fa75dc2f7b073aaa64fc36ea3c1b51cf7f1be58 Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 02:26:03 -0400 Subject: [PATCH 5/6] Add Str.reverseClusters to docs --- compiler/builtins/docs/Str.roc | 1 + 1 file changed, 1 insertion(+) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index 452f828991..5fab75eba7 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -103,6 +103,7 @@ padStart : Str, Int, Str -> Str padEnd : Str, Int, Str -> Str +reverseClusters : Str -> Str foldClusters : Str, { start: state, step: (state, Str -> state) } -> state From 6637bfb226a05b439ec204dcb591515858178e09 Mon Sep 17 00:00:00 2001 From: Richard Feldman Date: Mon, 16 Mar 2020 02:39:49 -0400 Subject: [PATCH 6/6] Add some more Str docs --- compiler/builtins/docs/Str.roc | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/compiler/builtins/docs/Str.roc b/compiler/builtins/docs/Str.roc index 5fab75eba7..bfb757b542 100644 --- a/compiler/builtins/docs/Str.roc +++ b/compiler/builtins/docs/Str.roc @@ -71,6 +71,18 @@ decimal : Float *, Int * -> Str ## Convert an #Int to a string. int : Int * -> Str +## Split a string around a separator. +## +## >>> Str.splitClusters "1,2,3" "," +## +## Passing `""` for the separator is not useful; it returns the original string +## wrapped in a list. +## +## >>> Str.splitClusters "1,2,3" "" +## +## To split a string into its grapheme clusters, use #Str.clusters +splitClusters : Str, Str -> List Str + ## Check isEmpty : Str -> Bool @@ -103,6 +115,16 @@ padStart : Str, Int, Str -> Str padEnd : Str, Int, Str -> Str +## Grapheme Clusters + +## Split a string into its grapheme clusters. +## +## >>> Str.clusters "1,2,3" +## +## >>> Str.clusters "πŸ‘πŸ‘πŸ‘" +## +clusters : Str -> List Str + reverseClusters : Str -> Str foldClusters : Str, { start: state, step: (state, Str -> state) } -> state