Title: | A Byte-Pair-Encoding (BPE) Tokenizer for OpenAI's Large Language Models |
Version: | 0.0.7 |
Description: | A thin wrapper around the tiktoken-rs crate, allowing to encode text into Byte-Pair-Encoding (BPE) tokens and decode tokens back to text. This is useful to understand how Large Language Models (LLMs) perceive text. |
License: | MIT + file LICENSE |
URL: | https://davzim.github.io/rtiktoken/, https://github.com/DavZim/rtiktoken/ |
BugReports: | https://github.com/DavZim/rtiktoken/issues |
Suggests: | testthat (≥ 3.0.0) |
SystemRequirements: | Cargo (Rust's package manager), rustc >= 1.65.0 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Config/rextendr/version: | 0.3.1.9001 |
Config/testthat/edition: | 3 |
Config/rtiktoken/MSRV: | 1.65.0 |
Depends: | R (≥ 4.2) |
NeedsCompilation: | yes |
Packaged: | 2025-04-14 20:20:47 UTC; david |
Author: | David Zimmermann-Kollenda [aut, cre], Roger Zurawicki [aut] (tiktoken-rs Rust library), Authors of the dependent Rust crates [aut] (see AUTHORS file) |
Maintainer: | David Zimmermann-Kollenda <david_j_zimmermann@hotmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-04-14 22:50:02 UTC |
Decodes tokens back to text
Description
Decodes tokens back to text
Usage
decode_tokens(tokens, model)
Arguments
tokens |
a vector of tokens to decode, or a list of tokens |
model |
a model to use for tokenization, either a model name, e.g., |
Value
a character string of the decoded tokens or a vector or strings
See Also
model_to_tokenizer()
, get_tokens()
Examples
tokens <- get_tokens("Hello World", "gpt-4o")
tokens
decode_tokens(tokens, "gpt-4o")
tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o")
tokens
decode_tokens(tokens, "gpt-4o")
Returns the number of tokens in a text
Description
Returns the number of tokens in a text
Usage
get_token_count(text, model)
Arguments
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
Value
the number of tokens in the text, vector of integers
See Also
model_to_tokenizer()
, get_tokens()
Examples
get_token_count("Hello World", "gpt-4o")
Converts text to tokens
Description
Converts text to tokens
Usage
get_tokens(text, model)
Arguments
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
Value
a vector of tokens for the given text as integer
See Also
model_to_tokenizer()
, decode_tokens()
Examples
get_tokens("Hello World", "gpt-4o")
get_tokens("Hello World", "o200k_base")
Gets the name of the tokenizer used by a model
Description
Gets the name of the tokenizer used by a model
Usage
model_to_tokenizer(model)
Arguments
model |
the model to use, e.g., |
Value
the tokenizer used by the model
Examples
model_to_tokenizer("gpt-4o")
model_to_tokenizer("gpt-4-1106-preview")
model_to_tokenizer("text-davinci-002")
model_to_tokenizer("text-embedding-ada-002")
model_to_tokenizer("text-embedding-3-small")