Title: | Retrieval-Augmented Generation (RAG) Workflows |
Version: | 0.2.0 |
Description: | Provides tools for implementing Retrieval-Augmented Generation (RAG) workflows with Large Language Models (LLM). Includes functions for document processing, text chunking, embedding generation, storage management, and content retrieval. Supports various document types and embedding providers ('Ollama', 'OpenAI'), with 'DuckDB' as the default storage backend. Integrates with the 'ellmer' package to equip chat objects with retrieval capabilities. Designed to offer both sensible defaults and customization options with transparent access to intermediate outputs. For a review of retrieval-augmented generation methods, see Gao et al. (2023) "Retrieval-Augmented Generation for Large Language Models: A Survey" <doi:10.48550/arXiv.2312.10997>. |
License: | MIT + file LICENSE |
URL: | https://ragnar.tidyverse.org/, https://github.com/tidyverse/ragnar |
BugReports: | https://github.com/tidyverse/ragnar/issues |
Depends: | R (≥ 4.3.0) |
Imports: | blob, cli, commonmark, curl, DBI, dotty, dplyr, duckdb (≥ 1.2.2), glue, httr2, methods, reticulate (≥ 1.42.0), rlang (≥ 1.1.0), rvest, S7, stringi, tibble, tidyr, vctrs, withr, xml2 |
Suggests: | dbplyr, ellmer (≥ 0.2.0), lifecycle, knitr, pandoc, paws.common, rmarkdown, shiny, stringr, testthat (≥ 3.0.0), connectcreds, gargle |
VignetteBuilder: | knitr |
Config/Needs/website: | tidyverse/tidytemplate, rmarkdown |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | yes |
Packaged: | 2025-07-12 15:29:26 UTC; tomasz |
Author: | Tomasz Kalinowski [aut, cre],
Daniel Falbel [aut],
Posit Software, PBC |
Maintainer: | Tomasz Kalinowski <tomasz@posit.co> |
Repository: | CRAN |
Date/Publication: | 2025-07-12 21:00:02 UTC |
ragnar: Retrieval-Augmented Generation (RAG) Workflows
Description
Provides tools for implementing Retrieval-Augmented Generation (RAG) workflows with Large Language Models (LLM). Includes functions for document processing, text chunking, embedding generation, storage management, and content retrieval. Supports various document types and embedding providers ('Ollama', 'OpenAI'), with 'DuckDB' as the default storage backend. Integrates with the 'ellmer' package to equip chat objects with retrieval capabilities. Designed to offer both sensible defaults and customization options with transparent access to intermediate outputs. For a review of retrieval-augmented generation methods, see Gao et al. (2023) "Retrieval-Augmented Generation for Large Language Models: A Survey" doi:10.48550/arXiv.2312.10997.
Author(s)
Maintainer: Tomasz Kalinowski tomasz@posit.co
Authors:
Daniel Falbel daniel@posit.co
Other contributors:
Posit Software, PBC (03wc8by49) [copyright holder, funder]
See Also
Useful links:
Report bugs at https://github.com/tidyverse/ragnar/issues
Markdown documents
Description
MarkdownDocument
represents a complete Markdown document stored as a single
character string. The constructor normalizes text
by collapsing lines and
ensuring UTF-8 encoding, so downstream code can rely on a consistent format.
read_as_markdown()
is the recommended way to create a MarkdownDocument
.
The constructor itself is exported only so advanced users can construct one by
other means when needed.
Arguments
text |
[string] Markdown text. |
origin |
[string] Optional source path or URL. Defaults to the
|
Value
An S7 object that inherits from MarkdownDocument
, which is a length
1 string of markdown text with an @origin
property.
Examples
md <- MarkdownDocument(
"# Title\n\nSome text.",
origin = "example.md"
)
md
Markdown documents chunks
Description
MarkdownDocumentChunks
stores information about candidate chunks in a
Markdown document. It is a tibble with three required columns:
-
start
,end
— integers. These are character positions (1-based, inclusive) in the sourceMarkdownDocument
, so thatsubstring(md, start, end)
yields the chunk text. Ranges can overlap. -
context
— character. A general-purpose field for adding context to a chunk. This column is combined withtext
to augment chunk content when generating embeddings withragnar_store_insert()
, and is also returned byragnar_retrieve()
. Keep in mind that when chunks are deoverlapped (inragnar_retrieve()
orchunks_deoverlap()
), only the context value from the first chunk is kept.markdown_chunk()
by default populates this column with all the markdown headings that are in-scope at the chunk start position.
Additional columns can be included.
The original document is available via the @document
property.
For normal use, chunk a Markdown document with markdown_chunk()
; the
class constructor itself is exported only so advanced users can generate or
tweak chunks by other means.
Arguments
chunks |
A data frame containing |
document |
A |
Value
An S7 object that inherits from MarkdownDocumentChunks
, which is
also a tibble
.
See Also
Examples
doc_text <- "# A\n\nB\n\n## C\n\nD"
doc <- MarkdownDocument(doc_text, origin = "some/where")
chunk_positions <- tibble::tibble(
start = c(1L, 9L),
end = c(8L, 15L),
context = c("", "# A"),
text = substring(doc, start, end)
)
chunks <- MarkdownDocumentChunks(chunk_positions, doc)
identical(chunks@document, doc)
Merge overlapping chunks in retrieved results
Description
Groups and merges overlapping text chunks from the same origin in the retrieval results.
Usage
chunks_deoverlap(store, chunks)
Arguments
store |
A |
chunks |
A tibble of retrieved chunks, such as the
output of |
Details
When multiple retrieved chunks from the same origin have overlapping character ranges, this function combines them into a single non-overlapping region.
Value
A tibble of de-overlapped chunks.
Embed text using a Bedrock model
Description
Embed text using a Bedrock model
Usage
embed_bedrock(x, model, profile, api_args = list())
Arguments
x |
x can be:
|
model |
Currently only Cohere.ai and Amazon Titan models are supported. There are no guardarails for the kind of model that is used, but the model must be available in the AWS region specified by the profile. You may look for available models in the Bedrock Model Catalog |
profile |
AWS profile to use. |
api_args |
Additional arguments to pass to the Bedrock API. Depending
on the |
Value
If x
is missing returns a function that can be called to get embeddings.
If x
is not missing, a matrix of embeddings with 1 row per input string, or a dataframe with an 'embedding' column.
See Also
Embed text using a Databricks model
Description
embed_databricks()
gets embeddings for text using a model hosted in a
Databricks workspace. It relies on the ellmer package for managing
Databricks credentials. See ellmer::chat_databricks
for more on
supported modes of authentication.
Usage
embed_databricks(
x,
workspace = databricks_workspace(),
model = "databricks-bge-large-en",
batch_size = 512L
)
Arguments
x |
x can be:
|
workspace |
The URL of a Databricks workspace, e.g.
|
model |
The name of a text embedding model. |
batch_size |
split |
Embed using Google Vertex API platform
Description
Embed using Google Vertex API platform
Usage
embed_google_vertex(
x,
model,
location,
project_id,
task_type = "RETRIEVAL_QUERY"
)
Arguments
x |
x can be:
|
model |
Character specifying the embedding model. See supported models in Text embeddings API |
location |
Location, e.g. |
project_id |
Project ID. |
task_type |
Used to convey intended downstream application to help the
model produce better embeddings. If left blank, the default used is
|
Examples
## Not run:
embed_google_vertex(
"hello world",
model="gemini-embedding-001",
project = "<your-project-id>",
location = "us-central1"
)
## End(Not run)
Embed Text
Description
Embed Text
Usage
embed_ollama(
x,
base_url = "http://localhost:11434",
model = "snowflake-arctic-embed2:568m",
batch_size = 10L
)
embed_openai(
x,
model = "text-embedding-3-small",
base_url = "https://api.openai.com/v1",
api_key = get_envvar("OPENAI_API_KEY"),
dims = NULL,
user = get_user(),
batch_size = 20L
)
Arguments
x |
x can be:
|
base_url |
string, url where the service is available. |
model |
string; model name |
batch_size |
split |
api_key |
resolved using env var |
dims |
An integer, can be used to truncate the embedding to a specific size. |
user |
User name passed via the API. |
Value
If x
is a character vector, then a numeric matrix is returned,
where nrow = length(x)
and ncol = <model-embedding-size>
. If x
is a
data.frame, then a new embedding
matrix "column" is added, containing the
matrix described in the previous sentence.
A matrix of embeddings with 1 row per input string, or a dataframe with an 'embedding' column.
Examples
text <- c("a chunk of text", "another chunk of text", "one more chunk of text")
## Not run:
text |>
embed_ollama() |>
str()
text |>
embed_openai() |>
str()
## End(Not run)
Chunk a Markdown document
Description
markdown_chunk()
splits a single Markdown string into shorter optionally
overlapping chunks while nudging cut points to the nearest sensible boundary
(heading, paragraph, sentence, line, word, or character). It returns a tibble
recording the character ranges, headings context, and text for each chunk.
Usage
markdown_chunk(
md,
target_size = 1600L,
target_overlap = 0.5,
...,
max_snap_dist = target_size * (1 - target_overlap)/3,
segment_by_heading_levels = integer(),
context = TRUE,
text = TRUE
)
Arguments
md |
A |
target_size |
Integer. Target chunk size in characters. Default: 1600
( |
target_overlap |
Numeric in |
... |
These dots are for future extensions and must be empty. |
max_snap_dist |
Integer. Furthest distance (in characters) a cut point may move to reach a semantic boundary. Defaults to one third of the stride size between target chunk starts. Chunks that end up on identical boundaries are merged. |
segment_by_heading_levels |
Integer vector with possible values |
context |
Logical. Add a |
text |
Logical. If |
Value
A MarkdownDocumentChunks
object, which is a tibble (data.frame) with with
columns start
end
, and optionally context
and text
. It also has a
@document
property, which is the input md
document (potentially
normalized and converted to a MarkdownDocument
).
See Also
ragnar_chunks_view()
to interactively inspect the output of
markdown_chunk()
. See also MarkdownDocumentChunks()
and
MarkdownDocument()
, where the input and return value of
markdown_chunk()
are described more fully.
Examples
md <- "
# Title
## Section 1
Some text that is long enough to be chunked.
A second paragraph to make the text even longer.
## Section 2
More text here.
### Section 2.1
Some text under a level three heading.
#### Section 2.1.1
Some text under a level four heading.
## Section 3
Even more text here.
"
markdown_chunk(md, target_size = 40)
markdown_chunk(md, target_size = 40, target_overlap = 0)
markdown_chunk(md, target_size = NA, segment_by_heading_levels = c(1, 2))
markdown_chunk(md, target_size = 40, max_snap_dist = 100)
Segment markdown text
Description
Segment markdown text
Usage
markdown_segment(
text,
tags = c("h1", "h2", "h3", "h4"),
trim = FALSE,
omit_empty = FALSE
)
markdown_frame(text, frame_by = c("h1", "h2", "h3"), segment_by = NULL)
Arguments
text |
Markdown string |
tags , segment_by |
A character vector of html tag names, e.g.,
|
trim |
logical, trim whitespace on segments |
omit_empty |
logical, whether to remove empty segments |
frame_by |
Character vector of tags that will become columns in the returned dataframe. |
Value
A named character vector. Names will correspond to tags
, or ""
for content in between tags.
Examples
md <- r"---(
# Sample Markdown File
## Introduction
This is a sample **Markdown** file for testing.
### Features
- Simple **bold** text
- _Italicized_ text
- `Inline code`
- A [link](https://example.com)
- ‘Curly quotes are 3 bytes chars.’ Non-ascii text is fine.
This is a paragraph with <p> tag.
This next segment with code has a <pre> tag
```r
hello_world <- function() {
cat("Hello, World!\n")
}
```
A table <table>:
| Name | Age | City |
|-------|----:|-----------|
| Alice | 25 | New York |
| Bob | 30 | London |
## Conclusion
Common tags:
- h1, h2, h3, h4, h5, h6: section headings
- p: paragraph (prose)
- pre: pre-formatted text, meant to be displayed with monospace font.
Typically code or code output
- blockquote: A blockquote
- table: A table
- ul: Unordered list
- ol: Ordered list
- li: Individual list item in a <ul> or <ol>
)---"
markdown_segment(md) |> tibble::enframe()
markdown_segment(md |> trimws()) |> tibble::enframe()
markdown_segment(md, c("li"), trim = TRUE, omit_empty = TRUE) |> tibble::enframe()
markdown_segment(md, c("table"), trim = TRUE, omit_empty = TRUE) |> tibble::enframe()
markdown_segment(md, c("ul"), trim = TRUE, omit_empty = TRUE) |> tibble::enframe()
Chunk text
Description
These functions are deprecated in favor of markdown_chunk()
, which is more
flexible, supports overlapping chunks, enables deoverlapping or rechunking
downstream by ragnar_retrieve()
, and automatically builds a context
string of in-scope markdown headings for each chunk instead of requiring
manual string interpolation from extracted headings.
Usage
ragnar_chunk(
x,
max_size = 1600L,
boundaries = c("paragraph", "sentence", "line_break", "word", "character"),
...,
trim = TRUE,
simplify = TRUE
)
ragnar_segment(x, boundaries = "sentence", ..., trim = FALSE, simplify = TRUE)
ragnar_chunk_segments(x, max_size = 1600L, ..., simplify = TRUE, trim = TRUE)
Arguments
x |
A character vector, list of character vectors, or data frame containing a |
max_size |
Integer. The maximum number of characters in each chunk.
Defaults to |
boundaries |
A sequence of boundary types to use in order until
|
... |
Additional arguments passed to internal functions.
tokenizer to use |
trim |
logical, whether to trim leading and trailing whitespace from
strings. Default |
simplify |
Logical. If |
Details
Functions for chunking text into smaller pieces while preserving meaningful semantics. These functions provide flexible ways to split text based on various boundaries (sentences, words, etc.) while controlling chunk sizes and overlap.
Chunking is the combination of two fundamental operations:
identifying boundaries: finding character positions where it makes sense to split a string.
extracting slices: extracting substrings using the candidate boundaries to produce chunks that match the requested
chunk_size
andchunk_overlap
ragnar_chunk()
is a higher-level function that does both, identifies boundaries and extracts slices.
If you need lower-level control, you can alternatively use the lower-level functions
ragnar_segment()
in combination with ragnar_chunk_segments()
.
ragnar_segment()
: Splits text at semantic boundaries.
ragnar_chunk_segments()
: Combines text segments into chunks.
For most usecases, these two are equivalent:
x |> ragnar_chunk() x |> ragnar_segment() |> ragnar_chunk_segments()
When working with data frames, these functions preserve all columns and use
tidyr::unchop()
to handle the resulting list-columns when simplify = TRUE
.
Value
For character input with
simplify = FALSE
: A list of character vectorsFor character input with
simplify = TRUE
: A character vector of chunksFor data frame input with
simplify = FALSE
: A data frame with the same number of rows as the input, where thetext
column transformed into a list of chararacter vectors.For data frame input with
simplify = TRUE
: Same as a data frame input withsimplify=FALSE
, with thetext
column expanded bytidyr::unchop()
Examples
# Basic chunking with max size
text <- "This is a long piece of text. It has multiple sentences.
We want to split it into chunks. Here's another sentence."
ragnar_chunk(text, max_size = 40) # splits at sentences
# smaller chunk size: first splits at sentence boundaries, then word boundaries
ragnar_chunk(text, max_size = 20)
# only split at sentence boundaries. Note, some chunks are oversized
ragnar_chunk(text, max_size = 20, boundaries = c("sentence"))
# only consider word boundaries when splitting:
ragnar_chunk(text, max_size = 20, boundaries = c("word"))
# first split at sentence boundaries, then word boundaries,
# as needed to satisfy `max_chunk`
ragnar_chunk(text, max_size = 20, boundaries = c("sentence", "word"))
# Use a stringr pattern to find semantic boundaries
ragnar_chunk(text, max_size = 10, boundaries = stringr::fixed(". "))
ragnar_chunk(text, max_size = 10, boundaries = list(stringr::fixed(". "), "word"))
# Working with data frames
df <- data.frame(
id = 1:2,
text = c("First sentence. Second sentence.", "Another sentence here.")
)
ragnar_chunk(df, max_size = 20, boundaries = "sentence")
ragnar_chunk(df$text, max_size = 20, boundaries = "sentence")
# Chunking pre-segmented text
segments <- c("First segment. ", "Second segment. ", "Third segment. ", "Fourth segment. ")
ragnar_chunk_segments(segments, max_size = 20)
ragnar_chunk_segments(segments, max_size = 40)
ragnar_chunk_segments(segments, max_size = 60)
View chunks with the store inspector
Description
Visualize chunks read by ragnar_read()
for quick inspection.
Helpful for inspecting the results of chunking and reading while iterating
on the ingestion pipeline.
Usage
ragnar_chunks_view(chunks)
Arguments
chunks |
A data frame containing a few chunks. |
Find links on a page
Description
Find links on a page
Usage
ragnar_find_links(
x,
depth = 0L,
children_only = TRUE,
progress = TRUE,
...,
url_filter = identity
)
Arguments
x |
URL, HTML file path, or XML document. For Markdown, convert to HTML
using |
depth |
Integer specifying how many levels deep to crawl for links. When
|
children_only |
Logical or string. If |
progress |
Logical, draw a progress bar if |
... |
Currently unused. Must be empty. |
url_filter |
A function that takes a character vector of URL's and may
subset them to return a smaller list. This can be useful for filtering out
URL's by rules different them |
Value
A character vector of links on the page.
Examples
## Not run:
ragnar_find_links("https://r4ds.hadley.nz/base-R.html")
ragnar_find_links("https://ellmer.tidyverse.org/")
ragnar_find_links("https://ellmer.tidyverse.org/", depth = 2)
ragnar_find_links("https://ellmer.tidyverse.org/", depth = 2, children_only = FALSE)
ragnar_find_links(
paste0("https://github.com/Snowflake-Labs/sfquickstarts/",
"tree/master/site/sfguides/src/build_a_custom_model_for_anomaly_detection"),
children_only = "https://github.com/Snowflake-Labs/sfquickstarts",
depth = 1
)
## End(Not run)
Read a document as Markdown
Description
This function is deprecated in favor of read_as_markdown()
.
Usage
ragnar_read(x, ..., split_by_tags = NULL, frame_by_tags = NULL)
Arguments
x |
file path or url. |
... |
passed on |
split_by_tags |
character vector of html tag names used to split the returned text |
frame_by_tags |
character vector of html tag names used to create a dataframe of the returned content |
Details
ragnar_read()
uses markitdown to
convert a document to markdown. If frame_by_tags
or split_by_tags
is
provided, the converted markdown content is then split and converted to a
data frame, otherwise, the markdown is returned as a string.
Value
Always returns a data frame with the columns:
-
origin
: the file path or url -
hash
: a hash of the text content -
text
: the markdown content
If split_by_tags
is not NULL
, then a tag
column is also included containing
the corresponding tag for each text chunk. ""
is used for text chunks that
are not associated with a tag.
If frame_by_tags
is not NULL
, then additional columns are included for each
tag in frame_by_tags
. The text chunks are associated with the tags in the
order they appear in the markdown content.
Examples
file <- tempfile(fileext = ".html")
download.file("https://r4ds.hadley.nz/base-R.html", file, quiet = TRUE)
# with no arguments, returns a single row data frame.
# the markdown content is in the `text` column.
file |> ragnar_read() |> str()
# use `split_by_tags` to get a data frame where the text is split by the
# specified tags (e.g., "h1", "h2", "h3")
file |>
ragnar_read(split_by_tags = c("h1", "h2", "h3"))
# use `frame_by_tags` to get a dataframe where the
# headings associated with each text chunk are easily accessible
file |>
ragnar_read(frame_by_tags = c("h1", "h2", "h3"))
# use `split_by_tags` and `frame_by_tags` together to further break up `text`.
file |>
ragnar_read(
split_by_tags = c("p"),
frame_by_tags = c("h1", "h2", "h3")
)
# Example workflow adding context to each chunk
file |>
ragnar_read(frame_by_tags = c("h1", "h2", "h3")) |>
glue::glue_data(r"--(
## Excerpt from the book "R for Data Science (2e)"
chapter: {h1}
section: {h2}
content: {text}
)--") |>
# inspect
_[6:7] |> cat(sep = "\n~~~~~~~~~~~\n")
# Advanced example of postprocessing the output of ragnar_read()
# to add language to code blocks, markdown style
library(dplyr, warn.conflicts = FALSE)
library(stringr)
library(rvest)
library(xml2)
file |>
ragnar_read(frame_by_tags = c("h1", "h2", "h3"),
split_by_tags = c("p", "pre")) |>
mutate(
is_code = tag == "pre",
text = ifelse(is_code, str_replace(text, "```", "```r"), text)
) |>
group_by(h1, h2, h3) |>
summarise(text = str_flatten(text, "\n\n"), .groups = "drop") |>
glue::glue_data(r"--(
# Excerpt from the book "R for Data Science (2e)"
chapter: {h1}
section: {h2}
content: {text}
)--") |>
# inspect
_[9:10] |> cat(sep = "\n~~~~~~~~~~~\n")
Read an HTML document
Description
Read an HTML document
Usage
ragnar_read_document(
x,
...,
split_by_tags = frame_by_tags,
frame_by_tags = NULL
)
Arguments
x |
file path or url, passed on to |
... |
passed on to |
split_by_tags |
character vector of html tag names used to split the returned text |
frame_by_tags |
character vector of html tag names used to create a dataframe of the returned content |
Value
If frame_by_tags
is not NULL
, then a data frame is returned,
with column names c("frame_by_tags", "text")
.
If frame_by_tags
is NULL
but split_by_tags
is not NULL
, then a named
character vector is returned.
If both frame_by_tags
and split_by_tags
are NULL
, then a string
(length-1 character vector) is returned.
Examples
file <- tempfile(fileext = ".html")
download.file("https://r4ds.hadley.nz/base-R.html", file, quiet = TRUE)
# with no arguments, returns a single string of the text.
file |> ragnar_read_document() |> str()
# use `split_by_tags` to get a named character vector of length > 1
file |>
ragnar_read_document(split_by_tags = c("h1", "h2", "h3")) |>
tibble::enframe("tag", "text")
# use `frame_by_tags` to get a dataframe where the
# headings associated with each text chunk are easily accessible
file |>
ragnar_read_document(frame_by_tags = c("h1", "h2", "h3"))
# use `split_by_tags` and `frame_by_tags` together to further break up `text`.
file |>
ragnar_read_document(
split_by_tags = c("p"),
frame_by_tags = c("h1", "h2", "h3")
)
# Example workflow adding context to each chunk
file |>
ragnar_read_document(frame_by_tags = c("h1", "h2", "h3")) |>
glue::glue_data(r"--(
## Excerpt from the book "R for Data Science (2e)"
chapter: {h1}
section: {h2}
content: {text}
)--") |>
# inspect
_[6:7] |> cat(sep = "\n~~~~~~~~~~~\n")
# Advanced example of postprocessing the output of ragnar_read_document()
# to wrap code blocks in backticks, markdown style
library(dplyr, warn.conflicts = FALSE)
library(stringr)
library(rvest)
library(xml2)
file |>
ragnar_read_document(frame_by_tags = c("h1", "h2", "h3"),
split_by_tags = c("p", "pre")) |>
mutate(
is_code = tag == "pre",
text = ifelse(is_code,
str_c("```", text, "```", sep = "\n"),
text)) |>
group_by(h1, h2, h3) |>
summarise(text = str_flatten(text, "\n"), .groups = "drop") |>
glue::glue_data(r"--(
# Excerpt from the book "R for Data Science (2e)"
chapter: {h1}
section: {h2}
content: {text}
)--") |>
# inspect
_[9:10] |> cat(sep = "\n~~~~~~~~~~~\n")
# Example of preprocessing the input to ragnar_read_document()
# to wrap code in backticks, markdown style
# same outcome as above, except via pre processing instead of post processing.
file |>
read_html() |>
(\(doc) {
# fence preformatted code with triple backticks
for (node in html_elements(doc, "pre")) {
xml_add_child(node, "code", "```\n", .where = 0)
xml_add_child(node, "code", "\n```")
}
# wrap inline code with single backticks
for (node in html_elements(doc, "code")) {
if (!"pre" %in% xml_name(xml_parents(node))) {
xml_text(node) <- str_c("`", xml_text(node), "`")
}
}
doc
})() |>
ragnar_read_document(frame_by_tags = c("h1", "h2", "h3")) |>
glue::glue_data(r"--(
# Excerpt from the book "R for Data Science (2e)"
chapter: {h1}
section: {h2}
content: {text}
)--") |> _[6]
Register a 'retrieve' tool with ellmer
Description
Register a 'retrieve' tool with ellmer
Usage
ragnar_register_tool_retrieve(
chat,
store,
store_description = "the knowledge store",
...,
name = NULL,
title = NULL
)
Arguments
chat |
a |
store |
a string of a store location, or a |
store_description |
Optional string, used for composing the tool description. |
... |
arguments passed on to |
name , title |
Optional tool function name and title. By default,
|
Value
chat
, invisibly.
Examples
system_prompt <- stringr::str_squish("
You are an expert assistant in R programming.
When responding, you first quote relevant material from books or documentation,
provide links to the sources, and then add your own context and interpretation.
")
chat <- ellmer::chat_openai(system_prompt, model = "gpt-4o")
store <- ragnar_store_connect("r4ds.ragnar.duckdb")
ragnar_register_tool_retrieve(chat, store)
chat$chat("How can I subset a dataframe?")
Retrieve chunks from a RagnarStore
Description
Combines both vss
and bm25
search and returns the
union of chunks retrieved by both methods.
Usage
ragnar_retrieve(store, text, top_k = 3L, ..., deoverlap = TRUE)
Arguments
store |
A |
text |
Character. Query string to match. |
top_k |
Integer. Number of nearest entries to find per method. |
... |
Additional arguments passed to the lower-level retrieval functions. |
deoverlap |
Logical. If |
Value
A tibble
of retrieved chunks. Each row
represents a chunk and always contains a text
column.
Note
The results are not re-ranked after identifying the unique values.
See Also
Other ragnar_retrieve:
ragnar_retrieve_bm25()
,
ragnar_retrieve_vss()
,
ragnar_retrieve_vss_and_bm25()
Examples
## Build a small store with categories
store <- ragnar_store_create(
embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),
extra_cols = data.frame(category = character()),
version = 1 # store text chunks directly
)
ragnar_store_insert(
store,
data.frame(
category = c(rep("pets", 3), rep("dessert", 3)),
text = c("playful puppy", "sleepy kitten", "curious hamster",
"chocolate cake", "strawberry tart", "vanilla ice cream")
)
)
ragnar_store_build_index(store)
# Top 3 chunks without filtering
ragnar_retrieve(store, "sweet")
# Combine filter with similarity search
ragnar_retrieve(store, "sweet", filter = category == "dessert")
Retrieves chunks using the BM25 score
Description
BM25 refers to Okapi Best Matching 25. See doi:10.1561/1500000019 for more information.
Usage
ragnar_retrieve_bm25(
store,
text,
top_k = 3L,
...,
k = 1.2,
b = 0.75,
conjunctive = FALSE,
filter
)
Arguments
store |
A |
text |
String, the text to search for. |
top_k |
Integer. Number of nearest entries to find per method. |
... |
Additional arguments passed to the lower-level retrieval functions. |
k , b |
|
conjunctive |
Whether to make the query conjunctive i.e., all terms in the query string must be present in order for a chunk to be retrieved. |
filter |
Optional. A filter expression evaluated with |
See Also
Other ragnar_retrieve:
ragnar_retrieve()
,
ragnar_retrieve_vss()
,
ragnar_retrieve_vss_and_bm25()
Vector Similarity Search Retrieval
Description
Computes a similarity measure between the query and the document embeddings and uses this similarity to rank and retrieve document chunks.
Usage
ragnar_retrieve_vss(
store,
query,
top_k = 3L,
...,
method = "cosine_distance",
query_vector = store@embed(query),
filter
)
Arguments
store |
A |
query |
Character. The query string to embed and use for similarity search. |
top_k |
Integer. Maximum number of document chunks to retrieve. Defaults to 3. |
... |
Additional arguments passed to methods. |
method |
Character. Similarity method to use: |
query_vector |
Numeric vector. The embedding for |
filter |
Optional. A filter expression evaluated with
|
Details
Supported methods:
-
cosine_distance – cosine of the angle between two vectors.
-
euclidean_distance – L2 distance between vectors.
-
negative_inner_product – negative sum of element-wise products.
If filter
is supplied, the function first performs the similarity
search, then applies the filter in an outer SQL query. It uses the HNSW
index when possible and falls back to a sequential scan for large result
sets or filtered queries.
Value
A tibble
with the top_k retrieved chunks,
ordered by metric_value
.
Note
The results are not re-ranked after identifying the unique values.
See Also
Other ragnar_retrieve:
ragnar_retrieve()
,
ragnar_retrieve_bm25()
,
ragnar_retrieve_vss_and_bm25()
Examples
## Build a small store with categories
store <- ragnar_store_create(
embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),
extra_cols = data.frame(category = character()),
version = 1 # store text chunks directly
)
ragnar_store_insert(
store,
data.frame(
category = c(rep("pets", 3), rep("dessert", 3)),
text = c("playful puppy", "sleepy kitten", "curious hamster",
"chocolate cake", "strawberry tart", "vanilla ice cream")
)
)
ragnar_store_build_index(store)
# Top 3 chunks without filtering
ragnar_retrieve(store, "sweet")
# Combine filter with similarity search
ragnar_retrieve(store, "sweet", filter = category == "dessert")
Retrieve VSS and BM25
Description
Runs ragnar_retrieve_vss()
and ragnar_retrieve_bm25()
and get the distinct
documents.
Usage
ragnar_retrieve_vss_and_bm25(store, text, top_k = 3, ...)
Arguments
store |
A |
text |
Character. Query string to match. |
top_k |
Integer, the number of entries to retrieve using per method. |
... |
Forwarded to |
Value
A tibble
of retrieved chunks. Each row
represents a chunk and always contains a text
column.
Note
The results are not re-ranked after identifying the unique values.
See Also
Other ragnar_retrieve:
ragnar_retrieve()
,
ragnar_retrieve_bm25()
,
ragnar_retrieve_vss()
Examples
## Build a small store with categories
store <- ragnar_store_create(
embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),
extra_cols = data.frame(category = character()),
version = 1 # store text chunks directly
)
ragnar_store_insert(
store,
data.frame(
category = c(rep("pets", 3), rep("dessert", 3)),
text = c("playful puppy", "sleepy kitten", "curious hamster",
"chocolate cake", "strawberry tart", "vanilla ice cream")
)
)
ragnar_store_build_index(store)
# Top 3 chunks without filtering
ragnar_retrieve(store, "sweet")
# Combine filter with similarity search
ragnar_retrieve(store, "sweet", filter = category == "dessert")
Build a Ragnar Store index
Description
A search index must be built before calling ragnar_retrieve()
. If
additional entries are added to the store with ragnar_store_insert()
,
ragnar_store_build_index()
must be called again to rebuild the index.
Usage
ragnar_store_build_index(store, type = c("vss", "fts"))
Arguments
store |
a |
type |
The retrieval search type to build an index for. |
Value
store
, invisibly.
Create and connect to a vector store
Description
Create and connect to a vector store
Usage
ragnar_store_create(
location = ":memory:",
embed = embed_ollama(),
...,
embedding_size = ncol(embed("foo")),
overwrite = FALSE,
extra_cols = NULL,
name = NULL,
title = NULL,
version = 2
)
ragnar_store_connect(location, ..., read_only = TRUE)
Arguments
location |
filepath, or |
embed |
A function that is called with a character vector and returns a
matrix of embeddings. Note this function will be serialized and then
deserialized in new R sessions, so it cannot reference to any objects in
the global or parent environments. Make sure to namespace all function
calls with |
... |
unused; must be empty. |
embedding_size |
integer |
overwrite |
logical, what to do if |
extra_cols |
A zero row data frame used to specify additional columns
that should be added to the store. Such columns can be used for adding
additional context when retrieving. See the examples for more information.
|
name |
A unique name for the store. Must match the |
title |
A title for the store, used by |
version |
integer. The version of the store to create. See details. |
read_only |
logical, whether the returned connection can be used to modify the store. |
Details
Store versions
Version 2 – documents with chunk ranges (default)
With version = 2
, ragnar stores each document once and records the start
and end positions of its chunks. This provides strong support for overlapping
chunk ranges with de-overlapping at retrieval, and generally allows
retrieving arbitrary ranges from source documents, but does not support
modifying chunks directly before insertion. Chunks can be augmented via the
context
field and with additional fields passed to extra_cols
. The
easiest way to prepare chunks
for version = 2
is with
read_as_markdown()
and markdown_chunk()
.
Version 1 – flat chunks
With version = 1
, ragnar keeps all chunks in a single table. This lets you
easily modify chunk text before insertion. However, dynamic rechunking
(de-overlapping) or extracting arbitrary ranges from source documents is not
supported, since the original full documents are no longer available. Chunks
can be augmented by modifying the chunk text directly (e.g., with glue()
).
Additionally, if you intend to call ragnar_store_update()
, it is your
responsibility to provide rlang::hash(original_full_document)
with each
chunk. The easiest way to prepare chunks
for version = 1
is with
ragnar_read()
and ragnar_chunk()
.
Value
a RagnarStore
object
Examples
# A store with a dummy embedding
store <- ragnar_store_create(
embed = \(x) matrix(stats::runif(10), nrow = length(x), ncol = 10),
version = 1
)
ragnar_store_insert(store, data.frame(text = "hello"))
# A store with a schema. When inserting into this store, users need to
# provide an `area` column.
store <- ragnar_store_create(
embed = \(x) matrix(stats::runif(10), nrow = length(x), ncol = 10),
extra_cols = data.frame(area = character()),
version = 1
)
ragnar_store_insert(store, data.frame(text = "hello", area = "rag"))
# If you already have a data.frame with chunks that will be inserted into
# the store, you can quickly create a suitable store with `vec_ptype()`:
chunks <- data.frame(text = letters, area = "rag")
store <- ragnar_store_create(
embed = \(x) matrix(stats::runif(10), nrow = length(x), ncol = 10),
extra_cols = vctrs::vec_ptype(chunks),
version = 1
)
ragnar_store_insert(store, chunks)
# version = 2 (the default) has support for deoverlapping
store <- ragnar_store_create(
# if embed = NULL, then only bm25 search is used (not vss)
embed = NULL
)
doc <- MarkdownDocument(
paste0(letters, collapse = ""),
origin = "/some/where"
)
chunks <- markdown_chunk(doc, target_size = 3, target_overlap = 2 / 3)
chunks$context <- substring(chunks$text, 1, 1)
chunks
ragnar_store_insert(store, chunks)
ragnar_store_build_index(store)
ragnar_retrieve(store, "abc bcd xyz", deoverlap = FALSE)
ragnar_retrieve(store, "abc bcd xyz", deoverlap = TRUE)
Insert chunks into a RagnarStore
Description
Insert chunks into a RagnarStore
Usage
ragnar_store_insert(store, chunks)
Arguments
store |
a |
chunks |
a character vector or a dataframe with a |
Value
store
, invisibly.
Launches the Ragnar Inspector Tool
Description
Launches the Ragnar Inspector Tool
Usage
ragnar_store_inspect(store, ...)
Arguments
store |
A |
... |
Passed to |
Value
NULL
invisibly
Inserts or updates chunks in a RagnarStore
Description
Inserts or updates chunks in a RagnarStore
Usage
ragnar_store_update(store, chunks)
Arguments
store |
a |
chunks |
Content to update. The precise input structure depends on
|
Details
Store Version 2
chunks
must be MarkdownDocumentChunks
object.
Store Version 1
chunks
must be a data frame containing origin
, hash
, and text
columns. We first filter out chunks for which origin
and hash
are already
in the store. If an origin
is in the store, but with a different hash
, we
replace all of its chunks with the new chunks. Otherwise, a regular insert is
performed.
This can help avoid needing to compute embeddings for chunks that are already in the store.
Value
store
, invisibly.
Convert files to Markdown
Description
Convert files to Markdown
Usage
read_as_markdown(
path,
...,
html_extract_selectors = c("main"),
html_zap_selectors = c("nav")
)
Arguments
path |
[string] A filepath or URL. Accepts a wide variety of file types, including PDF, PowerPoint, Word, Excel, images (EXIF metadata and OCR), audio (EXIF metadata and speech transcription), HTML, text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), YouTube URLs, and EPUBs. |
... |
Passed on to |
html_extract_selectors |
Character vector of CSS selectors. If a match for a selector is found in the document, only the matched node's contents are converted. Unmatched extract selectors have no effect. |
html_zap_selectors |
Character vector of CSS selectors. Elements
matching these selectors will be excluded ("zapped") from the HTML document
before conversion to markdown. This is useful for removing navigation bars,
sidebars, headers, footers, or other unwanted elements. By default,
navigation elements ( |
Details
Converting HTML
When converting HTML, you might want to omit certain elements, like sidebars, headers, footers, etc. You can pass CSS selector strings to either extract nodes or exclude nodes during conversion.
The easiest way to make selectors is to use SelectorGadget: https://rvest.tidyverse.org/articles/selectorgadget.html
You can also right-click on a page and select "Inspect Element" in a browser to better understand an HTML page's structure.
For comprehensive or advanced usage of CSS selectors, consult https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property and https://facelessuser.github.io/soupsieve/selectors/
Value
A MarkdownDocument
object, which is a single string of Markdown
with an @origin
property.
Examples
## Not run:
# Convert HTML
md <- read_as_markdown("https://r4ds.hadley.nz/base-R.html")
md
cat_head <- \(md, n = 10) writeLines(head(strsplit(md, "\n")[[1L]], n))
cat_head(md)
## Using selector strings
# By default, this output includes the sidebar and other navigational elements
url <- "https://duckdb.org/code_of_conduct"
read_as_markdown(url) |> cat_head(15)
# To extract just the main content, use a selector
read_as_markdown(url, html_extract_selectors = "#main_content_wrap") |>
cat_head()
# Alternative approach: zap unwanted nodes
read_as_markdown(
url,
html_zap_selectors = c(
"header", # name
".sidenavigation", # class
".searchoverlay", # class
"#sidebar" # ID
)
) |> cat_head()
# Quarto example
read_as_markdown(
"https://quarto.org/docs/computations/python.html",
html_extract_selectors = "main",
html_zap_selectors = c(
"#quarto-sidebar",
"#quarto-margin-sidebar",
"header",
"footer",
"nav"
)
) |> cat_head()
## Convert PDF
pdf <- file.path(R.home("doc"), "NEWS.pdf")
read_as_markdown(pdf) |> cat_head(15)
## Alternative:
# pdftools::pdf_text(pdf) |> cat_head()
# Convert images to markdown descriptions using OpenAI
jpg <- file.path(R.home("doc"), "html", "logo.jpg")
if (Sys.getenv("OPENAI_API_KEY") != "") {
# if (xfun::is_macos()) system("brew install ffmpeg")
reticulate::py_require("openai")
llm_client <- reticulate::import("openai")$OpenAI()
read_as_markdown(jpg, llm_client = llm_client, llm_model = "gpt-4.1-mini") |>
writeLines()
# # Description:
# The image displays the logo of the R programming language. It features a
# large, stylized capital letter "R" in blue, positioned prominently in the
# center. Surrounding the "R" is a gray oval shape that is open on the right
# side, creating a dynamic and modern appearance. The R logo is commonly
# associated with statistical computing, data analysis, and graphical
# representation in various scientific and professional fields.
}
# Alternative approach to image conversion:
if (
Sys.getenv("OPENAI_API_KEY") != "" &&
rlang::is_installed("ellmer") &&
rlang::is_installed("magick")
) {
chat <- ellmer::chat_openai(echo = TRUE)
chat$chat("Describe this image", ellmer::content_image_file(jpg))
}
## End(Not run)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- dotty