Help for package seekr

Title:

Extract Matching Lines from Matching Files

Version:

0.1.3

Description:

Provides a simple interface to recursively list files from a directory, filter them using a regular expression, read their contents, and extract lines that match a user-defined pattern. The package returns a dataframe containing the matched lines, their line numbers, file paths, and the corresponding matched substrings. Designed for quick code base exploration, log inspection, or any use case involving pattern-based file and line filtering.

License:

MIT + file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.3.2

Suggests:

testthat (≥ 3.0.0), withr

Config/testthat/edition:

Imports:

checkmate, cli, fs, lifecycle, purrr, readr, stringr, tibble, tidyr

URL:

https://github.com/smartiing/seekr, https://smartiing.github.io/seekr/

BugReports:

https://github.com/smartiing/seekr/issues

NeedsCompilation:

Packaged:

2025-05-09 17:32:24 UTC; smarting

Author:

Sacha Martingay [aut, cre, cph]

Maintainer:

Sacha Martingay <martingay.sacha@hotmail.com>

Repository:

CRAN

Date/Publication:

2025-05-10 23:00:02 UTC

seekr: Extract Matching Lines from Matching Files

Description

Author(s)

Maintainer: Sacha Martingay martingay.sacha@hotmail.com [copyright holder]

Assert Flag or Scalar Integerish

Description

Assertion function for check_flag_or_scalar_integerish(). Will throw an error if the input is invalid.

Usage

assert_flag_or_scalar_integerish(
  x,
  .var.name = checkmate::vname(x),
  add = NULL
)

Arguments

x

The object to check.

Check Flag or Scalar Integerish

Description

This function validates whether the input is either a logical flag (TRUE/FALSE) or a scalar integer-like value (e.g., 1, 2L, etc.).

Usage

check_flag_or_scalar_integerish(x)

Arguments

x

The object to check.

Value

TRUE if the input is a valid flag or scalar integerish, otherwise an error message string.

Extract Lowercase File Extensions

Description

Extracts the file extensions from the provided file paths, normalizes them to lowercase, and returns them as a character vector. The extension includes the leading period (.).

Usage

extract_lower_file_extension(files)

Arguments

files

A character vector of files to search (only for seek_in()).

Value

A character vector of lowercase file extensions.

Filter Files by Pattern and Content Type

Description

Filters a character vector of file paths using a user-defined pattern and additional content-based criteria to ensure only likely text files are retained.

This function applies multiple filters:

A regex-based path filter (if provided).
Exclusion of files located within .git folders.
Exclusion of files with known binary or non-text extensions.
A fallback scan for embedded null bytes to detect binary content in ambiguous files.

The function returns a filtered character vector of file paths likely to be valid text files.

Usage

filter_files(files, filter, negate, n = 1000L)

Arguments

files

A character vector of files to search (only for seek_in()).

filter

Optional. A regular expression pattern used to filter file paths before reading. If NULL, all text files are considered.

negate

Logical. If TRUE, files matching the filter pattern are excluded instead of included. Useful to skip files based on name or extension.

n

The number of bytes to read for binary detection in files with unknown extensions. Defaults to 1000.

Value

A character vector of file paths identified as potential text files. If no matching files are found, an informative error is thrown.

Identify Files with Known Non-Text Extensions

Description

Checks whether the provided file paths have extensions typically associated with binary or non-text formats (e.g., images, archives, executables).

Usage

has_known_nontext_extension(files)

Arguments

files

A character vector of files to search (only for seek_in()).

Value

A logical vector indicating whether each file has a known non-text extension.

Identify Files with Known Text Extensions

Description

Checks whether the provided file paths have extensions commonly associated with text-based formats (e.g., scripts, markdown, configuration files).

Usage

has_known_text_extension(files)

Arguments

files

A character vector of files to search (only for seek_in()).

Value

A logical vector indicating whether each file has a known text extension.

Detect Null Bytes in a File

Description

Reads the first n bytes of a file and checks whether any null bytes (0x00) are present, which is commonly used to detect binary files.

If the file cannot be read (e.g., corrupted or permission issues), the function safely assumes the file is binary and returns TRUE.

Usage

has_null_bytes(file, n = 1000L)

Arguments

file

A character string representing a single file path.

n

The number of bytes to read for binary detection in files with unknown extensions. Defaults to 1000.

Value

TRUE if a null byte is found or if an error occurs. FALSE otherwise.

Check if Files Are Located in a `.git` Folder

Description

Identifies whether the provided file paths are located inside a .git directory.

This function assumes that the file paths are normalized beforehand (i.e., using forward slashes / even on Windows systems).

Usage

is_in_gitfolder(files)

Arguments

files

A character vector of files to search (only for seek_in()).

Value

A logical vector indicating whether each file is located within a .git folder.

List Files in Directory

Description

Lists all files from a given directory with support for recursive search and inclusion of hidden files. The function throws a specific error when no files are found, based on the combination of recurse and all parameters. Returned file paths are made unique and are assumed to be normalized using forward slashes (/).

Usage

list_files(path, recurse, all)

Arguments

path

A character vector of one or more paths.

recurse

If TRUE recurse fully, if a positive number the number of levels to recurse.

all

If TRUE hidden files are also returned.

Value

A character vector of unique file paths. If no files are found, the function aborts with a message suggesting how to adjust search parameters (recurse and all), and includes a class-specific error identifier depending on the search mode:

"error_list_files_TT" for recurse = TRUE, all = TRUE
"error_list_files_TF" for recurse = TRUE, all = FALSE
"error_list_files_FT" for recurse = FALSE, all = TRUE
"error_list_files_FF" for recurse = FALSE, all = FALSE

Examples

## Not run: 
list_files("myfolder", recurse = TRUE, all = FALSE)

## End(Not run)

Prepare Tidy Data Frame from Matched Lines

Description

Constructs a tidy data frame from matched lines across a set of files. This function takes the output of read_filter_lines() and returns one row per match, including file path, line number, full line content, and regex match(es).

Usage

prepare_df(files, pattern, lines, path, relative_path, matches)

Arguments

files

A character vector of files to search (only for seek_in()).

pattern

A regular expression pattern used to match lines.

lines

A list with line_number and line, as returned by read_filter_lines().

path

A character vector of one or more directories where files should be discovered (only for seek()).

relative_path

Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity.

matches

Logical. If TRUE, all matches per line are also returned in a matches list-column.

Details

All steps are executed sequentially to transform file-based pattern matches into a structured tabular format. The function assumes that input files and their corresponding line data are correctly aligned. It handles path normalization, match extraction, and output column selection according to the matches and relative_path arguments.

Value

A tibble with the following columns:

path: File path (relative if specified), marked with class fs_path.
line_number: Line number of the match within the file.
match: The first matched substring from the line.
matches (optional): All matched substrings as a list-column.
line: Full content of the matching line.

Should CLI Output Be Printed?

Description

Determines whether CLI progress or messaging functions should be executed. This helper evaluates the seekr.verbose option, checks for an interactive session, and disables output during testthat tests.

Usage

print_cli()

Details

This function is designed to control conditional CLI output (e.g., cli::cli_progress_step()). It returns TRUE only when:

getOption("seekr.verbose", TRUE) is TRUE
the session is interactive (interactive())
testthat is not running (!testthat::is_testing())

Value

A logical scalar: TRUE if CLI output should be shown, FALSE otherwise.

Read and Filter Matching Lines in Text Files

Description

Reads lines from a set of text files and returns only the lines that match a specified regular expression pattern. The function processes each file one-by-one to maintain memory efficiency, making it suitable for reading large files. Files that cannot be read (due to warnings or errors) are skipped with a warning.

If verbosity is enabled via seekr.verbose = TRUE and the session is interactive, the function reports progress.

Usage

read_filter_lines(files, pattern, ...)

Arguments

files

A character vector of files to search (only for seek_in()).

pattern

A regular expression pattern used to match lines.

...

Additional arguments passed to readr::read_lines(), such as skip, n_max, or locale.

Details

Files are processed sequentially to minimize memory usage, especially when working with large files. Only the lines matching the pattern are retained for each file.

If a file raises a warning or an error during reading, it is silently skipped and contributes an empty entry to the result lists.

Value

A list with two elements:

line_number: A list of integer vectors giving the line numbers of matching lines, one per file.
line: A list of character vectors containing the matched lines, one per file.

Extract Matching Lines from Files

Description

These functions search through one or more text files, extract lines matching a regular expression pattern, and return a tibble containing the results.

seek(): Discovers files inside one or more directories (recursively or not), applies optional file name and text file filtering, and searches lines.
seek_in(): Searches inside a user-provided character vector of files.

Usage

seek(
  pattern,
  path = ".",
  ...,
  filter = NULL,
  negate = FALSE,
  recurse = FALSE,
  all = FALSE,
  relative_path = TRUE,
  matches = FALSE
)

seek_in(files, pattern, ..., matches = FALSE)

Arguments

pattern

A regular expression pattern used to match lines.

path

A character vector of one or more directories where files should be discovered (only for seek()).

...

Additional arguments passed to readr::read_lines(), such as skip, n_max, or locale.

filter

Optional. A regular expression pattern used to filter file paths before reading. If NULL, all text files are considered.

negate

Logical. If TRUE, files matching the filter pattern are excluded instead of included. Useful to skip files based on name or extension.

recurse

If TRUE recurse fully, if a positive number the number of levels to recurse.

all

If TRUE hidden files are also returned.

relative_path

Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity.

matches

Logical. If TRUE, all matches per line are also returned in a matches list-column.

files

A character vector of files to search (only for seek_in()).

Details

The overall process involves the following steps:

File Selection
- seek(): Files are discovered using fs::dir_ls(), starting from one or more directories.
- seek_in(): Files are directly supplied by the user (no discovery phase).
File Filtering
- Files located inside ⁠.git/⁠ folders are automatically excluded.
- Files with known non-text extensions (e.g., .png, .exe, .rds) are excluded.
- If a file's extension is unknown, a check is performed to detect embedded null bytes (binary indicator).
- Optionally, an additional regex-based path filter (filter) can be applied.
Line Reading
- Files are read line-by-line using readr::read_lines().
- Only lines matching the provided regular expression pattern are retained.
- If a file cannot be read, it is skipped gracefully without failing the process.
Data Frame Construction
- A tibble is constructed with one row per matched line.

These functions are particularly useful for analyzing source code, configuration files, logs, and other structured text data.

Value

A tibble with one row per matched line, containing:

path: File path (relative or absolute).
line_number: Line number in the file.
match: The first matched substring.
matches: All matched substrings (if matches = TRUE).
line: Full content of the matching line.

Examples

path = system.file("extdata", package = "seekr")

# Search all function definitions in R files
seek("[^\\s]+(?= (=|<-) function\\()", path, filter = "\\.R$")

# Search for usage of "TODO" comments in source code in a case insensitive way
seek("(?i)TODO", path, filter = "\\.R$")

# Search for error/warning in log files
seek("(?i)error", path, filter = "\\.log$")

# Search for config keys in YAML
seek("database:", path, filter = "\\.ya?ml$")

# Looking for "length" in all types of text files
seek("(?i)length", path)

# Search for specific CSV headers using seek_in() and reading only the first line
csv_files <- list.files(path, "\\.csv$", full.names = TRUE)
seek_in(csv_files, "(?i)specie", n_max = 1)

Read and Prepare Matching Lines

Description

Reads a set of files, filters lines based on a regular expression pattern, and constructs a tidy tibble of the results.

Usage

seek_lines(files, pattern, ..., path, relative_path, matches)

Arguments

files

A character vector of files to search (only for seek_in()).

pattern

A regular expression pattern used to match lines.

...

Additional arguments passed to readr::read_lines(), such as skip, n_max, or locale.

path

A character vector of one or more directories where files should be discovered (only for seek()).

relative_path

Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity.

matches

Logical. If TRUE, all matches per line are also returned in a matches list-column.

Value

A tibble with one row per matching line.

seekr: Extract Matching Lines from Matching Files

Description

Author(s)

See Also

Assert Flag or Scalar Integerish

Description

Usage

Arguments

Check Flag or Scalar Integerish

Description

Usage

Arguments

Value

Extract Lowercase File Extensions

Description

Usage

Arguments

Value

Filter Files by Pattern and Content Type

Description

Usage

Arguments

Value

Identify Files with Known Non-Text Extensions

Description

Usage

Arguments

Value

Identify Files with Known Text Extensions

Description

Usage

Arguments

Value

Detect Null Bytes in a File

Description

Usage

Arguments

Value

Check if Files Are Located in a .git Folder

Description

Usage

Arguments

Value

List Files in Directory

Description

Usage

Arguments

Value

Examples

Prepare Tidy Data Frame from Matched Lines

Description

Usage

Arguments

Details

Value

Should CLI Output Be Printed?

Description

Usage

Details

Value

Read and Filter Matching Lines in Text Files

Description

Usage

Arguments

Details

Value

Extract Matching Lines from Files

Description

Usage

Arguments

Details

Value

See Also

Examples

Read and Prepare Matching Lines

Description

Usage

Arguments

Value

Check if Files Are Located in a `.git` Folder