Title: | Extract Matching Lines from Matching Files |
Version: | 0.1.3 |
Description: | Provides a simple interface to recursively list files from a directory, filter them using a regular expression, read their contents, and extract lines that match a user-defined pattern. The package returns a dataframe containing the matched lines, their line numbers, file paths, and the corresponding matched substrings. Designed for quick code base exploration, log inspection, or any use case involving pattern-based file and line filtering. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Suggests: | testthat (≥ 3.0.0), withr |
Config/testthat/edition: | 3 |
Imports: | checkmate, cli, fs, lifecycle, purrr, readr, stringr, tibble, tidyr |
URL: | https://github.com/smartiing/seekr, https://smartiing.github.io/seekr/ |
BugReports: | https://github.com/smartiing/seekr/issues |
NeedsCompilation: | no |
Packaged: | 2025-05-09 17:32:24 UTC; smarting |
Author: | Sacha Martingay [aut, cre, cph] |
Maintainer: | Sacha Martingay <martingay.sacha@hotmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-05-10 23:00:02 UTC |
seekr: Extract Matching Lines from Matching Files
Description
Provides a simple interface to recursively list files from a directory, filter them using a regular expression, read their contents, and extract lines that match a user-defined pattern. The package returns a dataframe containing the matched lines, their line numbers, file paths, and the corresponding matched substrings. Designed for quick code base exploration, log inspection, or any use case involving pattern-based file and line filtering.
Author(s)
Maintainer: Sacha Martingay martingay.sacha@hotmail.com [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/smartiing/seekr/issues
Assert Flag or Scalar Integerish
Description
Assertion function for check_flag_or_scalar_integerish()
. Will throw an error if
the input is invalid.
Usage
assert_flag_or_scalar_integerish(
x,
.var.name = checkmate::vname(x),
add = NULL
)
Arguments
x |
The object to check. |
Check Flag or Scalar Integerish
Description
This function validates whether the input is either a logical flag
(TRUE
/FALSE
) or a scalar integer-like value (e.g., 1
, 2L
, etc.).
Usage
check_flag_or_scalar_integerish(x)
Arguments
x |
The object to check. |
Value
TRUE
if the input is a valid flag or scalar integerish, otherwise an error message string.
Extract Lowercase File Extensions
Description
Extracts the file extensions from the provided file paths, normalizes them
to lowercase, and returns them as a character vector. The extension includes
the leading period (.
).
Usage
extract_lower_file_extension(files)
Arguments
files |
A character vector of files to search (only for |
Value
A character vector of lowercase file extensions.
Filter Files by Pattern and Content Type
Description
Filters a character vector of file paths using a user-defined pattern and additional content-based criteria to ensure only likely text files are retained.
This function applies multiple filters:
A regex-based path filter (if provided).
Exclusion of files located within
.git
folders.Exclusion of files with known binary or non-text extensions.
A fallback scan for embedded null bytes to detect binary content in ambiguous files.
The function returns a filtered character vector of file paths likely to be valid text files.
Usage
filter_files(files, filter, negate, n = 1000L)
Arguments
files |
A character vector of files to search (only for |
filter |
Optional. A regular expression pattern used to filter file paths
before reading. If |
negate |
Logical. If |
n |
The number of bytes to read for binary detection in files with unknown extensions. Defaults to 1000. |
Value
A character vector of file paths identified as potential text files. If no matching files are found, an informative error is thrown.
Identify Files with Known Non-Text Extensions
Description
Checks whether the provided file paths have extensions typically associated with binary or non-text formats (e.g., images, archives, executables).
Usage
has_known_nontext_extension(files)
Arguments
files |
A character vector of files to search (only for |
Value
A logical vector indicating whether each file has a known non-text extension.
Identify Files with Known Text Extensions
Description
Checks whether the provided file paths have extensions commonly associated with text-based formats (e.g., scripts, markdown, configuration files).
Usage
has_known_text_extension(files)
Arguments
files |
A character vector of files to search (only for |
Value
A logical vector indicating whether each file has a known text extension.
Detect Null Bytes in a File
Description
Reads the first n
bytes of a file and checks whether any null bytes (0x00
)
are present, which is commonly used to detect binary files.
If the file cannot be read (e.g., corrupted or permission issues), the function
safely assumes the file is binary and returns TRUE
.
Usage
has_null_bytes(file, n = 1000L)
Arguments
file |
A character string representing a single file path. |
n |
The number of bytes to read for binary detection in files with unknown extensions. Defaults to 1000. |
Value
TRUE
if a null byte is found or if an error occurs. FALSE
otherwise.
Check if Files Are Located in a .git
Folder
Description
Identifies whether the provided file paths are located inside a .git
directory.
This function assumes that the file paths are normalized beforehand (i.e., using
forward slashes /
even on Windows systems).
Usage
is_in_gitfolder(files)
Arguments
files |
A character vector of files to search (only for |
Value
A logical vector indicating whether each file is located within a .git
folder.
List Files in Directory
Description
Lists all files from a given directory with support for recursive search and inclusion of hidden files.
The function throws a specific error when no files are found, based on the combination of
recurse
and all
parameters. Returned file paths are made unique and are assumed to be
normalized using forward slashes (/
).
Usage
list_files(path, recurse, all)
Arguments
path |
A character vector of one or more paths. |
recurse |
If |
all |
If |
Value
A character vector of unique file paths. If no files are found, the function aborts with a
message suggesting how to adjust search parameters (recurse
and all
), and includes a
class-specific error identifier depending on the search mode:
-
"error_list_files_TT"
forrecurse = TRUE
,all = TRUE
-
"error_list_files_TF"
forrecurse = TRUE
,all = FALSE
-
"error_list_files_FT"
forrecurse = FALSE
,all = TRUE
-
"error_list_files_FF"
forrecurse = FALSE
,all = FALSE
Examples
## Not run:
list_files("myfolder", recurse = TRUE, all = FALSE)
## End(Not run)
Prepare Tidy Data Frame from Matched Lines
Description
Constructs a tidy data frame from matched lines across a set of files.
This function takes the output of read_filter_lines()
and returns one row per match,
including file path, line number, full line content, and regex match(es).
Usage
prepare_df(files, pattern, lines, path, relative_path, matches)
Arguments
files |
A character vector of files to search (only for |
pattern |
A regular expression pattern used to match lines. |
lines |
A list with |
path |
A character vector of one or more directories where files should be
discovered (only for |
relative_path |
Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity. |
matches |
Logical. If |
Details
All steps are executed sequentially to transform file-based pattern matches into a structured tabular format.
The function assumes that input files and their corresponding line data are correctly aligned.
It handles path normalization, match extraction, and output column selection according to the matches
and relative_path
arguments.
Value
A tibble with the following columns:
-
path
: File path (relative if specified), marked with classfs_path
. -
line_number
: Line number of the match within the file. -
match
: The first matched substring from the line. -
matches
(optional): All matched substrings as a list-column. -
line
: Full content of the matching line.
Should CLI Output Be Printed?
Description
Determines whether CLI progress or messaging functions should be executed.
This helper evaluates the seekr.verbose
option, checks for an interactive session,
and disables output during testthat tests.
Usage
print_cli()
Details
This function is designed to control conditional CLI output (e.g., cli::cli_progress_step()
).
It returns TRUE
only when:
-
getOption("seekr.verbose", TRUE)
isTRUE
the session is interactive (
interactive()
)testthat is not running (
!testthat::is_testing()
)
Value
A logical scalar: TRUE
if CLI output should be shown, FALSE
otherwise.
Read and Filter Matching Lines in Text Files
Description
Reads lines from a set of text files and returns only the lines that match a specified regular expression pattern. The function processes each file one-by-one to maintain memory efficiency, making it suitable for reading large files. Files that cannot be read (due to warnings or errors) are skipped with a warning.
If verbosity is enabled via seekr.verbose = TRUE
and the session is interactive,
the function reports progress.
Usage
read_filter_lines(files, pattern, ...)
Arguments
files |
A character vector of files to search (only for |
pattern |
A regular expression pattern used to match lines. |
... |
Additional arguments passed to |
Details
Files are processed sequentially to minimize memory usage, especially when working with
large files. Only the lines matching the pattern
are retained for each file.
If a file raises a warning or an error during reading, it is silently skipped and contributes an empty entry to the result lists.
Value
A list with two elements:
line_number
A list of integer vectors giving the line numbers of matching lines, one per file.
line
A list of character vectors containing the matched lines, one per file.
Extract Matching Lines from Files
Description
These functions search through one or more text files, extract lines matching a regular expression pattern, and return a tibble containing the results.
-
seek()
: Discovers files inside one or more directories (recursively or not), applies optional file name and text file filtering, and searches lines. -
seek_in()
: Searches inside a user-provided character vector of files.
Usage
seek(
pattern,
path = ".",
...,
filter = NULL,
negate = FALSE,
recurse = FALSE,
all = FALSE,
relative_path = TRUE,
matches = FALSE
)
seek_in(files, pattern, ..., matches = FALSE)
Arguments
pattern |
A regular expression pattern used to match lines. |
path |
A character vector of one or more directories where files should be
discovered (only for |
... |
Additional arguments passed to |
filter |
Optional. A regular expression pattern used to filter file paths
before reading. If |
negate |
Logical. If |
recurse |
If |
all |
If |
relative_path |
Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity. |
matches |
Logical. If |
files |
A character vector of files to search (only for |
Details
The overall process involves the following steps:
-
File Selection
-
seek()
: Files are discovered usingfs::dir_ls()
, starting from one or more directories. -
seek_in()
: Files are directly supplied by the user (no discovery phase).
-
-
File Filtering
Files located inside
.git/
folders are automatically excluded.Files with known non-text extensions (e.g.,
.png
,.exe
,.rds
) are excluded.If a file's extension is unknown, a check is performed to detect embedded null bytes (binary indicator).
Optionally, an additional regex-based path filter (
filter
) can be applied.
-
Line Reading
Files are read line-by-line using
readr::read_lines()
.Only lines matching the provided regular expression
pattern
are retained.If a file cannot be read, it is skipped gracefully without failing the process.
-
Data Frame Construction
A tibble is constructed with one row per matched line.
These functions are particularly useful for analyzing source code, configuration files, logs, and other structured text data.
Value
A tibble with one row per matched line, containing:
-
path
: File path (relative or absolute). -
line_number
: Line number in the file. -
match
: The first matched substring. -
matches
: All matched substrings (ifmatches = TRUE
). -
line
: Full content of the matching line.
See Also
fs::dir_ls()
, readr::read_lines()
, stringr::str_detect()
Examples
path = system.file("extdata", package = "seekr")
# Search all function definitions in R files
seek("[^\\s]+(?= (=|<-) function\\()", path, filter = "\\.R$")
# Search for usage of "TODO" comments in source code in a case insensitive way
seek("(?i)TODO", path, filter = "\\.R$")
# Search for error/warning in log files
seek("(?i)error", path, filter = "\\.log$")
# Search for config keys in YAML
seek("database:", path, filter = "\\.ya?ml$")
# Looking for "length" in all types of text files
seek("(?i)length", path)
# Search for specific CSV headers using seek_in() and reading only the first line
csv_files <- list.files(path, "\\.csv$", full.names = TRUE)
seek_in(csv_files, "(?i)specie", n_max = 1)
Read and Prepare Matching Lines
Description
Reads a set of files, filters lines based on a regular expression pattern, and constructs a tidy tibble of the results.
Usage
seek_lines(files, pattern, ..., path, relative_path, matches)
Arguments
files |
A character vector of files to search (only for |
pattern |
A regular expression pattern used to match lines. |
... |
Additional arguments passed to |
path |
A character vector of one or more directories where files should be
discovered (only for |
relative_path |
Logical. If TRUE, file paths are made relative to the path argument. If multiple root paths are provided, relative_path is automatically ignored and absolute paths are kept to avoid ambiguity. |
matches |
Logical. If |
Value
A tibble with one row per matching line.