Date: | 2024-08-25 |
Type: | Package |
Title: | A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker |
Version: | 0.7.15 |
Description: | Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain. |
License: | MIT + file LICENSE |
BugReports: | https://github.com/ropensci/robotstxt/issues |
URL: | https://docs.ropensci.org/robotstxt/, https://github.com/ropensci/robotstxt |
Imports: | stringr (≥ 1.0.0), httr (≥ 1.0.0), spiderbar (≥ 0.2.0), future.apply (≥ 1.0.0), magrittr, utils |
Suggests: | knitr, rmarkdown, dplyr, testthat, covr, curl |
Depends: | R (≥ 3.0.0) |
VignetteBuilder: | knitr |
RoxygenNote: | 7.2.3 |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2024-08-25 07:16:34 UTC; pbtz |
Author: | Pedro Baltazar [aut, cre], Peter Meissner [aut], Kun Ren [aut, cph] (Author and copyright holder of list_merge.R.), Oliver Keys [ctb] (original release code review), Rich Fitz John [ctb] (original release code review) |
Maintainer: | Pedro Baltazar <pedrobtz@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-08-29 17:00:01 UTC |
re-export magrittr pipe operator
Description
re-export magrittr pipe operator
Method as.list() for class robotstxt_text
Description
Method as.list() for class robotstxt_text
Usage
## S3 method for class 'robotstxt_text'
as.list(x, ...)
Arguments
x |
class robotstxt_text object to be transformed into list |
... |
further arguments (inherited from |
fix_url
Description
fix_url
Usage
fix_url(url)
Arguments
url |
a character string containing a single URL |
downloading robots.txt file
Description
downloading robots.txt file
Usage
get_robotstxt(
domain,
warn = getOption("robotstxt_warn", TRUE),
force = FALSE,
user_agent = utils::sessionInfo()$R.version$version.string,
ssl_verifypeer = c(1, 0),
encoding = "UTF-8",
verbose = FALSE,
rt_request_handler = robotstxt::rt_request_handler,
rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
on_server_error = on_server_error_default,
on_client_error = on_client_error_default,
on_not_found = on_not_found_default,
on_redirect = on_redirect_default,
on_domain_change = on_domain_change_default,
on_file_type_mismatch = on_file_type_mismatch_default,
on_suspect_content = on_suspect_content_default
)
Arguments
domain |
domain from which to download robots.txt file |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
encoding |
Encoding of the robots.txt file. |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
function to get multiple robotstxt files
Description
function to get multiple robotstxt files
Usage
get_robotstxts(
domain,
warn = TRUE,
force = FALSE,
user_agent = utils::sessionInfo()$R.version$version.string,
ssl_verifypeer = c(1, 0),
use_futures = FALSE,
verbose = FALSE,
rt_request_handler = robotstxt::rt_request_handler,
rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
on_server_error = on_server_error_default,
on_client_error = on_client_error_default,
on_not_found = on_not_found_default,
on_redirect = on_redirect_default,
on_domain_change = on_domain_change_default,
on_file_type_mismatch = on_file_type_mismatch_default,
on_suspect_content = on_suspect_content_default
)
Arguments
domain |
domain from which to download robots.txt file |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
use_futures |
Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
function guessing domain from path
Description
function guessing domain from path
Usage
guess_domain(x)
Arguments
x |
path aka URL from which to infer domain |
http_domain_changed
Description
http_domain_changed
Usage
http_domain_changed(response)
Arguments
response |
an httr response object, e.g. from a call to httr::GET() |
Value
logical of length 1 indicating whether or not any domain change happened during the HTTP request
http_subdomain_changed
Description
http_subdomain_changed
Usage
http_subdomain_changed(response)
Arguments
response |
an httr response object, e.g. from a call to httr::GET() |
Value
logical of length 1 indicating whether or not any domain change happened during the HTTP request
http_was_redirected
Description
http_was_redirected
Usage
http_was_redirected(response)
Arguments
response |
an httr response object, e.g. from a call to httr::GET() |
Value
logical of length 1 indicating whether or not any redirect happened during the HTTP request
is_suspect_robotstxt
Description
function that checks if file is valid / parsable robots.txt file
Usage
is_suspect_robotstxt(text)
Arguments
text |
content of a robots.txt file provides as character vector |
function that checks if file is valid / parsable robots.txt file
Description
function that checks if file is valid / parsable robots.txt file
Usage
is_valid_robotstxt(text, check_strickt_ascii = FALSE)
Arguments
text |
content of a robots.txt file provides as character vector |
check_strickt_ascii |
whether or not to check if content does adhere to the specification of RFC to use plain text aka ASCII |
Merge a number of named lists in sequential order
Description
Merge a number of named lists in sequential order
Usage
list_merge(...)
Arguments
... |
named lists |
Details
List merging is usually useful in the merging of program settings or configuraion with multiple versions across time, or multiple administrative levels. For example, a program settings may have an initial version in which most keys are defined and specified. In later versions, partial modifications are recorded. In this case, list merging can be useful to merge all versions of settings in release order of these versions. The result is an fully updated settings with all later modifications applied.
Author(s)
Kun Ren <mail@renkun.me>
The function merges a number of lists in sequential order
by modifyList
, that is, the later list always
modifies the former list and form a merged list, and the
resulted list is again being merged with the next list.
The process is repeated until all lists in ...
or
list
are exausted.
make automatically named list
Description
make automatically named list
Usage
named_list(...)
Arguments
... |
things to be put in list |
null_to_defeault
Description
null_to_defeault
Usage
null_to_defeault(x, d)
Arguments
x |
value to check and return |
d |
value to return in case x is NULL |
function parsing robots.txt
Description
function parsing robots.txt
Usage
parse_robotstxt(txt)
Arguments
txt |
content of the robots.txt file |
Value
a named list with useragents, comments, permissions, sitemap
parse_url
Description
parse_url
Usage
parse_url(url)
Arguments
url |
url to parse into its components |
Value
data.frame with columns protocol, domain, path
Examples
## Not run:
url <-
c(
"google.com",
"google.com/",
"www.google.com",
"http://google.com",
"https://google.com",
"sub.domain.whatever.de"
"s-u-b.dom-ain.what-ever.de"
)
parse_url(url)
## End(Not run)
check if a bot has permissions to access page(s)
Description
check if a bot has permissions to access page(s)
Usage
paths_allowed(
paths = "/",
domain = "auto",
bot = "*",
user_agent = utils::sessionInfo()$R.version$version.string,
check_method = c("spiderbar"),
warn = getOption("robotstxt_warn", TRUE),
force = FALSE,
ssl_verifypeer = c(1, 0),
use_futures = TRUE,
robotstxt_list = NULL,
verbose = FALSE,
rt_request_handler = robotstxt::rt_request_handler,
rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
on_server_error = on_server_error_default,
on_client_error = on_client_error_default,
on_not_found = on_not_found_default,
on_redirect = on_redirect_default,
on_domain_change = on_domain_change_default,
on_file_type_mismatch = on_file_type_mismatch_default,
on_suspect_content = on_suspect_content_default
)
Arguments
paths |
paths for which to check bot's permission, defaults to "/". Please, note that path to a folder should end with a trailing slash ("/"). |
domain |
Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually. |
bot |
name of the bot, defaults to "*" |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
check_method |
at the moment only kept for backward compatibility reasons - do not use parameter anymore –> will let the function simply use the default |
warn |
suppress warnings |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
use_futures |
Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. |
robotstxt_list |
either NULL – the default – or a list of character vectors with one vector per path to check |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
paths_allowed_worker spiderbar flavor
Description
paths_allowed_worker spiderbar flavor
Usage
paths_allowed_worker_spiderbar(domain, bot, paths, robotstxt_list)
Arguments
domain |
Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually. |
bot |
name of the bot, defaults to "*" |
paths |
paths for which to check bot's permission, defaults to "/". Please, note that path to a folder should end with a trailing slash ("/"). |
robotstxt_list |
either NULL – the default – or a list of character vectors with one vector per path to check |
printing robotstxt
Description
printing robotstxt
Usage
## S3 method for class 'robotstxt'
print(x, ...)
Arguments
x |
robotstxt instance to be printed |
... |
goes down the sink |
printing robotstxt_text
Description
printing robotstxt_text
Usage
## S3 method for class 'robotstxt_text'
print(x, ...)
Arguments
x |
character vector aka robotstxt$text to be printed |
... |
goes down the sink |
function to remove domain from path
Description
function to remove domain from path
Usage
remove_domain(x)
Arguments
x |
path aka URL from which to first infer domain and then remove it |
request_handler_handler
Description
Helper function to handle robotstxt handlers.
Usage
request_handler_handler(request, handler, res, info = TRUE, warn = TRUE)
Arguments
request |
the request object returned by call to httr::GET() |
handler |
the handler either a character string entailing various options or a function producing a specific list, see return. |
res |
a list a list with elements '[handler names], ...', 'rtxt', and 'cache' |
info |
info to add to problems list |
warn |
if FALSE warnings and messages are suppressed |
Value
a list with elements '[handler name]', 'rtxt', and 'cache'
Generate a representations of a robots.txt file
Description
The function generates a list that entails data resulting from parsing a robots.txt file as well as a function called check that enables to ask the representation if bot (or particular bots) are allowed to access a resource on the domain.
Usage
robotstxt(
domain = NULL,
text = NULL,
user_agent = NULL,
warn = getOption("robotstxt_warn", TRUE),
force = FALSE,
ssl_verifypeer = c(1, 0),
encoding = "UTF-8",
verbose = FALSE,
on_server_error = on_server_error_default,
on_client_error = on_client_error_default,
on_not_found = on_not_found_default,
on_redirect = on_redirect_default,
on_domain_change = on_domain_change_default,
on_file_type_mismatch = on_file_type_mismatch_default,
on_suspect_content = on_suspect_content_default
)
Arguments
domain |
Domain for which to generate a representation. If text equals to NULL, the function will download the file from server - the default. |
text |
If automatic download of the robots.txt is not preferred, the text can be supplied directly. |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
encoding |
Encoding of the robots.txt file. |
verbose |
make function print out more information |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
Value
Object (list) of class robotstxt with parsed data from a robots.txt (domain, text, bots, permissions, host, sitemap, other) and one function to (check()) to check resource permissions.
Fields
domain
character vector holding domain name for which the robots.txt file is valid; will be set to NA if not supplied on initialization
text
character vector of text of robots.txt file; either supplied on initialization or automatically downloaded from domain supplied on initialization
bots
character vector of bot names mentioned in robots.txt
permissions
data.frame of bot permissions found in robots.txt file
host
data.frame of host fields found in robots.txt file
sitemap
data.frame of sitemap fields found in robots.txt file
other
data.frame of other - none of the above - fields found in robots.txt file
check()
Method to check for bot permissions. Defaults to the domains root and no bot in particular. check() has two arguments: paths and bot. The first is for supplying the paths for which to check permissions and the latter to put in the name of the bot. Please, note that path to a folder should end with a trailing slash ("/").
Examples
## Not run:
rt <- robotstxt(domain="google.com")
rt$bots
rt$permissions
rt$check( paths = c("/", "forbidden"), bot="*")
## End(Not run)
get_robotstxt() cache
Description
get_robotstxt() cache
Usage
rt_cache
Format
An object of class environment
of length 0.
extracting comments from robots.txt
Description
extracting comments from robots.txt
Usage
rt_get_comments(txt)
Arguments
txt |
content of the robots.txt file |
extracting permissions from robots.txt
Description
extracting permissions from robots.txt
Usage
rt_get_fields(txt, regex = "", invert = FALSE)
Arguments
txt |
content of the robots.txt file |
regex |
regular expression specify field |
invert |
invert selection made via regex? |
extracting robotstxt fields
Description
extracting robotstxt fields
Usage
rt_get_fields_worker(txt, type = "all", regex = NULL, invert = FALSE)
Arguments
txt |
content of the robots.txt file |
type |
name or names of the fields to be returned, defaults to all fields |
regex |
subsetting field names via regular expressions |
invert |
field selection |
load robots.txt files saved along with the package
Description
load robots.txt files saved along with the package: these functions are very handy for testing (not used otherwise)
Usage
rt_get_rtxt(name = sample(rt_list_rtxt(), 1))
Arguments
name |
name of the robots.txt files, defaults to a random drawn file ;-) |
extracting HTTP useragents from robots.txt
Description
extracting HTTP useragents from robots.txt
Usage
rt_get_useragent(txt)
Arguments
txt |
content of the robots.txt file |
storage for http request response objects
Description
storage for http request response objects
get_robotstxt() worker function to execute HTTP request
Usage
rt_last_http
get_robotstxt_http_get(
domain,
user_agent = utils::sessionInfo()$R.version$version.string,
ssl_verifypeer = 1
)
Arguments
domain |
the domain to get tobots.txt. file for |
user_agent |
the user agent to use for HTTP request header |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
Format
An object of class environment
of length 1.
list robots.txt files saved along with the package
Description
list robots.txt files saved along with the package: these functions ar very handy for testing (not used otherwise)
Usage
rt_list_rtxt()
rt_request_handler
Description
A helper function for get_robotstxt() that will extract the robots.txt file from the HTTP request result object. furthermore it will inform get_robotstxt() if the request should be cached and which problems occured.
Usage
rt_request_handler(
request,
on_server_error = on_server_error_default,
on_client_error = on_client_error_default,
on_not_found = on_not_found_default,
on_redirect = on_redirect_default,
on_domain_change = on_domain_change_default,
on_sub_domain_change = on_sub_domain_change_default,
on_file_type_mismatch = on_file_type_mismatch_default,
on_suspect_content = on_suspect_content_default,
warn = TRUE,
encoding = "UTF-8"
)
on_server_error_default
on_client_error_default
on_not_found_default
on_redirect_default
on_domain_change_default
on_sub_domain_change_default
on_file_type_mismatch_default
on_suspect_content_default
Arguments
request |
result of an HTTP request (e.g. httr::GET()) |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_sub_domain_change |
request state handler for any 3xx HTTP status where domain did change but only to www-sub_domain |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
warn |
suppress warnings |
encoding |
The text encoding to assume if no encoding is provided in the headers of the response |
Format
An object of class list
of length 4.
An object of class list
of length 4.
An object of class list
of length 4.
An object of class list
of length 2.
An object of class list
of length 3.
An object of class list
of length 2.
An object of class list
of length 4.
An object of class list
of length 4.
Value
a list with three items following the following schema:
list( rtxt = "", problems = list( "redirect" = list( status_code = 301 ),
"domain" = list(from_url = "...", to_url = "...") ) )
making paths uniform
Description
making paths uniform
Usage
sanitize_path(path)
Arguments
path |
path to be sanitized |
Value
sanitized path