Help for package heapsofpapers

Title:

Easily Download Heaps of PDF and CSV Files

Version:

0.1.0

Description:

Makes it easy to download a large number of files such as PDF files and CSV files, while automatically slowing down requests, letting you know where it is up to, and adjusting for files that have already been downloaded.

License:

MIT + file LICENSE

URL:

https://github.com/RohanAlexander/heapsofpapers

BugReports:

https://github.com/RohanAlexander/heapsofpapers/issues

Imports:

aws.s3, curl, dplyr, fs, magrittr, rlang, scales, utils

Suggests:

knitr, rmarkdown, spelling, testthat (≥ 3.0.0), tibble

VignetteBuilder:

knitr

Config/testthat/edition:

Encoding:

UTF-8

Language:

en-US

LazyData:

false

RoxygenNote:

7.1.1

NeedsCompilation:

Packaged:

2021-08-19 14:19:16 UTC; rohanalexander

Author:

Rohan Alexander

[aut, cre, cph], A Mahfouz [aut]

Maintainer:

Rohan Alexander <rohan.alexander@utoronto.ca>

Repository:

CRAN

Date/Publication:

2021-08-23 18:40:01 UTC

Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Arguments

lhs

A value or the magrittr placeholder.

rhs

A function call using the magrittr semantics.

Value

The result of calling rhs(lhs).

check_for_existence

Description

The check_for_existence function looks at the folder that you are going to save your PDFs to and checks whether you have already got any of them. It then suggests that you filter to remove them.

Usage

check_for_existence(data, save_names = "save_names", dir = "heaps_of")

Arguments

data

A dataframe that contains URLs that you want to download and the names that you want to save them as.

save_names

The name of the column whose values should be the saved file names where the downloaded file will be saved, save_names by default.

dir

The directory to download files to, current working directory by default.

Value

The data dataframe with a column specifying whether the file has been downloaded.

Examples

## Not run: two_pdfs <-
 tibble::tibble(
  locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download",
    "https://osf.io/preprints/socarxiv/a29h8/download"),
  save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf",
     "cesr_an_r_package_for_the_canadian_election_study.pdf")
   )

heapsofpapers::get_and_save(
data = two_pdfs,
links = "locations_are",
save_names = "save_here"
)

heapsofpapers::check_for_existence(data = two_pdfs, save_names = "save_here")

## End(Not run)

get_these_and_save_them

Description

The get_and_save function works with a tibble of locations (usually URLs) and file names, and then downloads the PDF from the location to the file name, saving as it goes, and letting you know where it is up to. It politely waits around 5 seconds between calls to the location, and skips locations that give an error.

Usage

get_and_save(
  data,
  links = "links",
  save_names = "save_names",
  dir = "heaps_of",
  bucket = NULL,
  delay = 5,
  print_every = 1,
  dupe_strategy = "overwrite"
)

Arguments

data

A dataframe that contains URLs that you want to download and the names that you want to save them as.

links

The name of the column whose values should be the URLs that you want to download, links by default.

save_names

The name of the column whose values should be the saved file names where the downloaded file will be saved, save_names by default.

dir

The directory to download files to, current working directory by default.

bucket

name of AWS S3 bucket to save files to.

delay

The number of seconds to wait between downloads, default (and minimum) is five seconds. We automatically add a bit of noise to lessen the effect on systematic processes that might be otherwise working.

print_every

The default is that you get a print message for every file, but you can change this. If you want to print an update for every second file then set this equal to 2, for a printed update every tenth file, set it to 10, etc.

dupe_strategy

There are a variety of ways of dealing with the situation where you already have some of the files downloaded. By default the function will just get them again and overwrite. However you can also specify 'ignore' in which case those files will be ignored. You can also investigate duplicates yourself using heapsofpapers::check_for_existence().

Value

A print statement in the console about whether each of the links was saved (if not turned off by the user), and notification that the function has finished.

Examples

## Not run: two_pdfs <-
tibble::tibble(
  locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download",
                    "https://osf.io/preprints/socarxiv/a29h8/download"),
  save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf",
                "cesr_an_r_package_for_the_canadian_election_study.pdf")
)

heapsofpapers::get_and_save(
data = two_pdfs,
links = "locations_are",
save_names = "save_here"
)

## End(Not run)