polite

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Package ‘polite’

June 30, 2023


Version 0.1.3
Title Be Nice on the Web
Description Be responsible when scraping data from websites by following polite principles: intro-
duce yourself, ask for permission, take slowly and never ask twice.
License MIT + file LICENSE
Encoding UTF-8
ByteCompile true
URL https://github.com/dmi3kno/polite,
https://dmi3kno.github.io/polite/
BugReports https://github.com/dmi3kno/polite/issues
RoxygenNote 7.2.3
Imports httr, magrittr, memoise, ratelimitr, robotstxt, rvest, stats,
usethis
Suggests dplyr, testthat, covr, webmockr
NeedsCompilation no
Author Dmytro Perepolkin [aut, cre] (<https://orcid.org/0000-0001-8558-6183>)
Maintainer Dmytro Perepolkin <[email protected]>
Repository CRAN
Date/Publication 2023-06-30 08:30:02 UTC

R topics documented:
bow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
guess_basename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
html_attrs_dfr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
nod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
politely . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
print.polite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
rip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
scrape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
set_scrape_delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
use_manners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1
2 bow

Index 10

bow Introduce yourself to the host

Description

Introduce yourself to the host

Usage

bow(
url,
user_agent = "polite R package",
delay = 5,
times = 3,
force = FALSE,
verbose = FALSE,
...
)

is.polite(x)

Arguments

url URL
user_agent character value passed to user agent string
delay desired delay between scraping attempts. Final value will be the maximum of
desired and mandated delay, as stipulated by robots.txt for relevant user agent
times number of times to attempt scraping. Default is 3.
force refresh all memoised functions. Clears up robotstxt and scrape caches. De-
fault is FALSE
verbose TRUE/FALSE
... other curl parameters wrapped into httr::config function
x object of class polite, session

Value

object of class polite, session


guess_basename 3

Examples

library(polite)

host <- "https://www.cheese.com"


session <- bow(host)
session

guess_basename Guess download file name from the URL

Description
Guess download file name from the URL

Usage
guess_basename(x)

Arguments
x url to guess basename from

Value
guessed file name

Examples
guess_basename("https://bit.ly/polite_sticker")

html_attrs_dfr Convert collection of html nodes into data frame

Description
Convert collection of html nodes into data frame

Usage
html_attrs_dfr(
x,
attrs = NULL,
trim = FALSE,
defaults = NA_character_,
add_text = TRUE
)
4 nod

Arguments
x xml_nodeset object, containing text and attributes of interest
attrs character vector of attribute names. If missing, all attributes will be used
trim if TRUE, will trim leading and trailing spaces
defaults character vector of default values to be passed to rvest::html_attr(). Recy-
cled to match length of attrs
add_text if TRUE, node content will be added as .text column (using rvest::html_text)

Value
data frame with one row per xml node, consisting of an html_text column with text and additional
columns with attributes

Examples

library(polite)
library(rvest)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
scrape() %>%
html_nodes("tr td:nth-child(1) a") %>%
html_attrs_dfr()

nod Agree modification of session path with the host

Description
Agree modification of session path with the host

Usage
nod(bow, path, verbose = FALSE)

Arguments
bow object of class polite, session created by polite::bow()
path string value of path/URL to follow. The function accepts either a path (string
part of URL following domain name) or a full URL
verbose TRUE/FALSE

Value
object of class polite, session with modified URL
politely 5

Examples

library(polite)

host <- "https://www.cheese.com"


session <- bow(host) %>%
nod(path="by_type")
session

politely Give your web-scraping function good manners polite

Description
Give your web-scraping function good manners polite

Usage
politely(
fun,
user_agent = paste0("polite ", getOption("HTTPUserAgent"), " bot"),
robots = TRUE,
force = FALSE,
delay = 5,
verbose = FALSE,
cache = memoise::cache_memory()
)

Arguments
fun function to be turned "polite". Must contain an argument named url, which
contains url to be queried.
user_agent optional, user agent string to be used. Defaults to paste("polite", getOption("HTTPUserAgent"),
"bot")
robots optional, should robots.txt be consulted for permissions. Default is TRUE
force whether or not tp force fresh download of robots.txt
delay minimum delay in seconds, not less than 1. Default is 5.
verbose output more information about querying process
cache memoise cache function for storing results. Default memoise::cache_memory()

Value
polite function
6 rip

Examples

polite_GET <- politely(httr::GET)

print.polite Print host introduction object

Description

Print host introduction object

Usage

## S3 method for class 'polite'


print(x, ...)

Arguments

x object of class polite, session


... other parameters passed to methods

rip Polite file download

Description

Polite file download

Usage

rip(
bow,
destfile = NULL,
...,
mode = "wb",
path = tempdir(),
overwrite = FALSE
)
scrape 7

Arguments

bow host introduction object of class polite, session created by bow() or nod()
destfile optional new file name to use when saving the file. If missing, it will be guessed
from ‘basename(url)“
... other parameters passed to download.file
mode character. The mode with which to write the file. Useful values are w, wb (bi-
nary), a (append) and ab. Not used for methods wget and curl.
path character. Path where to save the destfile. By default is temporary directory
created with tempdir() Ignored if destfile contains path along with filename.
overwrite if TRUE will overwrite file on disk

Value

Full path to the locally saved file indicated by the user in destfile (and path)

Examples

bow("https://en.wikipedia.org/") %>%
nod("wiki/Flag_of_the_United_States#/media/File:Flag_of_the_United_States.svg") %>%
rip()

scrape Scrape the content of authorized page/API

Description

Scrape the content of authorized page/API

Usage

scrape(
bow,
query = NULL,
params = NULL,
accept = "html",
content = NULL,
verbose = FALSE
)
8 set_scrape_delay

Arguments
bow host introduction object of class polite, session created by bow() or nod()
query named list of parameters to be appended to URL in the format list(param1=valA,
param2=valB)
params deprecated. Use query argument above.
accept character value of expected data type to be returned by host (e.g. html, json,
xml, csv, txt, etc.)
content MIME type (aka internet media type) used to override the content type returned
by the server. See http://en.wikipedia.org/wiki/Internet_media_type for a list of
common types. You can add the charset parameter to override the server’s
default encoding
verbose extra feedback from the function. Defaults to FALSE

Value
Object of class httr::response which can be further processed by functions in rvest package

Examples

library(rvest)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
scrape(content="text/html; charset=UTF-8") %>%
html_nodes(".wikitable") %>%
html_table()

set_scrape_delay Reset scraping/ripping rate limit

Description
Reset scraping/ripping rate limit

Usage
set_scrape_delay(delay)

set_rip_delay(delay)

Arguments
delay Delay between subsequent requests. Default for package is 5 sec. It can be set
lower only under the condition of specifying a custom user-agent string.
use_manners 9

Value
Updates rate-limit property of scrape and rip functions, respectively.

Examples

library(polite)

host <- "https://www.cheese.com"


session <- bow(host)
session

use_manners Use manners in your own package or script

Description
Creates collection of polite functions for scraping and downloading

Usage
use_manners(save_as = "R/polite-scrape.R", open = TRUE)

Arguments
save_as File where function should be created Defaults to "R/polite-scrape.R"
open if TRUE, open the resultant files
Index

bow, 2

guess_basename, 3

html_attrs_dfr, 3

is.polite (bow), 2

nod, 4

politely, 5
print.polite, 6

rip, 6

scrape, 7
set_rip_delay (set_scrape_delay), 8
set_scrape_delay, 8

use_manners, 9

10

You might also like