Scraping refers to using a computer to retrieve the code of a webpage. Once the code is obtained, it must be parsed into a useful form for further use in R.
Base R does not have many of the tools required for these processes, so scraping and parsing are typically done with packages. Some packages are most useful for scraping (RSelenium
, httr
, curl
, RCurl
), some for parsing (XML
, xml2
), and some for both (rvest
).
A related process is scraping a web API, which unlike a webpage returns data intended to be machine-readable. Many of the same packages are used for both.
Some websites object to being scraped, whether due to increased server loads or concerns about data ownership. If a website forbids scraping in it Terms of Use, scraping it is illegal.