Web scraping and parsing


Scraping refers to using a computer to retrieve the code of a webpage. Once the code is obtained, it must be parsed into a useful form for further use in R.

Base R does not have many of the tools required for these processes, so scraping and parsing are typically done with packages. Some packages are most useful for scraping (RSelenium, httr, curl, RCurl), some for parsing (XML, xml2), and some for both (rvest).

A related process is scraping a web API, which unlike a webpage returns data intended to be machine-readable. Many of the same packages are used for both.


Some websites object to being scraped, whether due to increased server loads or concerns about data ownership. If a website forbids scraping in it Terms of Use, scraping it is illegal.

