Web scraping and parsing

Download r eBook

Remarks

Scraping refers to using a computer to retrieve the code of a webpage. Once the code is obtained, it must be parsed into a useful form for further use in R.

Base R does not have many of the tools required for these processes, so scraping and parsing are typically done with packages. Some packages are most useful for scraping (RSelenium, httr, curl, RCurl), some for parsing (XML, xml2), and some for both (rvest).

A related process is scraping a web API, which unlike a webpage returns data intended to be machine-readable. Many of the same packages are used for both.

Legality

Some websites object to being scraped, whether due to increased server loads or concerns about data ownership. If a website forbids scraping in it Terms of Use, scraping it is illegal.

Related Examples

Stats

61 Contributors: 2
Friday, September 30, 2016
Licensed under: CC-BY-SA

Not affiliated with Stack Overflow
Rip Tutorial: info@zzzprojects.com

Download eBook