rvest
is a package for web scraping and parsing by Hadley Wickham inspired by Python's Beautiful Soup. It leverages Hadley's xml2
package's libxml2
bindings for HTML parsing.
As part of the tidyverse, rvest
is piped. It uses
xml2::read_html
to scrape the HTML of a webpage,html_node
and html_nodes
functions using CSS or XPath selectors, andhtml_text
and html_table
.To scrape the table of milestones from the Wikipedia page on R, the code would look like
library(rvest)
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'
# scrape HTML from website
url %>% read_html() %>%
# select HTML tag with class="wikitable"
html_node(css = '.wikitable') %>%
# parse table into data.frame
html_table() %>%
# trim for printing
dplyr::mutate(Description = substr(Description, 1, 70))
## Release Date Description
## 1 0.16 This is the last alpha version developed primarily by Ihaka
## 2 0.49 1997-04-23 This is the oldest source release which is currently availab
## 3 0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4 0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5 1.0 2000-02-29 Considered by its developers stable enough for production us
## 6 1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7 2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data
## 8 2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9 2.11 2010-04-22 Support for Windows 64 bit systems.
## 10 2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11 2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12 2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13 3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy
While this returns a data.frame, note that as is typical for scraped data, there is still further data cleaning to be done: here, formatting dates, inserting NA
s, and so on.
Note that data in a less consistently rectangular format may take looping or other further munging to successfully parse. If the website makes use of jQuery or other means to insert content, read_html
may be insufficient to scrape, and a more robust scraper like RSelenium
may be necessary.