R Language Tutorial => Basic scraping with rvest

Example

rvest is a package for web scraping and parsing by Hadley Wickham inspired by Python's Beautiful Soup. It leverages Hadley's xml2 package's libxml2 bindings for HTML parsing.

As part of the tidyverse, rvest is piped. It uses

xml2::read_html to scrape the HTML of a webpage,
which can then be subset with its html_node and html_nodes functions using CSS or XPath selectors, and
parsed to R objects with functions like html_text and html_table.

To scrape the table of milestones from the Wikipedia page on R, the code would look like

library(rvest)

url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'

        # scrape HTML from website
url %>% read_html() %>% 
    # select HTML tag with class="wikitable"
    html_node(css = '.wikitable') %>% 
    # parse table into data.frame
    html_table() %>%
    # trim for printing
    dplyr::mutate(Description = substr(Description, 1, 70))

##    Release       Date                                                  Description
## 1     0.16            This is the last alpha version developed primarily by Ihaka 
## 2     0.49 1997-04-23 This is the oldest source release which is currently availab
## 3     0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4   0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5      1.0 2000-02-29 Considered by its developers stable enough for production us
## 6      1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7      2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8      2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9     2.11 2010-04-22                          Support for Windows 64 bit systems.
## 10    2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11    2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12    2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13     3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy

While this returns a data.frame, note that as is typical for scraped data, there is still further data cleaning to be done: here, formatting dates, inserting NAs, and so on.

Note that data in a less consistently rectangular format may take looping or other further munging to successfully parse. If the website makes use of jQuery or other means to insert content, read_html may be insufficient to scrape, and a more robust scraper like RSelenium may be necessary.

PDF - Download R Language for free

Previous Next

R Language

Fastest Entity Framework Extensions

Example

Got any R Language Question?

R Language

R Language Web scraping and parsing Basic scraping with rvest

Fastest Entity Framework Extensions

Example

Got any R Language Question?