Web scraping is an automated, programmatic process through which data can be constantly 'scraped' off webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping may be illegal.
requests
A simple, but powerful package for making HTTP requests.
requests-cache
Caching for requests
; caching data is very useful. In development, it means you can avoid hitting a site unnecessarily. While running a real collection, it means that if your scraper crashes for some reason (maybe you didn't handle some unusual content on the site...? maybe the site went down...?) you can repeat the collection very quickly from where you left off.
scrapy
Useful for building web crawlers, where you need something more powerful than using requests
and iterating through pages.
selenium
Python bindings for Selenium WebDriver, for browser automation. Using requests
to make HTTP requests directly is often simpler for retrieving webpages. However, this remains a useful tool when it is not possible to replicate the desired behaviour of a site using requests
alone, particularly when JavaScript is required to render elements on a page.
BeautifulSoup
Query HTML and XML documents, using a number of different parsers (Python's built-in HTML Parser,html5lib
, lxml
or lxml.html
)
lxml
Processes HTML and XML. Can be used to query and select content from HTML documents via CSS selectors and XPath.