Tutorial by Examples | RIP Tutorial

Basic example of using requests and lxml to scrape some data

# For Python 2 compatibility. from __future__ import print_function import lxml.html import requests def main(): r = requests.get("https://httpbin.org") html_source = r.text root_element = lxml.html.fromstring(html_source) # Note root_element.xpath() gives a *...

Python Language • Web scraping with Python

Maintaining web-scraping session with requests

It is a good idea to maintain a web-scraping session to persist the cookies and other parameters. Additionally, it can result into a performance improvement because requests.Session reuses the underlying TCP connection to a host: import requests with requests.Session() as session: # all req...

Python Language • Web scraping with Python

Scraping using the Scrapy framework

First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run: scrapy startproject projectName To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questi...

Python Language • Web scraping with Python

Modify Scrapy user agent

Sometimes the default Scrapy user agent ("Scrapy/VERSION (+http://scrapy.org)") is blocked by the host. To change the default user agent open settings.py, uncomment and edit the following line to what ever you want. #USER_AGENT = 'projectName (+http://www.yourdomain.com)' For example ...

Python Language • Web scraping with Python

Scraping using BeautifulSoup4

from bs4 import BeautifulSoup import requests # Use the requests module to obtain a page res = requests.get('https://www.codechef.com/problems/easy') # Create a BeautifulSoup object page = BeautifulSoup(res.text, 'lxml') # the text field contains the source of the page # Now use a CSS ...

Python Language • Web scraping with Python

Scraping using Selenium WebDriver

Some websites don’t like to be scraped. In these cases you may need to simulate a real user working with a browser. Selenium launches and controls a web browser. from selenium import webdriver browser = webdriver.Firefox() # launch firefox browser browser.get('http://stackoverflow.com/questi...

Python Language • Web scraping with Python

Simple web content download with urllib.request

The standard library module urllib.request can be used to download web content: from urllib.request import urlopen response = urlopen('http://stackoverflow.com/questions?sort=votes') data = response.read() # The received bytes should usually be decoded according the response's characte...

Python Language • Web scraping with Python

Scraping with curl

imports: from subprocess import Popen, PIPE from lxml import etree from io import StringIO Downloading: user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' url = 'http://stackoverflow.com' get = Popen(['curl...

Python Language • Web scraping with Python