Tutorial by Examples

# For Python 2 compatibility. from __future__ import print_function import lxml.html import requests def main(): r = requests.get("https://httpbin.org") html_source = r.text root_element = lxml.html.fromstring(html_source) # Note root_element.xpath() gives a *...
It is a good idea to maintain a web-scraping session to persist the cookies and other parameters. Additionally, it can result into a performance improvement because requests.Session reuses the underlying TCP connection to a host: import requests with requests.Session() as session: # all req...
First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run: scrapy startproject projectName To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questi...
Sometimes the default Scrapy user agent ("Scrapy/VERSION (+http://scrapy.org)") is blocked by the host. To change the default user agent open settings.py, uncomment and edit the following line to what ever you want. #USER_AGENT = 'projectName (+http://www.yourdomain.com)' For example ...
from bs4 import BeautifulSoup import requests # Use the requests module to obtain a page res = requests.get('https://www.codechef.com/problems/easy') # Create a BeautifulSoup object page = BeautifulSoup(res.text, 'lxml') # the text field contains the source of the page # Now use a CSS ...
Some websites don’t like to be scraped. In these cases you may need to simulate a real user working with a browser. Selenium launches and controls a web browser. from selenium import webdriver browser = webdriver.Firefox() # launch firefox browser browser.get('http://stackoverflow.com/questi...
The standard library module urllib.request can be used to download web content: from urllib.request import urlopen response = urlopen('http://stackoverflow.com/questions?sort=votes') data = response.read() # The received bytes should usually be decoded according the response's characte...
imports: from subprocess import Popen, PIPE from lxml import etree from io import StringIO Downloading: user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' url = 'http://stackoverflow.com' get = Popen(['curl...

Page 1 of 1