Tutorial

January 15, 2024

Python Crawler

Web Crawling Ethics

Always check the website crawler protocol before crawling. Adding /robots.txt at the end of the domain name to check it.

Part of sample for https://www.google.com/robots.txt

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static

# AdsBot
User-agent: AdsBot-Google
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/

# Crawlers of certain social media sites are allowed to access page markup when google.com/imgres* links are shared. To learn more, please contact [email protected].
User-agent: Twitterbot
Allow: /imgres
Allow: /search
Disallow: /groups

User-agent: facebookexternalhit
Allow: /imgres
Allow: /search
Disallow: /groups

Sitemap: https://www.google.com/sitemap.xml

User-agent: * # All kinds of crawlers

Disallow: /search # Don’t allow crawling /search.htm

Allow: /search/about # Allow crawling /search/about.htm

Web Crawling Steps

Acquire data
Parse data
Extract data
Store data

Libraries

Step 1. Acquire data

Install requests
1
pip install requests
Import requests library
1
import requests
Send request to URL and save response to variable res.
1
res = requests.get('URL')
print(type(res)) return type is <class ‘requests.models.Response’>

requests return type is an object for Response.

Response attributes and usage

Attributes	Explain
response.status_code	Check request successful or not. 2xx: success / 4xx: client side error / 5xx: Server side error
response.content	Convert response object to binary data
response.text	Convert response object to string data
response.encoding	Define response object encoding

print(response.status_code)  # Sample output: 200
print(response.content) # Sample output: \xe9\x9d\x92\
print(response.text) # Sample output: HelloWorld
res.encoding = 'gbk'

Examples

import requests

# Crawling for text
res = requests.get('https://xxx.com')
res.encoding = 'utf-8'
novel = res.text
print(novel[:200])	# print out first 200 characters
file = open('test.txt','a+')	# save to test.txt file
file.write(novel)	
file.close()

# Carwling for picture
res = requests.get('https://xxx.com/zzz.png')
pic = res.content
photo = open('test.jpg','wb')	# Picture content need to write as binary wb method.
photo.write(pic) 
photo.close()

Step 2. Parse data

Install BeautifulSoup
1
pip install beautifulsoup4
Import beautifulsoup library
1
from bs4 import BeautifulSoup

Syntax

1	bsObject = BeautifulSoup(string_text_being_parsed, 'parser')

Usage

import requests
from bs4 import BeautifulSoup

res = requests.get('https://xxx.com') 
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) # Sample output: <class 'bs4.BeautifulSoup'>, soup is a BeautifulSoup object.
print(soup) # It will print all html source code

Differences between soup and res.text

The type of res.text is <class ‘str’> string.

The type of soup is <class ‘bs4.BeautifulSoup’> object.

Step 3. Extract data

BeatifulSoup methods

Methods	Explain	Example
soup.find()	Extract the first object matches.	soup.find(‘div’, class_=’books’)
soup.find_all()	Extract all object matches.	soup.find_all(‘div’, class_=’books’)

Examples

Example.html

<html>
  <head>
    <meta charset="utf-8">
  </head>
  <body>
    <h1>This is heading</h1>
    <div class='hello'>Hello world!</div>
    <div class='andy'>Hello Andy!</div>
    <div class='burger'>Hello burger!</div>
    <div class='burger'>Hello cheeseburger!</div>
  </body>
</html>

Example.py

import requests
from bs4 import BeautifulSoup

res = requests.get ('https://xxx.com')
print(res.status_code)

soup = BeautifulSoup(res.text,'html.parser')
item_1 = soup.find('div')	# Extract first div found
item_2 = soup.find_all('div')	# Extract all div found
item_3 = soup.find_all(class_='burger')	# Extract all with class name 'burger'

print(type(item_1))
print(item_1)
print('================================')
print(type(item_2))
print(item_2)
print('================================')
for i in item_3:
  print(type(i))
  print(i)

Output

200
<class 'bs4.element.Tag'>
<div class='hello'>Hello world!</div>
================================
<class 'bs4.element.ResultSet'>
[
<div class='hello'>Hello world!</div>,
<div class='andy'>Hello Andy!</div>,
<div class='burger'>Hello burger!</div>
]
================================
<class 'bs4.element.Tag'>
<div class='burger'>Hello burger!</div>
<class 'bs4.element.Tag'>
<div class='burger'>Hello cheeseburger!</div>

Data type ResultSet is like a list. [item1, item2, item3, …]

Tag Object

Methods	Explain
Tag.find()	Extract first tag found
Tag.find_all()	Extract all tags found
Tag.text	Extract text inside tag
Tag[‘Attribute’]	Extract value inside tag by add attribute

Book.html

<div class="books">
    <h2><a name="scientific">Scientific fiction</a></h2>
    <a href="https://xxx.com" class="title">The Three-Body Problem</a>
    <p class="info">The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body.
    </p> 
    <img class="img" src="https://upload.wikimedia.org/wikipedia/en/0/0f/Threebody.jpg">
    <br>
    <br>
    <hr size="1">
</div>

<div class="books">
    <h2><a name="cosmology">Cosmology</a></h2>
    <a href="https://xxx.com" class="title">A Brief History of Time</a>
    <p class="info">A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics.
    </p> 
    <img class="img" src="https://upload.wikimedia.org/wikipedia/en/a/a3/BriefHistoryTime.jpg">
    <br>
    <br>
    <hr size="1">
</div>

Example.py

import requests
from bs4 import BeautifulSoup

res = requests.get('https://xxx.com')
html = res.text
soup = BeautifulSoup( html,'html.parser') # Parse html source code to BeautifulSoup object
items = soup.find_all(class_='books') # Extract contents by tags and attributes

for item in items:
    subject = item.find('h2') # Extract <h2> tag
    title = item.find(class_='title') # Extract tag that attribute with class_='title'
    brief = item.find(class_='info') # Extract tag that attribute with class_='info'
    print(subject,'\n',title,'\n',brief)
    print(type(subject),type(title),type(brief))
    print('===========================================================================================')
    
print('===========================================================================================')

# Use .text, ['href'] to extract contents inside tag
for item in items:
    subject = item.find('h2') # Extract <h2> tag
    title = item.find(class_='title') # Extract tag that attribute with class_='title'
    brief = item.find(class_='info') # Extract tag that attribute with class_='info'
    print(subject.text,'\n',title.text,'\n', title['href'], '\n', brief.text)
    print('===========================================================================================')

Output

<h2><a name="scientific">Scientific fiction</a></h2>
<a href="https://xxx.com" class="title">The Three-Body Problem</a>
<p class="info">The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body. </p>
<class 'bs4.element.Tag'><class 'bs4.element.Tag'><class 'bs4.element.Tag'>
===========================================================================================
<h2><a name="cosmology">Cosmology</a></h2>
<a href="https://xxx.com" class="title">A Brief History of Time</a>
<p class="info">A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics. </p>
<class 'bs4.element.Tag'><class 'bs4.element.Tag'><class 'bs4.element.Tag'>
===========================================================================================
Scientific fiction
The Three-Body Problem
https://xxx.com
The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body. 
===========================================================================================
Cosmology
A Brief History of Time
https://xxx.com
A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics.

Step 4 Store data

Open function modes (File Access modes)

Parameter	Explain
r	Read only. Open text file for reading. The pointer is positioned at the beginning of the file. Raises I/O error if the file does not exist. This is also the default mode in which a file is opened.
r+	Read and Write. Open the file for reading and writing. The pointer is positioned at the beginning of the file. Raises I/O error if the file does not exist.
w	Write only. Open the file for writing. For the existing files, the data is truncated and over-written. The pointer is positioned at the beginning of the file. Creates the file if the file does not exist.
w+	Write and Read. Open the file for reading and writing. For an existing file, data is truncated and over-written. The pointer is positioned at the beginning of the file.
a	Append only. Open the file for writing. The file is created if it does not exist. The pointer is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.
a+	Append and Read. Open the file for reading and writing. The file is created if it does not exist. The pointer is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

Save to .txt format

# Open text file called test.txt, with append and read mode, encoding is utf-8.
with open('test.txt', 'a+', encoding="utf-8") as file:
        file.write(title +'\n')	# Write title in it
        for content in contents:
            file.write(content +'\n'+'\n')	# Loop contents list and write each line of content in contents to file
f.close() # Close file writer

Python Crawler

Web Crawling Ethics

Web Crawling Steps

Libraries

Step 1. Acquire data

Step 2. Parse data

Step 3. Extract data

Step 4 Store data

References

About this Post

“应用程序”已损坏，无法打开。解决方案

Calibre给txt等格式添加目录