Web Crawling Ethics
Always check the website crawler protocol before crawling. Adding /robots.txt
at the end of the domain name to check it.
Part of sample for https://www.google.com/robots.txt
1 | User-agent: * |
User-agent: * # All kinds of crawlers
Disallow: /search # Don’t allow crawling /search.htm
Allow: /search/about # Allow crawling /search/about.htm
Web Crawling Steps
- Acquire data
- Parse data
- Extract data
- Store data
Libraries
Step 1. Acquire data
Install requests
1
pip install requests
Import requests library
1
import requests
Send request to URL and save response to variable res.
1
res = requests.get('URL')
print(type(res)) return type is <class ‘requests.models.Response’>
requests return type is an object for Response.
Response attributes and usage
Attributes Explain response.status_code Check request successful or not. 2xx: success / 4xx: client side error / 5xx: Server side error response.content Convert response object to binary data response.text Convert response object to string data response.encoding Define response object encoding 1
2
3
4print(response.status_code) # Sample output: 200
print(response.content) # Sample output: \xe9\x9d\x92\
print(response.text) # Sample output: HelloWorld
res.encoding = 'gbk'Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import requests
# Crawling for text
res = requests.get('https://xxx.com')
res.encoding = 'utf-8'
novel = res.text
print(novel[:200]) # print out first 200 characters
file = open('test.txt','a+') # save to test.txt file
file.write(novel)
file.close()
# Carwling for picture
res = requests.get('https://xxx.com/zzz.png')
pic = res.content
photo = open('test.jpg','wb') # Picture content need to write as binary wb method.
photo.write(pic)
photo.close()
Step 2. Parse data
Install BeautifulSoup
1
pip install beautifulsoup4
Import beautifulsoup library
1
from bs4 import BeautifulSoup
Syntax
1
bsObject = BeautifulSoup(string_text_being_parsed, 'parser')
Usage
1
2
3
4
5
6
7import requests
from bs4 import BeautifulSoup
res = requests.get('https://xxx.com')
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) # Sample output: <class 'bs4.BeautifulSoup'>, soup is a BeautifulSoup object.
print(soup) # It will print all html source codeDifferences between soup and res.text
The type of res.text is <class ‘str’> string.
The type of soup is <class ‘bs4.BeautifulSoup’> object.
Step 3. Extract data
BeatifulSoup methods
Methods Explain Example soup.find() Extract the first object matches. soup.find(‘div’, class_=’books’) soup.find_all() Extract all object matches. soup.find_all(‘div’, class_=’books’) Examples
Example.html
1
2
3
4
5
6
7
8
9
10
11
12<html>
<head>
<meta charset="utf-8">
</head>
<body>
<h1>This is heading</h1>
<div class='hello'>Hello world!</div>
<div class='andy'>Hello Andy!</div>
<div class='burger'>Hello burger!</div>
<div class='burger'>Hello cheeseburger!</div>
</body>
</html>Example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20import requests
from bs4 import BeautifulSoup
res = requests.get ('https://xxx.com')
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item_1 = soup.find('div') # Extract first div found
item_2 = soup.find_all('div') # Extract all div found
item_3 = soup.find_all(class_='burger') # Extract all with class name 'burger'
print(type(item_1))
print(item_1)
print('================================')
print(type(item_2))
print(item_2)
print('================================')
for i in item_3:
print(type(i))
print(i)Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15200
<class 'bs4.element.Tag'>
<div class='hello'>Hello world!</div>
================================
<class 'bs4.element.ResultSet'>
[
<div class='hello'>Hello world!</div>,
<div class='andy'>Hello Andy!</div>,
<div class='burger'>Hello burger!</div>
]
================================
<class 'bs4.element.Tag'>
<div class='burger'>Hello burger!</div>
<class 'bs4.element.Tag'>
<div class='burger'>Hello cheeseburger!</div>Data type ResultSet is like a list. [item1, item2, item3, …]
Tag Object
Methods Explain Tag.find() Extract first tag found Tag.find_all() Extract all tags found Tag.text Extract text inside tag Tag[‘Attribute’] Extract value inside tag by add attribute Book.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21<div class="books">
<h2><a name="scientific">Scientific fiction</a></h2>
<a href="https://xxx.com" class="title">The Three-Body Problem</a>
<p class="info">The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body.
</p>
<img class="img" src="https://upload.wikimedia.org/wikipedia/en/0/0f/Threebody.jpg">
<br>
<br>
<hr size="1">
</div>
<div class="books">
<h2><a name="cosmology">Cosmology</a></h2>
<a href="https://xxx.com" class="title">A Brief History of Time</a>
<p class="info">A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics.
</p>
<img class="img" src="https://upload.wikimedia.org/wikipedia/en/a/a3/BriefHistoryTime.jpg">
<br>
<br>
<hr size="1">
</div>Example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25import requests
from bs4 import BeautifulSoup
res = requests.get('https://xxx.com')
html = res.text
soup = BeautifulSoup( html,'html.parser') # Parse html source code to BeautifulSoup object
items = soup.find_all(class_='books') # Extract contents by tags and attributes
for item in items:
subject = item.find('h2') # Extract <h2> tag
title = item.find(class_='title') # Extract tag that attribute with class_='title'
brief = item.find(class_='info') # Extract tag that attribute with class_='info'
print(subject,'\n',title,'\n',brief)
print(type(subject),type(title),type(brief))
print('===========================================================================================')
print('===========================================================================================')
# Use .text, ['href'] to extract contents inside tag
for item in items:
subject = item.find('h2') # Extract <h2> tag
title = item.find(class_='title') # Extract tag that attribute with class_='title'
brief = item.find(class_='info') # Extract tag that attribute with class_='info'
print(subject.text,'\n',title.text,'\n', title['href'], '\n', brief.text)
print('===========================================================================================')Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19<h2><a name="scientific">Scientific fiction</a></h2>
<a href="https://xxx.com" class="title">The Three-Body Problem</a>
<p class="info">The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body. </p>
<class 'bs4.element.Tag'><class 'bs4.element.Tag'><class 'bs4.element.Tag'>
===========================================================================================
<h2><a name="cosmology">Cosmology</a></h2>
<a href="https://xxx.com" class="title">A Brief History of Time</a>
<p class="info">A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics. </p>
<class 'bs4.element.Tag'><class 'bs4.element.Tag'><class 'bs4.element.Tag'>
===========================================================================================
Scientific fiction
The Three-Body Problem
https://xxx.com
The Three-Body Problem (Chinese: 三体; lit. 'Three-Body') is a novel by Chinese science fiction author Liu Cixin, the first in the Remembrance of Earth's Past trilogy—though the series as a whole is often referred to as The Three-Body Problem, or simply as Three-Body.
===========================================================================================
Cosmology
A Brief History of Time
https://xxx.com
A Brief History of Time: From the Big Bang to Black Holes is a book on theoretical cosmology by English physicist Stephen Hawking. It was first published in 1988. Hawking wrote the book for readers who had no prior knowledge of physics.
Step 4 Store data
Open function modes (File Access modes)
Parameter Explain r Read only. Open text file for reading. The pointer is positioned at the beginning of the file. Raises I/O error if the file does not exist. This is also the default mode in which a file is opened. r+ Read and Write. Open the file for reading and writing. The pointer is positioned at the beginning of the file. Raises I/O error if the file does not exist. w Write only. Open the file for writing. For the existing files, the data is truncated and over-written. The pointer is positioned at the beginning of the file. Creates the file if the file does not exist. w+ Write and Read. Open the file for reading and writing. For an existing file, data is truncated and over-written. The pointer is positioned at the beginning of the file. a Append only. Open the file for writing. The file is created if it does not exist. The pointer is positioned at the end of the file. The data being written will be inserted at the end, after the existing data. a+ Append and Read. Open the file for reading and writing. The file is created if it does not exist. The pointer is positioned at the end of the file. The data being written will be inserted at the end, after the existing data. Save to .txt format
1
2
3
4
5
6# Open text file called test.txt, with append and read mode, encoding is utf-8.
with open('test.txt', 'a+', encoding="utf-8") as file:
file.write(title +'\n') # Write title in it
for content in contents:
file.write(content +'\n'+'\n') # Loop contents list and write each line of content in contents to file
f.close() # Close file writer
References
About this Post
This post is written by Andy, licensed under CC BY-NC 4.0.