Skip to content

Commit 278f5d7

Browse files
author
jinxin0924
committed
粗糙版
0 parents  commit 278f5d7

File tree

7 files changed

+149
-0
lines changed

7 files changed

+149
-0
lines changed

.idea/Crawler.iml

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/encodings.xml

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/misc.xml

Lines changed: 80 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/modules.xml

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/scopes/scope_settings.xml

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/vcs.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Python网络爬虫Ver 1.0 alpha.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
__author__ = 'Xing'
2+
import re
3+
import urllib.request
4+
import urllib
5+
6+
from collections import deque
7+
8+
queue = deque()
9+
visited = set()
10+
11+
url = 'http://news.dbanotes.net' # 入口页面, 可以换成别的
12+
13+
queue.append(url)
14+
cnt = 0
15+
16+
while queue:
17+
url = queue.popleft() # 队首元素出队
18+
visited |= {url} # 标记为已访问
19+
20+
print('已经抓取: ' + str(cnt) + ' 正在抓取 <--- ' + url)
21+
cnt += 1
22+
if cnt>3:break
23+
urlop = urllib.request.urlopen(url)
24+
if 'html' not in urlop.getheader('Content-Type'):
25+
continue
26+
27+
# 避免程序异常中止, 用try..catch处理异常
28+
try:
29+
data = urlop.read().decode('utf-8')
30+
except:
31+
continue
32+
33+
# 正则表达式提取页面中所有队列, 并判断是否已经访问过, 然后加入待爬队列
34+
linkre = re.compile('href="(.+?)"')
35+
for x in linkre.findall(data):
36+
if 'http' in x and x not in visited:
37+
queue.append(x)
38+
print('加入队列 ---> ' + x)

0 commit comments

Comments
 (0)