Skip to content

Commit 28a2d8e

Browse files
committed
Scraping
1 parent 4148f32 commit 28a2d8e

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1297,6 +1297,40 @@ from urllib.parse import quote, quote_plus, unquote, unquote_plus
12971297
```
12981298

12991299

1300+
Scraping
1301+
--------
1302+
```python
1303+
# $ pip3 install beautifulsoup4
1304+
from http.cookiejar import CookieJar
1305+
from urllib.error import HTTPError, URLError
1306+
from urllib.request import build_opener, HTTPCookieProcessor
1307+
from bs4 import BeautifulSoup
1308+
1309+
def scrape(url):
1310+
"""Returns tree of HTML elements located at URL."""
1311+
jar = CookieJar()
1312+
opener = build_opener(HTTPCookieProcessor(jar))
1313+
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
1314+
try:
1315+
html = opener.open(url)
1316+
except ValueError as error:
1317+
return print(f'Malformed URL: {url}.\n{error}')
1318+
except (HTTPError, URLError) as error:
1319+
return print(f"Can't find URL: {url}.\n{error}")
1320+
return BeautifulSoup(html, 'html.parser')
1321+
```
1322+
1323+
```python
1324+
>>> document = scrape('https://en.wikipedia.org/wiki/Python_(programming_language)')
1325+
>>> table = document.find('table', class_='infobox vevent')
1326+
>>> rows = table.find_all('tr')
1327+
>>> rows[11].find('a')['href']
1328+
'https://www.python.org/'
1329+
>>> rows[6].find('div').text.split()[0]
1330+
'3.7.2'
1331+
```
1332+
1333+
13001334
Web
13011335
---
13021336
```python

0 commit comments

Comments
 (0)