Skip to content

Commit d09ab2a

Browse files
committed
updated
1 parent cca891d commit d09ab2a

File tree

5 files changed

+379
-32
lines changed

5 files changed

+379
-32
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,14 @@ pip install -r requirements.txt
3434
```
3535

3636
## The Python Snippets
37-
The snippets are organized by topic.
37+
The Python snippets are organized by topic.
3838

3939
__GeoJSON__
40-
- [GeoJSON Example - Stations](geojson/geojson_stations.ipynb)
40+
- [Convert a pandas DataFrame to GeoJSON (Stations) ](geojson/geojson_stations.ipynb)
4141

4242
__RSS__
43-
- [Parse a RSS feed with feedparser](rss/feedparser.ipynb)
43+
- [Parse a RSS feed with feedparser](rss/feedparser.ipynb)
44+
45+
__HTTP__
46+
- [Load the content from a website with urllib.request](http/urlib.ipynb)
47+
- [Extract the text from a HTML document with Beautiful Soup](htpp/beautifulsoup4.ipynb)

geojson/geojson_stations.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## GeoJSON Example - Stations\n",
7+
"## Convert a pandas DataFrame to GeoJSON (Stations) \n",
88
"\n",
99
"In this Python snippet we use stations (geographic) from the Swiss public transportation and convert the data to a GeoJSON (http://geojson.org/) file.\n",
1010
"\n",

http/beautifulsoup4.ipynb

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Extract the text from a HTML document with Beautiful Soup\n",
8+
"\n",
9+
"We start with the following HTML document."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": null,
15+
"metadata": {
16+
"collapsed": true
17+
},
18+
"outputs": [],
19+
"source": [
20+
"html_doc = \"\"\"\n",
21+
"<html><head><title>My new page</title></head>\n",
22+
"<body>\n",
23+
"<p class=\"title\"><b>Cool my new page</b></p>\n",
24+
"\n",
25+
"<p class=\"story\">I have written the following articles:\n",
26+
"<a href=\"http://foo.bar/A1\" class=\"sister\" id=\"link1\">A1</a>,\n",
27+
"<a href=\"http://foo.bar/A2\" class=\"sister\" id=\"link2\">A2</a>\n",
28+
"<a href=\"http://foo.bar/A3\" class=\"sister\" id=\"link3\">A3</a>;\n",
29+
"</p>\n",
30+
"\n",
31+
"<p class=\"story\">...</p>\n",
32+
"\"\"\""
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {},
38+
"source": [
39+
"Let's parse a HTML document. Here we have already the page as string."
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {
46+
"collapsed": false
47+
},
48+
"outputs": [],
49+
"source": [
50+
"from bs4 import BeautifulSoup\n",
51+
"\n",
52+
"soup = BeautifulSoup(html_doc, 'html.parser')"
53+
]
54+
},
55+
{
56+
"cell_type": "markdown",
57+
"metadata": {},
58+
"source": [
59+
"Or you could load the content with urlib.\n",
60+
"\n",
61+
"```\n",
62+
"import urllib.request\n",
63+
"from BeautifulSoup import BeautifulSoup\n",
64+
"\n",
65+
"url = 'https://foo.bar/'\n",
66+
"req = urllib.request.Request(url, headers={'User-Agent' : \"Magic Browser\"}) \n",
67+
"con = urllib.request.urlopen( req )\n",
68+
"html_doc = con.read()\n",
69+
"\n",
70+
"soup = BeautifulSoup(html_doc, 'html.parser')\n",
71+
"```\n"
72+
]
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"metadata": {},
77+
"source": [
78+
"We can get the title from the page"
79+
]
80+
},
81+
{
82+
"cell_type": "code",
83+
"execution_count": 13,
84+
"metadata": {
85+
"collapsed": false
86+
},
87+
"outputs": [
88+
{
89+
"data": {
90+
"text/plain": [
91+
"'My new page'"
92+
]
93+
},
94+
"execution_count": 13,
95+
"metadata": {},
96+
"output_type": "execute_result"
97+
}
98+
],
99+
"source": [
100+
"soup.title.string"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"metadata": {},
106+
"source": [
107+
"or all the links in the HTML document."
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": 14,
113+
"metadata": {
114+
"collapsed": false
115+
},
116+
"outputs": [
117+
{
118+
"name": "stdout",
119+
"output_type": "stream",
120+
"text": [
121+
"http://foo.bar/A1\n",
122+
"http://foo.bar/A2\n",
123+
"http://foo.bar/A3\n"
124+
]
125+
}
126+
],
127+
"source": [
128+
"for link in soup.find_all('a'):\n",
129+
" print(link.get('href'))"
130+
]
131+
},
132+
{
133+
"cell_type": "markdown",
134+
"metadata": {},
135+
"source": [
136+
"Of course is very easy to get the plain text from the document without the HTML tags."
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": 15,
142+
"metadata": {
143+
"collapsed": false
144+
},
145+
"outputs": [
146+
{
147+
"data": {
148+
"text/plain": [
149+
"'\\nMy new page\\n\\nCool my new page\\nI have written the following articles:\\nA1,\\nA2\\nA3;\\n\\n...\\n'"
150+
]
151+
},
152+
"execution_count": 15,
153+
"metadata": {},
154+
"output_type": "execute_result"
155+
}
156+
],
157+
"source": [
158+
"soup.text"
159+
]
160+
}
161+
],
162+
"metadata": {
163+
"kernelspec": {
164+
"display_name": "Python 3",
165+
"language": "python",
166+
"name": "python3"
167+
},
168+
"language_info": {
169+
"codemirror_mode": {
170+
"name": "ipython",
171+
"version": 3
172+
},
173+
"file_extension": ".py",
174+
"mimetype": "text/x-python",
175+
"name": "python",
176+
"nbconvert_exporter": "python",
177+
"pygments_lexer": "ipython3",
178+
"version": "3.5.2"
179+
}
180+
},
181+
"nbformat": 4,
182+
"nbformat_minor": 2
183+
}

http/urlib.ipynb

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Load the content from a website with urllib.request\n",
8+
"\n",
9+
"In this example we use _urllib.request_ to load the content from a website."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 21,
15+
"metadata": {
16+
"collapsed": false
17+
},
18+
"outputs": [],
19+
"source": [
20+
"import urllib.request"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"metadata": {},
26+
"source": [
27+
"Some sites will block request from urlib, so we set a custom 'User-Agent' header\n",
28+
"to load the content from the remote site."
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": 22,
34+
"metadata": {
35+
"collapsed": false
36+
},
37+
"outputs": [],
38+
"source": [
39+
"url = 'https://medium.com/tag/machine-learning'\n",
40+
"req = urllib.request.Request(url, headers={'User-Agent' : \"Magic Browser\"}) \n",
41+
"con = urllib.request.urlopen(req)"
42+
]
43+
},
44+
{
45+
"cell_type": "markdown",
46+
"metadata": {},
47+
"source": [
48+
"Let's check the HTTP status and the message."
49+
]
50+
},
51+
{
52+
"cell_type": "code",
53+
"execution_count": 23,
54+
"metadata": {
55+
"collapsed": false
56+
},
57+
"outputs": [
58+
{
59+
"name": "stdout",
60+
"output_type": "stream",
61+
"text": [
62+
"200 OK\n"
63+
]
64+
}
65+
],
66+
"source": [
67+
"print(con.status, con.msg)"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"We can check if a specific HTTP request header exists"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": 24,
80+
"metadata": {
81+
"collapsed": false
82+
},
83+
"outputs": [
84+
{
85+
"data": {
86+
"text/plain": [
87+
"'text/html; charset=utf-8'"
88+
]
89+
},
90+
"execution_count": 24,
91+
"metadata": {},
92+
"output_type": "execute_result"
93+
}
94+
],
95+
"source": [
96+
"con.getheader('Content-Type')"
97+
]
98+
},
99+
{
100+
"cell_type": "markdown",
101+
"metadata": {},
102+
"source": [
103+
"Now we can load the content from the website"
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": 25,
109+
"metadata": {
110+
"collapsed": false
111+
},
112+
"outputs": [
113+
{
114+
"data": {
115+
"text/plain": [
116+
"b'<!DOCTYPE html><html xmlns:cc=\"http://creativecommons.org/ns#\"><head prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#\"><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><title>Machine Learning \\xe2\\x80\\x93 Medium</title><link rel=\"canonical\" href=\"https://medium.com/tag/machine-learning\"><link id=\"feedLink\" rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"/fee'"
117+
]
118+
},
119+
"execution_count": 25,
120+
"metadata": {},
121+
"output_type": "execute_result"
122+
}
123+
],
124+
"source": [
125+
"text = con.read()\n",
126+
"text[:500]"
127+
]
128+
}
129+
],
130+
"metadata": {
131+
"kernelspec": {
132+
"display_name": "Python 3",
133+
"language": "python",
134+
"name": "python3"
135+
},
136+
"language_info": {
137+
"codemirror_mode": {
138+
"name": "ipython",
139+
"version": 3
140+
},
141+
"file_extension": ".py",
142+
"mimetype": "text/x-python",
143+
"name": "python",
144+
"nbconvert_exporter": "python",
145+
"pygments_lexer": "ipython3",
146+
"version": "3.5.2"
147+
}
148+
},
149+
"nbformat": 4,
150+
"nbformat_minor": 2
151+
}

0 commit comments

Comments
 (0)