updated

rueedlinger · rueedlinger · commit d09ab2ad50bd · 2016-12-20T21:57:49.000+01:00
diff --git a/README.md b/README.md
@@ -34,10 +34,14 @@ pip install -r requirements.txt
 ``` 
 
 ## The Python Snippets
-The snippets are organized by topic.
+The Python snippets are organized by topic.
 
 __GeoJSON__
-- [GeoJSON Example - Stations](geojson/geojson_stations.ipynb)
+- [Convert a pandas DataFrame to GeoJSON (Stations) ](geojson/geojson_stations.ipynb)
 
 __RSS__
-- [Parse a RSS feed with feedparser](rss/feedparser.ipynb)
+- [Parse a RSS feed with feedparser](rss/feedparser.ipynb)
+
+__HTTP__
+- [Load the content from a website with urllib.request](http/urlib.ipynb)
+- [Extract the text from a HTML document with Beautiful Soup](htpp/beautifulsoup4.ipynb)
diff --git a/geojson/geojson_stations.ipynb b/geojson/geojson_stations.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## GeoJSON Example - Stations\n",
+    "## Convert a pandas DataFrame to GeoJSON (Stations) \n",
     "\n",
     "In this Python snippet we use stations (geographic) from the Swiss public transportation and convert the data to a GeoJSON (http://geojson.org/) file.\n",
     "\n",
diff --git a/http/beautifulsoup4.ipynb b/http/beautifulsoup4.ipynb
@@ -0,0 +1,183 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Extract the text from a HTML document with Beautiful Soup\n",
+    "\n",
+    "We start with the following HTML document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "html_doc = \"\"\"\n",
+    "<html><head><title>My new page</title></head>\n",
+    "<body>\n",
+    "<p class=\"title\"><b>Cool my new page</b></p>\n",
+    "\n",
+    "<p class=\"story\">I have written the following articles:\n",
+    "<a href=\"http://foo.bar/A1\" class=\"sister\" id=\"link1\">A1</a>,\n",
+    "<a href=\"http://foo.bar/A2\" class=\"sister\" id=\"link2\">A2</a>\n",
+    "<a href=\"http://foo.bar/A3\" class=\"sister\" id=\"link3\">A3</a>;\n",
+    "</p>\n",
+    "\n",
+    "<p class=\"story\">...</p>\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's parse a HTML document. Here we have already the page as string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "soup = BeautifulSoup(html_doc, 'html.parser')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Or you could load the content with urlib.\n",
+    "\n",
+    "```\n",
+    "import urllib.request\n",
+    "from BeautifulSoup import BeautifulSoup\n",
+    "\n",
+    "url = 'https://foo.bar/'\n",
+    "req = urllib.request.Request(url, headers={'User-Agent' : \"Magic Browser\"}) \n",
+    "con = urllib.request.urlopen( req )\n",
+    "html_doc = con.read()\n",
+    "\n",
+    "soup = BeautifulSoup(html_doc, 'html.parser')\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can get the title from the page"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'My new page'"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "soup.title.string"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "or all the links in the HTML document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "http://foo.bar/A1\n",
+      "http://foo.bar/A2\n",
+      "http://foo.bar/A3\n"
+     ]
+    }
+   ],
+   "source": [
+    "for link in soup.find_all('a'):\n",
+    "    print(link.get('href'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Of course is very easy to get the plain text from the document without the HTML tags."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'\\nMy new page\\n\\nCool my new page\\nI have written the following articles:\\nA1,\\nA2\\nA3;\\n\\n...\\n'"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "soup.text"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/http/urlib.ipynb b/http/urlib.ipynb
@@ -0,0 +1,151 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Load the content from a website with urllib.request\n",
+    "\n",
+    "In this example we use _urllib.request_ to load the content from a website."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import urllib.request"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Some sites will block request from urlib, so we set a custom 'User-Agent' header\n",
+    "to load the content from the remote site."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "url = 'https://medium.com/tag/machine-learning'\n",
+    "req = urllib.request.Request(url, headers={'User-Agent' : \"Magic Browser\"}) \n",
+    "con = urllib.request.urlopen(req)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's check the HTTP status and the message."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "200 OK\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(con.status, con.msg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can check if a specific HTTP request header exists"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'text/html; charset=utf-8'"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "con.getheader('Content-Type')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can load the content from the website"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "b'<!DOCTYPE html><html xmlns:cc=\"http://creativecommons.org/ns#\"><head prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#\"><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><title>Machine Learning \\xe2\\x80\\x93 Medium</title><link rel=\"canonical\" href=\"https://medium.com/tag/machine-learning\"><link id=\"feedLink\" rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"/fee'"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "text = con.read()\n",
+    "text[:500]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/rss/feedparser.ipynb b/rss/feedparser.ipynb