@@ -6,13 +6,13 @@ Web Scraping
6
6
7
7
Web sites are written using HTML, which means that each web page is a
8
8
structured document. Sometimes it would be great to obtain some data from
9
- them and preserve the structure while we're at it, but this isn't always easy.
10
- It's not often that web sites provide their data in comfortable formats
11
- such as ``.csv ``.
9
+ them and preserve the structure while we're at it. Web sites provide
10
+ don't always provide their data in comfortable formats such as ``.csv ``.
12
11
13
- This is where web scraping comes in. Web scraping is the practice of using
12
+ This is where web scraping comes in. Web scraping is the practice of using a
14
13
computer program to sift through a web page and gather the data that you need
15
- in a format most useful to you.
14
+ in a format most useful to you while at the same time preserving the structure
15
+ of the data.
16
16
17
17
lxml and Requests
18
18
-----------------
@@ -43,12 +43,12 @@ we can go over two different ways: XPath and CSSSelect. In this example, I
43
43
will focus on the former.
44
44
45
45
XPath is a way of locating information in structured documents such as
46
- HTML or XML pages . A good introduction to XPath is ` here <http://www.w3schools.com/xpath/default.asp >`_ .
46
+ HTML or XML documents . A good introduction to XPath is on ` W3Schools <http://www.w3schools.com/xpath/default.asp >`_ .
47
47
48
- One can also use various tools for obtaining the XPath of elements such as
49
- FireBug for Firefox or in Chrome you can right click an element, choose
50
- 'Inspect element', highlight the code and the right click again and choose
51
- 'Copy XPath'.
48
+ There are also various tools for obtaining the XPath of elements such as
49
+ FireBug for Firefox or if you're using Chrome you can right click an
50
+ element, choose 'Inspect element', highlight the code and then right
51
+ click again and choose 'Copy XPath'.
52
52
53
53
After a quick analysis, we see that in our page the data is contained in
54
54
two elements - one is a div with title 'buyer-name' and the other is a
@@ -90,10 +90,10 @@ Lets see what we got exactly:
90
90
'$15.00', '$114.07', '$10.09']
91
91
92
92
Congratulations! We have successfully scraped all the data we wanted from
93
- a web page using lxml and we have it stored in memory as two lists. Now we
94
- can either continue our work on it, analyzing it using python or we can
95
- export it to a file and share it with friends.
93
+ a web page using lxml and Requests. We have it stored in memory as two
94
+ lists. Now we can do all sorts of cool stuff with it: we can analyze it
95
+ using Python or we can save it a file and share it with the world.
96
96
97
- A cool idea to think about is writing a script to iterate through the rest
98
- of the pages of this example data set or making this application use
99
- threads to improve its speed.
97
+ A cool idea to think about is modifying this script to iterate through
98
+ the rest of the pages of this example dataset or rewriting this
99
+ application to use threads for improved speed.
0 commit comments