XPath Navigation
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Slashes and Brackets
Single forward slash / looks forward one generation
Double forward slash // looks forward all future generations
Square brackets [] help narrow in on speci c elements
WEB SCRAPING IN PYTHON
To Bracket or not to Bracket
xpath = '/html/body'
xpath = '/html[1]/body[1]'
Give the same selection
WEB SCRAPING IN PYTHON
A Body of P
xpath = '/html/body/p'
WEB SCRAPING IN PYTHON
The Birds and the Ps
xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'
WEB SCRAPING IN PYTHON
Double Slashing the Brackets
xpath = '//p' xpath = '//p[1]'
WEB SCRAPING IN PYTHON
The Wildcard
xpath = '/html/body/*'
The asterisks * is the "wildcard"
WEB SCRAPING IN PYTHON
Xposé
W EB S CRAP IN G IN P YTH ON
Off the Beaten XPath
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
(At)tribute
@ represents "attribute"
@class
@id
@href
WEB SCRAPING IN PYTHON
Brackets and Attributes
WEB SCRAPING IN PYTHON
Brackets and Attributes
xpath = '//p[@class="class-1"]'
WEB SCRAPING IN PYTHON
Brackets and Attributes
xpath = '//*[@id="uid"]'
WEB SCRAPING IN PYTHON
Brackets and Attributes
xpath = '//div[@id="uid"]/p[2]'
WEB SCRAPING IN PYTHON
Content with Contains
Xpath Contains Notation:
contains( @attri-name, "string-expr" )
WEB SCRAPING IN PYTHON
Contain This
xpath = '//*[contains(@class,"class-1")]'
WEB SCRAPING IN PYTHON
Contain This
xpath = '//*[@class="class-1"]'
WEB SCRAPING IN PYTHON
Get Classy
xpath = '/html/body/div/p[2]'
WEB SCRAPING IN PYTHON
Get Classy
xpath = '/html/body/div/p[2]/@class'
WEB SCRAPING IN PYTHON
End of the Path
W EB S CRAP IN G IN P YTH ON
Introduction to the
scrapy Selector
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Setting up a Selector
from scrapy import Selector
html = '''
<html>
<body>
<div class="hello datacamp">
<p>Hello World!</p>
</div>
<p>Enjoy DataCamp!</p>
</body>
</html>
'''
sel = Selector( text = html )
Created a scrapy Selector object using a string with the html code
The selector sel has selected the entire html document
WEB SCRAPING IN PYTHON
Selecting Selectors
We can use the xpath call within a Selector to create new Selector s of speci c pieces of
the html code
The return is a SelectorList of Selector objects
sel.xpath("//p")
# outputs the SelectorList:
[<Selector xpath='//p' data='<p>Hello World!</p>'>,
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
WEB SCRAPING IN PYTHON
Extracting Data from a SelectorList
Use the extract() method
>>> sel.xpath("//p")
out: [<Selector xpath='//p' data='<p>Hello World!</p>'>,
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
>>> sel.xpath("//p").extract()
out: [ '<p>Hello World!</p>',
'<p>Enjoy DataCamp!</p>' ]
We can use extract_first() to get the rst element of the list
>>> sel.xpath("//p").extract_first()
out: '<p>Hello World!</p>'
WEB SCRAPING IN PYTHON
Extracting Data from a Selector
ps = sel.xpath('//p')
second_p = ps[1]
second_p.extract()
out: '<p>Enjoy DataCamp!</p>'
WEB SCRAPING IN PYTHON
Select This Course!
W EB S CRAP IN G IN P YTH ON
"Inspecting the
HTML"
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch, PhD
Data Scientist, NYU
"Source" = HTML Code
WEB SCRAPING IN PYTHON
Inspecting Elements
WEB SCRAPING IN PYTHON
HTML text to Selector
from scrapy import Selector
import requests
url = 'https://www.datacamp.com/courses/all'
html = requests.get( url ).content
sel = Selector( text = html )
WEB SCRAPING IN PYTHON
You Know Our
Secrets
W EB S CRAP IN G IN P YTH ON