0% found this document useful (0 votes)
15 views

Slide10 Part1

This chapter discusses web scraping using APIs and the Selenium library. It introduces APIs and how they can be used for web scraping by making HTTP requests and accessing structured data formats. Selenium is introduced as a tool for automating web browsers for tasks like clicking links and extracting page content. Code examples are provided for accessing APIs, searching API data, and using Selenium in Python to open browsers, find elements, click links, and get page titles and URLs. Beautiful Soup is also mentioned as a library for scraping HTML and JavaScript content.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Slide10 Part1

This chapter discusses web scraping using APIs and the Selenium library. It introduces APIs and how they can be used for web scraping by making HTTP requests and accessing structured data formats. Selenium is introduced as a tool for automating web browsers for tasks like clicking links and extracting page content. Code examples are provided for accessing APIs, searching API data, and using Selenium in Python to open browsers, find elements, click links, and get page titles and URLs. Beautiful Soup is also mentioned as a library for scraping HTML and JavaScript content.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

FACULTY OF INFORMATION SYSTEMS

Course:
Web Data Analysis
(3 credits)

Lecturer: Nguyen Thon Da Ph.D.


LECTURER’S INFORMATION

Chapter 10
Working with Web-Based APIs,
Beautiful Soup and Selenium
(Part 1)

Web Data Analysis :: Thon-Da Nguyen Ph.D.


MAIN CONTENTS
 Introduction to web APIs
 Accessing web API and data formats
 Web scraping using APIs
 Introduction to Selenium
 Using Selenium for web scraping
 Hypertext Markup Language: HTML
 Using Your Browser as a Development Tool
 Cascading Style Sheets: CSS
 The Beautiful Soup Library
 Scraping JavaScript

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Introduction to web APIs
A web API is an interface for websites to return information in
response to requests.
 It enables websites to share information with users and third-
party web applications.
 Web APIs are language-agnostic and typically return
information in JSON, XML, or CSV formats.
 APIs are used to develop applications and often include
documentation, methods, and libraries.
 Web APIs operate based on the HTTP protocol and often
require authentication, such as an API key, for requests.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Introduction to web APIs
 REST (Representational State Transfer)
 REST is an architectural protocol based on simple principles and resource-
oriented. It uses HTTP methods like GET, POST, PUT, DELETE to perform
operations on resources.
 REST uses URLs (Uniform Resource Locators) to identify resources and typically
returns data in JSON or XML format.
 REST is easy to understand, flexible, and suitable for lightweight and high-
performance web applications.

 SOAP (Simple Object Access Protocol)


 SOAP is a document-oriented data transmission protocol and uses XML to package
messages.
 It defines specific operations and messages through a special file called WSDL
(Web Services Description Language).
 SOAP is commonly used in complex and reliable systems, such as Enterprise Web
Services. Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to web APIs
 Benefits of web APIs
 An API's returned data is completely specific to the requests being
performed, along with the filters or parameters that have been
applied to it.
 Tasks such as parsing HTML or XML using Python libraries, such as
BeautifulSoup, pyquery, and lxml, isn't always required.
 The format of the data is structured and easy to handle.
 Data cleaning and processing for final listings will be more easy or
might not be required.
 There will be significant reductions in processing time (compared to
coding, analyzing the web, and applying XPath and CSS selectors to
retrieve data).
 They are easy to process.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Accessing web API and data formats

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Accessing web API and data formats
 Making requests to the web API using a web browser
 Case 1 – accessing a simple API (request and response)

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Accessing web API and data formats
 Making requests to the web API using a web browser
 Case 1 – accessing a simple API (request and response)

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Accessing web API and data formats
 Making requests to the web API using a web browser
 Case 2 – demonstrating RESTful API cache functionality

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Accessing web API and data formats
 Making requests to the web API using a web browser
 Case 2 – demonstrating RESTful API cache functionality

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Web scraping using APIs
 Example 1 – Searching and collecting university names and URL

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Web scraping using APIs
 Example 1 – searching and collecting university names and URL

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Web scraping using APIs
 Example 2 – scraping information from GitHub events
Demo SourceCode: Chapter10_Ex3.ipynb
Its output:

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Introduction to Selenium
Four key points about Selenium:
 Web Application Framework: Selenium is described as a web application
framework: It provides a structured set of tools and resources for working with
web applications.

 Web Scraping: Selenium can be utilized for web scraping activities. Web scraping
involves extracting data from websites.

 Browser Automation: Selenium can function as a browser automation tool: It can


perform various tasks within a web browser without human intervention. This
includes actions like clicking links, saving screenshots, downloading images, and
filling out HTML <form> templates.

 Dynamic and Secure Web Services: Selenium is suitable for working with dynamic
or secure web services that use technologies like JavaScript, cookies, and
scripts. It can load, test, crawl, and scrape data from these types of websites.…
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Selenium projects: Selenium WebDriver
 Selenium WebDriver for Browser Automation: A crucial part of Selenium
used to automate web browsers.

 Multiple Language Bindings and Third-Party Drivers: It supports multiple


programming languages and interfaces with browsers like Chrome, Firefox,
and Opera through third-party drivers.

 No External Dependencies: Selenium WebDriver works independently


without relying on external software or servers.

 Enhanced Features and Overcoming Limitations: It offers an object-


oriented API with improved features to address limitations seen in previous
Selenium versions and Selenium Remote Control (RC).
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Selenium projects : (cont.)
 Selenium RC is a server that is programmed in Java. It uses HTTP
to accept commands for the browser and is used to test complex
AJAX-based web applications.

 Selenium Grid is a server enabling parallel test execution on


multiple machines, across different browsers and operating
systems, reducing performance issues and time consumption.

 Selenium IDE is an open-source integrated development


environment for building test cases with Selenium.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Introduction to Selenium
Selenium projects : (cont.)
 Selenium RC is a server that is programmed in Java. It uses HTTP
to accept commands for the browser and is used to test complex
AJAX-based web applications.

 Selenium Grid is a server enabling parallel test execution on


multiple machines, across different browsers and operating
systems, reducing performance issues and time consumption.

 Selenium IDE is an open-source integrated development


environment for building test cases with Selenium.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Open Chrome
Use webdriver_manager for creating a driver object for the Chrome.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Open Chrome
To open a given URL in the browser window using Selenium for Python,
call get() method on the driver object and pass the URL as argument to get()
method. Code: driver.get(‘www.anywebsite.com’)

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by Link Text
To find a Link Element (hyperlinks) by the value (link text) inside the link, using Selenium in
Python, call find_element() method and pass By.LINK_TEXT as the first argument, and the link
text as the second argument. Code: driver.find_elements(By.LINK_TEXT, 'Contact MySQL')

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by Link Text

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Get Title of a Website
To get the title of the webpage using Selenium for Python, read the title property
of WebDriver object. Code: driver.title

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Get Link after clicking a link (current link)
To get the current URL in the browser window using Selenium for Python, read the
current_url property of web driver object. Code: driver.current_url

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by ID
To find an HTML Element by id attribute using Selenium in Python, call find_element() method
and pass By.ID as the first argument, and the id attribute’s value ((of the HTML Element we need
to find)) as the second argument. Code: find_element(By.ID, "id_value")

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by ID

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by Class Name
To find an HTML Element by class name attribute using Selenium in Python, call find_element()
method and pass By.CLASS_NAME as the first argument, and the class name (of the HTML
Element we need to find) as the second argument.
Code: find_element(By.CLASS_NAME, "class_name_value")

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by Class Name

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by CSS Selector
To find an HTML Element by CSS Selector, call find_element() method and pass
By.CSS_SELECTOR as the first argument, and the CSS selector string (of the HTML Element we
need to find) as the second argument.
Code: find_element(By.CLASS_NAME, "css_selector_value")
If there are multiple HTML Elements with the same given class name, then find_element()
returns the first HTML Element of those. Regading CSS selector string, if we would like to get the
first paragraph element whose class name is 'that_class_name' and the paragraph is inside a div,
then the CSS selector string is 'div p.that_class_name'.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Element by CSS Selector

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Elements by Partial Link Text
To find link elements (hyperlinks) by partial link text, call find_elements() method, pass
By.PARTIAL_LINK_TEXT as the first argument, and the partial link text value as the second
argument. Code: find_elements(By.PARTIAL_LINK_TEXT, "partial_link_text_value")
find_elements() method returns all the HTML Elements, that match the given partial link text,
as a list. If there are no elements by given partial link text, find_elements() function returns
an empty list.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Elements by Partial Link Text

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Elements by Tag Name
To find all HTML Elements that has a given tag name in a document, using Selenium in
Python, call find_elements() method, pass By.TAG_NAME as the first argument, and the
tag name (of the HTML Elements we need to find) as the second argument. Code:
find_elements(By.TAG_NAME, "tag_name_value")
find_elements() method returns all the HTML Elements, that match the given tag name,
as a list. If there are no elements by given tag name, find_elements() function returns an
empty list.

Web Data Analysis :: Thon-Da Nguyen Ph.D.


Python Selenium – Find Elements by Tag Name

Web Data Analysis :: Thon-Da Nguyen Ph.D.

You might also like