Web Scraping Course Notes
Web Scraping Course Notes
Pre-requisites:
• Basic knowledge of Python and Pandas
• Basic understanding of how the web works
• Basic knowledge of HTML
Course Structure:
3. Primer on Web
○ How websites interact
HTTP methods
Topics Page 3
Basics
Tuesday, October 29, 2024 10:39 PM
• Web Scraping targets particular data points within a webpage, such as prices, reviews, product listings, or other structured information.
• Scraping is focused on extracting certain elements or fields from a webpage, rather than exploring links or indexing the entire page.
2. Parsing HTML:
○ The script navigates through the received HTML structure to identify and extract data of interest.
○ This involves extracting only the required specific data.
3. Storage:
○ After extraction, data is cleaned and stored in the desired format.
○ Data is usually stored in a database, CSV file, or spreadsheet for further analysis.
Basics Page 4
Basics Page 5
Basics Page 6
Types of Web Scraping
Tuesday, October 29, 2024 11:16 PM
1. HTML Parsing:
○ HTML parsing is the most common form of web scraping.
○ It involves analyzing a web page’s HTML structure to extract relevant data.
○ Works well for websites with static content or basic HTML structures.
○ Example: Extracting blog titles, author names, and publication dates from a blog page.
4. API-based Scraping:
○ Many websites offer APIs (Application Programming Interfaces) for structured data access.
○ This can be a more efficient and ethical alternative to traditional scraping methods.
○ Example: Extracting user information, posts, and comments from a social media platform’s API.
Types Page 7
5. Image and Multimedia Scraping:
○ Image scraping involves extracting images, videos, or other media files from web pages.
○ Scrapers target img tags or other media tags in HTML, and download the files directly.
Types Page 8
Ethical Consideration
Tuesday, October 29, 2024 11:43 PM
• Ethical considerations in web scraping are essential to ensure that data collection practices are conducted
responsibly and in line with the legal and moral obligations.
• These considerations mainly revolve around respecting website policies, data privacy, intellectual property,
and transparency with users.
Ethics Page 9
users and harm the reputation of the data’s original source.
○ What To Do: If using scraped data for public purposes, clearly disclose its source, the data collection process, and any
limitations.
Ethics Page 10
Advantages of Web Scraping
Wednesday, October 30, 2024 12:01 AM
• Helps save considerable time and effort, enabling faster access to information.
• This is particularly beneficial for industries that rely on large datasets, such as e-commerce, market
research, and finance.
• Access to real-time data provides businesses with a competitive edge by allowing them to adjust
strategies based on the latest trends.
• Web scraping can gather data from various websites, blogs, social media, and online forums, providing a
broader view of the market landscape.
• Web scraping helps compile data that is crucial for understanding consumer behavior, trends, and
competitor activities.
• Helps companies analyze historical data to identify trends and predict future behaviors, aiding long-term
strategy planning.
• Companies can use web scraping to validate information about their own products and services by
comparing data across different platforms, detecting inconsistencies that may indicate fraud.
Advantages Page 11
6. Enhanced SEO and Content Strategy:
• Web scraping can help companies analyze competitors' keywords, backlinks, and content strategies to
improve their own SEO performance.
• Understanding high-performing content on competitors' websites can guide and allow companies to
identify and replicate successful topics and formats.
Advantages Page 12
Disadvantages of Web Scraping
Wednesday, October 30, 2024 12:12 AM
• Extracting data without permission can lead to copyright issues, potential lawsuits, or restrictions
from the website owner.
• Scraping personal information, even if publicly available, can raise privacy issues, especially under
data protection laws like GDPR.
• Companies can face penalties for scraping personal data without consent.
• This can interrupt scraping processes, requiring continual adjustment to circumvent these
systems.
• Many scrapers use rotating proxies to avoid detection, which can be costly.
• IPs can also quickly become blocked, rendering scraping scripts useless.
• These changes require scrapers to be reconfigured frequently, increasing maintenance time and
cost.
• Extracted data may contain inconsistencies, missing values, or irrelevant information that
requires significant preprocessing before it becomes usable.
Disadvantages Page 13
• Scraping such content requires additional tools like Selenium or Puppeteer, which increase
complexity.
• JavaScript-heavy pages can be slower to load and scrape, making data extraction more time-
consuming and resource-demanding.
5. Environmental Impact:
• Large-scale scraping operations consume substantial computational resources, which contributes
to energy usage and, indirectly, environmental impact.
Disadvantages Page 14
Alternatives to Web Scraping
Wednesday, October 30, 2024 12:13 AM
1. Public APIs:
• Many websites offer public APIs that allow developers to access structured data directly.
• APIs provide clean and organized data formats, eliminating the need for extensive parsing or cleaning.
• Using an official API helps avoid legal risks associated with web scraping.
2. RSS Feeds:
• Really Simple Syndication feeds are a way to automatically receive updates from websites in a single feed.
• RSS feeds are updated frequently, making it easy to access new content automatically.
• Since RSS feeds are structured in XML, they’re easy to parse and don’t require complex scraping scripts.
3. Public Datasets:
• Data portals provide clean, verified, and well-documented datasets, which are typically updated periodically.
• Most data portals offer free access, with datasets available in formats like CSV, JSON, or Excel.
• Using existing datasets reduces time spent on collection and cleaning.
Alternatives Page 15
Topics
Sunday, November 3, 2024 12:44 PM
1. About Anaconda
2. Tutorial of Common Anaconda Prompts
3. Creating a project environment using Anaconda
Topics Page 16
About Anaconda
Thursday, October 31, 2024 10:32 AM
• Anaconda is a popular open-source distribution of Python (and R) mainly used for data science,
machine learning, and scientific computing.
• Anaconda includes Conda, a package, dependency, and environment manager, making it easy to
install and manage libraries and environments.
• It comes with over 1,500 pre-installed packages for data science, including popular libraries like
NumPy, Pandas, Matplotlib, TensorFlow, and Scikit-Learn.
• Anaconda provides easy access to Jupyter Notebook, a powerful tool for interactive coding, data
visualization, and exploratory analysis.
• Anaconda includes conda, which is its package and environment manager (similar to pip)
Anaconda Page 17
Common Anaconda Prompts
Thursday, October 31, 2024 10:46 AM
• conda list: lists all the packages installed in the current environment
• conda env export --name <env_name> --file environment.yml: export an environment to a .yml file
• conda env create --file environment.yml: import the exported environment in another system
Prompts Page 18
Setup
Thursday, October 31, 2024 9:39 AM
2. Install Anaconda
• Official Website
• python -m ipykernel install --user --name=<env_name> --display-name "<Your Env Display Name>"
7. Launch Jupyter and create new notebooks using the appropriate kernel
Setup Page 19
Setup Page 20
Topics
Thursday, October 31, 2024 12:07 PM
1. Client-Server model
3. HTTP Methods
4. Status Codes
5. Web Technologies
• HTML
• CSS
• JavaScript
Topics Page 21
Client-Server model/architecture
Thursday, October 31, 2024 12:14 PM
Client:
• Device or an application that initiates requests for services or resources.
• Clients are typically end-user devices (e.g., smartphones, laptops) or software applications (e.g.,
browsers, email clients) that communicate over a network.
Server:
• A server is a dedicated system or application that listens for and fulfills requests from clients.
• Servers provide resources, data, or services to clients, typically through a network connection.
Working:
• importance of server for websites/apps
• client-server communication
Client-Server Page 22
Client-Server Page 23
HTTP Request and Response
Thursday, October 31, 2024 12:14 PM
HTTP:
• The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web.
• It facilitates the exchange of information between clients and servers.
• The HTTP request-response cycle is central to how web applications function.
Working:
HTTP Request:
• An HTTP request is a message sent by the client to the server to initiate an action or request a resource.
○ Request Line:
▪ Method / Verb
▪ Address (URI)
▪ Version
HTTP Response:
• An HTTP response is the message sent by the server back to the client after processing the request.
○ Response Line:
▪ Status Code
▪ Version
• HTTP methods, also known as HTTP verbs, are a fundamental part of the HTTP protocol.
• They define the action to be performed on a resource identified by a URI.
• Each method has specific semantics and is used for different purposes in client-server communication.
1. GET:
• Used to request data from a specified resource.
• It is the most commonly used HTTP method.
• Application: Retrieving web pages, images, or other resources from a server.
2. POST:
• Used to submit data to be processed to a specified resource.
• This often results in the creation of a new resource.
• Application: Submitting forms or uploading files
3. PUT:
• Used to update an existing resource or create a new resource if it does not exist.
• It sends data to the server to replace the current representation of the resource.
• Application: Updating user details or replacing an entire user.
Methods Page 27
4. PATCH:
• Used to apply partial modifications to a resource.
• It sends a set of instructions to update the resource rather than replacing it entirely.
• Application: Updating specific fields of a user, like changing a user’s email without altering other attributes.
5. DELETE:
• Used to remove a specified resource from the server.
• Application: Deleting user accounts, posts, or other resources.
Methods Page 28
Methods Page 29
HTTP Status Codes
Thursday, October 31, 2024 12:14 PM
• HTTP status codes are three-digit numbers sent by a server in response to a client's request made to
the server.
• These codes are crucial for understanding the results of HTTP requests.
• They provide important feedback to clients about the success or failure of requests and help
developers diagnose issues.
• Properly using and interpreting HTTP status codes is essential for effective communication between
clients and servers in the web ecosystem.
1. 1xx - Informational:
• These codes indicate that the request has been received and the process is continuing.
• They are rarely used in web applications but are important for certain protocols.
• Examples:
▪ 100 Continue
▪ 101 Switching Protocols
2. 2xx - Success:
• These codes indicate that the client's request was successfully received, understood, and accepted.
• Examples:
▪ 200 OK
▪ 201 Created
▪ 202 Accepted
3. 3xx - Redirection:
• These codes indicate that further action is needed to complete the request.
• Examples:
▪ 300 Multiple Choices
▪ 301 Moved Permanently
▪ 302 Found
• Examples:
▪ 400 Bad Request
▪ 401 Unauthorized
▪ 404 Not Found
• Examples:
▪ 502 Bad Gateway
▪ 503 Service Unavailable
• Web technologies encompass a wide array of tools, languages, and protocols used to create,
maintain, and manage websites and web applications.
• HTML provides the foundational structure, CSS handles visual presentation, and JavaScript
adds dynamic behavior.
• Together, they enable developers to build rich, interactive user experiences on the web.
• Understanding these technologies is crucial for web scraping, as it allows developers to
extract data from web pages effectively, even when dealing with dynamic content.
1. HTML:
• HTML is the standard markup language used for creating web pages.
• It provides the structure and layout of a web document by defining elements like headings,
paragraphs, links, images, and other types of content using appropriate tags.
2. CSS:
• CSS is a stylesheet language used to describe the presentation of a document written in HTML.
• It controls the layout, colors, fonts, and overall visual appearance of web pages.
• It controls the layout, colors, fonts, and overall visual appearance of web pages.
3. JavaScript:
• JavaScript is a high-level, dynamic programming language that enables interactivity and functionality
on web pages.
• The Requests module is a powerful and user-friendly HTTP library for Python, designed to make it easier to
send HTTP requests.
• It abstracts away the complexities of making requests and handling responses, allowing developers to focus
on building applications.
Key Features:
• Simplicity: Known for its clean and straightforward syntax, making it accessible for beginners and experienced developers
alike.
• Flexibility: Supports a wide range of HTTP methods and allows for customization, such as adding headers, query
parameters, and more.
• User-Friendly: Compared to the built-in urllib library, Requests provides a more intuitive API which is particularly
beneficial for rapid development and prototyping.
• Automatic Content Decoding: Requests automatically decodes the content based on the response headers, so you can
work with the data directly without worrying about the encoding.
• Session Management: You can persist certain parameters across multiple requests using session objects, which can be
useful for maintaining state.
• Built-in Error Handling: Requests come with built-in mechanisms to handle common HTTP errors and exceptions.
• Community & Support: The Requests library has a large and active community. Can easily find tutorials, documentation,
and support for any issues you encounter.
About Page 36
Applications of Requests:
1. Web Scraping:
• Requests is often the first step in web scraping.
• It allows developers to send HTTP requests to retrieve the HTML content of web pages, which can then be parsed and
analyzed using libraries like Beautiful Soup or lxml.
2. API Interaction:
• Commonly used to interact with RESTful APIs.
• It can send various types of HTTP requests (GET, POST, PUT, DELETE) to perform operations like retrieving data,
creating new records, or updating existing ones.
3. File Uploads:
• Requests supports file uploads, which is essential for applications where users need to submit files to a server.
4. Session Management:
• Requests provides the ability to maintain sessions across multiple requests, which is useful for applications that
require authentication.
• A web application that requires users to log in can use a session to maintain the user's login state while making
subsequent requests.
5. Data Submission:
• Requests can be used to submit data to web servers, especially in web forms where users enter information.
• Example: An application could use Requests to send user feedback or comments to a server.
About Page 37
• Example: An application could use Requests to send user feedback or comments to a server.
About Page 38
GET requests
Thursday, October 31, 2024 5:47 PM
Basic Syntax:
Query Parameters:
• They are parameters which are included in the URI as part of the request message.
GET Page 39
GET Page 40
POST requests
Thursday, October 31, 2024 6:28 PM
Basic Syntax:
• Since POST requests are used to send data to the server to create a resource, the Requests library
provides the data parameter for this
• We can also pass JSON data directly using the json parameter
POST Page 41
Headers
Thursday, October 31, 2024 7:33 PM
• Content-Type: Specifies the format of the request data, such as JSON or XML.
• User-Agent: Provides information about the client making the request, such as the browser or app.
• Accept: Informs the server about the types of content the client can process, like JSON or HTML.
Headers Page 42
About Beautiful Soup
Saturday, November 2, 2024 9:39 AM
• Beautiful Soup is a Python library used to parse HTML and XML documents.
• It’s especially useful for web scraping because it helps navigate, search, and modify the
HTML (or XML) content fetched from a webpage.
• It transforms complex HTML documents into a tree structure, where each node
corresponds to elements such as tags, text, attributes, etc.
• This makes it easy to locate and extract specific information.
• Flexible Parsing: Works with different parsers, such as the built-in Python parser or lxml, offering
flexibility in terms of speed and error handling.
• Handles Broken HTML: Automatically fixes errors in the HTML structure, allowing users to scrape data
from pages that other parsers might struggle with.
• Efficient Navigation and Search Functions: Provides intuitive functions like find, find_all, and select to
search and navigate through HTML tags and CSS selectors.
• Integration with Other Libraries: Integrates smoothly with libraries like Requests, to retrieve web
pages before parsing them. Also works well with Pandas for data analysis or Selenium for JavaScript-
heavy pages, making it a versatile choice for a complete web scraping workflow.
• Job Listings Aggregation: Commonly used to scrape job listings from platforms like LinkedIn, Indeed, or
company career pages. This can help create job aggregators that compile positions from various
sources.
• Market Research and Sentiment Analysis: Companies often use web scraping to collect data from
forums, blogs, and review sites to analyze customer sentiment about their products or their
competitors.
About Page 43
competitors.
• Real Estate Listings: Useful for gathering real estate listings from sites like Zillow or Realtor.com. Data
on prices, locations, features, and property availability can be scraped and analyzed to identify trends,
track prices, and help potential buyers and real estate investors make informed decisions.
• Travel and Flight Price Tracking: Used to monitor and compare prices across different airlines, hotels,
and booking platforms. By gathering this data, users can develop apps to track flight and
accommodation prices, helping travelers find the best deals.
About Page 44
About Page 45
Topics
Tuesday, November 5, 2024 11:14 PM
1. About Selenium
2. Getting Started
Topics Page 46
1. Definition
Tuesday, November 5, 2024 11:14 PM
What is Selenium?
• Selenium is a powerful, open-source tool used for automating web browsers.
• It is often utilized for web scraping when interacting with dynamic websites that rely on JavaScript to load
content, which static scraping libraries like Beautiful Soup or Requests cannot handle effectively.
• When scraping websites using Python, Selenium acts as a web driver, automating browser actions to interact
with web pages like a human user.
• It can navigate to web pages, simulate user interactions (clicks, scrolls, form submissions), and extract data
directly from rendered HTML.
• Selenium was originally developed for testing web applications. Over time, it became a popular tool for web
scraping due to its ability to handle dynamic, JavaScript-heavy websites.
About Page 47
2. Key Features
Monday, December 2, 2024 11:41 PM
Key Features:
• Dynamic Content Handling:
• Interaction Simulation:
○ Handles tasks such as clicking buttons, filling forms, selecting dropdowns, and scrolling pages.
○ Useful for scraping data hidden behind user interactions.
• Cross-Browser Support:
○ Works with popular browsers like Chrome, Firefox, Edge, and Safari.
• Customizable Waits:
○ Implements explicit and implicit waits to ensure elements are fully loaded before actions are performed.
About Page 48
3. Comparison
Monday, December 2, 2024 11:41 PM
Comparative Analysis:
About Page 49
4. Advantages
Monday, December 2, 2024 11:42 PM
Advantages:
• Handles JavaScript and AJAX (Asynchronous JavaScript and XML)
About Page 50
5. Disadvantages
Monday, December 2, 2024 11:42 PM
Disadvantages:
• Slower than static scraping methods since it requires a full browser environment
About Page 51
1. First Steps
Monday, December 2, 2024 11:47 PM
2. Install selenium
• Selenium provides the function find_element to find and locate elements of a webpage
• There're several strategies to identify an element of a webpage:
1. ID:
2. Name:
3. Class:
4. Tag:
5. XPath:
What is XPath?
• XPath (XML Path Language) is a query language used to navigate and locate elements within XML or HTML documents.
• Selenium uses XPath as one of its locator strategies to find elements on a webpage.
• XPath is a powerful tool for locating elements in Selenium, offering unmatched flexibility and precision.
• It's a go-to solution when working with complex web pages or when other locators are insufficient.
Advantages:
○ XPath can traverse the entire DOM, allowing you to locate elements in deep nested structures or without unique identifiers.
○ Works for all elements, even those without id, name, or class.
• Offers Rich Syntax: XPath supports a variety of conditions and operators, enabling users to
• Supports Relative Paths: You can locate elements without specifying their full path in the DOM, making XPath expressions robust to changes
○ Relative: //div[@class='example']
○ Absolute: /html/body/div[1]/div[2]
○ XPath can navigate through the DOM and locate elements that might not be directly visible or styled.
○ Elements in the DOM can exist even if they are hidden from the user's view, such as elements with CSS properties like display: none or visibility: hidden
Disadvantages:
• XPath is slower than CSS Selectors because of its ability to traverse the entire DOM.
• XPath expressions can be harder to read and maintain, especially for deeply nested elements.
• Some older browsers may have limited support for advanced XPath queries.
• It's used to simulate typing into an element, such as a text input field or a text area.
• It allows users to send keystrokes or input text programmatically as if a user were typing on a keyboard.
• It's used to clear the text content of a text input element on a web page
• It ensures the field is empty before entering new data (which can be done using the send_keys function)
3. Clicking Buttons:
• This is achieved using the click function
• Allows users to interact with clickable elements on a web page, such as buttons, links, checkboxes, or radio
buttons.
4. Submitting Forms:
• This is achieved using the submit function
• Helps to automatically submit a form without explicitly clicking the "Submit" button
• Executes the action associated with the <form> tag, such as navigating to a new page or processing data.
1. Dropdown:
• Wrap the identified element under the class Select, imported from the module selenium.webdriver.support.select
2. Multiselect:
○ This method is mainly used to execute JavaScript code within the context of the currently loaded webpage
○ It allows users to directly interact with and manipulate the Document Object Model (DOM) of the page
○ Helps interact with elements that might not be accessible using Selenium's standard methods
2. Scrolling Vertically:
3. Scrolling Horizontally:
5. Infinite Scrolling:
○ Initially, the webpage loads a fixed amount of content.
○ As the user scrolls close to the bottom of the page, a JavaScript function triggers a request to load more content dynamically
○ The new content is added to the page, and the process repeats.
Algorithm:
○ Get the height of the currently loaded page (h1)
○ Inside the loop, get the height of the page again (h2)
○ If h1 is same as h2, break out of the loop as no new content has been loaded
• This section covers advanced web interactions that go beyond basic navigation and element manipulation.
• By mastering these techniques, developers will be able to handle real-world web scraping and automation
challenges, including interacting with dynamic content, handling alerts, and managing iframes.
Topics:
• Explicit Waits
• Implicit Waits
• Handling Alerts
• Frames & iFrames
• Useful when dealing with dynamic web elements that take time to appear or become interactable on the page.
How to Implement?
• Selenium provides the WebDriverWait class to implement explicit waits
• The script checks for the condition at regular intervals (default - 500ms) until it's met or the timeout occurs
Syntax:
• Parameters:
○ driver: The WebDriver instance controlling the browser
○ timeout: Maximum time (in seconds) to wait for the condition to be met
○ poll_frequency: How often (in seconds) the condition is checked (default: 0.5 seconds)
○ ignored_exceptions: A tuple of exceptions to ignore while waiting (optional)
• WebDriverWait often works in conjunction with Expected Conditions (EC) to define what to wait for:
• If an element is not immediately found, the WebDriver waits for the specified duration before throwing a NoSuchElementException.
• This type of wait applies globally to all element searches in the WebDriver instance.
How It Works:
• We can set up the waiting duration using the implicitly_wait method of the webdriver instance.
• Makes Selenium scripts resilient to minor delays in the loading of web elements caused by network speed, animations, or dynamic content.
• Once set, it applies to all find_element and find_elements methods for the lifetime of the WebDriver instance.
• If the element is found before the timeout period, the script proceeds immediately. Otherwise, it waits until the timeout is reached and raises an
exception if the element is still not found.
Advantages:
• Simplicity: Easy to implement and applies globally, avoiding repetitive waits for every element.
• Resilience: Handles minor delays in loading dynamically generated elements, reducing script failures.
• Better Control: Provides a default buffer for all element searches without the need for explicit handling.
Disadvantages:
• Since it applies globally, it may not suit situations where different elements require different wait times.
• Mixing implicit waits with explicit waits can cause unpredictable behavior, as implicit waits can interfere with explicit wai t polling mechanisms.
• Implicit waits only handle element visibility or presence and cannot wait for specific conditions like page titles or JavaScript execution.
Best Practices:
• Use either implicit waits or explicit waits in your script, but not both, to avoid conflicts.
• Set reasonable timeout durations and not very high implicit wait times (e.g., 60 seconds) as it can unnecessarily delay script execution.
• Implicit waits are suitable for simple scripts without complex wait conditions.
What is a Frame/IFrame?
• In web development, frames and iframes are HTML elements that allow you to embed one HTML document inside another.
Frame:
• Part of the <frame> and <frameset> HTML tags, which were used in early web development to divide the browser window into multiple
sections, each capable of loading a separate HTML document.
• Now obsolete in HTML5, frames are rarely used. They were replaced by iframes and other modern layout techniques like CSS Grid and
Flexbox.
• Changes in the parent page (like CSS or JavaScript) typically do not affect the iframe's content, and vice versa.
• Each iframe has its own DOM (Document Object Model), CSS, and JavaScript scope.
• Interaction between the parent and iframe is restricted if they originate from different domains for security reasons.
• After interacting with an iframe, switch back to the parent page using driver.switch_to.default_content()
Best Practices:
• Ensure you know which frame or iframe contains the desired elements by inspecting the page source.
• Whenever possible, avoid switching by index to maintain flexibility if the page structure changes.
• Always switch back to the main content after interacting with a frame.
• They are typically generated by JavaScript or built into the HTML/CSS structure of a webpage.
• Alerts are used for various purposes, including notifying users, obtaining confirmation, or prompting for input.
Types of Alerts:
1. JavaScript alerts:
○ Created using JavaScript's alert() function.
2. Confirmation Alerts:
○ Created using JavaScript's confirm() function.
○ Asks the user to confirm an action with "OK" and "Cancel" buttons.
3. Prompt Alerts:
○ Created using JavaScript's prompt() function.
○ Requests user input and provides an input field along with "OK" and "Cancel" buttons.
○ Offers more flexibility in design and functionality (e.g., styled dialog boxes with multiple buttons or inputs).
• Use the dismiss() method to click the "Cancel" button on confirmation pop-ups
• Use the text attribute to retrieve the message displayed on the alert
• This section provides strategies for writing maintainable, efficient, and robust Selenium scripts.
• Adhering to these best practices will improve the performance of test automation or web scraping
projects and make the code easier to debug and maintain.
○ Elements are defined as variables, and interactions (methods) are encapsulated within the class
2. Variable Names:
○ Avoid overly generic names (ex: element1, button1)
○ Look to use Explicit Waits and Implicit Waits for better performance
1. Try-Except blocks:
○ Wrap code for critical interactions within try-except blocks to manage unexpected failures
2. Release Resources:
○ Always close the browser session at the end of the script to free resources
1. Logging:
○ Avoid printing directly to the console to debug; use logs instead
2. Capture Screenshots:
○ Save a screenshot for debugging If code fails
3. Developer Tools:
○ Use the browser's Developer Tools to inspect element locators and understand dynamic content of the webpage
○ Gives a better understanding of the website structure and its respective code
Agenda Page 80
Action Plan
Sunday, December 15, 2024 9:47 AM
1. Action Chains:
○ Action Chains in Selenium is a feature that allows automating complex user interactions such as mouse and keyboard events.
○ It is part of the ActionChains class in Selenium, designed to handle actions like clicking, dragging, hovering, and sending keyboard input.
○ Call the perform() method to execute all actions in the defined sequence
Common Methods:
○ click(): Clicks on a specified web element
2. Stocks Jargon:
a. Symbol:
b. Price:
○ The current trading price of the stock (in dollars)
○ This is the most recent price at which the stock was bought/sold on the market
c. Change:
○ The difference in the stock's current price compared to the previous day's closing price
○ +ve change indicates the price of the stock has increased from the previous day
○ -ve change indicates the price of the stock has decreased from the previous day
d. Volume:
○ The total number of shares of the stock during the current trading session
e. Market Cap:
○ Ex: If a company has 1,000 shares and each costs $10, then the company's market value is $10,000
f. PE Ratio:
○ Price to Earnings ratio
3. Loading a Webpage:
○ We can use return document.readyState to retrieve the current loading state of the web page
○ This is commonly used in Selenium scripts to wait until the page is fully loaded before interacting with elements
4. until():
○ It has a parameter method
○ This parameter must be a callable (function or lambda) which takes a WebElement as input and returns a boolean value
○ If the callable doesn’t return a truthy value within the timeout period, a TimeoutException exception will be raised
Steps Taken:
• Causes:
○ The ChromeDriver or WebDriver version might not match the installed browser version
• Solutions:
○ Ensure your WebDriver matches the browser version
○ Use a user-agent string to make requests appear like they are coming from a browser
2. Selenium Options:
• In Selenium, Options is used to customize and configure browser settings and behavior when automating browsers
• Allows to specify preferences, enable or disable features, and set options that are specific to the browser you are automating
• Each browser driver has its own Options class to provide these configurations
Common Usecases:
○ Running browser in headless mode (without GUI)
Key Methods:
○ add_argument(arg): Adds a command-line argument to the browser
Advantages:
○ Helps customize browser behavior according to your testing needs
○ Ensure the browser runs with specific settings each time, promoting consistency
○ Improves efficiency as we can run browsers in headless mode to save resources during testing
3. getBoundingClientRect():
• JavaScript function, used to manipulate and interact with elements on a webpage by executing JavaScript
• useful when standard Selenium methods can't achieve certain tasks directly, such as precise scrolling or positioning
• This approach ensures that Selenium scripts can interact with dynamically loaded or partially visible elements more reliably
Advantages:
○ Helps in cases where the elements are not immediately visible in the viewport
○ Handles scenarios where the webpage layout changes, requiring precise adjustments
○ Works even when the standard Selenium methods (scrollIntoView(), actions.move_to_element(), etc.) don't work as intended