0% found this document useful (0 votes)
13 views35 pages

Session5 - Analytics For Programming II - Siryani - 091924

Uploaded by

berkeal260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views35 pages

Session5 - Analytics For Programming II - Siryani - 091924

Uploaded by

berkeal260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Programming for Analytics II

Session 5
Thursday September 19th, 2024

Joseph Siryani, Ph.D.


Adjunct Professor, Department of Decision Sciences
George Washington University – School of Business
Agenda

Session 4 Recap 3 Hands-On


1
Recap of session 4 key topics, and address any Carry out instructor-led Python hands-on
questions related to lecture or hands-on. work, and Case Studies.

Lecture 5 Q & A’s


2 Present Lecture 5 – Web Scraping, Basic HTML.
4 Questions & Answers.
1 Session 4 Recap
Different Types of Analytics

Source: www.gartner.com
Predictive Analytics
What is predictive analytics?

❖ Predictive analytics is an applied field that uses a variety of quantitative methods that make use of data in order to make predictions

❖ An applied field: finance, telecommunications, advertising, insurance, healthcare, education, entertainment, banking, and so on

❖ Uses a variety of quantitative methods: bayesian inference, machine learning, deep learning, empirical findings, visualization

❖ That makes use of data: data is the raw material out of which, predictive analytics models are built

❖ Example: we can build a predictive model that is able to "predict", if a patient has the disease X using his clinical data
▪ now, when we gather the patient's data, the disease X is already present or not
▪ we are not "predicting" if the patient will have the disease X in the future
▪ the model is giving an assessment (an educated guess) about the unknown event "the patient has disease X“
▪ sometimes, of course, the prediction will be about the future, though keep in mind that won't be necessarily the case
Regular Expressions

❖ A regular expression (shortened as regex) is a sequence of characters that specifies a search pattern in text

❖ Such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings

❖ Regular expression techniques are developed in theoretical computer science and formal language theory

❖ The concept of regular expressions began in the 1950s

❖ The American mathematician Stephen Cole Kleene formalized the concept of a regular language

❖ Most general-purpose programming languages support regex capabilities including Python, Java, C, C++, etc.

❖ Regular expressions are used in search engines, in search and replace dialogs of word processors and text editors
Regular Expressions

Why Are Regular Expressions Important?

❖ Accessing and processing lot of data are not the main problem

❖ Filtering data is

❖ Regular expressions provide one type of filter that can be used to extract relevant data from the big chunks of data

❖ For example, consider an XML file containing 4GB of data on movies

▪ Regular expressions make it possible to query this XML text and find all movies that were filmed in Budapest in 2016

❖ Many small features are also easier Regular expressions are everywhere. These skills will come handy for you in your IT
engineering career.
Regular Expressions
Linear Regression

Linear Regression

❖ Regression analysis can be used to develop an equation showing how the variables are related:

▪ The variable being predicted is called the dependent variable and is denoted by “y”

▪ The variables being used to predict the value of the dependent variable are called the independent variables

▪ The independent variables are denoted by “x”


Simple / Multiple Linear Regression

Simple/Multiple Linear Regression

❖ Simple linear regression involves one independent variable and one dependent variable

❖ The relationship between the two variables is approximated by a straight line

❖ Regression analysis involving two or more independent variables is called multiple regression
2 Lecture 5
Web Scraping, Basic HTML
Web Scraping
❖ Web scraping (web harvesting or web data extraction) is a technique for extracting information
from websites

❖ Web Scraping scripts tend to simulate a person viewing a Web site with a browser. With these
scripts you can connect to a Web page and request a page, exactly as a browser would do

❖ The Web server will send back the page which you can then manipulate or extract specific
information from.
Document Object Model (DOM)
❖ Document Object Model (DOM) is an application programming interface (API) for valid HTML and
well-formed XML documents

❖ It defines the logical structure of documents and the way a document is accessed and
manipulated

❖ It is also considered a standard object representation of HTML


An HTML Document
An HTML Document Format
Web Scraping Process

• Importing information from a website


Web Scraping Tools
Web Scraping Applications
❖ For Marketing: Lead Generation
❖ For Businesses / eCommerce
❖ Gathering data from multiple sources for analysis
❖ For Research
❖ Collecting data set for Machine Learning
❖ For building your own website
❖ For providing data to real estate agents
❖ Get user profiles
❖ Build health applications
❖ Build review applications
❖ Translation of documents
Python Requests Module

• Requests is an elegant and simple HTTP library for Python, built for human beings

• Some useful features


• Keep-Alive & Connection Pooling
• International Domains and URLs
• Sessions with Cookie Persistence
• Browser-style SSL Verification
• Streaming Downloads
• Connection Timeouts
Python Requests Module

• Getting Started

r = requests.get('https://quotes.toscrape.com/')
r.status_code
r.encoding
r.text
Python Requests Module

•Extract quotes and author names from the website and save
them in a csv file named “quotes.csv”
• https://quotes.toscrape.com/
•e.g:
• Albert Einstein, Try not to become a man of success. Rather
become a man of value
• André Gide , It is better to be hated for what you are than to be
loved for what you are not
Beautiful Soup

❖ Beautiful Soup is a Python library for pulling data out of HTML

❖ Beautiful Soup is a library that makes it easy to scrape information


from web pages. It sits atop an HTML or XML parser, providing
Pythonic idioms for iterating, searching, and modifying the parse tree
Beautiful Soup
• What is the major difference between Requests and Beautiful Soup?
Beautiful Soup
• Requests is responsible for fetching the data from the internet. BS4 is
responsible for parsing and extracting data from the HTML
Scrapy

• Scrapy is an application framework for crawling web sites and


extracting structured data
• Why Scrapy
Comparison
• Requests VS Scrapy

Requests Scrapy
Sequential / Synchronous Parallel / Asynchronous

No parser Built-in Parser

Suitable for small projects and API Suitable for small and large projects
Scrapy
• Extract the tags along with the Quotes and Author name
Scrapy
• Extract the year from the first page as well and pass it to the next
callback
Selenium

• Selenium is not a data scraping tool.


• Selenium is used for automation generally
• Selenium is slow as compared to Scrapy
3 Hands-On
Let The Hands-On begin !
4 Q & A’s
Questions & Answers !
Thank You! Questions?

joesiryani@gwu.edu

josephsiryani

You might also like