Skip to content

Commit 114adc7

Browse files
committed
added js-css files webpage extractor tutorial
1 parent dc5b082 commit 114adc7

File tree

4 files changed

+61
-0
lines changed

4 files changed

+61
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
6868
- [How to Make an Email Extractor in Python](https://www.thepythoncode.com/article/extracting-email-addresses-from-web-pages-using-python). ([code](web-scraping/email-extractor))
6969
- [How to Convert HTML Tables into CSV Files in Python](https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python). ([code](web-scraping/html-table-extractor))
7070
- [How to Use Proxies to Anonymize your Browsing and Scraping using Python](https://www.thepythoncode.com/article/using-proxies-using-requests-in-python). ([code](web-scraping/using-proxies))
71+
- [How to Extract Script and CSS Files from Web Pages in Python](https://www.thepythoncode.com/article/extract-web-page-script-and-css-files-in-python). ([code](web-scraping/webpage-js-css-extractor))
7172

7273
- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
7374
- [How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# [How to Extract Script and CSS Files from Web Pages in Python](https://www.thepythoncode.com/article/extract-web-page-script-and-css-files-in-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- Extracting `http://books.toscrape.com`'s CSS & Script files:
5+
```
6+
python extractor.py http://books.toscrape.com/
7+
```
8+
2 files will appear, one for javascript files (`javascript_files.txt`) and the other for CSS files (`css_files.txt`)
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import requests
2+
from bs4 import BeautifulSoup as bs
3+
from urllib.parse import urljoin
4+
5+
import sys
6+
7+
# URL of the web page you want to extract
8+
url = sys.argv[1]
9+
10+
# initialize a session
11+
session = requests.Session()
12+
# set the User-agent as a regular browser
13+
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
14+
15+
# get the HTML content
16+
html = session.get(url).content
17+
18+
# parse HTML using beautiful soup
19+
soup = bs(html, "html.parser")
20+
21+
# get the JavaScript files
22+
script_files = []
23+
24+
for script in soup.find_all("script"):
25+
if script.attrs.get("src"):
26+
# if the tag has the attribute 'src'
27+
script_url = urljoin(url, script.attrs.get("src"))
28+
script_files.append(script_url)
29+
30+
# get the CSS files
31+
css_files = []
32+
33+
for css in soup.find_all("link"):
34+
if css.attrs.get("href"):
35+
# if the link tag has the 'href' attribute
36+
css_url = urljoin(url, css.attrs.get("href"))
37+
css_files.append(css_url)
38+
39+
40+
print("Total script files in the page:", len(script_files))
41+
print("Total CSS files in the page:", len(css_files))
42+
43+
# write file links into files
44+
with open("javascript_files.txt", "w") as f:
45+
for js_file in script_files:
46+
print(js_file, file=f)
47+
48+
with open("css_files.txt", "w") as f:
49+
for css_file in css_files:
50+
print(css_file, file=f)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
requests
2+
bs4

0 commit comments

Comments
 (0)