0% found this document useful (0 votes)
68 views2 pages

Assignment 5 - Text Web and Social Media Analytics

This document discusses how textual data can be captured from websites automatically using web scraping techniques. It explains that web scrapers use crawlers to start from a seed URL and follow instructions to crawl websites and extract specified data from pages to a repository. The extracted unstructured data can then be cleaned, deduplicated, structured and imported into databases and analytics systems for analysis. Potential patterns that can be extracted include structured data points specified during the crawling process.

Uploaded by

MKWD NRWM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views2 pages

Assignment 5 - Text Web and Social Media Analytics

This document discusses how textual data can be captured from websites automatically using web scraping techniques. It explains that web scrapers use crawlers to start from a seed URL and follow instructions to crawl websites and extract specified data from pages to a repository. The extracted unstructured data can then be cleaned, deduplicated, structured and imported into databases and analytics systems for analysis. Potential patterns that can be extracted include structured data points specified during the crawling process.

Uploaded by

MKWD NRWM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

University of the Immaculate Conception

Jacinto St., Davao City

BA 315 - BUSINESS INTELLIGENCE AND ANALYTICS

Prepared by: RODIBEE B. ROJO


ROLAN SEAN U. MAGAWAY
RICHARD D. SISION

Submitted to: RAYMOND PIDOR

GROUP ASSIGNMENT: 5_TEXT, WEB, AND SOCIAL MEDIA ANALYTICS

Instructions:

Examine how textual data can be captured automatically using Web-based technologies. Once
captured, what are the potential patterns that you can extract from these unstructured data sources?

Answer:

The technique of data extraction from the web automatically is called web scraping. It is used to
capture data in huge amounts from the websites on the internet and also it can be accessed only
through the web browser.

Web data is of great use for the e-commerce portals, media companies and also in research firms, data
scientists and even sometimes the government make use of these techniques.

There are many web-based technologies being utilized to make use of this data extraction approaches.
Some of them are as follows:
 DaaS (Desktop as a Service)
 In-house data extraction
 Many freely available data extraction tools

How do this work?

First of all for extracting data from the websites we make use of the bots which are called crawlers. A
crawler always starts from a URL which we will be providing to it. Hence, the first and foremost thing
to do in data extraction is to know which website we want to crawl and pass the URL to the crawler as
“Seed URL”.

Once, the crawler has fetched the seed URL then comes to time for feeding the bot with the directions
to follow while crawling the website. This instruction or directions will help in making the crawler
look for the data which is required.

After the directions, we need to make the crawler understand the depth of the website and to reach the
pages from where the data needs to be extracted. As soon as the crawler understands the depth of the
website the ultimate next job is to compile all the pages and then save it to the repository.
After this, the crawler will scrape through all the saved pages and will only scrape the required data
points. This is the time when we need to instruct the crawler to only pick the data which is required
and ignore the rest of the data.

After all of this data is captured at the crawler’s end then there is a need to check the data and then de-
duplicate all the data that is being resided in the fresh data. It will also delete the unwanted HTML
tags or the text that is getting scraped along with the relevant data.

The last and final step is to structure the data so that it can be used by the database and analytics to
provide machine-readable syntax and also import it to the database of the analytics system.

Hence, this is how data is being captured automatically from the websites using the different web-
based techniques and the formats that can be saved are XML, JSON, CSV, HTML, TXT, etc.

You might also like