WP WebDataExtractionPlaybook

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

The Web Data Extraction

Playbook
5 Steps to Leveraging the Open Web
as a Data Source
About this Guide
The internet has become an undeniable force in our lives over the past few decades,
changing everything from the way we do our shopping to the way our brains are wired.
In recent years, marketing and tech companies have started eyeing the vast troves of online
information as a potential data source that can be mined for analytical insights, trends
and patterns.

Today, many companies are already seeing actual value from analyzing the massive
amounts of web content published daily - whether it’s media monitoring companies
looking to follow the discussion around their clients, cyber security providers combing the
dark web for criminal activity, or researchers looking for datasets to train AI and machine
learning algorithms.

However, there are challenges to overcome: in its raw form, the web is not a structured
dataset but rather a mass of unstructured data that is constantly changing form, growing
and increasing in complexity. To reap the business and academic benefits of analyzing this
data, one must first transform it into a machine-readable data feed from which they can
extract insights and value using existing analytic technology and tools.

This 5-step guide will help you form a solid game plan for your next web data initiative,
give you the tools to define and scope out the project according to your needs, and offer
some pointers in regards to assessing existing technologies and techniques to access
web data.

For more information or to try out our web data API playground,
please visit webhose.io.
1. Outline Your Business Needs
Start with the questions you want to answer

A general rule of thumb for data analysis is to start with a question you want to answer.
Just poking at the data in the hopes that it drops some kind of insight into your lap tends
to be less than entirely effective - instead, it is always wiser to ask the practical question
first, and then find the best way to approach the data in order to find the answer.

The same applies to extracting data from the web: if you don’t know what you’re looking for,
you’re never going to find it. Some examples of the types of topics that can be examined
through the prism of open web data could include:

Price fluctuations of products or groups of products in e-commerce websites


Monitoring news and online discussions to identify trends, sentiments, or mentions of
a certain person or entity
Predicting stock behavior based on information published on the web

Each of these types of analysis poses its own challenges and needs to be approached
differently. For example, the e-commerce use case would require you to mainly look at
a subset of sites or even specific pages in those sites - but to constantly crawl and re-crawl
these pages while keeping the historical results easily accessible; whereas for media
monitoring you might want to focus on a larger breadth of websites in order to catch the gist
expressed in a wide array of forums, blog and news articles, and to avoid missing any
important coverage.

To begin scoping out your project in an informed manner, try to answer the basic
questions: what kind of information are you looking for? Where is this information
typically published? How often are these websites refreshed, and how fresh does your
data need to be? How will you want to deliver the findings to your end users?
2. Define the Data Requirements
Avoid costly mistakes when extracting and transforming web data

Once you’ve understood the business use case or research question you want answered,
it’s time to dive a bit into the the more technical side of things:
here is the place to think how you would need the data to be structured in order to answer
the questions you’re asking, and how you would integrate this data into your existing
technology stack.

Certain analytical queries you want to run might create prerequisites in terms of the data
structure, which should be addressed in advance. There might be limitations around file
formats and databases stemming from data visualization tools you plan to use. Text analytics
and NLP sampling might benefit more from a schema-less data structure, while a SQL
database might be a better fit for business intelligence analysis.

It’s important to start thinking about these things ahead of time, because they can deeply
impact the types of tools and techniques you use to extract data from the web. In some
cases this won’t be a big deal and you’ll be able to massage the data into whichever format
you need it after it is extracted, but taking these things into account beforehand can save
you a lot of trouble down the road.
For example, and as we will see further on, commercial web scraping tools might be
better suited when you need customized parsing for extracted datasets on a smaller
scale, whereas web data feeds as a service could work better when you’re running
a large scale operation in which the data requirements remain fairly consistent.
Using the wrong tool for the job won’t necessarily result in complete failure, but it
will probably be costly and inefficient.

Before making any investment in web data extraction, make sure to have a
comprehensive understanding of the technical considerations in terms of the way you
will want the data to be structured, modeled and integrated into your IT infrastructure.
3. Choosing the Right Tool for the Job
You can get web data manually, through scraping tools or as a
service - weigh the pros and cons of each against your project goals.

Now that you’ve covered both the business and technical aspects, you should already have
a firm understanding of both what you need to get started and what your endgame
would look like. The next step is to start considering the various tools, technologies and
techniques that are available to get the data you need.

There are dozens of free, freemium and premium tools that might be relevant for your web data
extraction project, but we can schematically divide them into three subgroups:

DIY for Complete Control


The first option, which might be appealing to the more gung-ho developers among us, would be
to simply write your own web crawler, scrape whatever data you need and run it as often as you
need. You could write such a crawler yourself from scratch in Python, PHP or Java, or use one of
many open source options.

The main advantage of this approach is the level of flexibility and customizability you have:
you can define exactly what data you want to grab, at what frequency, and how you would like to
parse the data in your own databases. This allows you to tailor your web extraction solution to
the exact scope of your initiative. If you’re trying to answer a specific, relatively narrow question,
or monitor a very specific group of websites on an ad-hoc basis, this could be a good and
simple solution.

However, manual crawling and scraping is not without its downsides, especially when it comes
to more complex projects.

If you’re looking to understand wider trends across a large group of sites, some of which
you might not even know you’re looking for in advance, DIY crawling becomes much
more complex - requiring larger investments in computational resources and developer
hours that could be better spent on the core aspects of your business.

To learn more about the pros and cons of building your own web crawling infrastructure,
check out our Build vs Buy comparison guide.

Scraping Tools for Ad-Hoc Analysis


Another common technique to turn websites into data is to purchase a commercial scraping
tool and use it to crawl,extract and parse whichever areas of the web you need for your project.
There are dozens of scraping tools available, with features and pricing varying wildly - from simple
browser-based tools that mimic a regular user’s behavior to highly sophisticated visual and
AI-based products.

Scraping tools remove some of the complications of the DIY approach since your developers will
be able focus on their (and your company’s) core competencies rather than spending precious
time and resources on developing crawlers. However, they are still best suited for an ad-hoc
project - i.e., scraping a specific group of websites in specific time intervals, to answer a specific
set of questions. Scraping tools are very useful for these types of ad-hoc analyses, and they
have the added advantage of generally being easy to use and allowing you to customize the way
the extracted data is parsed and stored.

On the other hand, if you’re looking to set up a larger scale operation in which the focus is not
on custom parsing but rather on comprehensive coverage of the open web, frequent data refresh
rates and easy access to massive datasets, web scraping tools are less viable as you run into
several types of limitations:
By definition, web scraping tools only grab the data from whichever web site you’ve “pointed”
them at. If you don’t know exactly where to look in advance, you could miss out important data -
e.g., in a media monitoring use case where you’re not aware of every possible publication that
could mention your clients.
Advanced scraping tools are built for customized extraction, and often have very advanced
capabilities in terms of identifying and parsing the data for analytical usage. However, this
often manifests itself in pricing models that are based on the amount of sites scraped -
resulting in ballooning costs for larger projects.
Developer overhead still exists in the form of managing lists of crawled sites and maintaining
the scraping tools.
Since the data is not collected before you activate the scraping tool, you won’t have access
to historical data.

Modern scraping tools offer powerful solutions for ad-hoc projects, giving you highly
sophisticated means of grabbing and parsing data from specific websites. However,
they are less scalable and viable when it comes to building a comprehensive monitoring
solution for a large “chunk” of the world wide web; and their advanced capabilities
could become overkill in terms of pricing and time-to-production when all you really
need is access to web data in machine-readable format.

Web Data as a Service for Scalable Operations


The third option is to forego crawling, scraping and parsing entirely and rely on an a data as
a service (DaaS) provider. In this model you would purchase access to clean, structured and
organized data extracted by the DaaS provider, enabling you to skip the entire process of building
or buying your own extraction infrastructure and focus on the analysis, research or product
you’re developing.

In this scenario you would generally have less ability to apply customized parsing on the data
as it is extracted, instead relying on the data structure dictated by the provider. Additionally,
you would need to contact your DaaS provider if you need to add sources (rather than simply
point your purchased or in-house scraping tool at whichever source you’re interested in).
These factors make web data as a service less viable for ad-hoc projects that require very
specific sites to be extracted into very specific data structures.
However, for larger operations, web data as a service offers several unique advantages in
terms of scale and ease of development:

Working with a proprietary provider allows you to leverage best-in-class crawling and scraping
technologies, rather than having your own developers try to re-invent the wheel.
A reliable web DaaS provider will offer comprehensive coverage, enabling you to immediately
access data from any relevant source on the web. Smart indexing and crawling enable new
sources to be added automatically as content spreads across the web, rather than waiting for
you to “point” at them.
Structured data is easily accessible via an API call, making integration dead simple. To see
how this works, you can check out an example of the webhose.io web data API.
The ability to consume data on-demand gives you more flexibility to launch and grow your
data-driven operations without making any large upfront investments.
Access to comprehensive coverage of the web without having to maintain your own lists of
sites to crawl.

These and other advantages make web data as a service the best solution for media
monitoring, financial analysis, cyber security, text analytics and other use cases that
center around fast access to comprehensive, frequently updated data feeds.

Learn more about on-demand web data as a service at the webhose.io website.

Feature Comparison of Web Data Extraction Methods

DIY Scraping Tools Data as a Service

Typical Scale Small Small Large

Custom Parsing Yes Yes No

Historical Data No No Yes

Price Project-dependant Tool-dependant Based on usage

Development Costs High Low Low

Coverage Low Low High


4. Launch Your Initiative - Start Small
Increase your investment and reliance on web data iteratively
and according to the value it generates for your business.

Having defined your requirements and examined the various tools available to you, it’s time
to start turning the web into your big data playground. Go ahead and ask a question, then see
how long it takes you to find the answers and deliver them to your end users.

Whichever product, technique or vendor you chose to go with, it’s always a good idea
to start small and scale up gradually. You should try to minimize your upfront investment
by going with a vendor that offers a free trial as well as flexible pricing plans, allowing
you to experiment and see value before committing to large, long-term contractual
commitments.

As you grow more comfortable with your web data extraction methods and providers, you can
grow your operations further - offering additional services based around the data, new types of
reports, etc. - but keep a close eye on your costs and time to production.

As web data becomes a core part of your business model, you’ll want to develop a strong
relationship with the vendors you’re working with, whether they’re providing tools, technology
or access to data feeds. See that there is adequate documentation and that support tickets
are answered in a timely fashion. Your clients have high expectations - so should you!
5. Ensure Ongoing ROI
Stay one step ahead in the race to achieve maximal coverage
of the web through a mix of methods and technologies.

Finally, you have your operation up and running: you have access to clean, organized and
structured data extracted from the web, and you’re comfortable with the update frequency
and the level of coverage; additionally, everything is integrated nicely between your own back
and front-end systems, and you’re happy with the way reports and insights are delivered to
end users.

However, when it comes to web data, the possibilities are endless - the web itself is evolving at
a rapid rate, with the amount of content growing exponentially. Hence it’s important to always
stay on top of developments in the web (such as the growth in dark web data) and seek new
possibilities to capture more data, or to increase your coverage - often using a mix of methods
rather than relying on a single vendor or tool.

To learn more, check out The Race to Achieve 100% Coverage of the Open Web.
You can find a detailed a calculation of the costs and expected ROI of various web data
solutions in The Crawled Web Data Dilemma: Build or Buy?
About Webhose.io
Webhose.io provides on-demand access to structured web data anyone can consume.
We empower you to build, launch, and scale big data operations - whether you’re a budding
entrepreneur working out of the garage, a researcher in university, or an executive at the helm
of a Fortune 500 company.

See powerful web data feeds in action

Schedule a demo with one of our experts today

You might also like