Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs and selectors are outdated #10

Open
endolith opened this issue Jan 15, 2023 · 5 comments
Open

URLs and selectors are outdated #10

endolith opened this issue Jan 15, 2023 · 5 comments

Comments

@endolith
Copy link

endolith commented Jan 15, 2023

    domain = 'https://www.mountainproject.com'

    # URL should be preceded by a /
    # e.g. /destinations or /v/STATENAME/ID
    relativeURL = '/v/hawaii/106316122'

    start_urls = [domain + relativeURL]
    allowed_domains = ['mountainproject.com']
    rules = [
        Rule(
            LinkExtractor(allow='v/(.+)'),
            callback='parse',
            follow=True
        )
    ]

The /v/ URLs redirect to a new scheme:

2023-01-15 11:19:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.mountainproject.com/viewer-old/106316122> from <GET https://www.mountainproject.com/v/hawaii/106316122>
2023-01-15 11:19:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.mountainproject.com/area/106316122/hawaii> from <GET https://www.mountainproject.com/viewer-old/106316122>
        if self.relativeURL != '/destinations':
            # use the following links variable if testing from an individual state page (e.g. WA states routes)
            links = response.css('#viewerLeftNavColContent a[target="_top"] ::attr(href)').extract()

<div id="viewerLeftNavColContent" class="rspCollapsedContent"> was present in old pages: https://web.archive.org/web/20161122233413/http://www.mountainproject.com/v/alabama/105905173

but no longer.

        else:
            # use the following links variable if testing from the homepage
            links = response.css('span.destArea a::attr(href)').extract()

<span class="destArea"> was present on old homepage:

https://web.archive.org/web/20171016232313/https://www.mountainproject.com/

but no longer.

@endolith
Copy link
Author

New URLs probably need something like this:

    relativeURL = '/area/106316122/hawaii'

    start_urls = [domain + relativeURL]
    allowed_domains = ['mountainproject.com']
    rules = [
        Rule(
            LinkExtractor(allow='area/(.+)'),
            callback='parse',
            follow=True
        )
    ]

@endolith
Copy link
Author

New state pages have

<div class="col-md-3 left-nav float-md-left mb-2">
                <div class="mp-sidebar">

So probably links = response.css('.left-nav a::attr(href)').extract()?

And on the main page it has

<div class="col-xs-12">
        <div class="title-with-border-bottom mb-2">
            <h2 class="inline-block mr-half">Rock Climbing Guide</h2>
        </div>
        <div class="row" id="route-guide">

So probably links = response.css('div#route-guide a::attr(href)').extract()?

Still doesn't work, though.

@endolith
Copy link
Author

endolith commented Jan 15, 2023

DEBUG: Filtered offsite request to 'www.mountainproject.comhttps': <GET https://www.mountainproject.comhttps//www.mountainproject.com/map/106316122/hawaii>

yield scrapy.Request(url, callback=self.parse_coordinates)

@endolith
Copy link
Author

I'm not sure why the original code says this:

        if 'Location' not in response.css('#rspCol800 div.rspCol table tr:nth-child(2) td ::text').extract()[0]:
            return response.css('#rspCol800 div.rspCol table tr:nth-child(3) td ::text').extract()[1].strip()
        else:
            return response.css('#rspCol800 div.rspCol table tr:nth-child(2) td ::text').extract()[1].strip()

In the case that it doesn't list Location:, then what is it getting instead?

https://web.archive.org/web/20161115082401/http://www.mountainproject.com/v/central-pillar-of-frenzy/105862930

for example.

(Now in the new layout it's "GPS:", though.)

@endolith
Copy link
Author

(I've got it working, but I made a bunch of clunky changes with the help of ChatGPT that I don't fully understand)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant