Scrape webelements

Question

I have been working with requests where I as easy as it is. Scraping the webpage and add it into a dict and print the payload if we find a new value or not.

import json
import re

import requests
from selectolax.parser import HTMLParser

payload = {
    "store": "Not found",
    "name": "Not found",
    "price": "Not found",
    "image": "Not found",
    "sizes": []
}

response = requests.get("https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103")

if response.ok:

    bs4 = HTMLParser(response.text)

    product_name = bs4.css_first('h1[class="product-page-header"]')
    product_price = bs4.css_first('span[class="price"]')
    product_image = bs4.css_first('meta[property="og:image"]')

    if product_name:
        payload['name'] = product_name.text().strip()

    if product_price:
        payload['price'] = "{} Price".format(product_price.text().strip())

    if product_image:
        payload['image'] = product_image.attrs['content']

    try:
        attribute = json.loads(
            '{}'.format(
                re.search(
                    r'var\s*JetshopData\s*=\s*(.*?);',
                    response.text,
                    re.M | re.S
                ).group(1)
            )
        )

        payload['sizes'] = [
            f'{get_value["Variation"][0]}'
            for get_value in attribute['ProductInfo']['Attributes']['Variations']
            if get_value.get('IsBuyable')
        ]

    except Exception:  # noqa
        pass

    del bs4
    print("New payload!", payload)

else:
    print("No new payload!", payload)

The idea mostly is that if we find the values then we want to replace the values in the dict and if we dont find it, basically skip it.

Things that made me concerned:

What happens if one of the if statements fails? Fails I mean etc if I cannot scrape product_image.attrs['content'] - That would end up in a exception where it stops the script which I do not want to do.
Im pretty sure to use except Exception: # noqa is a bad practice and I do not know what would be the best practice to handle it.

I would appreciate all kind of tips and tricks and how I can improve my knowledge with python!

If you do not add the actual link I cannot help you improve your DOM navigation. If the link does not include an authentication token there is no security risk. — Reinderien, Commented Apr 10, 2021 at 16:54
Hi again @Reinderien - I have now edited with the correct link :) My bad for that. Should be fine and should be able to run it as well! - EDIT, Can confirm that it does work when copy pasting. Make sure to pip install selectolax and requests — PythonNewbie, Commented Apr 10, 2021 at 16:55

Reinderien · Accepted Answer · 2021-04-10 22:23:24Z

You have not described the purpose of payload. If this is a JSON payload going to some other web service, Not found is a poor choice for a missing value and None would be more appropriate
You never use payload['store']
Your selector h1[class="product-page-header"] - which is xpath (?) syntax - can just be h1.product-page-header in CSS selector syntax
I think your regex for JetshopData is unnecessarily permissive. If the format breaks, you should be notified by a parse failure rather than silently letting a changed design through - since the outer dictionary format will likely not be the only thing to change
You should constrain your regex to only looking in <script> values rather than through the entire HTML document
'{}'.format is redundant
Tell requests when you're done with the response via context management; conversely, there is no benefit to del bs4 if you have proper method scope
It's likely that you should be looking at all variations instead of just the first
Don't blanket-except. If you get a specific exception that you want to ignore that, ignore that in a narrow manner.
Separate your scraping code from your payload formation code

The following suggested code uses BeautifulSoup because it's what I'm more familiar with and I didn't want to bother installing selectolax:

import json
import re
from dataclasses import dataclass
from pprint import pprint
from typing import Optional, List

import requests
from bs4 import BeautifulSoup


@dataclass
class Product:
    name: Optional[str]
    price: Optional[str]
    image: Optional[str]
    sizes: List[str]

    @staticmethod
    def get_sizes(doc: BeautifulSoup) -> List[str]:
        pat = re.compile(
            r'^<script>var JetshopData='
            r'(\{.*\})'
            r';</script>$',
        )
        for script in doc.find_all('script'):
            match = pat.match(str(script))
            if match is not None:
                break
        else:
            return []

        data = json.loads(match[1])
        return [
            variation
            for get_value in data['ProductInfo']['Attributes']['Variations']
            if get_value.get('IsBuyable')
            for variation in get_value['Variation']
        ]

    @classmethod
    def from_page(cls, url: str) -> Optional['Product']:
        with requests.get(url) as response:
            if not response.ok:
                return None
            doc = BeautifulSoup(response.text, 'html.parser')

        name = doc.select_one('h1.product-page-header')
        price = doc.select_one('span.price')
        image = doc.select_one('meta[property="og:image"]')

        return cls(
            name=name and name.text.strip(),
            price=price and price.text.strip(),
            image=image and image['content'],
            sizes=cls.get_sizes(doc),
        )

    @property
    def payload(self) -> dict:
        return {
            "name": self.name or "Not found",
            "price": self.price or "Not found",
            "image": self.image or "Not found",
            "sizes": self.sizes,
        }


def main():
    product = Product.from_page("https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103")
    if product is None:
        print('No new payload')
    else:
        print('New payload:')
        pprint(product.payload)


if __name__ == '__main__':
    main()

Hi ! I will probably take a day to understand the function but it seems to be really awesome I would say! Really love it so far! However I do have small concern in your image=image and image['content'] - Basically if we do not find the image['content'] then it would fail and throw Keyerror exception. Maybe we need to do something like "content" in image.attrs ? — PythonNewbie, Commented Apr 10, 2021 at 21:12
That is correct but what about if it finds image but it doesn't find the attrs content? (In our case it will work for sure but if there will be any changes on the webpage where they dont use it in content) — PythonNewbie, Commented Apr 10, 2021 at 21:38
If there are changes on the web page, you're going to have to redesign your scraping anyway. Such is the hazard of scraping - you are not guaranteed to be working with a stable interface. — Reinderien, Commented Apr 10, 2021 at 22:16
Right, I dont think there is a actual way to detect if a webpage has been changed for example I think. I could be wrong of course but I will write tomorrow a better comment in here since its midnight here :) But I can already see there is a huge improvement for sure! — PythonNewbie, Commented Apr 10, 2021 at 22:21
Thank you very much for the help by the way! There is very nice stuff that I would never thought of and was really interesting to see how you managed to do it so nicely! — PythonNewbie, Commented Apr 12, 2021 at 8:10

Kate · Accepted Answer · 2021-04-10 21:24:48Z

Naming confusion

First of all there is something that is confusing in your code: bs4 does not actually represent an instance of BeautifulSoup. Pretty much any Python code that is based on BeautifulSoup has a line like this:

from bs4 import BeautifulSoup

But in your code bs4 represents something else: HTMLParser. So I would use a different name, to make it clear that it indeed is not BS, and thus the methods available are completely different. I was confused with this:

bs4.css_first

because I am not familiar with such a method being present in BS4, and for a reason. The documentation shows that it behaves like find in BS and returns None if not matching element is found.

You've raised one obvious issue, what happens if one (or more) of the value you are attempting to retrieve is not there ? Obviously you should test the results because sooner or later the website design will change and your code will stop working as expected. Simply test that all your values are different than None using any or more straightforward:

if None in (product_name, product_price, product_image):

If this condition is met, then your code should stop. In this code you are retrieving only 3 elements but you could have made a loop (handy if your list is going to be longer). All you need is a basic dict that contains your xpath queries, like:

product_search = {
    "name": 'h1[class="product-page-header"]',
    "price": 'span[class="price"]',
    "image": 'meta[property="og:image"]'
}

And if any of the xpath queries does not succeed, then you break out of the loop immediately.

Exception handling

Something that one should never do:

except Exception:  # noqa
    pass

If an exception occurs, at least log it or print it. Do not discard it, do not hide it. This practice can only lead to unreliable applications, that are hard to debug. If you can, avoid the exception. For instance you can easily anticipate that your xpath queries may fail, but it's easy to check the results.

There is one type of exception you could handle, it is requests.exceptions. Because it's perfectly possible that the website could be down sometimes, or that your network connection is down, or the DNS is failing.

In general, it is good to have one generic exception handler for the whole application, and in certain sections of the code, you can add limited handling for specific types of exceptions. Thus, requests.get should be in a try block like:

try:
    response = requests.get("https://shelta.se/sneakers/nike-air-zoom-type-whiteblack-cj2033-103")
except requests.exceptions.RequestException as e:
    # print error message and stop execution

Note that in this example I am catching the base exception, but you could handle the different types of requests exceptions separately (eg: requests.exceptions.Timeout).

Hi! Sorry for the late response! Just got into it. Totally agree with you with most of the stuff and there is stuff I would never think about. Regarding part etc if a webpage does a new design, what to do? As you wrote you did if None in (product_name, product_price, product_image): if I understood it correctly, lets say if we do 10 requests in a row and in the 7th scrape they change the webelement to sometihng else. Do you mean by using that code to see if all of these 3 are None, then there has happend a new web change? — PythonNewbie, Commented Apr 11, 2021 at 10:58
Simply put, the idea is to check that every element returns a meaningful value, at least not None. If any of the expected fields fails to yield a value, then you know there is something wrong. It could be a site redesign or a bug in your program. In addition to handling requests exceptions you should also check the status code. It should be 200. If it is 404, or 401 or 500 then you have another problem. Last but not least, I recommend to spoof the user agent otherwise it's obvious to the website that you are a bot. Some sites might even reject you just for that reason. — Kate, Commented Apr 11, 2021 at 16:41
Very interesting. Yupp regarding the exception I will work on that of course. I'm still abit unsure regarding the check element that returns a meaningful value. Do we want to check all of them to have a value or that atleast 1 of those for example: product_name, product_price, product_image to have a value? — PythonNewbie, Commented Apr 12, 2021 at 8:11

Stack Exchange Network

Scrape webelements

2 Answers 2

Naming confusion

Exception handling

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
python-requests
or ask your own question.

Hot Network Questions

Scrape webelements

2 Answers 2

Naming confusion

Exception handling

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonpython-3.xpython-requests or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
python-requests
or ask your own question.