1
\$\begingroup\$

I have been working on where I want to improve my knowledge to do a safe JSON scraping. By that is that I could pretty much do a ugly workaround and wrap it into a try-except but that is not the case. I would like to know where it goes wrong and how I can improve the code I have been working with:

import json
from json import JSONDecodeError

import requests

payload = {
    "name": None,
    "price": None,
    "sizes": None
}

headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
    if not response.ok:
        print(f"Requests NOT ok -> {payload}")

    try:
        # They added extra comma to fail the JSON. Fixed it by removing duplicate commas
        doc = json.loads(response.text.replace(",,", ",")).get("productData", {})
    except (JSONDecodeError, KeyError, ValueError) as err:
        print(f"Not able to parse json -> {err} -> Payload: {payload}")
        raise

if doc:

    product_name = doc.get('name', {})
    product_brand = doc.get('brandName', {})
    product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})

    # Added exception incase the [0] is not able to get any value from 'media'
    try:
        product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {})
    except KeyError:
        product_image = None

    product_size = doc.get('items', {})

    try:

        if product_name and product_brand:
            payload['name'] = f"{product_brand} {product_name}"

        if product_price:
            payload['price'] = product_price.replace('\xa0', '')

        if product_image:
            payload['image'] = product_image

        if product_size:

            count = 0
            sizes = []
            sizesWithStock = []
            for value in product_size.values():
                if value.get('stockByMarket', {}).get('6', {}):
                    count += value['stockByMarket']['6']
                    sizes.append(value['name'])
                    sizesWithStock.append(f"{value['name']} - ({value['stockByMarket']['6']})")

            payload['sizes'] = sizes
            payload['sizesWithStock'] = sizesWithStock
            payload['stock'] = count

    except Exception as err:
        print("Create notification if we hit here later on")
        raise

print(f"Finished payload -> {payload}")

My goal here is to cover all the values and if we do not find it inside the json then we use the if statements to see if we have caught a value or not.

Things that annoys me:

  1. Long nested dicts but im not sure if there is any other way, is there a better way?
  2. try-except for product_image, because its a list and I just want to grab the first in the list, is it possible to skip the try-except?
  3. If unexpected error happens, print to the exception but am I doing it correctly?

Hopefully I can get new knowledge from here :)

\$\endgroup\$
2
  • \$\begingroup\$ I would like to know where it goes wrong - does it complete without error? \$\endgroup\$
    – Reinderien
    Commented Apr 12, 2021 at 14:59
  • 1
    \$\begingroup\$ @Reinderien right now the script works without any errors yes, but what if etc JSON is incorrect or unexpected errors happens, thats what I meant "where it goes wrong" \$\endgroup\$ Commented Apr 12, 2021 at 15:28

2 Answers 2

3
\$\begingroup\$
  1. You need functions.
    Functions help to split up your code and make understanding how everything fits together easier.

    You should have a function for:

    1. Getting the page.

    2. Parsing the JSON with your helpful error message.

    3. Getting the payload.

    4. A main function for calling all the other functions.

      We should only call the main function if the script is the entry point. We can do so with the following code:

      if __name__ == "__main__":
          main()
      
  2. You should add a function to walk paths.

    product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})
    
    # Added exception incase the [0] is not able to get any value from 'media'
    try:
        product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {})
    except KeyError:
        product_image = None
    

    Things would be so much simpler if we could just do:

    price = get(data, "markets", "6", "pricesByPricelist", "3", "price")
    image = get(data, "media", 0, "sources", "full", "url")
    

    To build get is simple.

    1. We take an object, a path and a default value.
    2. For each segment of the path we index the object.
    3. If the object doesn't have the value (__getitem__ raises LookupError) we return the default value.
    4. The default value should be to raise an error.
      We can ensure anything the user passes to the function is returned. By making a new instance of an object we can then check if the default is the instance.
    5. For convenience we can build the get function which defaults to None.
    _SENTINEL = object()
    
    def walk(obj, path, default=_SENTINEL):
        try:
            for segment in path:
                obj = obj[segment]
            return obj
        except LookupError:
            if default is _SENTINEL:
                raise LookupError(f"couldn't walk path; {path}") from None
            else:
                return default
    
    def get(obj, *path):
        return walk(obj, path, default=None)
    
  3. I'd highly recommend using exceptions for control flow in Python.

    We can subclass BaseException to make an exception that prints to the user and then exits the system. By wrapping the main() call in a try we can check if our custom error is raised print and then exit with a status of 1.

    try:
        main()
    except PrettyExitException as e:
        print(e)
        raise SystemExit(1) from None
    
import json
from json import JSONDecodeError

import requests

_SENTINEL = object()


class PrettyExitException(BaseException):
    pass


def get_page():
    headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
    with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
        if not response.ok:
            raise PrettyExitException(f"Requests NOT ok -> {payload}")
        return response.text


def parse_json(raw_json):
    try:
        return json.loads(raw_json)
    except (JSONDecodeError, KeyError, ValueError) as err:
        raise PrettyExitException(f"Not able to parse JSON -> {err} -> Payload: {payload}") from None


def walk(obj, path, default=_SENTINEL):
    try:
        for segment in path:
            obj = obj[segment]
        return obj
    except LookupError:
        if default is _SENTINEL:
            raise LookupError(f"couldn't walk path; {path}") from None
        else:
            return default


def get(obj, *path):
    return walk(obj, path, default=None)


def build_payload(data):
    name = get(data, "name")
    brand = get(data, "brandName")
    price = get(data, "markets", "6", "pricesByPricelist", "3", "price")
    image = get(data, "media", 0, "sources", "full", "url")
    size = get(data, "items")
    count = 0
    sizes = []
    sizes_with_stock = []

    if name is not None and brand is not None:
        name = f"{brand} {name}"
    else:
        name = None

    if price is not None:
        price = price.replace("\xa0", "")

    if size is not None:
        for value in size.values():
            stock = get(value, "stockByMarket", "6")
            if stock is not None:
                _name = get(value, "name")
                count += stock
                sizes.append(_name)
                sizes_with_stock.append(f"{_name} - ({stock})")

    return {
        "name": name,
        "price": price,
        "image": image,
        "sizes": sizes,
        "sizesWithStock": sizes_with_stock,
        "stock": count,
    }


def main():
    raw_data = get_page()
    data = parse_json(raw_data.replace(",,", ","))
    payload = build_payload(data.get("productData", {}))
    print(f"Finished payload -> {payload}")


if __name__ == "__main__":
    try:
        main()
    except PrettyExitException as e:
        print(e)
        raise SystemExit(1) from None

\$\endgroup\$
7
  • \$\begingroup\$ Hi! Thank you for the awesome review! I have currently tested your code and it seems like everything returns None aswell as UnboundLocalError: local variable 'stock' referenced before assignment when running it :'( \$\endgroup\$ Commented Apr 12, 2021 at 19:07
  • 1
    \$\begingroup\$ @ProtractorNewbie Oh sorry I made some typos! The code should be good now. \$\endgroup\$
    – Peilonrayz
    Commented Apr 12, 2021 at 19:09
  • \$\begingroup\$ No problem! There is also another small typo in name = get(value, "name") as you already using it at the top :) But another question, wouldn't it be ok to just use if name and brand because name doesnt have a value, then it wont go through the if statement without needing to add the is None? \$\endgroup\$ Commented Apr 12, 2021 at 19:11
  • 1
    \$\begingroup\$ @ProtractorNewbie Darn it, you're correct again! Sorry. Yes you can use if name and brand. I'm unsure what your data looks like so I chose to play things safe and assume any valid string is a valid name/brand. Yes if you think the API returning an empty name should be treated as no value please change the line to whatever works best for you. \$\endgroup\$
    – Peilonrayz
    Commented Apr 12, 2021 at 19:16
  • \$\begingroup\$ Im thinking since if we cannot get the name from your get function, then it will return None either way, thats what I was thinking. So either it finds it or it will return None and by that we could easily just if name... in that case, unless im missing something in between your walk function :) But I really love the function, looks so much better for sure! \$\endgroup\$ Commented Apr 12, 2021 at 19:18
2
\$\begingroup\$
  • You need to get rid of your "payload" antipattern. The dictionary you're constructing isn't particularly a payload; it's just a grab bag of untyped, unorganized data. If not OOP, then at least put this in normal variables; but avoid a dictionary unless you're actually doing something that requires a dictionary - e.g. saving to a JSON file, or sending to a secondary web service.
  • Write a generic function that accepts a product name, rather than just having a pile of global code
  • Consider adding PEP484 type hints
  • Use response.raise_for_status instead of checking ok yourself, particularly since you care about proper exception handling
  • Your global comma replacement is not safe. My guess is that instead of They added extra comma to fail the JSON, they just made a mistake - never attribute to malice that which you can attribute to ignorance. Attempt to replace this more narrowly.
  • You seem to want to use .get('x', {}) as a magic bullet without fully understanding what it does, because you've used it in some places where it's not at all appropriate, such as .get('price', {}). You do not want to default a price to a dictionary.
  • Consider wrapping exceptions in your own type and catching that at the top level
  • Where possible, either put magic constants such as 6 in a constant, or avoid it by searching for "what you actually want"; e.g. a specific currency string.

Suggested

import json
from decimal import Decimal
from json import JSONDecodeError
from pprint import pformat
from typing import Dict, Any, Optional

import requests

MARKET_ID = '6'
CURRENCY = 'SEK'


class APlaceError(Exception):
    pass


def get_api(product: str) -> Dict[str, Any]:
    headers = {
        'Origin': 'https://www.aplace.com',
        'Accept': 'application/json',
    }
    try:
        with requests.get(f'http://api.aplace.com/sv/dam/{product}', headers=headers) as response:
            response.raise_for_status()
            payload = response.text.replace(',,"cached"', ',"cached"')
        doc = json.loads(payload)
    except IOError as err:
        raise APlaceError('Unable to get response from APlace server') from err
    except JSONDecodeError as err:
        raise APlaceError('Unable to parse JSON') from err

    if doc.get('content', {}).get('is404'):
        raise APlaceError(f'Product "{product}" not found')

    return doc


class Product:
    def __init__(self, doc: Dict[str, Any]):
        try:
            data = doc['productData']
        except KeyError as e:
            raise APlaceError('Empty response') from e

        product = data.get('product')
        if product is None:
            self.id: Optional[int] = None
        else:
            self.id = int(product)

        self.uri: Optional[str] = data.get('uri')
        self.name: Optional[str] = data.get('name')
        self.brand_name: Optional[str] = data.get('brandName')
        self.sku: Optional[str] = data.get('sku')
        # etc.

        try:
            prices = next(
                price
                for price in data.get('markets', {})
                .get(MARKET_ID, {})
                .get('pricesByPricelist', {})
                .values()
                if price.get('price', '').endswith(CURRENCY)
            )
            price = prices.get('priceAsNumber')
            if price is None:
                self.price: Optional[float] = None
            else:
                self.price = Decimal(price)
            self.price_with_currency: Optional[str] = prices.get('price')
        except StopIteration:
            self.price = None
            self.price_with_currency = None

        # Added exception incase the [0] is not able to get any value from 'media'
        try:
            self.image_url: Optional[str] = next(
                medium.get('sources', {}).get('full', {}).get('url')
                for medium in data.get('media', [])
            )
        except StopIteration:
            self.image_url = None

        self.stock_by_size = {
            item.get('name'): item.get('stockByMarket').get(MARKET_ID, 0)
            for item in data.get('items', {}).values()
        }

    def __str__(self):
        return (
            f'Brand: {self.brand_name}\n'
            f'Name: {self.name}\n'
            f'Price: {self.price_with_currency}\n'
            f'Image URL: {self.image_url}\n'
            f'Stock for each size: {pformat(self.stock_by_size)}\n'
        )


def main():
    try:
        prod = Product(get_api('aaltoili-iso-mehu-gren-red-sand'))
        print(prod)
    except APlaceError as e:
        print(f'Failed to query product: {e}')


if __name__ == '__main__':
    main()
\$\endgroup\$
6
  • \$\begingroup\$ Hi again! You have been amazing and I really wished honestly that you had a donation that I could donate too! You been really awesome to me and I have been learning so much from you! Honestly! \$\endgroup\$ Commented Apr 12, 2021 at 19:23
  • \$\begingroup\$ The reason im using the payload is that I want to return as a dict, but I could also change it. The plan was to have a prefix of a dict where I could add stuff to the payload and then replace by using payload["name"] as I did but I could also do something like {"name": self.name or _prefix.name_ sort of. I think as you say, I should clear out the payload. But the reason also is that when we raise an error or that the response status code is not OK, I want to return atleast some information to the prefixed payload, if that make sense? \$\endgroup\$ Commented Apr 12, 2021 at 19:25
  • 1
    \$\begingroup\$ I'm glad you appreciate the review :) As to your "at least some information", the desire is good but the execution is off. It's entirely possible to populate a class with only some members; hence the Optional notation that accepts None if a value is absent. \$\endgroup\$
    – Reinderien
    Commented Apr 12, 2021 at 19:30
  • \$\begingroup\$ Indeed! I just thought about for example, lets say that we have a page that is found but the API is not updated fully, meaning example that the value data = doc['productData'] is not in placed in the API yet. So what I thought was that in that exception of it, to return a payload which I could for example strip and split the URL to get end of the URL which usually is the name of a product. then I could know atleast that there is a product with name as a little information to know... - Does that make some sense ? :D \$\endgroup\$ Commented Apr 12, 2021 at 19:35
  • 1
    \$\begingroup\$ Kind of? If all you have is the product URI and everything else fails, then you can just print the product URI. No need for stripping and splitting. \$\endgroup\$
    – Reinderien
    Commented Apr 12, 2021 at 19:48

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.