2
\$\begingroup\$

I have been writing a monitoring script where I check for whenever there is changes on a webpage. When a change has happend I want to be notified by printing out that there is a difference.

I have also written a spam filter function that could be caused due to a cache issue where a repo count can go from a old count to a new and back... Here is the code

import sys
import threading
import time
from datetime import datetime, timedelta
from typing import Union

import requests
from bs4 import BeautifulSoup
from loguru import logger

URLS: set = set()


def filter_spam(delay: int, sizes: dict, _requests) -> Union[dict, bool]:
    """Filter requests to only those that haven't been made
    previously within our defined cooldown period."""

    # Get filtered set of requests.
    def evaluate_request(r):
        return r not in _requests or datetime.now() - _requests[r] >= timedelta(seconds=delay)

    if filtered := [r for r in sizes if evaluate_request(r)]:
        # Refresh timestamps for requests we're actually making.
        for r in filtered:
            _requests[r] = datetime.now()

        return _requests

    return False


class MonitorProduct:
    def __init__(self):
        self._requests: dict[str, datetime] = {}
        self.previous_state = {}

    def doRequest(self, url):
        while True:
            if url not in URLS:
                logger.info(f'Deleting url from monitoring: {url}')
                sys.exit()

            response = requests.get(url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')

                if soup.find("span", {"data-search-type": "Repositories"}):  # if there are sizes
                    self.compareData({
                        'title': soup.find("input", {"name": "q"})['value'],
                        'repo_count': {soup.find("span", {"data-search-type": "Repositories"}).text.strip(): None}
                    })
                else:
                    logger.info(f"No diff for {url}")
            else:
                logger.info(f"Error for {url} -> {response.status_code}")

            time.sleep(30)

    def compareData(self, data):
        if self.previous_state != data:
            if filter_spam(3600, data['repo_count'], self._requests):
                logger.info(f"The data has been changed to {data}")

        self.previous_state = data


# mocked database
def database_urls() -> set:
    return {
        'https://github.com/search?q=hello+world',
        'https://github.com/search?q=python+3',
        'https://github.com/search?q=world',
        'https://github.com/search?q=wth',
    }


if __name__ == '__main__':
    while True:
        db_urls = database_urls()  # get all urls from database
        diff = db_urls - URLS
        URLS = db_urls  # Replace URLS with db_urls to get the latest urls

        # Start the new URLS
        for url in diff:
            logger.info(f'Starting URL: {url}')
            threading.Thread(target=MonitorProduct().doRequest, args=(url,)).start()

        time.sleep(10)

The code is working pretty good however I have written a mocked database where it will constantly be pulled every 10s (If I need to higher it up, please explain to me why) - The reason I wrote a mocked data is that to be able to show you people how it is working. The idea will be to read from database (postgresql through peewee) - I also used Github as an example where it was easiest for me to get the values I want to compare against. I am aware that there is an API for Github but for my situation, I would like to get a better knowledge regarding beautifulsoup therefore I use bs4.

To mention it again, the point for me with this script is that I would like to get notified whenever something has been changed on the webpage.

I hope I could get a feedback from you lovely code reviewer!

\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

URLS should not be global.

There's a better name for filter_spam, since it isn't spam you're filtering; but I can't get more inventive than "change filter". The algorithm here is probably inefficient - you have no mechanism to purge requests once they've exceeded your cooloff. Another big problem is that this method, though it exists in the global namespace, mutates a class member. Better to only return bool, and also to be a member of your class. An improved algorithm could keep a combined FIFO queue and hash table, represented by an ordered dict.

It's important that you use a session instead of making an isolated get() every time.

Put your response in context management.

Don't check status_code in your case; just check ok.

Do as much computation up-front as possible for BeautifulSoup, using both a strainer and a CSS sieve.

You're not looking at a very useful element span. Have you noticed that often it returns a numeric contraction, 1M? You should look at the h3 instead.

I don't understand why you're passing around your repo_count as a dictionary of a single integer key to a None value. Don't do this.

Don't if self.previous_state != data; your filter does that for you.

Don't bake https://github.com/search into your database; only your search term varies so only use that.

It's good that you have a __main__ guard but it isn't enough. All of those symbols are still globally visible. Make a main() function.

You already have a thread context object, MonitorProduct, so why also require a url parameter? That can be moved to the object.

exit() doesn't seem like what you actually want. If you want to exit the thread, just return.

Suggested

import locale
import logging
import soupsieve
import threading
import time
from collections import OrderedDict
from datetime import datetime
from typing import Callable

from bs4 import BeautifulSoup
from bs4.element import SoupStrainer
from requests import Session

logger = logging.getLogger()


class MonitorProduct:
    STRAINER = SoupStrainer(name='main')
    SIEVE = soupsieve.compile('div.position-relative h3')

    def __init__(self, term: str, is_alive: Callable[[str], bool]) -> None:
        self.is_alive = is_alive
        self.term = term
        self.requests: OrderedDict[int, datetime] = OrderedDict()
        self.session = Session()

    def do_request(self) -> None:
        while True:
            if not self.is_alive(self.term):
                logger.info(f'Deleting term from monitoring: "{self.term}"')
                return

            with self.session.get(
                url='https://github.com/search',
                params={'q': self.term},
                headers={'Accept': 'text/html'},
            ) as response:
                if not response.ok:
                    logger.info(f"Error for {self.term} -> {response.status_code}")
                    continue
                soup = BeautifulSoup(markup=response.text, features='html.parser', parse_only=self.STRAINER)

            result_head = self.SIEVE.select_one(soup)
            if result_head:
                result = result_head.text.strip()
                logger.debug(f'Results for {self.term}: {result}')
                repo_count = locale.atoi(result.split(maxsplit=1)[0])
            else:
                repo_count = 0
            self.compare_data(repo_count)

            time.sleep(30)

    def compare_data(self, repo_count: int) -> None:
        if self.filter_dupes(repo_count):
            logger.info(f"The data for term {self.term} has been changed to {repo_count}")

    def filter_dupes(self, repo_count: int, delay: int = 60 * 60) -> bool:
        """Filter requests to only those that haven't been made
        previously within our defined cooldown period."""

        # Purge old entries
        now = datetime.now()
        while self.requests:
            oldest = next(iter(self.requests.values()))
            if (now - oldest).total_seconds() > delay:
                self.requests.popitem(last=False)
            else:
                break

        # Check current entry
        if repo_count in self.requests:
            return False
        self.requests[repo_count] = now
        return True


# mocked database
def database_terms() -> set[str]:
    return {
        'hello world',
        'python 3',
        'world',
        'wth',
    }


def main() -> None:
    logging.basicConfig(level=logging.DEBUG)
    terms: set[str] = set()

    # Needs to match the locale of github.com
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

    while True:
        db_terms = database_terms()  # get all terms from database
        terms.clear()
        terms |= diff  # Replace URLS with db_terms to get the latest terms

        # Start the new URLS
        for url in diff:
            logger.info(f'Starting URL: {url}')
            threading.Thread(
                target=MonitorProduct(url, terms.__contains__).do_request
            ).start()

        time.sleep(10)


if __name__ == '__main__':
    main()

Output

INFO:root:Starting URL: wth
INFO:root:Starting URL: hello world
INFO:root:Starting URL: world
INFO:root:Starting URL: python 3
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /search?q=wth HTTP/1.1" 200 None
DEBUG:root:Results for wth: 1,550 repository results
INFO:root:The data for term wth has been changed to 1550
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /search?q=python+3 HTTP/1.1" 200 None
DEBUG:root:Results for python 3: 101,658 repository results
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /search?q=world HTTP/1.1" 200 None
INFO:root:The data for term python 3 has been changed to 101658
DEBUG:root:Results for world: 1,981,644 repository results
INFO:root:The data for term world has been changed to 1981644
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /search?q=hello+world HTTP/1.1" 200 None
DEBUG:root:Results for hello world: 1,733,873 repository results
INFO:root:The data for term hello world has been changed to 1733873
\$\endgroup\$
11
  • \$\begingroup\$ Hello! - Thank you for the amazing feedback and very knowledgeable. However I did found an issue regarding the filter_dupes - It will always print True whenever the timer is done. That means that it will print out that there is changes even though that its the same value. That's the reason why I used if self.previous_state != data was due to we dont need to do the filter_dupes if there isn't anything new compare to previous requests. If that make sense? - Right now it will just print out whenever the timer is over even though its the same value past the timer. \$\endgroup\$ Commented Jul 28, 2022 at 8:21
  • \$\begingroup\$ Also when is the if not self.is_alive(self.url): ever being used? How do we trigger a specific term to be "deleted"? \$\endgroup\$ Commented Jul 28, 2022 at 8:27
  • \$\begingroup\$ I have been trying the whole day to understand how to exit the thread using your example. I was not able to do it. It looks like it uses the old terms whenever we do have a smaller database_terms (e.g. if we delete the hello world & python 3 from the database, it will not exit the thread whenever it hits the second loop. - The reason is that we are first passing e.g. {'hello world', 'python 3'}.__contains__ to the first loop and whenever we change what terms means in main() that won't change what you passed it \$\endgroup\$ Commented Jul 28, 2022 at 20:23
  • 1
    \$\begingroup\$ That's my fault; please see edit \$\endgroup\$
    – Reinderien
    Commented Jul 28, 2022 at 23:01
  • 1
    \$\begingroup\$ Certainly. Isn't that by design? If you exit because the database has removed a term, and then the term is reintroduced, would you not want a thread to be spawned? \$\endgroup\$
    – Reinderien
    Commented Jul 29, 2022 at 2:24

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.