2

I have wrote simple script to get html's from multiple website. Although I didn't have any issue with the script up until yesterday. It suddenly started throwing the exception bellow.

Traceback (most recent call last):
  File "crowling.py", line 45, in <module>
    result = requests.get(url)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 685, in send
    r.content
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 829, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))

The main part of the script is this.

c = 0
#urls is the list of urls as strings
for url in urls:
    result = requests.get(url)
    c += 1
    with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
        f.write(result.text)

The list urls is generated by my other codes and I have checked that the urls are correct. Also the timing of the exception is not constant. Sometimes it stops when scraping 20th htmls and sometimes it goes until 80th then stop. As the exception suddenly appeared without changing codes, I am guessing that the exception is due to the Internet connection. Yet, I want to ensure that the script works stably. Is there any possible cause of this error?

2
  • Probably looking at the exception stack trace those urls have unicode characters in them
    – bigbounty
    Commented Aug 5, 2020 at 11:44
  • Can you post some sample urls you are calling?
    – isopach
    Commented Aug 5, 2020 at 12:13

1 Answer 1

5

If you're sure the URLs are correct and it's an intermittent connection problem, you can just retry the connection after failure:

c = 0
#urls is the list of urls as strings
for url in urls:
    trycnt = 3  # max try cnt
    while trycnt > 0:
        try:
           result = requests.get(url)
           c += 1
           with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
               f.write(result.text)
           trycnt = 0 # success
        except ChunkedEncodingError as ex:
           if trycnt <= 0: print("Failed to retrieve: " + url + "\n" + str(ex))  # done retrying
           else: trycnt -= 1  # retry
           time.sleep(0.5)  # wait 1/2 second then retry
     # go to next URL
2
  • Somehow my script now began to work properly and on other linux server but the idea of retrying with except clause is incredible. Thank you so much for your idea.
    – TFC
    Commented Aug 5, 2020 at 17:33
  • 1
    Please accept an answer so this post is removed from the "No Answer" list. Thanks.
    – Mike67
    Commented Sep 1, 2020 at 19:11

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.