1

I have an idx file: https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx

I could open the idx file with following codes one year ago, but the codes don't work now. Why is that? How should I modify the code?

import requests
import urllib
from bs4 import BeautifulSoup

master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url).content
data_format = byte_data.decode('utf-8').split('------')
content = data_format[-1]
data_list = content.replace('\n','|').split('|')

    for index, item in enumerate(data_list):

        if '.txt' in item:
            if data_list[index - 2] == '10-K':
                entry_list = data_list[index - 4: index + 1]
                entry_list[4] = "https://www.sec.gov/Archives/" + entry_list[4]
                master_data.append(entry_list)

print(master_data)

1 Answer 1

0

If you had inspected the contents of the byte_data variable, you would find that it does not have the actual content of the idx file. It is basically present to prevent scraping bots like yours. You can find more information in this answer: Problem HTTP error 403 in Python 3 Web Scraping

In this case, your answer would be to just use the User-Agent in the header for the request.

import requests

master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url, allow_redirects=True, headers={"User-Agent": "XYZ/3.0"}).content

# Your further processing here

On a side note, your processing does not print anything as the if condition is never met for any of the lines, so do not think this solution does not work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.