-1

I'm new to Web scraping in Python and try to scrape all htm document-links from an SEC Edgar full-text search. I can see the link in the Modal Footer, but BeautifulSoup won't parse the href Element with the link.

Is there an easy solution to parse the links of the documents?

Snapshot of the Link in the HTML Code on the Website

url = 'https://www.sec.gov/edgar/search/#/q=ex10&category=custom&forms=10-K%252C10-Q%252C8-K'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
    
for a in soup.find_all(id = "open-file"):
    print(a)

1 Answer 1

1

That data is loaded dynamically using javascript. There is a lot of information about scraping this kind of page (see one of many examples here); in this case, the following should get you there:

import requests
import json
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',   
}

data = '{"q":"ex10","category":"custom","forms":["10-K","10-Q","8-K"],"startdt":"2020-10-08","enddt":"2021-10-08"}'
#obvioulsy, you need to change "startdt" and "enddt" as necessary
response = requests.post('https://efts.sec.gov/LATEST/search-index', headers=headers, data=data)

The response is in json format. Your urls are hidden in there:

data = json.loads(response.text)
hits = data['hits']['hits']
for hit in hits:
    cik = hit['_source']['ciks'][0]
    file_data = hit['_id'].split(":")
    filing = file_data[0].replace('-','')
    file_name = file_data[1]
    url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{filing}/{file_name}'
    print(url)

Output:

https://www.sec.gov/Archives/edgar/data/0001372183/000158069520000415/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001372183/000138713120009670/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001540615/000154061520000006/ex10.htm
https://www.sec.gov/Archives/edgar/data/0001552189/000165495421004948/ex10-1.htm

etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.