how to crawl a movie by python?

Question

I am trying to download a movie titled "Shimen," which was showcased at the 60th Golden Horse Awards. I discovered a streaming link:

https://www.fofoyy.com/dianying/96937

I was unable to locate the video source within the page's source code, but I located two M3U8 files in the network requests when I examined the page using F12. By merging these, I obtained the ultimate URL I need to access:

https://v8.longshengtea.com/yyv8/202310/06/2yJDc3LMsW1/video/2000k_0X1080_64k_25/hls/index.m3u8

Utilizing the requests library to initiate a GET request, I received files terminated with .jpeg. I attempted to interpret these and write them into files terminated with .ts, and some videos are playable.

I have two strategies:

Employ a for loop to request each URL within the M3U8 file.

Firstly, it's excessively slow. Secondly, some are successful while others are not.

Implement aiohttp to request asynchronously with coroutines.

Firstly, it's more rapid. Secondly, all requests are unsuccessful. Would any experienced individual be able to assist me with this? I am extremely grateful!

mycode：

import asyncio
import aiohttp
import aiofiles
import os
import re


async def get_urls_from_m3u8(m3u8_url):
    async with aiohttp.ClientSession() as session:
        async with session.get(m3u8_url) as response:
            if response.status == 200:
                content = await response.text()
                urls = re.findall(r'https?://[^\s]+\.jpeg', content)
                return urls
            else:
                print(f"Failed to fetch M3U8 file, status code: {response.status}")
                return []


async def download_image(url, directory, max_retries=3):
    for attempt in range(max_retries + 1):
        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(url, timeout=30) as response:
                    if response.status == 200:
                        ts_filename = re.search(r'\d+', url.split("/")[-1]).group()
                        ts_filepath = os.path.join(directory, ts_filename)

                        async with aiofiles.open(ts_filepath, 'wb') as ts_file:
                            # 使用块读取的方法
                            while True:
                                chunk = await response.content.read(1024)  # 每次读取1KB
                                if not chunk:
                                    break
                                await ts_file.write(chunk)

                        print(f"Successfully saved {ts_filename}")
                        return True
                    else:
                        print(f"Failed to download {url}, status code: {response.status}")
                        return False
        except Exception as e:
            if attempt < max_retries:
                print(f"Attempt {attempt + 1} failed. Retrying...")
            else:
                print(f"Error downloading {url}: {e}")
                return False


async def main():
    directory = "downloaded_files_ts"
    if not os.path.exists(directory):
        os.makedirs(directory)

    m3u8_url = 'https://v8.longshengtea.com/yyv8/202310/06/2yJDc3LMsW1/video/2000k_0X1080_64k_25/hls/index.m3u8'
    urls = await get_urls_from_m3u8(m3u8_url)

    tasks = [download_image(url, directory) for url in urls]
    await asyncio.gather(*tasks)


if __name__ == '__main__':
    asyncio.run(main())

"Firstly, it's excessively slow" It is supposed to be slow. This is a deliberate protection against thieves and bots who just try to download everything (in the M3U8) at once. A real viewer would take 1 hour to reach the TS file for a play time of 1-hour later, but a bot will try to access all the TS files within seconds and such "not human viewer" behaviour gets blocked by their server. — VC.One, Commented Sep 27 at 6:52
"Secondly, some are successful while others are not." Don't rely on Exception before doing a download re-try... You need to check manually in your code. Before downloading a new TS file just check that the expected previous file also exists in downloads folder, and if not, then try to re-download that expected previous file again. — VC.One, Commented Sep 27 at 6:52

静逾王 · Accepted Answer · 2024-09-27 09:57:08Z

0

only use python get the m3u8 link , and then user aria2 to download this m3u8 link.

answered Sep 27 at 9:57

静逾王

11 bronze badge

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
– Community Bot
Commented Sep 29 at 12:47

Add a comment |

Collectives™ on Stack Overflow

how to crawl a movie by python?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
video
download
web-crawler
m3u8
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonvideodownloadweb-crawlerm3u8 or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
video
download
web-crawler
m3u8
or ask your own question.