Data Science Project 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

FIN42110: Data Science for Trading and Risk

Management
Project Title: Performance Analysis of Formula 1 Teams. (Project 1)

Group 12

Harsh Desai - 23205088


Jay Milind Kelkar - 23202493
Runqi Xue - 23206038

1
1 Introduction

Formula 1 is the pinnacle of motorsport, it displays a fusion of cutting-edge


technology, strategic prowess, and exceptional skill. In the world of Formula 1,
teams operate at the forefront of innovation, constantly pushing the boundaries
to gain a competitive edge. This data science project endeavours to dive into
the multifaceted landscape of Formula 1 by undertaking a comprehensive
analysis of team performance on and off the track.

By combining on-track performance metrics, financial insights, and sentiment


analysis, this project aims to unveil hidden patterns and correlations. The
synthesis of these diverse datasets may yield valuable insights into the holistic
nature of Formula 1 team dynamics. We anticipate to uncover strategies that
contribute to success, understanding the delicate balance between financial
investments and racing achievements, and understanding the impact of media
sentiment on team morale.

2 Novel Data Set Collection

For Novel Data Set Collection, data on Formula 1 teams and drivers from
different sources has been pooled together to create a novel database.

• The performance data for all the teams and drivers has been scrapped from
ergast.com API which tracks all F1 data on drivers and constructors
performance. Historical data on race results, lap time, pit stop time, driver
information, constructor information and fastest lap time are considered. The
time frame of the data is 2003 to 2023 as that is the most relevant data
available.
• The financial data which will be used in this report has been downloaded
from yahoo finance. The time frame of the data is taken from 2021 to 2023 for
relevance to future predictive analysis.
• Data which will be used for textual analysis has been scrapped by
Developing a Python-based web scraping tool using Selenium WebDriver for
automated web navigation and BeautifulSoup for HTML parsing, aimed at
collecting news articles related to Formula 1 teams: Ferrari, Alpine, Aston
Martin, and Mercedes including personnel changes (racers, technical staff,
CEOs, team principals), new sponsorships and partnerships, car model
launches, and terminations of sponsorships or partnerships.

2
3 Database creation and querying

• For our analysis, two databases have been created to store all the data. The
first one is called f1 database which contains all the race performance data and
financial data and the second one is called f1 news which contains data from
various news sources.
• Queries have been executed to extract data for each table to perform
exploratory data analysis, data cleaning, and model building.
• Further, summary statistics were generated using queries to gain deeper
insights into our novel data set.
• Table 1 displays the numbers of race wins for every team from 2003 to 2023
and their average career qualifying position meaning what position they start
the race.
• Table 2 displays the drivers who won the championship from 2003 to 2023 by
scoring most number of point and the team of the driver.

Table 1: Grand Prix Wins 2003-2023.

Constructor Number of Wins Average Qualifying


Ferrari 83 6.002
McLaren 47 8.851
Mercedes 116 4.704
Red Bull 92 6.589
Williams 6 11.976
Renault 20 10.017
Brawn 7 5.242
Lotus 2 10.777
Toro Rosso 1 13.659
Alpha Tauri 1 10.487
BMW Sauber 1 8.928
Racing Point 1 11.253
Alpine 1 10.264
Jordan 1 16.352
Honda 1 12.361

3
Table 2: Formula 1 Drivers Championship
2003-2023.

Year Driver Constructor


2003 Michael Schumacher Ferrari
2004 Michael Schumacher Ferrari
2005 Fernando Alonso Renault
2006 Fernando Alonso Renault
2007 Kimi Raikkonen Ferrari
2008 Lewis Hamilton McLaren
2009 Jenson Button Brawn
2010 Sebastian Vettel Red Bull
2011 Sebastian Vettel Red Bull
2012 Sebastian Vettel Red Bull
2013 Sebastian Vettel Red Bull
2014 Lewis Hamilton Mercedes
2015 Lewis Hamilton Mercedes
2016 Nico Rosberg Mercedes
2017 Lewis Hamilton Mercedes
2018 Lewis Hamilton Mercedes
2019 Lewis Hamilton Mercedes
2020 Lewis Hamilton Mercedes
2021 Max Verstappen Red Bull
2022 Max Verstappen Red Bull
2023 Max Verstappen Red Bull

4 Data Cleaning, Checking and Organisation

The required steps to clean, check and organize the data are as follows:

• For track performance analysis we have considered parameters such as


qualifying grid position, race finish position, lap time, points scored, fastest lap
time, driver and constructor information. In order to understand financial
relevance and position we have considered stock prices of the owner/partner of
formula 1 teams which are publicly traded and for textual analysis we have
used news article on teams and drivers.
• Raw data on performance has been cleaned by checking for missing or
abnormal values and filtering out irrelevant data. In order to do so we have

4
checked the range for values and identified any outliers.
• To simplify the analysis, the data has been organized by taking average race
pace for each driver for each race for every season. Normalization has been
done to make the data consistent for format and output result.
• The financial data has been organized as monthly stock price data and
aligned with the timeline of performance analysis.
• The textual data from news sources has been cleaned and organised through
stop word removal, stemming, lemmatisation, and tokenisation.

5 Codes

#Extracting Race results Data


import requests
import sqlite3
import xml.etree.ElementTree as ET

def create_driver_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS drivers
(id TEXT PRIMARY KEY,
first_name TEXT,
last_name TEXT,
nationality TEXT)''')

def create_constructor_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS constructors
(id TEXT PRIMARY KEY,
name TEXT)''')

def create_track_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS tracks
(id INTEGER PRIMARY KEY,
locality TEXT,
country TEXT,
name TEXT)''')

def create_results_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS results
(id INTEGER PRIMARY KEY AUTOINCREMENT,
driver_id TEXT,
position INTEGER,
grid INTEGER,
number INTEGER,
constructor_id TEXT,
race_track_id INTEGER,
points INTEGER,
race_year DATE,
FOREIGN KEY (driver_id) REFERENCES drivers(id),
FOREIGN KEY (constructor_id) REFERENCES constructors(id),

5
FOREIGN KEY (race_track_id) REFERENCES tracks(id))''')

def insert_driver_if_not_exists(conn, driver_first_name, driver_last_name, driver_id):


cursor = conn.cursor()
cursor.execute("SELECT * FROM drivers WHERE id = ?", (driver_id,))
driver = cursor.fetchone()

if driver is None:
cursor.execute("INSERT INTO drivers (first_name, last_name, id) VALUES (?, ?, ?)",
(driver_first_name, driver_last_name, driver_id))
conn.commit()
id = cursor.lastrowid
print(f"Driver {driver_first_name} {driver_last_name} inserted into the database.")
else:
id = driver[0]
print(f"Exists already")

return id

def insert_constructor_if_not_exists(conn, constructor_name, constructor_id):


cursor = conn.cursor()
cursor.execute("SELECT * FROM constructors WHERE id = ?", (constructor_id,))
exists = cursor.fetchone()

if exists is None:
cursor.execute("INSERT INTO constructors (id, name) VALUES (?, ?)",
(constructor_id, constructor_name))
conn.commit()
id = cursor.lastrowid
print(f"Constructor '{constructor_name}' with id '{constructor_id}' inserted successfully.")
else:
print(f"Exists already")
id = exists[0]

return id

def insert_track_if_not_exists(conn, locality, country, name):


cursor = conn.cursor()
cursor.execute('SELECT * FROM tracks WHERE name = ?', (name,))
track_exists = cursor.fetchone()

if track_exists is None:
cursor.execute('INSERT INTO tracks (locality, country, name) VALUES (?, ?, ?)',
(locality, country, name))
conn.commit()
id = cursor.lastrowid
print(f"Track '{name}' inserted successfully.")
else:
id = track_exists[0]
print(f"Track '{name}' already exists in the database.")

return id

def insert_result(conn, driver_id, position, grid, number, constructor_id, race_track_id,


points, race_year):
cursor = conn.cursor()
insert_query = '''INSERT INTO results (driver_id, position, grid, number, constructor_id,
race_track_id, points, race_year)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)'''
cursor.execute(insert_query, (driver_id, position, grid, number, constructor_id, race_track_id,
points, race_year))

6
conn.commit()
print("Result inserted successfully.")

def populate_race_db(year, ns, conn):


base_url = f'http://ergast.com/api/f1/{year}/results'
# Initial fetch to determine pagination
response = requests.get(base_url)
xml_data = response.content
root = ET.fromstring(xml_data)
print(f"Fetching data from year - {year}")
# Pagination details
total = int(root.attrib['total'])
limit = int(root.attrib['limit'])
offset = 0

# Fetching all results page by page


while offset < total:
paginated_url = f"{base_url}?limit={limit}&offset={offset}"
response = requests.get(paginated_url)
xml_data = response.content
root = ET.fromstring(xml_data)
for race in root.findall(".//mrd:Race", ns):
circuit = race.find("mrd:Circuit", ns)
location = circuit.find("mrd:Location", ns)
year = race.get("season")

track_name = circuit.find("mrd:CircuitName", ns).text


track_locality = location.find("mrd:Locality", ns).text
track_country = location.find("mrd:Country", ns).text

track_id = insert_track_if_not_exists(conn, track_locality, track_country, track_name)

for result in race.findall(".//mrd:Result", ns):


# Extract driver information
driver = result.find(".//mrd:Driver", ns)
driver_code = driver.get('code')
given_name = driver.find("mrd:GivenName", ns).text
family_name = driver.find("mrd:FamilyName", ns).text
driver_id = insert_driver_if_not_exists(conn, given_name, family_name, driver_code)

# Extract constructor information


constructor = result.find(".//mrd:Constructor", ns)
constructor_name = constructor.find("mrd:Name", ns).text
constructor_id = constructor.get("constructorId")
constructor_id = insert_constructor_if_not_exists(conn, constructor_name,
constructor_id)

# Extract result information


position = result.get("position")
points = result.get("points")
number = result.get("number")
grid = result.find("mrd:Grid", ns).text

insert_result(conn, driver_id, position, grid, number, constructor_id,

race_track_id=track_id, points=points, race_year=year)

offset += limit

def print_all_results_group_by_year(conn):

7
cursor = conn.cursor()
query = '''
SELECT r.race_year, d.first_name, d.last_name, r.position, r.grid, r.number, c.name as
constructor_name, t.name as track_name
FROM results r
JOIN drivers d ON r.driver_id = d.id
JOIN constructors c ON r.constructor_id = c.id
JOIN tracks t ON r.race_track_id = t.id
ORDER BY r.race_year, r.id
'''

cursor.execute(query)
results = cursor.fetchall()

current_year = None
for result in results:
race_year, first_name, last_name, position, grid, number, constructor_name,
track_name = result

if race_year != current_year:
print(f"\nYear: {race_year}")
current_year = race_year

print(f"Driver: {first_name} {last_name}, Position: {position}, Grid: {grid}, "


f"Number: {number}, Constructor: {constructor_name}, Track: {track_name}")

ns = {'mrd': 'http://ergast.com/mrd/1.5'}

conn = sqlite3.connect('f1_database.db')

create_driver_table(conn)
create_constructor_table(conn)
create_track_table(conn)
create_results_table(conn)

# Loop through years and fetch results


for year in range(2003, 2023):
populate_race_db(year, ns, conn)

print_all_results_group_by_year(conn)

conn.close()

import requests
import sqlite3
import xml.etree.ElementTree as ET

def create_driver_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS drivers
(id TEXT PRIMARY KEY,
first_name TEXT,
last_name TEXT,
nationality TEXT)''')

def create_laps_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS laps
(id INTEGER PRIMARY KEY,
driver TEXT,

8
position INTEGER,
time TEXT,
track_id INTEGER,
lap_number INTEGER,
year DATE,
FOREIGN KEY (track_id) REFERENCES tracks(id))''')

def create_constructor_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS constructors
(id TEXT PRIMARY KEY,
name TEXT)''')

def create_track_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS tracks
(id INTEGER PRIMARY KEY,
locality TEXT,
country TEXT,
name TEXT)''')

def create_results_table(conn):
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS results
(id INTEGER PRIMARY KEY AUTOINCREMENT,
driver_id TEXT,
position INTEGER,
grid INTEGER,
number INTEGER,
constructor_id TEXT,
race_track_id INTEGER,
points INTEGER,
race_year DATE,
FOREIGN KEY (driver_id) REFERENCES drivers(id),
FOREIGN KEY (constructor_id) REFERENCES constructors(id),
FOREIGN KEY (race_track_id) REFERENCES tracks(id))''')

def insert_driver_if_not_exists(conn, driver_first_name, driver_last_name, driver_id):


cursor = conn.cursor()
cursor.execute("SELECT * FROM drivers WHERE id = ?", (driver_id,))
driver = cursor.fetchone()

if driver is None:
cursor.execute("INSERT INTO drivers (first_name, last_name, id) VALUES (?, ?, ?)",
(driver_first_name, driver_last_name, driver_id))
conn.commit()
id = cursor.lastrowid
print(f"Driver {driver_first_name} {driver_last_name} inserted into the database.")
else:
id = driver[0]
print(f"Exists already")

return id

def insert_constructor_if_not_exists(conn, constructor_name, constructor_id):


cursor = conn.cursor()
cursor.execute("SELECT * FROM constructors WHERE id = ?", (constructor_id,))
exists = cursor.fetchone()

if exists is None:
cursor.execute("INSERT INTO constructors (id, name) VALUES (?, ?)",

9
(constructor_id, constructor_name))
conn.commit()
id = cursor.lastrowid
print(f"Constructor '{constructor_name}' with id '{constructor_id}' inserted successfully.")
else:
print(f"Exists already")
id = exists[0]

return id

def insert_track_if_not_exists(conn, locality, country, name):


cursor = conn.cursor()
cursor.execute('SELECT * FROM tracks WHERE name = ?', (name,))
track_exists = cursor.fetchone()

if track_exists is None:
cursor.execute('INSERT INTO tracks (locality, country, name) VALUES (?, ?, ?)',
(locality, country, name))
conn.commit()
id = cursor.lastrowid
print(f"Track '{name}' inserted successfully.")
else:
id = track_exists[0]
print(f"Track '{name}' already exists in the database.")

return id

def insert_result(conn, driver_id, position, grid, number, constructor_id,


race_track_id, points, race_year):
cursor = conn.cursor()
insert_query = '''INSERT INTO results (driver_id, position, grid, number, constructor_id,
race_track_id, points, race_year)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)'''
cursor.execute(insert_query, (driver_id, position, grid, number, constructor_id,
race_track_id, points, race_year))
conn.commit()
print("Result inserted successfully.")

def insert_lap(conn, driver, position, time, track_id, lap_number, year):


cursor = conn.cursor()
insert_query = '''INSERT INTO laps (driver, position, time, track_id, lap_number, year)
VALUES (?, ?, ?, ?, ?, ?)'''
cursor.execute(insert_query, (driver, position, time, track_id, lap_number, year))
conn.commit()
print("Lap inserted successfully.")

def get_laps(year, ns, conn):


for round in range(1, 22, 1):
base_url = f'http://ergast.com/api/f1/{year}/{round}/laps'
# Initial fetch to determine pagination
response = requests.get(base_url)
xml_data = response.content
root = ET.fromstring(xml_data)
print(f"Fetching data from year - {year}")
# Pagination details
total = int(root.attrib['total'])
limit = int(root.attrib['limit'])
offset = 0

# Fetching all results page by page

10
while offset < total:
paginated_url = f"{base_url}?limit={limit}&offset={offset}"
response = requests.get(paginated_url)
xml_data = response.content
root = ET.fromstring(xml_data)
for race in root.findall(".//mrd:Race", ns):
circuit = race.find("mrd:Circuit", ns)
location = circuit.find("mrd:Location", ns)
year = race.get("season")

track_name = circuit.find("mrd:CircuitName", ns).text


track_locality = location.find("mrd:Locality", ns).text
track_country = location.find("mrd:Country", ns).text

track_id = insert_track_if_not_exists(conn, track_locality, track_country, track_name)

lap_list = race.find("mrd:LapsList", ns)


laps = lap_list.findall("mrd:Lap", ns)
for lap in laps:
timings = lap.findall("mrd:Timing", ns)
lap_number = lap.get("number")
for timing in timings:
driver = timing.get("driverId")
position = timing.get("position")
time = timing.get("time")
insert_lap(conn, driver, position, time, track_id, lap_number, year)

offset += limit

def print_all_results_group_by_year(conn):
cursor = conn.cursor()
query = '''
SELECT r.race_year, d.first_name, d.last_name, r.position, r.grid, r.number, c.name as
constructor_name, t.name as track_name
FROM results r
JOIN drivers d ON r.driver_id = d.id
JOIN constructors c ON r.constructor_id = c.id
JOIN tracks t ON r.race_track_id = t.id
ORDER BY r.race_year, r.id
'''

cursor.execute(query)
results = cursor.fetchall()

current_year = None
for result in results:
race_year, first_name, last_name, position, grid, number, constructor_name,
track_name = result

if race_year != current_year:
print(f"\nYear: {race_year}")
current_year = race_year

print(f"Driver: {first_name} {last_name}, Position: {position}, Grid: {grid}, "


f"Number: {number}, Constructor: {constructor_name}, Track: {track_name}")

def populate_race_db(conn):
base_url = f'http://ergast.com/api/f1/{year}/results'
# Initial fetch to determine pagination
response = requests.get(base_url)
xml_data = response.content

11
root = ET.fromstring(xml_data)
print(f"Fetching data from year - {year}")
# Pagination details
total = int(root.attrib['total'])
limit = int(root.attrib['limit'])
offset = 0

# Fetching all results page by page


while offset < total:
paginated_url = f"{base_url}?limit={limit}&offset={offset}"
response = requests.get(paginated_url)
xml_data = response.content
root = ET.fromstring(xml_data)
for race in root.findall(".//mrd:Race", ns):
circuit = race.find("mrd:Circuit", ns)
location = circuit.find("mrd:Location", ns)
year = race.get("season")

track_name = circuit.find("mrd:CircuitName", ns).text


track_locality = location.find("mrd:Locality", ns).text
track_country = location.find("mrd:Country", ns).text

track_id = insert_track_if_not_exists(conn, track_locality, track_country, track_name)

for result in race.findall(".//mrd:Result", ns):


# Extract driver information
driver = result.find(".//mrd:Driver", ns)
driver_code = driver.get('code')
given_name = driver.find("mrd:GivenName", ns).text
family_name = driver.find("mrd:FamilyName", ns).text
driver_id = insert_driver_if_not_exists(conn, given_name, family_name, driver_code)

# Extract constructor information


constructor = result.find(".//mrd:Constructor", ns)
constructor_name = constructor.find("mrd:Name", ns).text
constructor_id = constructor.get("constructorId")
constructor_id = insert_constructor_if_not_exists(conn,

constructor_name, constructor_id)

# Extract result information


position = result.get("position")
points = result.get("points")
number = result.get("number")
grid = result.find("mrd:Grid", ns).text

insert_result(conn, driver_id, position, grid, number, constructor_id,

race_track_id=track_id, points=points, race_year=year)

offset += limit

def print_laps_with_track(conn):
cursor = conn.cursor()
query = '''
SELECT l.id, l.driver, l.position, l.time, l.year, t.locality, t.country, t.name
FROM laps l
JOIN tracks t ON l.track_id = t.id
ORDER BY l.id
'''
cursor.execute(query)

12
results = cursor.fetchall()

for row in results:


lap_id, driver, position, lap_time, year, locality, country, track_name = row
print(f"Lap ID: {lap_id}, Driver: {driver}, Position: {position}, Time: {lap_time},
Year: {year}, "
f"Track: {track_name}, Locality: {locality}, Country: {country}")

ns = {'mrd': 'http://ergast.com/mrd/1.5'}

conn = sqlite3.connect('f1_database.db')

# create_driver_table(conn)
# create_constructor_table(conn)
create_track_table(conn)
create_laps_table(conn)
# # create_results_table(conn)

# Loop through years and fetch results


for year in range(2021, 2023):
# populate_race_db(year, ns, conn)
get_laps(year, ns, conn)

# print_all_results_group_by_year(conn)
print_laps_with_track(conn)

conn.close()
#Inserting financial data into the database.
import pandas as pd
import sqlite3
#Reading the CSV file
df = pd.read_csv('RACE.csv')

#Clean up
df.columns.str.strip()

connection = sqlite3.connect('f1_database.db')
df.to_sql('Ferrari_stock', connection, if_exists='replace')

connection.close()

#Web Scrapping News articles


import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import sqlite3
from random import randint

# Connect to SQLite database


conn = sqlite3.connect('f1_ferrarinews2021.db')
c = conn.cursor()

# Create articles table


c.execute('''CREATE TABLE IF NOT EXISTS articles
(id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, paragraph TEXT)''')

# Setup WebDriver with a User-Agent


options = webdriver.ChromeOptions()

13
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

# Blacklist keywords/phrases for unwanted paragraphs


blacklist_words = ["cookie disclaimer", "related content", "you may also like", "Subscriber",
"cookies","browser","aggregated","anonymous","advertising","Internet","devices","identifiers",
"tracking","articles","geolocation","Apps","Newsletters","fraudulent","reviews"]

# List of URLs
urls = ['https://www.carandbike.com/news/f1-ferrari-to-develop-sf21-till-june-2021-2414332',
'https://www.businesswire.com/news/home/20210617005933/en/Ferrari-Selects-AWS-as-its-Official-
Cloud-Provider-to-Power-Innovation-on-the-Road-and-Track'

,'https://www.planetf1.com/news/controversial-mission-winnow-dropped-ferrari',
'https://www.carandbike.com/news/ferrari-discussing-new-f1-deal-with-philip-morris-despite-
mission-winnow-eu-ban-2471177','https://www.carandbike.com/news/ferrari-discussing-new-
f1-deal-with-philip-morris-despite-mission-winnow-eu-ban-2471177','https://sportsmintmedia.com/
formula-1-ferrari-signs-cloud-partnership-deal-with-amazon-web-services/',
'https://www.fia.com/news/f1-verstappen-quickest-red-bull-ring-ahead-ferraris-leclerc-and-sainz',
'https://www.the-race.com/formula-1/ferrari-to-use-generational-new-simulator-for-22-f1-car/',
'https://www.pmw-magazine.com/news/team-news/ferrari-completes-install-of-new-dil-simulator-for-
f1-team.html','https://www.pmw-magazine.com/news/team-news/ferrari-completes-install-of-new-
dil-simulator-for-f1-team.html',
'https://www.racefans.net/2021/08/09/ferrari-power-unit-upgrade-significant-step-f1-2021/',
'https://us.motorsport.com/f1/news/how-ferraris-new-gearbox-casing-helped-boost-its-f1-aero/6653646/',
'https://www.formula1.com/en/latest/article.ferrari-to-debut-new-engine-in-russia-forcing-
leclerc-to-start-from-back-of.NsUPIl5I66ZIol5eNMGKE.html',
'https://www.autosport.com/f1/news/sainz-calls-on-ferrari-to-analyse-recent-f1-pit-errors/6727337/',
'https://www.santander.com/en/press-room/press-releases/2021/12/santander-agrees-a-multi-year-
partnership-with-scuderia-ferrari',
'https://www.the-race.com/formula-1/ferrari-drops-mission-winnow-name-still-in-philip-morris-talks/'

# Scrape and store data


for url in urls:
driver.get(url)
time.sleep(randint(2, 10)) # Random delay between 2 to 10 seconds

soup = BeautifulSoup(driver.page_source, 'html.parser')


article_text = soup.find_all('p')

for paragraph in article_text:


skip_paragraph = False
for word in blacklist_words:
if word.lower() in paragraph.text.lower():
skip_paragraph = True
break # Exit inner loop if any blacklist word is found

if not skip_paragraph:
c.execute("INSERT INTO articles (url, paragraph) VALUES (?, ?)", (url, paragraph.text))

conn.commit()

14
# Cleanup
conn.close()
driver.quit()

# Note: This code has been used by making adjustments to url and file names mutiple times

15

You might also like