Web Scraping

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

To run the python command in Jupyter:

● SHIFT + ENTER after entering each line to run the command. ENTER will add a new
line below to continue entering

OR

● click the RUN button


TEST WITH SIMPLE HTML TAGS USING BEAUTIFULSOUP

1. Download simple.txt from i-learn

2. Copy the text and past it using notepad++, save it as html file on your desktop.

3. Open Anaconda Navigator

4. Launch Jupyter Notebook

5. New 🡪 python 3

6. Enter the following codes:

a. from bs4 import BeautifulSoup as bs SHIFT + ENTER

b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on


location of the files]

c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER

d. print (soup) SHIFT + ENTER

e. print (soup.prettify()) SHIFT + ENTER


f. soup.title SHIFT + ENTER

g. soup.body SHIFT + ENTER

h. soup.body.contents[1] SHIFT + ENTER

i. soup.get_text()

j. print (soup.get_text())
k. print (soup.get_text(strip=True))

l. print (soup.get_text(‘ ’, strip=True))

m. soup.findAll(‘p’)

n. soup.findAll('p',{'id':'First content'})
How to read and write data from/to a file using python

open (filename, file mode)


1. How to write data into a file. (if the file exists, then the content will be overwritten)

Example : writing a text into file named ‘lineText.txt’


Specify the file name lineText.txt
filename = "lineText.txt"
f = open(filename, 'w') Open the file and write to the file

for i in range(10):
repeat 10 times, just to print the text
f.write("This is line %d\r\n" % (i+1)) This is line …….

f.close() %d = to print the integer number


Which comes from %(i + 1)
\r = to insert carriage return (ENTER key)
\n = new line

Close the file name lineText.txt

2. How to append to the existing file. (if the file exists,


the new content will be appended, the existing content
still intact)

Example : append the text to file named ‘lineText.txt’

filename = "lineText.txt"
f = open(filename, 'a+')

for i in range(5):
f.write("Appended line %d\r\n" % (i+1))

f.close()

****Note : you can search the file in folder anaconda3/script, since it is a text file,
you can view it using notepad. You can view the file in Jupyter Notebook
explorer
Select View to view the content

Select the file to view

3. How to read all contents in the file.

Example : read all the contents in a file named


‘lineText.txt’

filename = "lineText.txt"
f = open (filename, 'r')

f1 = f.read()
print (f1)

f.close()

4. How to read content in a file line by line.

Example : read the content in a file named ‘lineText.txt’ line by line

filename = "lineText.txt"
f = open(filename, 'r')

f1 = f.readlines()

for x in f1:
print (x)

f.close()

How to start web scraping in Jupyter Notebook

1. Import BeautifulSoup from package bs4

2. Import the URL package, to read the URL address in the website

3. Copy the URL address selected from the website

4. Request to open the connection, read the webpage and download to our machine

5. Read the HTML tags from the webpage (scraped contents)

6. Close the connection to the webpage

7. To parse the contents (synthesize the webpage contents)


How to write the scraping data into a file (csv file – excel format)
Example :

Scrap data from webpage –


https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD

save the data into a file named test.csv (excel format delimited), it will be saved in the
anaconda3/script folder

from bs4 import BeautifulSoup as soup


from urllib.request import urlopen as uReq
my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')

………..

filename = 'test.csv'
f = open(filename,'w')

f.close()

Example :
To scrap
https://www.newegg.com/Laptops-Notebooks/Category/ID-
223?Tid=17489

from bs4 import BeautifulSoup as soup


from urllib.request import urlopen as uReq

my_url = "https://www.newegg.com/Laptops-Notebooks/Category/ID-223?Tid=17489"

uClient = uReq(my_url) #to request the connection to URL specified

my_page = uClient.read() #read the webpage connected

page_soup = soup(my_page, "html.parser") #to parse the webpage content

#to select all tags <div class = item-container>


my_content = page_soup.findAll("div", {"class":"item-container"})

print (my_content) #to display what is in my_content

for x in my_content: #looping through all the contents in the item-container


model = x.div.div.a.img['title'] #scrap the title and pun in array – tree navigation
print (model)

for x in my_content:
model = x.div.div.a.img['title'] #different title name for each image
item_desc = x.findAll('a',{'class':'item-title'}) #find all the <a href ......class= item-title
print(len(item_desc)) #how many contents are there?
print(item_desc[0]) # Array index always starts with 0

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print(len(item_desc))
print(item_desc[0].text) # display only text

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print ('Model : ' + brand)
print ('Product Name : ' + item_desc[0].text + '\n')

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'}) #shipping information
print(shipping[0].text.strip())

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'})
print('Model : ' + model)
print('Product Description : ' + item_desc[0].text)
print('Shipping : ' + shipping[0].text.strip() + '\n')

To print those data into file csv (excel format delimited)

You might also like