Web Scraping

To run the python command in Jupyter:
● SHIFT + ENTER after entering each line to run the command. ENTER will add a new
line below to continue entering
OR
● click the RUN button

TEST WITH SIMPLE HTML TAGS USING BEAUTIFULSOUP
1. Download simple.txt from i-learn
2. Copy the text and past it using notepad++, save it as html file on your desktop.
3. Open Anaconda Navigator
4. Launch Jupyter Notebook
5. New 🡪 python 3
6. Enter the following codes:
a. from bs4 import BeautifulSoup as bs SHIFT + ENTER
b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on

location of the files]
c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER
d. print (soup) SHIFT + ENTER
e. print (soup.prettify()) SHIFT + ENTER

f. soup.title SHIFT + ENTER
g. soup.body SHIFT + ENTER
h. soup.body.contents[1] SHIFT + ENTER
i. soup.get_text()
j. print (soup.get_text())
k. print (soup.get_text(strip=True))
l. print (soup.get_text(‘ ’, strip=True))
m. soup.findAll(‘p’)
n. soup.findAll('p',{'id':'First content'})
How to read and write data from/to a file using python
open (filename, file mode)

1. How to write data into a file. (if the file exists, then the content will be overwritten)
Example : writing a text into file named ‘lineText.txt’

Specify the file name lineText.txt
filename = "lineText.txt"
f = open(filename, 'w') Open the file and write to the file
for i in range(10):
repeat 10 times, just to print the text
f.write("This is line %d\r\n" % (i+1)) This is line …….
f.close() %d = to print the integer number

Which comes from %(i + 1)
\r = to insert carriage return (ENTER key)
\n = new line
Close the file name lineText.txt
2. How to append to the existing file. (if the file exists,

the new content will be appended, the existing content
still intact)
Example : append the text to file named ‘lineText.txt’
f = open(filename, 'a+')
for i in range(5):
f.write("Appended line %d\r\n" % (i+1))
f.close()
****Note : you can search the file in folder anaconda3/script, since it is a text file,
you can view it using notepad. You can view the file in Jupyter Notebook
explorer
Select View to view the content
Select the file to view
3. How to read all contents in the file.
Example : read all the contents in a file named

‘lineText.txt’
f = open (filename, 'r')
f1 = f.read()
print (f1)
f.close()
4. How to read content in a file line by line.
Example : read the content in a file named ‘lineText.txt’ line by line
f = open(filename, 'r')
f1 = f.readlines()
for x in f1:
print (x)
f.close()
How to start web scraping in Jupyter Notebook
1. Import BeautifulSoup from package bs4
2. Import the URL package, to read the URL address in the website
3. Copy the URL address selected from the website
4. Request to open the connection, read the webpage and download to our machine
5. Read the HTML tags from the webpage (scraped contents)
6. Close the connection to the webpage
7. To parse the contents (synthesize the webpage contents)

How to write the scraping data into a file (csv file – excel format)
Example :
Scrap data from webpage –

https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD
save the data into a file named test.csv (excel format delimited), it will be saved in the
anaconda3/script folder
from bs4 import BeautifulSoup as soup

from urllib.request import urlopen as uReq
my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
………..
filename = 'test.csv'
f = open(filename,'w')
f.close()
Example :
To scrap
https://www.newegg.com/Laptops-Notebooks/Category/ID-
223?Tid=17489
from bs4 import BeautifulSoup as soup

from urllib.request import urlopen as uReq
my_url = "https://www.newegg.com/Laptops-Notebooks/Category/ID-223?Tid=17489"
uClient = uReq(my_url) #to request the connection to URL specified
my_page = uClient.read() #read the webpage connected
page_soup = soup(my_page, "html.parser") #to parse the webpage content
#to select all tags <div class = item-container>

my_content = page_soup.findAll("div", {"class":"item-container"})
print (my_content) #to display what is in my_content
for x in my_content: #looping through all the contents in the item-container

model = x.div.div.a.img['title'] #scrap the title and pun in array – tree navigation
print (model)
for x in my_content:
model = x.div.div.a.img['title'] #different title name for each image
item_desc = x.findAll('a',{'class':'item-title'}) #find all the <a href ......class= item-title
print(len(item_desc)) #how many contents are there?
print(item_desc[0]) # Array index always starts with 0
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print(len(item_desc))
print(item_desc[0].text) # display only text
print ('Model : ' + brand)
print ('Product Name : ' + item_desc[0].text + '\n')
shipping = x.findAll('li',{'class':'price-ship'}) #shipping information
print(shipping[0].text.strip())
shipping = x.findAll('li',{'class':'price-ship'})
print('Model : ' + model)
print('Product Description : ' + item_desc[0].text)
print('Shipping : ' + shipping[0].text.strip() + '\n')
To print those data into file csv (excel format delimited)

Web Scraping

Uploaded by

Copyright:

Available Formats

Web Scraping

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Scraping

Uploaded by

Copyright:

Available Formats

To run the python command in Jupyter:

● click the RUN button

1. Download simple.txt from i-learn

3. Open Anaconda Navigator

4. Launch Jupyter Notebook

6. Enter the following codes:

a. from bs4 import BeautifulSoup as bs SHIFT + ENTER

b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on

c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER

d. print (soup) SHIFT + ENTER

e. print (soup.prettify()) SHIFT + ENTER

g. soup.body SHIFT + ENTER

h. soup.body.contents[1] SHIFT + ENTER

l. print (soup.get_text(‘ ’, strip=True))

open (filename, file mode)

Example : writing a text into file named ‘lineText.txt’

f.close() %d = to print the integer number

Close the file name lineText.txt

2. How to append to the existing file. (if the file exists,

Example : append the text to file named ‘lineText.txt’

Select the file to view

3. How to read all contents in the file.

Example : read all the contents in a file named

4. How to read content in a file line by line.

Example : read the content in a file named ‘lineText.txt’ line by line

How to start web scraping in Jupyter Notebook

1. Import BeautifulSoup from package bs4

3. Copy the URL address selected from the website

5. Read the HTML tags from the webpage (scraped contents)

6. Close the connection to the webpage

7. To parse the contents (synthesize the webpage contents)

Scrap data from webpage –

from bs4 import BeautifulSoup as soup

from bs4 import BeautifulSoup as soup

uClient = uReq(my_url) #to request the connection to URL specified

my_page = uClient.read() #read the webpage connected

page_soup = soup(my_page, "html.parser") #to parse the webpage content

#to select all tags <div class = item-container>

print (my_content) #to display what is in my_content

for x in my_content: #looping through all the contents in the item-container

To print those data into file csv (excel format delimited)

You might also like