Web Scraping
Web Scraping
Web Scraping
● SHIFT + ENTER after entering each line to run the command. ENTER will add a new
line below to continue entering
OR
2. Copy the text and past it using notepad++, save it as html file on your desktop.
5. New 🡪 python 3
i. soup.get_text()
j. print (soup.get_text())
k. print (soup.get_text(strip=True))
m. soup.findAll(‘p’)
n. soup.findAll('p',{'id':'First content'})
How to read and write data from/to a file using python
for i in range(10):
repeat 10 times, just to print the text
f.write("This is line %d\r\n" % (i+1)) This is line …….
filename = "lineText.txt"
f = open(filename, 'a+')
for i in range(5):
f.write("Appended line %d\r\n" % (i+1))
f.close()
****Note : you can search the file in folder anaconda3/script, since it is a text file,
you can view it using notepad. You can view the file in Jupyter Notebook
explorer
Select View to view the content
filename = "lineText.txt"
f = open (filename, 'r')
f1 = f.read()
print (f1)
f.close()
filename = "lineText.txt"
f = open(filename, 'r')
f1 = f.readlines()
for x in f1:
print (x)
f.close()
2. Import the URL package, to read the URL address in the website
4. Request to open the connection, read the webpage and download to our machine
save the data into a file named test.csv (excel format delimited), it will be saved in the
anaconda3/script folder
………..
filename = 'test.csv'
f = open(filename,'w')
f.close()
Example :
To scrap
https://www.newegg.com/Laptops-Notebooks/Category/ID-
223?Tid=17489
my_url = "https://www.newegg.com/Laptops-Notebooks/Category/ID-223?Tid=17489"
for x in my_content:
model = x.div.div.a.img['title'] #different title name for each image
item_desc = x.findAll('a',{'class':'item-title'}) #find all the <a href ......class= item-title
print(len(item_desc)) #how many contents are there?
print(item_desc[0]) # Array index always starts with 0
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print(len(item_desc))
print(item_desc[0].text) # display only text
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print ('Model : ' + brand)
print ('Product Name : ' + item_desc[0].text + '\n')
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'}) #shipping information
print(shipping[0].text.strip())
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'})
print('Model : ' + model)
print('Product Description : ' + item_desc[0].text)
print('Shipping : ' + shipping[0].text.strip() + '\n')