Unit 11 Application Development Using Python
Unit 11 Application Development Using Python
Unit 11 Application Development Using Python
[email protected]
D0OLHR8SGA
Program: MCA
Specialization: Core
Semester: 3
Course Name: Application Development using Python *
Course Code: 21VMT0C301
Unit Name: Web Scraping
In cases where one wants huge amounts of information from a website as early as possible.
Web scraping comes in handy. Web scraping employs intelligent automation techniques to
obtain thousands, if not millions, of data sets more quickly.
Web scraping is a computerised technique for gathering copious volumes of data from
websites. The majority of this data is unstructured in HTML format and is transformed into
structured data in a database or spreadsheet so that it can be used in multiple applications.
To collect data from websites, web scraping can be done in a variety of methods. These
include leveraging specific APIs, online services, or even writing your own code from scratch
for web scraping. You may access the structured data on many huge websites, including
Google, Twitter, Facebook, StackOverflow, and others, using their APIs. This is the greatest
option, however there are alternative websites that either lack the technological
sophistication or don't permit users to access significant volumes of structured data. In that
case, it's advisable to employ web scraping to collect data from the website.
Webs craping can be used for competitive analysis, R&D, Social media scraping, monitoring
of brands, lead generation etc. Web scraping is not illegal in any way, but whether it is or is
Mapit.py with Webbrowser module (reference: Automate the boring stuff with python):
A practical web browser controller in Python is the webbrowser module. It offers a high
level user interface that enables users to view documents hosted on the Web.
The web browser can be used as a CLI tool as well. The URL is accepted as the input, and the
following other parameters are optional: If it's possible, the options -n and -t open the URL
in new browser windows and tabs, respectively.
The given code will open a new tab on the browser with the Google webpage.
[email protected]
Figuring out the URL:
D0OLHR8SGA
Let’s say you see a place on a website and you wish to open the address of that place on
google maps. Here, we will look at a Starbucks located in Colaba, Mumbai. We can do this
automatically using python.
The command line arguments will be used by the script rather than the clipboard. The
application will know to use the contents of the clipboard if there are no command-line
inputs.
To begin with, you must decide which URL to utilise for a specific street address. When you
use the browser to access http://maps.google.com and search for an address, the URL in the
address bar like this:
https://www.google.com/maps/place/Terminal+2,+Navpada,+Vile+Parle+East,+Vile+Parle,+
Mumbai,+Maharashtra+400099/@19.0974424,72.8723077,17z/data=!3m1!4b1!4m5!3m4!1
s0x3be7c842b68282f1:0x200d8c72871da4f1!8m2!3d19.0974373!4d72.8745017
There is a lot of additional text in the URL in addition to the address. URLs are frequently
extended by websites in order to track users or to personalise content. However, if you try
simply going to:
https://www.google.com/maps/place/Terminal2,%20Vile%20Parle,%20Mumbai
[email protected]
D0OLHR8SGA
If you enter this into the command line to launch the programme:
mapit Terminal 2, Navpada, Vile Parle East, Vile Parle, Mumbai, Maharashtra 400099
The variable address will have ‘Terminal 2, Navpada, Vile Parle East, Vile Parle, Mumbai,
Maharashtra 400099’ as string.
sys.argv will then contain values: [‘mapIt.py’, ‘Terminal 2’, ‘Navpada’, ‘Vile Parle East’, ‘Vile
Parle’, ‘Mumbai’, ‘Maharashtra’, ‘400099’
Handling the Clipboard content and launching the browser:
The program will presume the address is on the clipboard if there are no command-line
arguments. With pyperclip.paste(), you can retrieve the contents of the clipboard and save
them in a variable called address. Lastly, call webbrowser.open to start a web browser with
the Google Maps URL ().
GET is one of the most popular HTTP methods. By using the Receive technique, you can get
or retrieve data from a certain resource. Invoke requests to send a GET request.get().
It returns a request.Response object.
The status code is the first piece of data you can learn from a response. You may learn the
status of the request from the status code.
An easy example of status code is 404 request not found. This means that the resource that
you are interested in seeing is not found. Similarly, 200 OK status indicates that your request
is successful. Decisions can be made using these status codes.
[email protected]
D0OLHR8SGA
HTML:
It is a text file with the.html or.htm extension that contains text and some tags included in
the brackets " <>" which provide the configuration instructions for the web page.
Each HTML document has two sections:
- One element that displays the complete page's content to the browser and cannot be
altered directly.
- another section that has the page's source code, which we may use to edit the HTML file.
We work with this component.
[email protected]
D0OLHR8SGA
Simply click the right mouse button inside the page's text area and select "View source" or
"View Frame-Source" to view the source code of any HMTL document. The page's source
code will be displayed in a document that will be opened in the Text Editor.
There are three tags that describe and provide basic information about the fundamental
structure of an HTML content. These tags just frame and organise the HTML file; they have
no impact on how the content looks.
<! Doctype>:
A doctype, also known as a document type declaration, is a directive that informs the web
browser of the markup language used to create the current page. The Doctype, which is not
an element or tag, informs the browser of the HTML or other markup language version or
standard that is being used in the page.
A DOCTYPE declaration is shown at the top of a web page before any other elements. Every
HTML document is required to have a document type declaration in accordance with the
HTML specification or standards in order to guarantee that the pages are displayed as
intended.
<!DOCTYPE html> is case insensitive on HTML5.
[email protected]
D0OLHR8SGA
HTML headings:
The heads of a page are specified using an HTML heading tag. HTML defines headers at six
different levels. These six heading elements are designated by the letters h1, h2, h3, h4, h5,
and h6, where h1 denotes the highest level and h6 the lowest.
For the primary heading, use <h1>. (Size-wise largest)
Subheadings are designated using a <h2> element; if there are more sections beneath the
subheadings, a <h3> element is used.
For the small heading, use <h6> (smallest one).
How are headings important?
Headings are used by search engines to index the website's structure and organise its
content. They are used to draw attention to key points. They give us useful information and
describe the document's structure.
Example:
Tag Purpose
<p> Paragraph tag.
In HTML, a paragraph is defined by the p>
tag. There are opening and closing tags on
these. Therefore, everything that is spoken
between p and p is considered to be a
paragraph. Even if we don't use the closing
tag, /p>, most browsers still treat a line as a
paragraph, although this could lead to
unexpected outcomes. Therefore, it is both
a wise norm and something we must do.
<center> Centre Alignment.
[email protected]
D0OLHR8SGA In HTML, the <centre> tag is used to align
content in the middle of a page. HTML5
does not support this tag. Instead of using
the centre tag like HTML5 does, CSS's
property is used to determine how the
element is aligned. It does not have any
attributes.
<hr> Horizontal lines tag.
The HTML tag known as "horizontal rule"
(abbreviated "hr") is used to insert
horizontal lines between sections of a
document. There is no need for the closing
tag because the tag is empty or unpaired.
<pre> Preserve formatting tag.
The block of preformatted text that retains
the tabs, line breaks, spaces, and other
formatting elements that web browsers
ignore is defined by the pre> tag in HTML.
Although it appears in a fixed-width font in
the pre> element, the text can be
customised using CSS. Starting and end tags
are necessary for the "pre" tag.
Non-breaking space tag
Here, we have printed a tag and created a document using a BeautifulSoup object.
Content can be scraped using a variety of techniques. One of them is the beautifulsoup
select() function. As an argument to the method, the select() CSS selector allows pulling
content from inside the specified CSS path.
We first import the packages necessary to use the select() method.
We then create a samplt HTML document containings links and texts. We then parse the
HTML before extracting its contents from the document. The html.parser argument is
passed in the BeautifulSoup() method. After that, we finally extract the contents from the
HTML document using the select() method of beautifulsoup. Inside the select() there is a
CSS like class name that needs to be found.
[email protected]
D0OLHR8SGA
However, an error is thrown if the attribute is not available. In that case, it can be checked
whether an attribute is present.
[email protected]
D0OLHR8SGA
In the above code, we first import the necessary modules, that is, requests and
BeautifulSoup. Then we pass the URL after making an instance of requests. Further, we pass
the requests into the BeautifulSoup() function and use the ‘img’ tag to find them. The
output is a ‘.svg’ link which is the logo of the website.
Likewise, we can also use the urlopen module to scrape web images.
IP Blocking:
Similar to physical addresses, IP addresses reveal details about the device and the network
being used to connect.
While you'll typically have the same IP address when connecting devices through your home
network, this address changes if you're using another network outside of your home. It can
also change if you reboot your router or switch Internet providers. IP addresses are not
static, unlike physical addresses.
IPv4 addresses, the most popular type of IP address, employ four sets of up to three
[email protected]
D0OLHR8SGA integers each, separated by dots.
A machine that serves as a gateway between your computer and the internet is referred to
as a proxy server or simply a "proxy." Your requests are sent through the proxy when you
are utilising one. The website you are scraping is not directly exposed to your IP address.
Rotating your identification, or IP address, on a regular basis is the best strategy to
circumvent IP banning. To avoid having your spider blocked, it is always preferable to use
proxy and VPN services, rotate IP addresses, and other security measures. It will aid in
reducing the chance of being stuck and banned.
Rotating IP addresses is a simple task if you use Scrapy. You can choose to incorporate the
proxies in your spider using Scrapy.
You can easily integrate one of the many IP blocking APIs, like scraperapi, into your project
for web scraping.