How to parse html using beautifulsoup/python?

Question

How do i parse the date start and date end value using beautifulsoup?

<h2 name="PRM-013113-21017-0FSNS" class="pointer">
    <a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br>
       <span>February 8, 2013 - February 10, 2013</span>
    </a>
</h2>

i want to have a output which is, date_start = February 8, 2013, date_end = February 10, 2013, what will i do? — user683742, Commented Feb 4, 2013 at 8:25

Amyth · Accepted Answer · 2013-02-04 08:34:18Z

Something like this.

import re
from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date_span = BeautifulSoup(html).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = re.findall(r'<span>(.+?)</span>', str(date_span))[0]

(PS: you can also use BeautifulSoup's text=True method with findAll to get the text instead of using regex as follows.)

from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date = BeautifulSoup(test).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = date.findAll(text=True)[0]

Update::

To have a start and end date as separate variables you can simply split them you can simply split the date variable as follows:

from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date = BeautifulSoup(test).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = date.findAll(text=True)[0]
# Get start and end date separately
date_start, date_end = date.split(' - ')

now date_start variable contains the starting date and date_end variable contains the ending date.

thanks @Amyth but i want to have an output of each dates, which is date_start = February 8, 2013 and date_end = February 10, 2013 — user683742, Commented Feb 4, 2013 at 8:28
how about simply splitting the date output on ` - `? Check the updated answer. — Amyth, Commented Feb 4, 2013 at 8:34

Collectives™ on Stack Overflow

How to parse html using beautifulsoup/python?

1 Answer 1

Update::

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
beautifulsoup
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update::

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonbeautifulsoup or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
beautifulsoup
or ask your own question.