beautifulsoup Collecting optional elements and/or their attributes from series of pages


Example

Let's consider situation when you parse number of pages and you want to collect value from element that's optional (can be presented on one page and can be absent on another) for a paticular page.

Moreover the element itself, for example, is the most ordinary element on page, in other words no specific attributes can uniquely locate it. But you see that you can properly select its parent element and you know wanted element's order number in the respective nesting level.

from bs4 import BeautifulSoup

soup = BeautifulSoup(SomePage, 'lxml')
html = soup.find('div', class_='base class') # Below it refers to html_1 and html_2

Wanted element is optional, so there could be 2 situations for html to be:

html_1 = '''
<div class="base class">    # №0
  <div>Sample text 1</div>  # №1
  <div>Sample text 2</div>  # №2  
  <div>!Needed text!</div>  # №3
</div>

<div>Confusing div text</div>  # №4
'''
        
html_2 = '''
<div class="base class">    # №0
  <div>Sample text 1</div>  # №1
  <div>Sample text 2</div>  # №2  
</div>

<div>Confusing div text</div>  # №4
'''

If you got html_1 you can collect !Needed text! from tag №3 this way:

wanted tag = html_1.div.find_next_sibling().find_next_sibling() # this gives you whole tag №3

It initially gets №1 div, then 2 times switches to next div on same nesting level to get to №3.

wanted_text = wanted_tag.text # extracting !Needed text!

Usefulness of this approach comes when you get html_2 - approach won't give you error, it will give None:

print(html_2.div.find_next_sibling().find_next_sibling())
None

Using find_next_sibling() here is crucial because it limits element search by respective nesting level. If you'd use find_next() then tag №4 will be collected and you don't want it:

print(html_2.div.find_next().find_next())
<div>Confusing div text</div>

You also can explore find_previous_sibling() and find_previous() which work straight opposite way.

All described functions have their miltiple variants to catch all tags, not just the first one:

find_next_siblings()
find_previous_siblings()
find_all_next()
find_all_previous()