Some challenges a new web scrapper may face with Beautiful Soup in examples 👷‍♀️🚸🏗️ – Part 2

This is a part two, that is basically a Managing Children block, because it turned out too big to fit.

Managing children

Making a long nested try/except to work with can be tedious. And bs4.NavigableString don’t make it easier. For example, for blocks that take these forms I want to get text and information about these tags:

<a><mark>One</mark></a>
<a>Two</a>
<mark>Three</mark>
<mark><i>Four</i></mark>
<a><mark><b>Five</b></mark></a>
Six
<p>One<i>Two</i><b>Three<u>Four</u></b>Five</p>
Text before:<i>One</i><i>Two</i>

Testing methods

The visually clear code I offer you to use is this:

import bs4

sour = [
    # lines
]

lines = [bs4.BeautifulSoup(x, 'html.parser') for x in sour]

for line in lines:
    print('Default:\t%s\n' % line)

    # method' test example
    x = method1()
    print('method1():\t%s\n' % '\n\t\t'.join([str(i) for i in x]))

Built-in methods

There are 5 built-in methods, let’s test them.

import bs4

sour = 

lines = [bs4.BeautifulSoup(x, 'html.parser') for x in sour]

for line in lines:
    print('Default:\t%s\n' % line)
    v = line.findChild()
    x = line.findChildren()
    w = list(line.contents)
    y = list(line.children)
    z = list(line.descendants)
    print('findChild():\t%s\n' % v)
    print('findChildren():\t%s\n' % '\n\t\t'.join([str(i) for i in x]))
    print('contents:\t%s\n' % '\n\t\t'.join([str(i) for i in w]))
    print('descendants:\t%s\n' % '\n\t\t'.join([str(i) for i in z]))
    print('children:\t%s\n\n' % '\n\t\t'.join([str(i) for i in y]))

The primary dataset is what we get when we use, say findAll(‘p’):

sour = [
    '<p> Child-1 <a>Child-2</a> </p>',
    '<p> <a>Child-1</a> <a>Child-2</a> </p>',
    '<p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>',
]

Running tests we see that Child-1 in 1st, 3rd and 4th lines is not recognized by no methods. For example, for the 3rd line:

Default:        <p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>

findChild():    <p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>

findChildren(): <p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>
                <a>Child-2 <b>Grandchild-1-1</b></a>
                <b>Grandchild-1-1</b>

contents():     <p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>

descendants:    <p> Child-1 <a>Child-2 <b>Grandchild-1-1</b></a> </p>
                 Child-1 
                <a>Child-2 <b>Grandchild-1-1</b></a>
                Child-2 
                <b>Grandchild-1-1</b>
                Grandchild-1-1

What if we remove the <p> tag…

Default:        Child-1 <a>Child-2 <b>Grandchild-1-1</b></a>

findChild():    <a>Child-2 <b>Grandchild-1-1</b></a>

findChildren(): <a>Child-2 <b>Grandchild-1-1</b></a>
                <b>Grandchild-1-1</b>

contents:     Child-1 
                <a>Child-2 <b>Grandchild-1-1</b></a>

descendants:    Child-1 
                <a>Child-2 <b>Grandchild-1-1</b></a>
                Child-2 
                <b>Grandchild-1-1</b>
                Grandchild-1-1

children:       Child-1 
                <a>Child-2 <b>Grandchild-1-1</b></a>

Interlude: nested tags – the unreusable way

Knowing possible types of nested tags I can do it like this way:

for line in lines:
    temp = line.find('a')
    if temp:
        temp2 = temp.find('mark')
        if temp2:
            temp3 = temp2.find('b')
            if temp3:
                print('Five' == temp3.get_text())
            else:
                print('One' == temp2.get_text())
        else:
            print('Two' == temp.get_text())
    else:
        temp = line.find('mark')
        if temp:
            temp2 = temp.find('i')
            if temp2:
                print('Four' == temp2.get_text())
            else:
                print('Three' == temp.get_text())
        else:
            print('Six' == line.get_text())

It will all return True‘s with the given dataset. You can test it with this piece of code:

import bs4

sour = ['<a><mark>One</mark></a>',
'<a>Two</a>',
'<mark>Three</mark>',
'<mark><i>Four</i></mark>',
'<a><mark><b>Five</b></mark></a>',
'Six',]

lines = [bs4.BeautifulSoup(x, 'html.parser') for x in sour]
# the loops I posted above

It is not reusable and is tied to this particular structure.

Crafting a method

Obviously, .contents() and .descendants are the closest to the desired result. The difference is that .descendants applies contents() recursively until it is not a Tag anymore, but a NavigableString.

.descendants

With .descendants it may get hard to determine a family – you can make up something like ‘if a parent contains current node, than the node is its child’. However, in this task I want to not only recognize a family, but also understand which child has which tag around it.

It can be solved by adding another condition: ‘if a parent contains current node and the current node is not the previous one’s get_text(), than the node is its child’. Now things get complicated, because if the dataset is like this:

<a>One <b>Two</b></a> Two

What does it turn into…

With this in mind, I spent quite some time on a method and ended up with several versions, none of which satisfied the task enough.

.contents

OK, so I have this data:

Default:        Child-1 <a>Child-2 <b>Grandchild-1-1</b></a>

contents:     Child-1 
                <a>Child-2 <b>Grandchild-1-1</b></a>

It looks like good old recursion can help me. I also need to know a generation and tag names for each and every node.

Method

Sorry in advance for any inconveniences this code and my mumbling excuse for an explanation may cause you. I spent more time on this testing different approaches than I would ever admit, and when it finally worked (and when I thought it did but it didn’t), I was like:

import bs4


def parsing(line, level=0):
    type_line = type(line)
    indent = '_'.join(['__' for x in range(0, level)])
    if type_line == bs4.NavigableString:
        line_stripped = str(line).strip()
        if line_stripped:
            print(indent + ' ' + str(line).strip()) # str(level) + 
    elif type_line == bs4.Tag:
       
        line_contents = []
        try: line_contents = line.contents
        except: pass

        if line_contents: parent_level = level + 1
        else: parent_level = level
        for x in line.contents:
            x_contents = []
            try: x_contents = x.contents
            except: pass

            # if type(x) == bs4.NavigableString:
            #     print(indent + str(x).strip() + '\t\t(%s)\t' % str(level) + '\t__ %s' % x)
            #     print(''.join([indent, ' __ '.join(dash_list)]))
            # else:
                # print(x)
            # if type(x) != bs4.NavigableString and :

            first_parent_condition = level == 0
            last_node_condition = len(x_contents) == 1 and x_contents[0] == x.get_text()
            if first_parent_condition or x_contents and not last_node_condition:
                level += 1
            else:
                level = parent_level
            parsing(x, level)
    # else:
    #     print('and I oop')
        # print(type_line)

sour = [
    '<a>Parent-1 <b>Child1-1 <i>GrChild1-1-1</i> Child1-2 <i>GrChild1-2-1</i> Child1-3</b> Parent-2 <b>Child2-1 <i><u>GrChild2-1-1</u></i> <u>GrChild2-1-2</u> </b> Parent-3</a>',
]

lines = [bs4.BeautifulSoup(x, 'html.parser').find('a') for x in sour]
con = list(lines[0].contents)
for x in con:
    parsing(x, 0, [])

The idea is to use .contents and add its elements to the family recursively until there is a bs4.NavigableString.

The output looks like this:

 Parent-1
__ Child1-1
_____ GrChild1-1-1
__ Child1-2
_____ GrChild1-2-1
__ Child1-3
 Parent-2
__ Child2-1
_____ GrChild2-1-1
________ GrChild2-1-1
 Parent-3

Major ideas of this approach are:

  • parent_level = level + 1 this whole block is needed in order to have different generations of children coming one after another start with the level provided by their parent and not the previous family member. Otherwise, for example, Child1-2 would be at level 2 like the GrChild1-1-1.
  • try: line_contents = line.contents in case there are children to process we will deepen the level for them. The same goes for every node of the sub generation wit x_contents.
  • first_parent_condition simply ‘if this is a parent
  • last_node_condition respectively is ‘if this member has no children (last in their line)‘. This conditions eliminates copies of nodes like <i>Child</i> that tend to be treated like two members of the same generation – <i>Child</i> and Child.
  • But if you are the last one in line or is not a parent, clean your level on the way out please.
  • bs4.BeautifulSoup(x, 'html.parser').find('a') previously I tried parsing lines without .find(‘a’) what meant that lines are bs4.soup and not bs4.Tag. I can recall only distinctly the problems it caused, but either way in reality this method is designed for working with bs4.Tag. This method will not work will plain bs4.soup. If you dare remove .find(‘a’) from the generator, the result will look like this:
__ Parent-1
_____ Child1-1
_____ GrChild1-1-1
_____ Child1-2
_____ GrChild1-2-1
_____ Child1-3
_____ Parent-2
________ Child2-1
________ GrChild2-1-1
________ GrChild2-1-2
________ Parent-3

Were you attentive enough, you would have noticed that GrChild2-1-1 here is technically a grand-grandchild as it is wrapped in two tags, not in one. This is important when it comes to collecting tags that the member is wrapped in.

So this code recognizes generations within families. In order to get which tags every member of the family is wrapped in, some changes are in order.

But first, lets return to the simple line:

sour = [
    '<a>Parent-1 <b>Child1-1 <i>GrChild1-1-1</i> Child1-2 </a>',
]

And print its .contents (there are only few of them:

Parent-1 
<b>Child1-1 <i>GrChild1-1-1</i> Child1-2 </b>

Let’s also print contents of the second line line by line:

Child1-1 
<i>GrChild1-1-1</i>
 Child1-2 

So this way I can say which tags every member is wrapped in:

import bs4


def parsing(line, level=0, tag_list=None):
    if not tag_list:
        tag_list = []
    type_line = type(line)
    indent = ''.join(['__' for x in range(0, level)])
    try:
        tag_str = '[' + '-'.join(tag_list) + ']'
    except: tag_str = ''

    if type_line == bs4.NavigableString:
        line_stripped = str(line).strip()

        if line_stripped:
            print(indent + ' ' + str(line).strip() + ' ' + tag_str) # str(level) +

    elif type_line == bs4.Tag:

        line_contents = []
        try: line_contents = line.contents
        except: pass

        if line_contents:
            parent_level = level + 1
            # parent_tags = tag_list.copy()
            if type(line_contents[0]) == bs4.NavigableString or len(line_contents) == 1:
                tag_list.append(line.name)

        # if line.name:
        #     if type(line.contents[0]) != bs4.Tag:
        #         tag_list.append(line.name)
        #     else:
        #         tag_list = parent_tags

        for x in line.contents:
            x_contents = []
            try: x_contents = x.contents
            except: pass

            first_parent_condition = level == 0
            last_node_condition = len(x_contents) == 1 and x_contents[0] == x.get_text()

            if first_parent_condition or x_contents and not last_node_condition:
                level += 1
                # tag_list.append(line.name)
            else:
                level = parent_level
            parsing(x, level, tag_list)
    # else:
    #     print('and I oop')
        # print(type_line)

sour = [
    # '<a>Parent-1 <b>Child1-1 <i>GrChild1-1-1</i> Child1-2 <i>GrChild1-2-1</i> Child1-3</b> Parent-2 <b>Child2-1 <i><u>GrChild2-1-1</u></i> <u>GrChild2-1-2</u> </b> Parent-3</a>',
    '<a>Parent-1 <b><s>Child1-1</s> <i><u>GrChild1-1-1</u></i> ChildKek <p>I say what what</p> <u>Child1-2 <s>Cat Kek</s></u> </b> </a>',
]

lines = [bs4.BeautifulSoup(x, 'html.parser').find('a') for x in sour]
con = list(lines[0].contents)
for x in con[1].contents:
# for x in con:
    # print(x)
    parsing(x, 0, [con[1].name])

With the following output:

__ Child1-1 [b-s]
____ GrChild1-1-1 [b-i-u]
 ChildKek [b]
__ I say what what [b-p]
__ Child1-2 [b-u]
____ Cat Kek [b-u-s]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.