Some challenges a new web scrapper may face with Beautiful Soup in examples 👷‍♀️🏗️😕 – Part 1

Just a collection of things I believe are worth noticing. It escalated quickly.

I spend a week updating this post and every other morning was a torture – I tried making fantasy methods that were so absurd I couldn’t grasp on what the heck I was thinking 12 hours ago.

Set-up

You will need bs4 and requests.

import requests, bs4
link = 'something'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser').html.body

As a second argument, parser, to bs4.BeautifulSoup you can use 'html.parser', 'lxml' or 'html5lib'. The differencies are:

  • html.parser – return HTML as it is
  • lxml – ignores poorly formed HTML tags, may create chaos
  • html5lib – tries to fix poorly formed HTML tags, may create chaos

First off, bs4’s documentation is not that long. But just reading it through might not be too useful. So I will incorporate highlights from the docs into this post. However, for the full method list you to go to another place.

By the way, if the html is unclear, you can prettify it with soup.prettify(). Or if don’t want UTF-8 (for Asian languages or some not standard web scrapping operations), you can pass encoding to it:

soup.prettify("latin-1")

But I highly suggest not to.

Simple data scrapping

I want to find something on a page. Say, I am on a SpaceX page on Wikipedia and I want to get the bold title in the beginning – Space Exploration Technologies Corp.

Using Google Chrome Developer Mode

In order to see the tags around the item, I can use Google Chrome’s Developer Mode (F12). By right clicking on the Space Exploration Technologies Corp. I choose Inspect and get to see the place on this particular item in html page. Most importantly, at the bottom of the inspect window there is a line of tags that this item is wrapped in (left to right):

Like in CSS, # means id and a dot . means class.

Going down the tree and finding things

So using code from the set-up I find object by id‘s (#) up until <p>:

import requests, bs4

link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.\
    find(id='content').\
    find(id='bodyContent').\
    find(id='mw-content-text').\
    find('div', {'class': 'mw-parser-output'})

As you see, I used soup.html.body instead of soup.find('html').find('body'), though it is basically the same thing (they both return None if not found). It is a matter of aesthetics and logic for me and I like it better this way.

You can search by id‘s with find(id='id_name'), as id‘s should be unique. However I would better prepare for a not so well formed html. Considering that I already use html.parser as a parser after all…

Let’s go click on each element of the hierarchy and find out what tag they are attached to. For example, here it is <div id='content'>. So instead of find(id='content') I will use find('div', {'id': 'content'}):

import requests, bs4

link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.\
    find('div', {'id':'content'}).\
    find('div', {'id':'bodyContent'}).\
    find('div', {'id':'mw-content-text'}).\
    find('div', {'class': 'mw-parser-output'})

Moving further, <p> does not have an id or a class. As there can be quite a few of them, I go and check. Yep, here they are.

The one I want is second in line, therefore I use findAll('p') and grab the second one. The same goes with <b>, it is clearly the first one:

res_p = res_general.findAll('p')[1].findAll('b')[0]

In order to get just text I will use get_text():

res = res_p.get_text()

All in all, the code looks like this:

import requests, bs4

link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.find('div', {'id':'content'}).\
    find('div', {'id':'bodyContent'}).find('div', {'id':'mw-content-text'}).\
    find('div', {'class': 'mw-parser-output'})
res_p = res_general.findAll('p')[1].findAll('b')[0].get_text()

Some calls can be confusing

Like this:

'NavigableString' object has no attribute 'get_text'

It simply means that whatever you found cannot be called with get_text(), for example if your soup.find('a', {'class': 'katya''}) did not find anything therefore it returned -1.

The visually clear code I offer you to use is this:

import bs4

sour = [
    # lines
]

lines = [bs4.BeautifulSoup(x, 'html.parser') for x in sour]

for line in lines:
    print('Default:\t%s\n' % line)

    # method' test example
    x = method1()
    print('method1():\t%s\n' % '\n\t\t'.join([str(i) for i in x]))

Don’t forget to make a timer

You don’t want your scrapping adventures to be soon over being detected as a DDOS attack lol. For example, you have a list of words you want to search. Implement a timer like this:

import time, random

words = ['devil', 'like', 'you']
for word in words:
   # place your malicious functions here
   time.sleep(random.uniform(0.2, 0.5))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.