Just a collection of things I believe are worth noticing. It escalated quickly.
I spend a week updating this post and every other morning was a torture – I tried making fantasy methods that were so absurd I couldn’t grasp on what the heck I was thinking 12 hours ago.
Set-up
You will need bs4 and requests.
import requests, bs4
link = 'something'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser').html.body
As a second argument, parser, to bs4.BeautifulSoup
you can use 'html.parser'
, 'lxml'
or 'html5lib'
. The differencies are:
html.parser
– return HTML as it islxml
– ignores poorly formed HTML tags, may create chaoshtml5lib
– tries to fix poorly formed HTML tags, may create chaos
First off, bs4’s documentation is not that long. But just reading it through might not be too useful. So I will incorporate highlights from the docs into this post. However, for the full method list you to go to another place.
By the way, if the html is unclear, you can prettify it with soup.prettify()
. Or if don’t want UTF-8 (for Asian languages or some not standard web scrapping operations), you can pass encoding to it:
soup.prettify("latin-1")
But I highly suggest not to.
Simple data scrapping
I want to find something on a page. Say, I am on a SpaceX page on Wikipedia and I want to get the bold title in the beginning – Space Exploration Technologies Corp.

Using Google Chrome Developer Mode
In order to see the tags around the item, I can use Google Chrome’s Developer Mode (F12). By right clicking on the Space Exploration Technologies Corp. I choose Inspect and get to see the place on this particular item in html page. Most importantly, at the bottom of the inspect window there is a line of tags that this item is wrapped in (left to right):

Like in CSS, #
means id and a dot .
means class.
Going down the tree and finding things
So using code from the set-up I find object by id
‘s (#
) up until <p>
:
import requests, bs4
link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.\
find(id='content').\
find(id='bodyContent').\
find(id='mw-content-text').\
find('div', {'class': 'mw-parser-output'})
As you see, I used soup.html.body
instead of soup.find('html').find('body')
, though it is basically the same thing (they both return None
if not found). It is a matter of aesthetics and logic for me and I like it better this way.
You can search by id
‘s with find(id='id_name')
, as id
‘s should be unique. However I would better prepare for a not so well formed html. Considering that I already use html.parser
as a parser after all…
Let’s go click on each element of the hierarchy and find out what tag they are attached to. For example, here it is <div id='content'>
. So instead of find(id='content')
I will use find('div', {'id': 'content'})
:
import requests, bs4
link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.\
find('div', {'id':'content'}).\
find('div', {'id':'bodyContent'}).\
find('div', {'id':'mw-content-text'}).\
find('div', {'class': 'mw-parser-output'})
Moving further, <p>
does not have an id or a class. As there can be quite a few of them, I go and check. Yep, here they are.
The one I want is second in line, therefore I use findAll('p')
and grab the second one. The same goes with <b>
, it is clearly the first one:
res_p = res_general.findAll('p')[1].findAll('b')[0]
In order to get just text I will use get_text()
:
res = res_p.get_text()
All in all, the code looks like this:
import requests, bs4
link = 'https://en.wikipedia.org/wiki/SpaceX'
req = requests.get(link).content
soup = bs4.BeautifulSoup(req, 'html.parser')
res_general = soup.html.body.find('div', {'id':'content'}).\
find('div', {'id':'bodyContent'}).find('div', {'id':'mw-content-text'}).\
find('div', {'class': 'mw-parser-output'})
res_p = res_general.findAll('p')[1].findAll('b')[0].get_text()
Some calls can be confusing
Like this:
'NavigableString' object has no attribute 'get_text'
It simply means that whatever you found cannot be called with get_text()
, for example if your soup.find('a', {'class': 'katya''})
did not find anything therefore it returned -1
.
The visually clear code I offer you to use is this:
import bs4
sour = [
# lines
]
lines = [bs4.BeautifulSoup(x, 'html.parser') for x in sour]
for line in lines:
print('Default:\t%s\n' % line)
# method' test example
x = method1()
print('method1():\t%s\n' % '\n\t\t'.join([str(i) for i in x]))
Don’t forget to make a timer
You don’t want your scrapping adventures to be soon over being detected as a DDOS attack lol. For example, you have a list of words you want to search. Implement a timer like this:
import time, random
words = ['devil', 'like', 'you']
for word in words:
# place your malicious functions here
time.sleep(random.uniform(0.2, 0.5))