More about Web scraping etiquette

Some things to consider. Not the ultimate truth by any means.

A part of my Web-Scraping series!

  1. If you are going through a list, randomize it every time you debug. It is one thing to see the same pages being called all the time and another entirely to see different pages all the time.

If you are working with your own files in a directory and you need to randomize them with glob and random.shuffle():

import random
from glob import glob

files = glob(os.path.join(path, '*.json'))
random.shuffle(files)

for file in files:
    # do your magic

Pay attention to *.json – in this piece I iterate and randomize .json files.

If you are working with a list in a file, use random.shuffle() on the lines:

import random

with open('file.txt', 'r') as f:
    lines = f.read.splitlines()
    random.shuffle(lines)

2. Log everything you need to know about. Do not overcomplicate it. I prefer to have different files with just the minimum to identify what went wrong. Say, I want to know which products don’t have reviews, so I save the names to the file no_reviews.txt:

with open ('no_reviews.txt', 'a+', encoding='utf-8') as f:
    f.write(product + '\n')
    f.close

That’s it. I will be able to go through them again later or just look through without working through a massive log.

3. Make a list of things you already processed to not go over them again. This way you won’t have to redo everything all the time. It is easy to do a .txt file, but if you need .json for some reason, jeez, go see my post about updating .json.

4. Test reading what you write as much as writing what you read. If you are saving the data to JSON, it will validate it for you on every json.load(open('yourfile', 'r+')) but otherwise you need to make sure that before you waste your time parsing things you can’t process, that you can read what you parsed.

5. Use proxy if needed idk.

6. Make a short time.sleep() timer and don’t do things in one go. Let servers rest, man.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.