Some things to consider. Not the ultimate truth by any means.
A part of my Web-Scraping series!
- If you are going through a list, randomize it every time you debug. It is one thing to see the same pages being called all the time and another entirely to see different pages all the time.
If you are working with your own files in a directory and you need to randomize them with glob and random.shuffle():
import random
from glob import glob
files = glob(os.path.join(path, '*.json'))
random.shuffle(files)
for file in files:
# do your magic
Pay attention to *.json – in this piece I iterate and randomize .json files.
If you are working with a list in a file, use random.shuffle() on the lines:
import random
with open('file.txt', 'r') as f:
lines = f.read.splitlines()
random.shuffle(lines)
2. Log everything you need to know about. Do not overcomplicate it. I prefer to have different files with just the minimum to identify what went wrong. Say, I want to know which products don’t have reviews, so I save the names to the file no_reviews.txt:
with open ('no_reviews.txt', 'a+', encoding='utf-8') as f:
f.write(product + '\n')
f.close
That’s it. I will be able to go through them again later or just look through without working through a massive log.
3. Make a list of things you already processed to not go over them again. This way you won’t have to redo everything all the time. It is easy to do a .txt file, but if you need .json for some reason, jeez, go see my post about updating .json.
4. Test reading what you write as much as writing what you read. If you are saving the data to JSON, it will validate it for you on every json.load(open('yourfile', 'r+'))
but otherwise you need to make sure that before you waste your time parsing things you can’t process, that you can read what you parsed.
5. Use proxy if needed idk.
6. Make a short time.sleep()
timer and don’t do things in one go. Let servers rest, man.