What not to do

Yesterday, we were assigned to scrape movie data from a website. I was impatient and wanted to jump right in without thoroughly going over the actual website.

When given the simple task of scraping information about a few movies from a website, do not spend the entire day creating a program that spiders and maps every link for the entire website from scratch.

I was getting tens of tens of thousands non-duplicate links and eventually getting throttled by the website. My new method isn’t very applicable to other websites, but still creates a dictionary of every movie on the site with all data related to each movie. And took about an hour instead of 24…

comments powered by Disqus