Homework 2: Web Scraping

Homework
Published

January 22, 2024

What’s your favorite movie? Wouldn’t it be nice to find more shows that you might like to watch, based on ones you know you like? Tools that address questions like this are often called “recommender systems.” Powerful, scalable recommender systems are behind many modern entertainment and streaming services, such as Netflix and Spotify. While most recommender systems these days involve machine learning, there are also ways to make recommendations that don’t require such complex tools.

In this Blog Post, you’ll use webscraping to answer the following question:

What movie or TV shows share actors with your favorite movie or show?

The idea of this question is that, if TV show Y has many of the same actors as TV show X, and you like X, you might also enjoy Y.

This post has two parts. In the first, larger part, you’ll write a webscraper for finding shared actors on TMDB. In the second, smaller part, you’ll use the results from your scraper to make recommendations.

Don’t forget to check the Specifications for a complete list of what you need to do to obtain full credit. As usual, this Blog Post should be printed as PDF from your PIC16B Blog preview screen, and you need to submit any code you wrote as well.

Instructions

1. Setup

1.1. Locate the Starting TMDB Page

Pick your favorite movie, and locate its TMDB page by searching on https://www.themoviedb.org/. For example, my favorite movie is Harry Potter and the Sorcerer’sPhilosopher’s Stone. Its TMDB page is at:

https://www.themoviedb.org/movie/671-harry-potter-and-the-philosopher-s-stone/

Save this URL for a moment.

1.2. Dry-Run Navigation

Now, we’re just going to practice clicking through the navigation steps that our scraper will take.

First, click on the Full Cast & Crew link. This will take you to a page with URL of the form

<original_url>cast/

Next, scroll until you see the Cast section. Click on the portrait of one of the actors. This will take you to a page with a different-looking URL. For example, the URL for Alan Rickman, who played Severus Snape, is

https://www.themoviedb.org/person/4566-alan-rickman

Finally, scroll down until you see the actor’s Acting section. Note the titles of a few movies and TV shows in this section.

Our scraper is going to replicate this process. Starting with your favorite movie, it’s going to look at all the actors in that movie, and then log all the other movies or TV shows that they worked on.

At this point, it would be a good idea for you to use the Developer Tools on your browser to inspect individual HTML elements and look for patterns among the names you are looking for.

1.3. Initialize Your Project

  1. Open a terminal and type:
conda activate PIC16B-24W
scrapy startproject TMDB_scraper
cd TMDB_scraper

This will create quite a lot of files, but you don’t really need to touch most of them. However, you must submit the entire TMDB_scraper folder for the autograder part.

1.4. Tweak Settings

For now, add the following line to the file settings.py:

CLOSESPIDER_PAGECOUNT = 20

This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.

Hint: Later on, you may run into 403 Forbidden errors once the website detects that you’re a bot. See these links (link1, link2, link3, link4) for how to work around that issue. The easiest solution is changing one line in setting.py. You might see this when you run scrapy shell as well, so keep an eye out for 403! Remember, you want your status to be 200 OK. If they know that you are on Python, they will certainly try to block you. One way to change user agent on scrapy shell is:

scrapy shell -s USER_AGENT='Scrapy/2.8.0 (+https://scrapy.org)' https://www.themoviedb.org/...

(this user agent is the default one that might be blocked by TMDB.)

2. Write Your Scraper

Create a file inside the spiders directory called tmdb_spider.py. Add the following lines to the file:

# to run 
# scrapy crawl tmdb_spider -o movies.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone

import scrapy

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    def __init__(self, subdir=None, *args, **kwargs):
        self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]

Then, you will be able to run your completed spider for a movie of your choice by giving its subdirectory on TMDB website as an extra command-line argument.

Now, implement three parsing methods for the TmdbSpider class.

  • parse(self, response) should assume that you start on a movie page, and then navigate to the Full Cast & Crew page. Remember that this page has url <movie_url>cast. (You are allowed to hardcode that part.) Once there, the parse_full_credits(self,response) should be called, by specifying this method in the callback argument to a yielded scrapy.Request. The parse() method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.
  • parse_full_credits(self, response) should assume that you start on the Full Cast & Crew page. Its purpose is to yield a scrapy.Request for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the method parse_actor_page(self, response) should be called when the actor’s page is reached. The parse_full_credits() method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.
  • parse_actor_page(self, response) should assume that you start on the page of an actor. It should yield a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked in an “Acting” role1. Note that you will need to determine both the name of the actor and the name of each movie or TV show. This method should be no more than 15 lines of code, excluding comments and docstrings.

Provided that these methods are correctly implemented, you can run the command

scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone

to create a .csv file with a column for actors and a column for movies or TV shows for Harry Potter and the Philosopher’s Stone.

Experimentation in the scrapy shell is strongly recommended.

Challenge

If you’re looking for a challenge, think about ways that may make your recommendations more accurate. Consider scraping the number of episodes as well or limiting the number of actors you get per show to make sure you only get the main series cast.

3. Make Your Recommendations

Once your spider is fully written, comment out the line

CLOSESPIDER_PAGECOUNT = 20

in the settings.py file. Then, the command

scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone

will run your spider and save a CSV file called results.csv, with columns for actor names and the movies and TV shows on which they featured in.

Once you’re happy with the operation of your spider, compute a sorted list with the top movies and TV shows that share actors with your favorite movie or TV show. For example, it may have two columns: one for “movie names” and “number of shared actors”.

Feel free to be creative. You can show a pandas data frame, a chart using matplotlib or plotly, or any other sensible display of the results.

4. Blog Post

In your blog post, you should describe how your scraper works, as well as the results of your analysis. When describing your scraper, I recommend dividing it up into the three distinct parsing methods, and discussing them one-by-one. For example:

In this blog post, I’m going to make a super cool web scraper… Here’s how we set up the project…

<implementation of parse()>

This method works by…

<implementation of parse_full_credits()>

To write this method, I…

In addition to describing your scraper, your Blog Post should include a table or visualization of numbers of shared actors.

Remember that this post is still a tutorial, in which you guide your reader through the process of setting up and running the scraper. Don’t forget to tell them how to create the project and run the scraper!

Submission

There will be three Gradescope assignments open for submission, one for autograder, one for PDF, and the other for files. You have to submit all of them for your homework to be graded.

  • For autograder, please submit the entire TMDB_scraper/ folder compressed in .zip format. Please compress the outermost one conatining the scrapy.cfg file:

    └── TMDB_scraper   <-- DIRECTLY COMPRESS THIS FOLDER.
        ├── TMDB_scraper
        │   ├── ...
        │   └── spiders
        │       ├── ...
        │       └── tmdb_spider.py
        ├── ...
        └── scrapy.cfg
    
  • For the PDF assingment, please submit your newly-created blog page printed as PDF, with the URL visible. Please make sure your code is visible in full, i.e., not cuttoff, in your pdf.

  • For the files assignment, please submit any code file you wrote for your homework, except for your scrapy project. All the .py file, .ipynb file, or .qmd files all included. It should include a .py file converted from any .ipynb file. The grader should be able to reproduce your result from the code portion you submitted.

    • It must include index.ipynb, the Jupyter Notebook you worked on, and index.py, a Python script-converted version of it.

Specifications

Format

  1. Please follow the “Submission” section above.

Coding Problem

  1. Each of the three parsing methods are correctly implemented. (autograded)
  2. parse() is implemented in no more than 5 lines.
  3. parse_full_credits() is implemented in no more than 5 lines.
  4. parse_actor_page() is implemented in no more than 15 lines.
  5. A table or list of results or pandas dataframe is shown.
  6. A visualization with matplotlib, plotly, or seaborn is shown.

Style and Documentation

  1. Each of the three parse methods has a short docstring describing its assumptions (e.g. what kind of page it is meant to parse) and its effect, including navigation and data outputs.
  2. Each of the three parse methods has helpful comments for understanding how each chunk of code operates.

Writing

  1. The blog post is written in tutorial format, in engaging and clear English. Grammar and spelling errors are acceptable within reason.
  2. The blog post explains clearly how to set up the project, run the scraper, and access the results.
  3. The blog post explains how each of the three parse methods works.
  4. Blog post has a descriptive title.

Footnotes

  1. [added 2/4 for clarification] Only the works listed in “Acting” section of the actor page.↩︎