What’s your favorite movie? Wouldn’t it be nice to find more shows that you might like to watch, based on ones you know you like? Tools that address questions like this are often called “recommender systems.” Powerful, scalable recommender systems are behind many modern entertainment and streaming services, such as Netflix and Spotify. While most recommender systems these days involve machine learning, there are also ways to make recommendations that don’t require such complex tools.
In this Blog Post, you’ll use webscraping to answer the following question:
What movie or TV shows share actors with your favorite movie or show?
The idea of this question is that, if TV show Y has many of the same actors as TV show X, and you like X, you might also enjoy Y.
This post has two parts. In the first, larger part, you’ll write a webscraper for finding shared actors on TMDB. In the second, smaller part, you’ll use the results from your scraper to make recommendations.
Don’t forget to check the Specifications for a complete list of what you need to do to obtain full credit. As usual, this Blog Post should be printed as PDF from your PIC16B Blog preview screen, and you need to submit any code you wrote as well.
Instructions
1. Setup
1.1. Locate the Starting TMDB Page
Pick your favorite movie, and locate its TMDB page by searching on https://www.themoviedb.org/. For example, my favorite movie is Harry Potter and the Sorcerer’sPhilosopher’s Stone. Its TMDB page is at:
https://www.themoviedb.org/movie/671-harry-potter-and-the-philosopher-s-stone/
Save this URL for a moment.
1.3. Initialize Your Project
- Open a terminal and type:
conda activate PIC16B-24W
scrapy startproject TMDB_scraper
cd TMDB_scraper
This will create quite a lot of files, but you don’t really need to touch most of them. However, you must submit the entire TMDB_scraper folder for the autograder part.
1.4. Tweak Settings
For now, add the following line to the file settings.py
:
= 20 CLOSESPIDER_PAGECOUNT
This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.
Hint: Later on, you may run into 403 Forbidden
errors once the website detects that you’re a bot. See these links (link1, link2, link3, link4) for how to work around that issue. The easiest solution is changing one line in setting.py
. You might see this when you run scrapy shell
as well, so keep an eye out for 403
! Remember, you want your status to be 200 OK
. If they know that you are on Python, they will certainly try to block you. One way to change user agent on scrapy shell is:
scrapy shell -s USER_AGENT='Scrapy/2.8.0 (+https://scrapy.org)' https://www.themoviedb.org/...
(this user agent is the default one that might be blocked by TMDB.)
2. Write Your Scraper
Create a file inside the spiders
directory called tmdb_spider.py
. Add the following lines to the file:
# to run
# scrapy crawl tmdb_spider -o movies.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone
import scrapy
class TmdbSpider(scrapy.Spider):
= 'tmdb_spider'
name def __init__(self, subdir=None, *args, **kwargs):
self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]
Then, you will be able to run your completed spider for a movie of your choice by giving its subdirectory on TMDB website as an extra command-line argument.
Now, implement three parsing methods for the TmdbSpider
class.
parse(self, response)
should assume that you start on a movie page, and then navigate to the Full Cast & Crew page. Remember that this page has url<movie_url>cast
. (You are allowed to hardcode that part.) Once there, theparse_full_credits(self,response)
should be called, by specifying this method in thecallback
argument to a yieldedscrapy.Request
. Theparse()
method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.parse_full_credits(self, response)
should assume that you start on the Full Cast & Crew page. Its purpose is to yield ascrapy.Request
for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the methodparse_actor_page(self, response)
should be called when the actor’s page is reached. Theparse_full_credits()
method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.parse_actor_page(self, response)
should assume that you start on the page of an actor. It should yield a dictionary with two key-value pairs, of the form{"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}
. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked in an “Acting” role1. Note that you will need to determine both the name of the actor and the name of each movie or TV show. This method should be no more than 15 lines of code, excluding comments and docstrings.
Provided that these methods are correctly implemented, you can run the command
scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone
to create a .csv
file with a column for actors and a column for movies or TV shows for Harry Potter and the Philosopher’s Stone.
Experimentation in the scrapy shell is strongly recommended.
Challenge
If you’re looking for a challenge, think about ways that may make your recommendations more accurate. Consider scraping the number of episodes as well or limiting the number of actors you get per show to make sure you only get the main series cast.
3. Make Your Recommendations
Once your spider is fully written, comment out the line
= 20 CLOSESPIDER_PAGECOUNT
in the settings.py
file. Then, the command
scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone
will run your spider and save a CSV file called results.csv
, with columns for actor names and the movies and TV shows on which they featured in.
Once you’re happy with the operation of your spider, compute a sorted list with the top movies and TV shows that share actors with your favorite movie or TV show. For example, it may have two columns: one for “movie names” and “number of shared actors”.
Feel free to be creative. You can show a pandas data frame, a chart using matplotlib
or plotly
, or any other sensible display of the results.
4. Blog Post
In your blog post, you should describe how your scraper works, as well as the results of your analysis. When describing your scraper, I recommend dividing it up into the three distinct parsing methods, and discussing them one-by-one. For example:
In this blog post, I’m going to make a super cool web scraper… Here’s how we set up the project…
<implementation of parse()>
This method works by…
<implementation of parse_full_credits()>
To write this method, I…
In addition to describing your scraper, your Blog Post should include a table or visualization of numbers of shared actors.
Remember that this post is still a tutorial, in which you guide your reader through the process of setting up and running the scraper. Don’t forget to tell them how to create the project and run the scraper!
Submission
There will be three Gradescope assignments open for submission, one for autograder, one for PDF, and the other for files. You have to submit all of them for your homework to be graded.
For autograder, please submit the entire
TMDB_scraper/
folder compressed in.zip
format. Please compress the outermost one conatining thescrapy.cfg
file:└── TMDB_scraper <-- DIRECTLY COMPRESS THIS FOLDER. ├── TMDB_scraper │ ├── ... │ └── spiders │ ├── ... │ └── tmdb_spider.py ├── ... └── scrapy.cfg
For the PDF assingment, please submit your newly-created blog page printed as PDF, with the URL visible. Please make sure your code is visible in full, i.e., not cuttoff, in your pdf.
For the files assignment, please submit any code file you wrote for your homework, except for your scrapy project. All the
.py
file,.ipynb
file, or.qmd
files all included. It should include a.py
file converted from any.ipynb
file. The grader should be able to reproduce your result from the code portion you submitted.- It must include
index.ipynb
, the Jupyter Notebook you worked on, andindex.py
, a Python script-converted version of it.
- It must include
Specifications
Format
- Please follow the “Submission” section above.
Coding Problem
- Each of the three parsing methods are correctly implemented. (autograded)
parse()
is implemented in no more than 5 lines.parse_full_credits()
is implemented in no more than 5 lines.parse_actor_page()
is implemented in no more than 15 lines.- A table or list of results or pandas dataframe is shown.
- A visualization with
matplotlib
,plotly
, orseaborn
is shown.
Style and Documentation
- Each of the three
parse
methods has a short docstring describing its assumptions (e.g. what kind of page it is meant to parse) and its effect, including navigation and data outputs. - Each of the three
parse
methods has helpful comments for understanding how each chunk of code operates.
Writing
- The blog post is written in tutorial format, in engaging and clear English. Grammar and spelling errors are acceptable within reason.
- The blog post explains clearly how to set up the project, run the scraper, and access the results.
- The blog post explains how each of the three
parse
methods works. - Blog post has a descriptive title.
Footnotes
[added 2/4 for clarification] Only the works listed in “Acting” section of the actor page.↩︎