All problems due by 11:59 p.m. on Tuesday, April 28, 2026.
In your work on this assignment, make sure to abide by the collaboration policies of the course. All of the problems in this assignment are individual-only problems that you must complete on your own.
If you have questions, please come to office hours, post them on
Piazza, or email cs460-staff@cs.bu.edu.
Make sure to submit your work on Gradescope, following the procedures found at the end of Part I and Part II.
60 points total
In this part of the assignment, you will write queries for a MongoDB version of our movie database – one that uses the data model outlined in lecture.
Create a subfolder called ps5 within your
cs460 folder, and put all of the files for this assignment in that
folder.
In addition, you should download the template that you will use for
your queries by clicking on the following link and saving the file in
your ps5 folder:
If the browser doesn’t allow you to choose where to download the file, right-click the above link and use Save link as... or the equivalent option.
Next, follow our directions to install and configure both MongoDB and the version of the movie database that you will be using.
See our separate instructions for the steps needed to perform your queries.
Remember: When typing a query in MongoDB Compass, you can allow your query to span multiple lines if you use Shift-Return or Shift-Enter at the end of a given line.
If you’re using a Mac, you should disable smart quotes, because they may lead to errors in MongoDB and in our testing. There are instructions for doing so here.
ps5_queries.py is a Python file, so you could use a Python IDE
to edit it, but a regular text editor like TextEdit or Notepad++
would also be fine. However, if you use a text editor, you must
ensure that you save it as a plain-text
file.
Construct the MongoDB method calls needed to solve the problems given
below. Test each method call in MONGOSH to make sure that it works.
Once you have finalized the method call for a given problem, copy
the call into your ps5_queries.py file, putting it between
the triple quotes provided for that query’s variable. We have
included a sample query to show you what the format of your
answers should look like.
Each of the problems must be solved by means of a single query
(i.e., a single method call). The results of the query should
include only the requested information, with no extraneous fields.
In particular, you should exclude the _id field from the results
unless the problem indicates otherwise.
You do not need to worry about the order of the fields in the results, nor the places in which line breaks or spaces appear.
Unless the problem indicates otherwise, you may only use aspects of the MongoDB query language that we discussed in lecture. See our general query-writing guidelines for more details.
Your queries should only use information provided in the problem itself. In addition, they should work for any MongoDB database that follows the schema that we discussed in lecture.
6 points
Make sure to read and follow the guidelines given above.
In Problem Set 2 and Problem Set
3, you wrote SQL
and XQuery queries to find information about the two movies in our
database named West Side Story. Write a MongoDB query to solve a
similar problem. The result of your query should be two documents that
each contain only the year, director names and actor names of one of
the movies named West Side Story. The director name(s) and actor
names for a given movie should appear as elements of an array
containing one or more strings. For example, one of the results
documents should have a field for the director’s names whose value is
["Steven Spielberg"].
Hints:
In order to obtain arrays of strings for the director and actor
names, you will need to use an aggregation pipeline instead of a
simple find command. The final results of the pipeline should
have only three fields: a year field, a field called directors
for the array of director names, and a field called actors for
the array of actor names.
Don’t forget that when you use dot notation to specify a field in an embedded object, you must surround that field name with quotes.
6 points
In Problem Set 1 and Problem Set 3, you wrote SQL and XQuery queries to find information about two Oscar nominees from the movie Hamnet: Chloe Zhao and Jessie Buckley. (Buckley has since won the Oscar for which she was nominated!) Write a MongoDB query to find the dates of birth and places of birth of these women. The result of your query should be two documents that each contain the name, date of birth, and place of birth of one of these women.
Hints:
This query can be solved using a simple find command.
Make sure that you start with the correct collection of documents!
6 points
In Problem Set
1, you wrote
a SQL query to find information about all Oscar-winning movies
whose names begin with “One”. Write a Mongo DB query to solve a
similar problem – finding just the names of those movies. You should
use pattern-matching as needed, and make sure that you only include
movies in which the first word of the name is exactly "One". (By the
way, the movie on which the PS 1 problem was based – One Battle
After Another – did end up winning a number of Oscars, including
Best Picture, but our database hasn’t been updated to include this
movie or its Oscar wins.)
Hints:
You should use the single-purpose aggregation method called
distinct that we covered covered in lecture, not an
aggregation pipeline. This method will ensure that a given
movie name appears at most once in the results. Recall that
the distinct method produces an array of values – in this
case, an array of strings.
As you did in PS 1, you will need to think about how to construct a pattern that obtains movie names whose first word is “One” without getting movies whose first word begins with ‘One’ (e.g., “Onetime”) or movies that have “One” somewhere after the very beginning.
Make sure that you start with the correct collection of documents!
6 points
It takes time to establish yourself as a director in Hollywood. As a
result, our database includes no examples of directors in their 20s,
very few in their 30s, and a relatively small number in their
40s. Write a query to find the number of directors in our database who
were born in the last 50 years – i.e., whose date of birth is
sometime in 1976 or later. You should use the single-purpose
aggregation method called count that we covered covered in lecture,
not an aggregation pipeline, and thus the result of your query will
be a single number.
Hints/notes:
You should ignore the “DeprecationWarning” message that you
will receive when you use the count method. Do not use
one of the alternate methods that MongoDB suggests as part of
the warning message, because those methods may not be present
in the version of MongoDB that the Autograder will use.
Don’t forget that the documents that we have created to store information about people include an optional field that allows you to determine whether someone is a director.
As needed, you should review the appropriate collection of documents to remind yourself of the format of people’s dates of birth.
6 points
Building on the previous problem, write a query to find the names and dates of birth of the ten youngest directors in the database. For this query, you will need to use an aggregation pipeline, and the result should be documents containing only two fields: one for the director’s name, and one for the director’s date of birth.
Note: Strictly speaking, you wouldn’t need to use the same selection document as the one that you used for the previous problem, but doing so will cut down on the number of documents that later stages of the aggregation pipeline need to consider.
6 points
In Problem Set 1 and Problem Set 3, you wrote SQL and XQuery queries to compute per-year statistics on top-grossing movies from 2015 until 2025. Write a MongoDB query to solve a similar problem. Your query should create, for each year from 2015 to 2025, a summary document that includes the following fields:
one called avg_rank for the average earnings rank of the top
grossers from that year.
one called best_rank for the earnings rank of the movie
from that year that has earned the most money. Remember that
the lower the earnings rank, the more money the movie has
made.
one called year for the value of the year being summarized.
Sort the results by year, from 2015 to 2025.
Hints/notes:
You should take advantage of the fact that our database only
includes an earnings_rank field for a movie if the movie
is a top grosser.
Because 2020 doesn’t include any top grossers, it will not have a document in the results of the query. Although we were able to include 2020 in the results of our PS 1 and PS 3 queries, doing so isn’t possible using the aspects of MongoDB’s query language that we’ve covered in lecture.
Our database doesn’t currently include any top grossers from 2026. However, your query should still explicitly limit itself to years between 2015 and 2025, so that even after the top grossers from 2026 have been added, it will still only show results for 2015-2025.
6 points
In Problem Set 1 and Problem Set 3, you wrote SQL and XQuery queries to to find information about directors who have directed at least four of the 200 top-grossing films. Write a MongoDB query to solve the same problem. The results of your query should be documents that each have only three fields:
one called name for the director’s name
one called num_top for the number of top-grossing movies
that the person has directed
one called top_grossers whose value is an array containing the
names of the top-grossing movies that the person has directed.
Sort the result documents in descending order by the number of top-grossing movies. If multiple directors have the same number of top grossers, sort them in ascending order by the name of the director.
Notes/hints:
To avoid combining two directors with the same name, you should group on the value of the entire embedded subdocument for a given director – i.e., the document containing both the director’s id and name. We showed an example of grouping on an entire embedded subdocument in lecture on April 15, and you can find the solutions to that example in the folder on Blackboard for that lecture.
That same lecture example also shows how to extract just a single field value from an object that is used as the basis of a subgroup.
You should use the accumulator that we mentioned in lecture that can be used to construct an array of values for a given subgroup.
6 points
Most of the people in our database were born in the United States, and most international POBs are connected to very few people. Write a query to find all places of birth outside of the US that are associated with more than 10 people in our database.
The result of your query should be documents with only the following fields:
one called birthplace for the place of birth being summarized
one called num_people for the number of people born there.
Sort the results in descending order by the number of people. If multiple birthplaces have the same number of people, sort them in ascending order by the name of the birthplace.
Important notes:
In our MongoDB database, some people don’t have a
pob field. You should exclude them from the results.
You will need to use pattern-matching to find all places of birth from outside of the US. You should:
Assume that all places of birth from within the United States have the letters “USA” at the very end of their value.
Use the $not operator to test for the absence of a
pattern. For example, if you wanted to test for all
movies whose genre value does not include the letter
“R”, you could use the following selection document:
{ genre: { $not: /R/ } }
6 points
Write a query to find all people who have acted in 3 or more of the animated movies in the database. The final result should consist of documents that each have two fields:
one called num_animated with the number of animated movies in
which the person acted
one called actor with the name of the person.
Sort the results in descending order by the number of movies. If multiple people have the same number of animated movies, sort them in ascending order by the name of the person.
Hints/notes:
To avoid combining two people with the same name, you should take a similar approach to the one that you took in Problem 7.
You may assume that all animated movies have an N somewhere
in the sequence of letters that make up the value of their
genre field.
6 points
Some talented actors have won Oscars for both a main and a supporting role – i.e., either (1) one or more “BEST-ACTOR” award and one or more “BEST-SUPPORTING-ACTOR” award, or (2) one or more “BEST-ACTRESS” award and one or more “BEST-SUPPORTING-ACTRESS” award. Write a MongoDB query to find them. The result of your query should be documents with the following fields:
one called winner for the name of the person
one called awards whose value is an array containing the
types of Oscars that the person has won. If the actor has
won the same type of award multiple times, it should appear
more than once in the array.
Sort the results by the name of the winner.
Hints/notes:
To avoid combining two people with the same name, you should take a similar approach to the one that you took in Problem 7.
The accumulator that we discussed in lecture for creating an array
of values creates a set of values, which means it does not maintain
duplicate values. As a result, you should use a related accumulator
called $push instead. That way, a given type of award can appear
more than once in a given person’s array of awards if the person
has won that type of award multiple times.
Login to Gradescope by clicking the link in the left-hand navigation bar, and click on the box for CS 460.
Submit your ps5_queries.py file using these steps:
Click on PS 5: Part I in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x) button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
You should see results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
You should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
Make sure that each query is logically correct, and that it will work for any instance of the movie database that follows the data model outlined in lecture. We reserve the right to ultimately run your queries on a slightly different version of the database to ensure that your queries are logically correct.
If needed, use the Resubmit button at the bottom of the page to resubmit your work.
Near the top of the page, click on the box labeled Code. Then click on the name of your file to view its contents. Check to make sure that the file contains the work that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
40 points total
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps5_partII.
Add your work for all of the problems from Part II to this file.
Once you have completed Part II, choose File->Download->PDF
document, and save the PDF file on your machine. The resulting
PDF file (ps5_partII.pdf) is the one that you will submit. See
the submission guidelines at the end of Part II.
25 points total
Consider the following sequence of log records written by a system that uses undo-redo logging:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: A; old: 100; new: 130; olsn: 0 |
|
20 |
txn: 2; BEGIN |
|
30 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
40 |
txn: 1; item: B; old: 200; new: 220; olsn: 0 |
|
50 |
txn: 2; item: D; old: 440; new: 470; olsn: 30 |
|
60 |
txn: 2; item: C; old: 300; new: 350; olsn: 0 |
|
70 |
txn: 2; COMMIT |
|
80 |
txn: 1; item: C; old: 350; new: 390; olsn: 60 |
(9 points) If a crash occurs and log record 80 is the last one
to make it to disk, what steps would be performed during recovery
if the system is performing undo-redo logging and the
on-disk datum LSNs are not consulted? (In other words, you
should assume that the system is not performing logical
logging, and thus you don’t need to worry about redoing or
undoing a change unnecessarily.) Complete the table provided in
ps5_partII to show how each log record would be handled during
both the backward and forward passes.
Guidelines:
In the columns for the backward pass and forward pass, each cell should include one of the following actions:
If the action is undo or redo, you should also include the appropriate assignment (e.g., you would write “X = 800” if data item X is given a value of 800).
(9 points) If a crash occurs and log record 80 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are consulted (i.e., the system is performing logical logging, despite the presence of the old and new values in the update log records)?
Complete the table provided in ps5_partII to show how each log
record would be handled during both the backward and forward
passes. You should assume that the datum LSNs at the start of
recovery are the following:
In addition, you should assume that the recovery subsystem does not perform any actions that the LSNs indicate are unnecessary.
Guidelines:
In the columns for the backward pass and forward pass, each cell should include one of the following actions:
Important: Make sure that you don’t put skip for cases that are more accurately described using don’t undo or don’t redo.
If the action is undo or redo, you should also include both the assignment for the data item (see above) and the assignment for the datum LSN (e.g., you would write “datumLSN(X) = 100” if the datum LSN of item X is given a value of 100).
(7 points) We will complete the material needed for this part of the question on April 24.
Now assume that a dynamic checkpoint had occurred between log records 50 and 60:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: A; old: 100; new: 130; olsn: 0 |
|
20 |
txn: 2; BEGIN |
|
30 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
40 |
txn: 1; item: B; old: 200; new: 220; olsn: 0 |
|
50 |
txn: 2; item: D; old: 440; new: 470; olsn: 30 |
|
55 |
CHECKPOINT (with appropriate additional info) |
|
60 |
txn: 2; item: C; old: 300; new: 350; olsn: 0 |
|
70 |
txn: 2; COMMIT |
|
80 |
txn: 1; item: C; old: 350; new: 390; olsn: 60 |
You should assume that the checkpoint record includes the appropriate additional information, as discussed in lecture.
How (if at all) would the presence of that checkpoint record change which log records are considered during the backward pass? Explain briefly.
How (if at all) would the presence of that checkpoint record change which log records are considered during the forward pass? Explain briefly.
15 points total; 5 points each part
Consider again the following sequence of log records:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: A; old: 100; new: 130; olsn: 0 |
|
20 |
txn: 2; BEGIN |
|
30 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
40 |
txn: 1; item: B; old: 200; new: 220; olsn: 0 |
|
50 |
txn: 2; item: D; old: 440; new: 470; olsn: 30 |
|
60 |
txn: 2; item: C; old: 300; new: 350; olsn: 0 |
|
70 |
txn: 2; COMMIT |
|
80 |
txn: 1; item: C; old: 350; new: 390; olsn: 60 |
This log was created by a system that uses undo-redo logging. If a crash occurs and log record 80 is the last one to make it to disk, what are all possible on-disk values of each of the data items (A, B, C, and D) after the crash but before recovery?
How would your answer to part 1 change if the system were using redo-only logging instead of undo-redo? Briefly explain the reasons for any changes. (You should assume that none of the data items – A, B, C, D – are on the same page.)
How would your answer to part 1 change if the system were using undo-only logging? Briefly explain the reasons for any changes. (Here again, you should assume that none of the data items are on the same page. In addition, you should assume that when the DBMS forces dirty database pages to disk, it forces only those pages that must go to disk in order for undo-only logging to work correctly.)
Once you have completed Part II in Google Drive, choose
File->Download as->PDF document, and save the resulting file
(ps5_partII.pdf) on your machine.
Login to Gradescope and click on the box for CS 460.
Click on the name PS 5: Part II in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps5_partII.pdf that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x) button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
Last updated on April 24, 2026.