All problems due by 11:59 p.m. on Tuesday, December 9, 2025.
In your work on this assignment, make sure to abide by the collaboration policies of the course. All of the problems in this assignment are individual-only problems that you must complete on your own.
If you have questions, please come to office hours, post them on
Piazza, or email cs460-staff@cs.bu.edu.
Make sure to submit your work on Gradescope, following the procedures found at the end of Part I and Part II.
60 points total
Create a subfolder called ps5 within your
cs460 folder, and put all of the files for this assignment in that folder.
In this part of the assignment, you will write queries for a MongoDB version of our movie database – one that uses the data model outlined in lecture.
You should begin by following our directions to install and configure both MongoDB and the version of the movie database that you will be using.
In addition, you should download the template that you will use for
your queries by clicking on the following link and saving the file in
your ps5 folder:
If the browser doesn’t allow you to choose where to download the file, right-click the above link and use Save link as... or the equivalent option.
See our separate instructions for the steps needed to perform your queries.
Remember: When typing a query in MongoDB Compass, you can allow your query to span multiple lines if you use Shift-Return or Shift-Enter at the end of a given line.
If you’re using a Mac, you should disable smart quotes, because they may lead to errors in MongoDB and in our testing. There are instructions for doing so here.
ps5_queries.py is a Python file, so you could use a Python IDE
to edit it, but a regular text editor like TextEdit or Notepad++
would also be fine. However, if you use a text editor, you must
ensure that you save it as a plain-text
file.
Construct the MongoDB method calls needed to solve the problems given
below. Test each method call in MONGOSH to make sure that it works.
Once you have finalized the method call for a given problem, copy
the call into your ps5_queries.py file, putting it between
the triple quotes provided for that query’s variable. We have
included a sample query to show you what the format of your
answers should look like.
Each of the problems must be solved by means of a single query
(i.e., a single method call). The results of the query should
include only the requested information, with no extraneous fields.
In particular, you should exclude the _id field from the results
unless the problem indicates otherwise.
You do not need to worry about the order of the fields in the results, nor the places in which line breaks or spaces appear.
Unless the problem indicates otherwise, you may only use aspects of the MongoDB query language that we discussed in lecture. See our general query-writing guidelines for more details.
Your queries should only use information provided in the problem itself. In addition, they should work for any MongoDB database that follows the schema that we discussed in lecture.
Make sure to follow the guidelines above.
In Problem Set 1,
you wrote a SQL query to find the year and runtime of the animated
version of Lilo & Stitch. Write a MongoDB query to find the
same information. Your query should make use of only
two facts: (1) that the movie’s name is Lilo & Stitch, and (2)
that there is an 'N' somewhere in the value of its genre
attribute. The result of your query should be a single
document that contains only the year and runtime of this movie.
The actor Daniel Craig stars in a movie that is now being released in theatres: Wake Up Dead Man, the latest movie in the Knives Out franchise. None of the Knives Out movies are in our database, but a number of Craig’s other movies are, including ones in which he acted alongside the legendary actress Judi Dench. Write a MongoDB query to find the names and years of all movies in our database in which both Daniel Craig and Judi Dench acted. The result of your query should be documents that each contain only the name and year of one of these movies.
Notes/hints:
You may assume that person names are unique.
When applying a condition to a field whose value is an array, the condition will be true if any element of the array satifies the condition, and the condition will be false if none of the elements of the array satisfy the condition.
Don’t forget that when you use dot notation to access a field in an embedded object, you need to surround that field name with quotes.
In Problem Set 2, you wrote a SQL
query to determine the total number of actors born in two
states: New York and California. Write a MongoDB query to solve
the same problem. You should use the single-purpose aggregation
method called count that we covered covered in lecture, not an
aggregation pipeline, and thus the result of your query will be a
single number.
Hints/notes:
Make sure that you start with the correct collection of documents!
You should ignore the “DeprecationWarning” message that you
will receive when you use the count method. Do not use
one of the alternate methods that MongoDB suggests as part of
the warning message, because those methods may not be present
on the version of MongoDB that the Autograder will use.
As you did in Problem Set 2,
you should assume that all people born in New York state have
the string "New York, USA" at the very end of their
place of birth, and all people born in California have the
string "California, USA" at the very end of their
place of birth. Make sure that you use a pattern that limits
itself to those two types of pob values.
Don’t forget that the documents that we have created to store information about people include an optional field that allows you to determine whether someone is an actor.
Building on the previous problem, out of all places of birth from
New York and California, which two are associated with the
largest number of actors? For this query, you will need to use
an aggregation pipeline, and the result should be two documents,
each of which has only two fields: one called pob for the place
of birth, and one called num_actors for the number of actors
from that place of birth.
Hint: Remember that the order of the fields within the result documents doesn’t matter.
Write a MongoDB query to find the names of all movies associated with one or more of the Oscars in our database from the past three years (i.e., from 2023 or later). If a given movie won multiple Oscars, it should appear only once in the results.
Hints:
You should use the single-purpose aggregation method called
distinct that we covered covered in lecture, not an
aggregation pipeline. This method will ensure that a given
movie name appears at most once in the results. Recall that
the distinct method produces an array of values – in this
case, an array of strings.
Make sure that you start with the correct collection of documents!
In Problem Set 1, you wrote a SQL query to find all movies with more than one director. Write a MongoDB to solve the same problem. The results of your query should be documents that each have only three fields:
name for the movie’s nameyear for the movie’s yearnum_directors for the number of directors
of the movie.Sort the results in descending order by the number of directors. If multiple movies have the same number of directors, sort them in ascending order by name.
Notes/hints:
As you did in PS 1, you should assume that the combination of a movie’s name and year is unique.
In order to create subgroups that are based on more than one of an object’s fields, you will need to create an object whose fields have the values that you want to group on – and use that object as the basis of the subgroups.
For example, if we wanted to group on the combination of a movie’s rating and genre, we would do something like this:
{ $group: { _id: { rating: "$rating", genre: "$genre" }, ... }}
In Problem Set 1, you wrote a SQL query to find, for all places of birth associated with a sufficient number of people, the number of people who were born there and the date of birth of the youngest person born there. Write a MongoDB query to solve a similar problem – but this time you should limit yourself to places of birth associated with 20 or more people, and you should sort the results in decreasing order by the number of people.
The result of your query should be documents with only the following fields:
pob for the place of birth being summarizednum_people for the number of
people born thereyoungest_dob for date of birth of the youngest
person in the database who was born there.Important: In PS 1, we excluded people with a pob value of
null from the results of the comparable query. In our MongoDB
database, such people won’t have a pob field, and you should
continue to exclude them from the results.
Write a query to create, for each of the last 10 years (i.e., 2015 or later), a summary document that includes the following fields:
num_movies for the total number of movies from
that yearavg_runtime for the average runtime of movies
from that yearbest_rank for the earnings rank of the movie
from that year that has earned the most money. Remember that
the lower the earnings rank, the more money the movie has
made.year for the value of the year.Sort the results by year, from 2015 to 2025.
Note: If a given year has no top-grossing movies, all of the
movie documents from that year will be missing an earnings_rank
field, and therefore you will end up with a best_rank value of
null for that year. That is to be expected, and you do not
need to do anything to remove it.
In Problem Set 1,
you wrote a SQL query to determine the number of directors in the
database who have not directed any of the movies in the database
from the last 10 years (i.e., 2015 or later). Write a MongoDB
query to solve the same problem. The result of the query should
be a single document whose only field is a field called count
whose value is the number of such directors.
Some talented actors have won Oscars for both a main and a supporting role – i.e., either (1) one or more “BEST-ACTOR” award and one or more “BEST-SUPPORTING-ACTOR” award, or (2) one or more “BEST-ACTRESS” award and one or more “BEST-SUPPORTING-ACTRESS” award. Write a MongoDB query to find them. The result of your query should be documents with the following fields:
one called winner for the name of the person
one called awards whose value is an array containing the
types of Oscars that the person has won. If the actor has
won the same type of award multiple times, it should appear
more than once in the array.
Sort the results by the name of the winner.
Hints/notes:
You may assume that person names are unique.
In lecture, we mentioned an accumulator called $addToSet
that can be used to construct an array of values. However,
because a set does not maintain duplicate values, you should
use a related accumulator called $push instead. That way, a
given type of award can appear more than once in a given
person’s array of awards.
Login to Gradescope by clicking the link in the left-hand navigation bar, and click on the box for CS 460.
Submit your ps5_queries.py file using these steps:
Click on PS 5: Part I in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x) button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
You should see results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
You should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
Make sure that each query is logically correct, and that it will work for any instance of the movie database that follows the data model outlined in lecture. We reserve the right to ultimately run your queries on a slightly different version of the database to ensure that your queries are logically correct.
If needed, use the Resubmit button at the bottom of the page to resubmit your work.
Near the top of the page, click on the box labeled Code. Then click on the name of your file to view its contents. Check to make sure that the file contains the work that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
40 points total
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps5_partII.
Add your work for all of the problems from Part II to this file.
Once you have completed Part II, choose File->Download->PDF
document, and save the PDF file on your machine. The resulting
PDF file (ps5_partII.pdf) is the one that you will submit. See
the submission guidelines at the end of Part II.
25 points total
Consider the following sequence of log records written by a system that uses undo-redo logging:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: B; old: 200; new: 230; olsn: 0 |
|
20 |
txn: 1; item: A; old: 100; new: 150; olsn: 0 |
|
30 |
txn: 2; BEGIN |
|
40 |
txn: 2; item: C; old: 300; new: 310; olsn: 0 |
|
50 |
txn: 1; item: B; old: 230; new: 290; olsn: 10 |
|
60 |
txn: 2; item: C; old: 310; new: 390; olsn: 40 |
|
70 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
80 |
txn: 2; COMMIT |
|
90 |
txn: 1; item: D; old: 440; new: 470; olsn: 70 |
(9 points) If a crash occurs and log record 90 is the last one
to make it to disk, what steps would be performed during recovery
if the system is performing undo-redo logging and the
on-disk datum LSNs are not consulted? (In other words, you
should assume that the system is not performing logical
logging, and thus you don’t need to worry about redoing or
undoing a change unnecessarily.) Complete the table provided in
ps5_partII to show how each log record would be handled during
both the backward and forward passes.
Guidelines:
In the columns for the backward pass and forward pass, each cell should include one of the following actions:
If the action is undo or redo, you should also include the appropriate assignment (e.g., you would write X = 800 if data item X is given a value of 800).
(9 points) If a crash occurs and log record 90 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are consulted (i.e., the system is performing logical logging, despite the presence of the old and new values in the update log records)?
Complete the table provided in ps5_partII to show how each log
record would be handled during both the backward and forward
passes. You should assume that the datum LSNs at the start of
recovery are the following:
In addition, you should assume that the recovery subsystem does not perform any actions that the LSNs indicate are unnecessary.
Guidelines:
In the columns for the backward pass and forward pass, each cell should include one of the following actions:
Important: Make sure that you don’t put skip for cases that are more accurately described using don’t undo or don’t redo.
If the action is undo or redo, you should also include both the assignment for the data item (see above) and the assignment for the datum LSN (e.g., you would write datumLSN(X) = 100 if the datum LSN of item X is given a value of 100).
(7 points) We will complete the material needed for this part of the question on December 5.
Now assume that a dynamic checkpoint had occurred between log records 50 and 60:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: B; old: 200; new: 230; olsn: 0 |
|
20 |
txn: 1; item: A; old: 100; new: 150; olsn: 0 |
|
30 |
txn: 2; BEGIN |
|
40 |
txn: 2; item: C; old: 300; new: 310; olsn: 0 |
|
50 |
txn: 1; item: B; old: 230; new: 290; olsn: 10 |
|
55 |
CHECKPOINT (with appropriate additional info) |
|
60 |
txn: 2; item: C; old: 310; new: 390; olsn: 40 |
|
70 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
80 |
txn: 2; COMMIT |
|
90 |
txn: 1; item: D; old: 440; new: 470; olsn: 70 |
You should assume that the checkpoint record includes the appropriate additional information, as discussed in lecture.
How (if at all) would the presence of that checkpoint record change which log records are considered during the backward pass? Explain briefly.
How (if at all) would the presence of that checkpoint record change which log records are considered during the forward pass? Explain briefly.
15 points total; 5 points each part
Consider again the following sequence of log records:
|
LSN |
record contents |
|---|---|
|
5 |
txn: 1; BEGIN |
|
10 |
txn: 1; item: B; old: 200; new: 230; olsn: 0 |
|
20 |
txn: 1; item: A; old: 100; new: 150; olsn: 0 |
|
30 |
txn: 2; BEGIN |
|
40 |
txn: 2; item: C; old: 300; new: 310; olsn: 0 |
|
50 |
txn: 1; item: B; old: 230; new: 290; olsn: 10 |
|
60 |
txn: 2; item: C; old: 310; new: 390; olsn: 40 |
|
70 |
txn: 2; item: D; old: 400; new: 440; olsn: 0 |
|
80 |
txn: 2; COMMIT |
|
90 |
txn: 1; item: D; old: 440; new: 470; olsn: 70 |
This log was created by a system that uses undo-redo logging. If a crash occurs and log record 90 is the last one to make it to disk, what are all possible on-disk values of each of the data items (A, B, C, and D) after the crash but before recovery?
How would your answer to part 1 change if the system were using redo-only logging instead of undo-redo? Briefly explain the reasons for any changes. (You should assume that none of the data items – A, B, C, D – are on the same page.)
How would your answer to part 1 change if the system were using undo-only logging? Briefly explain the reasons for any changes. (Here again, you should assume that none of the data items are on the same page. In addition, you should assume that when the DBMS forces dirty database pages to disk, it forces only those pages that must go to disk in order for undo-only logging to work correctly.)
Once you have completed Part II in Google Drive, choose
File->Download as->PDF document, and save the resulting file
(ps5_partII.pdf) on your machine.
Login to Gradescope and click on the box for CS 460.
Click on the name PS 5: Part II in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps5_partII.pdf that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x) button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
Last updated on December 12, 2025.