All problems due by 11:59 p.m. on Tuesday, April 30, 2024.
In your work on this assignment, make sure to abide by the collaboration policies of the course. All of the problems in this assignment are individual-only problems that you must complete on your own.
If you have questions, please come to office hours, post them on
Piazza, or email cs460-staff@cs.bu.edu
.
Make sure to submit your work on Gradescope, following the procedures found at the end of Part I and Part II.
50 points total
Create a subfolder called ps5
within your
cs460
folder, and put all of the files for this assignment in that folder.
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps5_partI
.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF
document, and save the PDF file on your machine. The resulting
PDF file (ps5_partI.pdf
) is the one that you will submit. See
the submission guidelines at the end of Part I.
20 points total; 4 points each part
Consider the following CREATE TABLE
commands for a portion of
relational database maintained by a bookseller:
CREATE TABLE Author(id CHAR(5) PRIMARY KEY, name VARCHAR(64), dob DATE); CREATE TABLE Book(isbn CHAR(13) PRIMARY KEY, title VARCHAR(128), publisher VARCHAR(64), num_pages INTEGER, genre VARCHAR(10), numInStock INTEGER); CREATE TABLE Wrote(authorID CHAR(5), book CHAR(13), PRIMARY KEY(authorID, book), FOREIGN KEY authorID REFERENCES Author(id), FOREIGN KEY book REFERENCES Book(isbn)); CREATE TABLE Sales(date VARCHAR(10), time VARCHAR(5), book CHAR(13), numSold INTEGER, PRIMARY KEY(date, time, book), FOREIGN KEY book REFERENCES Book(isbn));
The first two tables store information about authors and books.
The third table (Wrote
) captures relationsips between authors and books.
The fourth table (Sales
), combined with the numInStock
attribute
in Book
, allows the bookseller to keep track of sales and inventory.
You have been asked to create a database that stores this same information in MongoDB.
There are two types of relationships that the database needs to capture: those between authors and books, and those between books and their associated sales data.
One way to capture these relationships would be to take a purely reference-based approach that avoids embedding one type of document in another. Illustrate this approach by showing how the following tuples from the relational database would be represented as one or more documents in MongoDB:
from Author: ('56789', 'Barbara Kingsolver', '1955-04-08') from Book: ('9780063251922', 'Demon Copperhead', 'Harper', 560, 'fiction', 15) from Wrote: ('56789', '9780063251922') from Sales: ('2024-04-11', '10:45', '9780063251922', 2)
Guidelines:
You will need to determine how many documents are needed, how many references are needed, and where the references should go. Make sure to take into account the factors relevant to data modeling that we discussed in lecture.
Your answer should take into account the entire sets of relationships that the database will need to capture, but the only concrete relationship(s) that your answer needs to include are those described by the tuples above.
We encourage you to review the questions related to this problem in the PS 5 FAQ.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 1.
Another way to capture the relationships would be to take an approach that allows you to embed one type of document in another. Illustrate this approach by showing how the tuples from part 1 would be captured as one or more documents in MongoDB. The guidelines from part 1 also apply here.
Briefly describe one advantage and one disadvantage of the approach to data modeling taken in part 3.
Consider the updates that need to happen when the bookseller attempts to sell one or more copies of a given book. Discuss the appropriateness of the data-modeling approaches from parts 1 and 3 in light of this particular use of the database.
18 points total
Consider the following sequence of log records written by a system that uses undo-redo logging:
LSN |
record contents |
---|---|
0 |
txn: 1; BEGIN |
10 |
txn: 1; item: B; old: 200; new: 240; olsn: 0 |
20 |
txn: 1; item: D; old: 400; new: 420; olsn: 0 |
30 |
txn: 2; BEGIN |
40 |
txn: 2; item: C; old: 300; new: 370; olsn: 0 |
50 |
txn: 1; item: B; old: 240; new: 290; olsn: 10 |
60 |
txn: 2; item: A; old: 100; new: 150; olsn: 0 |
70 |
txn: 1; COMMIT |
80 |
txn: 2; item: D; old: 420; new: 480; olsn: 20 |
(7 points) If a crash occurs and log record 80 is the last one
to make it to disk, what steps would be performed during recovery
if the system is performing undo-redo logging and the
on-disk datum LSNs are not consulted? (In other words, you
should assume that the system is not performing logical
logging, and thus you don’t need to worry about redoing or
undoing a change unnecessarily.) Complete the table provided in
ps5_partI
to show how each log record would be handled during
both the backward and forward passes.
(7 points) If a crash occurs and log record 80 is the last one to make it to disk, what steps would be performed during recovery if the system is performing undo-redo logging and the on-disk datum LSNs are consulted (i.e., the system is performing logical logging)?
Complete the table provided in ps5_partI
to show how each log
record would be handled during both the backward and forward
passes. You should assume that the datum LSNs at the start of
recovery are the following:
In addition, you should assume that the recovery subsystem does not perform any actions that the LSNs indicate are unnecessary.
(4 points) Now assume that a dynamic checkpoint had occurred between log records 40 and 50.
a. How (if at all) would that change which log records are considered during the backward pass? Explain briefly.
b. How (if at all) would that change which log records are considered during the forward pass? Explain briefly.
12 points total
Consider again the following sequence of log records:
LSN |
record contents |
---|---|
0 |
txn: 1; BEGIN |
10 |
txn: 1; item: B; old: 200; new: 240; olsn: 0 |
20 |
txn: 1; item: D; old: 400; new: 420; olsn: 0 |
30 |
txn: 2; BEGIN |
40 |
txn: 2; item: C; old: 300; new: 370; olsn: 0 |
50 |
txn: 1; item: B; old: 240; new: 290; olsn: 10 |
60 |
txn: 2; item: A; old: 100; new: 150; olsn: 0 |
70 |
txn: 1; COMMIT |
80 |
txn: 2; item: D; old: 420; new: 480; olsn: 20 |
(4 points) This log was created by a system that uses undo-redo logging. If a crash occurs and log record 80 is the last one to make it to disk, what are all possible on-disk values of each of the data items (A, B, C, and D) after the crash but before recovery?
(4 points) How would your answer to part 1 change if the system were using undo-only logging? Briefly explain the reason for any changes. (You should assume that none of the data items (A, B, C, D) are on the same page. In addition, you should assume that when the DBMS forces dirty database pages to disk, it forces only those pages that must go to disk in order for undo-only logging to work correctly.)
(4 points) How would your answer to part 1 change if the system were using redo-only logging instead of undo-redo? Briefly explain the reason for any changes. (Here again, you should assume that none of the data items are on the same page.)
Once you have completed Part I in Google Drive, choose
File->Download as->PDF document, and save the resulting file
(ps5_partI.pdf
) on your machine.
Login to Gradescope and click on the box for CS 460.
Click on the name PS 5: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps5_partI.pdf
that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
50 points total
In this part of the assignment, you will write queries for a MongoDB version of our movie database – one that uses the data model outlined in lecture.
Click on each of the following links, and save the files in
your ps5
folder:
If the browser doesn’t allow you to choose where to download the files, right-click each link and use Save link as... or the equivalent option.
Follow the appropriate set of instructions to get everything installed and properly configured:
Make sure that you have downloaded the files listed above and installed and configured the software.
Launch MongoDB Compass as needed and click on the Connect
button to connect to the localhost
MongoDB deployment.
Click on the Databases tab, and then click Create database.
In the window that pops up, enter the following:
imdb
movies
and click Create Database.
Choose the imdb
database from the list of databases. You should
see an empty collection called movies
.
After choosing the imdb
database from the list of databases,
click Create collection and create a new collection with the
name people
.
After choosing the imdb
database from the list of databases,
click Create collection and create a new collection with the
name oscars
.
To add documents to the movies
collection:
In MongoDB Compass, select the collection from the list of
collections for imdb
.
Scroll down as needed to find the Import Data button and click it (NOT the Add Data one).
If needed, click Select a file.... (If you already see a pop-up window, this step isn’t needed.)
Choose the JSON file for the collection that you downloaded
above (e.g., for movies
, the file is called movies.json
).
Click Import. If needed, click DONE.
You should see that the collection now includes the documents that were found in the JSON file.
Repeat step 8 to add documents to the oscars
and
people
collections.
Once you have everything installed and configured, you can perform queries by taking the following steps:
Launch MongoDB Compass.
Click the Connect button to connect to the MongoDB server that should
be running on localhost
.
Click on the MONGOSH
bar at the bottom of the window for MongoDB
Compass. Doing so should expand a pane for the MongoDB shell, and
should see the following command prompt:
test>
Enter the following command from the shell prompt:
use imdb
Note: You may need to hit the Enter/Return key twice in order for the command to be accepted. Once it is, the shell’s prompt should be changed to the following:
imdb>
Enter the appropriate method calls for your queries. If you simply enter the method call, you should automatically see up to the first 20 results.
If there are more than 20 results, you can enter the following command to continue iterating through them:
it
If you’re using a Mac, you should disable smart quotes, because they may lead to errors in MongoDB and in our testing. There are instructions for doing so here.
ps5_queries.py
is a Python file, so you could use a Python IDE
to edit it, but a regular text editor like TextEdit or Notepad++
would also be fine. However, if you use a text editor, you must
ensure that you save it as a plain-text
file.
Construct the MongoDB method calls needed to solve the problems given
below. Test each method call in MONGOSH
to make sure that it works.
Once you have finalized the method call for a given problem, copy
the call into your ps5_queries.py
file, putting it between
the triple quotes provided for that query’s variable. We have
included a sample query to show you what the format of your
answers should look like.
Each of the problems must be solved by means of a single query
(i.e., a single method call). The results of the query should
include only the requested information, with no extraneous fields.
In particular, you should exclude the _id
field from the results
unless the problem indicates otherwise.
You do not need to worry about the order of the fields in the results, nor the places in which line breaks or spaces appear.
Unless the problem indicates otherwise, you may only use aspects of the MongoDB query language that we discussed in lecture.
Your queries should only use information provided in the problem itself. In addition, they should work for any MongoDB database that follows the schema that we discussed in lecture.
Make sure to follow the guidelines above.
In Problem Set 1, you wrote a SQL query to find information about Ryan Gosling and America Ferrera. Write a MongoDB query to find the same information. The result of your query should be documents that each contain the name, date of birth and place of birth of one of those two people.
A number of movies in our database are unrated – i.e., they
were never assigned a rating like PG or R. In such cases, we don’t
include a rating
field in the document for the movie. Most of
these unrated movies were released before 1960. Write a query that
finds the names and years of all unrated movies released in or after
1960.
In Problem Set
1, you
wrote a SQL query to find information about all Oscar-winning
movies whose names begin with the word "The"
. Write a MongoDB
query to find only the names of those movies. If a given movie has
won multiple Oscars, it should appear only once in your results.
Hints:
You should use the single-purpose aggregation method called
distinct
that we covered covered in lecture, not an
aggregation pipeline. This method will ensure that a given
name appears at most once.
You will need to use dot notation for the name of the field
that you pass into the distinct
method, and therefore you
will need to surround that field name with quotes.
You will need to use pattern matching to construct a pattern that obtains movie names whose first word is “The” without getting ones whose first word begins with “The” (e.g., “There”) or ones that have “The” somewhere after the very beginning. You should NOT include movies whose name moves the word “The” to the end of the string (e.g., “Usual Suspects, The”).
In Problem Set 1, you wrote a SQL query to find the most recent movie(s) with a G rating – i.e., the name(s) and year of the G-rated movie or movies whose year is closest to this year. Write a MongoDB query to solve the same basic problem, but you should handle ties differently than we did in PS 1. If two or more G-rated movies are tied for being the most recent, your query may produce the name and year of any one of them.
Write a query to find the names and places of birth of all actors in the database who were born in Mexico.
Hints:
You will need to use pattern matching to construct a pattern that obtains places of birth whose last word is “Mexico” without getting ones that include “Mexico” elsewhere in the value (e.g., people born in New Mexico).
If someone is both an actor and a director, they should be included in the results. However, if someone is only in our database as a director, they should not be included.
Write a query to find the number of R-rated movies in the database
with a runtime between 120 and 180 minutes – including ones with
runtimes of exactly 120 and exactly 180. You should use the
single-purpose aggregation method called count
that we covered
covered in lecture, not an aggregation pipeline.
Note: You should ignore the “DeprecationWarning” message that you
will receive when you use the count
method.
Now let’s extend the previous problem. Write a query to find, for each rating, the number of movies with that rating that have a runtime between 120 and 180 minutes, along with some additional information about those movies (see below). This time you will need to use an aggregation pipeline.
The results of the query should consist of documents with the following fields:
one called num_2_3_hours
whose value is the number of movies
with a given rating whose runtime is between 120 and 180
minutes
one called shortest
for the shortest runtime of those movies
one called longest
for the longest runtime of those movies.
one called rating
for the value of the rating itself.
Other requirements:
One of the documents that will be created when you compute
results for each rating is a document in which the rating is
null
; it summarizes movies without a rating. You should
exclude this result document from the final results of the
query. (Note: Be careful that you do not put the word
null
in quotes, since you are looking for the special
literal value null
, not the string "null"
).
Sort the results in increasing alphabetical order by rating
(i.e., the document for movies with a rating of G
should
come first).
It’s possible that not every movie rating has a movie that is between 120 and 180 minutes, and thus there may be some ratings that do not appear in your results.
Write a query to find all places of birth in the United States that are associated with 5 or more directors in our database. The results of the query should consist of documents with the following fields:
one called num_directors
for the number of directors born in
a given place of birth
one called pob
for the name of the place of birth.
The results documents should be sorted by the number of directors, from largest to smallest. If two or more places of birth have the same number of directors, you should secondarily sort their documents in alphabetical order by the name of the place of birth.
Hint: You may assume that all places of birth in the United States have the string “USA” at the very end of their name.
Christopher Nolan won this year’s Best Director Oscar for Oppenheimer. Write a query to create a summary of facts about movies that he directed. The result of the query should be a single document with the following fields:
num_movies
for the total number of movies in our
database that Nolan directedmovies
whose value is an array containing the
names of the movies that he directedavg_runtime
for the average runtime of movies
that he directedtop_rank
for the earnings rank of the movie
that he directed that has earned the most money. Remember that
the lower the earnings rank, the more money the movie has
made.Hint: In lecture, we mentioned an accumulator called $addToSet
that can be used to construct an array of values. It will be
useful for this problem.
Recall that a movie document only includes an earning_rank
field
if it’s one of the 200 top-grossing movies of all time. Find the
top 4 directors when it comes to the number of these top-grossing
movies that they have directed. The final result documents should
have the following fields:
num_top_grossers
for the number of top-grossing
movies a given person directeddirector
for the name of that person.You may assume that director names are unique.
Sort the results in decreasing order by the number of top-grossing movies that the person has directed. If two directors are tied when it comes to the number of top-grossing movies, you should break the tie by sorting by director name in alphabetical order.
Login to Gradescope by clicking the link in the left-hand navigation bar, and click on the box for CS 460.
Submit your ps5_queries.py
file using these steps:
Click on PS 5: Part II in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
You should see test results for each query. If you don’t see any results for a given query, it probably means that you have a syntax or logic error in your query, and you should attempt to fix it and resubmit.
You should keep making changes as needed until you get full credit for a given query. There will be no partial credit awarded for an incorrect query.
Make sure that each query is logically correct, and that it will work for any instance of the movie database that follows the data model outlined in lecture. We reserve the right to ultimately run your queries on a slightly different version of the database to ensure that your queries are logically correct.
If needed, use the Resubmit button at the bottom of the page to resubmit your work.
Near the top of the page, click on the box labeled Code. Then click on the name of your file to view its contents. Check to make sure that the file contains the work that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
Last updated on April 26, 2024.