Part I due by 11:59 p.m. on Tuesday, November 12, 2024.
Part II due by 11:59 p.m. on Monday, November 25, 2024.
In your work on this assignment, make sure to abide by the collaboration policies of the course. All of the problems in this assignment are individual-only problems that you must complete on your own.
If you have questions, please come to office hours, post them on
Piazza, or email cs460-staff@cs.bu.edu
.
Make sure to submit your work on Gradescope, following the procedures found at the end of Part I and Part II.
50 points total
Create a subfolder called ps4
within your
cs460
folder, and put all of the files for this assignment in that
folder.
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps4_partI
.
Add your work for all of the problems from Part I to this file.
Once you have completed Part I, choose File->Download->PDF
document, and save the PDF file on your machine. The resulting
PDF file (ps4_partI.pdf
) is the one that you will submit. See
the submission guidelines at the end of Part I.
30 points total; 10 points each part
Consider the following sequence of operations involving four transactions and two data items, A and B:
s1; s2; s3; s4; r2(A); w3(B); r2(B); r3(A); w1(B); r4(B); r1(A); w1(A); c1; c2; c3; c4
where si
indicates the start of transaction Ti
, which is when its
timestamp is assigned, and ci
indicates the commit of
transaction Ti
.
Assume that the transactions are assigned the following timestamps, based on the order in which they start:
and that RTS(A), WTS(A), RTS(B) and WTS(B) are all initially 0.
Given these assumptions:
In the first table for this problem in ps4_partI
, we’ve filled
in the row for the first operation. Complete the table to
show how the system would respond to the remaining read and write
requests when it is using regular timestamp-based concurrency
control without commit bits.
In the first column of the table, put the requested operation.
In the second column of the table, indicate the response of the system by selecting the correct option from the following list:
In the third column of the table, you should do one of the following:
If the action is ignored or denied, include a brief explanation. For example, if a transaction T7 tried to read item C and its read were too late, you would include something like this:
TS(T7) < WTS(C)
Summarize any changes to the state maintained for items A and B, as we’ve done in the first row. Note that there may be changes in the item’s state even when the action is denied.
If an action is allowed and it doesn’t lead to any changes in the state maintained for the items, simply put “no changes”.
Because commits don’t have an effect when we’re not using
commit bits, you should not include the commit actions
(c1
, c2
, etc.) in this table.
You do not need to restart a transaction that is rolled back, which means that you should skip any requests by a transaction that come after the point at which it is rolled back. As a result, you may not need all of the rows in the table.
Complete the second table that we’ve provided to show the system’s response to this sequence of operations if the DBMS is using regular timestamp-based concurrency control with commit bits.
In addition to the reads and writes, you should include any commit actions that are able to occur, since they may cause a change in state.
In the second column of the table, there are now four options to choose from:
In the third column of the table, make sure to also include changes to the commit bit when appropriate. For example, “c = false” or “c = true”.
If a transaction is made to wait, your table should include an additional row for the operation in question when the wait comes to an end – something like:
request | response of the system |
---|---|
... | |
r7(X) | denied; make wait |
... | |
r7(X) | allowed |
Important: If a transaction is made to wait, it cannot make any forward progress until the wait comes to an end, so you should temporarily skip any operations by that transaction in the schedule.
If more than one transaction is waiting for the same transaction, when the wait comes to an end, you should assume that the waiting transaction with the smaller transaction number (call it T) is allowed to try again first. In addition, if you needed to skip actions by T while it was waiting, you should make as much progress as possible with those skipped actions before you allow the next waiting transaction to try again.
Once again, you do not need to restart a transaction that is rolled back, which means that you should skip any actions by a transaction that come after the point at which it is rolled back. As a result, you may not need all of the rows in the table.
Explain what will happen in response to this sequence of operations if the DBMS uses multiversion timestamp-based concurrency control without commit bits.
In the third column, make sure to specify which version’s state is being updated (e.g., “create A(t) with RTS = 0” or “RTS(A(t)) = ...”, where t is the timestamp of the version). In addition, you should include an item’s version number in any explanation of why an action wasn’t allowed.
Similarly, if a transaction is allowed to read A, make sure to indicate in the second column which version it is allowed to read (e.g., if there was a version with a timestamp of 100 that it was allowed to read, you would say “allowed to read A(100)”).
Because commits don’t have an effect when we’re not using commit
bits, you do not need to include the commit actions (c1
,
c2
, etc.) in this table.
Once again, you do not need to restart a transaction that is rolled back, which means that you can skip any requests by a transaction that come after the point at which it is rolled back. As a result, you may not need all of the rows in the table.
20 points total
Assume that a database is replicated using synchronous replication across 8 different sites. In other words, there are 8 copies of each item.
Consider the following voting schemes:
Fill in the tables that we’ve provided to answer the following questions:
Would these voting schemes work if the system uses fully distributed locking? Follow the guidelines below.
Would these voting schemes work if the system uses primary-copy locking? Follow the guidelines below.
Guidelines:
If a given scheme would work, put “yes” in the second column of the table and use the third column to specify the inequality or inequalities that allow you to draw that conclusion.
For example, in lecture we considered a voting scheme in which there were 9 copies instead of 8, and a transaction needed to read 5 copies and write 5 copies. One of the relevant inequalities for determining that this scheme works is 5 > 9 - 5. Depending on the type of locking, you might also need to include the inequality 5 > 9/2. (Make sure to adjust your inequalities to specifics of each scheme – including the fact that there are 8 copies of each item, not 9!)
If a given scheme would not work, put “no” in the second column of the table. In the third column, you should not specify the relevant inequalities. Rather, you should list all possible problematic scenarios that could arise under that scheme. The problematic scenarios depend on the voting scheme and the type of locking, and they could include any of the following:
A transaction may not always read the most up-to-date value of a given item.
Two transactions can get a global exclusive lock for the same item at the same time.
If one transaction has a global exclusive lock for an item, another transaction can get a global shared lock for that item, and vice versa.
Login to Gradescope by clicking the link in the left-hand navigation bar. Once you are in logged in, click on the box for CS 460.
Submit your ps4_partI.pdf
file using these steps:
If you still need to create the PDF file, open your file on Google Drive, choose File->Download->PDF document, and save the PDF file on your machine.
Click on the name PS 4: Part I in the list of assignments. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF
button and find the ps4_partI.pdf
that you created in step 1.
Then click the Upload PDF button.
You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
50 points total
Important
You may complete the first two problems from this section with a partner, but the final two problems must be completed on your own.
In this assignment, you will write Java programs to run MapReduce jobs using Apache Hadoop.
Because it can be challenging to install and run Apache Hadoop locally, you will instead be running and testing your programs on Gradescope after eliminating any syntax errors in your code. You will NOT be running them on your own machine.
The Autograder will run Hadoop in local mode, which simulates a distributed cluster. However, the programs that you construct could also be run on a real cluster without modification – and they could handle much more data than we will be using in our testing!
Our problem domain is a social network database that contains information about users and their connections to other users in the network.
The data is stored in one or more plain-text files, where each line of a given file represents a single user and looks something like this:
18,Brown,Matthew,1989-11-05,mbrown@gmail.com,189,305,17,31;569121,235708,32087,188745,549575
From left to right, each line contains the following fields:
The fields are comma-separated, with the exception of a semicolon that appears before the friend list. However, if a user has no friends, the line will not contain a semicolon.
Here’s a subset of lines from the largest input file (users200k.txt
),
showing a number of different (though not all!) variations:
65,Lewis,James,1965-12-26,jlewis1965@hotmail.com,257,7,227;911255,176121,554966,671982,775492,834730,948609 80,Lawson,Emerald,1997-10-01,270,144,368,201,484,8;40146,44545,285685,734547,861038 201,White,Alexander,1958-08-19,awhite1958@yahoo.com;979442,33777,416988,823482,920887,929242
You should begin by downloading the following zip file:
ps4_mapreduce.zip
Unzip/extract the contents of the file.
Depending on your system, after extracting the contents you will either have:
a folder named ps4_mapreduce
that contains all of the files that
you need for the map-reduce problems
an outer folder called ps4_mapreduce
that contains an inner
folder named ps4_mapreduce
that contains all of the Java
files that you need.
Take the ps4_mapreduce
folder that actually contains the necessary
files and drag it into your ps4
folder so that you can easily find
and open it from within VS Code.
Launch VS Code on your laptop.
In VS Code, select the File->Open Folder or File->Open menu
option, and use the resulting dialog box to find and open the
problem6
folder that you created above – the one that contains the
provided files. (Note: You must open the folder; it is not
sufficient to simply open one of the Java files in the folder.)
The name of the folder should appear in the Explorer pane on the left-hand side of the VS Code window, along with a list of all of its contents.
Review the provided files. We have given you:
WordCount.java
- a sample MapReduce program that you are welcome
to use as a template for your programs
starter code for each of the problems (Problem3.java
,
Problem4.java
, etc.)
data
- a folder containing sample input data files for the
problems you will solve. They include users20.txt
, which we
use by default when testing your programs on Gradescope (see
below).
IMPORTANT: You should NOT try to run any of these programs on your own machine! Rather, you should follow the procedures outlined below.
As mentioned above, we will be running and testing your programs on Gradescope instead of on your own machine.
However, you should still use the compiler in VS Code to eliminate any syntax errors in your programs before you try to test them on Gradescope.
Finding and eliminating syntax errors
Click on the name of program in the Explorer pane to open it in the Editor window.
Open a Terminal window in VS Code, and click on the Problems tab at the top of the Terminal window.
Locate any error messages for the program you’re working on. They begin with a red circle with an X in it. Clicking on the error message will show you the corresponding line from your code.
Note: You may see some warning messages, which begin with a
yellow triangle symbol that contains an exclamation mark (!
).
These can be safely ignored.
Make whatever changes are needed to eliminate all of the error messages.
Remember: do NOT try to run your code on your own machine!
Testing and debugging
Once you have eliminated the syntax errors, save the program in VS Code.
Upload the program to Gradescope. Each problem has its own Gradescope page, and you have two options that you can use for testing:
Upload just the program itself (i.e., the Java file). Doing so will run the program on a small input file and compare the results to the expected results.
Upload the program and one or more input files like the ones
we give you in the data
folder. Doing so will run the
program on those input files and show you the results, but it
will not assess their accuracy. By enabling you to create and
use your own input files, this testing option will allow you
to ensure that your code handles edge cases that may not be
present in the default input file.
Review the results displayed in Gradescope.
Fix any logic errors and resubmit until you obtain the correct results.
Important: To assist you in debugging, you can add
temporary println
statements to your code. If you do so,
Gradescope will include the output of those statements as part of
the results that it displays.
Employ good programming style. Use appropriate indentation, select descriptive variable names, localize variables, insert blank lines between logical parts of your program, and add comments as necessary to explain what your code does.
Your nested classes for mappers and reducers should extend the
built-in Mapper
and Reducer
classes, as discussed in lecture
and shown in WordCount.java
. Your mapper classes will override
the inherited map()
method, and your reducer classes will override
the inherited reduce()
method. You should not override or make
use of the other inherited methods of the Mapper
and Reducer
classes.
You are welcome to include additional helper methods in your classes as needed, although doing so is not required. You may also include additional helper classes for a given problem, but they should go in the same file as the rest of the code for that problem, and they should also be nested static classes.
You will need to determine how to correctly parse each line in the input file. In particular, you will need to account for the optional fields in a given line. We encourage you to use the following methods as needed:
the String
method called split()
that we discussed in lecture.
(Note: We have imported the java.util
package at the top of
each file so that you can use the Arrays.toString()
method
as needed. You may find it useful during debugging, since it
can take the array produced by split()
and turn it into a
string that you can print using a temporary println
statement.)
other String
methods like indexOf()
and substring()
; however, you should not use anything that isn’t
available in Java 8
the following static methods for converting from a Java
String
object that is the string representation of a number
to the corresponding numeric value:
Integer.parseInt()
Long.parseLong()
Double.parseDouble()
You may assume that the birthdate and email fields are
well-formed; that is, the birthdate will always be of the form
YYYY-MM-DD, where YYYY is the four-digit birth year, MM is the
two-digit month, and DD is the two-digit day. The email, if present,
will contain an occurence of the @
character to separate the user
name from the domain.
You should limit yourself to the packages that we’ve imported at the top of each starter file. You must not use classes from any other Java package.
You should only use features of Java that were available in Java 8.
You should not use any global variables (i.e., static class variables) in your programs. Class constants (i.e., static final variables) are fine if needed.
Feel free to use these resources as needed:
13 points; pair-optional
This is one of only two problems in this assignment that you may complete with a partner. See the rules for working with a partner on pair-optional problems for details about how this type of collaboration must be structured.
Write a MapReduce program to find the number of users for each email-address domain. The output of the program should be (key, value) pairs in which the key is an email-address domain and the value is the number of users that have addresses from that domain. For example:
icloud.com 500 gmail.com 250 ...
(although you will probably get different numbers than the ones shown above!)
Notes:
In Problem3.java
, we have given you a template for your code.
You only need to implement the bodies of the map
and reduce
methods.
To allow for large numbers (i.e., long integers) in the final
results, we’ve specified that the values output by the reducer
should be of type LongWritable
. Make sure that your code
uses a variable of type long
when determining the counts.
Use the birth-month counter program from lecture as a model for what you should do.
Don’t forget that not all user records include an email address. However, you may assume that at least one of the input records includes an email address.
You may assume that all email addresses include an @
symbol.
If a given user doesn’t have an email address, the map
method can
simply return without writing anything.
13 points; pair-optional
This is one of only two problems in this assignment that you may complete with a partner. See the rules for working with a partner on pair-optional problems for details about how this type of collaboration must be structured.
Write a MapReduce program to find the email domain with the youngest users – i.e., the domain whose users have the smallest average age. The final output should be a single (key, value) pair in which the key is the email domain with the youngest users, and the value is the average age of that domain’s users, specified as a floating-point number. If there are multiple email domains whose members are tied for the smallest average age, your program may report any one of them.
Notes:
You will need a chain of two MapReduce jobs for this problem. We discussed how to do this in lecture.
The second job will process the results of the first job. Because
those results are stored in a text file, the key-value pairs
passed into the second map
function will have the following
format:
The key will be a file-offset value that you should ignore –
just as any other map
function that processes data stored in
a text file typically ignores its keys.
The value will be a Text
value consisting of one line from
the results file produced by the first job. In other words, it
will contain one of the key-value pairs written by the first
job’s reduce
function, with a single tab character ("\t"
)
between the key and the value.
In the second job, the map
function should ensure that all of the
(key, value) pairs that it outputs have the same constant key.
You may either use a string or an integer for this purpose. Using
a constant key will ensure that all of these pairs go to a
single reducer task, which will then be able to determine which
email domain’s users have the smallest average age.
When determining a user’s age, you should simply subtract the
person’s year of birth from 2024, regardless of when their
birthday is. For example, if someone has a dob string of
"2000-12-20"
, they should be considered to have an age of 24
(because 2024 -
2000 = 24), even though they haven’t celebrated
their birthday yet this year.
When computing the average age for an email domain, make sure to take into account how the division operator works in Java. In order to perform floating-point division, at least one of the two operands must be a floating-point number. One way to ensure this is to use a type cast.
You can convert the string representation of an integer into a
value of type int
by using the built-in Integer.parseInt()
method. Similarly, you can convert the string representation of a
floating-point number into a value of type double
by using the
built-in Double.parseDouble()
method.
You may assume that all of the average ages will be less than 120.0.
We have given you templates for the two sets of mapper and reducer classes, but they are more limited than what we provided for the previous problem. In particular:
The extends
clauses for the mapper and reducer classes
look like this:
extends <Object, Object, Object, Object>
As needed, you should replace a given occurrence of Object
with the class name for the appropriate Hadoop data type. See
the lecture notes for a reminder of the purpose of each of the
four types.
We haven’t given you headers for the map
and reduce
methods. Make sure that you use the appropriate
types for the parameters of these methods, and that you
include the same throws
clauses that we used for those
methods in the previous problem.
In the main
method, you should modify as needed the lines
that specify the classes for the output keys and output values
of the mappers and reducers. You should not change the
other lines in this method.
12 points; individual-only; you MUST complete this problem on your own
Recall that a user’s record may optionally include one or more group IDs – ID numbers of groups to which the user belongs. For example, consider the following record:
33,Gutierrez,Mary,1961-07-12,mary.gutierrez1@hotmail.com,10,19;55
It indicates that Mary Gutierrez belongs to two groups: group 10 and group 19.
Write a MapReduce program to find the group with the most members. The final output of the program should be a single (key, value) pair in which the key is the ID of the group with the most members and the value is the number of members of that group. If there are multiple groups that are tied for the most members, your program may report any one of them.
Notes:
As in the previous problem, you will need a chain of two MapReduce jobs for this problem.
Don’t forget that not all user records include a list of groups. However, you may assume that there is at least one record that includes a group, and thus there will always be a final result.
Make sure that your solution is able to accommodate large final counts.
You can convert the string representation of a long integer into a value
of type long
by using the built-in Long.parseLong()
method.
As we did in the previous problem, we have given you templates for
the two sets of mapper and reducer classes, and you will need to
select and fill in the appropriate class names in their extends
clauses, and then define the appropriate map
and reduce
methods.
Here again, you should modify as needed the lines in the main
method that specify the classes for the output keys and output
values of the mappers and reducers. You should not change the
other lines in this method.
12 points; individual-only; you MUST complete this problem on your own
In the previous problem, you found the group with the most members. In this problem, you will find the user who belongs to the most groups!
The final output of the program should be a single (key, value) pair in which the key is the ID of the user who belongs to the most groups, and the value is the number of groups to which they belong. If there are multiple users who are tied for the most groups, your program may report any one of them.
Unlike the previous problem, you will only need a single map-reduce job, rather than a chain of two jobs. That’s because a given user has only one record in the data files, and thus the initial mapper can determine each user’s number of groups on its own.
Notes:
You will need to take appropriate steps to ensure that all of the key-value pairs output by the mapper tasks are processed by a single reducer task, which will then be able to determine which user belongs to the most groups.
Don’t forget that not all user records include a list of groups.
However, in order to handle input files in which no one belongs
to any groups, your map
method should still write something
for every user. That way, the reduce
method will still
receive some input, and it can conclude that the maximum number
of groups is 0!
You may assume that long integers are not needed for the counts of the number of groups to which a user belongs.
Once again, we have given you a template for your code, and you
will need to select and fill in the appropriate class names in the
extends
clauses for the two inner classes, and then define
appropriate map
and reduce
methods.
In addition, you should modify as needed the lines in the main
method that specify the classes for the output keys and output
values of the mapper and reducer. You should not change the
other lines in this method.
You should submit only the following four files:
Problem3.java
Problem4.java
Problem5.java
Problem6.java
Each problem should be submitted to its own Gradescope page.
Here are the steps:
Login to Gradescope and click on the box for CS 460.
Click on the appropriate problem in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your files to the box labeled DRAG & DROP. You can either drag and drop the files from their folder into the box, or you can click on the box itself and browse for the files.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
The Autograder will perform some tests. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Notes:
Passing the preliminary tests does not guarantee that you will ultimately get full credit for the problem. Additional tests will be run later, and you should perform your own testing following the procedures outlined above to ensure that your code works correctly in all cases.
In particular, don’t forget that you can upload your own
input file(s) when you submit your program. We recommend
that you try input files that include instances of special
cases (e.g., users without email addresses or groups). To do
so, you can use one or more of the text files that we have
provided in the data
subfolder. Either use one of those
files in its entirety, or create a new file containing a
subset of lines from those files (edited as needed) so that
you can focus on particular types of cases.
If needed, use the Resubmit button at the bottom of the page to resubmit your work. Important: Every time that you make a submission, you should submit all of the files for that Gradescope assignment, even if some of them have not changed since your last submission.
Near the top of the page, click on the box labeled Code. Then click on the name of each file to view its contents. Check to make sure that the files contain the code that you want us to grade.
Important
It is your responsibility to ensure that the correct version of a file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submission carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs460-staff@cs.bu.edu
Last updated on November 13, 2024.