alice@bu.edu
and bob@bu.edu
, the unique identifier for their team will be alice_bob
, where, in cases with two members, the two usernames are in ascending alphabetical order and separated by an underscore. For a onemember team, it is simply the BU login name of the sole member (e.g., alice
).https://github.com/DataMechanics/course2016sprprojzero
, adding a single folder named using the group identifier (alice
or alice_bob
) that contains a single ASCII text file members.txt
. Each team member should commit a line to that text file specifying a mapping from their GitHub username and their BU username. For example, if Alice and Bob have the GitHub usernames alicegh
and bobgh
, respectively, then the file should look as follows:
members.txt
file you add above).
def union(R, S):
return R + S
def difference(R, S):
return [t for t in R if t not in S]
def intersect(R, S):
return [t for t in R if t in S]
def project(R, p):
return [p(t) for t in R]
def select(R, s):
return [t for t in R if s(t)]
def product(R, S):
return [(t,u) for t in R for u in S]
def aggregate(R, f):
keys = {r[0] for r in R}
return [(key, f([v for (k,v) in R if k == key])) for key in keys]
def map(f, R):
return [t for (k,v) in R for t in f(k,v)]
def reduce(f, R):
keys = {k for (k,v) in R}
return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]
map
operation applies some function f
to every keyvalue tuple and produces zero or more new keyvalue tuples. A reduce
operation collects all values under the same key and performs an aggregate operation f
on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).
R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]
X = map(lambda k,v: [(v[0], v[1])], R) # Projection keeps only gender and age.
Y = reduce(lambda k,vs: (k, sum(vs)), X) # Aggregates ages by gender.
R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]
X = map(lambda k,v: [(k, ('Age', v))], R)\
+ map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
lambda k,vs:\
(k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
X\
)
D
containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:
sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])
prov
Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.
doc = prov.model.ProvDocument()
doc.add_namespace('alg', 'http://datamechanics.io/algorithm/alice_bob/') # The scripts in / format.
doc.add_namespace('dat', 'http://datamechanics.io/data/alice_bob/') # The data sets in / format.
doc.add_namespace('ont', 'http://datamechanics.io/ontology#')
doc.add_namespace('log', 'http://datamechanics.io/log#') # The event log.
doc.add_namespace('bdp', 'https://data.cityofboston.gov/resource/')
this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8wnujj',
{'prov:label':'311, Service Requests',
prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
)
this_run = doc.activity(
'log:a'+str(uuid.uuid4()), startTime, endTime,
{prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
)
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)
found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)
repo.record(doc.serialize()) # Record the provenance document.
doc.serialize()
.
def dist(p, q):
(x1,y1) = p
(x2,y2) = q
return (x1x2)**2 + (y1y2)**2
def plus(args):
p = [0,0]
for (x,y) in args:
p[0] += x
p[1] += y
return tuple(p)
def scale(p, c):
(x,y) = p
return (x/c, y/c)
M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]
OLD = []
while OLD != M:
OLD = M
MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
PD = aggregate(PDs, min)
MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
MT = aggregate(MP, plus)
M1 = [(m, 1) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
MC = aggregate(M1, sum)
M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
print(sorted(M))
[(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
first filters product(MPD, PD)
using a selection criteria if p==p2 and d==d2
and then performs a projection from tuples of the form ((m,p,d), (p2,d2))
to tuples of the form (m, p)
to obtain the result.
N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]
P = product(N,N)
I = [((x,y),100 if x != y else 0) for (x,y) in P]
D = [((x,y),1) for (x,y) in E]
OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
STEP = OUTPUT
P = product(STEP, STEP)
NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y])
OUTPUT = aggregate(NEW, min)
SHORTEST = OUTPUT
db.M.remove({});
db.P.remove({});
db.createCollection("M");
db.createCollection("P");
db.system.js.save({ _id:"dist" , value:function(u, v) {
return Math.pow(u.x  v.x, 2) + Math.pow(u.y  v.y, 2);
}});
function flatten(X) {
db[X].find().forEach(function(x) { db[X].update({_id: x._id}, x.value); });
}
function prod(X, Y, XY) {
db[XY].remove({});
db.createCollection(XY);
db[X].find().forEach(function(x) {
db[Y].find().forEach(function(y) {
db[XY].insert({left:x, right:y});
});
});
}
function union(X, Y, XY) {
db[XY].remove({});
db.createCollection(XY);
db[X].find().forEach(function(x) {
db[XY].insert(x);
});
db[Y].find().forEach(function(y) {
db[XY].insert(y);
});
}
// We'll only perform a single product operation.
// Using mapreduce, we can perform argmax and argmin more easily.
// We can also use mapreduce to compare progress.
db.M.insert([{x:13,y:1}, {x:2,y:12}]);
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);
db.createCollection("HASHOLD");
db.createCollection("HASHNEW");
var iter = 0;
do {
db.M.mapReduce(
function() { emit("hash", {hash: this.x + this.y})},
function(k, vs) {
var hash = 0;
vs.forEach(function(v) {
hash += v.hash;
});
return {hash: hash};
},
{out: "HASHOLD"}
);
prod("M", "P", "MP");
db.MPD.remove({});
db.MP.mapReduce(
function() { emit(this._id, {m:this.left, p:this.right, d:dist(this.left, this.right)}); },
function() { },
{out: "MPD"}
);
flatten("MPD");
// No need to flatten when multiple mapreduce operations are taking place.
db.MPD2.remove({});
db.MPD.mapReduce(
function() { emit(this.p._id, this); },
function(key, vs) {
var j = 0;
vs.forEach(function(v, i) {
if (v.d < vs[j].d)
j = i;
});
return vs[j];
},
{out: "MPD2"}
);
// Remember that the reduce operation will be applied multiple times, in any order.
// Cannot assume all will come in at once. Must be associative and commutative.
db.MPD2.mapReduce(
function() { emit(this.value.m._id, {x: this.value.p.x, y:this.value.p.y, c:1}); },
function(key, vs) {
var x = 0, y = 0, c = 0;
vs.forEach(function(v, i) {
x += v.x;
y += v.y;
c += v.c;
});
return {x:x, y:y, c:c};
},
{ finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
out: "M"
}
);
flatten("M");
db.M.mapReduce(
function() { emit("hash", {hash: this.x + this.y})},
function(k, vs) {
var hash = 0;
vs.forEach(function(v) {
hash += v.hash;
});
return {hash: hash};
},
{out: "HASHNEW"}
);
var o = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
var n = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
print(o);
print(n);
print(iter);
iter++;
} while (o != n);
alice_bob
in accordance with Project #0.
https://github.com/DataMechanics/course2016sprprojone
. Note that we may commit and push a few additional fixes or updates to this repository before the deadline, so check for updates and pull them into your fork on a regular basis. Set up a MongoDB and Python environment as necessary for the project. You should be able to run the setup.js
script to prepare the repository, and then to start your MongoDB server with authentication enabled and run the alice_bob
example script.alice_bob
within the toplevel directory of the project. All your scripts for retrieval of data and for transforming data already within the repository should be placed within this folder. Note that the file name for each script should be reasonably descriptive, as it will act as the identifier for the script. The scripts should also follow reasonable modularity and encapsulation practices, and should logically perform related operations. A larger number of smaller simpler, and more reusable scripts that each perform a small task is preferable.README.md
file within your directory (along with any documentation you may write in that file).auth.json
file, and all your scripts should retrieve the credential information from that file when it is passed to them over the command line, as seen below.
auth.json
file and do not include hardcoded authentication credentials in your code. You can use a .gitignore
file to ensure you do not accidentially submit the credentials file. The course staff will use their own authentication credentials when running your code. Your README.md
file should list any idiosyncratic details associated with the credentials needed to run your scripts.README.md
file how to obtain and run those tools).record()
.plan.json
in the directory that encodes the overall plan describing what all the scripts do. It should not have explicit time stamps, but it should be a provenance document conforming to the PROV standard that is the union of the all the PROV documents that the individual scripts generate. The information in this file, together with the scripts, should be sufficient to reproduce the data obtained and generated by the scripts.
 = 

 


import z3
(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]
S = z3.Solver()
# Only allow nonnegative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
S.add(x >= 0)
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)
# Constraints derived from graph topology.
S.add(x1 <= x2+x3, x2 <= x4+x5, x3+x4 <= x6, x5+x6 <= x7)
S.add(x1 > 0) # We want a positive flow.
print(S.check())
print(S.model())
flow = 0
for i in range(5, 1, 1):
S.push()
S.add(x1 >= (2**i + flow))
if str(S.check()) != 'unsat':
flow += 2**i
S.pop()
S.add(x1 >= flow)
print(S.model())
def dot(xs, ys):
return sum([x*y for (x,y) in zip(xs, ys)])
x = [x1,x2,x3,x4,x5,x6,x7]
M = [
[ 0,1, 0, 0, 0, 0, 0],
[ 0, 0,1, 0, 0, 0, 0],
[ 0, 0, 0,1, 0, 0, 0],
[ 0, 0, 0, 0,1, 0, 0],
[ 0, 0, 0, 0, 0,1, 0],
[ 0, 0, 0, 0, 0, 0,1],
[ 1, 0, 0, 0, 0, 0, 0],
[ 0, 1, 0, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 0],
[ 0,1, 0, 1, 1, 0, 0],
[ 0, 0,1,1, 0, 1, 0],
[ 0, 0, 0, 0,1,1, 1]
]
b = [7,8,6,3,4,5,
0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]
S = z3.Solver()
for i in range(len(M)):
S.add(b[i] <= dot(M[i], x))
print(S.check())
print(S.model())
from scipy.optimize import linprog
M = [
[ 1,1,1, 0, 0, 0, 0],
[ 0, 1, 0,1,1, 0, 0],
[ 0, 0, 1, 1, 0,1, 0],
[ 0, 0, 0, 0, 1, 1,1],
]
b = [0, 0, 0, 0]
c = [1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]
result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
print(result)
 = 
 
= 
 
= 

 = 
 
 = 
 
= 

 = 

 ≤ 

 ≤ 

 = 

 = 

 = 
 
= 
 
= 

 = 

 = 
 = 
 
 = 

 = 

 = 
 
= 

 = 
 
= 
 
= 

 = 

 = 
 
= 
 
= 
 
= 

 ≤ 
 ≤ 

 = 
 = 

from random import shuffle
from math import sqrt
data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
(36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
(4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
(19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
(18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
(17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]
def permute(x):
shuffled = [xi for xi in x]
shuffle(shuffled)
return shuffled
def avg(x): # Average
return sum(x)/len(x)
def stddev(x): # Standard deviation.
m = avg(x)
return sqrt(sum([(xim)**2 for xi in x])/len(x))
def cov(x, y): # Covariance.
return sum([(xiavg(x))*(yiavg(y)) for (xi,yi) in zip(x,y)])/len(x)
def corr(x, y): # Correlation coefficient.
if stddev(x)*stddev(y) != 0:
return cov(x, y)/(stddev(x)*stddev(y))
def p(x, y):
c0 = corr(x, y)
corrs = []
for k in range(0, 2000):
y_permuted = permute(y)
corrs.append(corr(x, y_permuted))
return len([c for c in corrs if abs(c) > c0])/len(corrs)
scipy.stats.pearsonr()
function will return the correlation coefficient and the pvalue.
import scipy.stats
print(scipy.stats.pearsonr(x, y))
 > 
 
 > 
 
 > 

 = 
 = 

 = 

 = 
 
 = 

alice_bob
in accordance with Project #0. Teams consisting of up to four people are permitted for this project.
https://github.com/DataMechanics/course2016sprprojtwo
.
README.md
file within your directory (you may only need to update your existing file).README.md
file how to obtain and run those tools). If you anticipate that an algorithm or technique you employ could be applied to much larger data sets, you should try to implement it as a transformation in the relational model or the MapReduce paradigm. In most cases, the results will consist of new data sets; these should be inserted into the repository along with all the others in the usual way (and should be considered derived data sets).record()
.plan.json
file in your project directory that encodes the overall plan describing what all the scripts do. It should not have explicit time stamps, but it should be a provenance document conforming to the PROV standard that is the union of the all the PROV documents that the individual scripts generate. As before, the information in this file, together with the scripts, should be sufficient to reproduce the data obtained and generated by the scripts.[1]  "World's population increasingly urban with more than half living in urban areas". https://www.un.org/development/desa/en/news/population/worldurbanizationprospects.html 
[2]  Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "Growth, innovation, scaling, and the pace of life in cities". Proceedings of the National Academy of Sciences of the United States of America 2007;104(17):73017306. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/ 
[3]  Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "SGL: A Scalable Language for Datadriven Games". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data 2008. http://www.cs.cornell.edu/~sowell/2008sigmodgamesdemo.pdf 
[4]  Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "Declarative Processing for Computer Games". Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games 2008. http://www.cs.cornell.edu/~sowell/2008sandboxdeclarative.pdf 
[5]  W3C Working Group. "PROV Model Primer". https://www.w3.org/TR/provprimer/ 
[6]  Robert Ikeda and Jennifer Widom. "Data Lineage: A Survey". http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf 
[7]  Y. Cui and J. Widom. "Lineage Tracing for General Data Warehouse Transformations". The VLDB Journal 2003;12(1):4158. http://ilpubs.stanford.edu:8090/525/1/20015.pdf 
[8]  Gerd Gigerenzer. "Mindless statistics". The Journal of SocioEconomics 2004;33(5):587606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf 
[9]  Nihar B. Shah and Dengyong Zhou. "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing". CoRR 2014. http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf 
courseYYYYSSSprojNNNN
where YYYY
is the year, SSSS
is the semester (spr
or fal
) and NNNN
is zero
, one
, two
, and so on.
courseYYYYSSSprojzero
) will be a public repository so that those who are not members of the DataMechanics organozation can see and fork it in the steps below, as that will be how the course staff obtain everyone's GitHub usernames.
BUlogin1
, BUlogin1_BUlogin2
, or BUlogin1_BUlogin2_BUlogin3
(depending on the number of members), where the login names BUloginN
are the official Boston University login names for the members, and they are ordered in ascending alphabetical order and separated by underscores. All changes constituting work on the project should be made within this subdirectory and nowhere else, unless otherwise specified in the posted project instructions.
master
branch will represent the group's submission. At some point before the project deadline, the group must submit a pull request to official submit their work. Only the changes that were committed before the merge request is made will be accepted as submitted.mapReduce
operations: https://jira.mongodb.org/browse/SERVER2517 andeval()
and stored procedure capabilities: https://jira.mongodb.org/browse/SERVER17453.GEO2D
index using PyMongo: http://api.mongodb.org/python/current/examples/geo.html, andprov
package:
prov
library for Python on Windows, you may have some trouble installing the lxml
package. Try obtaining the precompiled package here, instead, and installing it using pip install *.whl
;optimize.linprog
module for solving linear optimization problems, and the library is relatively straightforward to install:
pip install numpy
and pip install scipy
should be sufficient.