150 points; pair-optional
Parts I and II are due by 11:59 p.m. Eastern time on Thursday, December 3, 2020
The full project is due by 11:59 p.m. Eastern time on Wednesday, December 9, 2020
Important: This assignment cannot be replaced by the final exam, so it is essential that you complete it.
For the remainder of the semester, you will be working on a final course project. The project is a bit larger in scope than a problem set and is thus worth 150 points. We encourage you to read through the final project description in its entirety before starting.
You may work on the project individually or as a pair. Since this is a large project, we especially encourage you to work in pairs, but please review the collaboration policies of the course before doing so.
If you have questions while working on the final project, please
come to office hours, post them on Piazza, or email
cs111-staff@cs.bu.edu
.
Make sure to submit your work on Gradescope, following the procedures found below.
The project will give you an opportunity to use Python to model, analyze, and score the similarity of text samples. In essence, you will develop a statistical tool that will allow you to identify a particular author or style of writing!
Background
Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoo’s Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. Details on that story and the analysis that was performed appear here.
The comparative analysis of Rowling’s works used surprisingly simple techniques to model the author’s style. Possible “style signatures” could include:
Other similar measures of style are also possible. Perhaps the most common application of this kind of feature-based text classification is spam detection and filtering (which you can read more about here, if you’re interested).
Your task
To allow you to compare and classify texts, you will create a statistical model of a body of text using several different techniques. At a minimum, you should integrate five features:
word frequencies
word-length frequencies
stem frequencies
frequencies of different sentence lengths
one other feature of your choice. You might choose one of the features used in the [analysis][rowling] that revealed Rowling’s authorship of The Cuckoo’s Calling. Or you could choose something else entirely (e.g., something based on punctuation) – anything that you can compute/count using Python!
Important: Like the other features in the model, your extra feature must involve maintaining frequencies – i.e., counting how often different things occur.
In addition, the feature should not be too closely tied to a specific type of text. Rather, it must be generic enough to apply to any reasonably long English paragraph.
Computing word lengths from words should be fairly easy, but computing stems, sentence lengths, and your additional feature are an additional part of the challenge, and offer plenty of room for algorithmic creativity!
In general, the project description below will offer fewer guidelines than the problem sets did. If the guidelines do not explicitly mandate or forbid certain behavior, you should feel free to be creative!
For the first part of the project, you will create an initial version
of a TextModel
class, which will serve as a blueprint for objects
that model a body of text (i.e., a collection of one or more text documents).
To get started, open a new file in Spyder and name it
finalproject.py
. In this file, declare a class named
TextModel
. Add appropriate comments to the top of the file.
Write a constructor __init__(self, model_name)
that
constructs a new TextModel
object by accepting a string
model_name
as a parameter and initializing the following three
attributes:
name
– a string that is a label for this text model, such as
'JKRowling'
or 'Shakespeare'
. This will be used in the
filenames for saving and retrieving the model. Use the
model_name
that is passed in as a parameter.
words
– a dictionary that records the number of times each word
appears in the text.
word_lengths
– a dictionary that records the number of times
each word length appears.
Each of the dictionaries should be initialized to the empty
dictionary ({}
). In Part III, you will add dictionaries
for the other three features as well.
Write a method __repr__(self)
that returns a string that
includes the name of the model as well as the sizes of the
dictionaries for each feature of the text.
For example, if a model for J. K. Rowling has been set-up, the return value of this method may look like:
text model name: J. K. Rowling number of words: 2103 number of word lengths: 17
Here again, information about the other feature dictionaries will eventually be included, but for now only the first two dictionaries are needed.
Notes:
Remember that the __repr__
method should create a single string
and return it. You should not use the print
function in this
method.
Since the returned string should have multiple lines, you will
need to add in the newline character ('\n'
). Below is a
starting point for this method that already adds in some newline
characters. You will have to expand upon this to finish the
method:
def __repr__(self): """Return a string representation of the TextModel.""" s = 'text model name: ' + self.name + '\n' s += ' number of words: ' + str(len(self.words)) + '\n' return s
In the final version of your class, the returned string should not include the contents of the dictionaries because they will become very large. However, it may be a good idea to include their contents in the early stages of your code to facilitate small-scale testing.
Write a helper function named clean_text(txt)
that takes a
string of text txt
as a parameter and returns a list containing
the words in txt
after it has been “cleaned”.
This function will be used when you need to process each word in
a text individually, without having to worry about punctuation
or special characters.
Notes:
Because this is a regular function and not a method, you should
define it outside of your TextModel
class–e.g., before the
class header. And when you call clean_text
, you should just
use the function name; you will not need to prepend a called
object. The reason for implementing clean_text
as a function
rather than a method is that it doesn’t need to access the
internals of a TextModel
object.
Your clean_text
must at least do the following:
remove the following punctuation symbols:
.
),
)?
)!
);
):
)"
)convert all of the letters to lowercase (which you can do
using the string method lower
).
You are also welcome to take additional steps as you see fit.
You may find it helpful to use the string method replace
. To
remind yourself of how it works, try the following in the Python
Shell:
>>> s = 'Mr. Smith programs.' >>> s = s.replace('.', '') >>> s 'Mr Smith programs'
Instead of using replace
, you could use a loop to iteratively
look at every character in the string and only keep the
characters that are not punctuation. However, you should not
use recursion to remove the punctuation, since for large files
you will run out of memory from too many recursive method calls!
Write a method add_string(self, s)
that adds a string of text
s
to the model by augmenting the feature dictionaries defined in the
constructor. It should not explicitly return a value.
Here is some pseudocode to get you started:
def add_string(self, s): """Analyzes the string txt and adds its pieces to all of the dictionaries in this text model. """ # Add code to clean the text and split it into a list of words. # *Hint:* Call one of the functions you have already written! word_list = ... # Template for updating the words dictionary. for w in word_list: # Update self.words to reflect w # either add a new key-value pair for w # or update the existing key-value pair. # Add code to update other feature dictionaries.
For now, you should complete the pseudocode that we’ve given you,
and then add code to update the
word_lengths
dictionary. Later, you will extend the method to
update the other dictionaries as well.
Write a method add_file(self, filename)
that adds all of the
text in the file identified by filename
to the model. It should
not explicitly return a value.
Important: When you open the file for reading, you should specify two additional arguments as follows:
f = open(filename, 'r', encoding='utf8', errors='ignore')
These encoding and errors arguments should allow Python to handle special characters (e.g., “smart quotes”) that may be present in your text files.
Hints:
You may find it helpful to consult the lecture notes on
file-reading from a couple of weeks ago, or check out the
example code online. Rather than reading the file
line-by-line, it makes sense to use the read()
method to
read in the entire file into a single string, and then add
that string to your model.
Take advantage of add_string()
!
At this point, you are ready to run some initial tests on your methods. Try entering the following commands from the Shell, but remember that the contents of your dictionaries may be printed in a different order.
>>> model = TextModel('A. Poor Righter') >>> model.add_string("The partiers love the pizza party.") >>> print(model) text model name: A. Poor Righter number of words: 5 number of word lengths: 4 >>> model.words {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2} >>> model.word_lengths {8: 1, 3: 2, 4: 1, 5: 2}
Important
If you add test code to your Python files, please put it in one or more separate test functions, which you can then call to do the testing. Having test functions is not required. However, you should not have any test code in the global scope (i.e., outside of a function).
You should continue testing your code frequently from this point forward to make sure everything is working correctly at each step. Otherwise, if you reach the end and realize there are errors, it can be very difficult to determine the causes of those errors in such a large program!
Creating a text model can require a lot of computational power and
time. Therefore, once we have created a model, we want to be able to
save it for later use. The easiest way to do this is to write each of the
feature dictionaries to a different file so that we can read them back in
at a later time. In this part of the project, you will add methods to
your TextModel
class that allow you to save and retrieve a
text model in this way.
To get you started, here is a function that defines a small dictionary and saves it to a file:
def sample_file_write(filename): """A function that demonstrates how to write a Python dictionary to an easily-readable file. """ d = {'test': 1, 'foo': 42} # Create a sample dictionary. f = open(filename, 'w') # Open file for writing. f.write(str(d)) # Writes the dictionary to the file. f.close() # Close the file.
Notice that the file is opened for writing by using a second
parameter of 'w'
in the open
function call. In addition, we write
to the file by using the file handle’s write
method on a string
representation of the dictionary.
Below is a function that reads in a string representing a dictionary
from a file, and converts this string (which is a string
that looks like a dictionary) to an actual dictionary object. The
conversion is performed using a combination of two built-in
functions: dict
, the constructor for dictionary objects;
and eval
, which evaluates a string as if it were an expression.
def sample_file_read(filename): """A function that demonstrates how to read a Python dictionary from a file. """ f = open(filename, 'r') # Open for reading. d_str = f.read() # Read in a string that represents a dict. f.close() d = dict(eval(d_str)) # Convert the string to a dictionary. print("Inside the newly-read dictionary, d, we have:") print(d)
Try saving these functions in a Python file (or even in
finalproject.py
), and then use the following calls from the Python
Shell:
>>> filename = 'testfile.txt' >>> sample_file_write(filename) >>> sample_file_read(filename) Inside the newly-read dictionary, d, we have: {'test': 1, 'foo': 42}
There should also now be a file named testfile.txt
in your current
working directory that contains this dictionary.
Now that you know how to write dictionaries to files and read dictionaries
from files, add the following two methods to the TextModel
class:
Write a method save_model(self)
that saves the TextModel
object self
by writing its various feature dictionaries to
files. There will be one file written for each feature
dictionary. For now, you just need to handle the words
and
word_lengths
dictionaries.
In order to identify which model and dictionary is stored in a given
file, you should use the name
attribute concatenated with
the name of the feature dictionary. For example, if name
is 'JKR'
(for J. K. Rowling), then we would suggest using the
filenames:
'JKR_words'
'JKR_word_lengths'
In general, the filenames are self.name + '_' +
name_of_dictionary
. Taking this approach will ensure that you
don’t overwrite one model’s dictionary files when you go to save
another model.
It may help to use the code for sample_file_write
as a starting
point for this method, but don’t forget that you should create a
separate file for each dictionary.
Write a method read_model(self)
that reads the stored
dictionaries for the called TextModel
object from their files and
assigns them to the attributes of the called TextModel
.
This is the complementary method to save_model
, and you should
assume that the necessary files have filenames that follow the
naming scheme used in save_model
.
It may help to use the code for sample_file_read
as a starting
point for this method, but note that you should read a separate
file for each dictionary.
Remember that you can use the dict
and eval
functions to convert
a string that represents a dictionary to an actual dictionary object.
Examples:
You can test your save_model
and read_model
methods as follows:
# Create a model for a simple text, and save the resulting model. >>> model = TextModel('A. Poor Righter') >>> model.add_string("The partiers love the pizza party.") >>> model.save_model() # Create a new TextModel object with the same name as the original one, # and assign it to a new variable. >>> model2 = TextModel('A. Poor Righter') # Read the dictionaries that were saved for the original model, # and use them as the dictionaries of `model2`. >>> model2.read_model() >>> print(model2) text model name: A. Poor Righter number of words: 5 number of word lengths: 4 >>> model2.words {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2} >>> model2.word_lengths {8: 1, 3: 2, 4: 1, 5: 2}
Login to Gradescope by clicking the link in the left-hand navigation bar, and click on the box for CS 111.
Submit a version of finalproject.py
that contains at least your work
for Parts I and II. If your file includes incomplete work for Parts
III-V that might prevent us from testing your work for Parts I and II,
you should copy the file into a different folder (keeping the same
name), and remove any code that might interfere with our
testing. Test your file before you submit it by running it in Spyder
and making calls to your methods/functions from Parts I and II.
IMPORTANT: If you chose to work on the final project with a partner, only one person from the pair should submit the file, and that person should add the other person as a group member following step 6 below.
Here are the steps:
Click on the name of the assignment in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful.
Click the (x)
button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.
Note: You will not see a complete Autograder score when you submit. That is because additional tests for at least some of the problems will be run later, after the final deadline for the submission has passed. For such problems, it is important to realize that passing all of the initial tests does not necessarily mean that you will ultimately get full credit on the problem. You should always run your own tests to convince yourself that the logic of your solutions is correct.
If you worked with a partner and you are the one who is submitting the file:
Click on the Add Group Member link that appears below your name above the results of the Autograder.
In the pop-up box that appears, click on the Add Member link.
Type your partner’s name or choose it from the drop-down menu.
Click the Save button.
Check to ensure that your partner’s name now appears below your name above the results of the Autograder.
If needed, use the Resubmit button at the bottom of the page to resubmit your work. Important: Every time that you make a submission, you should submit all of the files for that Gradescope assignment, even if some of them have not changed since your last submission.
Near the top of the page, click on the box labeled Code. Then click on the name of the file to view its contents. Check to make sure that the file contains the code that you want us to grade.
Important
It is your responsibility to ensure that the correct version of every file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submissions carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough
time to do so, wait an hour or two and then try again. If you
are unable to submit and it is close to the deadline, email
your homework before the deadline to
cs111-staff@cs.bu.edu
Update your __init__
method so that it initializes attributes
for three additional dictionaries:
stems
– a dictionary that records the number of times each word
stem appears in the text.
sentence_lengths
– a dictionary that records the number of
times each sentence length (i.e., the number of words in
a sentence) appears.
an appropriately named dictionary that records the frequencies of
whatever additional feature you have chosen to include in your
TextModel
(see the section above entitled Your task for
possible options).
Important: Like the other features in the model, your extra feature must involve maintaining frequencies – i.e., counting how often different things occur. This means that the value portion of each key-value pair in the dictionary must be an integer.
In addition, the feature that you choose should not be too closely tied to a specific type of text. Rather, it must be generic enough to apply to any reasonably long English paragraph.
Write a helper function named stem(s)
that accepts a string
as a parameter. The function should then return the stem of s
.
The stem of a word is the root part of the word, which excludes any
prefixes and suffixes. For example:
>>> stem('party') result: 'parti' >>> stem('parties') result: 'parti' >>> stem('love') result: 'lov' >>> stem('loving') result: 'lov'
Notes:
We will discuss stemming in lecture on 11/30.
Like clean_text
, this is a regular function and not a method, so
you should define it outside of your TextModel
class.
When you call it, you should just use the function name;
you will not need to prepend a called object.
The stem of a word is not necessarily a word itself!
This function does not have to work perfectly for all possible words and stems. Instead, you should define a multitude of cases for stems that work for many words, as we will discuss in lecture.
The number of different cases that your function is able to handle is up to you. For full credit, your function should handle at least seven distinct cases, each of which applies to multiple words.
Extend your add_string
method to update the feature dictionaries
for word stems, sentence lengths, and your chosen additional
feature.
Notes:
You should update the sentence lengths dictionary before you clean the text. Once you remove the punctuation from the string, it will be difficult to count the sentences.
You should make use of the stem
function that you wrote above,
and you should define any additional helper functions/methods as
you see fit. In particular, you may need one or more helper
function related to the dictionary for your chosen additional
feature.
Update the following methods to incorporate the feature dictionaries for word stems, sentence lengths, and your chosen additional feature:
__repr__
save_model
read_model
Test your new code by performing the following test. Here again,
your dictionary contents may be printed in a different order. In
addition, your dictionary for stems may be slightly different,
depending on how you implemented the stem
function:
>>> model = TextModel('A. Poor Righter') >>> model.add_string("The partiers love the pizza party.") >>> print(model) text model name: A. Poor Righter number of words: 5 number of word lengths: 4 number of stems: 4 number of sentence lengths: 1 # info for your additional feature goes here! >>> model.words result: {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2} >>> model.word_lengths result: {8: 1, 3: 2, 4: 1, 5: 2} >>> model.stems result: {'parti': 2, 'the': 2, 'pizza': 1, 'lov': 1} >>> model.sentence_lengths result: {6: 1}
Optional: Add extra functionality to your TextModel
object as
you see fit. This may include improving your algorithms for
cleaning the strings or finding stems, or it may entail adding
even more feature dictionaries beyond the ones that we have
required.
We will discuss Parts IV and V in lecture on 11/30.
In this part of the project, you will first implement the core algorithm that will allow you to compare bodies of text. This algorithm will produce a numeric similarity score that measures how similar one body of text is to another, based on one type of feature (e.g., word lengths). You will then compute scores of this type for all five of the features, and use them to classify a piece of text as being more likely to come from one source than another.
The similarity score that we will compute is based on a statistical model known as a Naive Bayes probability model. Despite the “naive” in its name, scores computed using this model have been very successful in distinguishing spam email from non-spam (“ham”) email, among other classification problems.
In essence, the Naive Bayes scoring algorithm works in the following way: You give it feature counts (e.g., word counts) from one body of text and feature counts from a second body of text. The algorithm will then compute the likelihood that the second body of text is from the same source as the first body of text! The reason that the algorithm is called “naive” is that it makes the assumption that each item in a given feature set is independent. For example, it assumes that the appearance of the word “spell” does not depend on the appearance of the word “potter” – and that this independence holds for all pairs of words. This assumption is certainly not true, but that turns out not to matter in many situations!
You can read more details about the use of Naive Bayes probabilities for classification on Wikipedia if you would like to know more, but all of the necessary information is summarized below.
How it works
To illustrate how the Bayesian scoring algorithm works, let’s assume that
the only features we care about are the individual words in the texts.
As you have already done in your TextModel
class, we can use a Python
dictionary to model all of the words in a text. The dictionary’s keys are
words, and the value for a given word is the number of times that it
appears in the text.
For example, let’s assume that we have two text documents:
a source text (which we are pretending was written by Shakespeare!) that has the following dictionary:
shakespeare_dict = {'love': 50, 'spell': 8, 'thou': 42}
This document has 100 words in all: 50 occurrences of the word “love,” 8 of “spell,” and 42 of “thou.”
a mystery text (author unknown) whose dictionary looks like this:
mystery_dict = {'love': 3, 'thou': 1, 'potter': 2, 'spam': 4}
This document has 10 words in all: three occurrences of “love,” one of “thou,” two of “potter,” and four of “spam.”
The Bayesian similarity score between these two texts attempts to measure the likelihood that the ten words in the mystery text come from the same class of text as the 100 words in the source text. (A given class of text could be based on a particular author or publication, or on other characteristics of the texts in question.)
To calculate the score, we first take each word in the mystery text and compute a probability for it that is based on the number of times that it occurs in the source text. If a word in the mystery text doesn’t occur at all in the source text (which would lead to a probability of 0), we instead compute a probability that is based on a “default” word count of 0.5. This will allow us to avoid multiplying by 0 when we compute the final score.
Here are the probabilities for the words in our mystery text:
Important: These probabilities have denominators of 100 because the source text has 100 words in it. The denominators should not always be 100!
To compute the similarity score of the mystery text, we need to compute a product in which a given word’s probability is multiplied by itself n times, where n is the number of times that the word appears in the mystery text. In this case, we would do the following:
# 3 "love" 1 "thou" 2 "potter" 4 "spam" sim_score = (.5*.5*.5) * (.42) * (.005*.005) * (.005*.005*.005*.005)
This similarity score is very small! In practice, these very small values are hard to work with, and they can become so small that Python’s floating-point values cannot accurately represent them! Therefore, instead of using the probabilities themselves, we will use the logs of the probabilities. The log operation transforms multiplications into additions (and exponents into multiplication), so our log-based similarity score would be:
log_sim_score = 3*log(.5) + 1*log(.42) + 2*log(.005) + 4*log(.005)
This results in a more manageable value of around -34.737. (Note that
Python’s math.log
function uses the natural log (of base e) by
default, which is fine for our purposes.)
The resulting similarity score gives us a measure of how similar the mystery text is to the source text. To classify a new mystery text, we compute similarity scores between it and a collection of known texts in order to determine which of the known texts is most likely to be related to the mystery text.
For example, let’s say that we also have the following model for texts by J.K. Rowling:
jkr = {'love': 25, 'spell': 275, 'potter': 700}
Note that there are a total of 1000 words in this model.
We can compute a similarity score for the mystery text and the jkr texts in the same way that we did above.
We get the following probabilities that the words in the mystery text came from the jkr texts:
Thus, the non-log similarity score for 3 occurences of “love”, 1 occurrence of “thou”, 2 occurrences of “potter”, and 4 occurrences of “spam” would be
sim_score = (.025*.025*.025) * (.0005) * (.7*.7) * (.0005*.0005*.0005*.0005)
This value is also very small! Using logs:
log_sim_score = 3*log(.025) + 1*log(.0005) + 2*log(.7) + 4*log(.0005)
Now the similarity score is approximately -49.784. This value is less than the value that we computed when comparing the mystery text to the Shakespeare text. Therefore, we can conclude that the mystery text is more likely to have come from Shakespeare than from J.K. Rowling.
Your tasks
Write a function (not a method, so it should be outside the class)
named compare_dictionaries(d1, d2)
. It should
take two feature dictionaries d1
and d2
as inputs, and it should
compute and return their log similarity score. Here is some
pseudocode for what you will need to do:
Start the score at zero.
Let total
be the total number of words in d1
– not only
distinct items, but all of the repetitions of all the items as
well. (For example, total
for our example
shakespeare_dict
would be 100.)
For each item in d2
:
Check if the item is in d1
.
If so, add the log of the probability that the item would be
chosen at random from everything in d1
, multiplied
by the number of times it appears in d2
.
If not, add the log of the default probability (0.5 / total
),
multiplied by the number of times the item appears in
d2
.
Return the resulting score.
Write a method similarity_scores(self, other)
that computes
and returns a list of log similarity scores measuring the
similarity of self
and other
– one score for each type of
feature (words, word lengths, stems, sentence lengths, and your
additional feature). You should make repeated calls to
compare_dictionaries
, and put the resulting scores in a list
that the method returns.
Important: In each call to compare_dictionaries
, the
dictionary belonging to self
should be the second parameter of
the call. For example:
word_score = compare_dictionaries(other.words, self.words)
Finally, write a method classify(self, source1, source2)
that compares the called TextModel
object (self
) to two other
“source” TextModel
objects (source1
and source2
) and
determines which of these other TextModel
s is the more likely
source of the called TextModel
.
You should begin by calling similarity_scores
twice:
scores1 = self.similarity_scores(source1) scores2 = self.similarity_scores(source2)
Next, print the two lists of scores, preceding each list of scores
with the name of the source TextModel
– for example:
scores for Shakespeare: [-34.737, ...] scores for J.K. Rowling: [-49.091, ...]
(Note: If you like, you can use the round
function to round
the printed scores to 2 or 3 places after the decimal, but doing
so is not required.)
You should then use these two lists of scores to determine whether
the called TextModel
is more likely to have come from source1
or source2
. One way to do this is to compare corresponding
pairs of scores, and determine which of the source TextModel
s
has the larger number of higher scores.
For example, imagine that the two sets of scores are the following:
scores1: [-34.737, -25.132, -55.312, -10.715, -47.125] scores2: [-49.091, -21.071, -60.154, -16.502, -43.675]
scores1
has a higher score for three of the features (the ones
in positions 0, 2, and 3 of the lists), while scores2
has a
higher score for only two of the features (the ones in positions
1 and 4). Thus, we conclude that self
is more likely to have
come from source1
.
You should also feel free to take an alternative approach to using the two lists of scores. For example, you could compute a weighted sum of the scores in each list by doing something like this:
weighted_sum1 = 10*scores1[0] + 5*scores1[1] + 7*scores1[2] + ... weighted_sum2 = 10*scores2[0] + 5*scores2[1] + 7*scores2[2] + ...
You could then base your classification on which source’s weighted sum is larger. One advantage of this approach is that it allows you to adjust the relative importance of the different features – giving certain features a larger impact on the classification.
Your classify
method should report its conclusions, using the
names of the relevant TextModel
objects. For example:
mystery is more likely to have come from Shakespeare
The method does not need to return anything.
Testing
Here is one function that you can use to test your TextModel
implementation:
# Copy and paste the following function into finalproject.py # at the bottom of the file, *outside* of the TextModel class. def test(): """ your docstring goes here """ source1 = TextModel('source1') source1.add_string('It is interesting that she is interested.') source2 = TextModel('source2') source2.add_string('I am very, very excited about this!') mystery = TextModel('mystery') mystery.add_string('Is he interested? No, but I am.') mystery.classify(source1, source2)
Here is what we get when we run this function using our TextModel
implementation:
>>> test() scores for source1: [-16.394, -9.92, -15.701, -1.386, -1.386] scores for source2: [-17.087, -15.008, -17.087, -1.386, -3.466] mystery is more likely to have come from source1
Some of your scores will be different than ours (e.g., the third score
in each list, which depend on how you stem the words, and the fifth
score in each list, which depend on which additional feature you
include). Our conclusion is based on a pairwise comparison of the
scores: because source1
has a larger number of higher scores, it is
chosen as the more likely source. If you use the lists of scores in
another way, you may come to a different conclusion, which is fine!
Now that your TextModel
class is complete and you have tested its
ability to compare texts, you should choose several bodies of text
from which you can create models and compute similarity scores.
Choose the bodies of text from which you will create your two “source” models. In the example we provided at the start of Part IV, we chose William Shakespeare and J. K. Rowling texts for our source models, and then selected a new mystery text to compare them against. For this part, you should similarly choose two bodies of text for your source models, and in the next part you will choose new texts to compare them against.
Note that we say bodies of text, because a given text model can be based on more than one text document. For example, if you want to build a text model for New York Times articles, you should base it on multiple articles from the Times.
You should choose two bodies of text that allow for meaningful comparisons. For example:
You are welcome to choose whatever texts you like, but as a starting point, here is a link to a text file containing the complete works of William Shakespeare. You should not use this file as it is. You should download it, open it, and remove the text at the beginning and end that explains the file and provides additional information. This is true of any text file(s) that you use – you should inspect them and perform whatever human pre-processing is necessary to clean the file before handling it computationally.
Once you have chosen your two types/bodies of text, you will need to
find the texts that will define the model for each of them. Once
the texts have been obtained and pre-processed by you, you
should create TextModel
objects for them and save them to files
using the save_model
method.
Important
Make sure to save your texts as plain-text files.
Note: We encourage you to leave out at least one text from each body of text when creating your models. This will allow you to use it for testing. For example, if your two bodies of text are a collection of articles from the New York Times and a collection of articles from the Wall Street Journal, you can use most of the articles from a given collection to build its text model, but leave out one article from each collection so that you can use it for testing.
Once you have your two source models, you should choose at least four new text documents (or collections of text documents) – texts not used in the creation of your source models – that you would be interested in classifying according to your source models.
For two of your classifications, you can use the texts that you left out when creating your source models. For example, if your source models are for Times and WSJ articles, you can perform tests to see if the Times article that you left out when building your source models is really more similar to the other Times articles than it is to the WSJ articles.
The other two classifications are up to you. For example, you could see if:
Be creative!
For each text/collection of texts that you want to classify, you will
again need to obtain one or more text files, pre-process them, and
create a TextModel
object from them. You should then invoke
the classify
method on that TextModel
to see which of your
source models is the more likely source.
To get started, copy-and-paste the following function inside of
finalproject.py
but outside of the TextModel
class:
# Copy and paste the following function into finalproject.py # at the bottom of the file, *outside* of the TextModel class. def run_tests(): """ your docstring goes here """ source1 = TextModel('rowling') source1.add_file('rowling_source_text.txt') source2 = TextModel('shakespeare') source2.add_file('shakespeare_source_text.txt') new1 = TextModel('wr100') new1.add_file('wr100_source_text.txt') new1.classify(source1, source2) # Add code for three other new models below.
You should replace the model names and file names in the provided
code with the names of your models and text files. Don’t forget
that you can use more than one file to build a given model, in
which case you would call add_file
multiple times for that
model.
In a plain-text file named reflection.txt
, write a brief
report containing the following information:
Your reflection.txt
file should be approximately two paragraphs
in length.
Coming soon!
Last updated on December 2, 2020.