Final Project

150 points; pair-optional

Parts I and II are due by 11:59 p.m. Eastern time on Thursday, December 3, 2020
The full project is due by 11:59 p.m. Eastern time on Wednesday, December 9, 2020

Important: This assignment cannot be replaced by the final exam, so it is essential that you complete it.

Preliminaries
Project Overview
Project Description

Preliminaries

For the remainder of the semester, you will be working on a final course project. The project is a bit larger in scope than a problem set and is thus worth 150 points. We encourage you to read through the final project description in its entirety before starting.

You may work on the project individually or as a pair. Since this is a large project, we especially encourage you to work in pairs, but please review the collaboration policies of the course before doing so.

If you have questions while working on the final project, please come to office hours, post them on Piazza, or email cs111-staff@cs.bu.edu.

Make sure to submit your work on Gradescope, following the procedures found below.

Project Overview

The project will give you an opportunity to use Python to model, analyze, and score the similarity of text samples. In essence, you will develop a statistical tool that will allow you to identify a particular author or style of writing!

Background

Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoo’s Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. Details on that story and the analysis that was performed appear here.

The comparative analysis of Rowling’s works used surprisingly simple techniques to model the author’s style. Possible “style signatures” could include:

the distribution of word lengths in a document – calculating the frequency of words of length one, length two, length three, etc. This is one of the metrics that was used in the Rowling case.
the distribution of words used by an author – calculating the frequency of each unique word.
the distribution of word stems used (e.g., “spam” and “spamming” would have the same stem) by an author – calculating the frequency of each unique stem.
the distribution of sentence lengths used by an author – calculating the frequency of sentences with one word, two words, three words, etc.

Other similar measures of style are also possible. Perhaps the most common application of this kind of feature-based text classification is spam detection and filtering (which you can read more about here, if you’re interested).

Your task

To allow you to compare and classify texts, you will create a statistical model of a body of text using several different techniques. At a minimum, you should integrate five features:

word frequencies
word-length frequencies
stem frequencies
frequencies of different sentence lengths
one other feature of your choice. You might choose one of the features used in the [analysis][rowling] that revealed Rowling’s authorship of The Cuckoo’s Calling. Or you could choose something else entirely (e.g., something based on punctuation) – anything that you can compute/count using Python!

Important: Like the other features in the model, your extra feature must involve maintaining frequencies – i.e., counting how often different things occur.

In addition, the feature should not be too closely tied to a specific type of text. Rather, it must be generic enough to apply to any reasonably long English paragraph.

Computing word lengths from words should be fairly easy, but computing stems, sentence lengths, and your additional feature are an additional part of the challenge, and offer plenty of room for algorithmic creativity!

In general, the project description below will offer fewer guidelines than the problem sets did. If the guidelines do not explicitly mandate or forbid certain behavior, you should feel free to be creative!

Project Description

Part I: Building an initial text model

For the first part of the project, you will create an initial version of a TextModel class, which will serve as a blueprint for objects that model a body of text (i.e., a collection of one or more text documents).

To get started, open a new file in Spyder and name it finalproject.py. In this file, declare a class named TextModel. Add appropriate comments to the top of the file.
Write a constructor __init__(self, model_name) that constructs a new TextModel object by accepting a string model_name as a parameter and initializing the following three attributes:
- name – a string that is a label for this text model, such as 'JKRowling' or 'Shakespeare'. This will be used in the filenames for saving and retrieving the model. Use the model_name that is passed in as a parameter.
- words – a dictionary that records the number of times each word appears in the text.
- word_lengths – a dictionary that records the number of times each word length appears.
Each of the dictionaries should be initialized to the empty dictionary ({}). In Part III, you will add dictionaries for the other three features as well.
Write a method __repr__(self) that returns a string that includes the name of the model as well as the sizes of the dictionaries for each feature of the text.

For example, if a model for J. K. Rowling has been set-up, the return value of this method may look like:
```
text model name: J. K. Rowling
  number of words: 2103
  number of word lengths: 17
```
Here again, information about the other feature dictionaries will eventually be included, but for now only the first two dictionaries are needed.

Notes:
- Remember that the __repr__ method should create a single string and return it. You should not use the print function in this method.
- Since the returned string should have multiple lines, you will need to add in the newline character ('\n'). Below is a starting point for this method that already adds in some newline characters. You will have to expand upon this to finish the method:
```
def __repr__(self):
    """Return a string representation of the TextModel."""
    s = 'text model name: ' + self.name + '\n'
    s += '  number of words: ' + str(len(self.words)) + '\n'
    return s
```
- In the final version of your class, the returned string should not include the contents of the dictionaries because they will become very large. However, it may be a good idea to include their contents in the early stages of your code to facilitate small-scale testing.
Write a helper function named clean_text(txt) that takes a string of text txt as a parameter and returns a list containing the words in txt after it has been “cleaned”. This function will be used when you need to process each word in a text individually, without having to worry about punctuation or special characters.

Notes:
- Because this is a regular function and not a method, you should define it outside of your TextModel class–e.g., before the class header. And when you call clean_text, you should just use the function name; you will not need to prepend a called object. The reason for implementing clean_text as a function rather than a method is that it doesn’t need to access the internals of a TextModel object.
- Your clean_text must at least do the following:
  - remove the following punctuation symbols:
    - periods (.)
    - commas (,)
    - question marks (?)
    - exclamation marks (!)
    - semi-colons (;)
    - colons (:)
    - double-quotes (")
  - convert all of the letters to lowercase (which you can do using the string method lower).
You are also welcome to take additional steps as you see fit.
- You may find it helpful to use the string method replace. To remind yourself of how it works, try the following in the Python Shell:
```
>>> s = 'Mr. Smith programs.'
>>> s = s.replace('.', '')
>>> s
'Mr Smith programs'
```
- Instead of using replace, you could use a loop to iteratively look at every character in the string and only keep the characters that are not punctuation. However, you should not use recursion to remove the punctuation, since for large files you will run out of memory from too many recursive method calls!

Write a method add_string(self, s) that adds a string of text s to the model by augmenting the feature dictionaries defined in the constructor. It should not explicitly return a value.

Here is some pseudocode to get you started:

def add_string(self, s):
    """Analyzes the string txt and adds its pieces
       to all of the dictionaries in this text model.
    """

    # Add code to clean the text and split it into a list of words.
    # *Hint:* Call one of the functions you have already written!
    word_list = ...

    # Template for updating the words dictionary.
    for w in word_list:
        # Update self.words to reflect w
        # either add a new key-value pair for w
        # or update the existing key-value pair.

    # Add code to update other feature dictionaries.

For now, you should complete the pseudocode that we’ve given you, and then add code to update the word_lengths dictionary. Later, you will extend the method to update the other dictionaries as well.

Write a method add_file(self, filename) that adds all of the text in the file identified by filename to the model. It should not explicitly return a value.

Important: When you open the file for reading, you should specify two additional arguments as follows:
```
f = open(filename, 'r', encoding='utf8', errors='ignore')
```
These encoding and errors arguments should allow Python to handle special characters (e.g., “smart quotes”) that may be present in your text files.

Hints:
- You may find it helpful to consult the lecture notes on file-reading from a couple of weeks ago, or check out the example code online. Rather than reading the file line-by-line, it makes sense to use the read() method to read in the entire file into a single string, and then add that string to your model.
- Take advantage of add_string()!

At this point, you are ready to run some initial tests on your methods. Try entering the following commands from the Shell, but remember that the contents of your dictionaries may be printed in a different order.

>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> print(model)
text model name: A. Poor Righter
  number of words: 5
  number of word lengths: 4
>>> model.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}
>>> model.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}

Important

If you add test code to your Python files, please put it in one or more separate test functions, which you can then call to do the testing. Having test functions is not required. However, you should not have any test code in the global scope (i.e., outside of a function).
You should continue testing your code frequently from this point forward to make sure everything is working correctly at each step. Otherwise, if you reach the end and realize there are errors, it can be very difficult to determine the causes of those errors in such a large program!

Part II: Saving and retrieving a text model

Creating a text model can require a lot of computational power and time. Therefore, once we have created a model, we want to be able to save it for later use. The easiest way to do this is to write each of the feature dictionaries to a different file so that we can read them back in at a later time. In this part of the project, you will add methods to your TextModel class that allow you to save and retrieve a text model in this way.

To get you started, here is a function that defines a small dictionary and saves it to a file:

def sample_file_write(filename):
    """A function that demonstrates how to write a
       Python dictionary to an easily-readable file.
    """
    d = {'test': 1, 'foo': 42}   # Create a sample dictionary.
    f = open(filename, 'w')      # Open file for writing.
    f.write(str(d))              # Writes the dictionary to the file.
    f.close()                    # Close the file.

Notice that the file is opened for writing by using a second parameter of 'w' in the open function call. In addition, we write to the file by using the file handle’s write method on a string representation of the dictionary.

Below is a function that reads in a string representing a dictionary from a file, and converts this string (which is a string that looks like a dictionary) to an actual dictionary object. The conversion is performed using a combination of two built-in functions: dict, the constructor for dictionary objects; and eval, which evaluates a string as if it were an expression.

def sample_file_read(filename):
    """A function that demonstrates how to read a
       Python dictionary from a file.
    """
    f = open(filename, 'r')    # Open for reading.
    d_str = f.read()           # Read in a string that represents a dict.
    f.close()

    d = dict(eval(d_str))      # Convert the string to a dictionary.

    print("Inside the newly-read dictionary, d, we have:")
    print(d)

Try saving these functions in a Python file (or even in finalproject.py), and then use the following calls from the Python Shell:

>>> filename = 'testfile.txt'
>>> sample_file_write(filename)
>>> sample_file_read(filename)
Inside the newly-read dictionary, d, we have:
{'test': 1, 'foo': 42}

There should also now be a file named testfile.txt in your current working directory that contains this dictionary.

Now that you know how to write dictionaries to files and read dictionaries from files, add the following two methods to the TextModel class:

Write a method save_model(self) that saves the TextModel object self by writing its various feature dictionaries to files. There will be one file written for each feature dictionary. For now, you just need to handle the words and word_lengths dictionaries.

In order to identify which model and dictionary is stored in a given file, you should use the name attribute concatenated with the name of the feature dictionary. For example, if name is 'JKR' (for J. K. Rowling), then we would suggest using the filenames:
- 'JKR_words'
- 'JKR_word_lengths'
In general, the filenames are self.name + '_' + name_of_dictionary. Taking this approach will ensure that you don’t overwrite one model’s dictionary files when you go to save another model.

It may help to use the code for sample_file_write as a starting point for this method, but don’t forget that you should create a separate file for each dictionary.
Write a method read_model(self) that reads the stored dictionaries for the called TextModel object from their files and assigns them to the attributes of the called TextModel.

This is the complementary method to save_model, and you should assume that the necessary files have filenames that follow the naming scheme used in save_model.

It may help to use the code for sample_file_read as a starting point for this method, but note that you should read a separate file for each dictionary.

Remember that you can use the dict and eval functions to convert a string that represents a dictionary to an actual dictionary object.

Examples: You can test your save_model and read_model methods as follows:

# Create a model for a simple text, and save the resulting model.
>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> model.save_model()
# Create a new TextModel object with the same name as the original one,
# and assign it to a new variable.
>>> model2 = TextModel('A. Poor Righter')
# Read the dictionaries that were saved for the original model,
# and use them as the dictionaries of `model2`.
>>> model2.read_model()
>>> print(model2)
  text model name: A. Poor Righter
    number of words: 5
    number of word lengths: 4
>>> model2.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}
>>> model2.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}

Submitting Parts I and II

Submit a version of finalproject.py that contains at least your work for Parts I and II. If your file includes incomplete work for Parts III-V that might prevent us from testing your work for Parts I and II, you should copy the file into a different folder (keeping the same name), and remove any code that might interfere with our testing. Test your file before you submit it by running it in Spyder and making calls to your methods/functions from Parts I and II.

IMPORTANT: If you chose to work on the final project with a partner, only one person from the pair should submit the file, and that person should add the other person as a group member following step 6 below.

Here are the steps:

Click on the name of the assignment in the list of assignments. You should see a pop-up window with a box labeled DRAG & DROP. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Add your file to the box labeled DRAG & DROP. You can either drag and drop the file from its folder into the box, or you can click on the box itself and browse for the file.
Click the Upload button.
You should see a box saying that your submission was successful. Click the (x) button to close that box.
The Autograder will perform some tests on your file. Once it is done, check the results to ensure that the tests were passed. If one or more of the tests did not pass, the name of that test will be in red, and there should be a message describing the failure. Based on those messages, make any necessary changes. Feel free to ask a staff member for help.

Note: You will not see a complete Autograder score when you submit. That is because additional tests for at least some of the problems will be run later, after the final deadline for the submission has passed. For such problems, it is important to realize that passing all of the initial tests does not necessarily mean that you will ultimately get full credit on the problem. You should always run your own tests to convince yourself that the logic of your solutions is correct.
If you worked with a partner and you are the one who is submitting the file:
- Click on the Add Group Member link that appears below your name above the results of the Autograder.
- In the pop-up box that appears, click on the Add Member link.
- Type your partner’s name or choose it from the drop-down menu.
- Click the Save button.
- Check to ensure that your partner’s name now appears below your name above the results of the Autograder.
If needed, use the Resubmit button at the bottom of the page to resubmit your work. Important: Every time that you make a submission, you should submit all of the files for that Gradescope assignment, even if some of them have not changed since your last submission.
Near the top of the page, click on the box labeled Code. Then click on the name of the file to view its contents. Check to make sure that the file contains the code that you want us to grade.

Important

It is your responsibility to ensure that the correct version of every file is on Gradescope before the final deadline. We will not accept any file after the submission window for a given assignment has closed, so please check your submissions carefully using the steps outlined above.
If you are unable to access Gradescope and there is enough time to do so, wait an hour or two and then try again. If you are unable to submit and it is close to the deadline, email your homework before the deadline to cs111-staff@cs.bu.edu

Part III: Adding features to the model

Update your __init__ method so that it initializes attributes for three additional dictionaries:
- stems – a dictionary that records the number of times each word stem appears in the text.
- sentence_lengths – a dictionary that records the number of times each sentence length (i.e., the number of words in a sentence) appears.
- an appropriately named dictionary that records the frequencies of whatever additional feature you have chosen to include in your TextModel (see the section above entitled Your task for possible options).
  
  Important: Like the other features in the model, your extra feature must involve maintaining frequencies – i.e., counting how often different things occur. This means that the value portion of each key-value pair in the dictionary must be an integer.
  
  In addition, the feature that you choose should not be too closely tied to a specific type of text. Rather, it must be generic enough to apply to any reasonably long English paragraph.
Write a helper function named stem(s) that accepts a string as a parameter. The function should then return the stem of s. The stem of a word is the root part of the word, which excludes any prefixes and suffixes. For example:
```
>>> stem('party')
result: 'parti'

>>> stem('parties')
result: 'parti'

>>> stem('love')
result: 'lov'

>>> stem('loving')
result: 'lov'
```
Notes:
- We will discuss stemming in lecture on 11/30.
- Like clean_text, this is a regular function and not a method, so you should define it outside of your TextModel class. When you call it, you should just use the function name; you will not need to prepend a called object.
- The stem of a word is not necessarily a word itself!
- This function does not have to work perfectly for all possible words and stems. Instead, you should define a multitude of cases for stems that work for many words, as we will discuss in lecture.
The number of different cases that your function is able to handle is up to you. For full credit, your function should handle at least seven distinct cases, each of which applies to multiple words.
Extend your add_string method to update the feature dictionaries for word stems, sentence lengths, and your chosen additional feature.

Notes:
- You should update the sentence lengths dictionary before you clean the text. Once you remove the punctuation from the string, it will be difficult to count the sentences.
- You should make use of the stem function that you wrote above, and you should define any additional helper functions/methods as you see fit. In particular, you may need one or more helper function related to the dictionary for your chosen additional feature.
Update the following methods to incorporate the feature dictionaries for word stems, sentence lengths, and your chosen additional feature:
- __repr__
- save_model
- read_model

Test your new code by performing the following test. Here again, your dictionary contents may be printed in a different order. In addition, your dictionary for stems may be slightly different, depending on how you implemented the stem function:

>>> model = TextModel('A. Poor Righter')

>>> model.add_string("The partiers love the pizza party.")

>>> print(model)
text model name: A. Poor Righter
  number of words: 5
  number of word lengths: 4
  number of stems: 4
  number of sentence lengths: 1
  # info for your additional feature goes here!

>>> model.words
result: {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}

>>> model.word_lengths
result: {8: 1, 3: 2, 4: 1, 5: 2}

>>> model.stems
result: {'parti': 2, 'the': 2, 'pizza': 1, 'lov': 1}

>>> model.sentence_lengths
result: {6: 1}

Optional: Add extra functionality to your TextModel object as you see fit. This may include improving your algorithms for cleaning the strings or finding stems, or it may entail adding even more feature dictionaries beyond the ones that we have required.

Part IV: Adding methods for scoring and classification

We will discuss Parts IV and V in lecture on 11/30.

In this part of the project, you will first implement the core algorithm that will allow you to compare bodies of text. This algorithm will produce a numeric similarity score that measures how similar one body of text is to another, based on one type of feature (e.g., word lengths). You will then compute scores of this type for all five of the features, and use them to classify a piece of text as being more likely to come from one source than another.

The similarity score that we will compute is based on a statistical model known as a Naive Bayes probability model. Despite the “naive” in its name, scores computed using this model have been very successful in distinguishing spam email from non-spam (“ham”) email, among other classification problems.

In essence, the Naive Bayes scoring algorithm works in the following way: You give it feature counts (e.g., word counts) from one body of text and feature counts from a second body of text. The algorithm will then compute the likelihood that the second body of text is from the same source as the first body of text! The reason that the algorithm is called “naive” is that it makes the assumption that each item in a given feature set is independent. For example, it assumes that the appearance of the word “spell” does not depend on the appearance of the word “potter” – and that this independence holds for all pairs of words. This assumption is certainly not true, but that turns out not to matter in many situations!

You can read more details about the use of Naive Bayes probabilities for classification on Wikipedia if you would like to know more, but all of the necessary information is summarized below.

How it works
To illustrate how the Bayesian scoring algorithm works, let’s assume that the only features we care about are the individual words in the texts.

As you have already done in your TextModel class, we can use a Python dictionary to model all of the words in a text. The dictionary’s keys are words, and the value for a given word is the number of times that it appears in the text.

For example, let’s assume that we have two text documents:

a source text (which we are pretending was written by Shakespeare!) that has the following dictionary:
```
shakespeare_dict = {'love': 50, 'spell': 8, 'thou': 42}
```
This document has 100 words in all: 50 occurrences of the word “love,” 8 of “spell,” and 42 of “thou.”
a mystery text (author unknown) whose dictionary looks like this:
```
mystery_dict = {'love': 3, 'thou': 1, 'potter': 2, 'spam': 4}
```
This document has 10 words in all: three occurrences of “love,” one of “thou,” two of “potter,” and four of “spam.”

The Bayesian similarity score between these two texts attempts to measure the likelihood that the ten words in the mystery text come from the same class of text as the 100 words in the source text. (A given class of text could be based on a particular author or publication, or on other characteristics of the texts in question.)

To calculate the score, we first take each word in the mystery text and compute a probability for it that is based on the number of times that it occurs in the source text. If a word in the mystery text doesn’t occur at all in the source text (which would lead to a probability of 0), we instead compute a probability that is based on a “default” word count of 0.5. This will allow us to avoid multiplying by 0 when we compute the final score.

Here are the probabilities for the words in our mystery text:

“love” has a probability of 50/100 or 0.5 (it occurs 50 times out of the 100 words in the source text)
“thou” has a probability of 42/100 or 0.42
“potter” has a probability of 0/100, but we change it to 0.5/100 or 0.005 to avoid a probability of 0
“spam” has a probability of 0/100, but we change it to 0.5/100 or 0.005

Important: These probabilities have denominators of 100 because the source text has 100 words in it. The denominators should not always be 100!

To compute the similarity score of the mystery text, we need to compute a product in which a given word’s probability is multiplied by itself n times, where n is the number of times that the word appears in the mystery text. In this case, we would do the following:

#            3 "love"   1 "thou"  2 "potter"         4 "spam"
sim_score = (.5*.5*.5) * (.42) * (.005*.005) * (.005*.005*.005*.005)

This similarity score is very small! In practice, these very small values are hard to work with, and they can become so small that Python’s floating-point values cannot accurately represent them! Therefore, instead of using the probabilities themselves, we will use the logs of the probabilities. The log operation transforms multiplications into additions (and exponents into multiplication), so our log-based similarity score would be:

log_sim_score = 3*log(.5) + 1*log(.42) + 2*log(.005) + 4*log(.005)

This results in a more manageable value of around -34.737. (Note that Python’s math.log function uses the natural log (of base e) by default, which is fine for our purposes.)

The resulting similarity score gives us a measure of how similar the mystery text is to the source text. To classify a new mystery text, we compute similarity scores between it and a collection of known texts in order to determine which of the known texts is most likely to be related to the mystery text.

For example, let’s say that we also have the following model for texts by J.K. Rowling:

jkr = {'love': 25, 'spell': 275, 'potter': 700}

Note that there are a total of 1000 words in this model.

We can compute a similarity score for the mystery text and the jkr texts in the same way that we did above.

We get the following probabilities that the words in the mystery text came from the jkr texts:

“love” has a probability of 25/1000 or 0.025
“thou” has a probability of 0.5/1000 or 0.0005 (using the default value of 0.5 for the count, because “thou” does not appear in the jkr texts)
“potter” has a probability of 700/1000 of 0.7
“spam” has a probability of 0.5/1000 or 0.0005

Thus, the non-log similarity score for 3 occurences of “love”, 1 occurrence of “thou”, 2 occurrences of “potter”, and 4 occurrences of “spam” would be

sim_score = (.025*.025*.025) * (.0005) * (.7*.7) * (.0005*.0005*.0005*.0005)

This value is also very small! Using logs:

log_sim_score = 3*log(.025) + 1*log(.0005) + 2*log(.7) + 4*log(.0005)

Now the similarity score is approximately -49.784. This value is less than the value that we computed when comparing the mystery text to the Shakespeare text. Therefore, we can conclude that the mystery text is more likely to have come from Shakespeare than from J.K. Rowling.

Your tasks

Write a function (not a method, so it should be outside the class) named compare_dictionaries(d1, d2). It should take two feature dictionaries d1 and d2 as inputs, and it should compute and return their log similarity score. Here is some pseudocode for what you will need to do:
- Start the score at zero.
- Let total be the total number of words in d1 – not only distinct items, but all of the repetitions of all the items as well. (For example, total for our example shakespeare_dict would be 100.)
- For each item in d2:
  - Check if the item is in d1.
  - If so, add the log of the probability that the item would be chosen at random from everything in d1, multiplied by the number of times it appears in d2.
  - If not, add the log of the default probability (0.5 / total), multiplied by the number of times the item appears in d2.
- Return the resulting score.
Write a method similarity_scores(self, other) that computes and returns a list of log similarity scores measuring the similarity of self and other – one score for each type of feature (words, word lengths, stems, sentence lengths, and your additional feature). You should make repeated calls to compare_dictionaries, and put the resulting scores in a list that the method returns.

Important: In each call to compare_dictionaries, the dictionary belonging to self should be the second parameter of the call. For example:
```
word_score = compare_dictionaries(other.words, self.words)
```
Finally, write a method classify(self, source1, source2) that compares the called TextModel object (self) to two other “source” TextModel objects (source1 and source2) and determines which of these other TextModels is the more likely source of the called TextModel.

You should begin by calling similarity_scores twice:
```
scores1 = self.similarity_scores(source1)
scores2 = self.similarity_scores(source2)
```
Next, print the two lists of scores, preceding each list of scores with the name of the source TextModel – for example:
```
scores for Shakespeare: [-34.737, ...]
scores for J.K. Rowling: [-49.091, ...]
```
(Note: If you like, you can use the round function to round the printed scores to 2 or 3 places after the decimal, but doing so is not required.)

You should then use these two lists of scores to determine whether the called TextModel is more likely to have come from source1 or source2. One way to do this is to compare corresponding pairs of scores, and determine which of the source TextModels has the larger number of higher scores.

For example, imagine that the two sets of scores are the following:
```
scores1: [-34.737, -25.132, -55.312, -10.715, -47.125] 
scores2: [-49.091, -21.071, -60.154, -16.502, -43.675]
```
scores1 has a higher score for three of the features (the ones in positions 0, 2, and 3 of the lists), while scores2 has a higher score for only two of the features (the ones in positions 1 and 4). Thus, we conclude that self is more likely to have come from source1.

You should also feel free to take an alternative approach to using the two lists of scores. For example, you could compute a weighted sum of the scores in each list by doing something like this:
```
weighted_sum1 = 10*scores1[0] + 5*scores1[1] + 7*scores1[2] + ...
weighted_sum2 = 10*scores2[0] + 5*scores2[1] + 7*scores2[2] + ...
```
You could then base your classification on which source’s weighted sum is larger. One advantage of this approach is that it allows you to adjust the relative importance of the different features – giving certain features a larger impact on the classification.

Your classify method should report its conclusions, using the names of the relevant TextModel objects. For example:
```
mystery is more likely to have come from Shakespeare
```
The method does not need to return anything.

Testing
Here is one function that you can use to test your TextModel implementation:

# Copy and paste the following function into finalproject.py
# at the bottom of the file, *outside* of the TextModel class.
def test():
    """ your docstring goes here """
    source1 = TextModel('source1')
    source1.add_string('It is interesting that she is interested.')

    source2 = TextModel('source2')
    source2.add_string('I am very, very excited about this!')

    mystery = TextModel('mystery')
    mystery.add_string('Is he interested? No, but I am.')
    mystery.classify(source1, source2)

Here is what we get when we run this function using our TextModel implementation:

>>> test()
scores for source1: [-16.394, -9.92, -15.701, -1.386, -1.386]
scores for source2: [-17.087, -15.008, -17.087, -1.386, -3.466]
mystery is more likely to have come from source1

Some of your scores will be different than ours (e.g., the third score in each list, which depend on how you stem the words, and the fifth score in each list, which depend on which additional feature you include). Our conclusion is based on a pairwise comparison of the scores: because source1 has a larger number of higher scores, it is chosen as the more likely source. If you use the lists of scores in another way, you may come to a different conclusion, which is fine!

Part V: Comparing texts

Now that your TextModel class is complete and you have tested its ability to compare texts, you should choose several bodies of text from which you can create models and compute similarity scores.

Choose the bodies of text from which you will create your two “source” models. In the example we provided at the start of Part IV, we chose William Shakespeare and J. K. Rowling texts for our source models, and then selected a new mystery text to compare them against. For this part, you should similarly choose two bodies of text for your source models, and in the next part you will choose new texts to compare them against.

Note that we say bodies of text, because a given text model can be based on more than one text document. For example, if you want to build a text model for New York Times articles, you should base it on multiple articles from the Times.

You should choose two bodies of text that allow for meaningful comparisons. For example:
- works by Shakespeare vs. works by J.K. Rowling
- articles from the New York Times vs. articles from the Wall Street Journal
- scripts from Friends vs. scripts from How I Met Your Mother
- two styles of writing (e.g., articles written for scientific journals vs. articles written for popular magazines, or articles written for two different sections of a given publication)
- the speaking styles of two individuals (e.g., Sheldon vs. Leonard from The Big Bang Theory).
You are welcome to choose whatever texts you like, but as a starting point, here is a link to a text file containing the complete works of William Shakespeare. You should not use this file as it is. You should download it, open it, and remove the text at the beginning and end that explains the file and provides additional information. This is true of any text file(s) that you use – you should inspect them and perform whatever human pre-processing is necessary to clean the file before handling it computationally.

Once you have chosen your two types/bodies of text, you will need to find the texts that will define the model for each of them. Once the texts have been obtained and pre-processed by you, you should create TextModel objects for them and save them to files using the save_model method.

Important

Make sure to save your texts as plain-text files.

Note: We encourage you to leave out at least one text from each body of text when creating your models. This will allow you to use it for testing. For example, if your two bodies of text are a collection of articles from the New York Times and a collection of articles from the Wall Street Journal, you can use most of the articles from a given collection to build its text model, but leave out one article from each collection so that you can use it for testing.
Once you have your two source models, you should choose at least four new text documents (or collections of text documents) – texts not used in the creation of your source models – that you would be interested in classifying according to your source models.

For two of your classifications, you can use the texts that you left out when creating your source models. For example, if your source models are for Times and WSJ articles, you can perform tests to see if the Times article that you left out when building your source models is really more similar to the other Times articles than it is to the WSJ articles.

The other two classifications are up to you. For example, you could see if:
- your WR 100 paper is more like works by Shakespeare or Rowling
- the Boston Globe is more like the New York Times or the Wall Street Journal
- Bart Simpson is more like Sheldon Cooper or Barney Stinson
Be creative!

For each text/collection of texts that you want to classify, you will again need to obtain one or more text files, pre-process them, and create a TextModel object from them. You should then invoke the classify method on that TextModel to see which of your source models is the more likely source.

To get started, copy-and-paste the following function inside of finalproject.py but outside of the TextModel class:
```
# Copy and paste the following function into finalproject.py
# at the bottom of the file, *outside* of the TextModel class.
def run_tests():
    """ your docstring goes here """
    source1 = TextModel('rowling')
    source1.add_file('rowling_source_text.txt')

    source2 = TextModel('shakespeare')
    source2.add_file('shakespeare_source_text.txt')

    new1 = TextModel('wr100')
    new1.add_file('wr100_source_text.txt')
    new1.classify(source1, source2)

    # Add code for three other new models below.
```
You should replace the model names and file names in the provided code with the names of your models and text files. Don’t forget that you can use more than one file to build a given model, in which case you would call add_file multiple times for that model.
In a plain-text file named reflection.txt, write a brief report containing the following information:
- Which source bodies of text did you choose? Which new texts/bodies of text did you choose to compare against the sources?
- What were the results of your comparisons?
- How well do you think your text classification program works? How could it be improved?
Your reflection.txt file should be approximately two paragraphs in length.

Submitting your fully completed project

Coming soon!

Last updated on December 2, 2020.