If you don’t see your question here, post it on Piazza or come to office hours! See the links in the navigation bar for both of those options.
What parts of the final project do we need to complete for the milestone that is due as part of PS 10?
You should complete and submit Parts I and II from the final project.
I am having trouble removing punctuation using a loop. What should I do?
It can be difficult to diagnose this issue without seeing your
code. In this case, it may be easier to just repeatedly call the
replace
method as we describe in problem set. This method may
be easier because it does not require you to manually loop through
each character in the string.
Note that you can avoid the need for multiple calls to replace
by using a loop to remove one punctuation symbol at a time.
For example:
for symbol in """.,?"'!""": # add additional punctuation here # use replace to remove symbol from your text
Note that we use triple quotes to surround the string of punctuation symbols so that we can include a single-quote character and a double-quote character within the string.
When splitting the text in clean_text
, what separator character
should we pass in to the split
method?
None. Because we want split
to split on all whitespace characters
(spaces, tabs, and newlines), you should not pass in a
separator – not even a space. Rather, you should use empty
parentheses when calling split
, which will cause it to split
on all whitespace characters.
My clean_text
function splits the text into a list of words, and
then it uses a loop to clean each word in the list. However, when
I look at the return value, the individual words in the list have
not been changed. What am I doing wrong?
Don’t forget that you can only change the internals of a list if you assign something to one of the positions in the list. For example, consider the following code fragment:
my_words = ['hello', 'world'] for w in my_words: w = w.upper() # changes w, but *not* my_words! print(my_words)
If you run this code fragment, you’ll see that the contents of
my_words
are unchanged. That’s because the assignment inside the
loop changes w
, but it doesn’t change the contents of
my_words
. In order to change the contents of my_words
, we
would need to use an index-based loop.
Is my clean_text
function good enough?
The clean_text
function should at least remove the specified
punctuation symbols and make every letter lowercase. Also remember that
clean_text
must return a list of words. This means you should
split up the cleaned string inside of clean_text
. If you are
unsure of how to do this, check out the word_frequencies
function in the example code online.
These are the minimum requirements. If you have time, you are welcome to take additional steps to further clean the text.
How do I update the words
and word_lengths
dictionaries?
You should start by reading the pseudocode we’ve given you for
add_string
.
Note that the for
loop in the pseudocode goes through each word
in a list called word_list
that contains all of the words in the
original string. You should complete the body of that loop so
that, for each value of the variable w
, it updates the frequency
of w
in the self.words
dictionary. What are the keys for that
dictionary? How can you correctly update that dictionary in light
of the current word w
? You may want to review the
example code from PS 9 for a reminder of how to update
a dictionary.
When you update word_lengths
, you will need another loop that
loops through every word in word_list
, but what are the keys in
the word_lengths
dictionary? How can you transform a word into a
key in this dictionary? Once you answer these questions, you can
add the code needed to update word_lengths
.
How do I read from a file?
In lecture, we presented two different ways to read the contents of a file. You can consult the lecture notes from a couple of weeks ago, or check out the example code online. In the problem set, we recommend reading in the entire file into a single string and then adding that string to your model.
How can I test add_file
?
In Spyder, open up a new file. It doesn’t matter what you call it
but you must save it in the same directory as your final project.
Add a few sentences to the text file and save it. Suppose you
called the file foo
. Try adding the file to a TextModel
object.
model = TextModel("Test") # you can call the model anything you want model.add_file("foo") # we want to add the file `foo` to the model
(Note: You should replace "foo"
with the full name of the
file that you saved in Spyder. If Spyder gave the file a .py
extension, you should include that .py
in the name of the
file. If Spyder gave the file a .txt
extension, you should include
that .txt
in the name of the file.)
Now try printing the model and the dictionaries that it contains.
Do the right words and frequencies appear in the model? If
everything looks good, then your add_file
function should be
fine. If not, it may be an issue in add_file
or in any methods
you use inside of the function. You can use debugging print
statements to narrow down the cause of the issue.
How do I read from a file?
In lecture, we presented two different ways to read the contents of a file. You can consult the lecture notes from a couple weeks ago, or check out the example code online. In the problem set, we recommend reading in the entire file into a single string and then adding that string to your model.
Why are my save_model
and read_model
functions are not working
properly?
Go through the test case we give you in the assignment one step at
a time. After you save a model, open one of the dictionary files
using a text editor (such as the editor in Spyder). Are the correct
dictionaries inside of the files? If so, the issue is likely
inside of your read_model
function. Remember that after you read
the dictionaries from the appropriate file, you must store them
somewhere in the TextModel
object. Which variable in
read_model
represents the TextModel
object itself? Which
attributes should you update so that they refer to the
dictionaries you loaded from the files?
How do we update sentence_lengths
in our add_string
method?
Let’s try to break this problem up into smaller parts. The first thing you can do is split your string into a list of words, but without removing any punctuation. If you were to go through every word in this list, what would it mean if you found a word that ended with a punctuation mark? How could you use this fact to count the number of words in each sentence? You will need to use some type of cumulative computation, and you should be careful to reset your count as needed.
When splitting the text in order to determine the sentence
lengths, what separator character should we pass in to the split
method?
None. Because we want split
to split on all whitespace characters
(spaces, tabs, and newlines), you should not pass in a
separator – not even a space. Rather, you should use empty
parentheses when calling split
, which will cause it to split
on all whitespace characters.
The numbers that my compare_dictionaries
function produces
seem too negative. What is an acceptable range?
There is no specific range of numbers required other than the fact that similarities should be less than or equal to 0. If you are getting positive numbers, then you should try debugging your similarity score function.
How do I know if my methods are working?
One thing to try is the test
function that we give you near the
end of Part IV. Copy and paste this function into the bottom of
your finalproject.py
file – outside of the TextModel
class
– and try calling test()
from the Shell. Compare the scores
that you get for source1
and source2
with the ones that we
get. Not all of the scores should be the same, but some of them
should be – in particular, the first scores (the ones based on
the words
dictionaries) and the second scores (the ones based on
the word_lengths
dictionaries) should be the same as ours.
It’s also possible that your classify
method may conclude that
mystery is more likely to have come from source2
(rather than
source1
, as our method concludes). That may be fine as
well. Just make sure that your classify
method is making the
correct conclusion for the two lists of five scores that your code
produces.
When I test my code using the test()
function that you provide
or when I submit my code on Gradescope, I seem to be getting an
incorrect score for just the word_lengths
dictionary. Why would
that be the case?
In the instructions for the similarity_scores
method, there is a
note labeled Important. Review that note, and make sure
that your similarity_scores
method is following its guidelines.
If you aren’t following the guidelines in that note, it’s
possible in certain cases for some but not all of the scores to
be incorrect.
I get some unexpected results when I compare texts. Is that a problem?
Not necessarily. Depending on the texts that you use to build your
models, it’s possible that the classifications may not correct.
For example, your method may claim a GQ article is from
Cosmopolitan, or vice versa. This can happen, and it does not
necessarily indicate that your code is wrong. As long as you get
reasonable results for our test()
function from Part IV (see
above), you should be fine.
Make sure to document your results in your reflections, including any unexpected results that you encounter. Try to think of reasons for the unexpected results. Does our approach to modeling a body of text miss some important features that differentiate the sources that you used?
Last updated on December 2, 2020.