This page is in an unreleased state. This warning message will go away once the page is released.

Project 4D: Wordnet (k!=0)

Lectures needed for this project:

  • Lecture 16 (Extends, Sets, Maps, and BSTs)
  • Lectures 20, 21 (Graph Traversals and Implementations)

Partner policy: No partners. Discussing ideas with other students is allowed, but code sharing is not allowed, and the solutions you submit should be your own work! More details on the policies page.

See here for a video overview of the project.

In this project, you’ll complete your implementation to Wordnet by handling the k != 0 case.

Setup

This setup is identical to that from Project 4C: Wordnet (k == 0). However, you’ll have to copy your implementation from Project 4C as we’ll outline below.

Follow the Assignment Workflow Guide to get started with this assignment. This starter code is in the proj4d folder.

You’ll also need to download the Project 4 data files (they are around 150 MB, which is too large to be pushed to GitHub).

The data files in this assignment contain a new set of data files in addition to the ones from Project 4A:

  1. ngrams: word history and year history files (Project 4A)
  2. wordnet: synset and hyponym files (Project 4C and 4D)

Download the data files at this link. Note: data files will be released when the skeleton is released.

You should unzip this file into the proj4 directory such that the data folder is at the same level as the src and static folders.

Once you are done with this step, your proj4d directory should look like this:

proj4d
├── data
│   ├── ngrams
│   └── wordnet
├── src
├── static
├── tests

You’ll notice that this skeleton is (almost) the exact same as the Project 4A skeleton. Task TODO of this project uses the TimeSeries and NGramMap classes from Project 4A, which is why we have provided placeholder implementations for those classes. This includes a working implementation of countHistory method using a new library from in library-sp24.

The placeholder implementations throw UnsupportedOperationExceptions for some methods. You will not need these methods. The given placeholder implementations are sufficient to complete Project 4C and 4D.

If you want to copy in your own NGramMap and TimeSeries from Project 4A, you can. However, we suggest only doing so after you get a full score on Project 4C and 4D, just in case your implementation has any subtle bugs in it.

After importing library-fa25, the code in NGramMap.java should no longer be red.

Copy in your own implementation and any helper classes you used to complete Project 4C.

Objective: Handling k != 0

In Project 4C, we handled the situation where k = 0, which is the default value when the user does not enter a k value.

Your final objective is to handle the case where the user enters k. k represents the maximum number of hyponyms that we want in our output. For example, if someone enters the word “dog”, and then enters k = 5, your code would return exactly 5 words.

To choose the 5 hyponyms, you should return the k words which occurred the most times in the time range requested. For example, if someone entered words = "food, cake", startYear = 1950, endYear = 1990, and k = 5, then you would find the 5 most popular words in that time period that are hyponyms of both food and cake. Here, the popularity is defined as the total number of times the word appears over the entire time period. The words should then be returned in alphabetical order. In this case, the answer is [biscuit, cake, kiss, snap, wafer] if we’re using top_49887_words.csv, synsets.txt, and hyponyms.txt.

Be sure you are getting the words that appear with the highest counts, not the highest weights. Otherwise, you will run into issues that are very difficult to debug!

Note that if the front end doesn’t supply a year, default values of startYear = 1900 and endYear = 2020 are provided by NGordnetQueryHandler.readQueryMap.

If k = 0, or the user does not enter k (which results in a default value of zero), then the startYear and endYear should be totally ignored.

If a word never occurs in the time frame specified, i.e. the count is zero, it should not be returned. In other words, if k > 0, we should not show any words that do not appear in the ngrams dataset.

If there are no words that have non-zero counts, you should return an empty list, i.e. [].

If there are fewer than k words with non-zero counts, return only those words. For example if you enter the word “potato” and enter “k = 15”, but only 7 hyponyms of potato have non-zero counts, you’d return only 7 words.

Task 1: Nonzero k

Modify your HyponymsHandler and the rest of your implementation to deal with the k != 0 case.

This task will be a little trickier since you’ll need to figure out how to pass information around so that the HyponymsHandler knows how to access a useful NGramMap.

The TimeSeries class we provide in the skeleton code does not support .data(). You can use .values() instead.

Do not make a static NGramMap for this task! It might be tempting to simply make some sort of public static NGramMap that can be accessed from anywhere in your code. This is called a "global variable".

We strongly discourage this way of thinking about programming, and instead suggest that you should be passing an NGramMap to either constructors or methods. We’ll come back to talking about this during the software engineering lectures.

Task 2: Autograder Buddy

Copy in the AutograderBuddy implementation you used to complete Project 4C.

Writing Tests

We have not provided any tests for the k != 0 case. We suggest creating a new testing file: tests/TestKNonzeroHyponyms.java.

You can use the sample tests in TestOneWordK0Hyponyms and TestMultiWordK0Hyponyms as a template to create new tests in this new testing file.

You’ll need to construct your own test cases. We provide one above: words = "food, cake", startYear = 1950, endYear = 1990, k = 5.

If you need help figuring out what the expected outputs of your tests should be, you can use the staff solution webpage.

Submission

Try submitting to the autograder. You may or may not pass everything.

  • If you fail a correctness test, this means that there is a case that your local tests did not cover.
  • The autograder will not run unless you fix all your style errors. Reminder that you can check style in IntelliJ as often as you’d like: Run style checker in IntelliJ
  • You will have a token limit of 8 tokens every 24 hours. We will not reinstate tokens for failing to add/commit/push your code, run style, etc.

Project 4D will be worth 70 points.

Grading breakdown:

  • HyponymHandler k != 0 (100%)

The score you receive on Gradescope is your final score for this assignment (assuming you followed the collaboration policy).

Optional Extra Features

If you’d like to go above and beyond in this project (and even explore some front-end development), read through the Optional Features spec!

Acknowledgments: The WordNet part of this assignment is loosely adapted from Alina Ene and Kevin Wayne’s Wordnet assignment at Princeton University.