For the term project part of your grade, which is 25% of your overall grade, you may opt to do either the term project, or you may do an extra three homework assignments.
You may do the homework assignments individually or in pairs. These homeworks will be graded based largely on your report and will be open ended than previous assignments.
We plan the following assignments for the homework option:
- Neural Machine Translation.
- Applying Large Scale Pretrained Transformers for classification
- Text Generation with GPT-2.
The minimum team size is 3, and the max team size is 6. You may reach out to the instructors on piazza for special circumstances about team size.
Your project will be a self-designed multi-week team-based effort. Your final project will consist of the following components:
- A formal definition of the problem and a motivation for while it is an interesting challenge for natural language processing. A literature review of past approaches to the problem.
- A commented implementation of the simplest possible solution to the problem. For instance, this could be a majority class baseline or a random baseline.
- A commented implementation of a baseline published in the literature, along with skeleton code obtained by removing the parts that students should implement.
- Extensions that attempts to improve on the baseline, along with a brief (one- to three-paragraph) accompanying write-up for each extension describing the general approach and whether it worked. Teams of size 1,2,3,4,5,6 should do 1,1,2,2,3,3 extensions, respectively. Interesting analysis can count as an extension.
- A evaluation script that can be used to score submissions like on the class leaderboard. The output of any model implementations should be gradeable with this program.
- A final report summarizing your results.
- A short 5-10 minute presentation to Mark about your project, with a 2-5 minutes for questions.
Term project is split into 4 deliverables, where the first three are worth 5% each, and the final report and presentation are worth 10%. You don’t have to wait to start working on each part of the project. We encourage you to begin work early, so that you have a polished final product.
If you don’t have an idea about what you’d like to do for the project, or if you’re having trouble coordinating a team, we recommend the homework option.
Milestones and Due Dates
Here are the milestones for the term project:
- Mar 26, 2021 - Deadline to decide on term project versus weekly homework option.
- Apr 9, 2021 - Milestone 1 - Submit a formal project definition and a literature review.
- Apr 16, 2021 - Milestone 2 - Collect your data, and write an evaluation script and a simple baseline.
- Apr 23, 2021 - Milestone 3 - Implement a published baseline. Prepare a draft of your final project presentation.
- Apr 30, 2021 - Milestone 4 - Finish all extensions to the public baseline, submit final report, final presentation, and schedule a presention.
Before you begin
If you want to do the project option, you must first declare your intent to do the project and inform us of your team by March 26 via Gradescope. After March 26 and before Milestone 1, you should book a time to discuss your project idea with Mark or one of the TAs to make sure it is appopriate in scope. After March 26th, we will open slots for this purpose.
For Milestone 1, you’ll need to create a writeup which includes:
- The problem definition (1 to 2 paragraphs, plus an illustrative example)
- A literature review of three or more papers or sections textbook that describes the problem
- What evaluation metrics could use to score system outputs
- What type of data you will need to evaluate, and how much data is available
For your literature review, you should read 3-5 research papers that address the problem that you are working on. You should write a 1-2 paragraph summary of each paper, desribing the approaches that used and how well the approaches worked. This milestone is worth 5% of the grade.
For Milestone 2, you will need to:
- Collect your data
- Write an evaluation script
- Write a simple baseline (for instance, a majority class baseline)
This milestone is worth 5% of the grade.
Collect your data
Since most of the projects that we do in this course are data-driven, it’s very important to have your data ready to go at the outset of a project. You should collect all of the data that you’ll need for your term project and split the data into three pieces:
- Training data
- Development data
- Test data
The training data will be used to train the model, the dev data can be used to optimize your system parameters and/or to evaluate different approaches to the problem, the test data is a “blind” test set that will be used in the final evaluation.
If you are basing your term project on a shared task, then usually the data will be collected already, and usually it will be divided into a standard training/dev/test split. If it’s already assembled and split - great! You’re ahead of the game. If you’re not doing a shared task, then you may need to assemble your own data. A good way of creating your own training/dev/test split is to divide the data into chunks that are sized around 80%/10%/10%, where you want to use most of the data for training. It’s important to ensure that the same items don’t appear in more than one of the splits.
For your M2 deliverables, we’ll ask you to submit your data, plus a markdown file named data.md that describes the format of the data. If your data is very large, then you can submit a sample of the data and give a link to a Google Drive that contains the full data set. You data.md should describe the number of items in each of your training/dev/test splits.
Write an evaluation script
For the next part of M2, you’ll need to determine a suitable evaluation metric for your task, and implement it. If you’re basing your term project on a shared task, then there is likely an established evaluation metric for the task. You should re-use it. If you’re doing a new task, then you may have to do a literature review in order to determine what metrics are best suited for your task.
You should write an evaluation script that takes two things as input: a system’s output and a corresponding set of gold standard answers. Your script should output a number that quantifies how good the system’s answers are.
For your deliverables, you should include your script, plus an example of how to run it from the command line. You should give a formal definition of the evaluation metric that explains how it is calculated in a markdown file called scoring.md - this file should cite any relevant papers that introduce the metric. You can also cite Wikipedia articles that describe your evaluation metric, and/or link to an overview paper describing the shared task that you’re basing your project on if it defines the metric.
Write a simple baseline
As the final part of M2, you should write a simple baseline. This should be the simplest way of producing output for your task. For example, it could be a majority class baseline (like the one that we used in HW1) that determines the majority class from the training data and guesses that class for each item in the test set.
You should write a python program that will generate the output for the baseline, and you should submit that as simple-baseline.py. You should also include a markdown file named simple-baseline.md that describes your simple baseline, gives sample output, and reports the score of the baseline when you run it on the test set, and evaluate it with your scoring script.
What do you need to turn in?
- You should create a directory containing your training/dev/test data (please create a gzipped tar archive of the data). If your data is too large to upload to gradescope, the you can submit a sample of the training data, plus your compute dev and test sets.
- Please upload a markdown file that describes your data (name it data.md). It should give an example of the data, describe the file format of the data, give a link to the full data set (if you’re uploading a sample), and give a description of where you collected the data from.
- You should describe your evaluation metric in a markdown file called scoring.md. This should give a formal definition of your metric, and relevant citations to where it was introduced. Your scoring.md file should also show how to run your evaluation script on the command line (with example arguments, and example output). The scoring.md file should say whether higher scores are better, or lower scores are better.
- You should include your evaluation script (you can call then score.py if you’re writing it in python).
- You should upload simple-baseline.py and describe it in simple-baseline.md. Your simple-baseline.md should say what score your evaluation metric gives to the simple baseline for your test set.
The goals of Milestone 3 to implement a published system to establish as a strong baseline for your project. You should re-implement the published baseline that you selected. It’s fine to use machine learning packages like pytorch or sklearn, or NLP software like AllenNLP or Spacy, but you should implement the main algorithms yourself. You should not turn in existing code that implements the baseline.
You should include a baseline.md markdown file that includes step-by-step instructions on how to run your baseline code. Your baseline.md should also report the score for your system for your test and development data, and compare that to your random baseline.
For Milestone 3, you will also prepare a draft presentation about your project. This can be a recording (10-12 minutes long), or Google slides with presenter’s notes. Your presentation should convey these main ideas:
- What is the topic of your term project? You should clearly explain to your classmates the problem that you selected to work on. Give an illustrative example of the problem first, and then give a more formal definition of the problem.
- What is exciting about your term project? Why did you want to work on this topic?
- How does the topic relate to the class? What new things did you learn?
You may also want to cover topics like this:
- What kind of data is available for this problem? How do you evaluate whether a solution is good or not? If the evaluation metric is not already familiar to the class, then walk through an explanation of how it works.
- What is the baseline performance for the simple baseline like a majority class baseline?
- What approaches have people taken in the past? How successful have they been?
- What did you implement for your published baseline?
For Milestone 4, you’ll need to implement several extensions beyond this published baseline. These should be different experiments that you run to try to improve its performance. Teams of size 1,2,3,4,5,6 should do 1,1,2,2,3,3 extensions, respectively.
What do you need to turn in?
- You should also include a 1-2 paragraph explanation of which paper you chose to implement as your published baseline, and why you selected that one.
- You should submit your code for the baseline system. You should also submit a README file explaining how to run it, and reporting its performance on your dev and test set, according to your evaluation metric.
- A link to your draft slides for your final presentation.
This milestone is worth 5% of the grade.
For your final milestone, you’ll complete your extensions to the baseline, and you’ll produce a final writeup for your term project. Teams of size 2,3,4,5,6 should do 1,2,2,3,3 extensions, respectively.
Your final report should be written in the style of a scientific paper, and formatted with this LaTeX style file (which will make it look totally scientific!). Your report should contain the following sections:
- Title. A descrpitive title for your term project.
- Authors. A list of team members.
- Abstract. Your abstract should give an overview of your project and your results (~100 words).
- Introduction. Your introduction should contain the following information. (~300-500 words, plus one illustrative example).
- An informal description of the task, and how it relates to NLP/Computational Linguistics (1-2 paragraphs)
- A figure that illustrates the task, or an illustrative example of the type of problem you’re trying to solve. This can be a picture, or an example of an input and output. You should include a caption or a short paragraph that describes what’s happening in your illustration.
- A formal definition of the problem.
- A paragraph describing why you picked this task for your term project.
- Literature Review. You can adapt your literature review from Milestone 3 for this part of your writeup. (~300-500 words, with 3 or more ciations).
- If you adapted a shared task for your term project, then you should describe the shared task in your literature review, and cite the overview paper and give a URL to shared task homepage (if applicable).
- For your literature review, you should also cite and summarize 3-5 research papers that address the problem that you are working on. You should write a 1-2 paragraph summary of each paper, desribing the approaches that they proposed and what results they got. Be sure to include a full citation of these papers in your Bibliography.
- Experimental Design. Your Experimental Design section should include a description of your training data, your evaluation metric, and your simple baseline model along with its performance on the test set. You can adapt your Milestone 2 submission for this part. (~300-500 words, plus 2 figures/tables, plus 1 or more equations).
- Data. This subsection should describe your training/development/test data. You should give an figure or table with examples from your data (including inputs and output labels). You should include a table that describes the size of your data sets. For example, it should give number of sentences or words, etc for each of the splits. You should also characterize the data. For instance, if there’s a skewed distribtuion over the labels, you should reoprt it here. If your training data comes from a published paper, then cite that paper and explain how they collected the data. If you constructed your data set, then explain in detail how you collected it, and include example code in an appendix.
- Evaluation Metric. This subsection should describe your evaluation metric. You should include an English description of the metric, an equation for how your metric is computed, and a citation for this metric, and some citation(s) that shows what past publication(s) used this metric for the task that you’re working on.
- Simple baseline. You should compute the majority class baseline (or other simple baseline) for your data, and report it in this section. This is a way of characterizing the data and showing the diffiulty of the task.
- Experimental Results. In this section, you should describe your implementation of a published baseline, and all of the extensions that you experimented with for your term project, and an error analysis of your system’s output. (~300-500 words).
- Published baseline. In this subsection you should write a detailed description of the published baseline that you implemented and cite the paper that it was published in. (You can update your Milestone 3 submission for this). You should report how well the model performs on your test set using the evaluation metric that you defined in your experimental design section. Does your implementation of the published baseline reach the same level of accuracy as the original paper? If not, why not? Are your results directly comparable – are they on the same test set? If not, why not?
- Extensions. In this subsection, you should describe each of the extensions that you tried. You should include a ~1-2 paragraph of each extension that explains what you tried, why you tried it, and how it performed compared to your baseline. You should include a table of results, where the rows are the performance of the baseline or one of your extensions, and the columns are the performance on the test set (and on the dev set if you measured it). If you did any experiments where you searched over a set of different parameters, then you should include a result on how varying the parameter changed the performance on the dev or test set. Your tables and figures should include a detailed caption that explain how to read them.
- Error analysis. In this subsection, you should perform an error anlaysis for your best performing system. Show examples of the errors that it makes. Can you cateorize the types of errors that it makes, and give an esimate of how prevelant each error type is? If you extensions performed better than the published baseline, then show examples of the errors that the published baseline makes that your extensions get correct (and vice versa if your extension introduces some new errors).
- Conclusions. You should write a brief summary of what you accomplished in your term project. Did any of your implementations reach state-of-the-art performance on the task? If not, how close did you come? If not very close, then why not? (~100-300 words).
- Acknowledgements. If you used someone else’s code or you benefited from discussions with one of the TAs, then you should thank them here. Give credit generously! (Optional)
- Appendicies. This can include short snippets of code that were relevant to your project, along with a description of what it’s doing. It could also include more examples of your training data or your system’s output. (Optional)
I really like examples and good illustrations. If you created some nice visuals for your final presentation slides, then I encourage you to include them in your writeup too. You can submit your images in a images/ subfolder.
What do you need to turn in for Milestone 4?
You should turn the following items:
- A PDF of your final report
- A PDF final project presentation slides.
- A tarball or zip file with all of your code and data. It should contain:
- data/ - a subdirectory containing the training/dev/test splits that you use. If your data is too large to submit, then you can include a README file in this subdirectory that explains how to download your data.
- code/ - a subdirectory containing all code that you developed for your project, including the baseline and extensions, and your evaluation scripts. This should include a README that gives a step by step walk thorugh of how to run your code, including an example of the command lines to run to reproduce the results that you report.
- output/ - a subdirectory containing your model’s predictions on the test set, along with the gold labels. This should also include a README that shows the command line on how to run your evaluation script on the output, and example of what scores the script returns.
- Schedule a time to present your results to Mark
This final milestone is worth 10%.
You’ve reached the end. Great job!