I was a TA for four semesters (and kinda in the shadows for two summers) for Dr. Leyk's CSCE 221 Class. In that time we moved from grading submissions by hand to grading with scripts to grading with Mimir. Each method had its own pros and cons, and we learned to work within the limits of the systems. At the end of my last semester, it was announced that the university would begin transitioning to Gradescope. As such, we wanted to see how we could integrate this new tool (this reads like we had a choice; we didn't really). As the head TA (and also the guy with experience writing test frameworks for Facebook) (and also the guy who had setup many of the assignments and their test cases on Mimir) the responsibility fell on me to figure out how to make this work.
When you setup your assignment, you can choose any of their builtin testing frameworks for setting up your project. Of course, there was no option for C++. So I used the next best thing. Gradescope gives you the option of provisioning/configuring a docker image for grading the assignment. You get a bare bones Linux install and can do whatever you like inside it. It uses all the good docker things so that you can do long-running install scripts, create an image, and, for each student submission, clone the image and run student code in seconds. If I were to design a code grading system, it would look very much like this.
The problem then became to write a testing framework that could take student C++ code (which is quite the superset of solution code) and run it. Gradescope takes the output of your grader as a JSON file saved at a particular directory. Of course, JSON output means I'm going to use Python. Using Python's system and subprocess modules it was possible to compile and run student code using existing utilities such as `make` and `g++`. From there its just a process of writing test cases, formatting the output, and ensuring correctness (or at least giving students the ability to know they were right and get them to tell the TAs something went wrong). Since this was just a simple experimental assignment, I didn't integrate Google Test (which we used on Mimir and had test cases written for), but that was an area planned for future work.
Of course that was not enough. To make things even easier I wrote multiple Python scripts that could be run locally to test before uploading to Gradescope, creating a Docker image, .... These scripts emulated the directory structure and the data flow (as best I could): resetting the test files, copying the student submission into the correct spot, invoking the entry script. Finally, another script would export the test files into the structure that Gradescope expected, allowing for foolproof development of the test cases.
Finally, I developed a custom JSON file format for specifying the tests in a user friendly manner. The format was inspired by what we needed as TAs: commands to run, points for successful completion, and the name to display. I also added some secondary features that we used on Mimir (descriptions and timeouts) and some features we wanted (precursor tests). That last one was a little controversial: not running a test because a other tests failed somewhat violates the independence of unit tests, and a couple TAs pointed out as much. However, in the case of unit testing student homeworks, this is an essential feature. Telling the students where to look for bugs and what to fix first saves many emails and exasperated visits to office hours.
I'm not posting this code publicly because its still being used (somewhat) and I know of at least one security vulnerability. Frankly, the less insight the students have into the inner workings the better.