Main Logo
Resume
Current Projects
All Projects
Skills

CSCE 313H: Wikipedia Indexing

Coursework Disclaimer: This project is coursework that is reused year to year; Details are left somewhat vague so as to not provide answers.

Dr. Dmitri Loguinov has a bit of a reputation in the Texas A&M Computer Science Department; he is known for having crazy hard projects and expecting a lot from his students. CSCE 313H is no exception. This course is titled Introduction to Computer Systems. It is sits a a level just above an operating systems class; the key learning objectives are I/O, memory management, multi-threading, and synchronization.

In this class we had four projects. The first two projects focused on multi-threading a graph search. The graph was noised with time to traverse an edge (i.e. some get adjacent calls took significantly longer) which introduced interesting synchronization problems, but that is a discussion for another time. The last two projects focused on indexing strings in Wikipedia. Dr. Loguinov gave us Wikipedia as a text file (up to 30 GiB) and asked us to count how many times certain words appeared (project 3) and how many times all words appeared (project 4).

To get numbers out of the way, for counting certain words (30GiB) saw speeds up to 2.5 GiB/s (runtime was very dependent on the input list's size and contents). For counting all words processing speed averaged around 370 MiB/s (about 80s for 30GiB). This was on a quad/eight 2.9 GHz Intel CPU (i7-7820HQ) and the programs were limited by CPU (not disk speed).

This was a massive foray into optimization from start to finish. Dr. Loguinov required use of Windows and the Windows API to promote efficiency. The thread priority is tinkered with. The file contents are read into RAM via DMA. Synchronization and Copying had to be minimized. These factors were all compounded as the students were competing with one another for the fastest runtime.

The general architecture was to spawn one thread per cpu core (so 8 when running on my laptop) and one thread for reading in the file. The read thread would read chunks of the file from disk (via Virtual Memory and Direct Memory Access) and push these chunks to the worker threads to process. At the end of each chunk, the thread would synchronize to report some summary statistics, which would be printed to the console in real time as the program was running.

One interesting problem arose: dealing with the chunk boundaries. That is, when counting words, how to handle the words which span chunk boundaries. To do this a part of the prior chunk was prepended to the front of the current chunk. Care was required to ensure that no words were double counted nor that any works were missed. And that was alot of fun to debug when running multiple threads. If you're here cause you're doing this assignment and hoping for hints, all I can say is good luck.

Overall, this was a fun foray into working with large datasets in an efficient manner. Additionally, since this was the first class with major, compounding projects, I learned a lot about writing clean and dry code. It was, up to that point, the most code I had written for any one class. While it was in C++, it read a lot more like C style code (Dr. Loguinov is a big fan of C-strings for performance reasons).

LinkedIn Logo
GitHub Logo
GrabCad Logo
Questions? Comments? Contact:
Spencer Gautreaux
Copyright 2022 Spencer Gautreaux. All Rights Reserved.