This Week in MLC@Home
Notes for Nov 17 2020
A weekly summary of news and notes for MLC@Home
GPU week(s), part 4: The saga continues.
This coming week we're starting to pivot back to writing and analysis of existing data. There's a lot to discuss related to GPUs still, but we also need to take time to write a bit; and frankly, the issues with the Linux/CUDA client have us stumped for the moment, so taking a week or two to focus on some of the science that's been piling up should help clear our heads and come back to it with a fresh perspective.
This past week we enabled the release track ("mlds-gpu" application) of the GPU client for Windows and Linux, and the good news is that at least the Windows client is working fairly well*, and its chewing through WUs wonderfully, complimenting the CPU crunching that continues to unabated. Having a separate app allows us to capture the wildly different RAM requirements for CUDA WUs without penalizing CPU crunchers. The GPU line allows us to send out longer dataset 1+2 WUs, which has led to a nice boost to the number of complete Parity* WUs, meaning we finally have over 1000 examples of each network type for Dataset1, and on our way to that in Dataset 2. It would be nice to wrap up those two sooner rather than later. We'll continue to release WUs in parallel to both the CPU and GPU queues to keep both users fed.
Not all is well in GPU land though. We released the Linux/CUDA client, but after several days, not a single WU completed without error, so we've pulled that back from production and will try again. This is incredibly frustrating, as it works on our test machine, but on volunteer machines it fails with CUDA errors indicating userspacedriver incompatibilities. Clearly we're not bundling it up correctly. In addition, there's been some strange results to CPU utilization and the Windows CUDA client. Users have reported better performance and utilization if they assign two CPUs to the WU instead of one, even though one core remains idle the entire time. There's some speculation in the linked thread, but we should track that down soon as well.
All that's to say we're really excited that GPU support is at least partially live and giving us a nice performance boost, but it's also been more of a drain on resources than anticipated, and we need to turn focus back a bit before tackling Linux/CUDA again. If any experienced Linux/CUDA devs would like to offer help deploying our pytorch/cuda app combination, we'd love for you to contact us and help us troubleshoot.
More specific news below, some of it is even non-GPU related!
We fired up our ARM-based test systems that had fallen off the network to make sure the current ARM app continues to run. We're able to verify that all three of our arm32/arm64 test systems running Debian 10 are crunching fine with the latest client, this includes a RPi3 (32-bit), RPi4 (64-bit), and a CuBox-i4 (32-bit).
The Dataset 1+2 WUs we release in the GPU queue have a larger epoch limit than those in the CPU queue, and have a proportional increase in credit awarded. We may make a similar change in the CPU queue, but it would mean much longer runtimes, so for now we're seeing how it goes in the GPU queue and will make a determination in the future.
We spent some time this week researching how to drop the AppImage (FUSE) requirement on Linux. Its definitely possible, but we're loathe to roll out that change, even to the test queue as, overall, appimage hasn't caused too many issues and don't want to do anything unnecessary at the moment. We thought it might help with the Linux/CUDA issues, but no longer things that's true.
Datasets 1,2 and 3 continue crunching away. GREAT progress so far!
We know some of the web pages are out of date, and we hope to address that this week. Updates queued include: a complete update/redo of the MLDS Dataset page, and an update to the "system requirements" section of the main page to better list minimum software requirements.
If we divide each of the three datasets into 3 releases based on the number of examples in each release (100, 1000, 10000), then we're ready to package up Dataset 1 (100, 1000), Dataset 2 (100), and Dataset 3 (100).
If you aren't aware of the BOINC Network Podcast, the MLC@Home devs lurk there and sometimes contribute Be sure to check it out if you're interested: https://www.boinc.network/.
We hope to get back to preparing Dataset 4, and writing a tech report/paper to go along with the Dataset releases this week.
Project status snapshot: (note these numbers are approximations)
Tasks ready to send 48470
Tasks in progress 24464
With credit 1190
Registered in past 24 hours 47
With recent credit 2129
Registered in past 24 hours 25
Current GigaFLOPS 33798.72