Sudden unexplained Tesla GPU failures: a corrupt JIT cache

Having just spent several hours trying to debug an issue I thought I should share it with the internet. It seem fairly obscure and I didn’t find any hints online about what might be causing it.

The Symptoms

  • Executing CUDA programs fail either giving malloc errors or ‘unknown error’!
  • Nothing important was changed (no new drivers, no software updates, no new CUDA toolkit).
  • Previously working binaries now fail!
  • GPU resets and full system reboots don’t fix the problem.
  • Families of GPUs fail across a cluster (e.g. all P100s fail in all nodes but the K80s even in the same nodes still work).

The final symptom was the key to solving the problem:

  • Everything works fine for other users!

Continue reading “Sudden unexplained Tesla GPU failures: a corrupt JIT cache”

SPICE Young Research Leaders Workshop

A couple of weeks ago we held a very successful workshop in Mainz: Young Research Leaders Group Workshop: Insulator spintronics – strong-coupling, coherence and entanglementFunded was provided by Jairo Sinova’s SPICE center and the MAINZ graduate school.

I was co-organizing with So Takei (Queens College of the City University of New York) and Yunshan Cao (University of Electronic Science and Technology of China) and we decided to try and join the different areas of insulator spintronics, strong couple and quantum magnetism. This was quite successful and I think there was at least one person for everyone who they had not met before (myself included). 

One of my favourite aspects of the workshop was the discussion time. The workshop is held in the Mainz Institute for Theoretical Physics where there is a large coffee room with black boards, surrounded by side offices. This was perfect for discussions after each session and was used extensively by the participants.

All of the talks not under embargo due to unpublished data, are available on the SPICE youtube channel.

arXiv preprint: Origin of temperature and field dependence of magnetic skyrmion size in ultrathin nanodots

In collaboration with groups from Spain and Italy we have written a paper on how temperature effects the radius of skyrmions. This is caused by the change in the effective magnetic parameters due to spin fluctuations. We can calculate the temperature dependence of the relevant magnetic parameters such as anisotropy, exchange stiffness and Dzyaloshinskii-Moriya interaction (DMI) using atomistic spin models.

Interestingly, my friend Unai Atxitia was also conducting a very thorough study of the temperature dependence of the DMI ( Using different methods and working independently we found the same results for the DMI scaling which is a nice cross check with each other. I highly recommend reading his paper which includes some really interesting details such as an emergent anisotropy contribution from the DMI.

arXiv preprint: Non-local magnon transport in the compensated ferrimagnet GdIG

With friends and collaborators at the Walther Meissner Institute (WMI), TU Munich we have written a paper about non-local magnon transport in Gadolinium Iron Garnet. The experimental measurements show quite a complicated behaviour where the non-local signal changing in response to both the external magnetic field strength and the temperature. We show that  different exchange magnon modes of the bulk change frequency in response to the field and temperature depending on their polarization in qualitative agreement with the measurements.


A reasonably comprehensive OSX backup strategy

It’s good practice to keep backups in different locations to guard against disasters like fire or theft. This is now becoming quite easy to implement using cloud backups. I also use multiple different pieces of software. This may sound a bit paranoid, but it mitigates against lots of nasty possibilities. For example nasty software bugs do happen which can render your backups useless. So there’s more chance of recovering if you have other fallbacks. Different software solutions can also incorporate nice features into your backup strategy such as reducing disruption for your work by having a 1:1 bootable copy of your hard drive.
Below is a brief outline of the different backup methods I’m using and some of the advantages they each bring to the party.

Continue reading “A reasonably comprehensive OSX backup strategy”