Like Aleph Earth, Nth Century uses a modified version of JC Testud's Video Generation With pix2pix to produce video. The key difference here is that the audio is also produced with Pix2PixHD and similarly predicts next steps algorithmically. To do so I encoded the audio into a series of spectrograph rasters which I trained Pix2PixHD to predict next frames from. After training Pix2PixHD to predict next spectrographs I would then render out an image sequence and turn those spectrographs back into playable audio data that I would stitch back together.
Here's a sample of the spectrograph data:
I made this video as part of the first year of the AI Creative Practices Research Group at the University of Oregon. A lot of our work culminated in a show called Slippages: AI as Creative Practice.