Text and Audio Generation and Understanding

I have recently come back to text and audio generation using a Hugging Face account with a Gradio interface. I’m not sold on Gradio, but right now it seems sane.

My first use is with Stability Audio where I’ve been usng both the toolkit and the diffusion options. Last time that I used the toolkit a few months, I was happy(ish) with the results. It does seem to have taken a bit of a dive in this instance with some serious Guassian noise issues when generating a fire noise. I compared this with the diffusion access to the same model (stable-audio-open-1.0) and the latter was a more realistic sound. I do need to look into why this might be the case.

In the meanwhile, I am having a nose around the preview model of the Nvidia Flamingo model which claims to be an interactive design assistant, so could be interesting. It would need some serious reading to understand what it is doing. I am wondering if this is a better task for Jupyter rather than another interface. It does pose a question for some work that I am doing in terms of understanding sound through other media. Originally I came across a challenge of vernaculars not being read but what terms might it generate? How might they differ?

Update: I’ve just run a query against an open API interface, via Hugging Face, of the Nvidia model and the description is certainly full. It needs more poking but I think that there is something to this.

No Comments

Leave a Reply

Your email is never shared.Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.