Hi guys, remember me? I'm Cary
Knows.
HELL.
First of all, if you're interested in Machine Learning,
You've got to check out Sirraj Rraval
Who's also got a YouTube channel dedicated to teaching Artificial Intelligence stuff.
He's got an actual degree in Computer Science and I don't, so that tells you something.
And he makes videos like this:
(Sirraj) Hello World, it's Sirraj and welcome to "Intro to Deep Learning".
In this first episode, we'll predict an animal's body weight, given only the weight of its brain.
(Cary) You mean this thingy?
(Sirraj rapping) Yeah, uh huh, you know what it is
(Cary) [forced laughter] uh huh huh huh huh huh
link in the description.
Anyway, today I'm going to try to get my computer to replicate my voice.
Spoiler! You won't be impressed.
Before we move on, I already know I'm going to get a lot of comments like:
"This has already been done before!", and
"Somebody else got successful results with a different method, so your method is not worth trying!"
And to that I respond: Yeah.
I admit I'm not the first person to try something like this." YouTubers SomethingUnreal and Jon Reeveman have already posted
impressive results of computers using neural networks to imitate human voices.
[warbly computer generated voice]
[different warbly computer generated voice]
And Google Itself has tackled this task, too, with Google Deepmind's wavenet program,
producing better text-to-speech results than any other TTS system in the world!
[clear computer generated voice reads the text]
(Screaming noises)
[clear computer generated speechlike sounds]
Also, really advanced text-to-speech systems like Siri or Amazon Alexa aren't really comparable here.
(Siri) Aren't comparable?! What the hell do you mean by that? You little bi-
(Cary interrupts) Because those systems are *already* given which phonetic sounds and words to say.
In addition, the probably use pre-recorded samples from humans speaking, which is, like, cheating.
The point of my project, and all projects I'm mentioning here,
is that the computer has to produce voice-like sounds
*without* directly being taught concepts like words, grammar, and voice inflection.
It all has to arise organically.
So there! I won't shy away from the fact that I'm not the first!
But hey! My goal is not to beat literally Google at their specialty.
it's just to tinker around
Have Fun and Learn a little by Building everything behind  the scenes but not Really and about me using an unconventional
Method I think That's Actually A good thing because we as a Community learn more about what Works and what doesn't...(But Cary!)
What? (what Easier Unconventional Method?) I'm glad you asked! See, the Three People I mentioned Earlier all use our current Neural....
(Google isn't a person!)What was that?(Google is not a person!)
You Think I Didn't know that? (Well,you said 3 people..)
The Three Entities I mentioned Earlier all used to recurrent Neural Networks Apply to raw Audio samples which just means each
Sample Was Being Produced one At a time and Considering that There are Anywhere from
11,000 To 16,000 to
48,000 Samples Per Second They Are going to have to go through a lot of Steps Just to produce a small amount of Audio
Not only will that Take A long Time I heard that wavenet Takes 90 Minutes to Produce one Second of Audio but it also Strains
The Computer's Ability to Remember Past events and Discover Patterns but I mean, let's be Clear These
Downsides Must not be Too serious if Google, was Still able to produce Leading-Edge Technology with them right so Maybe I'm Just Overreacting
But I am not Google so I have less Brain Power and less Computer Power for the fun of it Let's Try a Different Approach
Enter the Spectrogram
One of the most simple Intuitive
Representations of Sounds Known to Humankind and Now You I just realized that implies you aren't a human but I mean that's not surprising
Just Look at you it's A two Dimensional Heat map of sorts Where right and Left Mean Forward and Backward in Time and Up and
Down mean Higher and lower in Pitch
Lighter Colors Mean Stronger Vibrations at that Frequency at that Time and Darker Colors Mean Weaker
Vibrations if You've Watched Some of my older Videos you might be Thinking(you've Already used Spectrograms Two Times this is Nothing new!)
Well I see your point those images, were not true Spectrograms why?
Well They Could only Depict notes Coming from one Instrument and in my mind True
Spectrograms Should Be able To represent Any Sound in the World
Your mind is dull with Spectrograms one Second of Audio Can now be Summed up
Just One Hundred samples Instead of several Thousand This will Make Pattern Recognition Easier but on the other Hand
We've also got a, whole New Dimension to deal With, also we Can't Perfectly Reconstruct the Audio out of this but
We can get Kind of Close Just Listen
(Just Listen)
Take Two Cary Here I spent Over ten Minutes Talking About the Science of
Spectrograms But I am Just not Willing to
Animate 18 Minutes for One Video if You guys want to hear that Without the visuals Anyway Just Let me Know
Unrelated but Can I just Point out how
Innocently Sweet and naive Take One Cary Is
so ignorant of the Future hehe to Summarize I want to hear my computer Imitate my human Voice I
Say Human Voice as if I got some other Non-Human voice so here's my game Plan ten Words of Wisdom Is the most Mind-Blowing
Life-Changing Experience on This Website but it, also happens, to be a great source for Training
Data From it I was able to extract Just under Four Hours
Of Raw Audio of me Speaking in A natural Voice I have an IQ of you plus 22 ,BooBoo in A semi natural
Voice in Python I defeated These Four Hours Into Eleven Thousand
168 Snippets Each One .28 seconds Long There's This Program Called the Analysis and Resynthesis Sound Spectrograph
Written by Michael Ruzek in 2008 Basically ARSS is a program that Can listen to any .wav File and Draw its Corresponding
Spectrogram By Figuring out What Frequencies are Vibrating at
What Time on his Website Michael Ruzek lists some cool
Experiments He's Tried Some of These Experiments Are Only Possible when you play with the image files Instead of Audio Files
Including Time Stretching Audio
Time Squishing Audio Stretching Pitch intervals
Turning Real-World Images Into Sound
And even Converting an Image of Lenna Into sound and then Back Into Lennon but
Anyway the Main Reason why I love ARSS so much
*whoops*
*sigh*
Can We start Over?
The Reason why I chose ARSS even Though it's such Old Software Is that it's the only Program I could find that could be
Called From The Terminal Allowing me to Call it from Python and They Can convert From Raw Audio Into
Spectrogram Image and Backwards Though not Perfectly for some Reasons I described in A 10 Minute blurb i cut out
Anyway I'm going to set up ARSS with the Parameters of 100 Pixels per Second and 24 Pixels per octave Giving Us
128 By 128 Pixel images Small Enough for HyperGAN to train fast but big Enough to Depict what's being Said but Barely
Ok, to be Honest the Resolution so low that the Audio Reconstruction Barely
Sounds Like me ...   but hey low
Fine Music Is back in fact in These Days Questions why!
Did I choose a frequency range of 180 To 7000 Hertz if my Voice is only About 130 Hertz
Yes This IS 130 Hertz
The Answer Is that I Initially Took A wider Frequency range of my Voice but I wanted to cut out all the stuff that Didn't
Look Useful for Articulating the sound only This Stuff Looked useful to me (wow, you're being real Scientific right There) Remember
I'm a tinkerer not a researcher I do this to have fun not to expand Humanity's Knowledge Here's my Python Code by the way
also
There's This Weird Trick Where you Can Still hear the Fundamental Frequency of A sound Even When you Can only Hear the Overtones Check it
out
*700 Hz*
*800 Hz*
*900 Hz*
*1100 Hz*
*1300 Hz*
*1700 Hz*
*700,800,900,1100,1300,1700 Hz Playing all at once*
100 Hertz isn't playing but Can you hear it Anyway
*700,800,900,1100,1300,1700 Hz Playing all at once again*
Here it Is
*100 Hz*
*700,800,900,1100,1300,1700 Hz Playing all at once again*
*100 Hz*
*700,800,900,1100,1300,1700 Hz Playing all at once again*
*100 Hz*
*700,800,900,1100,1300,1700 Hz Playing all at once again*
This Means That Even if my Voice Is
Frequency is Technically out of the range of this image you'll still be able to hear it Because it's
Overtones Are in Range back to my game Plan I said to eleven Thousand images as Training Data
Into HyperGAN Which Simply Speaking Is a Program that looks at Thousands of images from the Outside World during its Training Session and then
Uses Deconvolutional Neural Networks to generate New Images in The Same style as the originals you Guys Have heard me Talk about this
Relentlessly in Previous Videos but if You haven't Check the description Let's Have HyperGAN Train on its Own but stop it Every Six Minutes
To hear How it's doing HyperGAN is going to give Us images Back but
We Can convert Those images Into Audio almost?
Instantaneously By using ARSS so Without Further Ado here is the Beginning of HyperGAN's attempts to sound Just like me
*Laser noises*
*More laser noises*
Yeah it sounds pretty Bad but I got to show you what Square One Sounded like so you Can Appreciate it more When
We get to Square two.
*More laser noises*
*More laser noises*
*More laser noises*
Sounds Like Lasers doesn't it Speaking of Lasers Check out my laser Song i'm such A sellout
*More laser noises*
*More laser noises*
*More laser noises*
You might Have Noticed These images Have color in Them which means There's Three
Output Color Channels Whereas The Desired Spectrogram images are only grayscale why
didn't i edit hyperGAN's Code to only Output one Color Channel Instead of Three do I have some innovative Reasoning Behind it
*More laser noises*
*More laser noises*
The Answer Is no I'm Just Too Dumb and Scared to change Any of HyperGAN's Code I've Subjected you to a lot of
Non-Human Laser Sound but I promise Things are About to change
*More laser noises*
*More laser noises*
*More laser noises*
*More laser noises*
*More laser noises*
*More laser noises*
*Lasers and muffled humans*
*Lasers and muffled humans*
*Lasers and muffled humans*
*Lasers and muffled humans*
*Lasers and distorted humans*
*Lasers and distorted humans*
*Lasers and distorted humans*
*Lasers and distorted humans*
*Lasers and distorted humans*
And Just like that a muffled somewhat metallic Human Voice Comes out of thin
Air it's not great and you can't even Make out Any Words I'm Saying
But what the spooky is you Can Kind of tell it's me!
*Lasers and distorted humans*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
Progress Slowed Down A lot after the first night so i'm going to
Skip Forward Like 20 Hours to Show you what the Final results are
*Unrecognizable distorted words while screen fades away*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
*Unrecognizable distorted words*
There it is those are my final results like I warned you're Probably not too impressed but Audio
Definitely Cleaned Up A bit since 20 Hours Ago There's less of that Constant Buzzing Background Noise Than before which you can
Also See Visually by the Background Becoming Darker but That Watery metallic Distortion of my voice Itself never went away
And I don't think you could ever go
Away again That's Because the image Dimensions are so low and
Arss as a program Can't Reconstruct Audio With Perfect
Fidelity
With That all being Said I do Think my methodology has?
Some Benefits
Over The Work of SomethingUnreal and John Reeveman not Google's Deepmind Though Google Deepmind's Unbeatable
so because my base unit of Time
Was the Pixel at One Hundredth of A second (1 centisecond) Whereas the other Twos base unit of Time
Was the Audio Sample at
111 Thousandth of A second my Program was able to learn Long-Term patterns like Voice Inflection and
Sentence Starts and Stops Better at the Beginning you Can Hear SomethingUnreal and Jon Rieveman's LSTM
getting Stuck on Single Phonetic Sounds for ages
(screaming noises)
Since One Vibration in That Phonetic Sound Already Takes Up several Hundred Steps a Long Time in the RNN's eye i suppose
This Is pretty Obvious but on one Hand the neural Nets Looking at the Microscopic level do Better on the Microscopic level With Crisp
Enunciated Sounds it's because the Crispness
Relies on A
Few of Those Audio Samples Right on the Transitions of Sounds being Just right on the other Hand Neural Nets Looking at the Macroscopic level
Do Better on the Macroscopic level with my generated Audio Generally getting the Flow of Sentence Intonation
Sounding About Right I would Say Their neural Networks Sound like a person with A stutter talking Just inches away from you
Whereas My Neural Network
sounds Like a Person Speaking Coherently
But Standing Behind iron Doors and Underwater of Course I'd be able to Say that a little more Confidently if I could generate
Samples Longer Than 1.28 Seconds so I guess the Videos Just About
Over Maybe I could Talk about Ways to improve This Voice Generation I'm Feeling A
Little Weary of HyperGAN at this Point
Because Even Though it creates Images
Super Fast Hyper again Can't Understand the Passage of Time Or Cause and effect which means it will only be able to generate Audio as
Long as the Samples I feed Into it it'll Never be able to ramble on indefinitely
Alternatives To HyperGAN, That Maybe Could Ramble
Indefinitely are PixelCNN Which Produced that Exquisite Jazz Music but Is slow to generate and Maybe wavenet's Tensorflow
Implementation Modified to Create images Because wavenet on Raw Audio has proven to be so so so extremely Successful I just
Wonder what Would happen if it could be used to Work on?
Spectrograms The Only Downside, Is we've got a whole another Dimension to Work With now but hey?
Computers Love Dimensions don't they no they
Don't that's it for now Thank you Patrons Few Guys Are super cool see your viewers next Time
Oh Wait two more announcements One if You want to see Sloppier Videos from me Including a video that's Literally
1,000 to the length of This One Check out my new Second Channel Called Humanyand too
Computery, thinks I Hog up too much of a spotlight and wants to do the Outro for Once
