Hi, My name is Weipeng
and I am here to talk you about our work 
on using deep neural networks 
for the joint detection and localization of speakers for robots.
Our work emphasizes the application in real HRI scenarios,
where there are multiple overlapping sound sources,
the number of which is unknown.
There is robot’s ego-noise.
And, the predictions should be made from short inputs
so that the predictions are real-time 
as well as that it is possible to detect short utterances.
So how does our neural network work for such problem?
We consider two types of input features for our neural network.
The first feature is the center coefficients of the GCC-PHAT function 
computed for each pair of microphones.
And the other type is the GCC-PHAT on mel-scale filter bank.
Since different sources in general dominate different time-frequency bins,
this feature provides the sub-band information 
that allows easier detection of multiple sources.
For the network output, we propose the likelihood-based coding. 
That is, the network outputs the likelihood value of a source 
being from each individual direction.
During training, the ideal output is the maximum of Gaussian-like functions
around the ground truth directions of arrival. 
And during application, as like spatial spectrum,
we decode the output by finding the peaks as the predictions.
Such output coding allows the detection of an arbitrary number of sources.
As for the network architecture, we investigate three different types,
including a multilayer perception with the GCC-PHAT coefficients as input,
a convolutional neural network which treats the GCC-PHAT on filter bank as images,
and a two-stage network which considers the sub-band analysis
before aggregation across all frequencies.
To train and evaluate the proposed methods,
we collected a sound source localization dataset with our robot Pepper.
The dataset includes in total 24 hours recordings 
of both loudspeakers and human subjects.
We made the dataset available through this link.
It can serve as a benchmark dataset for further studies.
Our proposed methods achieved approximately 90% precision and recall
in both the loudspeaker and the human subject recordings,
outperforming the popular spatial-spectrum based methods.
We will show some results at the end of the video.
Thank you for watching.
