-I am, for the benefit
of those in the room,
as well as whoever may be watching,
I am Kathleen Howell, an
associate dean for engineering.
I'm here to welcome everyone
on behalf of Dean Jamieson
to this event.
So this celebration of faculty
careers actually emerged
from a strategic plan
about five years ago.
And at that time, it was
called the Faculty 2020,
and the focus was on
professional development
at all stages of a
faculty member's career.
It was also interested
in aligning at that time
the hiring and promotion
and tenure process
as the college scope was evolving
and its leadership values
were being brought forward.
So there was a desire for a post
review of full professors
that would include the feature
of having an understanding
of the accomplishments of
everyone at this particular time
in their career and an
opportunity to make any plans
going forward.
So full professors who
are at least seven years
past promotion present
this type of colloquium
on their achievements and their plans.
And then following this,
they get the opportunity
for a meeting with their head
as well as with the dean.
So the program was
originally piloted in 2013,
and we're now entering
about the fourth year
of actually running the program.
So today we have the
opportunity to celebrate
professor Saul Gelfand,
and he completed his Ph.D. at MIT
in electrical engineering
and computer science.
Prior to coming to Purdue, he
was with Scientific Systems
Incorporated in Cambridge,
as well as Bolt Bearnick
in Newman in Cambridge.
So he's been here since 1987.
And as you all know, he's
currently a professor
in electrical and computer engineering.
His interests lie broadly
in modeling analysis
and optimization of stochastic
signals and systems.
And his research is in the
areas of digital communication,
statistical signal
processing, optimization
and pattern recognition.
So with that going
forward, we then let you
tell the rest of the story.
-Thank you.
So I just wanted to say that
I was a little bit unclear
on how to structure this talk,
so I decided to talk about things which
sort of maybe I'm most proud of.
Also what I'm currently interested in.
And maybe just overview
some of the other things
rather briefly.
So we'll see how that goes.
Okay, so this is an outline of the talk.
And so I'll first be talking
about stochastic sampling
and optimization.
This is actually work that
grew out of my Ph.D. thesis
at MIT but was done primarily by me
when I came to Purdue.
So the first topic is stochastic sampling.
So, Markov chain sampling methods,
also known as Markov chain Monte Carlo,
or MCMC.
Are a basic tool for Bayesian
statistical inference.
Things like MMSC and map
estimation, imputation
validation, those kind of things.
The idea is to sample from a Markov chain
with a desired and invariant distribution.
And this idea was due to
metropolis, but there has
been a tremendous amount
of work subsequently.
Many other related techniques,
some suitable for parallel implementation.
A related approach, which is
perhaps not as well known,
is sampling from diffusions,
or stochastic SDE,
stochastic differential equation.
And their discrete time approximation.
So that's for sampling in euclidean space.
Okay, so we examine the
relationship between Markov chain
and diffusion sampling algorithms
and what we showed is that
these suitably interpolated
Markov chain sampling methods
converge weakly to what's
called a Langevin diffusion
which is like a Brownian
motion, or Wiener process,
exception there is a viscosity term.
Furthermore, we showed that
different types of Markov chains
from different sampling
methods, like the metropolis
and heat bath method, they
converge to diffusions
running at different time scales.
And so, this is a way of
comparing the different methods,
something people really
weren�t able to do previous to that.
Now, a different approach,
and one which one might take
if certainly would be
to look at the modulus
of the second largest eigenvalue of
the transition matrices,
but actually that's a very difficult thing
to get a handle on.
Okay, so here's a display
of that Markov chain
sampling method.
So, anyway, the idea is this,
it has a transition density,
p of xy, which you get from
sort of a candidate transition
down to the q of xy.
So starting at x, you get y from q of xy,
and then you accept it
with probability s of xy.
And if you don't accept, you
stay at the same place x.
So this p of xy expression does that.
Now, the acceptance probability differs
from different Markov
chain sampling methods.
For the metropolis
method, s is equal to sm.
For the heat bath method,
s is equal to sh.
And there are others as well.
Okay.
So, how did we, or what did we show?
What is this convergence
of this Markov chain
sampling method to a diffusion?
Well, we parameterized a Markov chain
by small parameter epsilon.
And it has a transition density q epsilon,
and the way this works, remember,
we're doing things in continuous space,
so it selects a coordinate at random, and
then it performs or selects
a Gaussian perturbation
with mean zero and variance epsilon.
And then we interpolate
that into continuous time
and get this x epsilon.
And what we showed was
under suitable conditions,
this interpolated process,
x epsilon converges
to a Langevin diffusion,
depending upon the Markov
chain sampling method,
for example, for the metropolis method,
it converges to this xm,
and for the heat bath
method, it converges to xh.
And if you know a little bit
about stochastic differential
equations and diffusions,
it turns out that xm is running
at twice the speed of xh.
So in this sense, metropolis
runs at twice the speed
of the heat bath and would
therefore be preferred.
So this was a new result.
Okay, next thing I want to talk about
is stochastic optimization.
So, stochastic approximation
is a basic tool
for root finding and
optimization under uncertainty.
And it's a generalization of the classical
non-linear local search methods.
So for optimization, typically,
and that�s what we are interested in here,
typically one performs a
gradient or a Newton type step,
but the estimates which are
used to get the gradient
and Hessian or approximations to them
are noisy or imprecise measurements,
and so we get what's
called stochastic gradient
or stochastic Newton algorithms.
And these were initially
developed by Robbins and Munro
and Kiefer and Wolfowitz, and
there's been a lot of analysis
and application of these methods.
I should say that the
decreasing step size approach
is used for the fixed
parameter identification.
If we use a fixed step size, we can track
time varying parameters.
Okay, so what we did was
we developed and analyzed
what I'll call a globally
optimal stochastic
gradient algorithm
under fairly general conditions.
And we also developed and
analyzed continuous state
Markov chain annealing algorithm,
which is kind of a
different type of algorithm
also used for global optimization.
And we did it by writing
the annealing algorithm
in the form of a globally
optimal stochastic gradient
so this again is based on this connection
that we developed between the
Markov chain sampling methods
and diffusions.
And then we applied this
stuff also to several
global optimization problems
including edge detection
and virus reconstruction.
In the later case we actually
looked at a lot of other
global optimization methods
as well and compared them.
Okay so here's the classical
stochastic gradient,
it looks like a steepest decent algorithm
but it has this noise psi
k, a k is a sequence
of positive numbers.
In the modern analysis
there's usually two steps
to analyzing these type of things,
one is you establish some kind
of global stability property
like  boundiness, using a
Foster Liapunov criteria.
And then you characterize the limit points
from what's called an
associated ordinary differential equation
which is z of t,
which is obtained by averaging
the stochastic gradient algorithm.
And under suitable
conditions you can show that
with a k say going to
zero like one over k,
that this Z k converges to
the set S which is the local
minima of U . It converges
with probability one.
So what we did was we modified
the stochastic gradient
algorithm by adding in this
b k w k term here, this term.
And this is sort of done
to escape the local minima.
So everything is the same
as before except the W k
are standard independent
Gaussian random variables,
and the b k is a sequence
which goes to zero
very very slowly.
Again the analysis is done 
in two steps
and sort of guided by the 
ODE method
and the classical stochastic 
gradient algorithm.
We establish a global
stability property
in this case tightness or
boundedness in probability
and then we characterize the limit points
from what's called an, what
I call an associated SDE,
stochastic differential equation.
And under suitable conditions
with this B k again
going to zero very very slowly,
constant over square root k log log k
and also b over a being
greater than C zero
which is some constant
which comes out of analyzing
this diffusion by using
a Freidlin Wentzell
large deviations theory.
That X k actually converges
to the set of global
minima even in the presence of 
strict local minima
Convergence is in probability.
The analysis here is a lot 
more delicate
than the classical approach 
with the ode
because of this very
slowly decreasing noise,
you have to localize the approximation
and the stochastic
differential equation
on very long time intervals but it can be done.
Okay next thing I want to talk about
is pattern recognition
and machine learning.
So the first thing here
is iterative growing
and pruning of classification trees.
So a classification tree
is an important method
for non-parametric classification.
The trees are usually grown top down
by splitting features in feature vector
until some termination career.
You play with the terminal
nodes with the class labels.
There's many advantages
to these approaches,
to this tree structured approach.
An important one is interpretability.
How does the classifier work?
What features does it select?
What order does it select in?
What are the thresholds?
So it's used not just
for predictive analysis
but also for a feature extraction.
Okay here is the classification tree,
x is the feature vector,
f is a splitting function,
theta is the threshold,
c hat in the terminal
are the labels so,
feature vector propagates down the tree
it gets labelled as the
class of the terminal
node that it lands in.
Okay, CART growing and pruning.
So to avoid overfitting
which is a critical thing
in not just classification but regression,
Breiman Friedman Olshen and
Stone as a part of their famous
CART classification and
regression tree program
suggested growing a large
tree and pruning it back
rather than using stopping criteria.
This kind of global approach
as supposed to be a local approach,
which would be the stopping criteria.
So we grow a large tree
until the terminal nodes are pure,
that is they contain members
only from a single class
after which there's no
point to continue splitting.
Then CART minimizes what�s
called a complexity cost
over the pruned sub-trees.
These are sub-trees
with the same root node.
See if you label all the nodes
in a tree with some class
then each pruned sub-tree is
actually a classifier itself.
And selecting a pruned
sub-tree is like selecting
a less complex, a less
over fitted classifier.
Anyway they minimize  complexity
cost over all the pruned
sub-trees, over the pruned sub-trees
for all possible values of
the complexity parameter
and then find the complexity parameter
using cross validation.
And they give a very interesting
and efficient algorithm
for performing the minimization
over the pruned sub-trees.
So I wanna talk about this a little bit
because we generalized this
for the Neyman Pearson approach.
So R of T here is the
misclassification rate
of a classification tree
based on training data set.
And this will be a very biased estimate
because the estimation
is done on the same data
which is used to grow the tree.
So it needs to be penalized
to eliminate the bias
so this complexity cost
R alpha is introduced
and T tilda here is the
set of terminal nodes
and alpha is the complexity parameter.
So as I said CART grows this full tree
and then finds the
optimally pruned sub-tree
which minimizes this complexity cost.
Now they do this by proving the existence of
and then determining
an efficient algorithm,
to find thresholds alpha
k and pruned sub-trees T k 
such  that this T zero of
alpha is equal to this
fixed pruned subtree T k when alpha
is in this alpha k interval.
So the whole problem is finite,
you would expect something like this.
The issue is how to
actually find the alpha ks
and the T ks
and that's what they gave an
efficient algorithm to do.
Okay then we find alpha star and T star
the optimally pruned
sub-tree by minimizing an
estimate of the misclassification
rate based on cross validation.
So what we did, well I should say, CART
one thing that CART does,
CART uses cross validation or a test set
to find an optimally
pruned sub-tree amongst
a parametric family
to reduce this to a problem of
parameter estimation of pruned subtrees.
We instead proposed to find
optimally pruned sub-trees
amongst all sub-trees using all the data
in iterative growing and pruning phases.
So we had no need for
complexity parameters here.
And we did that by splitting the data set
and iteratively growing
and pruning based on
alternating subsets and
establishing convergence.
So the key thing was to get the thing setup
in a way that it actually converged.
And then we also gave
an efficient algorithm
for the pruning phases.
So here is an example,
this is from the CART
monograph  on a problem in
waveform classification
they use this extensively
to examine the results and
compare them with other methods.
So in the top table we have
the CART and proposed method,
and the, what you can see is that,
the number of terminal nodes,
that's the first, it's about
the same for the two methods.
The estimate of the risk
based on cross validation
is less for the proposed method
and the actual risk is less as well.
Okay we know the actual risk
because we have the model
and we can compute it.
Furthermore the CPU requirement
is dramatically less
because this generation
of the parametric family
of pruned subtrees turns out
to be very computationally expensive.
The lower table shows the iterations
in the proposed methods,
it actually just takes three iterations.
Actually even after the first iteration,
it's doing better than the CART.
However, it's difficult to
determine which pruning method,
or more generally, classification
tree design is better.
The result's problem dependent.
This is a very active area of research,
been going on at least 30 years.
Now there are new methods.
Things called bagging and
boosting, random forests.
We choose ensembles and the goal
is to have improved
prediction over a single tree.
So you can kinda compare
ensembles with pruning
where you generate a single tree.
And however, single trees
are still widely used
for feature selection because
they can be interpreted,
this is what I was talking about before.
How does the classifier work?
This is actually very important to people
who use these things and
don't just wanna black box.
Okay the next thing I wanna talk about
is Neyman Pearson classification trees.
Okay so classification
trees are usually done
using Bayesian approach.
Minimize the misclassification loss,
or Bayes risk in the various phases.
The frequentist approach
is handled by basically
arbitrarily selecting
some class prior values
to generate some suboptimal and incomplete
subset of the receiver
operating characteristic.
This sort of crude application
of what's called the
Neyman Pearson lemma.
We proposed an approach to
generate the entire optimally
labeled and pruned ROC
which will then yield
the Neyman Pearson design.
Actually other things too like minimax,
as well as the area under the ROC curve
which is a popular method
to compare classifiers
in the machine learning and
statistics communities.
So we proposed to minimize
what we call a prior
complexity cost, a prior
parameterized complexity cost
over the pruned sub-trees.
Also the terminal labels
because they change 
388
00:20:26,580 --> 00:20:28,138
when the priors change.
For all possible values of
the priors and complexity,
Then find the complexity parameter
using cross validation,
then extract the receiver
operating characteristic.
So by comparison CART just
minimizes the complexity cost.
So it's sort of a one dimensional thing
whereas we have a two dimensional thing.
So we've got a geometric
aspect to the problem 
which wasn�t there in CART.
So we give an efficient
algorithm for doing this
and it turns out that the CART algorithm
is really kind of special
and much simpler case.
So here is a display
in a little more detail
of what's going on.
So let P d and P f be the detection
and false-alarm probabilities.
They are parameterized now by prior gamma
of class zero in this two class problem.
So now the prior parameterized
cost, complexity cost
is this R alpha gamma.
And you're minimizing this
over all the pruned subtrees.
And what we show, we
demonstrate the existence
and then actually determine
an efficient algorithm
to find what turns out
to be convex polygons P k
such that the optimally pruned subtree
is equal to some fixed pruned subtree T k,
when alpha gammas is
in one of these convex polygons.
So we then find alpha star gamma,
and the optimally pruned sub
tree as a function of gamma
the prior minimizing over alpha
using cross validation.
And then we can get the ROC curve
by varying gamma.
That will germinate
the whole ROC curve.
Then we can find the ROC regions
I guess you could say.
And then we can find the ROC curve
by finding the boundary
of the convex hull.
So it actually kind of surprises me,
this whole thing could be done.
But I think we�ve done it.
So this is an experiment,
credit assessment experiment,
this is from the famous UCR,
machine learning database.
So you're trying to determine
From some training dataset.
Whether somebody's
credit worth or not,
plus or minus.
This is the full tree I was talking about.
So this is the full tree and
we wanna find the ROC curve
of optimally pruned subtrees
or randomizations which yield it.
Okay so the figure on the
left is the alpha gamma space
with the convex polygons each representing
one of the particular
pruned subtrees optimally.
And the figure on the right
is a magnified view of
the lower right corner.
You can see these convex polygons here.
And most of the action seems
to be going on in that corner.
There's actually a 190
of these convex polygons,
corresponding to 190
optimally pruned subtrees
that we generate which we then use.
We finally estimate the
probability detection
false alarm for each of those,
that's the figure on the left
and then we extract the
boundary of the convex hull,
that's the figure on the right.
So this algorithm which 
we determine to do this
is actually quite
interesting there's some,
linear programming sub-problems
that need to be solved.
But it seems to work.
Okay next thing I wanna
talk about is incremental
and adaptive regression trees.
So regression trees, classification tree
is an important method for
non-parametric or non-linear
regression.
The trees are constructed
like classification trees.
Conventionally there is just
a continuous response value
at each of the terminal nodes
but we're actually gonna use
a multiple linear regression
you could use a generalized
linear model even at each node
because we want to actually do
piecewise linear regression.
Or piecewise linear filtering.
There�s various methods
to find the regression,
splits points and prunings.
Again like classification tree,
these can be used for prediction,
also variable selection,
ranking, association all those things.
So incremental and
adaptive regression trees.
This is what we worked on.
Incrementally designed and
adaptive regression trees
are important when additional
data becomes available
or the data's not stationary.
Because you don't want
to rebuild the whole tree
especially in some deep learning
problem with a big dataset.
That's not practical and of course
that wouldn't work if
the data's not stationary
and we're trying to track it.
In the literature, people
use heuristics for this,
there's no analysis.
So the basic problem that you have
to come to grips with here
is that even with stationary
independent data, strong assumptions,
that the data at the non-root
nodes has a very complex
non-stationary and dependent character
because the splitting is changing at
the nodes due to the new data coming in.
So what we did was we
developed MMSE fixed-gain
stochastic gradient algorithms.
Also adaptive pruning
algorithms, the whole thing
is adaptive.
And we actually were able to
demonstrate the convergence
and specifically, how its
related to the tree depth.
And this actually guided us
in how to formulate the algorithm.
And actually am very proud of this work,
we developed some new ideas
about analyzing hierarchically structured
stochastic gradient algorithms.
And we also applied it
to some non-linear echo
cancellation and equalization problems.
Here's an example,
equalization of a severe isi channel.
So these are learning curves.
So on the left plot, this is
the tree structured approach
as you move up you get the linear equalizer
and then polynomial  second
order, and then a polynomial
third order.
And on the right plot,
this is asymptotic error rate
probability of error
On the bottom you get best performing
is the true structured approach.
Then the third order polynomial,
second order polynomial,
linear equalizer, linear
equalizer actualizing has an error floor.
So what we see there is
from both point of view
of convergence rate and
asymptotic error rate,
the true structured approach works better.
And the obvious thing is to
use a polynomial equalizer.
The reason that doesn't
work is because to get,
you have to use enough, high
enough order polynomial scheme
to get enough approximation capability
but then there are so many terms
that slows down the rate of convergence.
And so you could pick terms
offline, people do that.
But that wouldn't be suitable for
an adaptive implementation.
Okay next thing I want to talk about is,
and still in this pattern
recognition machine learning area
is multilayer neural networks.
So multilayer neural
networks are a basic tool
for non-parametric
classification and regression.
Very popular in the current
deep learning craze.
Multilayer neural networks
consist of weighted linear summations
and non-linear activation function units
arranged in a feedforward network.
There's other types of networks,
recurrent networks,
convolutional networks,
but this is still popular.
These multilayer neural
networks, feedforward networks
are classically trained with what's called
the Werbos back propagation algorithm,
is actually a stochastic
gradient algorithm.
Again the field has progressed
quite a bit since this work
and there are new methods.
There is a sort of
pre-training at hidden layers
which are not input or output layers
and unsupervised feature
selection at hidden layers
but the back propagation algorithm
is still the primary tool for training
these types of networks
especially fine tuning
even in the deep learning.
So here's a multilayer neural network.
At the top is a neuron,
the activation functions
that circle  there at the top
are, these are some examples,
the second one is the sigmoid
the classically popular one.
There are some others that
are increasingly popular now.
And at the bottom we have a two layer,
one hidden layer multilayer
net feedforward neural net
with both multiple inputs and outputs.
Okay so what we did was
the analysis of this
and I will try to explain why
what we did was anything
but, just applying
the usual theory.
So the back propagation is widely applied
but the analysis is difficult
because there's a complex
non-linear stochastic system.
The standard analysis
actually uses averaging
to determine associated
ODE and you linearize
around a candidate equilibrium point
to get some kind of local
asymptotic stability
for the original, for the average system
and then as well for the back
propagation algorithm itself.
But it turns out the
analysis does not explain
the qualitative behavior
due to the nonlinearity
which has been observed over time
with back propagation which
is that there's a long term
dependence on the initial conditions
and there's also a
drifting of the weights.
So we did something different.
We analyzed back
propagation using a separate
and statistical linearization
of each activation unit.
This is like what's called
the describing function method
in non-linear systems analysis.
So the algorithm, unlike the classical
conventional approach where you linearize
the whole mean vector field,
the algorithm is still nonlinear
and it reflects the
behavior more accurately.
This approach yielded an associated ODE
turns out had an unbounded
manifold of equilibria.
And we showed that the
trajectories of the ODE
are bounded and
converge to that manifold.
We could not use the
Liapunov theory we had to use
Lasalles theory for this.
I should also point out
that the convergence
does not imply bounded here
because the manifold
is of infinite extent.
Okay and then empirically
we confirmed that the back propagation
mean vector field actually
has such a manifold
and there was this dependence
on the initial conditions
and drift along the manifold.
So what I've shown here for
a very simple two layer net,
I think just a single input and output.
This is two weights but it illustrates
the general theory.
Show  some equilibrium trajectory.
So the equilibria for the
average back propagation algorithm
are these hyperbolic looking things
and for the quasi linearized,
the statistical linearized
each activation unit we
have similar equilibria.
Also the trajectories are pretty similar.
And in fact it does put predict
this type of weight drifting
that depending on the initial conditions,
the back propagation kind of contacts
or approaches this equilibrium
manifold in different places
and then subsequently drifts along it.
As far as I know we are
the only ones to have done
this kind of analysis
and to actually explain
this qualitative behavior.
Okay, the next thing I wanna talk about
is LMS  algorithm.
So adaptive algorithms are approach
to predicting and estimating
some unknown signal
or identifying, approximating
some unknown model parameters.
The identification and
approximation of things
are not exactly the same thing.
We don't really need a model.
In fact we don't operate
mostly in that setting,
I guess we're more in the
approximation setting.
Anyway there are some subtle
differences there but.
The most well known more
widely used adaptive algorithm
for MMSE linear estimation is
stochastic gradient algorithm
called least mean squares or LMS.
There's many applications of this
when low complexity linear
estimation is appropriate.
And what we did was we
investigated several fundamental
algorithmic and theoretical problems
navigated by practical issues
using a constructive approach.
And what do I mean by
constructive approach?
A nonconstructive approach is the following.
Sort of make very weak
assumptions on the data
and show something like there
exists a step size sequence
or a step size with some desirable
asymptotic property.
But it doesn't say how
big is the step size.
Or there really aren't useful details
about the asymptotic property.
In a constructive approach
we make strong assumptions
on the data.
Maybe stationary data,
stationary independent data.
Stationary independent Gaussian data.
And we actually can derive
Bounds on the step size
and other information,
practical information
about the performance of the algorithm.
And then what you can do
is you can do simulation
to validate that the
algorithm kinda works this way
even when the data doesn't
obey such strong assumptions.
So this is kind of a useful
approach for engineering.
So here is the LMS
formulation and algorithm.
So y hat is this regression on x.
x is the regressor, we
minimize the mean square error.
There's also a tracking problem
when the data is stationary
and then I've shown the LMS
algorithm at the bottom here.
Alpha is the step size.
So the first contribution
I want to talk about
is what's called noise constrained LMS.
So it turns out that in some problems,
the expected performance can be estimated,
for example in one of wireless
communication systems,
assuming that the actual
channel is in the model set,
automatic gain control is used
and the performance is
actually just the noise power.
And this is essentially known
so this actually is the
case in the CDMA network
where a certain types of signaling is done
to monitor the signal to noise ratio.
So what we did is we proposed
using this information, in
an adaptive constrained MMSE optimization
to improve the performance of LMS.
It�s more general than that.
This methodology can be
used to include components
of model based information
into adaptive algorithms
and so interpolate between
fully adaptive algorithms,
very simple algorithms typically
and fully model based algorithms.
So this sort of different approach.
Instead of assuming there�s a model,
and then well maybe I don't
really know the model, and
scale the model back
and things like that,
we're starting at the bottom
and adding model based information
to a fully adaptive algorithm.
So in the noise constrained minimum square,
performance constrained minimum square
linear estimation problem, what we do is
we minimize a Lagrangian which is formed
from the mean square and the constraint
and then this is the critical thing,
we penalize, and this sign
turns out to be important,
we penalize the multiplier.
Now why we do that?
Okay well the penalty term
is added in since otherwise
it turns out that critical
values are non-unique
and actually there are unbounded,
well non-unique and unbounded.
A particular the multiplier is
non-unique and unbounded
and that creates problems
with an adaptive algorithm.
So by subtracting this penalty term,
penalizing the multiplier,
we get a unique critical value.
Which turns out actually
to be a saddle point.
So since it's a saddle point
we can't use stochastic gradient,
we have to use what's called
a Robbins Munro algorithm.
But we can do it and adaptively
solve for the solution what
we call a multiplier penalized
constrained MMSE problem
which is this, this algorithm.
Is type of variable step
size LMS algorithm,
which turns out to be data dependent.
So for stationary
independent Gaussian data
we did a rigorous analysis
and showed the NCLMS weights
and multipliers were
bounded in mean square.
Actually this lead us to
look at the general problem
of analyzing LMS type algorithms
with data dependent step size
which was something also hadn�t been done,
we'll talk about this in a bit.
We also performed an
approximate analysis
that showed that the NCLMS algorithm
achieved larger convergence rates
and smaller asymptotic MSE
and even in the case
where there was mismatch
where we hadn�t estimated
performance correctly
like this sigma squared
term in the constraint
there was still improved performance gain.
We were able to show actually it
was best to overestimate.
So here's an example
of the identification of an ISI channel.
These are learning curves
for the third channel tap.
So best performing is actually,
if you can see this,
this is recursive least squares
which is a much more complex algorithm.
Next to it is the NCLMS.
And then the other
algorithm are various types
of variable step size LMS
which are based mostly on heuristics
unlike the kind of
principled derivation we gave
in the literature.
And the worst performing is the,
worst performing is the LMS here.
Okay.
So third topic or second topic I guess
in least mean squares type adaptive algorithms
is the general analysis of
variable step size LMS.
So this noise constrained
LMS we were just talked about
is type of variable step size LMS
but it depends on the data sequence.
There's many other types
based mostly on heuristics.
The idea is you wanna choose
a step size large initially
so you could get fast convergence and small
eventually to get a
small asymptotic error rate.
It turns out when we looked carefully
that a rigorous analysis of
the general data dependent
step size was unknown in literature.
In fact it was just assumed
that if the variable
step size satisfied the
same bounds required
if it were fixed step
size then it was stable,
had the same stability as LMS.
And that turns out to be true,
it's easily shown.
if it's not just fixed step size
but deterministic step size.
But all the popular versions
of a variable step size
LMS including the NCLMS were
actually not just variable
but data dependent.
It's a difficult problem
to analyze this because
unlike the fixed or variable
non data dependent case,
you can't get a recursion in
the weight error co-variance.
So what we were able
to do was to determine
some non-linear difference equations
which were satisfied by certain bounds
on the weight error co-variance,
bounds which were uniform
over an entire class
of data dependent step sizes.
For example, ones with
prior dependence meaning
they depended on the data
up to the previous time
and the posterior data dependent
which depended on data 
up to the current time.
That was the key.
We were then able to
analyze these equations,
determine stability regions
and we showed that the stability region
for a data dependent
step size can actually
be strictly smaller than a fixed step size
contrary to the usual assumptions
which were in the literature.
So a little bit of detail on this.
Here's the general form of
the variable step size LMS.
Alpha k is the data, generally
data dependent step size.
So we let script A
denote this class of step size
sequences I was talking about.
So it could be fixed,
deterministic, prior data-dependent
or posterior data dependent.
And then we let S
sub script A be the step size
intervals for which the weights
were mean square bounded
for all step size sequences in that class
which satisfied or lied in the interval.
We call this the mean
square stability region
for the class of step
size sequences script A.
Now it's known that for stationary
independent Gaussian data,
strong assumptions in this
constructive approach,
the stability region for
fixed step sizes is just
basically step size
intervals which are bounded
by some parameter alpha star.
Alpha star can be characterized
in terms of the eigenvalues
of the covariance of the regressor.
And it's also easy to show
that this chain of inclusions hold.
So what we showed first for
the case of a single tap
is that the stability region
for the prior step size
was the same as for the fixed step size
but the stability region
for the posterior step size
is strictly contained
in the stability region
for the fixed step size.
And then when we looked at
what turned out to be a much
harder problem of multiple taps
we were able to get bounds on
the prior and posterior step size
regions.
Still we were able to show
that the posterior step size
was strictly contained
in the stability region
for the fixed step size.
So here is an example of this.
The stability region
for the fixed step size
is just this kind of
triangular region here.
This point here is alpha star.
This axis is the upper bound.
Any step size which is less
than alpha star for the fixed
step size would be mean square stable.
The stability region for
the posterior step size
is now bounded away from the region
for the fixed step size
and it's kind of
interesting because it shows
as the, as the upper limit,
this is the upper limit on the step size,
as it gets larger so does the lower limit
on the step size interval.
Lower limit of the step size interval
for this, the upper limit that's here
but here, here.
So this interval is kind
of getting more narrow.
And actually as the upper limit
from the step size interval
approaches the maximum value,
so does the lower limit.
Which means that if you want
to get say choose the maximum
step size, say to get the
largest rate of convergence,
you essentially have to
choose a fixed step size.
Because there's no wiggle
room between the lower
and upper limit once you
choose such a large step size.
In the multi tap case we
were able to get bounds.
The figure on the left shows such bounds
for minimal eigenvalue spread, on the right
for a considerable eigenvalue spread.
Okay so,
that's all I wanted to
talk about in detail.
Now I wanna talk about
some other stuff briefly.
So I wanted to discuss
some stuff which I've spent
a lot of time on here doing.
In fact, realistically half
my career probably more.
But I've since moved on
the last several years
to this stuff I was
talking about previously,
I moved back to it machine
learning, pattern recognition,
statistics optimization.
Well there is some overlap.
So this other work involved a lot of work
with optimal and near
optimal model based methods.
All kinds of variants
of AR and ARMA models,
Markov, hidden Markov, hidden semi-Markov,
state space models, things like that.
And here is a list,
I'm not gonna read this.
A list of some of the
things that I worked on.
Maybe I�ll mention the last two.
One was this ethanol concentration
in pattern recognition
from an implantable bio sensor.
I should say something
about something funding I guess.
So this was funded by an
NIH R01 grant.
And most recently this
harmonic spectral analysis
in pattern recognition
from probed passive devices
which was funded by an
army MURI grant.
This other earlier work
was actually funded by NSF.
And also the work I was
discussing earlier about adaptive algorithms
and stochastic approximations,
that was also funded by NSF, but also by
the army research office
under a core grant.
I also got some high performance
computers from them
under their DURIP program.
And I think around the year 2000,
I actually had the most
powerful compute servers
in the department.
And I remember actually the
department had at the time
asking me for accounts on those machines
for some of the incoming
faculty, funny story.
Okay, the other area I worked on,
mostly moved on from although
I have some of my regrets
because this area has now
become very active again.
So I've done a lot of
work on practical problems
and most of it is supported by
industry involving modeling,
algorithm development and analysis
and lots and lots of simulation
of all kinds of different
coding and modulation
and channels and wireless
and satellite communications
and broadcast systems.
Again I'm not going to
read through all these.
The most recent one is this
non-linear channel coding
for satellite channels
with  Northrup Grumman
going on for a few years.
It's kind of an interesting channel model,
it's a peak power
constrained channel model.
Oh one thing I should say is,
had multiyear relationship
with Thomson, initially
consumer electronics and then multimedia
before they left and
that laid the foundation
for writing a fairly large
21st century grant,
large at the time anyway.
So I learned a lot from this stuff
but I'm kind of returning
to the pattern recognition
and machine learning area.
My interest in this
area is in what I'd call
modern time series analysis.
I'm using tools from
statistics, machine learning,
optimization and also some
model based signal processing.
Some recent work is discovering
these temporal drinking
patterns from implantable
ethanol biosensors,
already mentioned that.
Also discovering temporal
dietary and physical activity
patterns from surveys and accelerometry.
So these are both problems
in the health area
and there's a lot of these
kinds of problems there.
So we had a little seed grant from NIH
for this accelerometry work
but it's been challenging
to get funding, but we have proposals
and we�re optimistic.
On the theoretical side am
interested in developing
and analyzing classification
and regression trees,
both pruning with the single tree
and averaging with
ensemble of trees
for classification and
prediction of time series.
And then also developing
and analyzing dynamic
time warping for comparing and clustering
sparse time series.
So dynamic time warping
is a method to compare
time series which are sort of running
at different rates.
Which is kind of thing which happens a lot
when people are involved
when they speak or eat or move
and some of these time
series are quite sparse.
Sparseness can kind of 
arise in various ways.
One is, there is like
a missing measurement.
Okay maybe there's a sensor
or somebody doesn't answer
the questionnaire.
But typically isn�t a lot of sparseness,
that's kind of a rare event.
But the other way it can
arise is you can have
kind of lots of zeros
in a time series record.
These would be periods
where people aren't speaking
or maybe they're not eating or moving.
That's actually very significant
and so that's what we are looking at
trying to understand fundamental limits
and practical algorithms
for dealing with that kind of sparseness
while maintaining the near optimality,
of the dynamic time warping criteria.
So we just had a paper
accepted on this subject
at one of the more prestigious
machine learning workshops
and we�ll see how that goes.
So I've got a couple
of pages of references,
I guess that's it.
(audience applause).
-Thank you very much,
interesting history.
You guys have any questions.
-[Woman] Yeah.
Okay I've had one question is,
the algorithm to actually
build Neyman Pearson (mumbles).
You have the code for this
or you want to
share the code for this?
-We do.
I can't guarantee that
it's in great shape.
But we do.
-Okay.
-Okay.
-Thank you very much.
(audience applause)
