Data is the new oil.
It is the basis for many recent developments,
such as increasingly intelligent smart phones,
improved processes in industry, computers
that win games such as Go and Chess, and scientific
discoveries in biology and many other sciences.
However, just like oil, data is not useful
by itself; it needs to be processed in order
to become useful.
And processing data is not easy.
Let me use an example to illustrate this.
Suppose that a hospital has collected data
regarding the levels of expression of genes
in a number of patients.
A fundamental problem on such data is whether
we can find relationships between the expression
of genes and diseases that are present in
the patients, such as breast cancer.
Unfortunately, finding such relationships
is not easy.
An important problem is that for some diseases
it is relevant not only to consider relationships
involving individual genes but we also need
to consider combinations of genes.
Even for a dataset that contains data involving
300 genes, the total number of combinations
can be more than 10 to the power of 90; this
is more than the number of atoms in the universe.
This illustrates the problems that I am studying
in my research.
How can we develop algorithms that find patterns,
models, correlations and associations in large
data?
Algorithms that solve these problems in a
generic manner are known as data mining and
machine learning algorithms.
They can be applied on many different types
of data including but not limited to biological
data.
The problem of finding combinations of genes
is a good example.
It is a problem that is known in a more generic
sense as the problem of itemset mining.
In this case we treat highly expressed genes
as generic items.
How do my algorithms work?
They often take a similar approach.
They build a search tree.
The leaves of this search tree correspond
to the different possible itemsets.
The internal nodes reflect the decisions whether
or not to include items in the itemsets.
This tree can become very large.
The trick to make this algorithm efficient
is that we use constraints to remove parts
of the search tree.
A reasonable constraint in our example is
that we are not interested in combinations
of items that are not present in our data.
We can use this constraint to remove those
parts of the search tree that only involve
those combinations that we are forbidding.
Using this trick and many other search tricks,
I developed an algorithm that can find itemsets
in data with thousands of items.
I believe that this approach can be used to
solve many other problems as well.
Consider a problem of explainable artificial
intelligence.
If a computer is used to suggest treatments
for patients, doctors would like to be able
to understand these predictions.
Consider the problem of fair artificial intelligence.
If a computer is used to make decisions that
involve humans, these decisions should be
fair.
For instance, the computer should not discriminate.
Fairness can be seen as a constraint on the
outcome of a data mining algorithm.
Hence, use of search and constraints is a
generic approach.
If you have data, and if you think that search
could be useful to identify patterns in these
data, then don't hesitate to contact us.
