[MUSIC].
>> Good morning. It
is my great pleasure
to introduce today's
speaker, Silu Huang,
who came all the way
from Illinois and
is absolutely not impressed by
what little snow we have here.
Silu is a recipient of
the MSR Ph.D Fellowship,
and she has done a lot of
work in different areas.
All of it interesting.
Today she's going to talk about
Effective Data Versioning,
but she's also done work
while she was interning at
Microsoft research on sampling
with probabilistic
accuracy guarantees.
She's also done work
in the context of AI,
for selecting the best
configuration in which to train
an ML model and more things
than I'm even going to mention.
So she'll maybe briefly
mention some of it,
and if you're part of the loop,
feel free to talk to
her about that later.
Thank you very much and
I'll hand it over to Silu.
>> Thanks for the nice introduction.
Good morning everyone.
I'm Silu Huang and I'm a fifth
year PhD student from UIUC.
It's my great pleasure to be
here and share my thesis work
on Effective Data Versioning
for Collaborative Data Science.
So nowadays, data
science is increasingly
popular and has been recognized
as an emerging field.
So in particular, data
science has lots of
applications in
different areas like finance,
healthcare, manufacturing,
and retail.
So how does data
science life cycle look like?
Typically, data science
starts from task formulation,
and then acquired data
from log systems.
Followed by data preparation
including data clinging,
transformation.
Then we can try to understand and
refine the data through
initial data exploration.
After gaining a broad picture,
we can then further investigate
machine learning models.
In order to achieve
a good machine learning model,
data scientist were typically
tried to incorporate
more new data and try
different data pre-processing,
as well as future engineering
operators, also different models.
After gaining a satisfactory model,
we can then deploy it and solve it.
This will in return
generate more new data.
So we can see this is
an iterative process.
During this iterative process,
different versions of the same data
set will be generated.
So during my PhD study,
I'm working towards making
data science simpler
and more efficient,
and spanning many of
different stage inside
this data science life-cycle,
like effective data versioning
and automatic data extraction,
interactive data exploration,
modeling, and AutoML.
So I will not have time to cover
each of this topic in detail,
but in this talk I will instead
focusing on effective
data versioning.
So as I mentioned,
data scientists routinely
generate a number of
different versions by applying
data pre-processing
and feature engineering operators.
This is a real example
from biological domain,
where people are collaboratively
construct, curate,
and analyze on the protein-protein
interaction data set.
Here protein one and
protein two forms
a composite primary key along with
some numerical features to
indicate the interaction strengths
of different sources.
Initially, we have version one.
As biologists conduct
more lab experiments,
new record, and columns will be
added and lead to a
new version, V2 here.
Simultaneously, some experts can
work on their own working copy.
Correct and curate the data set,
and periodically merge it
back to the main line.
Also, data scientists can download
and analyze on
their own working copy.
So on and so forth.
So we can see,
this is a non-linear process
with branches and modulus.
During this process,
hundreds to thousands of
versions are generated.
So how are these versions
managed in each organization?
So here I will show
how data is managed
in some computational
biology group at MIT.
So typically there is
around 20-30 students, postdocs,
and researchers and with
around 100 terabyte data
shared via file system.
Each terabyte of data is
costing around $800 per year.
For maintenance cost are from
some local storage provider for
unlimited read and write access.
This will amount to
around $100,000 per year,
which is a large amount for
an academic research group.
So when these scientists
want to do analysis,
then they were forced to make
a private copy and
modify and analyze it.
Also with ad hoc names assigned
to each new reproduced version like,
Data set V1 here.
However, such ad hoc
management is problematic.
Just as shown in this comic,
never look into someone
else document folder.
So this versions are
simply lying around
the shelf OS system and no one
really knows what it is,
what it contains, and how
they relate to each other.
So specifically, from
user's perspective,
there is no true collaboration.
There is no easy way
to merge someone else
modification and to share
your new reproduced result.
Also there is no querying
capability to check
the dependency or perform
some analytics across
different versions.
From the system's perspective,
there is lots of duplications
among different versions.
Which is a waste of storage.
Actually, we are told that
the team members are periodically
asked to clean up the disk because
of the disc and cost constraint.
Also due to the lack of metadata.
So there is no easy way to
figure out which versions
can be deleted and
which versions are important.
So there is a pressing need
for systems to
have managed disk versions data set.
There are mainly
three desired property
for such a management system.
The first is, we want to have
compact storage to
reduce the storage cost.
The second is, we want to have
efficient versioning capability
to enable true collaboration.
Last but not least,
we want to have our data
manipulation and analytics to
have data scientist to reason
about across different versions.
So what about some existing
version management systems?
The first thing that
comes to our mind may be
the source code version control
system like Git.
However, it is designed
for source code.
It cannot scale to large data set.
Also, it only provides
a limited data manipulation
and analytics functionality.
While temporal database can only
support a linear chain of motions.
So the question we
ask here is: Can we
make use of the mature
relational database,
and support versioning for
the relational database?
So in this way,
we can inherit much of the same
benefits as relational database.
For example, the advanced
querying capability,
the locking and the indexing, etc.
Meanwhile, we can provide
versioning capability to it.
So the rest of my talk is going
to be divided into two parts.
I'm going to tell
you about OrpheusDB,
which is a photon approach for
structured data versioning.
Specifically, I will first show
you a demo and then I will
dive into some of the details
in developing OrpheusDB.
After that, the second part is
going to be how OrpheusDB makes
certain restricted
assumptions and we try to
relax some of the assumptions
to make it more general.
These are the two main
parts of my talk.
After that, I will give you
some brief introduction
to other works that I have
been working on during
the course of my PhD,
and I will conclude
with some future works.
So first, I will introduce
our built prototype
called OrpheusDB For
Structured Data Versioning.
This is a joint work with
another PhD student,
Leaky and two Master's student and
Aaron from Chicago and
my advisor Aditya.
So here, is the overall
architecture of OrpheusDB.
OrpheusDB is a lightweight layer
on top of traditional
relational database.
The user can interact with OrpheusDB
using version control
commands or SQL commands.
Then OrpheusDB will try to
translate the input statement
into SQL queries that can be
executable in underlying database.
Underlying database
is unmodified and it
is unaware of the concept
of versioning.
So all the versioning
logic is handled
inside the intermediate
layer in OrpheusDB.
So faster, let's dive a bit into
the query language inside OrpheusDB.
First, we can support true
collaboration Git style commands.
So we allow checkout,
commit, merge and diff.
Also, we can support
advanced querying capabilities
to help our data scientists to
reason about different versions.
So here, I'll simply list some of
the useful queries that can be asked.
For example, the user can ask: Can we
find the version that is
last committed by Alice?
Also, can we retrieve
the first bad version that
contains some arinose tuple?
In the data science
scenario maybe we want
to find the version that is with
the highest accuracy in some
particular data slides, say in US.
Also, maybe we can find
the history of a particular tuple.
So all these are the ones to querying
capability enables the data
scientist to track versions,
compare different versions, and
identify versions satisfying
certain property.
So before I dive into the detail,
I will first show a demo here.
So on the left-most part,
it is a file visualizer,
where the user can view
all the files that are
accessible to this user.
In the middle part is where the user
can type in SQL like commands,
or version control commands,
and we will return the result
in the bottom part.
While the rightmost path
is a version visualizer".
Here each node is a version.
The [inaudible] the derivation
between versions.
In the bottom there are
some of the shortcuts.
When you click each button,
it will populate the SQL or
version control commands here.
So I will first show
some SQL like queries.
For example, we can view the content
of a particular version.
Also we can try to identify
versions that has certain property.
For example, here
what we want to find
is the versions that is with
smaller than 100 protein pairs
which co-expression interaction
larger than 200.
So basically, we will
submit the button and we
find version three and version
four satisfy this property.
Also, we can issue
any traditional SQL queries on
one version or set of
different versions.
Here, we are issuing a group i-query.
Next, I will show some of
the version control commands.
So we can first checkout a version,
and then we will use
some external tools to
process it or curate it.
So here, I will modify
basically three different
records and curate them.
After modifying it, we can then save
it and we can commit
this version back to the system,
so that others can also
visualize it or access it.
So we can see here there
is a new version pops up
and we can try to compare
the difference between versions.
Here, we can see there are
three different tuples, and also,
we can try to see the meta information
of this particular version.
So now we have looked at the demo,
so I would then dive into some of
the details in developing Office DB.
So there are two major criterias
when developing Office DB.
The first one is we want to
have compact storage and
the second one is we want to have
efficient versioning capability.
So in particular, we are focusing
on two fundamental operators,
one is commit and the
other is checkout.
So in overall, we have
three different criterias.
One is storage consumption
and commit latency,
as well as checkout latency.
These are those three
criterias to guide
our design choices inside Office DB.
So I will start from
Strawman Approach here.
One nice way is that we can augment
each tuple with
the version ID, VID here.
However, this has two major issues.
The first one is the storage
which can be very large.
For example, here the
record is repeated twice.
So basically, each
record will be repeated
as many times as the number
of version it's present in.
The second issue is that
the checkout can be very
costly because we're wasting time
in accessing irrelevant records.
So for the first issue
I will come up with
optimized representation
scheme and for
the second issue we will propose
a partition scheme for it.
Let's start with
the data representation.
By the way, if you have any question
or anything is not clear,
you can ask a question.
Yes. I think there are something
wrong with the slides.
Hopefully, it should be fine.
Yes. So one natural way
is that we can combine
the tuples and associate
each distinct tuple with
a version list, VList here.
Even though array data type
is an anatomic data type,
it is commonly
supported in databases,
so in this way we can
reduce the storage cost.
However, the commit can be
very costly because if we
clone version 3 as version 4 and we
commit version 4 immediately
back to the system,
then we basically need to access
every single record that
presents in version
4 and append version 4 in each VList.
So what about some other
alternatives?
So one other way is we can separate
the version information from
the data information because they
are independent to each other.
Here we have a table
showing the data table
with an augmented record ID and
we need a mechanism to encode
the version information from
the version to the records.
So one natural is we can
organize it based on
record ID and the other way is
that we can organize it
based the version ID.
So the first one,
it also suffers from costly commit
because it is similar to combine
table in the previous slide.
But for the second one when you
want to commit a new version,
you only need you only need to
insert one more tuples here,
with the version ID and
its associated record list.
So we eventually choose
the speed by Alice as
our data model and we conduct
experiments and verify that,
our speed by Alice to has
the best performance in terms of
the storage and
the commit time. Okay.
>> Just a curiosity.
You ultimately have
physical design problem here.
So not only depends on how you
represent the data but also
which Index structures you use.
Can you talk a little bit about that?
What did you try out?
What works and what doesn't?
>> So you mean different
indexing structure for the-
>> Because you have a database
system underneath everything.
So you can take advantage
of whatever indexing.
>> Yes. So we basically for example,
for the data table we have
index on the record ID and for
this versioning table we also
have index on the version ID.
This is to make it
faster when we want to do
the joint between these 2 tables.
For other like indexing structures,
we basically let users to figure out
which index they want to build on
top I of the different tables.
Yes.
>> You find that most records
are in most versions.
Once a record has been introduced,
I would imagine it's going to be in
pretty much every version
from that point on,
some exceptions but generally.
So it seems like if you could
have a scheme that says,
here's the exceptions to that rule.
Here's a list of what versions
each record is not
in and then when you commit it,
it will just automatically
be in every version.
Did you guys look at
anything like that?
Making assumptions about how?
>> We didn't make assumptions
on the patterns of the user,
in terms of different versions.
So basically, we are having
a list of different versions,
this particular record is present in.
But maybe what you're
mentioning is we can condense
this version list to
a starting point and end point
and then maybe we can say,
"It is this range and maybe
one particular version is
not inside this range."
We didn't consider the compression
of the VList here. Yes.
>> I read about there
can be cases where
[inaudible] so what do you do
with this kind of schema change?
>> Yes. So in terms
of the schema change,
how we handle it is as follows.
So for example, there is
a new version committing
with schema change.
If it is only a deletion
of some particular column,
then we basically have a
table recording each vote,
what are the attributes
inside these version.
If it is addition of the columns,
then maybe we need to
change the schema.
So there are two different cases.
The first one is, we can store it as
an additional table
or separate table,
and the other one is, maybe we
want to alter the original table.
So basically we can use
the auto-command inside
database to do that schema change.
Yes. There are
two different options here.
>> I may have missed this but
what types of queries are
you supporting? Do you have?
>> So we basically support
Git-style commands and the we can
support SQL-like commands. Yes.
Okay, so recall that. Okay, sorry.
>> Can you go back to the last slide.
>> This one.
>> After this, the next one.
Yes, I probably missed the intuition.
Why is split by alternatives
better in performance?
>> Yes, so basically we,
as we mentioned we have
three different criterias.
The first one is the storage
and the second one is commit,
commit latency, and the last one
is the checkout latency.
So in terms of,
when we want to do
this to the storage,
we have mentioned like we can reduce
the storage by combining
the same tables and then
in terms of the commit,
when you need to
commit a new version,
for the first one you
need to append V4,
to each of the single records in
the upper part and you need
to do array appending.
For the second one, when you
want to commit a new version,
you only need to insert
one new record and
indicating the Version
ID and the A-List.
So in this way,
you basically can
reduce the commit time.
>> How does that incorporate
actually a list of arrays?
>> Yes, yes.
>> Or is it a like a pick?
>> A list of ID.
So we are using the array
data type inside database.
>> We know that row can be very big,
the overall cost is so cheap.
>> Yes, yes.
>> Just a clarification. V3
and V4 have the same add list?
Under what condition
will that happen?
Why is a new version
not having a new datas?
>> Yes, so basically I'm simply
giving a very naive example.
So for example, if
you clone a version,
but actually you will
do some modifications,
but I didn't show the
modification here.
>> Modification will be as a new RID
because you want to
preserve the program.
So you're going to use same RID?
>> New RID. So each record is
immutable and when you
do some modification,
then you will have a new record ID.
>> Right. So it's the same
record ID here for before.
Shouldn't it be
a new RID in the list?
>> Yes. So here,
I'm only showing a naive example
of cloning one version
to another version.
Yes, but if you do some modification,
then it will lead to a new RID
and then I will append
a new record here,
in this data table.
>> What it seems to me that what you
gain in your commit improvement,
you would lose on at least a
certain type of query profile.
If you are looking
at a certain RID and
you ask for the entire
version history,
your specified dealers would
be optimal for that clarity,
but your [inaudible] would be
just as expensive as [inaudible].
>> Yes. That's a good question.
So in the current implementation,
we are basically considering,
so prioritize the commit and
the checkout operators and for
other operators we are
not that care about it.
Yes, it depends on
the workloads, right?
Okay. So for this one,
recall that we have
the three criterias.
One is storage and one is
commit and also checkout time.
So in the following,
I will tell you how we
can further optimize
the checkout time by
a while partition scheme.
So as we mentioned,
the checkout latency is very
large because we are wasting
time in accessing irrelevant
records even this index.
Since the records are
scattered across the table,
so hundreds of thousands
of random access will
eventually be reduced
to a false gain.
So for example, here if we want
to check out version five,
even if it only has three records,
we are actually accessing the
entire table with several records.
So how can we eliminate
accessing irrelevant records?
So what we propose is,
we can partition the big table
into sub-tables,
so that we can reduce
the checkout latency and maybe at
the cost of some more
storage consumption.
So we continue with
the previous example.
We can partition the big table
into two sub-tables.
The partition one will contain
all the records that is in
version three and version four,
while partition two will
contain all the records,
that is in version
one, two, and five.
Now when we want to
check out version five,
we only need to access the smaller
table with only four records,
but at the trade-off,
we are repeating record three
twice in this two tables.
So we can see there is
a trade off between
the storage cost and
the checkout time.
So we first define,
what is the storage cost and what
is the average checkout costs?
So the storage cost there is
basically the total number of records
in our partitions and
the checkout costs for each version,
is the size of the partition
this version belongs to.
The average total cost is basically
averaging across
all different versions.
So here comes the problem definition.
So given a storage threshold Gamma,
we want to find
a partitioning scheme,
such that the storage
consumption is weaving
the Gamma budgets and
we can also minimize
the average checkout cost.
So we proved that this is
an NP-Hard problem by
a reduction from
three partitioning problems,
and in the following, I will show
an approximation algorithm for this.
So before we introduce
our proposed algorithm,
we first look at
two extreme cases and get
some intuition behind this algorithm.
So at one extreme case,
if we store all the versions
inside one single table,
then the storage cost
will be minimized because
each distinct tuple it
appears exactly once.
At another extreme, if we store
each version as a separate table,
then we can see the average
checkout cost is
smallest because we are not accessing
any irrelevant records when
checking out that version.
So if we visualize this
two in the plot where
the x-axis is the storage cost and
y-axis is the average checkout cost,
then they are basically lying in
two corners and what
we want to do is,
we want to find the optimal
partitioning scheme in between,
such that the storage is
within the budget Gamma.
The high level intuition
here is that,
we can possibly group
these devotions that are similar
to each other together, right?
Such that we don't need to,
we can only access
a smaller number of irrelevant
records when checking out
the old version and also
the storage cost is not
increased to too much.
Yes, sorry for the figure here.
So if we use the traditional
clustering algorithm then it will,
it can be very time-consuming
because in each iteration,
we need to access the total number of
records and the number of records
can be as large as one billion.
So instead, we proposed to
operate on the version graph,
so we can partition across the
version graph as shown in the demo,
the version graph basically encodes
the derivation information and
also indicates the similarity
among different versions.
For example, version one here has
seven records and version
two has eight records,
and they have six records in between.
So in this way,
compelled to working directly
on total number of records,
this is very lightweight
because the number of versions
is much smaller compared
to the number of records.
Then how about the effectiveness?
So as you may have already noticed,
if we work on the version graph here,
then it is actually exploring
a restricted subspace.
So for example, there is no way to
partition the versions
in this way because
other versions inside
the blue partition is not
connected without the inclusion
of version three in
the yellow partition.
However remarkably, even with
this restricted search space,
we are able to prove some
approximation guarantee in terms of
the smallest storage cost and
the smallest checkout cost,
which is good.
So we will show the algorithm below.
So it is very simple,
but it has very strong guarantee.
So we first define doubt
coherent property.
So mathematically, what it means is
that the current checkout costs is
within one over delta times the
smallest possible
average checkout cost.
So the indication is
that the versions
inside this partition are
very similar to each other.
Our algorithm runs as follows,
we start from
the single partition case.
Where it has the smallest
possible storage costs.
Then we were forced to check whether
it satisfies the delta
coherent property.
So if it satisfies the
delta coherent property,
then we can stop because
the storage is the smallest,
and the current checkout cost is
not too far away from
the smallest possible checkout cost.
Otherwise, we will
split this partition,
and so on and so forth.
So we can prove that this has
an approximation guarantee in terms
of the smallest possible
area checkout cost
and the smallest
possible storage cost.
As an example, so if the
delta itself has 0.25,
and the max staffs along
the partitioning process is three,
then we can guarantee that
the current storage cost is within
four times the smallest
possible storage cost. Yes?
>> I'm just trying to understand
the restriction on this workspace.
So here's a simple example.
Say, in each version,
I'm adding and I
couldn't [inaudible].
This is all I do, so I clear
the chained up versions,
but every other version is
identical to each other.
So all the even versions are
the same in all the algorithm.
So the best storage would
be not just two versions.
So in that case,
isn't that the coherent?
I mean is there a restricted
search-based that's
[inaudible] solutions,
what are you going to do?
>> Yes. So in that case,
is basically in our current
partitioning mechanism,
we are only having
those constrained search space.
So which means, we are not able to
partition the versions inside like
the odd petitions and
the even partitions.
In that case, I guess our algorithm
will eventually be like
partition every- so
basically it will have
some cut point in-between and then
group all the versions inside,
in like the previous parts inside
one particular partition and then the
others inside
another partition, right.
>> So the chain is not delta
coherent or is it delta coherent?
>> So the delta coherent will
basically be based on whether
it satisfies this
mathematical equation.
So in your case is,
I guess if it is only a small chain,
then it will delta coherent,
but if it is a longer chain
then it may not be.
>> The average will be one
minus all the other records.
So it will be C average is going to
be half of N minus one over
N or something like that.
>> Yes, so, in your case,
you can imagine in this way, right?
So basically, if your group like
the even version with
the odd version,
then basically, you can see
the partition is basically
increased a bit, right?
Because you are having
like some of them are deleted
and some of them are added back.
So basically, the partition
will be the size of like,
say like the odd partition size,
the odd version size.
Then, when you check out the version,
then if the number of odd petition or
the number of even petition
is very large,
then it can possibly [inaudible]
like this delta coherent property.
But if you partition them in
to 10 versions inside one group,
then it should satisfy
this coherent property. Yes.
>> [inaudible]
>> Yes, sure. Yes, then,
we can't guarantee like
the storage cost and
the average checkout cost.
So we can further illustrate
this using an example here.
So basically, initially,
we will store
all the versions
inside one partition.
Then, we will check whether
it is delta coherent.
So for this one, it is
not delta coherent.
Then we will snip the
edge that is with
the smallest weight and we can
then result into partitions.
We, again, check whether it
is delta coherent or not.
So if it is not,
then we will again
snip the edge and then
result into three
different partitions.
So now that each partition satisfies
the delta coherent property,
then we can terminate.
So we further consider this compound
based algorithm with two existing
clustering-based algorithm.
One is Agglomerative
and one is K Means with
some version in benchmark
from previous paper.
We first look at the
effectiveness where the x-axis
is the storage size and
the y-axis is the checkout time.
So as an example,
this point shows
the storage size is around 2.3
gigabytes and the checkout time
is around 2.9 seconds.
We can see as an increase of
the storage size basically,
we can reduce the checkout time
and also our proposed algorithm,
Last Bit is dominating
the other two algorithms.
Of the other two
algorithms, actually,
K Means is performing better and
it is quite close to Last Bit.
So next in terms of the efficiency.
So on average, we
observe three orders of
magnitude speedup compared to
two clustering-based algorithm.
If you still recall that K Means has
very good effectiveness property.
But here you can
sometimes not be able to
finish the algorithm as
shown in this cutoff here.
At last, we will show
the benefits of partitioning.
So again, the x-axis is
a three different data sets
and the red one is the one without
partitioning and the other two
are with a storage budget.
So we can see as we increase
the storage twice from
four gigabytes to eight gigabytes,
we can actually, reduce
the checkout time dramatically.
Any question for
the OrpheusDB part? Yes?
>> Can you talk about checkout time
and the space, right?
You have also done experiments
on the query processing then?
[inaudible] design query say,
on the most recent value.
How much of work it does,
all of these items?
>> So if you, for example, basically,
we are optimizing for the checkout
time and the commit time.
So if, for example,
the case you mentioned
like I want to have
more workloads on the recent version,
then it is basically saying,
we will have a workload
on the checkout cost.
So our algorithm can
be generalized to
the case where we have the workload
on different versions.
So for example-
>> [inaudible] I'm just saying,
I'm [inaudible] some
[inaudible] on my data.
If I didn't have any
[inaudible] let say
it takes 10 seconds
to run that quick.
How much overhead does all the
bookkeeping do you have to do?
>> Yes, yes. I understand that.
So what I mean is like if you have
a workload of different queries
on different versions,
then you can possibly, basically,
say I have the workloads
on different versions.
Then, in that case, you
will basically materialize
the latest version and store
it as a separate table.
Then, in that case,
when you issue a query,
then you can post your performance.
>> Okay. So this is like
a localization strategy
>> Yes.
>> Based on the workloads.
>> Based on the workloads. Yes.
>> Just building on something
somebody else had said.
So it seems like
you're geared towards
the worst-case data versioning.
If you actually use it,
you might see something like
concentration of records
that keep getting changed.
That maybe a new record is
more likely to change soon,
but an older record is probably
not changing that often.
If you invite those things or you
assume that some of the data,
each introversion change
is going to be small,
is there something else that
will actually be better?
Like in videos and
stuff, you have iframes,
and then you have lots
of DataFrames and
then you take other iframe.
Would things like that
kick it in this context?
Or When you applied this to
your datasets, what did you see?
Did you see any such patterns?
>> I think one possible way
is basically saying,
basically you can have
a trunk structure.
Inside this trunk structure,
you maybe want to group
the similar records altogether
and then you do the optimization.
But in this one,
we didn't take the patterns
into consideration.
>> So can you talk about
this whole optimization
that you're concern
of these version IDs
is now writing this.
If you had millions of
records for example,
how would you integrate
processing because it's just
an array of ArrayLists.
It's very hard for me to
imagine because you actually
not using the database
any longer with
its built-in indexes and capabilities
of capturing or locating records.
Why don't you just use
like a first-class
temporal version database
or version index,
like an interval index or something
which might make it easier
to do query processing on the data?
>> So for the first thing is
compelled to temporal database,
it is the diversion graphs or DAG.
So it is not easy to
encode them into a range,
so that is the first thing.
The second thing is for
the list of versions.
So array datatype is support,
so we are building our system
on top of Postgres.
Array data type is supported
in Postgres and it is
true that for the array data type,
it is not having
a very good optimization.
We are currently exploring
some of the directions of
how to rewrite the query
in terms of the arrays,
so that the query execution
can be faster.
For example, one nice way is you can
unmaster the arrays
and then do the joins.
Another way is for example inside
Postgres they're having joins,
like you can have
one RECORDID and you
can try to do a binary search
in the ArrayList.
So there are different mechanisms to
handle the array optimizations.
We are currently still exploring
how to rewrite the queries,
so that we can have the optimizer
to do the best of decisions.
Yes. Okay.
>> Yes. How do you concentrate
using any provenance system?
Because what you've just described,
the versioning is like
Unix information,
it's called [inaudible] provenance.
The denormalization,
you're talking about
the logs or denormalized representation
of the provenance draft,
and that includes
several other things.
So I was wondering if you're
considering any provenance [inaudible].
>> Yes. We didn't consider
the provenance systems yet,
but I guess, and it's
actually for this one.
Actually you can see it is
not much of the provenance.
So it's basically you have to
check the version graph and
then for the records
that has primary key,
then you can possibly check the
lineage of this particular record.
In addition to that,
we are only maintaining some metadata
of this particular versions.
So I guess it's not very much
of provenance information.
>> I was thinking about Linux
which is fine-grained and
[inaudible] not ultimate
the data levels,
so not coarse-grained provenance,
but fine-grained provenance.
>> I see.
>> So you have a version,
it's command is going to be
accelerating in your version,
that's the coarse-grained
information.
The fine-grained information
is that from the array
different input and went to
the array to the output.
This is [inaudible] this as you're
creating that array [inaudible] is going
through clockwork provenance
and [inaudible]
List is going to be
the backwards provenance.
>> I guess it is because of
the design of our system.
So we are allowing the external
tools to process the data.
So we are actually
not be able to check
the operations that is done
inside one particular version.
So if we can capture like
the different operators and
how the records are evolving,
then maybe we can use
the provenance systems.
But in this ones for example,
as I show in the data,
we are alerting the users to use
some like external
tools to handle it.
So we are not be able to capture
all the provenance information,
but it is a good direction.
As I was mentioning in the latter,
if we can capture other code and
the workload and also the data.
>> [inaudible] to avoid all these
joins from denormalization,
you really don't want to do
such kind of pre-processing,
what deviance of records is going on,
entering so and so and all
that. Anyway [inaudible].
>> To add a question to that,
at the beginning you talked about
the biology lab and they have
all these versions of the data.
You've really tune your system
for assuming that people are
kind of making changes at a data,
like adding tuples and moving
tuples, manipulating tuples.
But some of the examples you
gave on that first slide were,
underscore normalization,
which I assume maybe
they've normalized
some whole column or something.
So if someone normalizes the column,
it's going to double
the database, right?
It's going to add
a whole new set of tuples.
>> That's a good question.
So it depends on how we
do the Datasets difference.
So actually for
the normalization case,
you can consider it as you
are adding a new column.
So you are having
two different columns and then
you basically are doing
the schema change in our case.
>> So how do you handle a new column?
I guess that would be my question.
If someone wanted to add a column,
does that double
the size of everything?
Now you have tuples with
or without the column,
or can you mark that column as being
there for some tuples and
not for others versions or?
>> Yes. So basically
as Byron just asked,
so in order to handle
the schema change thing,
if it is a newly added column,
then first of all we will
have a table that is
recording all the attributes
that is inside
this particular version.
Then if there is a new column added,
we can do two different things.
The first thing is, you
can try choose to create
a separate table and
the other thing is you can
try to alter these original table.
Then you basically use
the database commands like
alter at a column to
handle the schema change.
>> But would you sign,
I think I missed it.
If I have just one entry and now I
add a column without
becoming two entries,
one [inaudible] entry and then one
with the value in the new column.
Each one is versioned in
your version table or
do you have a way of
having one entry in a
version table says,
one version has this column
and the other version
doesn't have this column?
>> Yes. So you will have
one entry and then you will have
the table recording which
attribute is inside this version.
>> I see, okay, cool, thank you.
>> So you need to talk about
that part in this talk.
Because here we just have row ids.
>> Yes. Just for simplicity,
I only show the records,
record schema change,
you can refer to
the paper, we talked about it.
Okay, then I will go
to the second part.
So we mentioned that OrpheusDB
makes several restricted assumptions.
Here we want to relax some of
the assumptions to
make it more general.
So first I have work on
changing the storage data into
the storage representation from
the structure data to
a more general data formats.
We try to allow
a more general traversal language
that is beyond SQL.
I also worked on asserting,
where the user is not
starting from scratch,
instead they want to
bootstrap the process that
has already been done.
So in this talk, I
will simply focus on
the generalized storage
representation.
So this is a joined work
with [inaudible] , Amit,
and Amore from New Maryland and
as well as my advisor Aditya.
As we mentioned,
different from OrpheusDB,
here we want to deal with
different data formats
ranging from structured,
unstructured to unstructured data.
One common thing with
OrpheusDB is that we need to
balance between the storage
and the checkout costs.
So one natural question is,
can we just use the mechanism
inside OrpheusDB?
So the problem here is that we don't
know what is each individual record,
if we are handling
unstructured Datasets.
All we have access to,
is some black-box
[inaudible] mechanism
that can give us doubt
between versions.
So this is the assumption we are
starting with and this assumption
is powerful enough to handle
different data formats.
Therefore, we are using
a delta-based approach.
What it means is we can either
store some version in
its entirety or materialize it,
and for some other version,
we can store it as the modification
from an existing version.
So what is delta here?
The delta can be in
different formats.
For example, we can represent
using Unix difference,
or we can represent it
in using SQL script.
Often, some particular
datatype we can even use XOR.
Next, we will show
how we can represent
different versions using
this delta-based approach.
So for example, we can
draft three versions here,
and we can materialize
version one and store
the other two versions in
a linear train as the modification
from its previous versions.
In this way, the storage cost is
basically 100 plus 30 plus 10.
So 30 here means the delta
between the version
one and version two.
When we want to check out
a particular version, for example,
if we want to check out version
one because it is materialized,
so the checkout cost will
simply be 100 megabyte.
For the second version
and the third version,
we basically need to start
from the materialized version,
and then trace all the way down
to this particular version.
>> Talking about the access cost,
you only care about the amount of
storage that you have to read.
You don't talk about the cost
of actually applying the bit.
>> It's a good question.
For simplicity, I only
show the case where
the storage cost is the same
as the recreation cost,
but actually the storage cost can be
different from the recreation cost.
We have that in the paper,
but I didn't show it here.
Alternatively, we can
materialize the version one
and then store the other two versions
as the modification from version one.
We can calculate the
storage cost as well
as the total recreation
cost accordingly.
Also, we can materialize
version two instead of
materialize version
one and then store
the other two versions
as the modification from version two.
So in this way, we can have
a smaller storage cost and also
a smaller total checkout cost.
So we can see there are
two criterias here.
One is storage cost and
one is checkout cost.
In the following, I will
simply focus on the case
where we have the constraint
on the storage size,
and we want to minimize
the average checkout cost.
So this is the same
as that in Office DB.
Next, I would show some of
our proposed algorithms.
So we first map this problem
definition into a graph setting.
So here we have four
versions and the [inaudible]
here basically means the
delta between versions.
We will first introduce
a null version here,
and we will connect this null
version with our other versions.
Here, 25 basically means when
the version one is materialized,
what is the storage cost?
Let's look at one possible
storage graph representation.
So in this one,
what we are saying
is we are basically
materialize version
one and version two,
and we are storing version three
as the modification from version
one and storing version four as
modification of version two.
So we can see the total
storage cost is basically
the summation of all the actuates
inside this graph.
The checkout cost, for example,
if we want to recreate version three,
we basically need to start
from this null version and
walk along the path from this
null version to the version three.
Also, we can observe
that this is a tree.
The storage graph must
be a tree because
in order to recreate one version,
we only need one path from
the null version to
this particular version.
So with this observations,
we can then dive into
some of the problems.
So the first one is:
If we only care about
the storage size and there is
no constraint on the recreation cost,
then this can basically be
transformed or reduced to
a minimum cost arborescence
problem because
the storage size is basically
the weight sum inside this tree.
On the other hand,
if we want to minimize
the recreation cost and there is
no constraint on the storage cost,
then this is actually
a shortest-path tree problem because
the checkout cost for
one version is basically
the path weight from the null
version to this particular version.
So what we want to do is we want to
balance between these two tree.
So we want to find a spanning tree
that is in-between
of the minimum cost arborescence
and the shortest path tree.
So we propose a local
move greedy algorithm,
where we start from
the minimum cost arborescence,
which has the smallest storage cost.
Then, we iteratively add
more edges or deltas.
In each iteration, we are picking
the edge that has the largest ratio
between the reduction in
the recreation cost and
the increase in the storage cost.
So let's look at one example here.
So this is a spanning tree
in the intermediate process.
Basically, what it means is we are
materialize version
one and version two,
and we are store
all other versions as
the modification from
version one and version two.
We can calculate the ratio
for each dotted line,
and then we can find the edge
between null version and version
four has the largest ratio.
So we will pick this edge and
remove the edge from version
one to version four.
What it essentially
means is that instead of
storing version four as the
delta from version one,
we are materializing
version four instead.
So we compare this algorithm
with some other heuristics.
One is modifying Prim's algorithm.
Prim's algorithm here is basically
the famous algorithm to calculate
the minimum spanning tree.
The second one is from delay
constrained scheduling literature,
and the third one is
a Git repack heuristic.
So again, the x-axis
is the storage cost
and the y-axis is
the total recreation costs.
So we can see with the increase
of the storage cost,
we are reducing the recreation cost.
These two dotted lines are
the smallest possible recreation cost
and the smallest
possible storage cost.
We can see with a small increase
on the storage budget,
we can actually reduce
the recreation cost dramatically.
So in addition to
this problem definition,
we also consider different cases
varying the optimization goal
and the constraint.
Also, as Christian mentioned,
we consider the case where the
delta is basically the storage
and the phi is basically
the recreation cost.
We consider different cases here.
>> You restrict the [inaudible] as
the graph as a tree in
all these formulations?
>> Yes.
>> How come that's not as optimal?
I'm not sure that's the [inaudible]
could you get the matrix formation?
>> So basically,
the question is whether
the storage graph should
be a tree, right?
So there are cases you may
imagine that we don't want
to be a tree, that is merge.
So if you merge two versions
into one version,
then you may want to have
a DAG instead of a tree.
But yes, in this problem definition,
we are constraining it to store
one version as the modification
of only one version instead
of multiple versions, right.
>> All right. Also
maybe its possible to
add edge to make it non-tree,
but still reduce the reconstruction
cost for regular version.
>> So you can think
of it in this way,
if it is not a merge,
then a tree must be
the optimal solution. Right?
>> I'm not sure.
>> Because you can think
of it in this way.
If you can retrieve the highest,
so first in terms of the storage
cost rate if you'll delete
one edge then the
storage cost must be reduced.
The second thing is if you're,
in terms of the recreation cost,
if you can retrieve
a smaller recreation cost for
these particular version using
one edge instead of the other one,
then basically you can
remove the other edge.
Because there is no need
to add this other edge.
>> Then that actually will not
be used for other versions?
>> Yes. It will not be used because
it is like a shortest path problem.
>> Yes.
>> So it is a sub-optimal.
So if this version has
the optimal recreation cost
then it's descendant
must be having a optimal cost for
this particular version. Yes.
>> Yes. Because for
that bigger formation in that, tree.
I'm sorry, I'm asking whether
there are any other valuations
here in accommodations which
might be probably true?
>> So as I mentioned, the only
valuation is that if you consider
merge instead of storing
each version as the modification
from one particular version,
then tree must be
the optimal structure.
Yes, for all the definitions. Yes.
>> I'm sorry to interrupt. We have
to give up the room
a little bit early.
So how about this, let
[inaudible] finish,
hold your questions until the end and
whatever time we have at the
end we'll use for questions.
Sorry, but I want to
make sure she can finish
the talk. Sorry about that.
>>Yes. Thank you. So
we have talked about
the Office DB and also the
generalized storage representation.
So the overall takeaway
here is that with
only a lightweight layer on
top of traditional database,
we can enable effective data
versioning and
we can enable Data Scientists to
reason across different versions.
The second takeaway is
that with very simple,
but very efficient algorithm,
we can optimize
the performance a lot.
So in both the Office DB
and the generalized storage
representation part we witnessed that,
observed that with
smaller increase in the storage,
we can reduce the check
out time dramatically.
The third thing is for
different settings,
we may require
different algorithm for it.
For example, for Office DB we are
using record level de-duplication.
But for the generalized
storage representation,
we are using a
data-based the approach.
So in the following I will briefly
talk about some of the other works
that I have been doing during
the course of my PhD study.
So I'm also involved in lots of
the interactive data
exploration projects.
So well, we want to find
fast response time for
each different queries.
So in particular, I started
two different Visual Queries,
which are two parts,
which are parts of the sandwich
stage project inside our group.
For the first one, we
want to first identify
the future paths that can explain
the separability given
two different sets.
The second one is
given a histogram we
want to find the match to
histogram in our first manner.
In addition to the Visual Query,
I also worked on Aggregate Queries
during my internship
here in DMX group.
In addition, I also worked on
Automatic Data Extraction from Logs.
Where we want to find patterns
inside the Log Data sets
and try to extract
the structured format from
the unstructured Logs.
As well as the Auto Machine Learning
for large-scale data set,
where we basically draw
the intrusion from
Approximate Query Processing and
apply it here into the Auto ML.
The last one is from
my internship in Google,
where we want to do the table
synthesis and table compression,
and where we try to apply
deep learning techniques to
do the attribute ranking.
Last, I will conclude with some of
the future works that
can be potentially done.
So but the first thing is,
in addition to the data versioning,
we are also having the model
versioning during
the iterative process.
Can we build some tools to help
data scientists to inspect
the model and track how model
evolves in an efficient manner?
Such that for example,
the users can ask queries like,
what is the data slice that
this model performs worst.
Also, why this model degrades
when the time evolves.
So the goal here is that we want
to fast forward the result to
deepen the queries instead of letting
the users to examine it
in a brute force way.
Also, in addition to
the data versioning,
we may want to build
our management system,
that is an end-to-end
data science life cycle.
So we want to include
not only the data,
but also the code as well
as different models.
Such that we can capture out
different informations or as you
mentioned like the different
provenance informations.
So that the Data Scientist can
better reason about the workflow,
and the data, and the model.
More broadly, I'm also interested
in developing tools to make
data science simpler and more
efficient in different stages
of the whole life cycle.
Yes. Right. That's
pretty it. Thank you.
>> Thank you speaker. Now you can
ask questions and [inaudible].
>> Yes.
>>Yes. So I have some
clarification questions about the-
that box and did our work.
So one thing is that you mention
about where you grow the graph.
You should have start
from the long version
and branch to the other side
of other versions.
So if the data pass
the original data to get
the sizes more than
the full size of the version,
would that make sense
to start another aid,
so that you can start from
multiple versions and
grow all of them up?
>> So there must be
a triangle inequality,
like if you materialize
these version then it will
be the smallest, the cost.
If you're starting with
some other existing version,
and then use the data to basically
trace the path from
the materialized version to
this particular version,
then the cost must be bigger than
the cost when materialized
this version.
>> This can help them, like
its better to materialize
this version instead of adding
there a number of different version.
>> What do mean by a number
of different versions?
>> You have versions through,
which fills from version one,
plus the outer of version two,
and there is another version
which you said.
>> Yes.
>> If the cost for
materializing version
two is cheaper than those steps,
will that make sense just to
materialize the project rate?
>> Yes, there is a constrain
after storage constrain.
So because of the constrain
of storage costs,
you may not be able to
materialize all the versions.
So you can only like store
the data instead of materializing
this particular versions.
So basically, inside this problem,
there are two parts.
The first part is, which versions
should be materialized,
and then for the others
that is not materialized,
which depth should I store.
So which versions should I
store that data on, right.
So that's how the two major
province incentives problem.
>> The other question is that,
in the example you mentioned
where you can actual version one,
can be version two process data,
do assume you actually have
an older version that is grow
this time are newer version?
>> The older version that is,
then that will result
in a new version.
So basically, you'll have
a old version and you
will have a new version.
And then, if this
old version evolves,
then it will result in a new version.
>> Because I was thinking
about whether this is
a more like a real balancing scale,
do incorporate radically
to restructure of this.
How about when you said close
with the process when you're
creating your versions.
Because it seems like you,
during the process of
creating new version,
after any new version,
you actually have
limited choice of how
to like materialize
this new version, right.
So it's not like you
have the full freedom to
actually compute a new graph
with all the versions.
>> Yes. So basically,
the question is,
is it a repack problem on
online maintenance problem, right?
So in for example,
in the previous part,
we are considering
the online maintenance and
then we can do the
repack periodically.
So basically, you can,
as you mentioned you only have
rarely restricted choices.
And then you can use
a sub-optimal solution.
And then basically, you can
try to repackage periodically.
Same for this one.
So you can't force to make
all these decision off,
like storing it in as
a modification, or you can.
So basically, if you
look at the algorithm,
it is also like all.
So you can think of it
as an intermediate step.
And then you can do,
yes periodically.
>> [inaudible]
the structure of the data.
>> And then reorganize this tree
into an more optimized one.
>> As you are ending things,
isn't the case you might,
the order occur also
considering the cost of
migration Worldwide.
Structure trials.
>> Yes. Yes?
>> So to do these algorithms,
do you need the deltas
between all pairs of
versions already computed for you?
>> Yes. That's also a good questions.
So in currently, in our experiments,
we are basically calculating like
within 10 hops instead
of order of arithmetic.
But is basically depends on
like how good you want to do,
you want it to be.
So there is a trade
off like calculating
the deltas between
a more different versions.
So in our experiment, we were
constrained it into like 10 hops.
>> So to do that calculation,
you materialize
every version and compute
all the deltas to
all these materialized versions?
Or do you, when you check
it and you compute all the
delta's at that moment,
because it doesn't matter?
>> Yes. So basically,
you can choose to calculate during,
after you commit it, yes.
So because you will commit it,
you may want to do it
in a very fast phase.
So basically, you can try to compute
the delta between this version
with one particular version,
and then commit it immediately.
And then after the commit process,
you can do repack.
And then you can try
to compute more dealt.
[MUSIC].
>> And then try to do the repack,yes.
>> And can you, do you ever use like,
is it possible to do
a computed delta from two dealt,
like you have two deltas.
And obviously, you've got to
just concatenate the deltas,
but is there any assumption
that you could,
or any advantage if
you could just compute
a new delta based on these two
deltas that might be smaller than?
>> So there is like, as I mentioned,
there is triangle inequality,
right, among like
these three versions.
But there is not like,
you cannot ask me exactly what
this delta will be, right.
>> And there's no access to say,
I harsh two sequential deltas
give me a new delta.
It would actually have
to just materialize
it and recompute the delta?
>> Yes.
>> Okay.
>> Yes.
>> Responding with
the overall storage,
like when you say compare Office DB
with something's stupid
element [inaudible].
What do I get, why would I use it,
I'm going for it for a better
storage but the check out time,
what's the high order storage?
>> So for Office DB,
the high storage is like,
as the starting point,
I can show you the interview
with the MIT group in.
So the starting point is this one.
So the starting point is basically
when we talked to some of
the computational biology groups,
and they are saying that,
they are wasting lots of time,
lots of money in that storage.
>> To get this clear.
>> Yes.
>> What is the Office DB
counterpart? What will we get-
>> So the first thing that is,
as I mentioned, you can
have a compact storage.
The second thing is you can-
>> [inaudible]
>> So as I mentioned,
you can set a budget for
the storage or consumption.
You can set the budget to
two times of the smallest
possible storage.
Yes. So it depends on
how you set the budget.
Then, the second thing is,
because you have lots of Metadata,
record it, and you'll
know the version graph.
So basically, when you want
to do some analysis, you can,
based on the information that is
maintained inside this system.
The other thing is,
for example you want
to do analysis across
different versions,
for example in
the data sign scenario,
you may want to see what
is the version that has
the highest accuracy on some
particular data slice like in the US.
Then you can easily issue SQL
queries to answer these ones.
So basically, you can think
of it as like enables
the Versioning Capability and
also you can do the other ones,
analytics across different
versions and check versions
and also identify version that is
satisfying some certain property.
Meanwhile you will have
a very compact storage.
>> Do you have an idea, for
instance this data center.
So you mentioned that it
is a 100,000 per year
to curate this Datasets so it will
obviously reduce it to a [inaudible].
>> We didn't a particular
down the experiments on
this one but it basically depends
on how many versions and how
they are similar to each other.
So for example, here
the 20 to 30 students
and each time when they do some
analysis they make a private copy.
Obviously, this one
can be reduced, right.
>> I think it'll be interesting
compared to how Git does, right?
>> How Git does, right?
>> Yes.
>> So yes, Git also
use Database Approach.
For example, in this one we
only consider 100 versions.
This is a Linux Fork Datasets.
We can see the Git will take
around 200 megabytes and while
ours just in terms of
the storage cost not in
terms of the equation costs.
We can see our Y is
around 116 megabytes, right.
>> It's very interesting Caroline.
Do you have any more feedback that
you've got from other users of
your system on where it's useful
and what they encountered?
>> Sure. So basically,
we led the biologists teams
in our university to try it.
So some of them find
they want to query
capabilities where useful while some,
they don't want to use the system.
Or they are reluctant
to use the system
because they need to basically
change their behavior.
So they need to every time when
they generate a new version,
they need to commit and
check out the version.
Another thing is, as I mentioned,
we only care about
the Datasets Versioning.
It's not like an overflow of
the post data and
code and the models.
So they feel it is very restricted.
So it cannot capture all the Meta
information unless the user
can manually type in the Meta
information into the system.
They don't want to manually type
in all the Meta information.
They want to have automatic way
to have them captured.
All different information.
Yes. Also we got contacted by some,
I think the Janelia Research from
Virginia and what they want to do is,
they want to do the
Neuroscience and Annotation.
So they basically first use
deep learning to annotate
the figures and then
they will need to do
Human Proofreading of
each problematic area.
They said it would be very
useful for them to consider
different versions and support
the advanced querying
capabilities. Yes.
>> So let's pretend that this was
in production of some kind of
enterprise environment that was
holding PII and someone issued
a GDPR requests that required
backing out some information from.
So I imagine that the way
I'll would work will probably
by swapping out immutable versions
of records with the backed up
information about mass,
would there still be
capability to recognize that
between version one and
version two of some particular record
that a particular field
had been changed,
even though the actual values in
those fields might
no longer be present
in the canonical version
of the records that
have been stored or I mean,
how do you envision
supporting something like
a GDPR request for
a system like this?
>> So I'm not very familiar
with GDPR request [inaudible].
>> Depending on where data is stored,
there is a law in particular.
There've been a lot
of large organizations
have adopted that requires
that anyone holding on to
personal data has
the capability to back out
that data if the people who are
subjected to it got requested.
So if you living in France,
you could contact the Role and say,
"I want you to flush everything
about me out of your system."
So if someone wanted to
have their data cleared
out of all versions of some record,
how, and I imagine that's
not supported now.
But you imagine you
have any ideas how
that might be something like
the system might be
augmented to support.
>> I see. So basically,
if we want to flash your particular
[inaudible] back, then basically,
I guess you can simply find
those records and then try to
remove those records
and then accordingly,
you can update
the Versioning Information
from version table. Yes.
>> I have a question.
>> Okay.
>> So do you also compare with
Offline Optimos in your paper or it
was too expensive to
Compute Offline Optimos
because you're using [inaudible].
>> We do compare. I didn't
show it here but in the paper,
we do consider like the integer.
We formulate things that
Integer Linear Programming problem.
Then we compare the gradient
with the Linear Integer
Programming Method.
But that one is basically,
we are using some software
and it cannot scale up to
a very small size and then we
compel them to add the optimal one
with our algorithm.
It actually performs very well.
>> There was another question
informing up on that.
So the starting point you said was
10 pop information about
the deltas, right?
For the second group was,
you said you need 10 pop
information [inaudible].
Is it like- did you
vary that experiment?
If you had only five pop information
and 10 pop information
[inaudible] information there.
In diabolical order
theoretically I know.
So do you have
an experiment like that?
>> We didn't conduct the experiments
on like considering
the number of pops into five.
In all the experiments, we
basically started at 10.
Yes. If the number of
deltas is smaller,
then obviously it will
affect the algorithm.
Yes. Because you will have like
less search space or
constraints search space.
>> I'm just curious
that empirically is it
like if we fill up
information is enough to
give you same results or
there's incremental approach possible
that you start you log information,
come across implementation and
then see if you can get information.
>> Right.That's a valid point I
can point or maybe you can try.
>> Yes, possibly if
you can make use of
your tribal inequality as
our predominance for the data that
you haven't computed, [inaudible].
>>Right.
That is a valid one.
>> So I think how you emerge
a collaborative effort on
a Datasets and merge will be
actual very essential functionality.
But over the course of the topic,
you didn't talk about merger also,
is that there in your paper or if
you haven't come to merge yet?
>> Yes, we do what is
seen Azure functionality.
So currently what we are
doing is basically we trace
this version back its common system
of these two versions.
Then we figure out whether there is
a conflict in terms of each record.
If there is a conflict,
then we basically use them.
So when you merge two versions you
will have like specify V1 and V2.
Then if we inspect
there is a conflict,
we simply use V2 to override
the record inside V1.
So that is we first report to
the user that there is a conflict.
If they want to merge,
still want to do the merge,
then we will simply overwrite it.
But we have some ongoing book on how
to do the conflict resolution
inside the space.
>> Well, if you [inaudible].
>> Yes.
Exactly, right.
> [inaudible].
>> All right. No more takers
then [inaudible].
>> Thank you.
