 
 
2
00:00:00,500 --> 00:00:01,620
Hello, my name is Eirik.
Thanks for the introduction.
As Matt said, I was once a
grad student for, I think,
the longest possible
respectable time
you can be at CSAIL
without having to graduate.
Professor David
Karger was my advisor,
and I've been working on
the same project for all
of these eight years.
Now I've continued it now as
this one-man company funded
by an NSF grant.
And I will tell you
about this project.
So I am particularly interested
in this very special class
of business applications.
It's this third panel
in the comic here.
It's the kind of
business applications
that look like this.
This seems like an
exaggeration of something.
But the reason I'm
interested in this
is actually before college,
my first job, actually
my Norwegian military service,
I declared myself pacifist
and got a civil service one-year
job at a public music school
in Norway, where I come from.
And I was an end user of a
software that looked like this.
It's a system specially
designed for managing
public Norwegian music schools.
And what I discovered, as I
later became a real software
engineer, is this is just
one of very many applications
that look exactly the same.
This is like from a system
for managing a dental clinic.
This is actually a system
built by MIT called
Coeus built over 10 years.
They spend about $6 million
building this system
for managing grant proposals.
This is from like the
movie The Bourne Ultimatum.
JIRA you know, SAP--
this is again Coeus, MIT.
So there are these applications.
They all look like this.
And what is this?
They all have the
same user interface.
They are user interfaces
for relational databases.
So you have some relational
database, this technology
we've had since the
late 1970s or '60s
even, for storing data,
storing and processing data.
And then there are all these
user interfaces on top of them.
And they have tables.
They have forms.
They have reports, basically,
lots of different views on top
of the data in the database.
So what's the problem?
The problem is that each
business, or each organization,
or each department
within a business
or project within
an organization
needs a different
database schema.
So the database schema is the
particular data structure,
the business-specific
structure, that you
need to solve the
particular kind
or track a particular
kind of data.
Whether you're a hospital, or
a professional staffing firm,
or a public Norwegian music
school, or the military,
you're going to need a
different database schema
and a different user interface.
So a programmer actually
has to go and hand-code
each one of these screens,
or pages, web pages,
or dialog boxes,
or whatnot, just
to be able to use a
relational database.
And it takes a year or more
just to make one of these apps.
So what people do
instead is they
dump all their data into
Excel, into the spreadsheet.
And that is often a good
solution in some cases.
If you have a
database that consists
of a single table of data,
and you have a single user,
a spreadsheet is
actually pretty good.
The problem is when you are
looking at relational database
use cases, you
typically have something
slightly more complicated.
You have many tables of data.
You have many different
kinds of business entities,
and they're related
through what we
call one-to-many relationships.
Like in the music
school, each student
can have one or two parents.
Each student can be attending
one or more lessons.
Each lesson usually
has one teacher,
but each teacher can
teach multiple lessons.
Each student can have
many rental instruments.
Those are called
one-to-many relationships,
and they're all in every
relational database.
So how do you represent
this in a spreadsheet?
It's not very easy.
It breaks down.
Spreadsheets break
down when you try
to do more complicated
database tasks on them.
So I'm going to talk
about a solution.
And by the way, don't hold
your questions to the end.
Actually, as Matt said,
you can type questions
into the chat box later.
We will probably open
up the audio, too.
 
 
101
00:05:21,440 --> 00:05:25,760
And I love to get questions
as I start doing this demo.
 
 
103
00:05:29,220 --> 00:05:32,120
So what I'm going
to show you today--
it actually comes out of
three research papers.
The most recent one is one
I wrote with David Karger
for the SIGMOD conference.
So those who don't
know SIGMOD, it's
one of the big
database conferences
for database researchers.
And the other one
came from InfoVis.
It's actually in
reverse order here.
Another conference is the
Information Visualization
conference.
Before that, I did a paper
on an editing user interface.
 
 
116
00:06:07,160 --> 00:06:09,230
That's for the CHI conference.
I'm going to talk about
and spend most of the time
on this latest paper, which
is the visual query system.
It's been improved over
the last couple of years.
But the gist of it is
still in the SIGMOD paper.
And so to motivate
this system separately
from the whole
overall replace all
of these custom-made database
applications motivation,
there is a separate
motivation here for just
this visual query system.
What do I mean by a
visual query system?
I mean-- well, suppose you have
a database already, POSTGRES,
MySQL, SQL Server,
Oracle, and you
want to work with it somehow.
So the sort of default
way to interact
with a relational database
is to write SQL queries,
and it's not very user friendly.
We'd like to make
something easier.
Here's an example
of a SQL query you
would use to pull some data
out of a couple of tables
in the database and
do some calculations.
So how can we do something
simpler than that?
So I actually like Excel.
I'm an Excel fan.
I have been using Excel
since high school.
I love Excel, and there
are about a billion users
of Microsoft Excel in the world.
 
 
144
00:07:35,870 --> 00:07:44,910
So this is a big
resource to draw from.
A billion users know how
to use Microsoft Excel,
and I posit that any usable
general-purpose interface
for relational databases-- it
has to somehow look like Excel,
look and feel like Excel, draw
on some of the good design
decisions that made
the modern spreadsheet.
Hey, Eirik?
So we're going to make
something, a user interface,
that looks and feels
like Microsoft Excel
but that can actually let the
user do arbitrary SQL queries.
Eirik?
Yes?
I have a quick question.
Yes?
Is it a wise decision to move to
a NoSQL MongoDB for Big Data AI
application versus a
relational database?
Ah, so this is going
to be interesting.
Because if you think about
what you can do in MongoDB,
for instance, that you cannot
do with a relational database is
you can store JSON, documents,
essentially, hierarchical data.
As it turns out, this is very
important for presentation
purposes.
And I will show you this.
When you are looking at
data, it is very useful
to be able to look at the
data in a hierarchical form.
But in this system,
we're going to assume
that the inputs are actually
flat, flat tables of data,
CSV files, whatever.
And we're going to produce
hierarchical data, JSON data,
just for presentation purposes.
So actually, this system
assumes that, no, actually,
the relational database is all
we need, that it's actually
a very good data model
for storing the data,
the relational data model.
But for presentation
purposes, we're
going to do something cool.
And that is-- so this is a
screenshot of the system.
This is the core part of
the visual query system that
exists in this Ultorg system.
And it looks sort of like
a spreadsheet in the sense
that it has cells,
and it has a cursor
with cells in the middle.
And you see data on the screen,
and it has a formula bar.
But it uses a visualization that
is not your standard tabular
representation that you see in
Excel with rows and columns.
You actually have
hierarchical data model
for visualization purposes.
So you see here in
this example, I've
exaggerated the separator
lines here a little bit
just to show the structure.
Here is a course catalog,
where we have Courses.
Each Course course, has
a number of Readings.
Each course has a
number of Sections.
Each Section, like a
lecture or a precept,
has the number of Meetings.
Each Meeting has a
number of Instructors.
So that-- and we're going to
see that in the live demo, too.
And that's going to give the
system some special powers,
which allows it to visualize
SQL queries because it turns out
SQL queries.
They're also hierarchical
in their structure.
It's just that their result you
usually see as a table of data.
So in this system,
everything we do
is going to be done by
interacting directly
with the data somehow.
 
 
210
00:11:17,210 --> 00:11:21,930
And we're going to have
a couple of pop-up menus.
This is basically the
entire user interface.
And from that, and
in the SIGMOD paper,
we actually show you're going to
be able to construct arbitrary
SQL queries.
I'll get back to that.
So this is the system.
You start out with this
relational database.
So this is a flat SQL
database, not a NoSQL database,
but an actual flat database--
sorry, a database
of flat tables,
relational, with tables
like Courses, Departments--
because this is like an example
course catalog, academic course
catalog kind of example
database-- instructors.
And the first thing I
do is just to show you
that suppose I start
with the query that
is based on this Courses table.
And in fact, I don't
need any other part
of this user interface.
I just need this
central part here.
 
 
231
00:12:19,120 --> 00:12:24,580
The first thing I can do is
to hide and show some columns.
That's sort of your
standard Select query in SQL
and very easy to understand
how you would do that.
But then the interesting
thing happens
when there are other
tables in the database that
are related to this one.
So for instance,
there is a table
called distribution areas, which
is related to the course table.
So in fact, if I
show some columns
here from this
other table, you're
going to see that the course
table has a column called
Distribution Area,
which is joined
to a field in this other table
called Distribution Areas.
So this system
makes it very easy
to join in data from
this other table.
Now, in this case,
we have what is
known as a many-to-one
relationship, which
means that each course has
exactly one distribution area.
But in another case,
another table here,
we have a Reading list,
which is the opposite.
There's a one-to-many.
So each course has a
number of Readings,
and the system
makes it very easy
to introduce another Join
into the table here--
into the query, sorry.
So here we have what is
called a Join in SQL language.
We have joined the Courses
table against Readings,
and I can introduce
another table.
I can introduce the
one called Sections.
So a Section is like a
seminar, lecture, precept.
And each Section, in
turn, has a number
of Meetings and Instructors.
So now if I zoom out a
little bit-- or actually,
I will hide a field here,
show a couple more fields--
all right, so here we
have a list of Courses.
And then for each
course, we have
a list of readings
in the Reading list.
And independently of that,
we have a list of Sections.
Many of the Courses have just
one section, like this one
seminar course here.
But then there are some
that have a lecture
and a precept and a precept.
Each section, in turn, can
have one or more instructors.
Usually, there's just one
instructor per section.
But in some cases, if we scroll
down, we're going to find one,
I think.
Here is, for instance,
a course called
Human Evolution that
has actually no readings
in its Reading list.
But it has a lecture and
a couple of precepts.
This lecture actually
has two lectures.
So and then each section is
also associated with a meeting.
So this is a very sort of
business-specific database
schema, as we call it.
This data structure
here is very specific
to the academic environment,
but the system doesn't care.
It works with any
database schema.
Once we have done some drawings
like this, we can go in,
and we can do other things.
We can put filters on any one of
these fields, for instance, do,
show me only the spring 0607.
This is an old database.
I should probably add a 10
there, but it says 16, 17.
So it sounds like a
more recent data set.
Oh, well.
Or I could filter here
to find only courses
by Brian Kernighan.
 
 
299
00:16:18,080 --> 00:16:24,890
So that's another basic
function in the system.
So that's an example of probably
the two most important actions
in the system, the ones
you use most of the time.
There are a couple
more that are important
in order to be able to
express arbitrary SQL queries.
The other example I
usually give is now, well,
what is it that
really distinguishes
one of these
database applications
from a spreadsheet?
Well, a fundamental difference
between a database application
and a spreadsheet
is in a spreadsheet,
you type in your data, and
it's display back at you
exactly the way you typed it.
And you can sort it.
You can filter it.
You can cut and paste
it, but it's always
displayed in one way.
In a database application,
especially because we
have many different
tables of data,
many relationships
involved, you always
end up needing to
look at the same data
from a number of different
ways, different screens,
different perspectives, they're
called in the system here.
So to make an example of
that, I can look here.
Let's clear the filter there.
Actually, let's start a new
query based on another table.
So we've seen the
instructors here already,
but I'm going to start a new
query based on the instructors.
Eirik?
Yes?
Was that-- we're
using only Inner Join?
That was the question.
Oh, good question, yeah.
So these are not
technically Inner Joins.
They are more akin
to Left Joins.
Although, they are really
something called a Nest Join
because what you will see
here if I scroll down,
somewhere there was a course
that didn't have any readings.
So you can see in this case,
Human Evolution has no reading.
So if you tried to do this in
SQL as sort of the naive way,
you would actually
lose that-- if you
tried to do an Inner Join
between Courses and Readings,
you would actually
lose that row.
It would disappear.
Human Evolution would disappear.
In this system, it doesn't,
because it's not giving back
just the flat table of data,
like a single SQL statement
would give you back.
It's actually giving you
back a different data model.
 
 
349
00:19:01,960 --> 00:19:04,360
It's giving you a nested
relation it's called.
It's really more like JSON.
So we can do those things.
I can show, actually,
and this relates
to the question about NoSQL.
In fact, anything you
see here on the screen
you can think of as JSON.
In fact, here it's
converted to JSON.
Actually, that's this one.
So now, if you look here, I
scroll up, and you will see--
just for debugging purposes
or for export purposes,
you can see the equivalent
JSON format here.
So this is a different beast
than a regular SQL query.
Even though the goal is sort
of to be as expressive a SQL,
in order to do that, we need
to be slightly more expressive
than SQL in terms of the data
model that we are retrieving.
And that we'll be more clear in
the next example, where we'll
sort of take
advantage of that fact
that we have a nested data
model for display purposes.
And we'll use that to see
the inside of a SQL query
rather than just the
result of a SQL query.
 
 
371
00:20:12,520 --> 00:20:17,920
So to do that example,
which will also
include an aggregate function--
so I think from previously
having talked to many
CSAIL Alliances people,
I know a lot of you
are technologists.
Many of you actually
probably know some SQL.
Or if you're a
software engineer,
you certainly know this already.
So then there are other reasons
why you should use this tool
that I can get back to, too.
 
 
381
00:20:41,920 --> 00:20:44,020
So for instance,
in this case, I was
talking about seeing
the same data from two
different perspectives.
This was one.
The other one is
suppose we want to see--
instead of browsing
the course catalog
as a student wanting
to pick their courses,
imagine you're in HR instead
or at the department level
and you want to figure
out teaching load.
Which instructors are
teaching the most courses?
And I use the same features
that we've already used here.
And I will go in, and I
will expand all these Joins
into meetings.
So I'm going the
opposite direction now.
I'm going now from
instructors to meetings.
 
 
397
00:21:29,920 --> 00:21:37,720
And what I want to do is I
want to just total up, sum up
the total amount of time that
each instructor is teaching
every week.
And for that, I will
need a calculation.
You can do calculations in SQL.
You would do select
some group by--
and in Excel, how
would you do this?
Excel-- it's a little tricky.
You'd probably do the formula
here first to figure out
the difference between the
Beginning Time and the End
Time.
So you'd insert a new column.
That's the same thing
we're going to do here.
I just inserted another column.
It's called the formula column.
Because everything we're
doing so far is read-only,
I'm not changing the database,
just changing the query.
And I'm going to
have a formula here
that is going to be something
like End Time minus Beginning
Time, except I haven't
implemented minus on times yet.
So I'm going to use a formula
function called diff Date, so
the difference between the
End Time and the Start Time.
And it's going to give me like
an Excel-style fraction of 24
hours.
I multiply by 24 hours
to get a number of hours.
And I'm going to actually
call this column something.
I'm going to call it
just Hours [INAUDIBLE]..
Hey, Eirik?
Yes?
If the query result
is hierarchical,
can it still be produced by
an existing query execution
engine?
Yes, the-- oh, sorry,
yeah, was there more?
Oh, no, sorry, just the
example was, can I still
use POSTGRES to run a SQL query?
Yes, in fact, this example
here is running on POSTGRES.
It's running on a POSTGRES
instance that comes
bundled with the software.
But you can also connect it
to a SQL Server, or Oracle,
or MySQL.
So behind the scenes
here, the system
is actually generating
SQL queries.
And it's generating
not just one SQL query.
The way you do this to
retrieve this result--
you actually have to do
essentially one SQL query
for each one of the
bold headings here.
So you need to do
one for meetings.
In this case, you don't
really strictly need to.
You could join all
of this together.
It would be a very hairy query.
In this case, you
really don't want
to try to join together
Courses, Readings, and Sections,
because the Readings have
nothing to do with Sections.
And you would get back
a lot of repeated rows.
In fact, it would blow up so
that you would get back you
know a million rows just to
retrieve a couple of courses
if you started expanding
some more here.
So what you end
up doing is you do
one SQL query for
each bold heading
in the hierarchical model here.
And then the system
stitches them together
on the client's side.
There are ways you can write
it as a single SQL query.
You could use the XML
aggregates, for instance,
on Oracle or POSTGRES.
Or you could union
them altogether.
But the thing that works
for every database is you
need to split them
into separate queries.
So in fact, the answer is, yes,
this is executed on POSTGRES.
And in fact, the system
takes a lot of care
to make sure that none
of the calculations
are really done on
the client's side.
And it does a lot
of work to avoid
retrieving more than
exactly what you're
seeing on the screen.
So as I'm scrolling
down here, it's
just doing infinite scrolling,
redoing queries, really
outsourcing all the
hard work to POSTGRES.
 
 
476
00:25:35,970 --> 00:25:40,140
So in this case, I've done
one kind of formula here.
I did a formula at the
level of the Meetings.
And notice that I entered
this formula just once,
and that gets evaluated
for the entire row.
That is not what you
would get in Excel.
So in Excel, you would
actually need a formula
in every single
one of these rows.
And what you would do is you
would type it in in one place,
and then you would drag it down.
And then eventually,
two months later, you
would change the
formula and forget
to drag it down, or not drag it
all the way down, or whatever.
So this is much more like
SQL, where the query is really
defined in terms of
the schema in terms
of the names of the
columns rather than
at the individual cell level.
That's very important for
robustness and for being
able to repeat these things.
So if the data changes, the
query will update again.
Now I'm going to
do another formula,
but I'm going to do it out
at the Instructor's level.
So I'm going to
insert a formula here.
And here I'm going
to say, equals sum.
And I'm going to take the
sum of the hours taught.
Now, this is a bit different
from both Excel and from SQL,
so there's not going to
be any "group by" here.
It's going to figure out
how to do a sum just based
on where I put the formula.
So if I press Enter here,
you're going to get the sums.
Each instructor has a
single sum because I
put the sum, the formula, at
the level of the Instructors.
And that's going to be based
on this grouping hierarchy
that we already have.
So now I can sort on that,
and I could figure out, OK,
who's teaching the
highest number of hours?
And that is Robert Paul
Esperanz except, now we
immediately see something iffy.
Of course, if we we're
just doing this in SQL,
typing a SQL query, all we
would see would be this.
We would see a single flat
table of just the results,
and we see that this
can't possibly be true.
No one is teaching
278 hours a week.
It's not possible.
There's 40 hours a week.
Well, that's not true.
But, oh, OK,
[INAUDIBLE] say 80 hours
if you're like a workaholic.
But in this system,
you're actually
seeing what goes
into the aggregates.
And so I can start looking
here, well, you know,
it seems like the same thing is
happening over and over again.
I can start investigating
and see what's going on.
Like, show us some more fields
from these related tables.
I can see that, actually, we're
including every single semester
that was in the database.
In fact, I need to
do a filter here.
And maybe I want to just filter
on the same semester that I
used before, so
this spring 0607.
Now you see the sum is updated.
Not only is the sum updated,
but there's someone else on top
now.
If I undo that, you see
before it was Robert Esperanz
But now, having
done that correction
and just focusing
on one semester,
it's Michael G. Littman.
And in fact, I can see
some other things here.
Actually, some of these
sections-- they were canceled.
So they don't count.
So I can go, and I can
filter Michael Littman.
Let's exclude all the
canceled sections.
Of course, you have
to know what X means,
but it's very common to have
these cryptic codes somewhere
in your database.
And at least, you
can see them now.
So I exclude that, and you
can see-- if I press space
again to add and
remove the filter,
you can see the total updating.
 
 
554
00:29:46,520 --> 00:29:49,930
Now, if I'm happy
with the result,
maybe I want to do a little
bit of a visualization,
maybe I actually only
want to show the results,
so I can hide this entire level
and just show the results.
So this is the single
SQL query because it's
just a single bold header here.
So now I have a report
that looks a little
more like something you
would print out and send
to your boss.
But if you shared this
with your colleague,
your colleague goes
in and sees, OK, well,
Michael Littman on
top, 24.7 hours.
Do I really trust that?
How is that query written?
Well, I can go in and I can
unhide that level again.
And I can see where
the data came from.
I can see what the
query looks like.
In fact, the way to look
at this visualization
is to think that
the white area--
those are the
contents of the table.
That's the data coming
out of the query,
the result of the query.
But the gray area,
the header, that's
actually the structure
of the query.
So in fact, when you
look at the gray area
here with all the levels and
so on, that is, for those
who took the database course,
the undergrad database course,
this is essentially
a query plan.
It's not necessarily
executed exactly like that,
but this is a relational
algebra, as we call it.
And you can see it
doesn't show everything.
So it doesn't show what
values are filtered on,
but it shows that there
is something here.
There's a filter here.
So if you open the filter, you
will see what's filtered on.
So the complete information
about what the query is
can be seen from this header.
 
 
592
00:31:37,180 --> 00:31:40,660
And then, we can go back to that
if we have a little more time.
 
 
594
00:31:46,850 --> 00:31:52,173
This was the database schema
of the database we just saw.
You could do something
similar here just
to get a visualization
of the schema itself.
You can click a
little button here.
That just sort of hides
all the primitive fields.
And we visualize the
entire schema here
as a tree, which is
a lot more compact
than this kind of schema diagram
thing that I've drawn out here.
But it serves a similar purpose.
So in the academic
paper, the SIGMOD paper,
I had to compare this
to previous systems.
This is one of those things
they teach you in grad school
that I actually didn't know
coming into the PhD program.
You need to-- to
get a publication,
and that's to get
a PhD, you need
to actually do something new.
So it's not enough to
simply create a cool system.
You have to be
able to articulate
what exactly is new here.
And you have to consider--
you don't have to--
usually the commercial
competitors--
they are not so relevant from
the academic point of view,
because they are
usually doing something
that was invented in 1992.
But on the other hand, there can
be an academic paper from 1977
that does exactly
the same thing.
And you can write an
academic publication on it.
So really-- and
you can read more
about this in the SIGMOD paper.
But it comes down to
three requirements,
that this system is the only one
that meets all of these three.
And so there's a long list
of academic prior systems
and some commercial systems
I've compared it to here.
 
 
628
00:33:38,720 --> 00:33:42,410
You have probably seen these
kinds of systems before.
Alteryx is a well-known
commercial system.
Tableau Prep is another one
based on the same concept.
These are what I call
diagram-based systems, or not
just me.
It's a taxonomy from years
ago, the diagram-based system.
So you're manipulating
not the data.
You're actually manipulating
a diagrammatic representation
of the query itself
or the query plan.
So the first thesis
of this paper
was that, now, to
make this usable,
to have a usable
visual query system,
you actually need to be
manipulating the data itself.
So you need to be able
to click on a column
and say, filter on that.
You don't want to
add another box
and say, add the filter box
in the plan or whatever.
You need to be able to interact
directly with the data,
not with some sidebar.
This is a hypothesis of how
to make a usable visual query
system.
The other requirement
here had to do with,
how do you modify a query
that you have already created?
So many of these prior
academic systems,
they actually like to do direct
manipulation of the data.
But then once you have
created the query,
once you've filtered
on this column,
and then hidden the
column, and then maybe done
an aggregate on
it, there's no way
to go back three
steps, four steps,
and change the filter without
undoing all the way back.
So those are some of
the fundamental things
that you can actually
do in this system
that you can't do in the
others, like the ability
to go back here
and change a filter
and have the sum updates that we
specified many, many steps ago.
That's actually a novel
thing with this system.
And then the final and easier
to understand requirement is,
well, we want to be able
to represent arbitrary SQL
queries.
So if you can write it in SQL,
and there is a lot of queries
you can write in
SQL, then you should
be able to construct visually an
equivalent query, a query that
will always return the
same result regardless
of what the data is.
That is the third requirement.
And then here I've
used Tableau, which
is one of my favorite
examples, actually,
because it's a really--
it's a really expressive
system in the sense
of the visualizations
you're able to produce.
Basically any kind
of visualization
that you can plot on an
x-axis and a y-axis you
can make in Tableau.
They look really cool, and
it's by direct manipulation.
Because you can click directly
on the editor or in the data,
and you can make changes to
the query or the visualization.
But it is not in any way
as expressive as SQL.
If you want to do this sort
of arbitrary SQL queries,
then you either have
to type them out
or you have to go back to
this diagram-based prepped
interface.
So the challenge is to have all
of these requirements in one.
And we could do
that in this system.
So this was an academic
paper, so, of course,
we had to evaluate it.
I'm not going to talk
about all of these.
We did user studies
with real users
and measured, how fast can
people do certain queries?
Then the really
interesting takeaway
is that after about 20
minutes of this system--
and some things that were a
little harder to understand
than others, especially where
to put formulas and things
like that-- that was a
little hard for people
to understand initially.
But then after about 20
minutes, people were super quick
with this thing.
People could use most
of these features,
and there are not
that many of them,
and do fairly complex queries.
So we also did a controlled
user study there.
We used Microsoft Access.
I would have loved to
use Tableau as a control,
but it has a famous
DeWitt clause
in its license agreement.
DeWitt was a or is a famous
academic in the database
community that once made
Larry Ellison so mad at him
by publishing a paper
stating that Oracle had
a low performance
compared to other systems
that Larry Ellison put in the
license agreement of Oracle
that you can never
do a performance
study on this system.
It's against the
license agreement.
That's called a DeWitt clause,
and everyone uses it now,
unfortunately.
Anyway.
So the control study worked
out very well, actually.
We scored a lot better than
Microsoft Access 2016's
query builder.
I actually redid this
study sort of informally
on a group of students at
Columbia University last week.
And it seems, even
though this is not
statistically
significant, because they
were only 10 students, it
seems we increased the score
quite a lot.
So it went from
Microsoft Access being
in the bottom sixth
percentile of usability
across the business
tools, so I heard
was the old name of this
system or the old version,
the academic version,
of this system that
was right in the middle of
usability for business tools.
So we could really
make SQL queries
tolerable or making
database queries tolerable.
Now it's quite high up.
Although, this was on quite
technically-proficient users.
So the thing you can do
is you can just count up--
take some queries, and you can
count up the number of clicks
that you do, clicks
and keystrokes.
So we did that, compared
it with two other systems.
Some of the other systems
were a little faster
for certain queries,
but the big difference
was Ultorg, previously
called SIEUFERD,
was the only system
that could do
all of the queries
in the benchmark.
So it's the only one
that was expressive to do
all of the queries.
So it's actually quite--
I got this question at
the SIGMOD conference
presenting this work.
And I anticipated this
question because this
is a very loaded thing
to say, that the system
can do all of SQL.
SQL is huge.
There is a lot of stuff
that you can do with SQL.
So specifically, what
the system could do
is cover SQL-92, which
is the core of SQL
that you learn in
database class.
Or if you take a database
class, and they teach them--
or data science class,
and they teach them SQL,
they will teach you SQL-92.
So its SELECT--
SELECT, WHERE, GROUP
BY, and nested SELECT,
and all of those things.
The nested part is
important, the fact
that you're able to put one SQL
query inside another SQL query.
That is actually very important
for the expressiveness of SQL,
and the system can
model all of those.
So how do you
actually prove that?
 
 
777
00:41:41,610 --> 00:41:45,050
Well, there's something called
a relational algebra, which
is the mathematical formulation
of what SQL gets translated
into for [INAUDIBLE] purposes.
 
 
781
00:41:55,020 --> 00:41:58,110
And there's a prove at the
end of the SIGMOD paper
that shows you how to convert
these visual queries--
or sorry, to convert
relational algebra
into visual representations.
But how do you know which
ones you need to cover SQL-92?
 
 
787
00:42:14,100 --> 00:42:15,832
Well, you have to
talk to someone
who implemented the database.
So I talked to Julian Hyde who
once gave this presentation.
And that's that
SQL-92 compliant.
I tweeted that to him.
It says SQL-92 compliant.
Which relational
algebra operators
did you need to implement
in this Java database?
And he gave me the list.
And I did the similar
thing with someone
else who had implemented
a database from scratch
and actually sold it later
to Tableau, Thomas Neumann,
gave me the list of operations.
So we have this system
that lets you then
express arbitrary
SQL queries from
within this spreadsheet-like
user interface.
And I'm not going to go for
too long before we go to Q&A,
but I want to show you just
a couple of other components
of this system.
And the other is the
automatic layout engine.
So if you recall these
stereotypical relational
database applications,
they consist
of a lot of different
screens, or perspectives,
or views over usually multiple
tables in the database.
And not all of them
are simple tables.
Quite a lot of them-- they are
forms, or they are reports.
So they have some
hierarchy to them.
And it was important
for this system
to be able to do those
kinds of layouts.
And in fact, the fact that the
results here are hierarchical
already--
we saw this hierarchical
structure here
in the course catalog--
this is actually a very
good starting point
for modeling these other kinds
of boring text visualizations.
So for instance, a form--
if you see this structure
here, we could equivalently
show this as a bullet list, like
it has sort of an XML style.
So this is another way to
look at the exact same data,
where we're just
looking at Courses
and the Reading list and the
Sections but in a bullet list.
So that is one other
way to look at it,
but it takes much more space
because a table is actually
more compact--
I'm switching
between the two now--
because you avoid
repeating all the headers.
The problem with the table--
if I start producing a
lot of different fields
and all of these different
pieces of information
about the course, I get a
very, very wide visualization.
I have to scroll horizontally.
This is not the way you
would usually present this
in a tailor-made interface.
It's just not very usable.
So when you start having
to have a lot of fields,
for instance, show
me the 360 degree
of all information we have
about a course for a customer,
or a hospital, or whatever, you
use a different visualization,
typically what we call a Form.
And the automatic
layout engine is
the part of the
system that generates
not only these tables,
but also automatic forms.
So if I switch this
into Form mode,
we're going to see again the
exact same data displayed
as a form.
What is a form?
Well, it's sort of a compromise
between these bullet lists
and tables and a
couple other tricks
we can do to make sure that
all of the data that we saw
in the table before
can be displayed
without horizontal scrolling.
That is good for editing a
single item or for printing.
I've also said use
the word Reports.
So Report is basically what
you get if you take 100 forms
and pile them together
because they have
the same kind of hierarchy.
I can group by place,
things like that,
get a building report.
Again, the kind of things
that you would always
see in these CRUD applications,
but which you could never
actually produce in Excel.
And you can produce it in
Tableau or any other BI
tool that is based on crosstabs,
because this is not a crosstab.
We can do those too, but
this is not your standard BI
visualization.
So there was a whole part
of the system just dedicated
to how to generate these
kinds of visualizations
and automatically
adjusting the layouts,
depending on how much vertical
space you have available,
and so on.
And it's the exact
same user interface.
The same features are
available here as anywhere else
and as in the table layouts.
Michael Littman is still here.
I can filter on that.
And the very last thing I will
show you before I go to Q&A
is the fact that, well,
now that we have a form,
and we have a table view, we can
do that fundamental thing that
really distinguishes
spreadsheets
from database applications,
which is, we can go in here
and pretend that we're doing
some sort of admin task,
where we go into microprocessors
from measurement and control.
And we want to change
something, so we
want to use the form
to change the data.
So I can go in
and I could, say--
oh, actually, I tried
to move my cursor here.
I tried to make a change.
But it says, Not in
data editing mode.
So in Excel, it's very easy
to accidentally edit data.
We don't want to do that.
So far, everything
we've done has
been Read Only, no
changes to the database
until I go to Edit Mode.
And I go in here, and I
say, change this to X.
And maybe change this one to 50.
It's not going to
commit my changes yet.
It's not going to make any
changes to the database
until I press Control-S. At
the moment I press Control-S,
you will see the
data update here
and the data update
here in the other view.
So I've closed this section.
I'm about to close the
section or cancel the section.
If you recall earlier,
we had a filter
that excluded canceled
sections from this total.
So if I press Control-S
here and save this data,
you will see the
result update here.
And in fact, Michael
Littman is no longer on top,
because we closed the section.
And now someone else is on
top, and the total changed.
So with that, you
have a system that
is able to model all of
these different perspectives
over a relational database
that make up the [INAUDIBLE]..
And with that,
I'm actually going
to go to the conclusion
and Q&A. And at this point,
I'm happy to unmute
people, or maybe Matt
can take care of that.
Or I can take questions
by chat as well.
Eirik, I did get one.
And someone was asking,
what are some changes
Ultorg saw as it moved away from
the lab to a more commercial--
Yeah, the main change
is that the last couple
of years of
developing on this has
been concerned with really
just polishing up all the stuff
that was done in grad school.
So in particular-- well, this
is a visual query system,
so you're going to
do database queries.
And you're going to get results.
You're going to get
totals, sums, reports.
Well, you absolutely
want to make sure
that those results are correct.
Because in the
academic version, there
were quite a few bugs in the
system that actually generated
SQL queries that stitch these
things together, and so on.
So I spent probably
about a year [CHUCKLES]
writing unit tests for this
thing, for the query system.
So the visual query system--
sorry, the part, the
evaluator, as I call it,
the part that generates
equal queries and composes
the results back together-- that
needs to be absolutely correct.
The user interface--
you can allow some bugs.
It's like sort of the
in-flight entertainment system
of a Boeing plane versus
the flight controls,
and so on, and the
instrument landing system.
There is one you
cannot accept any bugs.
In the user interface,
you can prioritize
features a little more.
There were certain
features that had
to be added here to complete the
expressiveness of the system,
not from a theoretical
point of view,
but from a usability
point of view.
There were certain things
you could do in theory,
but they were very cumbersome.
 
 
960
00:51:59,750 --> 00:52:03,890
so there had to be easier
ways to for instance group
on fields that are
tables in the database.
It's not really been launched
as a company yet or as a product
yet.
This is still in development,
finally nearing the point
where you can actually use it.
And I would love
to get beta users
and talk to people who might
potentially become customers
in the future.
But for now, this is really
still a research project.
So I'm very interested in
getting additional people
to try it.
 
 
973
00:52:42,350 --> 00:52:45,050
What are some open
research questions
in this area that
are exciting to you?
Are there some that
you're working on
or others in general?
Yeah, so the--
I think the hardest part
was the visual query system,
figuring out how to do
this at a level of SQL-92,
how to express
arbitrary SQL queries.
But there are things
that you cannot get--
do in this system that would be
interesting research questions.
Probably, the research
question that would
be most interesting,
not necessarily
from a commercial
point of view--
but there's something called
a recursive query in SQL.
You can't do them in MySQL,
or maybe now you can.
But POSTGRES and Oracle-- you
could do a kind of query called
[INAUDIBLE] recursive.
And it becomes
important when you
have sort of loops in
your database schema,
such as this company is a
subsidiary of this authority,
this ultimate organization.
That was actually a name
of a database column,
so that's the reason for
the name of the company.
So there is a subsidiary,
a chain of subsidiaries,
and you want to find the
ultimate organization,
that the ultimate owner,
that is something that you
need recursive queries to do.
And then the special
thing about those
is you don't know
beforehand how many joins,
how many steps,
away from the table
you're going to have to do.
So that would be a very
interesting research question,
but not so much a commercially
interesting question,
because it will be very
hard to do, and probably
not worth the time for now.
Then there's some
things that are
interesting both as
a research question
and as a commercial
question, for instance,
doing window functions
properly, cumulative sums.
I have a bank account
with debits and credits,
and I want to find the balance
at the end of the month.
Those are kinds of queries
that you can't do in SQL-92
but which would be good to
get into a system like this.
And then there are some other
questions around the credit
application, the custom
database application
use case, where you
want to figure out
how to do versioning
well, for instance,
to be able to host
tables and have
version control on
those tables, and have
that be reflected in a usable
way in the user interface.
 
 
1024
00:55:20,990 --> 00:55:23,210
Have you considered
visualization
as part of the report view?
And did you get any
feedback from users
that said they wanted that?
Yes.
So the reason you
don't see a lot--
so this is all very text-based,
sort of boring visualizations.
And the reason why you
haven't seen a lot of focus
in this system on that
is because in that part,
I consider that almost
a solved problem.
So Polaris, which was the
system that became Tableau--
that was a PhD thesis
at Stanford in 2003--
that basically
solved that problem.
OK, so basically, plots--
how do we make a model,
a visual interface,
to make arbitrary plots?
So that's why it was not
part of this research effort.
There was a lot more low-hanging
fruit and unpicked fruit
at the boring text
visualization level.
You can also see that in
almost all of these examples
that I've given, there
is actually very little
slick visualization.
OK, there's some teeth here
because it's a dental system.
But you get some bar charts.
The bar chart is the
single most important one.
How useful is really this?
I have strong opinions on this.
But definitely, you
want to get some plots.
And yes, I've been
working on how
to integrate that
into the system
and done some mockups and--
yeah, it's going to be done once
other more pressing issues are
taken care of, such as how
to properly save your data
and share it with other
people, and so on.
 
 
1059
00:57:12,800 --> 00:57:16,930
And we actually have a
question from David Karger.
Hi, David.
You mentioned that you're
beginning to tackle writes,
and you don't update
until someone hits Save.
Is it possible for
edits to conflict
what happens if you edit a
data in that disappearing cell?
Yeah, so actually, this
kind of visualization
gives you a potential
for handling
that in a very neat way.
Because what you
could do is I could--
well, of course, if it
even really conflicts,
the classical conflict here
is that I changed this number
to 25.
And then someone else
changes it to 70.
And then someone else
actually commits,
and now I'm trying
to commit myself.
The correct thing
to do in that case
is to tell the user
that, hey, what
you were looking
at before when you
made the change has changed.
There has been a change
behind your back.
We're sorry, but
someone else changed it.
You should probably
review your changes again.
And there is no semantic
way away from that.
The thing that is good
here in this kind of system
is that it's much less likely
that these conflicts will
have to be resolved manually
than in, say, a spreadsheet.
Because in a spreadsheet, if you
make changes, the spreadsheet--
in Google Docs, for instance.
Google Docs does not
know that this cell is
a salary column for John Smith.
So if you change the
salary of John Smith,
and another user sorts
the entire table,
it has to throw its
hand up and say, I
don't know what happened.
One user changed
the entire table.
The other one change the cell.
The cell is not
where it used to be.
What happened?
You solve it, user.
In this case, we have
a good semantic model
of what each field means.
So it doesn't matter if the
other users sorted the table.
In fact, sorting the table
is a read-only operation.
So it's not going to
cause any conflict.
 
 
1105
00:59:31,135 --> 00:59:34,330
But yeah, the
correct thing to do
is to say to the user, what you
were looking at when you made
the edit is no longer the case.
It's changed.
Here is the new version.
Review it, and see if you still
want to commit your changes
and override the other
person's changes.
 
 
1113
00:59:55,030 --> 00:59:57,360
Thanks so much, Eirik.
I think we're wrapping up.
But actually, I'm going to
unmute David for a second.
Hey, Eirik, I just wanted to
elaborate on the question.
I wasn't actually
thinking about conflicts
between different users,
but self-conflict,
where I edited data in one
cell, and that actually
causes a change to the joins.
So that another cell that I
edited data in disappears.
Oh, yeah, yeah, yeah,
that's a good point.
So I didn't actually show
you inserts and deletes.
Actually, that's
not related here.
So there are various
error conditions here.
Like, if I try to change a
primary key, it's going to say,
you can't change
the primary key.
And if I try to change this
thing that is joined on Course
ID, it says, cannot modify a
value that is joined against
its parent group.
I should probably figure out
how to word these things better.
But yeah, there are
various conditions that
will cause the
system to say, no,
this is not a valid edit
for this particular view.
 
 
1136
01:01:04,290 --> 01:01:05,110
Cool.
I guess you could
actually signify that--
you could actually gray
out, like color out
those cells to tell the
user that they [INAUDIBLE]..
Yeah, you could do that.
Cool.
And there are the things that--
I mean, you could
probably find--
there are some interesting
things that happen within here.
If you start going and
changing the title here,
like the Organic Chemistry
to something Zumbucom,
that change will actually be
reflected in multiple places
in the same view.
So it doesn't quite work like a
spreadsheet, and you can but--
[INAUDIBLE] fascinating.
There are a lot of
interesting things
that happen when you
add editing here.
Yes, cool, thank you.
Thank you.
 
 
1156
01:01:54,690 --> 01:01:56,860
OK, well, it's
about time, I'd say.
But thank you so much,
Eirik, for joining us
today and sharing this
presentation with us.
Thanks for all the
CSAIL Alliances members.
I would love to hear
from you, and I'm
very liberal with giving out
test copies of this thing.
It will work with
your existing database
or with your Excel spreadsheets.
And I'm always
looking for feedback.
