- So I'm honored here, to be here today
to present my PhD thesis work,
and this was carried out at
Carnegie Mellon University
and it wouldn't have
come together, obviously,
without my great advisers,
Chris Harrison and Scott Hudson,
and my community members,
Ben Coe and Jodie Forlizzi,
and of course, all the collaborators I had
from Carnegie Mellon
University, Microsoft Research
and other institutions.
So my thesis is titled
On-World Interaction,
On-World Computing, rather,
focused on computational interaction
with everyday surfaces
such as walls, tables,
furniture and so on.
So, ubiquitous computing
seems to be here now.
Like, we seem to actually
have ubiquitous computing
in the form of these smartphones.
Billions of people own smartphones
and these are very, very
sophisticated computers
jammed into very tiny packages
that we carry around with us constantly.
So, of course, that means that smartphones
have made computing disappear
into our lives, right?
Now, it's not quite true.
If you actually wanna
interact with a digital world,
you have to do it through
that handheld terminal,
and digital content is fully separated
from the physical world.
This isn't really the vision
of ubiquitous computing.
It's trapping the digital
world in these little boxes.
And so for my thesis, I
set up to try to fix this,
to bring computing out of
the small trapped interfaces of today,
and out onto the world around us,
out onto our everyday surfaces,
our walls, tables, furniture and so forth.
This is what I call on-world computing,
this is bringing interaction out
onto natural everyday surfaces
and embedding computational capabilities
into the real world.
So now, one way we could do that
is we could just transform our
environment into computers.
This is the easiest way, right?
So we could replace
tables with touchscreens,
we could replaces walls with giant TVs,
countertops with tablets and so on
until everything is interactive.
The obvious downside here is the cost.
You know, although computers
might be cheap now,
they won't be as cheap as
drywall or wood any time soon.
And it's difficult to upgrade.
Imagine telling someone,
well, you have to buy Wall 2.0 next year,
it's the must-have upgrade.
And it still limits you to surfaces
that are actually suitably augmented.
But we already have a technology
that can actually enhance
environments at a distance
and that's the humble light bulb.
So what I'm gonna talk about today
is showing you my research
in taking this basic one pixel light bulb
and turning it into an
interface projector,
something that can actually
put touch interfaces
on any surface in the environment.
So I started my journey with
a system called WorldKit,
and this is an early project that I did
to try to explore the
potential applications
and also the pain points
that we might run into
when trying to build
on-world interactive systems.
So WorldKit's a complete system
and what it lets you do is,
it's kind of a software
toolkit for the world.
Here, what I'm doing is
I've written an application
and now I'm deploying
it into my environment.
This is a set a status
message app for my office.
So when the door is closed,
there's a touch sensor on the door,
when the door is closed,
I can let people know
why my door is closed.
So the application program right here
has basically defined so the behavior
and controls that the application needs,
and then the user
customizes the application
for their environment
by specifying where each
of those controls go.
So this actually works is, you know,
there's a projector and a depth sensor
that are fastened together
and calibrated into a single unit
that you point at the surface.
This is in the light bulb, you'll notice.
And the depth camera here
captures depth images
30 times a second.
This is what one of those
depth images looks like
of a typical sort of office environment.
And unlike an RGB camera,
like a regular, you know,
camera that you've used,
this captures distance
to the nearest object
which is represented here as a gradient
between light and dark.
And its start up time, what we do is
the WorldKit system will capture
a couple of seconds of background
and use that as a background model.
So you can see here, this is a video of
how that tracking actually
looks in real time,
how we actually track user
input using the depth data.
So we can detect differences
in the pixel depth
from that background model.
So here, the regular green color
is pixels that are at background,
dark green is too far in the foreground
and blue is those pixels that are sort of
close to the background
but still discernible
as being different from it.
So that lets you know where
the user's actually
interacting with the system.
So this is how we do touch
detection in WorldKit.
And the second component is to do output.
So in order for the
WorldKit system to handle
any surface at any angle
we have to do some sort
of planar rectification,
so the system detects essentially
what orientation the surfaces
are at using the depth camera
and then rectifies the image
so that it appears correct.
So on the left, you can see
what gets projected out of the projector,
and on the right, what you actually see
when you're interacting with the system.
The other thing that's
interesting about this
is it makes it possible for developers
to treat the whole interface
as a collection of 2D planes,
and this makes the development process,
the actual programming
process very, very simple.
But the big difference is
that instead of pixels,
developers have to think in
real world units, millimeters.
So we've deployed a couple
of applications with this.
So this first application is,
it's sort of a recipe helper.
What this does is each of these slots
has a touch kind of, different
kind of touch sensors here,
and allowing the user to visually see
when the recipe they've sort of assembled
has actually been fulfilled.
And this was actually deployed
to Alzheimer's patients
on a real kitchen table
showing that this improved their ability
to complete sequential recipe tasks.
And the second sort of
little demo application here
is my office application.
So this is an augmented office
with a very much younger me.
You can see that the contact
tracking here is pretty rough,
I have to use my whole hand, right?
So this is one of the things
that's sort of a downside to the system,
but I've got a little
sensor on my keyboard here
so that when I put my hands down,
my private calendar pops up.
So this is a kind of
a like nice little way
to augment my environment,
and it's pretty unobtrusive.
But on the other hand,
there are a lot of
challenges with this system.
So first of all, you know,
the contact tracking,
I have to use my whole hand for that
and that's not what we're used to
when we're using touch
interfaces of today.
And the user interaction
was very, very simple,
it's just painting and hand tracking,
not a lot more that I can do with that.
It's up to the interface author
to define any other interactions
that might be possible.
And the system doesn't really
respond to its environment
in any meaningful way either.
I mean, it's an on-world system
but it's not really
interacting with the world.
And finally, obviously that big projector
is pretty unwieldy, so it'd be nice
to try to shrink that down into something
that's really more like a light bulb,
or consider an alternative to projection.
So in the rest of my thesis work,
I set out to improve each of those areas,
so I'll start with the input sensor.
So this is a system called direct.
This is called Depth-Infrared
Enhanced Contact Tracking.
So what this actually does is it enables
high precision tracking of the fingertips
using only an off-the-shelf depth camera.
And so you can see here,
the field of green circles
are direct tracking the user's fingertips,
and all the previous
methods in the literature
that we compared against
as these hollow circles.
So why does direct perform
much better than previous methods?
Well previous methods have only
really used depth data alone.
And if you only use depth data,
you run into some really
nasty corner cases.
On the left side, you can
see this is what happens
when I put my palm flat
against the surface.
You can see the fingertips
are barely visible
and they're just sort of
fading into the background.
And on the right side,
this is what happens
if I put my finger at a very steep angle.
You can see the actual
fingertip is not visible at all.
But a lot of these depth cameras
capture infrared data as part
of their normal operation.
So if we get up that infrared data,
we can see something very different.
Obviously now, the fingers
are very, very visible.
But if you only use infrared, of course,
you can't track how close
things are to the surface,
so the key to direct is
merging both of those.
So in depth, it's easier to
segment the arm and the hand
and detect distance to the table,
and then in infrared, we
can segment the fingertip.
And so what direct actually does is
it locates the arms, the hands
and then the fingertips
in the combined image,
and this kind of hierarchal model
ensures that we can reject anything
that's not really remotely human-like
that's sitting on a random office table.
And so this is sort of a picture
of direct setup looks like.
Like, it's still a depth camera
and it's still like a projector,
but you can see that
direct actually offers
a significantly better accuracy
as you can see by the bottom right picture
without any smoothing applied.
So this is significantly
advancing the state of art
in terms of touch interaction.
And we ran a user evaluation for direct
using crosshairs and shape tracing,
and you can see here, direct
as that filled green dot
performs significantly better
than the comparison methods
we found from the literature.
An important fact, actually,
from our crosshair test,
we found that direct was able
to achieve, on the left there,
achieve an average
accuracy of 4.9 millimeters
in terms of mean error.
That's actually getting
competitive with real capacitive,
you know, physical
electrical touchscreens.
That's pretty good
considering we're, you know,
using a camera that's two
meters above the surface.
And for shape tracing, in
face, it's even better,
it's around 2.9 millimeters error.
So this is achieved all
without any temporal smoothing,
without any sub pixel position correction,
all sort of techniques
that are commonly used
in real touchscreens, so this is enabling
the raw tracking to be
significantly more accurate
than previous systems.
So this is good.
We've actually managed to,
you know, achieve kind of
touchscreen level touch
accuracy with the system.
So that's kind of our first goal achieved.
So the next goal is to
look at interaction,
so building on-world interfaces
that live on complex desk surfaces.
And so, if you actually look at the,
if you look at the literature
about desk surfaces,
the existing literature
kind of implies that
we've gotten rid of everything
that lives on our desk,
it's really clean, clear, no clutter.
But, you know, our desks probably
look a bit more like this.
You know, there's lots
of stuff on the desk,
other, you know, digital interfaces that
might have to be projected
on those environments
are gonna have to contend with
physical objects for space.
So I built a system to try to figure out
how digital interfaces can
actually coexist with these
and to figure out what
kind of interactions
people would want to use
on their desk environment.
So to do this, the first step was to run
an elicitation study to figure out
what people's expectations were
regarding digital interfaces
living on physical desks.
So, for example, what I did is
have people use paper application lockups,
so bring some different kind
of applications to users
and say, okay, where would you
put these on your interface?
How would you expect them to behave?
What kind of interactions do
you wanna perform with them?
So here, for example, the user's placed
a map on the left side 'cause
they don't use it much,
a calendar, B, in the middle
that they can sort of
use and see regularly,
and C and D, which are a
music player and a keyboard,
or a number pad, rather,
on the laptop so that they can
actually augment the keyboard.
And this is kind of an
interesting fact, right,
we can use digital interfaces
to try to augment physical objects.
And you'll also notice a
funny thing which is that
users have actually arranged
it sort of radially.
This indicates that, you
know, real physical desks
don't necessarily correspond to sort of
nice rectangular grids
like you'd expect on a,
you know, on a computer desktop,
but rather, they might be arranged
according to a completely
different coordinate system,
like this polar coordinate system here.
So in the end with the elicitation,
we derive several different controls.
Multi-touch input, window manipulations,
launching, closing, et cetera.
But users also wanted these applications
to actually respond to their environment,
for example so that they could attach,
follow certain objects,
or move away if objects were to obstruct,
physical objects were to obstruct them.
So to actually implement all
of this into a real system,
I built a system called desktopography,
named after the fact
that it essentially scans
the topographical map of a user's desk,
identifying different
objects in the scene.
And besides the standard
touch interactions,
it supports snapping
and following behaviors
by basically tracking object
edges in the in-depth map.
And it supports evading and collapsing
by using an optimization method,
sort of like what Ana was talking about,
to deal with actual
changes in the environment.
So this is by defining
different costs and penalties,
by saying, you know, don't put interfaces
over top of physical
objects that already exist,
it'll continually try to update
the layout of the interfaces
to deal with any changes
in the environment.
And so this is what it kind
of looks like in practice.
So notably, these are basically
standard touch applications,
they could be web apps,
Android applications,
this, you know, much more standard
than what we did with WorldKit,
which had custom sort of
applications designed for WorldKit.
So what we can do, the user
can launch an application
by summoning it like this
so there we've got my calculator
application, for example.
This now can snap it onto the laptop,
and you can see it sort of
shows you the different edges
that you can snap onto when you do this.
And the user can also resize them
with this little nub in the corner,
just like a desktop application.
This is a responsive
app, so it changes layout
when the layout changes.
And the interesting
thing here is, you know,
all these behaviors are implemented
just with web applications, right?
So here's what we're doing
with the optimization,
and this is basically,
the interfaces have actually moved away,
automatically detecting
that they've been occluded,
and moving to a more suitable
location in the environment.
Okay, so that's a set
of sort of interactions
and there's sort of more work in my thesis
but I'm going through the
high level points here.
A set of interactions that
allow the user interactions
and both the environmental responses
that work on practical desk environments.
So for the last bit, let's look at
how we can actually do output.
So this is split into two parts.
First off, what I actually looked at was
an alternative to projecting entirely,
so this is augmented reality,
head-mounted augmented reality.
So what happens if, instead of
projecting things on your environment,
we project it to the user's eyes directly?
So if you actually look at kind of
the HoloLens, it's a pretty cool device,
you can do bloom gestures,
like this kind of hand gesture thing,
you can do voice input and head gaze.
But one of the things
that's really missing
is the kind of tactile feedback
and precise position input
that you get with touchscreens.
So what I did was to add touch input
into the HoloLens and call that MRTouch,
or Mixed Reality Touch.
And so this is actually really nice,
it actually allows you to co-opt
any surface in your
environment as a touch surface.
So this is essentially how it works.
This video is showing what the user sees,
this is a digital overlay on top of
the user's first person camera view.
It scans the field of view,
it looks for basically
interaction surfaces,
flat planes that are sort of appropriate
to put down interfaces.
And this is showing
sort of a debug output,
this is essentially showing you
which planes have been detected.
And then what the user can do
is just drag out a rectangle
to pull up an app launcher.
And so, this is an interface,
a virtual interface
that lives in the environment
that the user can see and interact with,
center touch applications all work.
And it also supports things
like in air gestures as well.
So in this example, I can build
a little blue print for a building
and just pull it out, just like that.
So this can actually
let you mix touch input,
which is, you know,
positionally very, very precise
and lets you do a lot of
really interesting stuff
with hand gestures that
expand the input vocabulary.
So this is sort of getting
the best of both worlds here,
gesture input and touch input.
So how that actually technically works,
it's not actually that simple
in terms of the technology,
we actually have to, you know,
in direct, we had the advantage that
the camera and the table didn't
move relative to each other,
but now, the head's constantly in motion,
there's no stability anywhere,
so what this is actually doing is
every single frame it's
running a RANSAC algorithm
to figure where they think
touch interaction plane is,
and then to use a direct-like algorithm
to detect the fingertips,
with optimizations, of course, so that
it actually runs on the HoloLens.
We actually ran a study on this as well,
found it achieves a similar
five millimeter mean error.
We can do the same sort
of tracking as direct,
but now from a constantly moving platform
in sensing the environment.
So this is all self-contained,
works entirely on the HoloLens
and actually forms part of the basis
for Microsoft's next
version of the HoloLens.
This is something that's
being deployed for real.
Okay, that's great, but
we forgot one thing.
This augmented reality is great,
but we haven't talked
about the light bulb yet.
So what about that?
So one of the things I actually did was
to try to miniaturize those components,
so that it actually fit into a
very, very small environment.
This is miniaturizing the
projection and sensing components
which was presented last year at CHI.
And I partnered with a
hardware OEM called A-SU
to build a super small projection
module and depth sensor.
So in the bottom right, this projector
is small enough to fit
inside a smart watch.
So, and around it, we built
a quad-core Android-powered smartwatch
that projects a touchscreen onto your arm.
And you're probably asking
how long this lasts,
it's about an hour of
continuous projection
or about a day if you're
intermittently using it
throughout the day.
So this is how it actually
looks when you're using it.
And so the user swipes to unlock,
this is pretty typical, you know,
the old sort of system.
And that swipe to unlock
is not merely for security
or sort of, like, play,
it's actually a calibration procedure.
So when the user drags their
finger towards the sensor,
it's telling the system
what angle the arm is at
because that can change as
the user uses the device.
And then, of course, we support
sort of standard touch
interactions with this,
so using a little tiny
depth sensor that we built
to detect the distance
of the user's finger
away from the smartwatch body.
And so this is fully self-contained.
The projector, the touch
sensing computation parts,
we can get all of that in
a very, very small package,
so the next step is to
really collaborate with them,
build a real InfoBulb, something
that has enough brightness,
enough computational power to support
sophisticated on-world interactions.
So that work is ongoing.
But the good news is, you know,
we've covered all the challenges
that I outlined at the start.
So, you know, there's
always more to be done here
but certainly in my
thesis, I think I've pushed
the state of the art in terms of
input, interaction and output.
And I hope I've convinced you today that
on-world interaction is not
only desirable and useful,
but that it's also practically achievable
in the near future.
So, thanks for listening
and a quick plug right here,
I just started my professorship
at the University of British Columbia.
So I'm actually looking for students,
so if you're interested, drop me an email.
And I'll take questions.
- [Woman] Thank you.
(audience applauds)
Let's have one question
before we ask to...
- Nobody has any questions.
- [Woman] Yes, guy at the front.
- [Man] Yeah, it's amazing work, Robert.
I have maybe a naive question,
but obviously the resolution of projecting
is a lot lower than what
it might be on a screen,
like, are there particular
applications that you think,
like, where the resolution
doesn't really matter?
What are the kinds of things where
it's not gonna be very viable
if the resolution's not
higher than these things?
- That's a really good question.
So things like, you
know, there's actually,
sort of the recipe application
I think is a great one
where you can sort of
see it at a distance, right?
Where you have it sort of
laid out on your environment.
And certainly for real,
for someone who's not
an Alzheimer's patient,
we would, you know, have
something a bit more compact,
but something that you can use
at a bit of a distance, right?
We're not thinking about things that
you would use right up close.
Certainly there, the resolution
does become a problem,
so you can't do very, very small text.
But I think things that, you know,
you can sort of imagine
the sci-fi vision of like
having, you know, the
weather shown on your wall
when you wake up kinda thing.
Or when you wanna do
some simple interactions
with that kind of content,
or for things like even, you know,
having wallpaper that changes dynamic.
There's all sorts of
applications, I think,
in that sphere that would work.
But I think, I'm not advocating replacing
digital interactions, the
devices that we use today,
but simply having computation
be more ambient, yeah.
