>> 
MARTY: So, what up? What I want to talk about
today. I'm going to talk a little bit about
log analysis and visualization. So I'm Raffy.
I've worked in the log management and log
analysis space for about the last 11 years.
I grew up in Switzerland. I worked at IBM
Research out in Switzerland for a little bit,
I did my master's thesis there and then with
a quick stint at PWC Consulting, I--that's
where I learned how to dress. I moved out
to the US and have worked for ArcSight which
is now HP. They'd just acquired them about
a half a year ago. Then worked for Splunk
for awhile, which some of you might know,
it's a city up--a company up in the city that
started up that--or actually not start up
anymore I guess. It's doing a log management
also. And about two years ago, I left Splunk
and then started my own company called Loggly,
and I will talk about that a little bit, a
little more. On the way, I sort of discovered
a passion for visualizing security data because
I was always sort of on the security side
of log management. I, at some point figured
out that well, there are ways to actually
generate visual representations of all of
that data. So, I figured out how to visualize
data and I--I really got into that and tried
to work more on how do you do that and how
can I help people visualize their own data
that they have and make their analysis process
more--more efficient. I've wrote a book on
the topic and started the--a community site
called Secviz.org and so I've done a lot of
work in that community and you will see a
bunch of that coming out here. So, the agenda
looks as follows. I want to talk about--very
briefly about what--what's log analysis--most
of you are going to be like, "Yes, that's
pretty obvious." I'm going to talk about the
history very briefly; how did they even come
about; what are some of--like, the commercial
milestones that we hit in terms of technologies
also. I'm going to talk a little bit about
log architectures, what's--are sorts of the
differences out there between log management
and SIM also. And then I want to launch into
what's working and what's not. And you will
realize that the second part is going to be
much bigger than what's not working. I think
we as an industry, have failed at this point
to really deliver working systems but I will,
I will elaborate on that. Then I'm going to
look at what do we need in the future, and
as part from that, I will show that visualization
can actually help us a lot with analyzing
and mining our data and then I'll show you
a bunch of examples on security visualization
and on how that can actually help sifting
through a lot of log files. All right. So
log analysis, this is actually a fun picture
I found when I was up in Seattle. I guess
some of you guys out there on the VC site.
When we went to Amazon, they had this--these
logs outside and I had to take a picture,
it was kind of cool. So you--you have probably
seen log files before, they come in all kinds
of formats, some of them are a little more
interpretable than others. Usually they're
fairly crazy, there's all kinds of stuff in
there and if you haven't worked with these
logs before, it's really hard to interpret
what's really going on in there most of the
time. So if you look at a brief history, sort
of, logging has been around for quite awhile
and I'm not sure if, if it really just started
with Eric Allman developing Syslog. But, but
back in the 1980s, he--he had quoted Syslogd
and mainly for Sendmail at that point. He
just used it for Sendmail itself. And then
there was a whole bunch of development around
that and it evolved and at some point in '96,
it was the first company, Intellitactics,
that actually entered the market as sort of
a--as SIM, Security Information Management
tool that started taking feeds mostly from
intrusion detection systems to help kind of--to
reduce the number of false positives by correlating
the data with other data sources. So that--that
was really the first SIM out there. In '97,
Tivoli came up with Risk Management that was
actually something that was developed in Zurich
and the team I was working on back there.
And that was sort of the first attempt to
trake--to take security data and start correlating
it at vulnerability information that you have
and that gave you a lot of benefits from that
sampling. And then in the next I guess, 14
years, there was a lot going on in terms of
different companies came on the market; there
was like, netForensics and CadetNet and all
these early SIM companies that entered the
market. And in 2000, I--I marked that ArcSight
specifically there because I think that was
probably the most successful SIM, they were
not the first to enter the market but they
were able to learn about the mistakes from
the others. And when--when Arcsight entered
them, they had a lot of benefits and they
came up with a different kind of approach
to dealing with things and they just got sold
for $1.65 billion to HP, which is a fairly
successful outcome for them. And the last
point on there I have is in 2009, I--I marked
Loggly there because we're probably the first
company that does real logging as a service
we're--we're a cloud based hosted service
but you send the log files to us and we take
care of them for you so you don't have to
install anything anymore. Now, if you look
at it--not necessarily from a timeline or
from a kind of company perspective, there--it's
interesting kind--you kind of follow, how
did this product or these solutions even,
Open Source tools, how did they develop? Why--why
was--were they even there? And it started
really--with network management. So Tivoli
detect--the Tivoli event correlator was really
a network management tool that helped collecting
all those SNMP traps and you had HP OpenView
and all that kind of stuff. So the first problem
that really came from the network management
side versus a lot of information out that
we wanted to gather. But then the whole SIM
world came about a security information management
where the biggest problem in security at that
point was, you had all this intrusion detection
systems and they threw up all these alerts
and a lot of them were false positives and
unfortunately they didn't manage to reduce
the number of false positives by themselves.
So the SIM's came in and said, "Well, we need
to look at a more holistic picture. We need
to correlate all that data of other data sources
to reduce that amount of data," which I'm
not sure we have really succeeded in doing
because now that you get all the other data
feeds you start looking at all that data as
well and so, you just pile up data that you
look at. So, that's what's happened. So, when
security monitoring came in to place, more
data sources were used and, and collected
in one central place. And then some of the
SIM's started pushing the whole combination
of SOC and NOC. So they wanted to combine
the security operation center of the network
operation center because that historically
was a lot of network management that came
in and there's a lot of interesting information
you can use for your security use cases also.
And they wanted to combine that. But I don't
think that has really worked that well, there
are some installation where these two consistencies
are put together in one operating center but
generally it's still separated. There's still
the network management part that makes sure
that actually, you have bandwidth and you
can actually communicate and then you have
the security people that make sure that there's
no dirt on their lines basically. Then what
happened is, is interesting as well, so this
was all sort of infrastructure related. People
started collecting logs for infrastructure
purposes and maybe for web analytics but that
was a little to decide. But what then happened
is that people realized, "Oh we have all these
log files and applications", so they started
pumping in all this application logs into
these SIMs and it just blew up because there
was too much data, the schemas didn't work
because it was all relational and a whole
lot of problems came--came about and I will--I
will elaborate on that a little bit. Here
is a graphic I came up with which I called
the Maturity Scale of Log Management and what
I mean by it is the following, if you look
at companies or people that do log management,
they basically--they start of sort of on the
left hand side and in the--in the beginning
what you do is you collect logs, you centralize
them so that you can actually access them
in one place instead of logging into every
machine. I mean, especially on your scale,
you don't want to log in to every machine
to look at the logs, you want to have it central.
It's a lot of people decentralize things and
then use scrap or something to look for the
data which is absolutely non-scalable, it's
slow, it's--it's conversing to do. So then
they start using some log management tool
and they do some forensics, they might keep
the searches that they do on--in a--in a text
file or something so that they have them and
can pull them back out again and so they don't
have to write the same command line all the
time or they write little scripts to do things.
And then they start sharing these scripts
with other people so they can run them. And
then you get them to the next thing which
is sort of reporting. And that's--that's already
quite a big leap you made to be actually be
able to report because now you have to extract
fields or data from the log records that are
completely unstructured. And then people set
up alerts, right? You have things that you
do everyday and you're like "Well, that doesn't
make sense if I do that everyday. I just run
a script that doesn't summarize my information
and generates alerts." Then you start collecting
more logs because people will realize, "Oh
there's actually real benefit to what I have."
What if I have these logs? So you can keep
moving over to that right hand side; you start
correlating information and then really far
to the right hand side, you start with some
visual analytics, maybe with some pattern
discovery, or some anomaly detection. I'll
talk about that in a second also. But if you
look at the distribution of companies out
there and what people are doing, I would say
90% are to the left of--probably about here.
They do a little bit of correlation but not
very good and not much. Most of the people
are just collecting the logs and centralize
them and maybe search them but that's--that's
pretty much where people are today. And that
has a very interesting impact because if you
think about it--if you--if you're developing
a product in this side or a solution, or a
tool or something--an algorithm; you don't
really have the sites to test it because you
don't have the data collected that you need
to do this and you don't have the people that
understand this site. So you can start moving
things more and more to the right hand side
to actually enable all this stuff. [PAUSE]
If you look at the tools on how is it done,
on the left hand side is a lot of just do
it yourself, do it your own, you use some
scripting and maybe a data base you throw
the stuff in. Then if you'll move a little
bit over there, there's Log Management tools
that can take care of--up to the reporting
part, you have the collection, there's Open
Source things like, logstash or Graylog and
then you have the commercial tools out there.
And then if you move more to the right hand
side especially when it comes to correlation,
you have things like, Complex Event Processing
or Secured Information Management, again there's
commercial and Open Source tools. Some people
use MapReduce for some of the ana--analysis
case--used cases there. And then the advanced
analytic side, on the right hand side, there's
really not much there. Their especially not
log specific. There might be some--like a
[INDISTINCT] or something that you can use
for visual analytics of generic data and people
are trying to like, push in the log data to
those tools which works. If you look at the
tools that are out there, Open Source, there's
a whole bunch. This--this list is definitely
not complete. There is a bunch of tools, there
is Graylog2 and logstash which has--which
are probably the most advanced at this point
in terms of Log Management tools. Logstash
is actually written by our Head of Operations.
It's an Open Source tool, it's been out there
for quite awhile. There are little tools like,
Tenshi or Swatch that you can use to look
for different patterns in your log files.
That they just monitor the logs and they have
different cues you can set up and then send
alerts or take some action. I marked Snare
here, that's--it's probably one of the better
solutions to collect logs from Windows, which
is always a huge problem. How they get those
Windows logs out because it's sort of proprietary,
they're not really Open standards. You can
use WMI to get to the data but that's often
a headache for security and there's all kind
of issues there. An interesting tool on here
is SEC, the Simple Event Correlator. It's
really old at this point. I don't even know
when it was coded, it was probably around
10 years old. It's a, an interesting correlation
engine, there are some downsides to it but
if you are--have a smaller set-ups somewhere
and you want to play with correlation, it's
an interesting engine to use. Yes. There's
a whole bunch of other tools. In the commercial
world this list is nowhere near complete but
that's I think sort of the--the major players
that you've probably seen. I don't--I'm not
going to talk about them much more unless
anyone is interested. [PAUSE] So let's talk
about some of the log architectures and how
these tools do some of the internal work.
If you look at Log Management tools and I
make a very clear distinction between Log
Management and then Security Information Management.
So, Log Management itself is basically trying
to collect data from all kinds of data sources
and probably about 90% of the data sources
are some kind of a syslog based transport.
But you have these other tools out there that
use some kind of a proprietary protocol like,
Check Point for example, uses something called
OPSEC which is a whole binary protocol, you
need to write a client. There's all kinds
of over head there. There's inte--I--Cisco
has SDEE for their Intrusion Detection System
what used to be RDEP back in the day You have
NetFlow Data out there which is an--I'm putting
NetFlow on here and not traffic flows in general
because I think 90% of people that collect
flows used to be NetFlower and not SFlower
or anything like that. And then you have databases,
that's another data source that for example,
ISS Real Secure--if you're security people,
the IDS, the Intrusion Detection System, they
actually log into a database which is very
interesting. So, you have to clear the database
for the new records if you want to keep up
with the data that's in there which is an
interesting approach or interesting challenge
also. So then when you collect all these data,
you will index it generally--that's what the
Log Management tools do mostly then they add
some context generally to it so you have like
asset database where you can say, "These are
my Web servers, these are my financial servers,
this is the criticality of the machines."
Maybe you can add some flavor to user names
or something like that but it's usually just
IP address space so you can add certain--more
information then. And then in terms--in terms
of scaling this, you can either cluster the
Log Management appliances or softer it's [INDISTINCT]
themselves or you throw the data into some
kind of a sand or something like that where
you actually keep the data itself. Now, what
some tools started doing is that--the problem
is, if you collect all these raw data, you
can really do not much else then just index
it and then do a full text search on it. So
what a lot of these tools started doing is,
they introduced this kind of collector or
agent or something that parses the data and
then enables you to do an extra field based
searches. You can say, "I want to have the
user name as Raffy" and not just look for
Raffy anywhere in the log files. And that
sort of what a bunch of these Log Management
tools did and what they do for--basically,
the main data source and it's--it's a huge
pain to build all these parses if you think
about all the data sources that are out there,
there are so many. And building a parser for
all of them is just incredibly hard it takes
a lot of time. And these data sources keep
changing, you don't have documentation for
them so you can't even generate all the logs,
how do you test this? In an intrusion detected
system case, how do you generate all the signatures?
It's almost impossible to trigger all of them.
So there's a lot of challenges associated
and a lot of people sort of ignore this problem
a little bit and they say, "Well, if you have
a certain data source, we give you an SDK
and you can build those parses yourself which
is not really a solution in my eyes. We'll
talk about that in a second or a little more
also. So this connectors and agents, a lot
of people when I talk to them, they're like,
"Is your solution agent-based or is it...?"
Like in any RFP, you get that question. Is
it agent-based? Is it agent-less? And there's
always like, "If it's agent-based, that's
bad because you have to install something
on your machine." But what people don't understand
is there's actually benefits to having a piece
of code that does some processing and often
what they really mean--well, is do I have
to install something on the end system to
collect the log files? That's what they really
want to find out. There's always some sort
of a connector agent involved or it should
be. So what these guys do, these connectors
or agents with that some of the features that--you
can batch data so if you--if you get real
time feeds in general, you can still batch
some of them up and maybe every second you
send the new batch of data and that helps
if compression helps, if band with consumption
helps with the inscription of the data. You
can sign the data also, you can show that
you have integrity so if you do like hash
change or something on the log files, you
can do--secure time stamps even in if you
weren't inside of these connectors. And often,
they are used for failovers so that if you
have a couple of servers where you collect
the data, they can--if one goes down or if
it's overloaded, they can switch over to another
one. Often what they also do is they parse
the data obviously in most cases, but it can
also aggregate. So that it can say, "Well,
if I see a port scan, for example, happening
in our network and the firewall just goes
crazy because I have all these connections,"
you can aggregate it. And say, "Well, I see
this one 100 times and here's the--here's
the data for it." Sometimes what you can also
do is enrich the data with context information
so you configure these agents to have more
knowledge so that they can go out and say,
"Well, I'm creating your configuration management
database and get the role of the machine and
add that to the log files for example." There's
all kinds of other enrichment you can do in
inside of the agent already so that you off
load that from the central server or manager
where you send all the data. And then obviously,
they can help you support different protocols,
things like OPSEC or SDEE. You have to have
a client that actually talks to your resources
to do that. Then sometimes if you install
an agent on your end system that can have
benefits, you can for example monitor your
file system for changes and then report on
all of those. You need to have something on
the box to do this. Otherwise, it's really
hard to clear your machine for things that
have changed on the machine itself. Now, if
you leave the log management side; so we have
the log management tools themselves, mostly
raw data, sometimes they have a connector
that parses the data and outlines the data.
Then you have the SIEMs which have a very
similar sort of architecture where you get
the data in raw form, you haven't connected
the parses yet, but at this point, all the
data needs to be parsed and understood. So,
you need to have a connector for each and
every data source that you have. And if you
don't parse or if the connector doesn't understand
certain types of logs or certain log entries,
generally it will just discard them and put
them in some error log that no one is ever
looking at. I had a case, a year and a half
ago, I was on site at a large car manufacture
and I was helping them. They were really mature
in terms of using the tool, they really understood
it. I was really excited to go there, to work
with them on the advanced use-cases and when
I started looking at things, I'm like, "Guys,
you're dropping about 40% of all of your Cisco
router messages because they're not parsed
correctly." And if you think about that, it
can have incredible impact on the whole--down
the line. If you don't see them ever, you
miss all kinds of stuff. Potentially, attacks
that--are just going by and you have never
seen them. What then happens--generally, there's
a relational database schema where the data's
being dumped into, just pros and cons. And
then the SIEMs usually takes--take all kinds
of external information into account and so
they can connect live to a CMDB where you
have the configuration of a machine or an
asset database. Or they can connect to identity
management so that you know that these users
are all the same. So, if from one system,
the icon is rmarty, from the other one rraffy,
they can control all that and I can get additional
information on the identities of the users.
So I mentioned this already, so there's generally
a fixed schema, it's a relational database,
there's fixed number of types and fields.
Now, what happens if you get a new data source?
A good example was, I think we got--we looked
at email systems at some point and there was
no field for emails. But we had a username
so we're like, well; emails are very close
to usernames so you just start overloading
these fields, right? So, a certain connectors
now are just parsed to email and to the user
field. Well, if you have certain rules that
look at usernames and do certain correlation
on that, you might suddenly get correlation
rules that trigger on all kinds of stuff that
you're not intending to do. So, it's just
a hack that you start overloading, so how
do you do that in a relational database? Usually,
these clusters get really--like if you have
to scale this stuff, it gets really expensive.
And it--it's also really hard to set up the
database itself for making historical queries
fast. What they usually do is they try to
optimize for insert speed and then the queries
are extremely, extremely slow. And that's
a bunch of tools out there that are just horrible
at querying the data. Now, there are also
benefits to doing this and that's really what
the premise was when these things were built.
One of the things is that you can actually
do real time correlation now based on parsed
data that you can extract. So I can say I
looked at a certain username or a certain
source IP address and say, "Well, if I see
that hitting my server 15 times, then I'm
going to alert right away." I can do statistics
on things, so kind of that correlation theme.
The other interesting thing is now you have
a unified language you can define to access
all of the data. So you know that my source
addresses are always called a certain way--they--I
can address them right away. So it's easier
to write correlations, for example. Now, what
happened--probably Splunk kind of pioneered
this where they said, "Well, why do we either
have full text search and no parsed or no
access to fields themselves or you have the
RDBMS and with all the problems that you have
associated with that with slowness and all
that kind of stuff." Why don't we just index
all the data and then you can do a full text
search. So you search for denied, it basically
pulls out--it uses the index to get all the
[INDISTINCT] results and gives them back to
you. Well, if you want to have field-extraction
standard parsing done, what you do is you
get all these results and then now you apply
your parses on the results. And then you look
in which ones it actually matches the field
you're interested in and you throw the rest
away. And that's really what Splunk started
doing and now a lot of the tools out there
are starting to do it the same way, which
makes a lot of sense, but it still doesn't
enable your real time correlation. So you
still have to kind of either fork your data;
one part goes into the index and the other
part you do the real time correlation by adding
all the parses in there, applying that and
then doing your correlation rules. So that's
kind of what's happening now that people started
going with these sort of hybrid approaches.
But by having learned all these over time,
the nice thing is also that now there are
some companies into doings where they're starting
to use different approach. They're starting
to use more scalable things with all the [INDISTINCT]
data movement. Now you go to column-based
database as you use all these new technologies.
And if there's--I'm fairly sure there's going
to be a new wave of tools coming out of the
market that actually support these new technologies
that--which is kind of discovered in the last
couple of years. Now, another thing that's
interesting to know and something that these
SIEMs are doing is they're using categorization
or tagging however you want to look at that
to basically talk to the data. So if you want
to look for failed logins, for example, you
have to--across any kind of data source you
have, you would have to have this crazy query
and say well, if windows reports a security:538
event or in UNIX, you have the sshd authentication
failure and in some cases, it's actually called
sshd failed password. So you have to know
all these different ways of how these tools
talk about a failed login. And maybe if you
add a new tool, you have to go and--again
and do that and it's just in every correlational,
you would do that again, so you always have
to say that. So what people started doing
is you basically find sort of--it's called
as taxonomy, it's not really a straight mathematical
taxonomy or you can call it categorization
schema where you basically map all of your
events into these sort of N dimensional space
where you have what we use--to a lot of people
are using now is that you have sort of an
object that you talk about. For example, a
file or a host on a system or a service or
something then you have a behavior, an action;
what's bring done to this. There's a login,
it's created, it's deleted and then you have
a status; was it successful or did it fail?
And then suddenly you can write queries that
are much nicer. You can say, "Well, show me
all the objects that have an authentication
and the action is a login and it was actually
a success." But then you can also say, "Well,
show me all the file objects or all the events
that have to do with the file object that
actually failed." So I get all the failures
for files. That's--it's very interesting.
And then you can obviously--in the queries,
you can combine searching on these taxonomy
entries and regular fields so you can say,
"Show me all the logins from user rmarty and
I will get everything that way." This approach
scale is a little better; the problem is someone
has to build these mapping tables. And for
example, at ArcSight, we had a team of--at
this point, probably three or four people
that basically maintained this table, add
new events to it as soon as the new data sources
supported your ad--the new entry there, you
map it all. If there's changes to different
tools, it's--you try to be hard because you
have to get all these updates all the time.
You have to do it in a timely manner and the
problem is if you don't keep these things
up to date and keep them really in production
up to date then you start missing information
or events again. If you look for failed logins
and there's a new events that talks about
that on Windows, if you don't up date your
mappings, you're going to loose them. Okay,
now the last thing here is--what I quickly
want to mention is what we see happening now
is that everything--a lot of these services
are moving to the Cloud as this--or a lot
of these applications or enterprise features
are moving to the Cloud as a service. And
it makes a lot of economical sense for--especially
smaller companies; bigger installations there.
There's always the factor of can these services
actually support you? So if you were starting
to send me a terabyte of data today, it's
probably not something I want. But there's
a lot of smaller players that have a few machines
that they don't really want to setup their
own things. So that makes a lot of economical
sense for them. The interesting thing is for
us, for example, is that we're really elastic
so that if we get more people, we just add
more machines and it just scales really well
versus a regular log management tool; at some
point, it hits some kind of limits because
of the way they're architected. The other
thing that we're really big on is providing
an open platform with all kinds of APIs, you
can search the data, you can manage all of
your data, all your objects in the system.
So that's another trend I'm seeing if all
kinds of enterprise application services that--are
really open so you can interact with them
and automate things. So here's what we do,
you know, in kind of a schema--it's fairly
simple. You basically retake the data and
either it's syslog or its HTTP posts. A lot
of times in--for example, Google App Engine,
right, you don't have access to sockets so
I can't really open a syslog connection to
someone but what I can do is I can do a web
post. So we accept web post where you sent--can
send the log files. There's all kinds of overhead
and inefficiency is associated with that,
but at least you can get the logs out and
in to us. So we get the data, we write a copy
to an archive so that the customer can actually
get to it untouched and the rest goes into
our index. And they're using Solar Cloud to
index all of the data in full text and then
store it. And then you have APIs to access
the data, we have a wed interface. That's
very interesting because it's basically a
shell inside of your web browser so you actually
interact as if you type. And it's--there's
was a Google mash up, I think it's called
Goosh, a Google shell or something. So it's
kind of mimicking that kind of approach. And
then--yes, I think that's it. Now, I get asked
by people often what tool are we going to
use, should we build it ourselves, should
we use log management tools. And this slide,
I basically built it because a couple of weeks
ago, this guy came to me, he's like, "Well,
we have this log management tool and we're
doing this and this and we want to--we want
to use it for this use-case of looking at..."
I forget what it was, it was one very specific
data source and a very specific use-case of
looking at that data. And he's like, "Can
we make our log management tool do this?"
I'm like, "Why would you even do that? That
doesn't make any sense. Your cost is going
to be so high buying a license for that because
you pay by the volume. Just build your own
use a database for it and parse the data.
It's--you have one data source that you need,
you can build a parser for that and it's one
regex probably. It's fairly structure log,
just store it in a database and then build
a little web interface to it to address exactly
a use-case, build your mash up, use some visualization,
an API to generate an image from that." And
that's really all he needed. So often times,
you have to think about how many different
use-cases do I have, how many different data
sources do I have. And then you can kind of
map yourself into here--and you guys have
access to the slide, you can look at the details
yourself if you want. So don't fit a square
peg in a round hole. So what's working and
what's not? And I think I already mentioned
a bunch of these before. I think log collection
is working today. We can get the data into
central location. We can alert on previously
known patterns. We can't really alert on the
more intelligent things yet, I believe. And
we can solve specific known used-cases for
known data sources; but as soon as you have
sort of the more dynamic data sources, it's
getting kind of iffy whether we can actually
install it with. So what's not working is
the way big of list I have. And this is sort
of the beef I have of the whole log management
industry. And I think people are not necessarily
taking this very serious. One of the huge,
huge probably is that there are so many log
formats out there, they're not documented,
they're not standardized. The standard we're
working on is still, it's not released. We've
been working on this for years at this point.
We have all kinds of entities involved. And
if you look at just--if you take two very
big operating system vendors and then you
let them talk about how to standardize log
files, you get two really interesting point
of views. There are no guidelines out there
on how to log. And I think this is a huge
problem we have. It's sort of analogous to
the security problem, right? Like, we're trying
to make things more secure while you have
to go do the source of--to the program will
actually teach them secure coding so you have
to educate them. It's exactly the same thing
with logging. You need to teach people early
on, like here's some guidelines and here's
how you do it. It's not super complicated;
but come up with these guidelines, they're
going to help you so much. And I'm fairly
sure you guys actually here do that where
you [INDISTINCT] how sort of a logging framework
works. I think parsing is completely broken.
And we talked about that. You have regular
expressions, you have to right, and they always
wrong and they are not--they don't scale because
now you have a different log format coming
in. I think we need to find something else
there. And then we get into a little more
hidden problems that a lot of people don't
talk about it, don't even know about it. I
think normalization is broken. And what I
mean by that is that some log files reported
have an IP address. And other log reports
are fully qualified domain name or hostname.
How do you correlate that? Well, there's couple
of ways to do this. You can say, well, when
I collect the data I map everything into either
an IP or a hostname, right? Let's say you--mapped
into a hostname; so you do a DNS look up at
that point. Well guess what, it's really slow.
A DNS look up takes a lot of time for the
result to come back. And if you have real
time streams coming in, that's--that can be
a problem. Now, if you did it later, the problem
is now the DNS might have changed. You don't
know what the original DNS was. And make sure
you used the right DNS server to actually
resolve; because maybe, depending on which
DNS server you use, you kind of get a different
result. So there's interesting problems there.
And a lot of the SIEMs are actually doing
it at--with correlation time. They're not
doing it in the beginning, but they start
doing it later. Then you have to hold user
name problem, mapping different users and
identities in the same entity. And that's
not just user name specific, there's all kinds
of entities that you address in different--with
different names or different ways; but how
do you know they're the same? This whole categorization
taxonomy things is--or tagging I guess is
interesting. I think it's a good approach
to sort of obstruct certain things, but it's
always out of date. How do you update that?
If you have a new signature, like a new IDS
signature coming you, you need to update that
thing immediately. You can't wait for the
vendor to get it done. And usually, they don't
do it right away so you need to have subscription
models to do that and push things down. And
the next problem I see is at ArcSight. We
were big in having this like, we call it like
"Threat Level Formula." But every event, you
calculate how important is it. If it's a ten,
you have to look at it; if it's a one, forget
about it. There's no formula that works. We
tried everything. We looked at contexts, we
looked at the role of the machines, of the
history of the machines. We looked at the
importance of the event itself when it comes
in. Incredibly complicated, and it's never
right. And I don't have and seen a way that
works it. And then finally, and anomaly detection,
I think, is voodoo. You've read a lot of research
on anomaly detection and log files, and oh
my goodness, we could talk at length about
that. I haven't seen anything that really
works in dynamic environment and dynamic used-cases.
I think one of the other core problems is
also that we don't understand the data. If
you were security analyst, you're sitting
in front of a screen and you get all of these
data sources feeding in. You might understand
your security products. But what happens if
you suddenly get a feed from an application?
You have no idea what that application exactly
does. You have to understand fairly intrinsically
how that application works to actually make
a determination of what's going on, unless
you have really good developers and guidelines
or actually the logs are very descriptive.
Let's say, hey, if this happens, we have a
security problem. So you have to think about
these things when you come up with your guidelines.
And I don't think--there are really no tools
out there that support this notion. I think
this is absolutely crucial to do. So what
do we need in the future? Well, it's definitely
a more--it's getting more and more, more,
more data. And current architectures of these
log management tools have just--don't cut
it. So we need to use new approaches, big
data stuff, to actually deal with these problems
over large data. Removing more and more in
the application layer, people are really interested
in monitoring business layer logic. It's not
just about the infrastructure anymore. I/O
even say, forget the infrastructure, look
at the applications, right? You can protect
most security problems up there. And this
is a little exaggerating here, but that's
where the money is. So how are you going to
do all about the parsing? Are we going to
have guidelines? How do you all that? And
then, we need to have something that helps
analyst or people looking at the logs understand
things. And my take there is that, a lot of
times, it's not really that I need a specific
security analyst looking at my application
data to understand what's going on; but I
need to give the application developer access
to their logs and tools so they can very quickly
look at them. And if they find something,
then they can maybe escalate it. But we need
to make that a priority. Okay. So that's--launching
to the more fun part here. Let's talk briefly
about data visualization. I think we have
some people here that know about that topic
much more than I do. But when I wrote my book
on--it's called "Applied Security Visualization,"
one of my goals was that, I sort of see this
dichotomy of having security people. They
understand networks and security and policies
and all kinds of technical details really
well. That--that's where I'm coming from.
And then, not that I understand it very well,
but that's where I grew up. But then, you
have the visualization people. And these people
know all kinds of things about perception,
and they know how to make graphs that actually
work, things you can look at and interpret
and make them interactive and all that kind
of stuff. And I feel there's this huge gap
between these two groups. And we need to bring
them together. And my approach was like, well,
how can I educate security people to give
them the minimum knowledge of this visualization
things and things they can apply right away?
So I went through a whole bunch of university
courses and books, and just try to extract
what I need to know about visualization. I'm
not interested in how the eye works and optics
and all that stuff; that's usually what you
find in these courses. I was interested in
like, how do you make good displays? How do
you use color in a good way? How do you--like
if you have color palletes or schemas for
example. Or how do you--some of the principles
that Edward Tufte talks about reducing non-data
ink. Things that are fairly--like, if you
look at them, they're like--that makes a lot
of sense. But we need to teach technical people
how to do that. It's not complicated; but
that's the stuff that people should know.
So when I look at visualization, I usually--there's
sort of four important things that I think
we can do with it. It's--one is exploration
and discovery. A lot of people come to me
and say well, but why don't you just write
a algorithm, a tool, a script that does this?
I went, well, but you don't know what you're
looking for. You first have to explore your
data set, figure out what you want, and then
you could code your thing. But first, you
have to get an idea what's really going on.
Then, you can use it for answering certain
questions like certain things you just--you
answer much quicker if you have an image that
you look at it and you're like, oh, yes, that
makes sense. Look, we have all these kinds
of traffic going on instead of writing a tool
that does some analysis and comes up with
something, because we can use our brain to
interpret these things. And often, probably
one of the biggest used-case is communication.
If I want to tell you what happened, I can
give you an image and say, look here, this
machine talked to all these others and then
here we had these things going on, and here's
a cluster of something. It's much easier to
communicate within images and how it can help
you in making decisions. So if you look at
the security visualization fields--so I'm
focusing a lot on security here because that's
what I know most about. I think we are absolutely
nowhere with security visualization. We have
had a few years now that people started talking
about it; but we're nowhere. Visualization
is generally an afterthought, right? You start
collecting the data and you're like, oh, I
have all these used-cases and I'm going to
write the correlation rules and this and that
and reports. But visualization, in the end,
is always this afterthought. People start
building these tools and then they're sort
of plopped on top of an existing solution.
They're never the main goal. I talked about
dichotomy. And then, the tools we have out
there, they lack basic capabilities. They're
really rudimentary in--on what they can do,
I think, in terms of what we need as security
analysts. So here are some quick concepts
on things I think I--extremely important if
you're writing or if you--if you're building
a security visualization or visualization
tool in general. One things is a concept that
Ben Snyder sort of came up with which is basically
you have--if you look at data, then usually
one display or one look at the--at the data
doesn't really solve your problem right away.
So if I asked you--or if you asked me to visualize
some of your NetFlow data, I can't just give
you one display and you're like, oh, yes,
that actually--that solves it for me. But
generally, what you want to do is you want
to look at an overview to get an idea of what's
happening. And then, in this graph here, you
see there's two spikes in there in the top--the
two top bars, and then there's some gap. So
that might be what I'm looking for. So I can
then zoom into that potentially using a complete
different display form. I might use link graphs
for the--for the next analysis to see what's
really going on in that kind of data. And
then, from there, I might see, oh, there's
a certain cluster here. What really happened
here? And I might--generally, I have to go
back to the original data. I might have to
pull out up my pcaps or my NetFlows to actually
understand what's really going on in the data.
So these three things, people need to keep
this in mind that it's really a process to
go through. Some of the things that are interesting
when I--when I try to look at security data,
what I like to have is simultaneous views
on the data with different--for example, I'm
looking at the source IP address distribution,
destination IPs, and then I'm looking at some
kind of--maybe in the categorization schema.
What do I have in terms of what objects are
impacted, what are the statuses, what are
the actions that are being done on this data.
So I can very quickly kind of pave it around.
And hopefully, this is interactive so that
I can select certain IP addresses. And I think
it's the next one. Also here, like, you use
a dynamic coloring schema. I can say, well,
I want to use--I want to look at the top source
addresses. And now, as the color, I want to
use the severity of the events or I want to
use some other field that I have in my data
so I can very quickly sort of pave it around
and understand what's going on in my data.
So here are just the different examples of
dynamic coloring. And then, something that
I find incredibly useful is linked views.
So we have your different views on the data.
And if I select one of those bars or one of
those sectors in the--in the pie chart, I
actually see the entire displays updating
and I see my selection across all of them.
So I can very quickly pave it around to figure
out what happened in my dataset; and hopefully,
I can then filter things out and say, oh,
that's interesting. Let me get rid of the
rest or let me get rid of exactly this so
that I can interact with the data. And the
tools that actually--and the screenshots here
are from business intelligence tools, like--this
is advisory solutions; there's Cognos out,
there are these [INDISTINCT]. There's all
these tools that support this but they're
not really built for our security used-case
in our security data sources. So that's always
an interesting problem. And then, one of my--probably
the biggest--the most important principle
that I find is really--what Edward Tufte said,
is reduce your non-data ink. So the left-hand
graph is the same as the right-hand; but it
uses all these different things that cluster
the display, the background, the grid, the
bounding box, the three-dimensional pyramids.
You can reduce that all; but you also see
I added some things, the labels and the x
axis are not there. So what are these different
pyramids showing there? Well, you want to
have labels, so you don't always want to just
remove things but you want to have the information
in there that you need. And then, maybe the
second most important thing is you use the
right way of displaying your data. 3-D pie
charts might be really cool looking, but this
one is so bad. If you can read the labels,
the red--you can barely see it. The red slice
there is just a--kind of the small--that it
has the least height and it says 108,000;
and then the next one is 126,000. If you look
at the sectors width, the red one is bigger
than the gray one; but it's 108,000 versus
126,000. So what is that? And how big is the
yellow one in that case? Well, if you--if
you sort of tilt to the side and look from
the side as like a bar chart, then it makes
sense. But if you tilt it and look at it as
a--as a pie chart, it doesn't make any sense
anymore. So use the right chart for your--for
your problems. A lot of people have very strong
aversions to pie chart. And this is what I
found from [INDISTINCT] but I'm not sure if
you can read it. Probably you can. Yes. So
in general, I also believe that pie charts
are often not the solution you want to use.
If you just have two numbers and you want
to maybe compare them, percentages or something,
maybe yes. But if you have more, then the
slices all look the same, that--you can't
really tell which ones is bigger, you have
to put labels in there. Why not use a bar
chart that shows it much, much quicker and
nicer. And then, something that a lot of--especially
junior analysts are to blame is they jump
to conclusion very, very quickly. And you
always have to sort of question your entire
data collection or the entire data processing
process. And here, this is a network graph
that I created very early on when I was playing
with visualization on security data. This
is actually from a honey pot I was collecting
data. And I got this graph and I was super
excited, I'm like, we have a tax going on.
There's smallware in our honey pot. Until
I started looking and stayed a little more
and try to understand what's going on--so,
I--I'm not sure if you can really read the
labels here, can you? Sort of? So, what do
you think is going on here? I'm already giving
you the careful sign so it's probably little
more obvious but a lot of people say, "Well,
there's--they--there's something scanning
for different services." And what you see
here is the circles here are destination IP
addresses, so machines and then the--the squares
are ports that are accessed on those machines.
A lot of people jump to the conclusion, "Oh
yes. Wow, we have all these different activity
there--this--the smallware trying to scan
us on these ports trying to figure out whether
we have some thing running." Well, the problem
here was when I investigated this a little
more is that there's something I called, "The
Source Destination Confusion" where these
destination ports, a lot of them are actually
source ports and what happened at parsing
state there was a mistake where someone assumed
that the first column is always the source
address and the second one is the destination
[INDISTINCT], the same for the ports. But
it happens so that if you looked at a server--client
server communication, that's true for the
first, the third, the fifth, and so on the
communication between client and server but
from server to client, it shows you those
things inverted so if you look at like a TCPDUMP
for example that--that's exactly what happens
to you so you have this--every second one
is the wrong way around. So, if you look at
things, question the process before you jump
to conclusions, "A, we have an attack going
on." And I see that way too often. So, here's
some examples that I collected and generated
overtime. This is couple of pictures I have
and I visited the Norwegian Sort, they're
also the--the secret service up in Norway.
And I taught a class up there on visualizing
security data and what they have is they have
this security operating center where they
monitor all of the critical infra--critical
infrastructure in Norway, so, the power grids,
the water, and everything in term--in terms
of IP traffic. So, they can see very quickly
if there's spikes in certain traffic in certain--like
in the water sector for example or if it's
on gas and oil or--so they can see attacks
across the critical infrastructure. Now, usually
when I go into these security operating centers,
you hand in your phone and your--your camera
and if you have a laptop with a camera, then
you can't bring that in. So, they're very
strict and when I asked them whether I can
take picture in their software, they're like,
"Yes, absolutely." You know, I'm a little
baffled and there's--they're like--what they
have is this red button on the desk and they
press that and what happens is it anonymizes
all the screens. So, it's all random data,
it still shows the--the real displays but
the data is completely random so on the top
left you have different netflow centers out
there. It's a completely random number of--of
centers that they show, it's completely random
information but they found that so many people
are interested in what they're doing that
they want to support as in they want to promote
what they do and want to be able that people
can show pictures of it or screen shots. When
I walked around in their office, I found another
display, another dashboard and that one is
actually really interesting because the right
hand side on here I thought that was the most
important information that they were broadcasting
in their office. If you can--if you understand
German or Norwegian, Kaffeekanne, it's a coffee
pot, so they actually have the two webcams
in front of their coffee pots to measure how
much white versus black does it see and they
can tell you how much coffee is left in the
coffee pot. So, that's sometimes important
for security as well to keep your analyst
awake. Then--so, here are some actual examples
of--of security data or security relevant
data. This is a graph I didn't create but
this was Chris Horsley down in Australia.
And what he did is he takes kind of the similar
setup like the Norwegian Sort but they look
at a whole bunch of netflow sort--sensors
out there. And they're trying to figure out
what--what's just generally going on. And
what he does here is he--he's using a tree
map where he's showing the amount of data
by sensor and by port number. So, the sensors
are in color and I'm not sure if you can really
see it down here on the--on the TV there it's
actually really good. So, you see like blue
up on the top left is all one center, then
you have the red. Then the--the little rectangles
in here are different port numbers so if you
have large rectangles in there, it means there's
a lot of traffic happening there. Now, the
ones that are sort of highlighted here is
port 4445 across all of these different in--different
netflow centers so there's something obviously
going on. Now the interesting thing here I
thought what Chris did is he's using the--the
brightness of these--the right angles to show
the variance over time. And--so, basically
he looks at how much traffic did I get in
the last five minutes versus to the last hour.
And he--if he got a lot then it's much brighter.
So, you can see changes that are temporarily
very close are much brighter so there's a
lot of activity going on somewhere in here
and a little bit here. So, you--you see different
things popping out. And if he sees a certain
port across all of these netflow centers suddenly
being really bright, he knows there's an attack
happening across all of the different sensors
and not just locally to one--one of them.
So, I thought it was an interesting was of
using treemaps and colors. Here's another
example that I was able to bring home from
Norway, it was just a fun example that one
of the guys ran a firewall at home and what
you see here is time and then the port number
and he visualize that basically whenever there's
activity, you draw a dot, if you have more
activity just make it brighter or use a different
color--a color scheme for that. And you see
there's some scanning going on over here on
different port numbers and you have some horizontal
lines but there were these two guys that knew
what this guy was doing so they inject the
traffic into the firewall and out came their
two faces. I thought that was pretty cool.
There's other ways of--of analyzing or looking
at firewall data and I'm going to cut these
exams a little short in the interest of time.
Here, what I'm using is just a treemap and
I'm looking at the source address as the outer
most box so everything on the left hand side
here this whole--what is that about a third,
it's a little more, is all one source IP address
connecting to multiple different destinations
that are--that's the next one here and then
on different ports. So, the port number is
in--in inside and on the left hand side what
you see--and if it's red, it's blocked and
if it's green, it was passed. So, what you
see on the left hand side, there was this
man-machine going to a lot of different machines
on my network on--usually, there's just one
port here was echo request. I'm putting that
into the port number category here. So, there's
a whole lot of pings that came in that were
blocked but then what happened on the top
left, some pings came in were blocked, but
other--some pings were passed to the green
square--parts up there. And right after there
was a connection attempt of port 135. That's
a very common behavior for some of the worms
out there back when I generate this example.
The thing with ping machines if they find
one, they start connecting on the windows
share to try to see whether they can access
it and exploit it. So, fortunately the firewall
seemed to be okay--configured. But my question
here would be, well why do some machines allow
pings to come in? So, that's very quickly,
I can see that. The other thing that probably
I want to investigate is this here that--that
really stick out, what's this 427 traffic
that has been blocked? It turned out it was
a missed configured MAC that was trying to
talk Bonjour or the Bonjour Protocol and say,
"Hey, I'm here, talk to me, talk to me." Fortunately,
the firewall blocked it and didn't let it
out to the world. So, an interesting way of
looking a lot of data and kind of prioritizing
or understanding, exploring what your landscape
of your firewall data is. Here's a way where
I use spark lines for Port numbers, Source
IP, and Destination IP. You just see kind
of trend over time and you see again on the
port numbers for example there's a whole bunch
that have just like one little tic here and
I am really curious what that is. It might
be sort of--that the Source Destination Confusion;
it might be something else that happens every
now and then, maybe there's someone scanning
me very, very slowly. I might see other patterns
in here. There's a [INDISTINCT] machine that
is very periodic for that 212 machine up there.
It's--it's a very periodic behavior. Why is
that, is that a DNS server, a back up server,
some kind of scheduled thing going on? So,
it's very interesting to kind of explore time-based
things here. Then one of the big problems
is if the intrusion detects system is you
have a lot of false positives and you have
to tune your signatures. So, how do you do
that? What I like to do is, I like to basically--what
I'm doing here is I look at the source address
as the outer kind of part of the hierarchy
then you move in the destination address,
the signature and then I color it by the priority
that I--if given this--this different signatures.
So, I can look for large, dark, or red rectangles
in here that kind of focus my work on and
I see very quickly the left hand side here
seems to be the stuff that generates a lot
of events so I can start focusing on that.
And often you will see that for example, I
see impede based signatures are triggering
a lot these signatures and then it could go
in and say, "Oh well, I got to define exceptions
because my network monitoring station is paying
all the machines in this in the network and
that generated a lot of data." So you can
see this trends and things very, very quickly
and prioritize your work. The same with vulnerability
scanned data where here are basically by--I
think it's the by machine I'm showing the
different vulnerabilities by portals and then
I associated a certain priorities with those
vulnerabilities and say, "Well, red ones are
the important ones." So I see very quickly
I have one machine at the top left that's
super vulnerable to all kinds of stuff, the
next one down as well, and then I have one
over here that seems to have a very large
or a very important vulnerability that I should
probably fix, and then there's a whole bunch
of little stuff at the bottom right. But those
in the end were probably about 200,000 different
entries in a--in a sort of log files that
I had about vulnerabilities on machines. And
if you very quickly see, these are the things
to focus on instead of going through and figuring
out what's happening. So what do we need for
visualization in the future? I think I mentioned
some of these things before. We need some
kind of a solution for the entity extraction
or the parsing like I still feel this whole
regular expression-based approach is completely
failed. We need something different. We need
dynamic interactive displays so what I showed
you here was mostly static. That doesn't really
cut it if you want to investigate something.
These are good screen shots of like a certain
state of my investigation that I was happy
with. But if you want to do something, you
really [INDISTINCT] interactive and dynamic
displays. And then something I call computer
aided intelligence. I don't know, this was
just this morning. It was kind of a--I don't
know if that's a good way to call it, but
what I mean by that is I feel like we don't
need to chase this anomaly detection algorithms
that are really hard to make right. Well,
I think what we should--we should chase is
using us or the analyst, the experts more
and giving them the tools that they need to
look at the data because teaching a tool that--tomorrow
at 2 o'clock, there's going to be a maintenance
window where a lot of stuff can go wrong versus
having the analyst that probably knows it
because he saw some email or something. That's
really hard to bridge that gap right, to code
all that knowledge so the person can have,
so we should make things to support analyst
so people will look at the data more and make
it super highly interactive so that people
can [INDISTINCT] around and really use the
data to look at it and then hopefully capture
also the data from these experts when they--when
they work with the systems so they can say,
"Well, if I have a graph that looks like this,
it's a port scan." Maybe that's something
very simple to detect, but if you have junior
person coming in they can help them a lot
to understand that. So capture the knowledge
and make it collaborative that people--multiple
people can work on this. So if you want some
more information on security visualization,
there is a secviz.org, sort of a port line
maintained very loosely. It's really for people
to submit things they do in visualization
for security purposes. It's--you just submit
your post or your question or what ever. A
little library of graphs in there or gallery
of graphs you can look at that people have
done work on this area. There's a live CD
that's unfortunately kind of outdated at this
point. But Kristen Howard talking a few months
back, the CD was actually built by a friend
and actually one of your co-workers. He works
in Zurich. It's called DAVIX, the Data Analysis
and Visualization Linux, and it's just a live
CD that has all kinds of visualization tools
installed. They're regularly running so that
you don't have to go and compile them yourself
and then you [INDISTINCT] libraries and all
that stuff. But it's a little out-of-date
at this point, but maybe we can revamp that.
We're looking for people to help us. There's
a twitter feed, there's a mailing list that's
absolutely quiet. No one has ever post anything
on it, but feel free to use it. There's a
bunch of subscribers on there. And with that,
I'm done. And if any questions I think on
the point in our--but if there are any questions,
please--or comments. Yes.
>> I have a question [INDISTINCT].
>> MARTY: So the question is how does--how
would--how does pricing work in a, could pricing
work in a cloud model? What we do is volume
base. We want to make--we want to keep it
simple. If you look at the pricing of the
different log management in SIM tools--excuse
me--they look--they use all kinds of different
metrics like how many data sources do you
have, and how many different ones, and how
many hosts. And for vulnerability scan it's
a little cheaper because we have a lot of
them or--it's like just keep simple. You want
to get as close to utility-based computing
as you can where we have a tiered-model with
Loggly whereas we could do the utility-based.
So just whatever you do, what you get is what
you pay for, but that's I think the simplest
thing to do. And then sometimes in higher
tiers you can add more features to kind of
price that way also. But the volume seems
to be the thing to do. Any other question?
Yes.
>> What's the turn around once they [INDISTINCT]?
>> MARTY: So what we do is basically you send
the logs to us and then we give you the tools
to look at the data again. So it's just a
fully index data at that point and we give
you tools to graph it and explore the data.
So we don't go, and analyze it, and tell you
all these happened on the data. And to turn
around and if you're asking like what's the
time to index, it's like 10 or 15 seconds
and it shows up.
>> [INDISTINCT]?
>> MARTY: Yes, yes, yes. And that's actually
a big problem that we have to solve if we
use something like Solr indexing engine right,
how do you do the near-realtime indexing.
Yes. Okay. Yes.
>> [INDISTINCT]?
>> MARTY: Have I used machine learning to
do anomaly detection? I haven't. I read a
lot of research like the RAID conference,
Recent Advances in Intrusion Detection, has
a lot of different approaches that people
tried for all kinds of stuff, [INDISTINCT]
networks, and machine learning, whatever,
you name it. What I've read and seen is just--hasn't
work. It's just--it's hard. It's not that
easy because the environment's changed, the
data source has changed. And even if you try
to keep these things fairly stable, you have
to—generally, you have to train these systems
and that's where it falls down. No one has.
Like how do you train a system in a live environment?
You don't know--you don't have a labeled data.
You don't know what's bad in there so it's
like--every time someone comes to me and says,
"Oh, I have this thing. You have to train
the system." I'm like, "Forget it. It's not
going to happen." And maybe you can prove
me wrong. Maybe you have done some work on
that, but I've be interested in hearing that
if anyone has any approaches to it.
>> I used to work in a company [INDISTINCT].
>> MARTY: Yes. Yes.
>> [INDISTINCT].
>> MARTY: Sure.
>> [INDISTINCT].
>> MARTY: Yes.
>> [INDISTINCT].
>> MARTY: That's interesting. It probably
won't scale because your environments are
very dynamic in general. Like, even if you
keep the data source static, you have new
machines being introduced to your network.
You have new users coming online. You have
new applications coming online, so your patterns
keep changing. Even if you keep them static--I
mean, yes, you can probably do--you can go
through all the data and say, "Oh, this, this,
this, this, this, this." But what you will
find also is that a lot--you have big clusters
of things, and those are easy to classify.
But then you have this--the long tail, it's
going to kill you. That's going to be so much.
I can guarantee you, every system you're going
to look at you will have things where people
are like, "We have no idea what this is."
There's one log record here, one log record
there. You have no idea when that happens
and it takes a lot of time to investigate
what is this, why is it generating? The back
up solution trigger in my ideas every night.
That's not that hard because it's periodic,
but you see like this log entry or I have
no idea. This application just does it, right?
But, yes. If you can make that scale, so is
that labeling off your data?
>> You've introduced that [INDISTINCT].
>> MARTY: Introduce the complex API to let
people label their data. I think yes. Well,
I'm not sure about the complex part. Make
it simple. But I haven't thought of this just
a second ago. The question is can you--the
question is how do you let them--what are
the features you let them label? Like, if
I--if you have your data set, and I have mine,
and you label some of your data based on IP
address, and you say, "Oh, this is the web
server doing this," then I cant use that because
it's your IPs, so you have to abstract it
to the right level I think and that would
be very interesting. If you can come up with
sort of taxonomy for that and then let them...
>> [INDISTINCT].
>> Yes. Yes.
