Welcome to the 2016 NASA Ames summer series
If a tree falls in a forest and no one is around to hear it does it make a sound
Another way to look at that question if a species a race or a data point
Disappears and we do not have a record of it
Did it exist?
Or does it have any value in the results of where we see
Today's presentation is entitled search searching harsh environments and will be given by dr. Or fear Frieda
He is a professor of biostatistics
bioinformatics and Bethel biomathematics in the Georgetown University Medical Center
He is he also holds the Robert and Katherine McDevitt chair
in computer science and information processing
Besides that he also holds an external position as a chief scientific officer of Umbra Health Corp
He received a bachelor's in science and computer science and communication studies and
Masters of Science and computer science and engineering and a PhD in computer science and engineering all from the University of Michigan
He's the fellow of the triple is the ACM I dub I Triple E. And nai
Please welcome, please join me in welcoming dr. Freda
Good morning, which is for me a little odd to say consider. It's really good afternoon, but for you it's good morning
Thank you for being here. I see the lights are nice, and shiny. I'm not used to
Bright lights, but we'll do what we can do so
Today's talk is about searching in harsh environments and before I came here. I looked at the
variety of talks are gonna be given through this summer series and
This one is not quite in the same a spirit as some of the other ones
This will be a little different this, but we'll save hopefully you'll piqued your interest at least
something
Alright, so first of all I have to correct you a well-known myth how many of you use Google?
Ok it's considering am I like a stone's throw away. It'd be pretty sad how many of you basically use Google today?
Ok no surprise
Well if you actually give a talk in any search community what you'll hear is that Google solved it all and for the most part
they actually have solved quite a bit of it, but it isn't exactly the truth because
Google really solved computerized data and a lot of data
Isn't computerized it judaized, and I'll show you what I mean in a second and more so
Google is hardly a
social media player how many of you have actually used Google+
Wow there's actually three of you impressed
I've given these talks in large audiences and
Not even a single hand so obviously
Google+ didn't quite make the splash we're hoping it to make and obviously it's
Social media is here all over the place. So Google me is hardly a social media player
so what I'm going to talk to you about today as
you can see I'm going to talk about a gambit of things and
The way I usually talk give a presentation is I usually pick something that I considers more engineering
Meaning that the solution to a real problem exists. I put it together
by me I say nice to actually put it together and
Be basically is in real use
So the other extreme is research which is something we hope to do in the future
We hope to get it to go into applied practice, and I'll talk to you about searching and mining of social media
But first I'm gonna start talking to you about what is complex document information processing and by complex information
I try to find something that would relate to NASA along the way now
I gotta tell you that I'm not an expert in anything that NASA does in fact
I'm not even a novice, but I do realize that when you actually look at the brochures
and you actually look at some of these stuff that you find online you talk about the
mission assurance
System and they talk about having timeliness of information that it will get you complete information as quickly as you can
Now complete information really means that you need to get it from all sources
Not just the standard computerized as you know it
So I'm going to talk about that and obviously I'm going to talk about the research direction in this area, so let me
By the way feel free to raise your hand and stop me
All right, so this is a complex document
When I talk about what is complex document I talk about documents like this and as you can see this is not your standard web
Document or isn't your standard Microsoft Word document, it's got
Stamps, it's got signatures. It's got logos. It's got a whole variety of things and that's pretty complex, but
This one's a little more complex
So this was a government funded project. I was working on and this too has
Logos and it too has signature and it also has
Different type components including tables and handwriting and by the way anybody can read this
No one can read this
Okay, anybody can read it from right left to right
I'm glad nobody said yes because they arabic you read from right to left
But anyway the fact the matter is that this is a complex document and you need to search that
now
Fortunately we know how to search it
Always know how to play with it
Like most of software development today your target is to be able to have software that basically you can utilize other people's software or
Freely available commodity software because if you actually tried to redevelop your software what you're gonna end up doing is
Having lots of errors if you ever succeed at all to have anything done, so what we do is we capitalize on other people's technology
freely available software
For the available component some more robust than others and we try to get an answer
And we do know how to deal with handwriting recognition
And we do know how to deal with DES with structured extraction
and we do know how to do with OCR and we do know how to do with all of that and
The way we deal with all that is simply we take the document
we basically a
We basically here's a code occupant. We then basically enhance a document because you can see it's kind of bad
We layer it
We OCR the various components we take all the various components that we run it through all the various software routines that we know
we put it together and
Then we try to either search it or mine it now
Obviously here
I can tell you that there's a need to quote unquote enhance the document that I have to prove it to you obviously
Taking the documents. You'll give me as granted and
Extracting stuff from it
you'll give me as granted, but your what I still owe you is an explanation of why I have to enhance it and
Why does integration make any difference at all?
So I owe you that explanation, so let me start with this
You can guess which side is enhanced and which side is not
Right, I'm looking people are like wondering. Okay. You do know very good
So you can tell which size enhancer inside that and one and you can tell it's a handwritten
document what this is is documents taken from a
Diary an old diary. That's gone through. Let us say adverse conditions during World War two and
Basically, it's a store in the museum so what we have here is basically a scanned image of it
and as you can tell you have to scratch it, and you and you can't really read it and
Even if you did try to do OCR forget it you can't get anywhere with it
But how about this one?
Now you can get somewhere on it, this one's even a little harder because it's got a conversation of a hand written and
Typed and
This one shows you that if you enhance it you really can get anywhere, which you can look at it. If you look at the
Black square rectangle you can't really read anything in the unenhanced and you can easily read it in the other enhance
Anybody want to try to translate it I?
Could make it beer and you speak fluent German well if you speak fluent German then you could translate it ah
Very enough I can translate it for you if you'd like but it's a little hard even for me to read. I'm a little closer
Anyway, so you'll by the fact that you need to enhance
Right you you should enhance because if you don't enhance
Things like that you won't be able to play a process certainly now with OCR
But I also only proof that you actually then
integration actually makes a difference and that you really do want to break the pieces into the whole into summer pieces and summit together and
The best way to do is this this is me
It's kind of like a business card
don't really use business cards anymore, but it's an old historical artifact, and I have to ask you the question of
What positions do I hold now if you actually look at it
and you did not deal with the text if you basically only deal with the
Components, they're non text you cannot answer that well. That's a given
but the real question is in which institution am I act and
As you can see without processing the logo you couldn't answer that so without
integration you couldn't answer either these two questions, so
Let me show you a little bit of what we did
But before we do that and I look at the age of the audience and this se makes it very clear for me
I have to address the notion of technology we built this prototype and by the time we were done. No one would use it
Because technology moves so quickly and so rapidly that it becomes out-of-date fairly quickly, but unfortunately or fortunately
What doesn't come out of date our?
benchmarks now I'll show you how sad or
inspiring whichever way you want to view it benchmarks are
so in
Early on days there's some competition called Trek anybody ever heard a track
nobodies ah
More than the number of people that use Google Plus
Trek is an international competition for text
It's basically not a competition. It's sponsored by NIST. It's a bake-off where everybody gets together they submit
They give you a set of queries they give you a set of a data
and they tell you go run your system on it submit the results you send it to the internist and this basically does the
evaluation for it's been running on for a long long time in
1993 which based on some of the audience age here was probably before some of you were born in 1993
They basically had a very very large collection a collection that was so large
They basically felt that the academics
Can't really do a lot of with it and what they did was that they actually had a subset of it
They also evaluated on it
So anybody want to guess what is a large collection in?
1993 a very large collection that academics couldn't handle
Order magnitude was it thinking terabytes
No, how about hundreds of Terra how about hundreds of a gigabytes?
It was actually two gigabytes
People couldn't store it forget
Forget actually products store it and they had to rely on a small subset collection which was 500 megabytes
If you take your iPhone in your pocket
Or whatever device you have and think how much you've got storage on it
And think of how many songs you have on it well anyway
that benchmark that they still still is used periodically today a
long time ago
Still used today, so we felt that we if we're going to have any benchmarks that we were going to build any benchmarks
We actually needed to build a benchmark. They would stand the test of time but more of the fact
We knew we had to build a bedroom because no such systems that we described were available and no was no way of evaluating him
Was possible so we had to spend a lot of time creating our own benchmarks
And you know what the cardinal rule of creating your own benchmark is
Are you you're you're guaranteed to be the best you're also guaranteed to be the worst, but you never say that
So we built a benchmark and what are the benchmarks well in order to have a benchmark we build a set of characteristics
What are the characters that we felt important it had a very?
Inputs it had a very in fonts it had a very in graphical
elements had a very and everything had a varying the whole things and
Key to it is it had to be
Free
How many times have you bought so when I was growing up and we wanted to buy music
We did something that's very archaic called we bought records
seen records
So when we we would buy records later on and we would buy CDs you've seen CDs
now let us say you borrow music from various different sources and
You never pay for data
You don't pay to read the newspaper you
Don't pay to search any media
it has to be free so for anybody to actually use it you had to worry about the copyright problems and
You had to make sure that if you solve the copyright
And you solved it in such a way that was free so we actually built a system built the benchmark. That's free
Make a long story short
We built this benchmark. It was used. It's about 7 million documents 42 million TIFF images. It's about a hundred
It's about hundred a gigabytes of OCR after you ocr'd it
It's one point five terabytes in size it's still credible, and it's most importantly it's still used
It's still there if you want it. You can go to nist and you can get it
Basically what we did was we actually came up with a way of actually searching this collection
And it was that's been used in like but suffice to say is we built a simple piranha type system that
had very simple integration they do all the components to it you just plug in component to gether and
You basically ran it now
Without going into details. Let me show you why this system isn't it should be a bigger interest so here's a query for you
The query says you want a logo of the American tobacco company it
Has to talk about the documents to talk about income forecast, and it has to be of talking about income for COS of greater than
500,000
now if you look at it the logo is pretty clear and
If you actually look at it the word in a red box that you probably can't see says income forecast
So far pretty straightforward, but in the black box it says 800,000
Now the reason I use that as an example is if you actually did a text search
You wouldn't find this document
Even if you found the logo because the text documents when you basically type in the number
500,000 what are you actually doing?
You're looking for a string that says 500,000
But in a database search when you say 500,000 or greater than 500,000 say in SQL
You actually get a result that's greater than 500,000 so this document fits
Here's another example
RJR logo with the filtration efficiency in a signature why do you care about signatures?
When do you sign?
Typically if you are in management particularly all you do all day long is like this
Because that's an authorization
So this is a way of identifying
something that has a logo it talks about filtration and somebody has an
authorization that did something about it hence this document
another example is
five find the five highest signatures of dollar allocations by people
Now you can imagine. There are some organizations that care about payments that are paid to people by certain authorizing people I
Guess too many people's in that commentary
but
To the goal in order to solve that you have to identify all the documents signed by the people you
Have to sum total any money that they talk about and then you have to sort order the ones to be most highest paid
But that's only relatively easy part of the challenge
How many how many people signed ten things in a row
More view than that I'm sure
The problem is when you sign ten times in a row how similar is your signature?
Hundred times in a row for that matter every time I sign something it looks different
So the fact that matters you have to do signature matching and
Unlike simple straightforward things signatures are very hard to match
But we cheat
How do you think we cheat?
Well we take different extraction procedures since we're integrating everything together and some byline who's gonna have a name where some headers gonna
Have a name and that's the way we match it
So that so these are the five people who paid the most amount of money and some payments on tobacco litigation
Here's another example
This is a query about the association's of people
Where they've paid who they've paid who dr. Stone has paid?
Again not your typical search that is supported and
Of course when I actually say that I build it actually I didn't do it very it was just running the effort
These are the people who actually did it and they did it over time and over time they've changed their affiliations
Some are students some are collaborators some are
Colleagues and so on and so forth and since I am an academic
And we have to worry about
Publications in the new age of a universities today a lot of people are looking at patents
So there are some patents about it later on
So that was kind of basically in the middle ground it was somewhere between
Research and practice you were trying to solve a real problem. We were trying to get something in a prototype stage
But it hasn't really seen major deployment or any specific deployment
the next effort we're talking about is a little bit more a
to the point and it's actually very relevant to systems that basically of user reporting's and
user reporting systems have wildly interesting Spelling's and have wildly interesting phrasing and have wildly interesting grammars and
When you have a real problem searching those
Like you have in the various reporting systems like NASA has
you need to come up with such a solution so
searching in advanced
conditions
What does it mean?
Spelling
Is difficult?
it's getting worse of a problem because we've got autocorrect for everything so basically people learn how to spell less and less well and
then they basically try to spell using the twig Twitter ease or
Any other encryption mechanism that they call spelling?
And it gets very difficult in spelling as a faculty I look at spelling when people write on exams is getting worse and worse
But it's even worse when you start in a spell in foreign languages
Particularly if you don't know the foreign language, but you're going to do a search that is related to the foreign language
So once at a situation exists in a collection called e score books
and
The other issue is when you start dealing with medical texts
And I'll tell you about both of them independently so e score books the score is an word in Hebrew means remember
it's a collection of books that are scattered around the world including in the Hong Kong United States Holocaust Memorial Museum and
In this collection the restriction there's restricted access
and there's also an X section that actually has an archive and
The archive people come in there and say I'd like to find
Meishan about
someone
And so okay fine, who are you looking for? Oh? They lived in some city?
Okay, fine, when is this from oh? This is from a World War 2 era, okay fine?
Could you be a little bit more specific? Where did they live in?
Europe
Not really helpful, but to start
Okay, what about?
Could you tell me what language they spoke? Oh? Yeah? Yeah, they spoke German getting a little better
What was the name of the city they lived in I?
Don't know, but he's had a
Berg in it
Anything else I'm not quite. I don't know I kind of sounds like
So I said, okay, so
Where would you be an example of it so what example of such city is called?
Bratislava
Anybody heard of Bratislava
Since a Slovakia
Does righty slava have the Berg in it I
Don't know how to spell about this lava - well, but doesn't sound like his Berg in it
But it does because in the board flesh Berg
was another name for Bratislava and
That one does have it
So it gets a little difficult, and they did speak German because the Jews of that area called it Pressburg
And they actually were speaking German
So we wanted to build a search system where people didn't quite know how to pronounce things people didn't quite know and how do
Location-wise find it PA people ready, but yet they wanted information, and we built a system that basically has
such an interface
But most importantly they give you when they give then they do a search
they actually can find a collection from the
Documents that actually have multiple languages in the same documents let alone different documents of different languages
And it produces a simple ranking system now as you can see it's not doing particularly great
but it is finding it and the way it finds is is based on this simple algorithm if
You look at the top green box
Segment section that is basically breaking things up in random pieces
And if you look at the bottom green box that is a traditional approach of what people do it's called engrams
Engrams is sliding character windows that go overlap each other in order to find what you're looking for
The problem is is
its sliding windows and
Sliding means contiguous, and if it's contiguous you cannot find some things that are basically chopped in the middle
so the top one takes care of it and
We built this system
in use in the archive section of the Holocaust Museum today and
It uses simple simplistic rules to break things up which are kind of crazy-looking
but they are to try variety and add mutations into things and
Here's a standard way of evaluating it
So the way you evaluate these systems is by taking the standard approach that is basically language basis called DM. Sound X
It's a sound X approach. It's by sound
Which doesn't help if you cannot pronounce it?
Anybody speak Czech here? Ah
Okay, then you can validate my statement, I don't speak Czech
But there are words that are yay long that just forget about
Vowels, they just don't think it's necessary
and
For somebody like me to try to pronounce them is a new experience
Make a long story short
Doesn't help the phonetics of it so the way that
Generally other approaches is called engrams and ours is the one in the bottom
And what you're looking at is basically a standard way of evaluating search and techniques for looking for letters
it's basically you add a letter you drop a letter you replace letter or you swap a random pair and
what you're looking at the ones in red are the ones that actually is the best score and
What you're looking at and when you talk about the rank is where in the list?
Did you find that what was the average rank for all the collections you tried?
And the reason there's two numbers is
for the bottom number
Sorry, the top the bottom number is what did you find that engrams also found so the u.s.. U-ace?
Hmm search finds everything the engrams search does plus some other ones
So if you only look at what the US hmm did in with the Engram that Engram also found?
It's the bottom number and as you can tell you these are ugly numbers, but what you want to see is where the trend is
And this is if you wanted to get really scary
add
Two characters add three characters add four random characters removes two three and four
Replace two three and four and swap two three and four and whatever you what you see is the trends of the same
So this is the algorithm that's used today in the archive section to find names
But I'm in a Medical Center one of my appointments is the Medical Center and
If you're in the Medical Center you basically want to show that you're doing something introduced for medicine
So we actually tried to deal with transcriptions, and how much errors occur
now
By transcription how many of you been to a hospital I?
Mean you've been treated in the hospital
It's quite a few how many of you would wish they would not be there during the time shifts between
And when nurses go from 3:00 to 3:30 or from 7:00 to 7:30?
the
nurses overlap and the reason they overlap is they hand off the information from one shift to the next
But a lot of times they kind of write things down and so and how many of you seen doctors handwriting's
They don't write things down. They kind of scribble
The bottom line is that transcription errors in the names and the medications and so on forget about it
so a
lot of these errors occurred and they account for quite a bit of the a
Possibilities now. Not all not all necessary are transcription errors, but some of them are
So we ran it on a
medical dictionary and this is a standard characteristics, and we were actually quite happy we did it okay and
Again, we did fairly well in fact. We found almost everything even when you basically tried
A complete mess of four swaps and so on and so forth
So that just shows this algorithm or and this is again
This is basically straight forward combination of modifying and Grahm solutions that are well known
Adding a little noise into it because noise helps by the way how many of you deal with optimization algorithms here
Some of you, there's a famous optimization algorithms there's things like
Simulated annealing and genetic algorithms, and you know what a key of their successes add some noise
Mutations random heating's
Right so we did the same thing here and then worked here to
Again these are the people involved and by what I when I say I didn't do much. It's
obviously my collaborators, and we think the massive users of the system for their comments and
again being an academic and of course my student very much appreciated the last a
academics stint to Santorini in April a very nice place
And now I
Kind of want to leave you with it to show you that really
We do some things that are hopefully for the future, and I'm going to talk about sorts in social media
and I don't need to motivate that I don't need to give you an a NASA program that does that it's
Everywhere everything to do with social media
So I'm going to talk about public health surveillance and
The way that it's usually done is it takes a long time
Because it takes a huge amount of a human knee
Effort it basically involves when somebody comes around and looks for you and goes to the doctor feeling sick and enough time
Different people go to the doctor. Eventually the doctor eat kourt's it and if enough doctors reported. Eventually. It's caught
but it takes a long time and
Therefore you need it expedited, and it's a way you expedite almost everything nowadays is
social media
So
How do you deal with social media? Well we're not the first in fact many people have done it typically they talk about a known
basic problem, so they're gonna say I'm gonna look for influenza, and they're gonna find influenza and
They're gonna look through it and because they know what exactly what they're looking for
This is a problem. If you don't know what you're looking for often. They use complex solutions I
firmly believe in the KISS principle
Anybody know the kiss principles. I'm sure you do keep it
I'll leave it to you to complete the rest of it
So the and those don't usually work because complex solutions take forever, and they aren't it really heavily adopted
Or they use their own resources that you can't get access to
for example query logs
Query logs are heavily used. Do you have an access to the query log?
Not really, so you can't do it
So what we wanted to do was we wanted to change the old way which was saying is there a flu
Ie looking for something specific to general things like is there something occurring
is there something new if
So yes, there is something occurring. It's the flu
So what we have was we have a collection of tweets, it's two billion tweets, which were given to us by Johns Hopkins
We basically took those tweets we partition them by time we then checked for them being trending and
Once we identify what is trending we saw if it's something that should alert us. That's the nutshell but I'll go in more specifics
So here's a tweet collection two billion we cleaned out the ones that are not health-related. We left 1.6 million
That it's partition over time so then we partition it over the time
We clean it up some more got rid of some punctuation stop words and so on and so forth
then we found out what is actually associated with one another and
We saw that if you took the tweens pounding heavy pounding headache sore throat and low-grade flu and fever
Versus the other ones you saw that certain things had a certain sufficient amount of support of what's going on so flu sore throat
had support three fever head support three coffin supported two and
Basically we decided to clean out and keep only things that above a certain threshold, so that's an example
Then we decide to see if that information is trending and what we did was we looked for us, we look for slope
Andesite
How many of you are still thinking writing your dissertations or
think about writing dissertations
thinking writing writing research papers
okay, I
Said we look for the change in slope
bad word
Tell you a better word
we change we look for the
derivative
because
Slope is a is. Delta-y over delta-x
Not to be confused with derivative which is Delta Y over Delta X
Sounds a saying it is the same but
But one is easier to publish with
Because it sounds very sophisticated derivative
And one is much sounds like you're in junior high you're talking about slope so we did derivatives. We didn't do slope
forget the slide
So here's an example of things which we look for its derivative
So here's the trending decision this one occurs very frequently right so the torso feel sick occurred very clear over time
This one did not occur so frequently, but this one trends the first one does not trend because although it's high not
Let's change to it
We took those
We used Wikipedia to map things to like the two alike sections
And why Wikipedia because it's layman's terms what's tweets written in
you see you see neoplasm or
sarcoma
Listed in tweets no see words like cancer so we wanted to clean use clean English
We looked for where in Wikipedia and mapped on to a concept
So here's the words that mapped on sore throat. We knew is actual medical thing. How do we know that?
We cheat
See the red circle this is icd-9 and icd-10 codes, what are icd-9 and icd-10 codes?
Billing codes
Billing codes for medicine now it's icd-10 previously was icd-9
If it has a billing code. It's a medical condition
Most likely
and
We compared in our reaction to two
flu and as you can see
The corrected Google and the CDC and ours you do not shouldn't compare the scale
But you should compare that you see we detected the trends at the right time
So we were a little optimistic and said maybe there's a hope for it
But I have to give you a word of warning
using tweets
While they are helpful in times they're not quite as helpful in others
So it is true that the landing on the Hudson and the Mumbai terror attacks were detected quickly in Twitter it
Is also true
Slightly less accurate is the flu tweet detection
Hurricane sandy had a bunch of misses in fact some cases where the misses were indicating in the wrong locations and
Some locations for the meet up to the wrong locations occurred in
Off to the east of New Jersey, which if you know geography is not exactly a very useful place to meet
But most was a full it was a my favorite of the celebrity deaths ah Goss of Colbert Report
I don't even used to see The Colbert Report
He had a show of speaking the Jeff Goldblum from the dead
Because on Twitter Jeff Goldblum was killed only problem, not problem only good thing is that Jeff Google isn't dead
So a Colbert talked to Jeff Goldblum from the dead
Because obviously Twitter had killed him, so don't necessarily bank on it being the case
So we
Actually looked to see what it can do and we try ended on sinuses
And we've noticed that there's a problem around the April and May time frame
We then look allergic response
And we saw that that trended along the April and May and in fact
So did food allergies along that same timeframe, so we said, maybe there's something to it
But the truth is that
What we used it is not to validate a situation because if you
Validated a Jeff Goldblum is dead. You would have validated it true, and it is false
Social media should not be necessarily the guide for validation of a concept
But should be a guide potentially for telling you something potentially has changed now
It may be a false alarm, or it may be reality
And if it's reality you will detect it by other sources it at least will give you something to do
It'll give you something to explore, so it is very useful as the hypothesis generator. Not so useful as a truth indicator and
that's what we did and
By we I mean all these people
Including Alec Colts when he was at Twitter
And so on and so forth along the way, and he are some
publications
again student appreciated he went not only to cologne, but he continued to go slovenia and a two and a half week vacation courtesy of
yours truly
So
What do we do I?
Showed you that
The whole is greater than some of the parts when you integrate things you can actually get somewhere you can solve problems
you couldn't solve otherwise I
Showed you that searching, which we all know is very easy
It's not so easy
if you can't really do the search, and if they've got very very adverse spelling and grammar and conditions along the way and
I sold you that social media
Should be used as a warning mechanism or an alarm indicator
But may not be the ideal situation for you to go solve via
truth indication
So let me conclude with one
Would would the way I always conclude I?
Always conclude with a bunch of statements one I
Have three cardinal rules?
rule number one is
Always finish on time and the reason it's important to finish on time is if you don't finish on time
people start to wonder
If you didn't know how to organize your talk, or didn't organize it so rule number one finish on time
rule number two
Rule number two is always leave room for questions and
It's very important leave room for questions because if you don't leave room time for questions people are start to say okay
This was a Kant speech
Person hadn't rehearsed
Was not did not want to answer any questions because they were afraid that they'll be asked questions
They don't know how to answer so rule number two is
Always leave room for questions, but not any rules is important is rule number three
Rule number three is by far the most important rule number. Three is never leave room for too many questions
because if you do they will
Realize that you didn't know what you were talking about and they will ask the questions that you cannot answer
So I was told that I would have a total of about 45 minutes. I
Was told I should leave room for questions
And it is now exactly 43 minutes 42 minutes and 25 seconds into the talk
so I
Very much, thank you for coming and are they I open the floor for questions
So great great rules and so we have time for questions
And maybe some deep questions that will challenge the speaker if you have a question
Please raise your hand and wait for the microphone and ask a question
In one of your tables of results you had drop a character add a character drop multiple characters add multiple characters
Those are different tables, but yeah
So it seems like for misspellings, you'd need combinations of those two did you also include?
dropping and adding and or maybe that was replacing or so we try we added we
evaluated using adding a character
Dropping a character replacing that character and random swaps of characters
And we did it up to a combination of four
So we basically you if you actually looked at what we searched with and you compared it to what?
To actually what we were really meant to search with
It was completely different. You wouldn't recognize some of the terms
And we also used actual real logs of a of a user's to compare as well
Um right now the FBI and CIA are doing huge numbers of searches to try to figure out if there are
People stalking us at this very moment etc
Are you working on any of that and does some of the some of the approaches that you're using help to root out this evil?
part of our society
Am I working on?
Stalking people
I try not to stalk people
I
work on technology a
Technology is used in various different ways and built various search systems along the way
so I don't actually know a
What people are using the systems or the algorithms or they that I have built?
So I really can't answer that but I try not to stalk people if that's the question the answer to your question
If I didn't answer your question ask it again
I
Said I was thinking more of using your technology to read between the lines and some of these
some of these communication systems that you know organizations like Isis is supposedly using over the Internet and
Whether or not. I mean you're not directly involved is what you answered me is saying
but
So you're not working with the FBI
And the CIA as far as you know as far as you know that your technology is not part of that
As far as I know no, but I can tell you that you can search
You can search a basic different languages along the way
Using different algorithms and different approaches some which I've actually used and developed
So what you are doing is great
But before you can do this you have to have these
documents digitized in this electronic form and
That's a big barrier both
technically and
Like legally so not all documents
available for this and actually maybe
too few documents are
How about this barrier
So the reason that we had the first part that I talked about is in order to try to get
Quote-unquote scanned documents into a form that you can actually OCR some of it, so you can actually do some of this interpretation
the goal is that if you can do the OCR and it's a big if
You can do the OCR then you can use some of these
foreign language search techniques or foreign
garbled search techniques to try to correct some of the OCR errors on top of it or at least start to search the
Documents that have them even if they're poor OCR Corrections, but yes
Digitization is the process and OCR doesn't always work in fact it works it often fails in some
documents
from various collections
we've actually tried to help people actually OCR and
We basically put our hands up and said it's never gonna happen at least never as far as computer science, which means five years
Are there any other lessons you learned from your work that you can apply to other areas of your life
if I ever learned any lessons I
Learned a lot of lessons
first and foremost when you do search technology
you should actually get your hand or do any evaluation of search development you better have a
Solid a benchmark to actually evaluate your systems with gold standard is
There's no replacement for gold standards and every approximation that you have
Fails to actually having the reality of it now
I'm not saying that you necessarily will always have it
But you should try to aspire it another thing is get really solid graduate students
Being an academic I made my a
Career whatever it is
Based on the hard work and ingenuity of my graduate students
I'm indebted for and also my colleagues, but really the fundamental works is done by the graduate students, so I guess
Resources get the resources you need it's probably the best. We have a grouping everything together, so that's the short answer for you
So in your Twitter example you got a very large database from John Hoffman
Is it possible to actually apply your model to live data?
And is it something that anyone's planning to do in the future?
So it's interesting you asked the initial goal was to try to do it on a live data in fact. We had an intern
Collaborator plus a
researcher at Twitter at the time
that was going to try to do it live we wanted to do what we wanted to as a real live feed and
basically be able to
parse any trend situation for different domains you need a domain because we have to identify the
vocabulary and is likely what you're trying to track but
the what that was the intent
Hasn't gone, very far. We're trying. We're trying to do some of that now
But it's still in an infancy so I can't really tell you if it actually will work or not
That's that's why when I piece this talk together. It was
What is kind of ripe for reality? What is?
What is in use and versus what is we hope will be eventually ripe for reality a?
So vocabulary was one two was noise
You need a strong enough signal you need a base enough of general vocabulary so that you'll be able to
be able to track enough information you needed specific enough so that it basically can isolate it correctly and
The problem comes in that you need also enough noise to be able to detect on the live stream
That when it spikes and when it when it goes down so the isolating noise and non
non
Major topics is it was the biggest problem if you have a major topic like influenza no problem
If you have a topic that basically deals with foodborne illnesses
we didn't quite get as far because
geographical coding of foodborne illnesses and people tweeting on it was not sufficient for us to catch it as of yet, I'm
hopeful
But can I guarantee it for you
Well now we're way over so with that please join me in thanking dr. Freda, they're like
You
