(cheerful acoustic music)
- All right, I'm coming up here
as I feel like I'm Mr. Swag here.
I've got my new little cell phone holder
and water and shades so I'm like soaked
on all the free stuff so I don't know
if a lot of you guys here
are for the free stuff.
But I'm here to, so next
week I'm going home.
I live here so I'm kind of
on vacation all the time.
But next week I'm going home to Minnesota.
(audience member cheers)
Yeah, it's cool right?
Anybody who's here from Minnesota.
So next week goin' home my cousin
is in stand up and I
wanna do that with him.
So this is my first public
speaking, a little practice.
But meanwhile I'm a programmer
so I feel like I have some funny things
to say about programming all right?
So I'd like to say thank Chloe
and because as a programmer who's locked
in a padded cell all
day it's nice to laugh.
And I think we don't laugh
enough as programmers.
So here I am trying to make us laugh.
And you know my boss shoves me away
in that padded cell, says go do x y z.
Little inside joke because I do,
I just did a plot in x y z and I work
at Scripts Institute of Oceanography.
My goal, you know, save the oceans.
So that's what I do on my day job.
And meanwhile I crack
myself up all the time
so I thought hey, might
as well give it a try.
Crackin' some of you all up.
(audience laughs)
So we're in the land of
the hippies, AKA San Diego.
And we like labels so I
actually don't have a button
'cause I don't really care for labels.
But I respect other peoples' labels.
And so you know except for food.
I'm always like is this local, organic,
shade grown, fair trade, you know.
I like my food labels right?
Well another thing of land of the one,
we are one with, we like
to be one with each other.
One with nature, one with ourselves.
Let me tell you about what I
wish we were one with, time.
It's just like, there's just
never enough time right?
And anybody program with time, time zones?
(audience laughs)
Time zones?
Oh those are fun, aren't they?
- [Audience Member] Oh yeah.
- Cause oh man,
there's more than just 24 and
I'm talkin' to normal people,
I'm like oh I know a lot
about time zones man.
There's more than 24, some
of them are 30 minutes
or 45 minutes apart and then
there's Daylight Savings Time.
I am like please can we get rid of this?
I've been lookin' up
on the history of this
'cause I'm like this just is archaic.
This needs to go,
programming, you're just like
oh my God can not just
all be one with time?
It's called UTC, you know?
I wish, that'd just make my
programming life so much easier.
So I'm new to social media,
I just started my Twitter.
Thank you, I started my Twitter today.
'Cause I was like it's in a spreadsheet
and I love to fill up spreadsheets.
I've been doin'
spreadsheets since I was 10.
So Twitter, all right been meaning
to do this I procrastinate sometimes.
Just need a little push,
all right, made a Twitter.
Twitter is Pico de Loco, one.
'Cause there was already Pico de Loco,
I just made an Instagram
account like a month ago.
Only a few posts, lot of cats, dogs.
You know and I skateboard and so anyways.
So it's me, right, well
I was Pico de Loco.
I do twitter, there's already one.
So I'm Pico de Loco One.
(audience laughs)
Well okay, well all right
so I get on Instagram,
change my name, I am now Pico de Loco One.
'Cause guess what I am crazy.
I was crazy once, they locked
me in a room with bugs.
Bugs make me crazy, I was crazy once.
They locked me in a room with bugs.
Oh yes, 'cause programming bugs.
Programming bugs make me crazy. (laughs)
Especially when you're all alone
tryin' to figure out a
bug, it makes me crazy.
So I'm crazy, I embrace the crazy,
I am proud of my crazy label.
Some people think oh
crazy is derogatory, no.
I'm crazy and I'm proud so,
oh I got one minute left.
Sick I'm goin' fast 'cause I talk fast.
So try to listen fast
'cause I ain't slowin' down.
Can't stop won't stop, social media.
I'm on social media, so
so, I'm new to things.
I got this 360, I wanna get a YouTube.
Haven't done a YouTube
video yet, going to.
I see you, and... (laughs)
(audience laughs)
Focus, focus, focus,
it's what I need to do.
I have a little bit of
ADD, I'm having a hard time
right now staying to this microphone
because I like to move and my boss,
meanwhile, I just sit
in an office all day.
Tryin' to sit on that chair,
I'm tryin' to be real good, I'm
tryin' to be a good employee
tryin' to sit on that chair, sit all day.
Sometimes you get tunnel vision,
you get in code mode, you are
just goin' goin' goin' goin'.
Right, meanwhile it's, oh dang it.
I gotta go to the bathroom, boom.
That's when you figure something out.
Is when you've stepped away, right?
You step away you realize I need
to step away a little
bit more often, you know?
'Cause we all need to step away, anyways.
Last thought, tryin' to save the oceans.
Sometimes there's labels,
respect those labels.
Including recycle labels,
some of y'all usin'
recycle cups or whatever
hey, try to put it in a cup.
Try to put it in a bottle, if you're not,
if it's a recycle, make
sure it goes in the recycle.
Thank you, want a hug?
(audience applauds)
- Yeah very well done.
- Thanks hi everybody, I'm Tim.
I write code for the
Wharton School, various
open source projects and
most importantly for myself.
I helped organize DjangoCon US here
and also the Philadelphia
Python Users Group back home.
I started writing code
when I was six years old
because my mom won a raffle and I've been
doing it ever since
and absolutely love it.
I consider myself very lucky to get paid
to do what I love with
colleagues and friends
that are close enough to be considered
like a second non-DNA family.
So let me show you my
github chart from 2014.
Does that look like the chart of somebody
who likes to write code a lot?
The truth is in 2014 I
wasn't doing what I loved.
I had been slipping
further and further away
from doing what I love for a long time.
And you know what, I still worked hard
and had a lovely home squaring
who I thought I was and who I should be
with how I acted was getting
harder and harder every day.
And looking in the mirror was getting
harder and harder every day.
So it was because of this that
I had to change just one thing.
So when happened here, for many many years
I had been trying to change just
one thing on my own and failing.
Finally in April I asked for help
with my alcohol addiction problem.
In May I got outta rehab. (laughs)
And you can see a sudden
uptick in activity there.
And I've been clean and sober
one day at a time ever since.
(audience applauds)
So. (laughs)
Thank you, so what happened here
in July and August, I
made a terrible terrible
tragic mistake and reopened
my World of Warcraft account.
(audience laughs)
So 2016 looks like it might be
a bit of a lighter year
but I was contributing
to several private repositories
and me and my friends from Wharton
were a little busy helping host
this conference you may have heard of,
the year it was in Philadelphia.
But when we take a look at 2017
this is where I really
started to hit my stride.
Having gotten more involved with a bunch
of open source projects and from Wharton
we also started to put
some of the packages
we had been developing
internally on github and Pie PI.
And if you take a look
at 2018, it's up to,
so the first year was 21
contributions in the year.
1053 in the last year
so as you can see now,
I'm pretty active again
and doing what I love.
Writing code with a community
of friends old and new.
It's been such a wonderful way to connect
and just such a better life.
So what I just really wanna do here
is ask everybody if there is one thing
that you could change about yourself,
one thing that you don't
like what would it be?
And how much better could your life look
if you could change just that one thing?
It doesn't have to be
an addiction like I had.
It doesn't have to be, only you
will really know what it is when you look
at the mirror at night
but I wanna encourage
everybody here to if you do have
that one thing that you wanna change
about yourself, it is worth doing.
It is worth asking for help for.
I couldn't do it until I asked for help.
And I've had support not just
from the rehab center I went
to and not just the recovery
communities I've been in.
My colleagues at work have
been incredibly supportive.
The Django community which
I've immersed myself in
ever since has been incredibly supportive.
And it's been an amazing amount of support
everywhere I've looked to try to keep
this going one day at a time.
And the same thing could happen for you.
'Cause if you need help
to make that change,
absolutely ask for it 'cause
it really did save my life.
And you know as Gandhi said, we must
be the change we want to see in the world.
So if you have a problem, sometimes
changing one thing can change everything.
And it really did for
me, thank you very much.
(audience applauds)
- Hello everyone, I'm a software developer
and I write software but occasionally
I do not only need to write software,
I also need to write documentation
or put up a website for the
software that I've written.
Both things usually include screenshots
and doing screenshots is cumbersome.
You wanna do them on
different screen resolutions,
you might wanna do them
with different locales.
And once you're all
done having created them
they're outdated and you
have to start over again.
So let's automate this,
and let's automate it
with tools that we already know
or that probably most of
you have already seen.
And the first tool that
we wanna use is Selenium.
You might know Selenium as something
to write front-end tests
for your web publication.
Basically a way to remotely control
your browser from a program.
Our next ingredient is Chrome Headless.
Chrome Headless is a way to start Chrome
without requiring a display
or anything attached.
So you can run it on a
server without needing
to have a full blown
desktop operating system.
And then we use Py.Test,
Py.Test is test runner.
But if you look at it
closely it's not only
a test runner but makes it easy
to define tests and run them,
to run them in specific ways.
But basically it's a way to run
functions in a specific way.
So usually you have test that is just
a Python function prefixed
with test underscore
and that checks for something.
And then you run Py.Test and it finds
all of those functions in your project
and runs them and informs
you about the result.
But Py.Test can do so much more.
It can have parameters for
input and fixtures and so on.
Fixtures for example are
things that you wanna,
that should be there
before the test is run.
So for example on this,
in this code snippet,
we are testing the SNTP library.
And we have a fixture that creates
an SNTP connection object
and then we have a test
that declares that it needs
this fixture passed as an input.
And for Django there's pytest-django.
And in Django itself
there's LiveServerTestCase
which allows for creating
a test case that exists
as an actual run server
like version of your app.
So with all these ingredients,
we already know them, they
are usually used for testing.
In this case we will use
them for screenshoting.
And to make it a bit nicer
we use pytest configuration
to redefine some strings for example.
We wanna define our screenshots
in seen files with short
functions and not in a, we
don't wanna call a test.
And later we can use
fixtures to create objects.
Because nobody's interested in screenshots
of an empty application, we want
to populate the data
in some way beforehand.
And we can define and as another
fixture we define the Selenium client
that is already logged into our system.
We can also define our screen resolution
in a fixture as well as any other options
that we wanna pass to Chrome.
And then we can define every one
of our screenshots by simply writing
something like Py.Test with specifying
what fixtures we wanna be run before.
And then just calling a utility function
that does a screenshot with Selenium.
And we can run it by just
calling Py.Test on our folder.
And we're done, and
with Py.Test parameters
for example you could do
this for every language
that your project supports
or for every theme
that your project supports or whatever.
And you end up with a
bunch of screenshots.
And for the application I'm working on
you can find the repository
with the screenshot definitions
and also the utility code
that is a bit more complicated
than I've shown here in
this github repository.
And thank you very much,
if you have questions
on that feel free to talk
to me in the hallway.
(audience applauds)
Thank you.
- Hello, it's time for our story.
Today's story is that's not my emoji.
That's not my emoji,
its head is too shiny.
This is two emoji put
together, this is not an emoji.
This is not my emoji, its
face is far too animated.
Can you make a unicorn
sound? (imitating horse)
(audience laughs)
Emojis do not have sound.
This is not my emoji, its existence
as a character in a movie
is far too distressing.
(audience laughs)
You see when a studio loves a
marketing opportunity too much
they can make a terrible movie out of it
where a bunch of the main
characters aren't actually emoji.
(audience laughs)
That's not my emoji, it
is partying far too hard.
You see the pretty bird, he's dancing.
Emoji do not dance.
(audience laughs)
That's not my emoji, its
permutations are under-documented.
(audience laughs)
Can you wave at the kai emoji?
(audience laughs)
Hello.
That's not my emoji, its ligature
is only vendor-implemented.
(audience laughs)
You see the cat drinking coffee?
Cats don't drink coffee.
(audience laughs)
And this only appears on
Windows 10 operating systems.
(audience laughs)
That is my emoji, its
codepoint is so standardized.
And that's the end of our story.
(audience laughs and applauds)
- Okay sorry about that, so yeah.
This is a talk about Crypto, but it's not
a talk about that stupid new thing.
This is a talk about the original Crypto,
which of course is
cryptozoology, the study
of cryptids or legendary
stroke mysterious creatures.
You know we're talking bigfoots
and chupacabras and the
Michigan dog-faced man
and all of these wonderful things.
And the reason I'm interested in cryptids
at the moment is last week I was in Ohio
and I went bat detecting
with my wife Natalie
in the woods in the dark and it was dark
and it was the woods in Ohio.
And we realized that we hadn't
really done our research.
And this is America, there
are weird creatures out here.
We didn't know if we were within the range
of any of these mythical creatures.
And I'd like to know so that
I can greet them and say hi.
So I obviously hit Twitter
and on Twitter I said,
so just out of interest, oops.
Does anyone know where
I can retain range maps
of cryptozoological creatures like Yetis
and chupacabras and so forth?
(audience laughs)
And then I asked a question about
an ask metafilter and I realized
that actually no the
internet does not have
a conclusive source of range maps
for different cryptozoological creatures.
So I started a github repository.
And this is...
(audience laughs)
This is a github repository which I would
actively encourage
people to contribute to.
Where I am trying to get a directory
of information on
cryptids and their ranges.
And so the way I'm doing this
is using a file format called geoJSON.
It's a brilliant format, a way
of representing
geographical shapes in JSON.
So let's take a look at
the Loveland Frogman.
This is my geoJSON file
for the Loveland Frogman.
The great thing about github is
that github knows how to render geoJSON.
So it's rendering this shape for me.
But if I click on that shape I can see
that it's got a Wikipedia rail,
the name, it was first sighted in 1955.
Last seen in 1972 and it's a humanoid frog
described as standing
roughly four feet tall.
So that was my first cryptid, I have been,
you know what I've got some pull requests.
Russel has sent me a pull
request adding the drop bear.
I'm going to merge this right now.
And if I'm lucky it will deploy
by the end of my lightning talk.
(audience laughs)
I should have
done this a minute ago so
we now have a drop bear.
This is very exciting,
so you get the idea.
So then what am I doing with this data?
Well so I'm working, I've been working
on this open source
project called Data Set
which is an application that takes
a SQLite database and gives
you a UI for exploring it.
And it gives you an API for getting
the data back out as well and it's
the perfect application for cryptozoology.
So what I've done here
is this github repository
has Travis set up to run a script
any time I commit anything and the script
reads in the geoJSON and writes
it into a SQLite database.
It's all of what, 123 lines of code.
There was not a lot to this, and so
it builds a SQLite database
with all of these cryptids.
It deploys it and then based on this
I can build out an API so here is,
the nice thing about Data
Set, everything's SQLite.
So the API is just a SQL query.
Here is a SQL query that selects
details of cryptids where the geometry
overlaps the, where within geomtext.
So where the geometry overlaps a point.
Here's our current latitude
and longitude right now.
And if I run that query
I get back the Bigfoot.
It turns out America has a
lot of Bigfoot sightings.
So this is now a JSON API which you can
feed latitude and longitude
to and it will tell you
which cryptozoological creatures
you are within range of.
Here's Travis, oh look Travis is
building the drop bear right now.
So if we're lucky that'll be built
and deployed in just a moment.
I'll talk about Bigfoot quickly.
It turns out there is already a database
of Bigfoot sightings run by
the Bigfoot Research Organization.
Who are banned from Twitter, if you try
to tweet a link to BFRO.net
you get an error message.
So clearly Twitter are part
of a coverup conspiracy
trying to hide,
(audience laughs)
trying to hide the distance to Bigfoot.
But they've got 3000 sightings.
They publish a KML file, a KML file
has latitudes and longitudes in.
I happen to know that the
range of Bigfoot is 15 miles
from a conversation I had at
the Bigfoot Discovery Museum
outside Santa Cruz,
thoroughly recommended.
If you take those 3000 points and put
a 15 mile radius on them, you can see
that Bigfoots have been sighted
across much of the United States.
Everywhere in Florida, it turns out
everyone in Florida's seen a Bigfoot.
(audience laughs)
And here in San Diego.
So when I ran that
query with our San Diego
coordinates, I did get back the Bigfoot.
I've got a isles shortcut
that isn't working
at the moment so I can say
hey Siri, check for cryptids.
That's very useful.
(audience laughs)
And if you want something a
bit more useful, I do have
a version of this that
works for timezones instead.
So please, take a look at the project.
Draw maps of cryptozoological creatures.
Since they probably don't exist,
you don't have to be very accurate.
(audience laughs)
And let me know how it goes on.
And before I get hugged I'm just
gonna quickly see if Russell's
drop bear has deployed yet.
(audience laughs)
Unfortunately it looks
like drop bear did not quite make it.
But in a couple of minutes
there will be a drop bear.
So thank you very much.
(audience applauds)
- All right.
- [Announcer] Begin.
- Thank you, okay this
is the first presentation
I've given in a very long time.
Right this is from something I did
literally eight years ago so I'm
gonna try and do this really quickly.
In 2010 I was at a place called
the Santa Fe Institute
for a summer program
and they asked us to create
various research projects.
So me and some friends
stuck something together
in three weeks which was a
Django app which you'll see.
Four years, that should have said,
yeah four years later I got diagnosed
with a really complicated situation
which is why I've had three surgeries
and why I can't have this part
of my glasses on that side of my face.
Long story, still managed
to get a PhD in sociology.
Now I'm trying to figure out what to do.
It went so well, my most recent surgery
in June that I'm here
and I'm really really
happy to be here.
(audience applauds)
And my github name is
Spool, if any theater nerds
who know Krapp's Last
Tape get that reference,
please see me after
because very few people do.
Right okay so the project was to try
to study how people collaborate creatively
and how they might respond to each other
in making a design and
I'm gonna skip over that
'cause that was a lot
of, again some things
I had before you can ask me stuff about.
But basically if we're working together
and making a design,
what do, how do people
respond to each others' things?
People may be aware of
the whole Reddit project
that came a couple years ago.
This came before that technically in 2010
and you'll see some differences
in the user interface.
So the idea was we printed
a T-shirt at the end.
It was gonna be a black T-shirt by default
and then we were gonna
put some designs on it.
We wrote it in Django and a
language called processing.
Which is really cool for doing some
user interface and sound experiment stuff.
There was a processing js back then
which was really slow, there's now P5JS
which I really recommend
people try if they like to.
What we did is we created a grid
with 64 cells, squares that people
could do their own designs on
and that whole thing became the canvas.
And you could only see
your Moore neighbors.
Can I see a pair of hands or hands
for anyone who knows what
a Moore's Neighbor is?
Okay, ah there's one dude, do you
wanna tell everyone what that is?
(audience member murmuring)
Fine, okay.
(audience laughs)
That's cool, you'll see it in a second.
So this is just an
example of how we run it
in the museum recently,
so you get assigned
a square in this big grid and you
can only see the edges of your neighbors.
So if you, there's the ones to the left,
right, top and bottom and then
there's the ones on the corners.
So those eight are the
Moore's Neighborhood.
So you can press spacebar, do a design.
And then press spacebar
at the end and you see
how your design fits alongside
the designs of everyone else.
Okay, so press play
again, so that was that.
So that was the, you could log back in.
So when we ran it at a museum
you could only do one design and then
see the whole canvas, the way we ran it
as a website is you could log in,
you could only see your neighbors.
Then if you logged in the
next day you might see
how your neighbors changed
and then you could respond.
So we had a couple of problems.
If someone logged in and then left it
for an hour and then logged back in,
their neighbors might have changed.
And then they would
log in and we ended up,
they would accidentally not respond
to the most recent changes
of their neighbors.
We should have done that with websockets.
We also didn't take
into account timezones.
So someone's laptop was
on a different timezone
and they kept overriding everyone else's.
We didn't know why, that
was really frustrating.
So this is what it looked like.
(cheerful acoustic music)
So each of those cubes are
individual peoples' designs.
Oh the music.
(cheerful acoustic music)
That was when the dude tried to do his own
and then accidentally
overrode everyone else's.
(audience laughs)
And it's a Taurus so
the people on each side
can actually see each other and the people
in the tops can see the bottoms.
(cheerful acoustic music)
Okay, thank you very much.
(audience applauds)
- Well done, well done.
(audience applauds)
Good job.
- So that was the final thing.
If anyone's interested in the music
you just heard that's by a really
incredible composer named Julius Eastman
who sadly died in poverty
in 1990, look him up.
Thank you.
(audience applauds)
- Hello, I'm Ernest, I
really like to tweet.
You might think that
tweeting is my full time job
based on this incredible brand engagement.
But in reality I work for the
Python Software Foundation.
I also tweet a lot and so
if you wanna follow me,
that's where you do it on Twitter.
But yeah the Python Software Foundation.
So the Python Software
Foundation if you're not aware
is a nonprofit that controls
the intellectual property,
copyright, trademark et cetera
for Python the language.
Also it is a grant giving nonprofit.
And so we raise money and we pour
that money back into the community
in order to support events like this,
meetups, and smaller events as well.
There's a larger event associated
with the Python Software Foundation
called PyCon, if you're not familiar
with PyCon it was in Cleveland in 2019.
It's gonna be in Cleveland in 2018.
It's gonna be in Cleveland in 2019.
And I wanna talk briefly about that.
Currently the call for
proposals for PyCon is open.
You can check that URL
out to go submit a talk,
tutorial, education summit,
presentation poster,
or a charlas, or a talk in Spanish.
Speaking of talks in
Spanish, the Python Charlas
came out of the PyCon Hatchery Program
and so this is kinda what I
really wanna pitch to you.
The PyCon Hatchery Program is a way
for your ideas to be realized in PyCon.
So if you've ever been to PyCon
and not seen something
that you wanted to see,
this would be the way
that you would tell us
what you wanna see and perhaps
even propose to do the work to make it so.
So please check out the Hatchery
Program and read more about that.
So you might be saying okay. (stammers)
This is Django Con, so what about Django?
The Python Software
Foundation loves Django.
We actually, so I'm the
Director of Infrastructure.
And so that's like a bunch of services
that are out on python.org, out of that
seven of them are currently
written in Django.
The PyCon website is written
in Django and I love Django.
Admission, I haven't always loved Django.
(audience laughs)
But it's gotten much easier
in the past few years.
And so being here and being among people
who are interested in or experts
in Django has been really exciting.
And so I'm also gonna come up
here and ask for your help.
I frequently tweet and sometimes
I tweet about asking for help in Django.
More often than not when I'm tweeting
about Django it's asking for help.
So help, please, and when I
say this I mean this sincerely.
If you are in this room it is probably
because Django has either made you feel
like you have superpowers or you want
to feel what it feels
like to have superpowers.
And I want you to be involved in Django
and the PSF and PyCon
if any of that's true.
You might just be starting out
and I'd love to work with
you to get you started.
You might be an expert and I would love
to work with you to get a little bit more
of that expertise and so that
we can all share and grow.
So there's also a bonus, it's not Django.
But PyPI is a piece of software
many of you in the room
might be familiar with.
And if you're at all interested
in contributing to PyPI,
we have stickers now.
(audience laughs)
You can... (laughs)
You can check out a microsprint
that's gonna be occurring at lunch
with myself and Dustin
Ingram and that's it.
I'm Ernest and you can
follow me on Twitter there.
(audience applauds)
- Okay so I'm going to talk to you folks
about one plus one equals one
or record deduplication with Python.
This is a 45 minute talk, I
will make it in five minutes.
Not sure if it will work.
(audience laughs)
So real world data is a mess, probably you
dealt with data like this before.
Those are restaurants, restaurant records.
And you see here that
clearly those four records
here from zero to three are duplicates.
'Cause the name is quite similar,
the address is similar, city can vary.
So we have duplicates here,
real world data is a mess.
We don't have unique identifiers.
And the solution is to
perform Deduplication,
also known as record
linkage, to join records
in a fuzzy way using data
like names and addresses.
Mostly we will deal with
those kind of things
but it can be other kinds of dates.
And to solve that we should do
some fuzzy comparison of strings.
We can use algorithms like
Jaro Winkler similarity.
If I compute the Jaro Winkler similarity
between those two similar
strings I get this high number.
And with those different
strings I get this lower number.
So I can use that as a tip for me
as an indication of
similarity between records.
I can do also fuzzy
comparison of addresses.
And the trick is to geocode them
to latitude longitude and
that'll allow us to clean
irrelevant address variations
like a small variation
on the street number
or something like that.
And to enable the calculation
of geometric distance
using latitude and longitude because
we want to group and match
things that are close together.
And if I geocode those two addresses here,
you can see that although
they have variations
and even typos the latitude
and longitude is the same.
And the zip codes are also the same.
I can grab this from Google
geocoders for example.
Okay now into the process
of Deduplicating a data set.
First we need to preprocess it.
We will use the restaurant data set.
It contains 881 restaurant records
from the Fodor's and Zagat's guides.
And it contains 150 duplicates
and we want to find those duplicates.
The data set looks like this so it comes
with the closer column which
is the truth about this data.
Of course we will remove
that and we will also
remove the phone column
because it will make
things very easy for us and we will
left only with name address and city.
And we want to Deduplicate
using only that.
First we clean just using some
regexes to clean the name.
We will geocode all
the addresses so we get
the postal latitude and
longitude for the addresses.
And then we can move to the next step
on the record linkage
process which is indexing.
We will use the library recordlinkage,
also known as Python
Record Linkage Toolkit.
And we have the cleaned
up records, now we want
the pairs to compare to find matches.
To produce the pairs we
could do a full index.
We could compare all
records against all records.
Of course that's slow but we don't have
enough time to think about a smarter way
to index so we just produce all
records against all records.
And the pairs will look like this.
We compare zero with one, zero
with two and there it goes.
Now running the comparisons,
we want to compare the pairs
to get a comparison vector for each pair.
So a comparison vector looks like this.
The names are similarities, 0.5.
The address similarity 0.8 and
that's our comparison vector.
And to compute the comparison vectors,
we define similarity
functions for each column.
We can do that with the
Record Linkage Toolkit.
We use Jaro Winkler for
name, address, postal
and an exponential decay geometric
similarity between latitude and longitude
and that's what we get if we run that.
Now with the vectors we want to explore
different ways to classify
them as matches and nonmatches.
And we can do some simple
threshold-basic classification.
We can compute a weighted
average over those vectors.
And by looking at data we see
that those three are matches.
So we just consider anything with more
than 0.9 of a score as a match.
And if we do that and
compute from the truth
about this data we see that we got
128 true positives and
two false positives.
Only two false positives
and 22 false negatives.
So it's quite good performance
but there are smarter ways
to solve that problem.
Make sure to check active
learning classification.
It will help you a lot because
it will allow you to build
a training set for your data.
(bells ringing loudly)
Thank you very much.
(audience applauds)
- [Announcer] Thank you.
- Okay so I'm Filipe,
I'm partners at Vinta.
And let's talk about parks, so I have
an image here of two
layouts for a park trail
or something like that
and I want you to think
about which one of those you think
is a more pleasant trail for a park.
So to be honest this is
a very trick question.
I didn't give you enough
information to answer that.
And the reason is you don't
know anything about the terrain,
you don't know anything about
if there are trees around.
You have also no idea of
what's going on in that area.
So to answer that kind of questions
you need something like this book.
Which called A Pattern Language
from Christopher Alexander
and it's for 1977.
And this guy, he defined a series
of patterns and things to help architects
to build and design and
do good architecture.
So for example you can
take the pattern 120,
which is about paths and goals.
And it says the layout of paths
will seem right and comfortable
only when it's compatible
with the process of walking.
And the process of walking is far
more subtle than one might imagine.
This is very interesting,
especially because
it can be visualized through
this thing called desire paths.
Desire paths are like this image,
they show how people interact
with the place they live.
Through like these
patterns in the terrain.
So for instance here we're seeing
there are two paths going,
one that goes straight
to the door but there is a fence.
And another one that goes around.
So probably at some point
the fence didn't exist there.
And people just changed the way they
went inside the house after that.
Also this other example, so in this case
people are going around the trees.
And here it's very interesting
because they got the desire path
and actually made it a fixed
path and a proper path.
So one thing to remember and this comes
from this idea of patterns and paths.
Is that when you see a big street,
mainly when it's an old one,
a main avenue in a city.
Probably at some point that was
a bare earth road in the woods.
And maybe before that
there was just a path
where people and animals pass by.
And so those are patterns
and this is from Christopher,
the author of the book and he says,
this idea comes simply
from the observation
that most of the wonderful
places of the world
were not made by architects but by people.
So the book is really very based on this.
On how architecture is made
by people and constructed naturally.
So the book has other patterns.
And for example the path shape one.
There is seat spots, so when you, sorry.
It's missing, okay so there's another part
that's missing here which
is all about language.
So it's a pattern language, we talk
about pattern and language.
It's just like when you have words.
You have words that have
separate meaning by each of them.
And when you group that together
you get a lot more
meaning through language.
So language is as a group, you group words
to convey a lot more meaning.
And that's where we
get to Design Patterns.
So Design Patterns was actually created,
the book we know the
software book we know.
It was actually created
based on Christopher's book.
And I don't have a lot of time
to talk about Design Patterns here
but it's just, that idea comes from.
So let's just jump to some takeaways.
So first Design Patterns are
not created, they emerge.
So just like the Christopher's
patterns, our design patterns are just
observation from how people code.
Design Patterns are a
tool for communication
between programmers
through speech and code.
They're not to brag about, they are
to help you communicate
with your teammates.
And the last thing I will leave to you
is that quote that says good architecture
is about improving people's lives.
And most of the time this means
that what feels more natural
or more pleasant to us.
And this applies to both to
architecture and to programming.
Thank you.
(audience applauds)
- [Announcer] Thank you.
(audience applauds)
(electronic jingle)
