Okay, thanks for coming. Today Dan from 
Alteryx...
going to talk about the election...  
data mining of the election, so Dr. Dan... Putler, 
he's a chief scientist at Alteryx, and he's 
actually responsible for developing and 
implementing
the product roadmap for the predictive analytics.
He was... he has over 30 years of experience in 
developing predictive analytics models for 
companies 
and organizations that cover a large number of
industry verticals, ranging from the performing 
arts  
to B2B financial service, so quite diverse. 
He is actually a co-author of the book...
'Customer and Business Analytics: Applying 
Data 
Mining for Business Decision Making Using R'... 
okay, very related, okay... which is published by 
Chapman and Hall CRC Press. Prior to joining 
Alteryx, 
Dan was a professor, so he's from an 
academic... 
of marketing and marketing research at
University of... British Columbia
Sauder School of Business. And before actually 
he also is a professor at Purdue University.
So let's welcome Dan. 
[Dan Putler]
Thank you. 
[clapping]
Thanks for having me. Let me... you know, 
it's sort of interesting. When I came up with this 
topic,
it was something that I had been working on, 
and I'll explain why we were working on it. 
It's not typically part of my day-to-day activities, 
although it increasingly is a little bit part of my 
day-to-day activities. And we'll talk a little bit 
about 
why we did it. I wanted to present this because 
that's...
it was closest to me at that moment since I had 
spent a lot of time working on this particular 
application, 
and the one thing that kind of disturbed me was 
that 
everyone would be really sick of this election 
at this point and no one would really want to  
hear about this, and then, well, Tuesday 
happened. 
And the one thing that we're now dealing with, 
just yesterday we got a request through our 
public 
relations agency, did we want to participate 
with...
a journalist from AP, and what the journalist's
topic was is that we're becoming really 
dependent 
on predictive analytics models, aren't they crap? 
And for some reason our COO didn't really want 
me
to be involved... or wanted us involved with that 
conversation. 
So I think in some sense, one of the things that 
we're going to see today is that while the polling 
data 
wasn't what it's supposed to be, it turns out 
the polling data was actually very predictive. 
And we'll kind of go into it as well since the past 
several days, one of the things I've been taking 
advantage of is the good kindnesses of National 
Public Radio, who on their website had a really 
easy... 
easily dealt-with JSON format file with all the 
county level election results. So one of the 
things
we'll be looking at is how did we actually do 
with this particular election app that we did and 
how did that go. So there's going to be a number 
of 
things that we're gonna be covering today. 
I have probably two times too much material...
so I'm gonna be... one of the things that my 
students 
always were amazed about is how much I can 
cram into a small period of time. 
Teaching marketing research in five weeks 
taught me how to do that. It was painful for them, 
I'm sure it'll be painful for you. So I'm going to 
talk about a little bit about myself, probably not 
the way you would think, as I get into this.
So that's the first thing I will cover, to set the scene. 
So then what we will be talking about is...
a little bit about... we sat down and did this, and 
the basic idea of what we're trying to do is take 
polling data and project results down to really 
small areas. In the U.S., we can go all the way 
down to census block groups, because in some 
sense 
you have to have some level which you get 
reporting of underlying socio-economic and 
demographic data. In the U.S., what this...
the lowest level you can go down to is a block 
group, 
but a block group is an area that typically
has somewhere on the order of 1,000 people 
residing within it. So it is very small geographies 
that we're talking about. And the method that I'm 
presenting allows you to sit down and come up 
with
estimates of voter preferences all the way down 
to that small level, which you can then 
aggregate 
up to any level you care to look at, which is 
different than how existing polling is done. 
We'll talk a little bit of what we did,
the presidential election app... matter of fact,
the main way to do is allow me to demo 
what we did with the presidential election app 
if you haven't seen it. And then we'll talk about 
why we really did it along the way.
Altruism is good, but it only gets you so far. 
And then what we'll be talking about a little bit
just really briefly about approaches to 
forecasting 
election results, and largely kind of looking at 
to what extent can you take those methods 
and begin to project them down to smaller 
geographic areas, and then we'll talk about
the main part of what we're doing, which is
the two part, bottoms-up approach to predicting 
election outcomes, which start with really low-
level
geographic areas and then allow you to build up 
from that. We'll talk about the process that's 
required
to do that. As it indicates, it's really sort of 
two-part in nature, and we'll talk about those two 
parts. 
And then the other thing that we did before it... 
and it's kind of interesting to look at... 
back in late July, I developed what is known as a 
fundamentals model of election... of predicting 
elections
which uses basically historic data on past 
elections and a bit of other information to do 
that. 
We estimated that model, we haven't really 
presented it before, so I'll present it here
and we'll also see how that model did
for this particular election as well.
There were little things that were new about that 
particular...
application, but not earth-shatteringly different. 
And then what we'll do is we'll examine 
the performance of the two different models at 
the county level for the 2016 presidential 
election. 
This is, again, taking advantage of the lovely 
JSON file that NPR provided us with, which was
the easiest scrape I've ever seen to get the
underlying data associated with county level 
returns. 
And there'll be some interesting results
associated with that. So about me; 
so you've got the formal sort of bio, the Alteryx  
official bio of who I am. Let me tell you a few 
other
things about me. I'm male, I'm sure that's a 
shock. 
I'm non-Hispanic white, probably not a shock. 
I'm... I could lie about this, but no,
I'm between the age of 55 and 59. 
I do have a post-graduate degree,
and I reside in Santa Clara county, California. 
Now the interesting thing is I also had the 
following 
estimated multinomial candidate choice 
probabilities
on Tuesday's last election. So there was a 
64.3% probability I would vote for Clinton,
a 25.9% probability I would vote for Trump,
and a 9.8% probability I would vote for a 
third party candidate, right? So it turns out
if you look at this, knowing my demographics 
and where I reside, you can tell a lot about my 
likely inclination. However, at 65% or under 
65%, 
there's still a lot of room for you being wrong 
in terms of guessing about me. So in some 
sense,
one of the things that probably comes out as we 
deal with the aftermath of the indictment of 
polling 
in this past election is, hey, everything's a 
probability 
distribution and it depends on how we happened 
to
sample out of that probability distribution. 
And in this particular case, here's my personal 
probability
distribution. Just so you know, if I was a local 
here...
I'd be a little bit less likely to have voted for 
Clinton, 
a little bit more... considerably more likely to 
vote 
for Trump, and about the same in terms of me 
voting for a third party candidate. So there are a 
number of things that kick in to what's 
going on here and we'll take advantage of that. 
So it also gets us into the underlying nature of 
what goes on with polling data, is that we're 
dealing 
with data about individual decision-making at the 
end of the day, and so we're going to sit down
and take advantage of that to build up the 
process 
that we're doing. And I'll talk a little bit about that 
a little bit later. The other reason why I wanted  
to go through this example was that it 
shows that, yeah, there are predictors out there.
They do a reasonably good job, but there's still 
a lot of uncertainty associated with what goes 
on 
with those predictors. So now the basic 
question
is we're trying to do it so you can get down to
really low geographic areas in order to get 
an idea of what's going to go on.
Why would anyone care to do that?
Well, it's interesting to talk about, 
which is one of the reasons why we did it...
but there are a number of things that really 
begin to matter if you look at running a 
campaign. 
So getting accurate small geographic area 
information
is kind of important depending on your decisions 
as a campaign as to where you're going to 
allocate 
your resources for activities that are very 
specific 
from a geographic point of view. When you look 
at, 
say, something like buying television advertising 
time,
that's pretty broad. You can survive on very 
broad geographies, because television markets 
are not particularly small. I'm assuming that 
Bloomington is in the... really in the 
Indianapolis...
television market. I used to live in Lafayette, 
which had one TV station, and otherwise 
you had to go watch the Indy stations as well. 
So it's a media market that stretches from 
Bloomington to West Lafayette, and probably
a little bit more beyond in both directions 
associated with it. For that sort of activity, 
you don't really need fine-grained sorts of things, 
but the things that are really labor-intensive,
you typically do need to do things in a really 
small geographic area. So early on in a 
campaign, 
when you're trying to persuade hearts and 
minds,
what you're gonna be looking at then is where 
are you going to do door-to-door canvasing,
where do you want to send your candidate or 
surrogates of that candidate to talk to people... 
on a door-to-door basis so they can get to... 
you can sit down and do the personal sales 
activity 
at the door when they are fairly persuadable. 
As we get to the election day itself, a lot of 
what goes on is get out the vote efforts.
I mean, literally driving little old ladies to the 
polls 
becomes a critical deal. So in that particular 
case, 
where do you want to make yourself known 
that you can get people to the polls? 
Well, that's going to be in a really micro-
geographic
area. And so you need to sit down and target 
those efforts associated with that.
Also, while broadcast media works on really 
broad areas, mobile, bus shelters, buses, 
that sort of thing, and outdoor, billboards,
are done at a very local level. So where 
do you want to make sure that you have
those local advertisements for your candidate. 
That's going to be done on a very low geographic 
basis,
so you're going to need to come up with good 
choices associated with that, you're going to 
be... 
have to be able to micro-target the geographic 
areas 
where you want to put that information. 
You also want to make sure you have plenty of 
yard signs show up in the appropriate places. 
Probably places that are 50-50, 
so really sort of on the cusp.
You want to have a large number of yard signs 
so you can kind of get a herd mentality going in
that particular area and potentially 
swing enough additional votes your way.
So there are a number of things where this is  
going to come into play. So it does have value 
in and of itself. It doesn't explain why we did it. 
We're not working with any campaign. 
Turns out we have customers who use our 
software
for doing this sort of thing on both the 
Republican 
and the Democratic side. So it's kind of 
interesting, 
we sit down in the annual users conference, 
and I've sat... talked to one group, 
and sitting next to them was the other group, 
and they didn't know it, which was sort of 
amusing.
[pause]
Associated with it... but if you sit down and 
look at the real reasons why we did it, well, 
we did it to create the presidential election app. 
And so what I'm going to do now real quickly is
show that, 'cause it would be a shame if I didn't. 
So I preloaded this, I can do another one. 
Actually, I will. Having lived in Indiana I know 
what will be the interesting contrast to 
Bloomington 
in just a moment. So let me go here, 
and what we can do here is we can see what is 
going on in terms of how we calculated 
what was going to happen in the Bloomington 
area. 
So this actually is just using... under the hood, 
it's using Google's ability to find locations. 
We got a third party company that sat down and 
did things with the things that we were doing. 
The data that you see here are the projections 
at a census track level. Census track is 
typically 
somewhere on the order of about 4,000 people 
within it. The data that you're seeing here is 
what's coming out of the model. And so 
if you look at it here, what we're looking at,
how we expect the effect of race, education, 
age and gender to influence what's going on,
and then if you really have really sharp eyes, 
'cause I can't really see it from here, so if you 
take out your binoculars at this moment, 
you'll see that Asians indicate 9.6%... 
that indicates, that's... what we expect, 9.6 
of the voting population to be Asian.
7.9% of the voting population
to be African-American, 3.6 black, 
3.6 other race categories, 75.3% white. 
So we have all those sorts of things.
And if you look at it in terms of Bloomington...
[pause]
it's a pretty blue area. Yeah.
[Audience member]
I have a question. So when the Asians are being 
followed, or any other race, why isn't there a 
position made race versus age? 
So everybody in that base did not have the 
same opinion, so there should be a prediction 
that in this race, this age group...
[Putler]
Implicitly, there is, and we can talk a little bit 
about what goes on. So underneath the hood, 
what's driving this is actually allowing for 3-way 
interaction effects. So you would definitely have 
the ability to look at the combined effect of 
race, age, and, say, gender. 
[Audience member]
No, no, I was referring to another one. 
So in one race, I now want to drill down to that...
[Putler]
Yeah. We didn't set... we could do that,
but we didn't want to set up a decision cube 
because we wanted to show this really cleanly. 
So does that data exist? Oh, yeah. But...
[Audience member]
Which is what... the candidate 
might want to see.
[Putler]
Yeah. So we have the ability to come up with 
those estimates. We sat down and just broke it
down this way 'cause it was an easy thing to do. 
So that's what's going on here, let me go to  
someplace else just to get a big contrast. 
[pause]
I don't know if they have the same...
[pause]
This proved I used to live in Indiana, right? 
So we'll sit down and... oh, I have to hit 'Find 
Me'.
[pause]
Okay. Yes. 
[pause]
So Martinsville, the mirror opposite of 
Bloomington, 
at least by Indiana standards. We could go to 
my neighborhood, which is even a little bluer
than Bloomington... associated with those 
things. 
And if we go down to the map, what we're going 
to do is see just this incredibly beautiful sea of 
red...
which I guess makes them better Hoosier fans 
than the people in Bloomington, but that's 
the way that stuff goes... associated with that.
We can go look... yeah, the interesting one, 
if you want to go see this app, you can get to it
www.alteryx.com/election. 
Looking up Trump Tower in New York, New York, 
is sort of amusing... to do that.
And I probably... I was... I'm not going to do 
too much more with this, but... you can find 
areas 
where you get really checkerboards associated 
with the preferences within those areas.
Modesto, California, where I grew up, is actually 
a good town for finding a checkerboard pattern
in terms of voter preferences in that particular 
area. 
So that's... we sat down, we created the 
presidential election app, which begs the 
question
of, okay, you didn't do it for a candidate, 
you created the presidential election app, 
why did you guys bother to do that? 
And the real simple answer is 'press'.
So the first one... 'You, Too, Can Be Nate Silver'.
So this was for 2012 when we did it the first 
time. 
We were the front page of Bloomberg Business 
Week.
I think, looking back at it, it's really... interesting 
at this moment why only Samsung builds 
phones 
that outsell iPhones... unless they catch on 
fire...
associated with that. This is from this election, 
datanami, 
which is sort of a commonplace for people to get 
information associated with what's going on
in the Big Data/Big Analytics sort of world. 
We really like this one, 'cause they went 
through 
in rank order of what they thought was the 
coolest out there, they started with 538, then 
they went to a presidential election at Alteryx.
So we do it naturally to do this from a 
marketing public relations point of view to get 
people to know who we are, to build positive 
brand attitude, it's a wonderful thing. Question.
[Audience member]
What are the sample size [unknown] area 
you showed in Martinsville and Bloomington, 
what was the total sample size?
[Putler]
Oh, within Martinsville? I have no idea what the 
sample size was in Martinsville, because I'm... 
what I'm doing is I'm working with polling data, 
and then I'm projecting Martinsville based off the 
polling data. And I'll get into exactly the size of 
the sample that I was using going on. 
I can tell you there's some really drastic 
changes going on with polling at this moment 
that people haven't really digested... associated 
with 
what's going on, and I'll talk a little bit about 
that.
So it is... one of the things that is kind of 
interesting at this moment in time. 
So how can we make projections? 
And this actually gets exactly to sample size. 
What we can do is we can use traditional...
polling, and the way traditional polling is... 
by defining traditional polling in this particular 
case, 
I'm talking about telephone interview polling, 
which if you look at 538 and what Nate Silver's 
opinion about that as the gold standard, 
there are a lot of problems these days doing... 
telephone-based polling, and it's started to be 
problematic all the way back in the 1990's, 
and it's only gotten worse since then. 
So it turns out once upon a time,
everyone had a land line. No one had caller ID,
it didn't exist, and people tended to pick up the 
phone and respond to people who called them.
Now I'm probably assuming with everyone under
a certain age, they have no land line is my 
guess? 
And for those of us who've got a few gray hairs, 
do you have a land line?
And do you actually ever answer it? No.
[pause]
So that was sort of a problematic associated  
with that, because caller ID started it, because 
people would look at the caller ID and say, 
'Nope'.
And there used to be a whole lot of Mom-and-
Pop
survey shops, well, they did surveys and they 
did
focus groups, qualitative research. 
And they kind of went out of business because 
it became much more expensive to run survey 
research 
because their response rates went tumbling 
down
with caller ID. Now we throw in cell phones, 
and it turns out it's interesting, with land lines,
it is okay if you have a computer dial the phone.
With cell phones, it is... illegal. You must hand-
dial 
each number. That's the law. And so getting
cell phone... even interacting with any sort of 
level of cell phones becomes a much more 
expensive proposition because you have to 
go hand-dial the numbers. And then you, 
of course, get this really abysmal... response 
rate
when it either... it gives you a number and you 
don't know and you go 'click' to... and suddenly 
say, 
'Yeah, we're not gonna talk to them or let them 
talk to my... voicemail', right? So it's become
really difficult to do that. Internet polling is now
becoming more prevalent. Both Google and 
SurveyMonkey are big in this particular election. 
It turns out if you look at a traditional telephone 
sample survey, say, a national sample survey  
for a national political opinion on a presidential 
election, 
sample size is typically just over 1,000 people 
because it's so expensive to do, and now what 
they 
do is they try to do the bulk of it using land line 
and then they fill in... including that sample, 
specifically cell phones at some level, 
to try to make it more representative. 
But what they can do is they can take sampling 
frames 
of the population to begin to do that, 
because you can get some sort of linkages 
between telephone numbers and other things. 
So you can come up with some sort of an 
underlying 
probability sample. If you look at the online 
polls, 
and we worked with SurveyMonkey, so we'll talk 
a little bit about SurveyMonkey in a moment, 
the way their polls worked is that if you
showed up at SurveyMonkey for whatever 
reason, 
you could be asked, 'Hey, would you be 
interested...
sample those people... randomly selected set of 
people...
those people were asked would they be 
interested 
to fill out a presidential poll. 
And they got a 25% click-through on that, 
which is remarkably high. But it is a convenient 
sample.
It isn't an underlying easily documented sample, 
so after the fact, they tried to create sampling 
weights 
associated with it to make it representative of 
the population.
And it turns out a lot of the analytics that they 
did 
on that... on their part was actually figuring out 
what the appropriate sampling weights were 
to begin to deal with that particular issue. 
Sample sizes are drastically different. 
Like I said, a typical traditional telephone survey, 
around 1,000. In this cycle, they expect... 
SurveyMonkey figured that they would have 
600,000 responses.
We were using 10 days of data at two different 
times. 
The second one, which the current model's 
based off, had 36,000 people respond.
So the sample sizes have become enormously 
larger, 
although their population characteristics are 
skewed, 
and they know that, and they try to adjust for 
that, 
and that kind of complicates things associated 
with it. Now the trick is even when you do that, 
even on those large sample sizes, if you want 
to go down to a low level, it's really hard.
Because if you talk about it in terms of the 
number of counties in the U.S., there are over 
3,100. 
So basically you can get, at best, ten people
per county. And then you've got counties like 
King County, Texas, which has 50 people in it.
And you're probably not going to get anyone... 
associated with it. If you go down to lower level, 
census tracks, there are 72,000 census tracks. 
So there's no way when we look at small 
geographies, 
that you can talk about running polling in any 
reasonable sort of way just because
there are too many sub-units that are involved. 
And so that becomes problematic and so you 
can't
really use them for doing this sort of activity. 
We've got election prediction betting markets. 
Betfair, Hypermind, PredictIt, a number of 
others,
which it turns out probably about 20 years ago 
now
the guys in the Econ and the Business school 
at the University of Iowa created the Iowa... 
political stock market, and for a long time 
that was considered the gold standard of 
coming up with predictions associated with elections 
and other things. It's all this wisdom of the crowd, 
the market mechanism is a wonderful thing,
you can tell economists were involved when 
that's... 
suddenly what people are talking about, etc., 
etc.
That kind of went in an interesting sort of way 
which I'll show in just a moment.
The problem with that is that you can do 
betting... 
you can really talk about a betting market on an 
election basis. You can't really talk about a 
betting market going on for a small geographic 
area.
So that becomes a bit of a problem. 
Fundamentals election models, which we'll talk 
about 
in a great detail a little bit later, are a little bit 
better in terms of being able to go down.
They can get you safely in the U.S. down to 
county level 'cause that's the most consistent 
area in which election returns are returned. 
It turns out in some states they're good about 
releasing precinct-level data, and once you get 
down to precinct-level data, you probably have
all the granularity you want. The number of 
states 
that release precinct-level data is really small, 
and in California they do, but you have to get 
special
permission from the county to receive that data. 
And so you have to go on a county-by-county 
basis,
not even at the state level, you have to go to 
each county to get that underlying data, 
so it's sort of problematic from that point of way. 
I just wanted to bring up betting markets 
for a moment 'cause this is just too amusing. 
This is the price of stock of a Donald Trump 
share...
through the... course of the election.
And you can see it had hops and bounds but it 
had this general gradual trend.
I bet you can guess what day this is.
Right? So, yeah, not so good, not that that's
directly germaine, but it's just too amusing 
not to include at this point in time. 
So let's talk a little bit about what we have done. 
So we've already talked about the fact that, gee, 
you can't really run polling down to really low 
levels,
that polling is probably the best way to get at  
a set of things that are observable, like people's 
demographic and socio-economic 
characteristics,
and a couple of other things. So what we do is 
basically have a two-step approach 
associated with it. We develop a choice model 
at the individual level that provides the probability 
that a registered voter will select a Democratic,
Republican or third party candidate based on 
socio-economic, demographic, and other 
factors.
So a good example of that, we did me. 
Another example is an Asian woman
between the ages of 30 and 34 years of age
with a Bachelor's degree living in Cook County, 
Illinois.
So those sorts of things we can take advantage of. 
Now the fact that it's Cook County...
we can't directly use the fact it was Cook 
County  
because we're back to there are 3,100 counties 
in the U.S. and we're going to be working with
about 30,000 records. And so that gets to be 
a little bit problematic. But it turns out there are 
things
about knowing the county, you can suddenly 
pop in additional information, which I'll talk about
in just a moment. I've got this now at 
the individual level. Well, that's nice,
but now what I need to do is aggregate it up.
So what I need to do is for each of... 
so what I need to do is I need to estimate 
the number of individuals who are eligible to 
vote, 
which means they are voting age and U.S. 
citizens,
for each unique demographic and 
socio-economic profile living in a small 
geographic 
area like a census track or a block group.
So I need to know how many people with that 
constellation. So I need to know how many...
American citizen... Asians... who are women 
living between... who are between the ages of
30 and 34 and have a Bachelor's degree in that 
particular area. It turns out some of that 
information 
is directly available from the American 
Community
Survey and people who repackage that data. 
However, not the full set of information is 
available. 
So what I need to do is estimate out the joint 
density of the socio-economic and demographic 
variables
in a given census track and take advantage of 
them. 
And it turns out there are methods for doing 
that.
So the second part of this is estimating out 
the joint density of... the joint distribution
of the underlying socio-economic and 
demographic variables. And then once I have 
both of those things, what I can do... 
and we also adjust for number of registered voters... 
what I can then do is, once I know the number of
people in a particular census track 
in Cook County who are Asian women
between the ages of 30 and 34 who are 
U.S. citizens and have a bachelor's degree, 
I can take the probabilities for each candidate, 
multiply by the number of women or number of 
individuals that fall in that characteristic,
and I can get then the number of votes for 
each... 
the estimate of the number of votes for each 
candidate. 
I can then turn around and then sum across
all the profiles to get the number of votes in that 
particular census track for that particular 
individual. 
And there seems to be an A/V problem. Nope.
[pause].
Okay, no, it's okay, I like... kind of like to do 
that, 
turns out. Okay. 
[pause]
So as I already indicated, the voter choice 
model 
was estimated using polling data obtained from 
SurveyMonkey. It turns out we got two 
different...
we got three different dumps of this data. 
Through time, the last dump... each time we got 
it,
well actually, when we ignore it... the first one 
we did in early... basically the election app  
launched in the second... when did it launch...?
Mid-September. And what we did is we had 
data that was basically just at the end of August 
to early part of September. We then received 
a second dump of data which we thought
was gonna be great, we figured the polls would 
be  
settled at that particular moment, so that data 
ran 
into October 24, which was the Tuesday of the...
of course on Friday, Comey then indicated that 
maybe
they were reopening the investigation of Hillary 
Clinton.
[Audience member]
So going back to the part 2 of the previous slide,
when you're creating the data hub,
what was the difference that was observed 
between the two datasets collected within the 
configurations and the actual dataset 
that was collected after the elections? 
Was there a dataset collected after the 
elections?
[Putler]
We did collect data after the...
not polling data after the elections, but we have
actual results after the elections.
[Audience member]
So was it assumed... or was the prediction, 
which was still in part 2, assumed to be 100% correct, 
or there was some error, I just wondered if that...
[Putler]
So what we're gonna do at the end of the day, 
and that's... so we didn't account... what we
really were working with was with a registered 
voter population, was the base, and not
the actual voters who showed up. So there could 
be
some sort of... unknown level of impact 
that comes with it. Yeah, but it turns out, 
what we're gonna show is the bias wasn't very bad. 
We'll kind of get to that. It was highly... 
much more correct than the media would have 
you believe about polling at this moment in time,
let's put it that way. It turns out what they do is 
they asked who do you prefer to be president 
two different ways. One is just what's known as 
a head-to-head, where they're just asking you 
directly
are you going to vote for the Republican or the 
Democratic candidate, and it turns out 
a lot of the work that's been done in this area,
they'll talk about vote shares for the major party 
candidates, 
and at some point, we will do the same thing,
but in estimating this model we also included 
third party. We did it both using it broken down 
to basically Jill Stein and Gary Johnson and 
McEwan and eventually we estimated models 
that way and then we estimated models where 
we just aggregated all together the third party 
candidates. 
The model that I'm presenting is based on the 
aggregated third party candidates, it didn't really 
matter much, associated with it.
The demographic and socio-economic variables 
included in the most recent SurveyMonkey  
polling data had changed on us, because it was 
syndicated data. One of them we really wish we 
could
have still but don't, and that is the underlying 
marital status of the individual. Never-married  
women were particularly hostile towards Trump. 
And we wish we could have had that particular 
data.
It turns out as we go through this it does create 
a curse of dimensionality associated with  
estimating the underlying joint distributions. 
And it turns out the way we did it, given the 
demographic 
variables we have, there are 1,530 unique 
profiles.
We throw in marital status and then we also had 
income, 
which actually is not that predictive, but when 
we threw in income as well, and income has
its own challenges which I won't go into, 
but when we had those two additional variables
kicked in, the number of profiles is multiplicative.
So it's what goes on... we wound up with 1,530, 
it turns out when we had all of them, we had 
17,300 unique profiles that we were estimating 
out. 
So it turns out the estimate for most groups
in a census track was fractions of 1. But you're 
looking...
the best you can do is a joint distribution. 
The expected joint distribution is the underlying
expectation you can look at. So in this case 
we have age, which was broken basically into 
five-year groups, race, which are the groups that
I've already talked about and you saw before, 
gender, and educational attainment. 
And so those are what we looked for that. 
It turns out the nice thing about SurveyMonkey 
data 
is, unlike any other data... polling data I've seen,
they ask you your county of residence. 
And it turns out that is high enough up 
that the non-response bias to it isn't too bad. 
So most people were willing to say, 
'Yeah, I live in Monroe County'.
And so that turns out to be actually very useful 
as we get into it, because we can suddenly
begin to augment the data with data 
that's really more on a county basis.
So what we could do in terms of augmentation... 
and it turns out I'll talk a little bit more about it...
it turns out this is the magic variable of doing 
modeling in politics, which is the partisan voting 
index. 
And I'll talk about what the partisan voting index 
is. 
So a lot of domains have magic variables that 
suddenly soak up a lot of the variants 
associated with what's going on, so this is like 
feature engineering and as we go into this,
this is heavily feature engineered. 
The guys at the Cook political report 
who came up with it... don't know what the term 
feature engineering is, but implicitly, they did it.
If you're looking at baseball and you're looking at 
hitting...
nothing beats on base percentage plus slugging 
percentage. 
It's just remarkable in terms of allowing you 
to predict how many runs a team's gonna score. 
Within a lot of marketing applications for direct 
mail campaigns on the part of, say, traditional 
and
online retailer, there are metrics known as RFM, 
Recency, Frequency and Monetary Value, 
and those are all really critical dimensions.
In politics there is this really great predictor 
variable, and that turns out to be the PVI,
or the partisan voting index. So I'm going to 
spend a little time talking about what that is, 
'cause it's just so critical. We also got a trend
because it turns out the partisan voting index
is looking at the past two elections and we're 
kind of curious of what's going on with the trend
of what's going on. And it turns out...
there is... are distinct strong trends in terms of 
what is going on with the PVI in different 
counties. 
And that's just based on a simple time trend 
between it and the PVI going in time. 
So it turns out most counties in the U.S. are 
either 
moving in a direction... so we did the correlation 
within time... the way the PVI works, 
a negative value indicates becoming more 
Republican, 
a positive value indicates becoming more 
Democratic. 
The time trend between PVI... and... 
between PVI and that simple time trend,
for most counties in the U.S., it is either  
between -1 and... -0.9 or between 0.9 and 1.
So there's incredible polarization that is going on 
at this moment in time. It's something that 
political scientists have called the Big Sort... 
associated with it, and you definitely see the 
Big Sort going on associated with what's 
happening. 
It turns out there are three things that matter for 
partisanships: Region, Race, and Religion, 
the Three R's of American politics. 
It turns out the Census Bureau doesn't collect 
any information on religion. 
It would potentially violate the establishment clause, 
so they just don't want to touch it. 
And so what goes on is it turns out 
there is... an organization whose exact name is 
escaping me, it's something like the Association 
of Statisticians for Religious Organizations, 
or something along those lines, and every ten 
years,
they produce a U.S. religion census.
And what that is based on is not talking to 
individuals, 
like the U.S. census is. But instead what they 
do 
is they do a census of religious congregations 
in a county and they get the number... basically, 
can you tell me how many people is... are on 
your... 
congregant list. So that gives you an idea
of what's going on. Most recent data available
for that is 2010. And then there were some
indication that there are differences going on
in terms of what's happening from a religious 
point of view. Evangelicals and members of the 
Church of Latter-Day Saints have been trending 
very heavily leaning increasingly more 
Republican  
in terms of what's going on. So that's why we 
included those variables. There was indications 
that 
there was a dichotomy going on between them
in this particular elections, that in particular, 
people who were Mormon are not particularly 
fond of Donald Trump. And so we wanted to
potentially capture that effect as well, and that 
Evangelicals were basically coming back to the 
fold 
and were supporting the Republican candidate.
And then we have a set of state residence 
indicator variables, just in case there is some 
spatial elements that we're not picking up 
any other way. And then finally we have a third 
measure of sort of partisanship, a little... 
another member of partisanship a little bit 
different than the partisan voting index,
that we've got from TargetSmart 
Communications,
who's one of our customers. They work on
the Democratic side. There's a company called 
Deep Root which works on the Republican side
who's also one of our customers as well. So 
those are the companies that... what we see.
So a little bit about the partisan voting index. 
So it was developed by Cook Political Report
to measure the political orientation of voters 
in a geographic area relative to the country 
as a whole. So everything's relative, it's not 
absolute.
It can be applied down to any geographic area where
the... where presidential voting returns are available. 
It turns out in most places in the U.S., that's the
county level. Of course there's an exception to 
that which is Alaska, which has no counties.
They have boroughs, but it's an optional form of 
government,
and most of Alaska isn't in a borough. 
So there's no way to report... and it turns out 
what they have, the moral equivalent of it
are known as census... statistical areas. 
So some of the times you'll get this information
with this area that has no county government 
whatsoever
and is an invention of the... Census Bureau
purely for reporting purposes. But the fact... that 
they don't have counties means that they can't 
report
data on a county basis. So as a result Alaska...
and some of this analysis falls away just 
because 
what we can get is state-level data, we can't get 
county-level data. As I indicated, some states 
you can get precinct data, but it's few and far 
between.
So a lot of what we're going to do is county level 
because that's the most reliable geography 
we can go down to at a low level to see if things 
work. 
It is based on the past two presidential 
elections.
[pause]
Some of the stuff that we've done is we've 
actually 
created PVIs going back... all the way back to
1948 using election returns data. 
And it turns out if we look at what we're using is 
we're using a variation on what Cook used 
for the PVI, we're giving 75% weight to the most 
recent election and 25% weight to the one... 
the prior election. A number of people have 
done...
there's a study that I'll talk about a little bit later 
by Hummel and Rothschild where basically if 
you 
back out what they did, the 75-25% split's about 
right. 
This is something that 538 uses in what they 
do, 
and we've tried it both ways, and yeah, that 
75-25
seems to be the right way to break it down. 
It only looks at voting for the major party 
candidates.
So third parties aren't really allowed. 
And it looks at the difference in a local area's 
results 
compared to the national level across those two 
elections.
So the specific formula... and it's really very 
simple 
in terms of what goes on, is you look at
the percentage of the vote in the 2012 election 
locally
and you subtract it from what the national vote 
was. 
So if it was... this election, it looks like it's 
basically 
gonna be 50-50 when you take the major party 
vote. 
And so in this particular case, if you're looking 
at 
a county that was... went 70% for Clinton, 
what you're gonna get is this value is gonna 
suddenly turn out to be 20. 
'Cause it's just the percentage difference... 
or the differences in the two percentages... 
for this election and then what it was in the 
election  
before that. And it's just a weighted average of 
those two things. It is a remarkable variable. 
[pause]
So the model itself. So this is where the 
Data Science-y thing comes in at this particular 
moment. 
We use several different modeling algorithms 
and we checked a whole bunch of different 
hyper-parameters on those... underlying models. 
The underlying... we basically did everything 
using a training test... a training set/test set 
situation, 
and what we did is we basically determined 
what was the best model for predicting the test 
set 
using the training set to estimate those models, 
and then what we actually did is re-estimated 
those models using all the data but the 
algorithm 
that was selected in the test set. The specific 
algorithms that we looked at were random... 
forest models, gradient base boosting models...
so and then feed forward neural network models. 
In terms of the feed forward neural nets,
they were single layer perceptron models.
We do everything in Alteryx, the underlying 
statistical algorithms use R. R is not... there's 
not...
there are just now deep learning algorithms 
coming out for the R language at this moment in 
time.
Neural network models were not beloved by 
statisticians for a long time. They're 
numerically...
incredibly unstable. And so as a result, 
gee, they didn't really... you could get local 
optima problems all the time, so they tended to 
shy away from them. And then they had
the work of [unknown] and Friedman
doing ensemble models of decision trees, 
and they seemed to be a whole lot more robust 
associated with it. In this particular case, 
what we find is that the best model is a gradient-
based
boosting model with 3-way interactions. 
So interestingly enough... with political data, 
and 
this is kind of based on what we've seen with it,
and what both Deep Root and TargetSmart have 
seen with it, for some reason gradient-based 
boosting 
works really well with political data. So it's...
in other cases that's not true in every 
application.
But it seems that gradient-based boosting,
for whatever reason, is really good with political 
data.
And then finally, as I indicated already, 
we took the best algorithm with the best set of 
hyper-parameters on the test set and we applied 
it  
to all the data, and that's what we do going 
forward. 
This is really small, I'm sorry about that...
[pause]
By far... the #1 predictor is the PVI.
Interestingly enough, it's not about the individual,
it's about where the individual resides. 
So it's using the county-level PVI,
and it turns out because of the Great Sort, 
that's a really good predictor of people's 
preferences
for political campaigns. So you'd like to... 
you and your neighbors tend to vote alike, 
which is... one of the things we're running into. 
Followed by that is race, and that works exactly 
the way you'd expect. Followed by that is 
education,
post-graduate... people with post-graduate 
education 
are most likely to head towards... skew 
Democratic. 
People with high school-only educations
are most likely to skew Republican.
People with Associates degree are very similar 
associated with that skew. People with 
Bachelor's degree don't skew as democratic as 
people with post-graduate degrees, but they do 
skew much more Democratic, which is one of 
the big discussions that have shown up in the 
press.
Through this, age, age is sort of an interesting 
thing, 
younger voters, third party, much more 
third party-oriented. Older voters tend to be 
much more for Trump. If you look at Hillary 
voters, 
they didn't have a strong age relationship, 
but it was more on third party versus Trump 
that we were seeing. 'Who did you vote for 
instead of Hillary?' I guess is the way to put it.
Gender, which runs exactly the way you'd 
expect
or has been talked about, and then we get  
the interesting effect of religion and area. 
So what we do is we find that Evangelicals
are much more on the Trump bandwagon...
and then what we find is for the state of... 
combination of state of Utah and the percentage 
of people who are Mormon within the... county 
population,
they're much more likely to go third party. 
Both Anderson and McEwan associated with that. 
So they definitely have this third party effect. 
Interestingly enough, the other one that comes 
up 
a little bit later down is what's going on with New 
Mexico. 
And it turns out New Mexico, since Gary 
Johnson
is from New Mexico, there's a boost out for him 
within New Mexico on an individual basis.
So that runs pretty much exactly the way you'd 
expect. 
Population density kicks in, rurals go for Trump, 
urban is better for Clinton, kind of the way 
everyone thinks the world works, and this is 
what these models kind of show... show it.
So in the next part of this, and I've got very little 
time, 
so let me hurry through this, what we're going to 
do 
is sit down and calculate out the individual... 
the number of people within an area that fit 
a particular socio-economic and demographic 
profile, 
and I said there's 1,530 of those socieconomic 
and political... socio-economic and demographic profiles. 
The way we're going to do that is through 
iterative
proportional fitting, which is not a new method,
since it was invented in the 1940's, but it is still 
a very good method. There have been 
improvements 
on that method, although they are 
computationally
very challenging. And they don't buy you that 
much. 
Having invented one of those methods, 
I'm afraid it's the case. One of the alternative 
methods. 
And so the basic idea is what we're going to do 
is 
look at the underlying relationships between those
variables using what you could really call a prior table. 
So what we're going to do is say, 
'Hey, across a population as a whole, 
how are those variables interrelated together'? 
And so we're going to essentially use that 
underlying... 
overall population data where we have the actual 
underlying...
'cause at the end of the day, when you create 
this joint density, what you're really doing is 
you're estimating out the interior cells of the 
contingency table. And so what we do is 
we create a true contingency table using 
individual 
data that we get from the American Community 
Survey Public Usage Micro-Sample Data,
the most recent available is 2010-2014, 
and use that to inform us what the underlying 
relationships are between the demographic 
variables.
And then what we're going to do is take 
the univariate marginal distributions of the
variables we're looking at, so race, gender, etc., 
and what we're going to do is take those 
marginals 
and then condition on that prior table, 
the underlying thing, to begin to give it the 
underlying...
inter-relationships between those demographic 
variables.
[Tom Wiggins]
We have a question here from someone online. 
They're asking if you could please explain about 
the Cook partisan index.
[pause]
[Putler]
The Cook partisan index is explained actually... 
earlier on the partisan voting index. 
So I've just sat down and described that already.
[pause]
So let me go... I'm now going too far. Okay.
So... what we have is this joint true complete 
contingency table of the joint distribution. 
And then what we do is we get the underlying 
marginal distributions and we try to get where 
possible...
Q1 2016 estimates from Experian CAPE, 
which sends first census... so... it's basically
predictions and estimates associated with 
what's going on with that data. It turns out 
in some cases they don't give us exactly 
the data that they want. They basically do 
education from 25... people age 25 and up, 
and we need to capture those 18-24, 
so we have to go to the American Community 
Survey.
Citizenship is not a standard question 
or a standard summary statistic
that comes out of the American Community 
Survey,
but they do a special tabulation for that.
And so we grab the special tabulation
to take advantage of it. So what we try to do 
for the marginals is take advantage of the...
estimates and projections where possible, 
but when we couldn't deal with that, we use
the American Community Survey data. 
The most recent data that we could really  
take advantage of was the... 2010-2014 data 
since it runs on sort of 5-year summary 
numbers. 
So the other model we estimated was
the fundamentals model really quickly. 
A fundamentals model uses data from
historical elections to predict future elections.
It was developed largely... started in economics, 
it's done... been a little bit done in political 
science 
since then. Initially they were done at a national 
level
and they really kind of focused on how 
macro-economic variables impacted presidential 
elections. 
It turns out they're a bit problematic in social 
science 
because it turns out they implicitly assume that 
what drives people's voting behavior is economic  
self-interest. There's a huge amount of literature 
that indicates that's not the case. 
Probably the leading theory in this is something
that John [unknown], who is a political 
psychologist at UCLA, came up with, which is...
a theory which is known as symbolic 
predispositions.
So these are beliefs that you have that have  
really high effective value from an emotional 
point of view, and those seem to be much more 
important in terms of driving what's going on.
And so what's going on since then, these 
models 
basically were based off of the wrong theory 
of how voters worked. What's going on since 
then 
is in more recent literature, people make use of 
smaller areas, such as states, and the other 
thing  
that goes on is they begin to look at some of the 
drivers that underlie the symbolic 
predispositions.
And it turns out PVI, the PVI index, is probably 
doing a pretty good job of picking up those 
underlying belief orientations that matter for 
people's voting decisions. Turns out if you want 
to 
read more about this, because there's a lot to 
cover on this area, and I have hardly enough 
time 
to do that, is a paper by Humell and Rothschild 
on 'Fundamental Models for Forecasting 
Elections at the State Level', which showed up 
in electoral studies in 2014. And it turns out 
the model that I'm about to present is pretty 
much 
based on the Humell and Rothschild model,
with a few minor exceptions. The main one is 
we went to county level rather than state level. 
But there also are some minor differences 
in how we wind up using variables.
So we developed the model... a fundamentals 
model on the county level. We used data from 
the 1972 to the 2012 presidential elections. 
If you go... earlier than that, you get all kinds of 
weirdness...
[pause]
largely as a result of a combination of
the Civil Rights Act of 1964, 
which enabled the state of Alabama... 
which then prevented the state of Alabama
from having Lyndon Baines Johnson show up
on the ballot in that year's election.
Suddenly... the South had been blue-blue-blue, 
went for Goldwater because of the Civil Rights 
Act 
of 1964, then went back to being blue trending 
away.
The other thing that goes on is after 1968... after 
'68
you really start to see the modern world setup. 
So you get the Republican Southern strategy 
starting to kick in in a big sort of way, and that 
kicks things off in terms of a regime that had 
been
heading in the same direction for a long time.
One thing that's going to be argued is that 
regime continuing on with the most recent 
election. 
Given the results that I'm going to show in just a 
moment, 
my answer is yeah, I think it is. 
So I think those are the things that are 
going with that. The predictors that we use 
are the PVI and the PVI trend measure. 
So this is the partisan voting index.
The approval rating of a sitting president
on or near June 15 of the year of the election.
And it turns out there's a lovely site that is done 
by a couple guys at UC-Berkeley which has 
Gallop presidential approval... Gallop's 
presidential approval ratings going back
to when they were first asked in, I think it was, 
1946.
And then what you look at is the difference in 
GDP growth, the GDP growth rate between 
the second quarter of the year prior to the
election and the second quarter of the election 
year.
And it turns out how you code these variables...
is very interesting to reflect the impact of...
who's in power, which party's in power, and how 
you handle it that way. And it's fairly 
complicated, 
so again, the Hummel and Rothschild's article 
will go into how you begin to deal with that 
issue.
And then finally, whether the incumbent party 
has held the White House for 8 or more years.
So do we get a 'throw the bums out' effect. 
Now the other thing that... oh, yes, I should...
the last one was home state advantage. 
So there's an indication that you get a home 
state 
advantage if your candidate is from that state. 
So we saw that this year for Anderson in New 
Mexico.
Though that impact only happens for smaller 
states. 
So as a result, I think Donald Trump was the 
first
president to actually lose his home state.
[pause]
Interesting as that may be, having said that, 
it was also considered to be Hillary Clinton's 
home state, since she had been a Senator from 
New York. And then finally there was 
adjustments for Jimmy Carter. 
And that's included in the model as well, 
because he was sort of an aberration, 
being a Southerner in that particular case. 
So again, we use a training test set approach, 
algorithms looked really similar, with the 
exceptions 
that we also included linear regression.
The best model was, again, a gradient-boosted 
model with a three-way interaction, yes!
The model for political data! 
PVI dominates the model. Surprise, surprise. 
So that's what's going on here, approval rating 
followed by second, we get our Jimmy Carter 
effect 
in '76, and then we have GDP growth rate, 
and then we have the 'throw the bums out' 
incumbent 
party 8 years, are the ones that begin to matter.
So these are the two different models. 
And this is fitted versus actual values,
this should be a 45-degree line, it's not, I didn't 
have enough time, that's just a least squares 
regression line with it because Alteryx will 
pop that out instantaneously. The least squares 
regression line is essentially a 45-degree line,
which is a very good sign... associated with it.
And as you can see... 
[pause]
you got a little funkiness down here,
we've got this little kind of bulb up there,
but oh no, when you look at these plots, 
you're gonna go, 'Yeah, this looks pretty good'. 
[pause]
And again, this is the data from this election
that we're looking at. And let's look at... 
this was the interesting one where I can show 
you 
both a residual plot and I can show you 
confusion matrices, because I'm gonna see 
which party won in each county. And it turns out 
if you look at that, actual versus predicted, 
actual 
versus predicted, the 2-step model does slightly 
better, 
but only slightly better, than does the other one. 
But, gee, and I'll see it in just a moment... 
the hit rates... for the 2-step model? 96%. 
96%. I don't know if you've done a model
with a hit rate of 96%, I don't generally see 
that... 
unless I've somehow included the left-hand side
variable on the right-hand side. 
It just... it works incredibly well. 
The fundamentals model, nothing to sneeze at.
Just under 95% hit rate. So... very predictable.
Overall accuracy, you can see here, it's sort of 
the standard metrics. Trump accuracy is better
than Clinton accuracy across both of the 
models, it's worse for the 2-part model.
Correlation between actual and fitted. 
Both are floating in at just under 0.95 for the 
correlation coefficient. Interestingly enough, 
suddenly the fundamentals model do things
a little bit better in terms of other metrics
of goodness of fit of the two models. 
But if I look across the two of them,
I would tend to lean towards the 2-part model, 
but my preference isn't strong. I estimated that...
I did the forecast. The forecasts you are seeing 
were done on... based on the fundamentals model... 
were done on July 31. 
[pause]
Months before the election.
So it's very predictable. So... and I'm now 
running over, so let me go real quickly through 
my last slide. So in this talk, an approach for 
creating election-based polling data,
election-based polling data estimates were 
forecast for any geographic area, as presented,
and it relies heavily on demographic and 
socio-economic data. The approach is applied to 
data from the 2016 U.S. presidential election. 
An analysis presented that shows its predictive  
efficacy using county-level election returns. 
[pause]
There are deficiencies still in this particular 
approach.
You can tell I did was the last slide I did. 
I'm just getting point estimates here. 
There are no measures of uncertainty. 
As we know from this past elections, 
measures of uncertainty matter a great deal.
[pause]
As Nate Silver has pointed out and pointed out, 
his last forecast before said there was a 
30% chance that Trump would win.
30% is non-trivial, and he did.
[pause]
Eventually when we look at what's going on 
here, 
we're going to have to use some sort of 
simulation method to come up with those 
measures of uncertainty. I think it's going to be 
a little bit complicated, I can tell where basically
lots of computer cycles would be consumed 
in order to do that, but it's something that really
should be done. But having said that, 
the point estimates are darn good. There's 
currently...
there are currently a number of stories which I 
started out this talk with media saying,
'Well, the polls were horribly wrong, 
maybe we need to throw this stuff out.'
The polls had problems. But using polling data
on a county level, I can get 96% hit rates, which 
are 
incredibly high. So are the polls completely 
broken? 
No. Were there some minor issues with them? 
Probably. But it's hardly... we're not... it's not 
time to call it in that polling is garbage and we 
should just ignore it and instead what we'll do is 
use 
chicken entrails. That's not really where we are. 
[pause] 
So that's just reiterating the point I just gave.
It'll be interesting based off these estimates
to look at where did the models go wrong 
and what were the characteristics of those 
counties where the models went most wrong. 
And that's probably going to be indicative of 
what is going on in the underlying mix that really 
needs to be accounted for now. And so this is 
how I'm going to actually spend my weekend. 
As I'm on the Southwest flights from here  
back to San Jose, we're gonna start to... 
crank on those particular questions.
And this is how I occupy my weekends as well. 
With that, any final questions? Yeah.
[Audience member]
Simply building upon that last point, 
are there any emerging consensuses
as to what the models seem to be off...
what they seem to have missed?
[pause]
[Putler]
My guess... is there's always been a trick. 
So we're looking at stuff for... our population 
base 
is registered voters... that's not who showed 
their little selves up at the polls necessarily. 
I think one of the things that's coming out is 
it's gonna look heavily... what people are gonna 
look at really heavily is turnout.
[pause]
And the argument that the Trump people...
if you... really did look at the problem with the 
polls, 
what you would suddenly say is the problem 
with the polls is we had a Bradley effect.
Bradley effect is basically people lie who
they're going to vote for to make themselves 
look better in the eyes of the interviewer.
[pause]
I don't think... given what I'm seeing here,
I don't think there was much of a Bradley effect. 
I think what's gonna be... I mean, there probably 
was a wee bit, but it was a third or fourth order 
effect.
I think it's gonna be turnout. 
And some of the conversations I've had...
okay, how many people were like 
gung-ho passionate about Hillary Clinton?
[pause]
Yeah, five people somewhere. How many 
people... 
were passionate about Donald Trump?
[pause]
Considerably more. Frightening...
but considerably more. So what does that 
mean? 
Am I gonna take... am I gonna sit home
on election day or am I gonna show up? 
I'm gonna I bother to fill out... and in California 
you have this lovely thing. It took me... 
it took us two days to go through our... 
everyone does... mail-in ballots in California. 
Took me two days to go through it. 
We had 15 ballot initiatives on it... all the way to 
should porn actors be required to wear condoms deal.
And so you had to make those life and death 
decisions like that. And so do you get people 
out, 
do you get them excited, and that becomes 
really important, and so that was problematic. 
And I think the best thing that has been said is 
that 
Hillary's... the Clinton campaign kind of ran with 
the legacy of Obama, it was sort of based off of
we're running on vapor in the tank, 
can we take it forward? And it's just hard to do 
that. 
And it's really hard to win a third term. 
There's a penalty that you pay when you're on a 
third term. So I think at the end of the day, 
it's gonna turn out to be turnout. And modeling 
turnout is very difficult because you could take 
models... 
because you can get the data from the CPS,
from the Census Bureau. And they ask 
questions 
about current population survey. And they ask 
questions about that, but if you look at 2008, 
you had one part of the population that was 
really energized. And if you look at 2016, 
you had another part of the population that was 
energized. 
And how do you capture in a model energy... 
how energized the sides are. And that's difficult. 
Pardon?
[Audience member]
Any thoughts about using social media to 
analyze that? Because I would think that that
could serve as a proxy how people are reacting 
to it or saying things in regards to it.
[Putler]
Yeah, but, boy, if we want to talk about  
a convenient sample, it's looking at what's going 
on 
with Twitter... or social media. So yeah, 
potentially something could be done about that, 
and maybe that's a good thing to look at.
It's gonna be hard to say what's gonna go on 
with it though. Lemme go here.
[Audience member]
You've probably read one of those many articles 
like, '24 Reasons Why Hillary Lost and Trump 
Won'. 
Do you have any sort of insight as far as which 
variables are actually impacting, I'm talking 
about 
things like Comey or to what effect did the 
third party candidates have on the election, 
whether so on and so forth.
[Putler]
Well... what we can say is... you do get much 
higher third party... and the third party vote was 
non-trivial. Actually it was split enough across 
that none of them hit the threshold which will 
allow them to get funding the next go.
It was also very age specific. So basically 18-24 
and then... both the 18-24 and the 25-29 age 
groups
tended to be much more skewed third party this 
go. 
So that obviously had some sort of effect, 
and that's gonna play into it. This is back to  
they just weren't excited about Hillary, 
so they voted third party. They just didn't...
Bernie elicited excitement... and... Hillary never did. 
And she survived it in the primaries... she 
almost 
but did not quite survive it in the general 
election.  
Because the three states that really mattered 
were Wisconsin, Michigan, Pennsylvania,
and they're all under 2 percentage points
in terms of the difference in the popular vote for 
the two candidates. So it was just the 
excitement 
was enough to tip it in my opinion. 
So I think... excitement turns out to be really 
critical. 
So I really want... if I'm a Democrat I really want 
an exciting candidate to the population
in the next election.
[Audience member]
Yes, so on this subject of perhaps
understanding bias in these critical states, given 
that...
[pause]
given the stakes are enormous... sorry, 
it's on the subject of understanding bias
in the existing polling mechanisms.
Given how high the stakes are, have people 
talked about doing some kind of in-person 
sampling strategy where you... have the census
tracks where maybe you want to audit that your 
polling... 
your model is not suffering from a Bradley effect. 
You do very targeted in-person sampling.
[Putler]
Now the problem with in-person sampling... 
because the problem is... 
this notion of... 
[pause]
not wanting to upset the interviewer and the 
underlying driver of the Bradley effect.
It's gonna be... in my opinion, I think it's likely to 
be 
exacerbated if you do in-person sampling that 
way. 
[Audience member]
But even that would show you that there was a 
broad Bradley effect since you'd see people
more likely to be doing a Bradley of sorts 
than the phone or Internet polling place.
[Putler]
I think what's gonna go on is people are gonna 
look to see... to what extent... so let me back 
up. 
I just... what I'm seeing... I don't know if 
there was a lot of range for Bradley. Like I said,
I think it was a third or fourth order effect... 
and... 
what you would do from a marketing research
point of view, what you would... if I did this and I 
was  
using product concept sort of thing and doing  
that sort of thing, what people do is they sit 
down in norm, and so I think pollsters after... 
in the 80's and 90's would sit down in norm 
for racial effects... taking this into account.
And we have similar situations, what was it, 
so... 
we do this survey and we get the responses and 
then 
people tend to tinker with them on the edges.
And so what they do is... and that's known as 
'norming'.
And so what they'll probably do is come up with 
some sort of a norming adjustment going 
forward.
[pause]
It was transitory for just African-American 
candidates.
It seems to... Obama won twice, so
it seems to have been greatly taken down.
It's hard to say how important it's gonna be 
going forward. It's quite possible it's sort of a 
one-off deal... that's going to be a transitory 
effect.
I think... it depends on... what 2018 looks like.
[Audience member]
So I haven't framed this question very well 
in my mind, but I was just wondering that
why are certain factors not taken into 
consideration?
For example, [unknown] factor... [unknown] 
factor
which is that if a ruling Party A is in...
if there's a ruling Party A, it's less likely to come 
next time, B is more likely to come.
The second part is that the survey, which is 
general,
is a particular age group [unknown] and show 
that 
you're less likely to take a survey sitting in an 
office 
than I am to sit in a lab. Those kinds of things...
[Putler]
So what they will do with this...
[Audience member]
And there was a lot of... focus on these 
SurveyMonkey location. It's not hard to find 
a location these days, a stream and file 
location,
and you may give your location wherever you are 
doing anything, you just have to say [unknown]. 
SurveyMonkey actually puts... takes into... 
it's very very target-specific. So SurveyMonkey 
has more users than the extra above 22...
[Putler]
Yeah. So what they do is they know that,
they know the biases of what they're dealing 
with, 
so what they do is they use sampling weights  
when they do the projections associated with it. 
So a lot of what they're doing in the analytics 
that they did was coming up with ways of 
coming up with appropriate sampling weights. 
[Audience member]
But we don't have the samples for the other part 
of the age group, which is also a considerable 
population. For example, 55+...
[Putler]
No, they get people who are 55+. 
And what you'll find is that someone who's 55+ 
who's on... who is taking a SurveyMonkey 
survey  
gets up-weighted. What they find... the big 
bias...
[Audience member]
The thing is there's a higher bias on [unknown].
[Putler]
But that's also true of telephone surveys. So 
there was the USC Los Angeles Times survey 
that was really aberrant from all the other polling 
data.
And when you really dug, there was something 
on the... in the upshot on the New York Times 
that talked about it. So it turns out what really 
drove a lot of the results is that African-
Americans
tend to be less likely to respond to polling
than other population groups, particularly 
younger 
African-Americans. So there was a young man, 
18-year-old African-American in Illinois 
who was a Trump supporter.
And his sampling weight was 30. And so 
suddenly that drove a lot of what looked like...
[Audience member]
Yes, that's what I'm coming to.
When we have a bias for sampling people that
we get the correct results, and it's very
hard to model these things. 
[Putler]
But what I'm gonna say here is that based on 
what we're seeing in this data, the bias in the
polling data... the polling data's not perfect.
But the biases... it's not that it's fundamentally 
flawed. 
You can sit down and work with polling data 
and get a 96% hit rate on counties.
If you look at the swing in this election...
so it's expected that Hillary will ultimately get 
1% more than Trump of the popular vote. 
[pause]
So the indicated was that she would get 3.5% 
more than he would. The difference is 2.5%. 
The 2000 election, the swing was 4%. 
So the swing in the popular vote between
the polling average and the actual popular vote... 
for this election is within historic norms. 
So the bias is not... I think they're doing 
a reasonable... I mean, is there bias? Yes.
They're doing things to adjust for the bias. 
Are they perfect? No. Are they reasonable? 
Yes. Could you do better? Probably not.
Lots of smart guys think about this a lot.
[pause]
[Audience member]
Sorry, you just kind of answered my question, 
but I'll ask it myself. Historical polling data 
takes a start at '72, 'cause a lot had done before 
that.
I was gonna ask do you think such an alignment 
could be going on right now, and what would that 
mean for next year's polling. But... your answer 
may have just kind of answered that, but 
secondly
I would ask if such a realignment were 
occurring, 
would polling be sufficient to detect that...
for upcoming elections?
[pause]
[Putler]
And I... really don't know the answer
to that question. I think there may be a sort of...
cross-purposes. What will go on is that 
realignments are likely to be captured pretty
rapidly in the PVI. I think in some sense, 
if I were to say anything about this stuff is 
you look at what they norm on, and they do it
on gender, on race, on age, on income.
There are several variables that they try to 
make representative. One thing that's gone on
is that income is increasingly less a predictor
of your political behavior. Matter of fact, 
as I said, income... we didn't miss losing 
income
as a predictor because it was not...
it had some pop to it, but the pop was not...
the impacts were not particularly consistent. 
So... what it may mean is that you want to 
norm... 
come up with sampling weights on different 
criteria...
that currently aren't being used. 
Part of the problem is that they're not collected 
by the Census Bureau, and what you can get 
is what the Census Bureau collects. 
In some sense what we're doing is you say, 
'Hey', y'know... and counties, are too small a 
geography again, there are 3,100 counties in the 
U.S.
3,111. So it gets to be a little bit tricky that way. 
But what you could do is you could put things 
into
PVI groups and using... and based off of what  
you're then stratifying on is county of residence. 
You aren't using the county directly, you're using 
characteristics of that county to do the...
stratification on. But that... is one way to 
sit down and talk about it. So at some point... 
I've met with both John Cohen and... Mark...
Bloomenthal at SurveyMonkey. A couple of 
weeks
from now, I'll probably sit down with them and
go over what we've shown. So we'll probably sit 
down 
and there'll be conversations going on with us 
and the folks at SurveyMonkey about 
potential ways that you could begin to address 
some of these issues. But I think what you may 
want to do is select a slightly different set of 
variables that you want to do your norming on, 
or your stratification or your sample weights on. 
[pause]
[Audience member]
Last one. That was an excellent work,
cheers from us. So how many data scientists
from Alteryx that work on this? 
[Putler]
On the... create the data? Oh, on everything? 
Three of us. So the data... I was the data guy. 
So all the models, etc., etc., grabbing the data, 
cleaning the data, ringing the models, I did.
And then we had Steve Wagner and Mack 
[unknown]. 
Steve Wagner is our Tableau wiz...
and also handled a lot of what we did in terms of 
what was going on, in terms of the plots of the
presidential election app. Mack did a lot 
in terms of taking the data I provided with
and getting it appropriately formatted. 
He also handled the CARTO. So the maps
were done using CARTO, which is one of our 
partners as well, and so he was our CARTO wiz.
So there were... and then we throw in a bunch of 
people in our PR and our marketing, 
and we had our QA... yeah, we had QA people, 
quality assurance people involved, we had 
people from our Creative Services Dept. involved.
At some point I kind of went ballistic because 
they didn't want gender, even though it was 
a predictive... and they'd rather have income. 
But the number of categories in income match 
more aesthetically with the number of categories 
of everything else. And it's sort of like,
'No, no, no, no.' You don't want to live by 
the immortal words of David Lee Roth,
'It doesn't matter if you win or lose but how good 
you looked while you did it,' you'd really rather 
show what's really going on. And so we had 
arguments about that one. Gender came back 
in.
[Audience member]
One last question. So I understand how to  
design develop this kind of software, 
but how to test this kind of software?
[pause]
[Putler]
I think it's what we just did, on seeing 
how well did we do.  
[Audience member]
Not in production, before that.
[Putler]
Before that, you'd basically have to sit down and 
say it's basically methodology. And so we have 
a methology that we know works, so going 
forward 
we could use the same methodology. We could 
back it up and replay 2012 and see if that 
worked.
[Audience member]
I think my question is a different one. 
So when we write a software, people design it
then we develop it and we test it and we deploy 
it.
So how does the testing part of such a software 
happen? 
Do we take out some data from the test data
and see if it runs well, planning data,
[unknown] data, pool test data, again like that.
[pause]
[Putler]
So it's a little bit... this one's a little bit... 
we know... we basically can do unit testing. 
End-to-end testing is a little hard. And all of this 
stuff, 
we did have QA involved who are testing people 
involved with what was going on. 
But there's still... some of it's hard to test. 
And a lot of it's just literally manual testing, 
like you go to places that you know are funky, 
and it's like, does the data look right?
And there was a fair amount of that that went 
on. 
So our COO had specific places where he 
wanted 
to look to see if... the numbers I was coming up 
with 
made sense. So it passed his sniff test, 
which is not a standard software engineering 
metric, 
but is the one we kind of lived with.
[pause]
[Audience member]
So not exactly a small community question, but
based on the PVI and the trends there...
predict today in what year will Texas go blue?
[Putler]
That's a... okay. I don't know that one.
I can tell you... okay, so it turns out our CEO... 
lovely... what you don't want to have show up...
on the front page of 'The Guardian' website 
is your CEO at the Trump convention.
We had that. And he's in Orange County
and one of the things we always looked at is... 
the data was going... I'm going... I was looking 
at it and I was predicting that either this cycle  
or the next cycle, Orange County, California, 
would turn blue. And I did the models and I'm 
going,
"Oh God, Orange County is blue."
Orange County was blue.
So when is Texas gonna go? Well, this year,
the election was single digits... in Texas. 
I can tell you exactly where it tends to occur.
Dallas County is getting much bluer. 
Travis is actually pretty stable.
Austin's not going any bluer than it's already 
blue. 
Tends to be the coasts that are going blue.
I think it's still... at least 20 years away...
but it could... definitely happen.
[Ying Ding]
Okay, thank you very much, we have 
another talk at 3 o'clock.
[clapping]
