The 
concurrent right architecture and did you
figure out what the common program would look
like okay explain 
so again i will repeat first you read what
was there okay just in case if you need to
restore that everybody do well then the right
instruction happens okay and somebody will
win but the remaining the lower priority once
can read it back and tell that something was
got written was not the same thing has that
they wrote okay so they all know that it needs
to be unrolled undone how do they undo it
they all have old value they are all going
to write the old value once of them will still
win okay whoever among them is higher priority
okay and old value will get written now everybody
is trying to write the same thing okay actually
now that sometimes simulations can happen
will take more steps priority can also be
simulated with common and not in constant
term so here is simple pram program ahhh we
are not doing initialization over here how
the input got in how processor got active
and all that this is crcw pram model ahhh
and you are basically checking right the code
in the middle says and this is see the same
thing that would have seen in a ram model
cdac code only does have become those in parallel
of pardo okay so it says i = 1 to 5 pardo
if ai = 1 then output 1 ai is a shared memory
array i is the processor now right for every
i is very similar to open mp style for loop
also work share right so every single is on
every single processor and the read exclusively
they are reading different areas in a this
is crcw but they are allowed to be exclusive
ahh if that value turns out to be one then
they ar going to into one location out which
has be agreed to y all the processor the value
1 so in crcw model ahh and the goal was as
it says compute the boolean or of all the
things and that boolean or will appear which
crcw model the value should be 1 if anyone
is 1 okay i will repeat the question the program
in the middle it is says that all the processors
are assigned early ah based on their id one
of the locations of a to check that location
of a and that location of a happens to be
1 they write one into shared memory called
output shared memory cell called output okay
so let us look at each of them i have simplest
is random right somebody results is going
get written nobody trying to write a 0 right
so if your value is one you write it otherwise
you keep quite so if people who have one trying
to write somebodies results got written and
somebodies results must get written and then
says i do not whose results is get written
but somebody’s is going to get written one
will appear okay in case of priority again
one specific fellow will get written whoever
priority is highest still we want okay and
the third everybody who is written will be
writing one okay so whatever the model may
be is going to complete the boolean r okay
that is not always nice is that but in this
case it is okay we are going to add a few
things together okay i got n objects values
in n locations in shared memory and cells
in the shared memory and am going to assume
order n processors how can i add in erewt
ram exclusive read exclusive write okay so
everybody has and there are n things everybody
has read one of those n things then what happens
two processors will read the same value that
is correct but then proceed 
you want to start with n by 2 processors 
of a two values which two values? So 
you read i and i + 1 / 2 okay or i into y
2i + 1 sorry 2i into y + 1 that might remind
you some okay we are building a tree on this
array okay so you read n things and everybody
is going read two things they are going to
submit together and where will they write
they can overwrite assuming that the input
is not needed more the input write in one
of them i by 2 you want write in i by 2 if
it is needed means you cannot overwrite then
you will write in some temporary condition
but this time keep doing this on temporary
location or for that matter you might say
you are going to just copy over the original
array into temporary location and do that
summation in the temporary location okay so
that was one step in one step everybody read
two things some did wrote one how many such
steps will be needed log on right again the
place to read from right to confirm the keep
structure at each level you know what your
children ahh values is r index is r at the
end of it you are going to have one processor
is going read two things and write one thing
okay at each level some processors will be
turning them so on okay how the simply check
the level right and 
each time you can keep reducing the number
right okay and you can do it in many different
ways so there are log in steps involved yes
right so that is the technicality right you
can say that the processors were not reading
one of those are reading from one location
called infinite okay and they are writing
into location called infinite so in that detail
is not necessary to manage as a part of the
program itself ahhh in fact even if the processor
does something it has to write some it might
read do some computation read more does not
have to write something okay so that i think
probably is the good thing to mention that
all these steps are optional which means all
they are reading or doing nothing what you
cannot do is while someone reading and somebody
is writing and somebody else is doing some
local computation alright so the time be will
log in the number of processors we have used
is n by 2 first of all n by 2 but it then
kept going down but the maximum number of
processors at any step was n by 2 so we have
used n by 2 okay the speed up what is the
sequential thing to add n thing n step right
n -1 additions if you compare it to that then
you have got n over log n speed up so these
are the few remember from earlier slide and
the same statistics efficiency is speed up
over number of processors okay and we have
used n again this is order we are not worrying
about n and n by 2 the work done is the number
of processors times the number of struts so
work done is n log n and that is where we
are going to now figure out what happens if
i have p processor not n okay we did n operations
but could not have done all the n operations
at the same time using n processor the work
is the total number of steps multiplied by
the total number of processors yes so this
is not the summation of work done over time
this work done says how much parallel work
has been done but they could not have any
other work if there was some other work to
do then they should have been doing it right
we are talking about context and all talking
about that machine and in this program and
at some stage because your algorithm is going
to blow up computation you can do lots of
processors then keep it ideal for the long
time which object to algorithm okay so if
you can manage keep the processors one at
every step keep all of them busy probably
going to get a lower work done and the work
done is going to be the most predictive statistic
when you run in it on actual machine only
p processors okay we will talk about that
in more detail ahh but let us look at the
other algorithms this is the member problem
i have got n things written into n places
membership research problem okay and i have
go a query i have got x exist in that input
array okay and we can write it at high level
so in the first step ahh everybody gets value
x okay in the second step everybody checks
if the x exists in some part of the input
and in the last step they are together going
to decide somebody found it okay let us look
at each of these three steps and the different
models they are looking at in this case erew
crew and crcw common okay not all of it yeah
but yes the last step is so the first step
is broadcast so that everybody knows what
we are looking for because in erew x cannot
be read by everybody in the same yes for concurrent
read everybody will read x in the same the
first step okay in crew and crcw that cost
is written as one out of one p is the number
of processor with number of processors in
the case been given value rather than n exactly
we are going to now move towards how do you
take this models out of program in this model
and talk about a real implementation 
of this model okay alright no we are going
to have come up with the algorithm to do it
model has no notion at broadcast we are not
going to assume anything about the interconnect
all the information is shared by writing to
some place and somebody else reading it which
takes each constant number of time interconnect
is exactly connecting it to a shared memory
but it is hidden never directly say talk t
the interconnect we never say talk to the
processor pram there is no send receive type
instruction that is what we do in ahh distributed
memory model in the shared memory model it
is a strict pure memory model you write to
a common place you read from a common place
okay so how do you get x to every processors
knowing that they cannot all read the location
of x in the first lock so the question y is
that log of p multiplied by 2 so the first
processor can read x and read into you are
not carrying about how many how much space
you use right so we are plenty of space shared
memory space just allocate some temporary
area and the first processor read x and write
in one more locations now we have two places
where x is now to more processors we will
read those two things any each will write
one more so now we have x in for loop okay
next step will have x in 8 locations so how
many steps to get into p locations log of
p right the local computations we have determined
that every processor will take n over p section
of the input and search for x in that right
so in over p steps will read in any model
exclusive concurrent all models will read
from their exclusive part of input check if
that value = x if it is not leave a flag of
otherwise turn on the local plan and then
keep doing this on all the locations okay
so at the end of it somebody may have found
it possible that no run of and they have local
variable called f is local memory as much
as i and now everybody needs to agree need
to come to an conclusion about somebody got
a plan i am not taking any mechanism i am
giving three examples 
you initialize it is 0 and somebody has found
it and they are going to write it as 1 all
of it that r work for every single variable
of all concurrent write okay and so here we
do not need to worry about with type of concurrent
write it is so although i have said common
it would be out of one in all but how about
many you have exclusive right which is both
first and second column reverse of the log
view right instead of broadcasting you are
collecting it is very similar to add you are
not adding but saying compute all of these
things okay similarly you can think of the
activations we said there is a add protocol
that start protocol can be done in same way
we have the starting process which that says
the id is 0 in steps just like we did log
reduction or log expansion in log p steps
can tell everybody to become active it tells
to the active every processor is reading right
so they do not where to read they do not know
where the input is we do not know whether
program is running or not so we are just reading
from arbitrary location and processor 0 is
going to now know that actually running the
program that data that input etc so not it
is going to tell that whatever it has learned
to everybody or else like everybody broadcasting
you all start working on this part okay alright
so we look that arbitrary location is where
you write 
so there is some hard wire things in the model
that everybody is reading location number
0 if it is an exclusive if it is concurrent
read allowed in this case you write 0 to or
you write something whatever you want everybody
to know location 0 and every reads to next
part if exclusive write is allowed everybody
is reading from their own respective locations
cells you got to write that new value all
of okay that to read so they read together
so if you are calling alternative cycles then
am not sure they are reading from some location
which will be filled with some input at some
point which says now the programs over there
or now the data is over there or over there
okay and so that sixe is going to fit fill
in log n steps log p steps if you have p processors
ahhh and if you are exclusive read model then
it takes log p2 initially they do or some
okay any questions on the basic of the model?
We are running out of time on this session
work is what exactly number of processors
multiplied by number of steps is and so now
suppose i have got an algorithm and ahhh in
each time step let us go into the next level
detail each time you are doing some amount
of work that work is how many processor where
active okay in out summation for example the
first step n / 2 work was done second step
only n / 4 amount of work was done then the
third n / 8 amount of work was done right
ahh if you are going to count that amount
of work done ahh with n processors and now
if i have p processors i know that any amount
of work that was done in one step was done
in parallel for n processors meaning that
p processors can also do parallel what they
cannot do is do some of that or some of this
that may depend on each other okay so the
p processors can only take those n things
are some functions of n things that was done
in one step and allocate it among themselves
and do it among wi things was done at step
i p processors to each stage wi over p of
it and do it okay when i actually have a p
processors not whatever a input is assumed
how of the algorithm is assumed okay ahh now
let us added together over all the steps that
is the amount of work the p processor pram
will actually do that is the that also says
that is the number of steps they are going
to take because what was done in one step
by wi processors is not taking wi over p steps
there is only p processors to share the code
okay so you add it together and the summation
over wi is p and each step this takes a healing
right because if there is one processors if
there is ten things to do and if there is
three things to do then there will be somebody
will be taking four steps okay so that things
got done in four steps and the some of that
is equal to w over p w being the summation
of all over plus the time 9 is the number
of steps that the original pram algorithm
took with however many processor is assumed
why at each step it is saying i am doing wi
over p ceiling so when i say when i take wi
over p an add it together inside but y get
p is the same in the denominator all the wi
is get added with total work right total work
divided by p but how much extra i might have
added in wi over p ceiling at each step i
could have at most added one right one minus
by taking it to the ceiling and we did this
tn time because the original algorithm took
tn steps so we have added n more things 
so tell me what is wi it is the sum of wi
it is related will talk about that also or
maybe you will have a question on that okay
what we are going to do next week is auto
simulate a general processor pram with the
pram which is this was just an analysis of
how much time it will take we will talk about
an algorithm we will blindly take and n processor
program or any size processor program and
run it on a p processor pram similarly we
will take any processor any program that runs
on n cells and run it on n prime cells okay
If i have only limited memory and only limited
amount of processor i can still simulate whatever
the pram algorithm simulate did at the cost
of number of step i have put but it can be
done good question find out alright so we
were discussing the last time this notion
of ahh work time scheduling we will come back
to it i have not finished talking about it
but we will relate it to some of the other
things will talk about shortly ahh so am going
to bring it back when makes sense ahh the
first couple of things to look at it is that
theoretical level now where we are looking
that pram make sense the said the infinite
amount of memory is infinite amount of infinite
but as many as you need ahh some arbitrary
number of processors unlimited number of processors
and unlimited memory is that handicap? Meaning
with bad assumption am i breaking the effectiveness
the applicability of this model okay so one
of the theorem we are going to prove or mi
we will is that suppose if i give you pram
algorithm we uses some f of n processors okay
in general p processors so according to that
level ahhh but i want to run it on another
pram that only has prime processors okay you
cannot less than p could i do it same amount
of shared memory same amount of local memory
in fact we do not say anything about local
memory local memory we are in fact going to
use as much local memory need in order to
do this simulation so we do not restrict to
local memory and make sense because we can
make local memory quite big of the ram model
assumes infinite local memory and there is
quite useful so if i have only p prime processor
i will give you p processor pram logarithm
how would that p prime processor pram run
this algorithm kind of like context each processors
are in fact lemma i should be telling you
(refer time: 33: 08) any problem that can
be solved on the p processor pram in t steps
can we solve on the p prime processors pram
in order t prime times p / p prime steps okay
essentially what you are going to do is just
have each we are assuming that p prime is
smaller if is bigger than p then it is not
much to be done we just have the extra processor
to be in but if you have fewer processor then
you need when you say p processor then you
need p prime that i have and going to have
each processor act for be a proxy for p prime
over the p of the original processors okay
so we are going to call the original processors
the once we are simulating the simulated processors
p of them and p prime real processors available
processors okay each processor will simulate
or will work for p over p prime virtual processors
what is it have to do so how was the algorithm
work for every step it was the algorithm said
you read from here to do that local computation
you write over here okay so now we are going
to do that in a loop but nobody saying it
is inactive there are p prime processors working
each processor is simulating multiple prime
over p of the virtual processor of the imagine
the assumed number of processors and you so
i am one of this processors and i have to
simulate p prime over p okay so p prime of
p of them each one had a read local computation
write step so i have a array of reads and
array of local computation and array of write
to do what do i have to do all the reads read
them in local all the array i have then i
perform the local operation in local memory
each one or one after the other and then i
have to do all the writes so i do all the
rates i have done one after the other so they
would have done in parallel i am doing it
in c does not make a difference they are all
reads they would have done their local computation
in parallel am going to do this does not make
a difference they are all independent i am
taking one area of local memory as thinking
of this as a local memory of processor 0 another
processor 1 and so on so they are independent
to each other and similarly they were all
going to write in parallel am going to write
in sequence but after these many steps i have
similar related all how many steps? Once approximate
one right so if you say three steps are one
steps together then i have done not one step
like this but three times the number of processors
sub steps which means as many steps as number
of processor some simulative how do you simulate
common write? How all can be done? So if we
have a common aahh crcw pram then we will
simulate a common crcw pram how would you
do that? All these writes if they are writing
for the same address would write should write
only if they are writing the same thing so
what would you have to do you can locally
check that first it is in the local memory
okay how much time it takes to check them?
As many processors are simulate you can group
force and do anything smart so that does not
change number of steps needed ahhh to perform
the write if it is random then do not do anything
we are just keep writing if it is priority
then you write can you highest priority at
the end but there are p prime and independently
doing and i have three writes to that memory
location and some of the processor have only
one write to that memory location but is the
highest priority one so when does that write?
That guy writes immediately then this after
three steps this guy overwrite so you have
to do somewhat the opposite of what you are
saying which is perfectly fine so 
this processor have written its stuff first
and this processor is going to write it is
so it i going to write one and this is done
two to the same location okay if this is the
highest priority and this guy first was the
highest you sorted by in the order of highest
priority highest priority thing gets written
first and this highest priority have written
first higher of those will get written this
guy now second write from do because it is
already done on higher priority exactly then
you are perfectly fine you write to an address
only if you have not written to it before
that is ok that you have data to a shared
memory and some will not which is not common
address not an conflicting address a and b
want to write to one and b and c want to write
to 2 right if they are going to figure out
first of all a and b will decide that we are
going to take down highest priority thing
we are writing right and we are going to write
that first if there is any body is common
then they automatically will take care of
it the because pre prime processor are will
take care of it if not so then i am simply
going to write d and c synchronize we are
assuming that everybody knows how many processors
has everybody as in fact in this case we are
for simplicity of proof we can start with
assumption that everybody has the same number
that does not do anything to this example
you basically this is the basic rule is that
if you are writing to a higher priority so
it is not a higher priority memory location
which is what i am going to say the higher
priority processor right so this processor
has n thing to write this processor has n
thing to write and we have to figure out which
order is going to write to them and you do
not know what where this guy is going to write
there are n memory location is that you are
going to write n being the number of there
are you are simulating t prime by t processors
whereas t by p prime processors that many
memory location is what you are going to write
alright this is one way to do it you basically
write the values in whatever order of these
memory locations you care for that the issue
was you do not know what order was write the
memory location you just write them in any
order but you have to read it back to find
out whether it go written or not whether anybody
else wants to write the memory locations are
not okay you are going to do it t prime by
t times if somebody else wants to write memory
location you need to know what the priority
is if it is higher priority then you do not
want to write it again so now in every memory
location you write what you wanted to write
as well as the id of the processor that who
is going to write it so each memory location
becomes a tool you write your processor id
whose going to who want to write whoever one
out his value is going to be written here
right if you do not win out then you are going
to know it and you are not going to write
it again these steps you will have to be figure
out who have to be should be writing it who
is higher over here i understand but it is
not only that there is a because you are not
simulating ahhh a priority write on a non-priority
write which is you do what you are saying
right the underlining hardware is still priority
hardware but only it recognizes priority among
only p processors sorry p prime processors
whereas you need you to figure out the priority
among the p processors the original p some
one of them one of the available processors
not 
but 
the other processors is going to write it
in one of the other steps that is the problem
they were writing it on the same steps there
is no problem but when it going to write it
one of the other steps it must not overwrite
something that someone else is written in
that same step at high priority right so it
must before writing it must read it and write
it again only if among the simulated steps
nobody has written to it nobody at higher
priority has written to it before so still
p prime over p simulation steps are needed
are over the p prime ahh but you write to
it by reading it first each write operation
is done into a read do some compute and write
operation another read write priority on original
p 
internally you can distinguish right simulating
five processors four processors then you know
that the simulating processor 8 9 10 11 and
8 has a higher priority than 9 10 and 11 so
that is the local competent but the issue
is that i am writing k things p over p over
p prime things and some other processor is
also writing p prime p over p prime things
up to p over p prime things right and in some
arbitrary order so what i write here may have
a conflict over that fellow writes over there
right if i have a high priority that fellow
over write you can because it is pram hardware
with p prime processors in only write p prime
things concurrently so you cannot say i am
going to write these many things ahh whereas
that is another way of simulating because
technically in pram there is no limit on size
of the primary each memory cell nobody says
is right memory cell nobody cell can be as
big as you like enough to hold all the processors
id so now you can simulate by you are writing
your part of the data into your slot so each
memory cell as become array and all of them
together somehow tell you what the value is
in which case everybody gets to write nth
slot but the when you read only the highest
priority is going get you are going to only
use that others you are going to discard you
can read the entire thing in one shot because
that is one memory cell okay but there is
some technical way of doing this more algorithmic
way doing this is simply add the information
of who is writing okay alright everything
is clear on small number of processors 
before you write you have read it every time
so k writes are not only k writes and done
to k struts of read local computer write okay
but it is 
not just common among them that you are interested
in one of your writes may be common with one
of the writes on somebody else and because
you have said am going to write that in this
order and this is just some three random memory
areas memory cells you are writing them in
this order there is some other three random
memory cells you writing them in some arbitrary
order begin even say am going to write the
smallest than the biggest when than the bigger
than the biggest does not make a different
because these are three random independent
values but one of them night match you do
not know whether that guy is third and you
are first matches or some other alright what
if i had not enough memory and you now the
same idea will apply instead of m memory cells
i have got a m prime so shared memory is limited
that local memory is not if you want more
local memory i can give it to you basically
you are simulating many processors inside
one now you will simulating many memory location
else inside one right how do you do that it
basically make a local ahhh representation
a copy or replica of the shared memory what
you wanted right you wanted n cells of others
you only have m prime so i will take local
memory and distribute m locals into their
local memory is so everybody is m over p local
memory cells dedicated as the shared memory
because shared memory has distributed but
you still have m prime also available right
there are m prime m shared memory cells that
you wanted to simulate you wanted your algorithm
to be able to have but you only got m prime
real hardware have only m prime shared memory
so i want to write m different things possibly
potentially over the period that program runs
right so am going to run those different things
is that going to axis as a shared memory and
allocate it among the p processors as you
keep the replica of p m over p cells you keep
the replica of m over p cells and a last m
over p cells is ended to the last part okay
but now how do you share 
so you still have a shared memory right so
as long as one memory cell is available for
every processor one shared memory cell you
simply figure out which of the shared memory
things you want to write you write that thing
in that memory cell and keep the other local
to you right so how do you make sure that
you have written you write one of those values
one of your allocated values other can see
it and over the period of this step the original
step you have to make sure that everybody
gets a chance you do not know who wants to
read what so you are going to take all the
cells you are responsible for and put it in
a shared memory 
one at a time okay so you are basically a
loop you have to say now you take my memory
location number 0 now you take my memory location
1 take my memory location 2 you have written
one at a time but it is not a tight loop right
you are not just writing because somebody
else has to read some other write some other
time you are going to turn it into an entire
step everybody writes their 0th location into
the shared memory and then everybody gets
a chance to read if you wanted to read it
then want read it keep quite again you are
allocating the shared memory among the different
processors and because you are responsible
for closing your part of the shared memory
to everybody else you are going to do that
by exposing every single cell you have yes
there are so many writes for once for original
pram steps right now you have taken each of
them and turn into a full step so you said
now you are going to write my location 0 into
shared memory location 0 processor one is
going to write its location 0 it everybody
has an array of memory cells now right it
is going to write its 0th ahh cell into the
shared memory address one processor two will
write itself its 0th cell into shared memory
address 2 and in one step everybody has written
one value okay p processors have written p
things but we are saying that we have at least
m prime processors thats why the number of
processors available in max of pram m prime
so as long as there at least m prime processor
all the m prime cells have been written by
somebody now the next read step anybody you
wanted to read that value is going to read
it and now all the processors will write the
second element of their array into the shared
memory for writing part this is the just writing
has turned into a big loop after that everybody
is going to read they are saying do the local
computation that is going to happen in anywhere
so you are simulating taking the m ahhh cells
you have partitioning it into m prime segment
blocks whatever you want to call them assign
each block to a processor and have the processor
essentially use the shared memory it is section
of the shared memory and the staging area
and after as many iterations as a m over m
prime because that is how many memory cells
each one is responsible for you have exposed
all of the memory with us so anybody who wanted
to read anything in the original pram as go
a chance to read the proper value in one of
this ram’s alright in terms of evaluating
performance we looked at a notion of how many
processors we are going to use because now
the algorithm is just saying because if you
give me p processors am going to do this in
time t so t becomes potentially a function
of m the number of things in your input processing
and the time taken becomes a function of the
p you have and the n which is input size okay
so instead of having one metric that we use
to have how much is the time taken is it log
n algorithm or n square algorithm we are now
said it is m log n if you have n squared n
processor now how do you compare two algorithms
and that is where we use the notion of work
and will just go through ahh few things now
to a see why work is what we should be considering
but what is saying that is that if i have
well you give me two algorithm one takes p1
processors and takes t1 prime either takes
p2 processors takes t time i will say hold
on tell me how much work is each to it one
takes w1 the other takes does work w2 then
am going to take the one that does lower or
smaller amount of work and so example if i
have got something that takes time n order
m but does also work that is order m i am
going to prefer this over another one that
is faster takes log in time but the work is
un log in generally speaking we will be little
deeper into that very shortly and if tow algorithms
at the same work then we will look at the
time you do the same amount of work but who
does it in fewer steps okay that fewer steps
means what that work is more evenly balance
among steps right we also be looking at how
scalable it is right now we are not saying
it is if i have a value n then goes up by
this much but you will say how well does it
scale your n goes up so there are two notion
one is of absolute p dash which says if i
had the best sequential algorithm it took
time t0 your parallel algorithm takes time
t1 to the speed up is t0 over t1 near that
it is more relative that is why the term is
relative to speed up it says that if i have
one processor you run your algorithm not the
sequential algorithm your pram algorithm if
it had only one processor available how much
time would it have taken if that is p0 and
on n size input it is pn then the speed up
is relative speedup is p0 over here okay so
usually relative speed up is going to be important
not that absolute speedup not important but
if you got a good relative speedup it is saying
that you got an algorithm that scale well
as hardware size will change it is going to
do well okay we 
do not reduce the input size we just say that
it is same algorithm we just the number of
because algorithm now has to values how many
processors it takes and how many what the
input size if i keep the input size take the
processor all the way to one and then compare
that speed if i take the processor to some
other higher value okay let us stop here
