>> GOOD AFTERNOON, EVERYONE.
IT'S MY PLEASURE TO WELCOME YOU
TO A SPECIAL WEDNESDAY AFTERNOON
LECTURE BECAUSE THIS IS THE
MARGARET PITTMAN LECTURE, A
SPECIAL LECTURE GIVEN ONCE A
YEAR TO HONOR MARGARET PITTMAN.
SHE WAS THE FIRST WOMAN
LABORATORY CHIEF AT THE NATIONAL
INSTITUTES OF HEALTH APPOINTED
TO THAT POSITION IN 1957 AFTER A
DISTINGUISHED CAREER HERE AT NIH
AND IN OTHER PLACES LIKE
ROCKEFELLER, WHERE SHE WAS VERY
MUCH A PIONEER IN THE AREA OF
INFECTIOUS DISEASE.
IN FACT, DR. PITTMAN WAS THE
FIRST TO ESTABLISH A CAPSULAR
TYPE B OF HOMOFLUOUS INFLUENZA
AS ONE OF SIX TYPES OF H
INFLUENZA MOST RESPONSIBLE FOR
CHILDHOOD MENINGITIS SETTING THE
STAGE FOR EFFORTS SOME DONE HERE
WHICH HAVE NOW LED TO A
REMARKABLE DECLINE IN THE
INCIDENCE OF THAT TEAR TERRIBLE
DISEASE BECAUSE OF ABLE OF
VACCINE, ALL YOU CAN CONNECT THE
STORY BACK TO MARGARET PITTMAN'S
WORK IN THE 1930s.
SHE ALSO WORKED ON SALMONELLA
TYPE B, WORKED OUT ANOTHER
PARTICULAR BACTERIUM, HOMOFLUOUS
EGYTHIUS RESPONSIBLE FOR
EPIDEMIC CON JUNETIVEITIS AND
OTHER OBSERVATIONS IMPORATIONS
IMPORTANT I
N
 VACCINES.
 SO WE EVERY YEAR CHOOSE SOMEONE
 AND THE CHOICE COMES FROM THE
 ADVICE OF THE SCIENTIFIC
 DIRECTORS AT NIH AND ON THE
 RECOMMENDATION OF THE NIH
 SCIENTIST ADVISORS AND
 DELIVER THE PITTMAN LECTURE WHO
 FOLLOWS IN THAT TRADITION OF
 BEING A REMARKABLE WOMAN
 SCIENTIST LEADER.
 TODAY WE'RE FORTUNATE THAT THAT
 ROLE IS BEING PLAYED BY
 PROFESSOR BONNIE BERGER, WHO IS
 PROFESSOR OF APPLIED
 MATHEMATICS AND COMPUTER
 SCIENCE AT M.I.T.
 AND ASSOC$Uy
 BROAD INSTITUTE.
 BONNIE GOT HER UNDERGRADUATE
 DEGREE AT BRANDEIS AND GOT A
 Ph.D. AND EVER SINCE, AND A
 POST DOCTORAL AT M.I.T. AND
 SINCE 1992 ON THE FACULTY,
 RATHER RAPIDLY, ADVANCING FROM
 ASSISTANT TO ASSOCIATE, TO FULL
 PROFESSOR WHERE SHE'S NOW ALONG
 WAIT SHE'S BEEN HONORED BY AN
 NSF CAREER AWARD, BY THE
 BIOPHYSICAL SOCIETY'S DAYHOFF
 AWARD FOR RESEARCH, AND BEING
 CHOSEN AS A FELLOW OF THE
 ASSOCIATION FOR COMPUTING
 MACHINERY IN 2004.
 HER WORK IS VERY TIMELY FOR US
 HERE AT NIH, AS WE'RE ALL
 STRUGGLING WITH THE WONDERFUL
 PROBLEM OF HAVING TOO MUCH
 DATA.
 BIG DATA, AS IT'S FEATURED ON
 THE COVER OF NATURE MAGAZINE,
 AS WE TALK AROUND THE TABLE AT
 DIRECTOR MEETINGS ON THURSDAY
 MORNING, BIG DATA AS I'M NOW
 BEING ASKED BY PEOPLE IN THE
 WHITE HOUSE, WHAT ARE YOU GOING
 TO DO ABOUT THIS, SINCE
 EVERYBODY RECOGNIZES THAT WE
 ARE IN A CIRCUMSTANCE OF
 NEEDING TO BE VERY THOUGHTFUL,
 AND CREATIVE, ABOUT HOW WE
 HANDLE THE VERY LARGE
 QUANTITIES OF BIOLOGICAL DATA
 THAT ARE POURING OUT OF MANY
  GENOMICSS BASED UPON GENERAL
 AND HOW DECEMBER WORKS.
 DISEASE WORKS.
 WE NEED INDIVIDUALS CREATIVE IN
 PUTTING TOGETHER ALGORITHMS TO
 ASSIST US IN MINING NUGGETS OUT
 OF THIS LARGE SEA OF
 INFORMATION.
 WE COULD NOT HAVE A BETTER
 PERSON TO DESCRIBE SOME OF THE
 APPROACHES THAT ARE CURRENTLY
 BEING DONE IN THAT REGARD, AND
 WHO IS A LEADER HERSELF IN THAT
 EFFORT THAN TODAY'S SPEAKER.
 SO HER PRESENTATION TODAY IS
 CALLED COMPUTATIONAL BOILING IN
 THE 21st CENTURY, MAKING
 SENSE OF DATA.
 JOIN ME IN WELCOMING PROFESSOR
 BONNIE BERGER.
 [APPLAUSE]
 >> GOOD AFTERNOON.
 DR. COLLINS ALMOST TOOK SOME OF
 MY INTRODUCTION, BUT THAT'S
 FINE.
 ANYWAY, THE MISSION OF OUR
 FIELD IS TO ANSWER BIOLOGICAL
 AND BIOMEDICAL QUESTIONS BY
 USING COMPUTATION IN SUPPORT OF
 OR IN PLACE OF LABORATORY
 PROCEDURES, WITH ONE GOAL BEING
 TO GET MORE ACCURATE ANSWERS AT
 A GREATLY REDUCED COST.
 WE ARE CURRENTLY GENERATING
 MASSIVE DATA SETS, SO MASSIVE
 THAT WITHOUT SMART ALGORITHMS
 WE WON'T BE ABLE TO ANALYZE
 THESE TO DISCOVER PATTERNS THAT
 MIGHT PROVIDE CLUES TO THE
 UNDERLYING BIOLOGICAL
 PROCESSES.
 THROUGHOUT MY TALK, THERE WILL
 BE A COMMON THEME OF TAKING A
 MACROSCOPIC VIEW OR PICTURE OF
 THE DATA THROUGH WHICH WE CAN
 VIEW PROBLEMS LIKE MEDICAL
 GENOMICS AND BIOLOGICAL
 NETWORKS.
 BUT THERE WAS A CHALLENGE HERE,
 AS DR. COLLINS SAID, THE SIZE
 OF THE DATABASES ARE GOING
 ASTRO NORMICALLY.
 ASTRONOMICALLY.
 WE HAVE LOTS OF DATA.
 THE BAD NEWS IS THE PROBLEMS
 THREATEN TO BECOME
 COMPUTATIONALLY INTRACTABLE DUE
 TO THE SHEER ENORMITY OF THE
 DATABASES.
 THINGS WERE BAD ENOUGH WHEN I
 STARTED AROUND 1995 IN THIS
 AREA, BACK THEN THE SIZE WAS
 HALF A MILLION SEQUENCES, THE
 PDV HAD 3-7 3800 PROTEIN
STRUCTURES
 S WE50,000 PROTEIN SEQUENCE
 USED THIS FOR PARALLEL RESIDUE
 CORRELATION.
 THINGS HAVE GOTTEN WORSE AT AN
 INCREDIBLE RATE.
 RECENTLY THERE'S BEEN AN
 EXPONENTIAL EXPLOSION IN THE
 AMOUNT OF SEQUENCING DATA.
 NOW, IT IS TRUE THAT COMPUTERS
 HAVE GOTTEN A LOT FASTER, AND
 ALSO MORE COST EFFECTIVE.
 AS YOU CAN SEE IN THE GREEN LOG
 SCALE PLOT HERE, THE AMOUNT OF
 PROCESSING YOU CAN DO PER
 DOLLAR OF COMPUTE HARDWARE HAS
 BEEN MORE OR LESS DOUBLING
 EVERY YEAR.
 KNOWN AS MOORE'S LAW.
 BACK IN THE 1990s, THIS WAS
 ENOUGH TO KEEP UP WITH THE PACE
 OF SEQUENCING DATA.
 WHICH IS SHOWN IN BLUE HERE.
 BUT LOOK WHAT HAPPENED AFTER
 THE ADVENT OF NEXT GEN
 SEQUENCING, THE SIZE OF
 DATABASES HAS BEEN GROWING BY A
 FACTOR OF TEN EVERY YEAR.
 NOW, IN THE PASTF, WE WOULD
DEAL
 WITH SUCH PROBLEMS BY SAYING
 FUTURE COMPUTERS WILL BE FAST
 ENOUGH.
 BUT CLEARLY, THAT'S NOT THE
 CASE.
 SO THIS IS A BIG PROBLEM AND A
 CHALLENGE FOR THE FIELD.
 SO MUCH SO, THAT THERE'S BEEN A
 RECENT NEW YORK TIMES ARTICLE,
 ALSO MANY OTHERS, IDENTIFYING
 THIS KIND OF A PROBLEM, IN FACT
 THEY POINT OUT STAGING CGI, THE
 LASTER GENOME CENTER IN THE
 WORLD WAS SEQUENCING SO MUCH IT
 OVERWHELMED THE INTERNET
 CONNECTION AND IT COSTS MORE TO
 ANALYZE A GENOME THAN TO
 SEQUENCE IT NOW.
 NOW, IT'S TEMPTING TO THINK
 THAT CLOUD COMPUTING WILL SOLVE
 THIS PROBLEM, AS THIS ARTICLE
 ITSELF SUGGESTS.
 BUT THAT'S SIMPLY NOT THE CASE.
 IT MAY SAVE SOME COST, BUT IT
 DOESN'T ADDRESS THE FUNDAMENTAL
 ISSUE.
 THAT IS, IT DOESN'T CHANGE THE
 PROBLEM THAT SEQUENCING DATA IS
 GROWING EXPONENTIALLY FASTER
 THAN COMPUTING POWER PER
 DOLLAR.
 SO THE ONLY THING THAT WILL
 ADDRESS THIS ISSUE ARE
 FUNDAMENTAL TALLY BETTER
ALGORITHMS
 TO MAKE A DIFFERENCE.
 WE NEED ALGORITHMS SO FAST THAT
 IN SOME CASES THEY DON'T --
 THEIR RUNNING TIME DOES NOT
 EVEN GROW LINEARLY WITH THE
 SIZE OF THE DATA.
 AND THAT'S WHAT WE DO.
 WE DIVIDE ALGORITHMS THAT DO
 THESE COOL DELAYSE CALCULATIONS
FAST AND
 SCALE SO THE COST DOESN'T
 EXPLODE WITH THE SIZE OF THE
 DATABASE.
 ANOTHER THING WE DO IS DESIGN
 ALGORITHMS TO TAKE ADVANTAGE OF
 MASSIVELY GROWING DATA SETS TO
 DEBTS NEW BIOLOGICAL INSIGHTS.
 SO DESIGNING EFFICIENT AL
 ALGORITHMS FOR PROCESSING
 MASSIVE DATA ALLOWS US TO
 PRODUCE SOFTWARE THAT CAN
 ANSWER SOME IMPORTANT
 BIOMEDICAL QUESTIONS IN
 PRACTICE.
 SO IN THIS TALK, I'LL SPEAK
 ABOUT THREE INSTANCES WHERE WE
 HAVE MASSIVE AMOUNTS OF DATA,
 AND HOW WE'RE RESPONDING TO THE
 CHALLENGE OF ANALYZING IT.
 I'LL TALK ABOUT ONE CHALLENGE
 IN LARGE SCALE GENOMICS, ONE
 CHALLENGE IN MEDICAL GENOMICS,
 ONE IN NETWORK BIOLOGY.
 THE SPOTLIGHT THERE WIL WILL BE
ON HOW
 BETTER ALGORITHMS CAN MAKE THE
 PROBLEMS TRACTABLE AND GAIN
 INSIGHTS WE WOULDN'T HAVE BEEN
 ABLE TO GAIN.
 LET'S FOCUS FIRST ON LARGE
 SCALE GENOMICS.
 SO CURRENTLY, MANY GENOMICS
 APPLICATIONS REQUIRE US TO
 STORE, ACCESS AND ANALYZE VERY
 LARGE LIBRARIES OF SEQUENCE
 DATA.
 BUT GIVE BE THE GROWTH OF SUCH
 DATA THAT I JUST DESCRIBED, WE
 HAVE TO WONDER IF OUR FASTEST
 ALGORITHMS CAN KEEP PACE.
 CLEARLY IF WE JUST WANT TO
 STORE THE DATA WE COULD
 COMPRESS IT WHICH SOME HAVE
 DONE BUT THAT IS NOT GOING TO
 SOLVE ALL OF OUR PROBLEMS
 BECAUSE EVENTUALLY WE HAVE TO
 LOOK AT IT.
 SO THE KEY HERE IS THAT MUCH OF
 THE QUOTE/UNQUOTE NEW DATA IS
 ACTUALLY SIMILAR.
 SO THE QUESTION BECOMES HOW CAN
 WE TAKE ADVANTAGE OF THIS
 REDUNDANCY IN OUR ALGORITHMS
 THAT STORE AND PROCESS THIS
 DATA AT THE SAME TIME?
 WE CALL THIS COMPRESSIVE
 GENOMICS.
 SO NOTICE THAT IN THE ORIGINAL
 SCENARIO HERE, WE COMPRESS THE
 DATA AND THEN DECOMPRESS IT TO
 ANALYZE IT, WHEREAS WITH
 COMPRESSIVE GENOMICS WE IMPRESS
 COMPRESS THE DATA AND OPERATE
 ON THAT WITH NO NEED TO
 DECOMPRESS.
 NOW, IN THE ALGORITHM COMMUNITY
 WHICH I'M FROM THIS IS WHAT'S
 KNOWN AS DISTINCT DATA STRUCK
 FOR FOR EXACT CASE MATCHING BUT
 THINGS ARE RARELY THAT IN OUR
 FIELD.
 AS IT TURNS OUT, WE HAVE
 DATABASES OUT THERE SUCH AS
 WORM BASE THAT HOLD DATA FOR
 MANY CLOSELY RELATED AND NOT SO
 CLOSELY RELATED SPECIES.
 AND THE THOUSAND GENOMES
 PROJECT IS GENERATING LOTS AND
 LOTS OF HIGHLY SIMILAR HUMAN
 SEQUENCE DATA.
 SO HOW SIMILAR IS THIS DATA?
 WELL, HERE IS AN ILLUSTRATION
 OF A SUBTREE, AND THE AMOUNT OF
 NONREDUNDANT DATA FOR EACH
 LEVEL OF THE TREE IS IN BLACK,
 AND THE INDIVIDUAL GENOMES ARE
 COLORED.
 IF YOU LOOK UP HERE, YOU SEE
 THAT THE AMOUNT OF NONREDUNDANT
 DATA IS HALF THE SIZE OF THE
 TOTAL DATABASE.
 AND YOU WOULD EXPECT THAT FOR A
 COLLECTION OF HIGHLY SIMILAR
 GENOMES, YOU COULD GET THE
 AMOUNT OF NONREDUNDANT DATA
 PROPORTIONAL TO ONE OF THE
 GENOMES.
 HOW WE MAKE USE OF THIS
 REDUNDANCY IS AT THE HEART OF
 COMPRESSIVE GENOMICS.
 WE HAVE A NUMBER OF APPLICATION
 AREAS AND I'M NOT GOING TO DEE
 ABLE TO GET INTO THEM TODAY BUT
  PRINTOULD SEE THEM IN
 SHORTLY.
 THE KEY IS THE RUN TIME IS
 PROPORTIONAL TO THE
 NONREDUNDANT INFORMATION THAT
 WE HAVE IN THE COLLECTION OF
 GENOMES WE CONSIDER, RATHER
 THAN THE FULL DATA SAT.
 SO I'VE JUST TALKED ABOUT HOW
 SUBLINEAR TIME AL G ALGORITHMS
THAT
 SCALE WITH REDUNDANT DATA
 RATHER THAN THE FULL SET CAN
 HELP US MANAGE THE ENORMOUS
 GROWTH IN BIOLOGICAL DATA.
 SO NOW WHAT I'M GOING TO DO IS
 TALK ABOUT HOW WE CAN GAIN
 MEDICAL INSIGHT FROMNUTE SIGHTS
 LARGE SCALE DATA.
 THIS IS IMPRESSIVE UNDER THE
 EMBARGO POLICY.
 SO IN THE OLD DAYS, IF YOU WERE
 INTERESTED IN SOME DISEASE, SAY
 BREAST CANCER, YOU WOULD MAP
 THE GENE EXPRESSION PROFILES
 FOR A VARIETY OF GENES, ONTO AN
 EXPRESSION ARRAY, TO LOOK FOR
 PATTERNS OF INTEREST.
 IN THE LAB NEXT DOOR, SOMEONE
 MIGHT BE DOING THE SAME THING
 FOR, LET'S SAY, COLON CANCER.
 BUT YOU WOULD HAVE NO WAY TO
 COMBINE AND INTEGRATE THE
 SEPARATE DISCOVERIES.
 ALL THIS HAS NOW CHANGED.
 WE NOW HAVE DATABASES SUCH AS
 NCBI'S GENE EXPRESSION OMNI
 BUS, WHICH PULLS TOGETHER MANY
 DISPARATE GENE EXPRESSION
 STUDIES.
 SO NOW BECAUSE COMPUTERS ARE
 MUCH FASTER, AND COST LESS, AND
 WE HAVE LOTS AND LOTS OF THESE
 GENE EXPRESSION STUDIES PUBLIC
 AVAILABLE WE'RE NO LONGER
 CONFINED TO THE TENS OF
 SAMPLEES WE CAN GENERATE IN OUR
 OWN WET LAB.
 BUT NOW AS I'LL SHOW YOU IN
 THIS TALK WE HAVE BEEN ABLE TO
 ANALYZE THOUSANDS OF GENE
 EXPRESSION SAMPLES TO DERIVE
 NOVEL BIOLOGICAL OR MEANINGFUL
 BIOLOGICAL INSIGHTS.
 AND MORE IMPORTANTLY, MANY OF
 THESE INSIGHTS CAN ONLY BE
 GLEANED BY LOOKING AT HUNDREDS
 OF THOUSANDS OR TENS OF
 THOUSANDS OF GENE EXPRESSION
 SAMPLES AT THE SAME TIME.
 SO I JUST HAVE SHOWN YOU A PLOT
 OF THE WHOLE DATABASE.
 AS YOU SAW IT CONSISTS OF
 HUNDREDS OF THOUSANDS OF
 SAMPLES AND WAS INTRINSICALLY
 HIGHER DIMENSIONAL.
 WIRE GOING TO LOOK AWE'RE GOING
TO LOOK A
T A
 SUBSETH, 3,000 SAMPLES, 20,000
 GENES, PROJECTED ONTO TWO
 DIMENSIONS.
 HERE IS THE TWO DIMENSIONAL
 PLOT WITH 3,000 SAMPLES, EACH
 IS A GREAT POINT WITH COLORS
 WHICH I'LL SPEAK ABOUT IN A
 MOMENT.
 NOW, AMAZINGLY, ACROSS ALL
 THESE SAMPLES, WE CAN LEARN
 SOME REALLY INTERESTING THINGS.
 SO THE FIRST THING WE LEARNED
 IS THAT TISSUES OF SIMILAR
 TYPES LOCALIZE ON THIS
 LANDSCAPE.
 SO AS YOU CAN SEE, THEY ARE
 VERY CLEAR, FLOOD, BRAIN,
 EPITHELIAL CLUSTERS HERE.
 IN FACT, WE CAN EVEN SEE
 SOMETHING MORE.
 WE GET THAT MORE SPECIFIC TYPES
 CO-LOCALIZE.
 SO IF WE JUST TAKE ONE LEVEL
 DOWN THE EPITHELIAL CLUSTER AND
 ITS SAMPLES, AND WE PROJECT
 THEM ONTO A TWO DIMENSIONAL
 PCA, WE GET THAT GASTROSAMPLES,
 AND THOSE ASSOCIATED WITH
 REPRODUCTIVE HORMONES
 CO-LOCALIZE AND WE HAVE MANY
 EXAMPLES WHERE WE CAN GO
 FURTHER DOWN THE HIERARCHY AND
 SEE SAMPLES FROM SIMILAR TISSUE
 TYPES CO-LOCALIZE.
 OKAY.
 SO THE INTERESTING THING IS IF
 YOU LOOK AT CANCER SAMPLES THEY
 LIE IN THE SAME VICINITIES AS
 NONCANCEROUS COUNTERPARTS BUT
 MORE SPREAD OUT ON THE
 LANDSCAPE.
 SO OUR OVERALL GOAL IS TO
 LEVERAGE THE STRUCTURE IN ORDER
 TO MAP THE TRANSCRIPTOMIC
 LANDSCAPE.
 SO TO DO THIS, WE NEEDED A
 UNIFIED APPROACH WHERE WE COULD
 MAP SAMPLES INTO THEIR
 CORRESPONDING BIOMEDICAL
 PHENOTYPES.
 LET'S SAY LUNG TISSUE OR DUCTAL
 BREAST TISSUE.
 AND FOR THAT WE CONSTRUCTED A
 CURATED MACHINE READABLE
 DATABASE, THAT ALLOWED US TO
 MAP A GIVEN GENE EXPRESSION
 SAMPLE TO ITS BIOMEDICAL
 PHENOTYPES.
 WE USE THE NLM MEDICAL LANGUAGE
 SYSTEM AND THAT'S GENE
 EXPRESSION SAMPLES UP THE
 HIERARCHY.
 SO HAVING SUCH A DATA STRUCTURE
 WHERE WE CAN QUICKLY RETRIEVE
 GENE EXPRESSION SAMPLES ALLOWED
 US TO BE ABLE TO DO A
 MACROSCOPIC ANALYSIS OF A LARGE
 AMOUNT OF DATA.
 NOW THAT WE HAVE THIS DATA
 STRUCTURE WHICH MAPS GENE
 EXPRESSION SAMPLES ON THEIR
 BIOMEDICAL  PHENOTYPES WHAT CAN
 WE DO WITH IT?
 ONE THING WE'VE DONE, HERE IS
 OUR DATA STRUCTURE.
 WE TAKE NEW GENE EXPRESSION
 SAMPLES AND WE QUANTIFY HOW
 THEY MAP ONTO OUR
 TRANSCRIPTOMIC LANDSCAPE.
 SO TO DO THIS, WE DEVELOPED A
 CONCEPT ENRICHMENT SCORE BASED
 ON SMIRNOFF STATISTICS OVER THE
 CONCEPT DATABASE.
 SO THIS STAT STATISTIC ALLOWS
TO US
 ANSWER THE QUESTION GIVEN A NEW
 GENE EXPRESSION SAMPLE CAN WE
 ACCURATELY LABEL IT GIVEN THE
 OTHER SAMPLES IN THE DATABASE
 AND THEIR LABELS?
 AND IN FACT OUR ABILITY TO
 CORRECTLY LABEL IT IS QUITE
 STRONG.
 WHEN WE TESTED THIS AND LEAVE
 ONE SAMPLE OUT CROSS VALIDATION
 THE AVERAGE ACCURACY WAS 92.8%
 AS MEASURED BY THE AREA UNDER
 THE CURVE OF OVER THE 120 THE
 CONCEPTS IN THE DATABASE.
 OUR ABILITY TO PLACE GENE
 EXPRESSION, NEW GENE EXPRESSION
 SAMPLES ON THIS LANDSCAPE, IS
 WE CAN DO THIS WITH CONFIDENCE.
 IT'S STRONG.
 SO WE'VE DEVELOPED A WEB
 RESOURCE BASED ON THIS,
 COMPORTIA THAT TAKES AN INPUT
 OF EXPRESSION DATA AND RETURNS
 A RANK ORDERED LIST OF THE
 CONCEPTS MOST ASSOCIATED WITH
 IT, AND IT ALSO RETURNS A PLOT
 OF WHERE THE NEW SAMPLE, WHICH
 COMES FROM THE BRAIN
 FALLS ON THIS LANDSCAPE.
 SO THE SAMPLE WE'RE TRYING TO
 PLACE IS IN BLUE, AND WE'VE
 LABELED THE OTHER BRANDS, THE
 OTHER BRAIN SAMPLES IN THE CASS
 DATABASE, IN ORANGE.
 IT'S IN THE MIDDLE OF THE BRAIN
 RANGE.
 OMIC IS OUR TRA TRANSCRIPT
OURBGS
 LANDSCAPE, THE ONE GENERATED
 FROM 3,000 SAMPLES.
 SO AS YOU MIGHT IMAGINE, HAVING
 THE FULL TRANSCRIP
TRANSCRIPTOMIC
 LANDSCAPE CAN BE HELPFUL IN THE
 DIAGNOSIS.
 JUST BECAUSE CANCER IS IN THE
 BRAIN DOESN'T MEAN IT
 ORIGINATED IN THE BRAIN.
 KNOWING THE ORIGIN CAN BE
 HELPFUL.
 BY BEING PLACE TO PLACE A NEW
 EXPRESSION SAMPLE ON THIS
 TRANSCRIPTOMIC LANDSCAPE WE'RE
 ABLE TO DO SOMETHING REALLY
 IMPORTANT.
 AND THIS IS BECAUSE IN OUR
 FRAMEWORK, NEW -- THIS IS
 BECAUSE IN OUR FRAMEWORK
 SAMPLES TEND TO LOOK MORE LIKE
 THEIR TISSUE OF ORIGIN THAN
 THEY LOOK LIKE THEIR TISSUE
 WHERE THEY M ECONOMY
METASTISIZE TO.
 THESE ARE LUNG CANCER
 METASTISES IN ORANGE, AND THEY
 FALL -- FOR THE LUNGS, THEIR
 TISSUE OF ORIGIN, MUCH MORE
 HIGHLY THAN FOR THE BRAIN.
 HERE WE HAVE ANOTHER EXAMPLE,
 WHERE WE HAVE BREAST CANCER
 METASTASIES, THEY LOOK MORE
 LIKE BREAST, TISSUE OF ORIGIN,
 THAN LUNG, CLOSE BY, THAN BONE
 OR BRAIN AND CONCEPT ENRICHMENT
 SCORES ARE HIGHER FOR BREAST.
 WHILE THESE ARE TWO EXAMPLES,
 WE SEE SIMILAR RESULTS ACROSS A
 VARIETY OF CANCER.
 SO NOT ONLY CAN WE IDENTIFY THE
 TISSUE OF ORIGIN MORE
 METASTASES WE CAN IDENTIFY
 WHICH GIANTS ARE MOS GENES ARE
MOST ASSOC
IATED
 WITH THEM.
 SO WHAT WE WANT TO DO IS GIVEN
 A PARTICULAR BIOMEDICAL
  IDENTIFY MARKERIFY MARKER
 GENES, FROM AN ENTIRELY
 DIFFERENT DIMENSION WITH
 TRANSCRIPTOMIC LANDSCAPE.
 WHAT WE'RE ASKING IS WHAT -- WE
 WANT TO PINPOINT THE GENES THAT
 ARE MOST RELATED TO A
 PARTICULAR PHENO TYPE AND NOT,
 LET'S SAY, RELATED TO MORE
 GENERAL PHENO TYPE LIKE CANCER.
 SUCH AS GENES INVOLVED IN CELL
 CYCLE AND CELL ADHESION.
 AND IN FACT, WE'VE DEVELOPED AN
 APPROACH FOR IDENTIFYING MARKER
 GENES WHICH I'M NOT GOING TO
 GET INTO BUT WE BASICALLY USE A
 FINITE IMPULSE CONTROL FILTER,
 OVER EACH PHENO TYPE, ALLOWS US
 TO IDENTIFY THE MARKER GENES
 THAT ARE ENRICHED FOR EACH
 PARTICULAR BIOMEDICAL PHENO
 TYPE.
 SO IN SO DOING, FOR EXAMPLE,
 WE'RE ABLE TO FIND ONES MORE
 PARTICULAR TO BREAST CARCINOMA.
 THIS BRINGS A MORE GENERAL
 STUDY WHICH ANSWERS WHAT MY
 COLLABORATOR CALLS THE
 INCIDENTSODENTUN.
 WE LOOKED AT WHAT THE MARKER
 GENES WERE FOR CARCINOMA AND 13
 SUBSETS AND FOUND A QUARTER OF
 THE MARKER GENES HAD HIGHER
 MARKER P VALUES FOR CARCINOMA
 THAN THEY DID FOR THE MORE
 PARTICULAR CONCEPTS HERE.
 AND THIS IS IMPORTANT WHEN
 YOU'RE DESIGNING CLINICAL TESTS
 BASED ON MARKER GENES.
 YOU DON'T WANT TO BE USING THE
 GENERAL CANCER ONES FOR, LET'S
 SAY, LOBULAR BREAST CARCINOMA.
 SO WE RAN CONCORDIA AND OUR
 CONCEPT ENRICHMENT SCORE TO
 IDENTIFY MARKER GENES ACROSS
 BREAST CANCER GENES AND WE
 FOUND THAT THERE WERE 74 THAT
 WERE HIGHLY ENRICHED FOR BEING
 UNIQUE TO BREAST CANCER.
 THREE INTERESTING ONES ARE
 LISTED HERE, WHICH SOME OF YOU
 MAY BE FAMILIAR WITH, BUT THEY
 WERE EXTREMELY HIGH SCORING AND
 ARE KNOWN TO BE ASSOCIATED WITH
 BREAST CANCER.
 AND WHEN WE LOOKED AT THE GO
 ENRICHMENT, MEANING THE
 FUNCTIONAL ENRICHMENT FOR THE
 DIFFERENT CONCEPTS, WE SAW THAT
 THEY DID NOT HAVE THE COMMON
 CANCER GO TERMS, BUT THEY HAD
 TOUCH AS CELL CYCLE AND CELL
 ADHESION BUT HAD ONE PARTICULAR
 TO BREAST CANCER, AND
 ADDITION, WE FOUND ONES RELATED
 TO CARBOHYDRATE AND LIPID
 METABOLISM.
 IT'S KNOWN THAT WOMEN WITH TYPE
 2 DIABETES MAY HAVE HIGHER
 SUSCEPTIBILITY TO BREAST CANCER
 SO THIS WAS NOT SURPRISING.
 SO WHAT WE WOULD LIKE TO BE
 ABLE TO DO IS USE THESE
 DATABASES AND SYSTEM TO DEVELOP
 DATA MINING ALGORITHMS FROM
 WHICH WE CAN UNDERSTAND THE
 MACROSCOPIC SIGNALS IN THE
 DATA.
 SO IN ONE EXAMPLE HERE, WE TRY
 TO DO THIS FOR STEM CELLS.
 LIKE GENES.
 AND SO WHAT WE DID HERE WAS WE
 LOOKED AT -- WE MADE A WHOLE
 NEW PCA, THIS TIME OVER 200
 GENES, NOT OUR 20,000 GENES
 THAT WE ORIGINALLY STARTED
 WITH.
 AND THESE 200 GENES WERE
 IDENTIFIED AS BEING THE HIGHEST
 SCORING MARKER GENES FOR STEM
 CELL LIKENESS.
 SO WE STILL HAVE OUR 3,000
 SAMPLES.
 BUT WE'VE REDUCED 200
 DIMENSIONAL SPACE AS OPPOSED TO
 20,000 GENE DIMENSIONAL SPACE.
 AND THIS IS THE MAP THAT WE
 GET, AND THEN WE CAN ASK
 OURSELVES, WHERE DO NORMAL GENE
 EXPRESSION SAMPLES, MALIGNANT
 SAMPLES AND STEM KRILL LIK
CELL-LIKE ONES
 LIE ON THE LANDSCAPE?
 THE STRIKING THING IS WE
FIND9+Ñ
 THAT MALIGNANT TUMOR SAMPLES
 SUCH AS HERE FOR BLOOD, THAT
 LIKE LIE LEUKEMIA LIE BETWEEN
 NORMAL AND STEM CELL-LIKE ONES.
 WE FIND THAT MALIGNANT TUMOR
 SAMPLES RETAIN SOME
 CHARACTERISTICS CLOSE TO THE
 ISSUE OF ORIGIN BUT ADOPT STEM
 CELL LIKE PROGRAMMING.
 PEOPLE SUGGESTED THIS IN
 STUDIES BUT HERE WE'RE FINDING
 IT IN TERMS OF ANALYZING
 BLINDLY A MASSIVE AMOUNT OF
 DATA.
 SO IN THE NEXT FEW SLIDES, THE
 RED AREA WILL BE THE NORMAL
 TISSUE SAMPLES, THE GREEN WILL
 BE THE MALIGNANT ONES, THE
  POTENTWILL BE THE PLURY
 STEM CELL, BLUE MESNYCHUYMAL.
 THIS IS BLOOD.
  FOREFORIMILAR PATTERN
 COLON WITH NORMAL ONES NEAR
 NORMAL, AND THE RED AND
 GREEN-SHADED AREAS NOW
 CORRESPONDING TO COLON INSTEAD
 OF BLOOD SHIFTED, WHICH WE
 WOULD EXPECT,IVE DIFFERENTIATED
 STEM CELLS REMAIN WHERE THEY
 ARE, WHICH WE ALSO HOPE FOR,
 AND SHADING THEM ON EACH OF THE
 SLIDES.
 WE GET A SIMILAR PATTERN FOR
 BREAST TISSUE STAM SAMPLES AND
 SIMILAR PATTERN FOR PROSTATE
 SAMPLES AND TAKEN ALL TOGETHER
 WITH THE ADDITION OF BRAIN, WE
 CAN SEE AN OVERALL SUCH PATTERN
 WHERE MA ANYTHIN MALIGNANT
CAMPEL ARE
 BETWEEN THAT AND NORMAL.
 LOOK AT PC 2, IT TURNS OUT THE
 SHADING OF THE RELATIVE TISSUES
 ACTUALLY REFLECT THEIR
 PLACEMENT ON THE ORIGINAL WHOLE
 TRANSCRIPTOMIC LANDSCAPE OVER
 THE 20,000 GENES THAT I SHOWED
 YOU BEFORE.
 THESE ONLY THE 200 STEM
  GENES,LATED SKWRAOERPBGS RECAT
 RECAPITULATING THE STEM CELLS.
 WE WOULD LIKE FOR THIS TO HAVE
 CLINICAL APPLICATIONS, AND
 WE'VE SHOWN THAT WE CAN
 ACTUALLY SHED SOME INSIGHT AS
 TO WHERE THE PRIMARY SITE IS
 FOR M METASTASIS, WE HOPE THIS
 APPROACH PROVIDES A COMPLEMENT
 TO THEIR METHODS.
 WE CAN IDENTIFY MARKER GENES
 SPECIFIC TO A DISEASE, IN
 PARTICULAR BREAST CANCER, AND
 WE'RE HOPING THAT THESE KIND OF
 METHODS MAY BE HELPFUL IN, YOU
 KNOW, CLINICAL TESTS SUCH AS
 NAMA PRINT IN THE FUTURE.
 AND IN PRELIMINARY STUDIES,
 WE'VE ALSO BEEN ABLE TO SHOW
 THAT TUMOR GRADE IS CORRELATED
 WITH A STEM CELL LANDSCAPE,
 THAT I JUST SHOWED YOU.
 AND HOPEFULLY IN THE LONG RUN,
 THIS WILL BE HELPFUL IN DISAM
 BIG WAITINDISAMBIGUATING
MID-GRADE
 TUMORS, WHICH ARE SO DIFFICULT
 TO TREAT.
 OOPS, SORRY.
 I WENT -- OKAY.
 SO WE'VE SEEN THAT BY S*EUFRPBT
 SYNTHESIZING A LARGE
 EXPRESS DATABASE WE HAVEE
 INSIGHT WE WOULD SOUGHT KNOT
 ABLE TO GET FROM ONE PARTICULAR
 MEMBER OF THE DATABASE.
 NOW WE'RE GOING TO LOOK AT HOW
 BY USING NETWORK INFORMATION
 WE'RE GOING TO BE ABLE TO DO
 CROSS SPECIES INFERENCE WHICH
 WE COULD NOT GET FROM SEQUENCE
 DATA ALONE.
 IN PARTICULAR, WE'RE GOING TO
 FOCUS ON PROTEIN-PROTEIN
 INTERACTION NETWORK.
 SO THESE ARE THE SPECIES FOR
 WHICH WE HAVE THE MOST PPI DATA
 FOR.
 AS YOU CAN SEE, THESE ARE THE
 NUMBER OF PROTEINS IN EACH OF
 THE SPECIES.
 AND THESE ARE THE CURRENTLY
 KNOWN NUMBERS OF INTERACTION,
 OF COURSE FOR YEAST WE HAVE A
 LOT OF KNOWN INTERACTION,
 WHEREAS FOR A MOUSE WE HAVE
 VERY FEW.
 AND HERE IS -- IN THE PAST, THE
 WAY WE WOULD MODEL PPIs, WE
 WOULD TAKE A VERY LOW
 THROUGH-PUT STRUCTURE-BASED
 APPROACH WHERE A GREAT EXTENT
 AND LARGE AMOUNTS OF TIME WE
 WOULD BE ABLE TO ANALYZE THE
 STRUCTURE AND CHEMISTRY OF A
 PARTICULAR PROTEIN COMPLEX.
 BUT NOW OVER THE LAST DECADE OR
 SO, THERE'S BEEN A HIGH
 THREE-PUT NETWORK-BASED
 APPROACH EMERGING WHERE WE
 MODEL NETWORKS AT LOWER
 RESOLUTION BUT WE COME UP WITH
 A NETWORK WHICH COVERS THE
 ENTIRE SPECIES, AT LEAST
 ATTEMPTS TO COVER THE ENTIRE
 SPECIES.
 X ANDROTEIN IS A VERTE EXECS
 EACH EDGE REPRESENTS
 INTERACTION BETWEEN THE PLANO
 PROTEINS.
 THIS LOW LEVEL APPROACH ALLOWS
 US TO COME UP WITH INSIGHTS
 THAT WE COULDN'T NECESSARILY
 GET FROM THE LOW THROUGH-PUT
  DETAILED THE MORE DETAILED
 STRUCTURAL APPROACH.
 THIS LOW RESOLUTION APPROACH.
 SO HERE IS THE YEAST PPI
 NETWORK, THE EARLIEST ONE.
 AND IN SUCH A NETWORK, EVERY
 EDGE IS DETERMINED BY SOME HIGH
 THROUGH-PUT TECHNIQUE.
 SO IN THIS CASE THIS EDGE IS
 DETERMINED BY YEAST 2 HYBRID.
 FORTUNATELY, WE HAVE MORE
 TECHNIQUES COMING ALONG, AND
 MASTECTROMETRY IS A GOOD ONE
 GIVING US NEW ENTER ABC NEWS
 DATA.TERACTION
 THERE'S A PROBLEM WITH THIS
 DATA.
 AS YOU MAY HAVE GUESSED FROM MY
 PREVIOUS SLIDE WHERE I TALKED
 ABOUT THE NUMBER OF PROTEINS
 AND ENTER AB INTERACTIONS,
COVERAGE IS
 NOT SO GREAT.
 THERE ARE PROTEINS FOR WHICH WE
 HAVE NO INFORMATION AT ALL, AS
 TO ANY OF THEIR INTERACTIONS.
 LOTS OF FALSE
 AND THERE Aá
 POSITIVES.
 AND EACH OF THESE HIGH
 THROUGH-PUT TECHNIQUES GIVES A
 CONFIDENCE IN EACH OF THESE
 EDGES, AND WE WOULD LIKE TO
 ASSIGN BETTER CONFIDENCES,
 THAT'S A NICE OPEN PROBLEM THAT
 WE'RE WORKING ON.
 SO THOSE PROBLEMS LEND
 THEMSELVES TO COMMONATORIAL
 ALGORITHMS.
 TRADITIONAL ONES IN THE PAST
 DON'T APPLY DIRECTLY WITH LOTS
 OF ERRORS IN THE DATA.
 ONE THING WE CAN DO WITH THIS
 NETWORK DATA IS -- I NOTICED.
 ONE THING WE CAN DO WITH THE
 NETWORK DATA ISAWAY COMPAR WE
CAN COMPARE
 IT ACROSS SPECIES.
 SO THIS IS WHAT'S KNOWN AS
 COMPARATIVE GENOMICS, AND I'M
 SURE YOU'RE VERY FAMILIAR WITH
 IT IN TERMS OF SEQUENCES, WHERE
 WE LOOK AT BIOLOGICAL DATA
 ACROSS SPECIES, WITH THE HOPE
 THAT AREAS OF HIGH CONSERVATION
 CORRESPOND TO FUNCTIONAL PARTS
 OR MODELS OF THE GENOMES.
 WHAT I'M GOING TO SHOW YOU HERE
  PROTEINY LOOKING AT PRIOR TEEN
 SEQUENCE INFORMATION WE CAN'T
 GET AS GOOD CORRESPONDSES OF
 GENOMES ACROSS SPECIES AS WE
 CAN BY USING SEQUENCE
 INFORMATION WITH A NETWORK
 PERSPECTIVE.
 AND MY GROUP, I JUST GRABBED
 THIS PICTURE FROM SOMEWHERE.
 WE DID SOME.
  OF THE EARLIEST
 WORK COM COMPARING MOUSE
GENOMES,
 AND INFORMATION ACROSS GENOMES
 AND TURNED THIS TO COMPARING
 NETWORKS.
 ONE REASON WE WANT TO COMPARE
 NETWORKS IS WE WANT TO BE ABLE
 NOTATIONR TRANSFER AN OH TAKES
 FROM ONE SPIRES TO ANOTHER.
 SPECIES TO ANOTHER.
  GENOME IN A OUT A JOIN HOME IN
 MOUSE, YOU LEARN BUT YOU
 WOULDN'T WANT TO DO THAT IN
 HUMANS.
 WE NEED A MECHANISM TO TRANSFER
 INFORMATION INTO HUMANS.
 AND ONE TERMINOLOGY FOR THIS IS
 ORTHOLOGY, THE CORRESPONDS
 BETWEEN GENES AND PROTEINS USED
 INTERCHANGEABLY HERE ACROSS
 SPECIES.
 BUT WHAT WE WANT IS WE WANT
 FUNCTIONAL ORTHOLOGY, AND WHAT
 THIS MEANS IS WE WANT PROTEIN
 WHICH IS ACTUALLY PERFORM THE
 SAME FUNCTION ACROSS SPECIES.
 AND THIS IS A VERY IMPORTANT
 PROBLEM.
 I'M WORKING WITH BIOLOGISTS WHO
 ARE FRUSTRATED WITH THE
 SEQUENCE-ONLY BASED METHODS
 TRADITIONALLY USED FOR THIS.
 AND THEY WANT TO USE OTHER
 INFORMATION TO GET BETTER
 CORRESPONDSES BECAUSE THE
 SEQUENCE-BASED METHODS TEND TO
 GET LOTS OF FALSE POSITIVES,
 AND THEN THEY ARE NOT GETTING
 CORRECT ANSWERS.
 SO AS I SAID, I'M GOING TO SHOW
 YOU THAT BY USING SEQUENCE AND
 NETWORK INFORMATION, WE CAN GET
 MUCH BETTER MAPPINGS BETWEEN
 GENES ACROSS SPECIES.
 SO THE PROBLEM WE HAVE IS GIVEN
 TWO PROTEIN-PROTEIN INTERACTION
 NETWORKS WE WANT TO FIND FOR A
 NETWORK SOMETHING THAT HAS
 COMPARATIVE STRUCTURE IN THE
 OTHER NETWORK.
 SO FOR ANY PARTICULAR PAIR OF
 NODES WE WANT TO SCORE HOW
 SIMILAR THEY ARE BASED ON
 SEQUENCE AND NETWORK.
 SO FOR A GIVEN NOTE ON THE FLY
 WE WANT TO KNOW WHICH NODES
 HERE IN THIEST HAVE SIMILAR
 FUNCTIONS.
 SO THE WAY THAT WE DO THIS IS
 WE MATCH NEIGHBORHOOD
 TOPOLOGIES.
 ALGORHMTHIC.LEGE GET AL GO RATE
 THE HEART OF THE AL G ALGORITHM
IS
 COMPUTING THE SIL SILL LARRY
 SCORES.
 COMPUTE -- SIMILARITY SCORES.
 WE'RE GOING TO GET A HIGH SCORE
 IF THE TWO NODES ARE A GOOD
 MATCH, IF I AND J ARE A GOOD
 MATCH.
 THE INTUITION WE PURSUE, I AND
 J ARE AIvt GOOD MATCH IF THEIR
 SEQUENCES ALIGN AND IF THEIR
 NEIGHBORS ARE A GOOD MATCH.
 IN THE PAST, THIS QUOTE/UNQUOTE
 FUNCTIONAL SIMILARITY SCORE,
 RIJ, WAS BASED MERELY ON
 SEQUENCE SIMILARITY.
 AS I SAID, THAT LEADS TO A LOT
 OF FALSE POSITIVES.
 SO WHAT WE'RE GOING TO DO IS
 ADD A NETWORK COMPONENT TO
 THIS.
 NIJ, WHICH IS A SIMILARITY
 SCORE BETWEEN THE NEIGHBORS OF
 NODES I AND J.
 THEN IT'S GOING TO BE A CONVEX
 COMBINATION OF SEQUENCE AND
 NETWORK SIMILARITY SCORE.
 NOTICE THAT ALPHA IS USER
 DEFINED, ALTHOUGH RERECOMMEND A
 SETTING, IF IT'S ALPHA ONE, YOU
 HAVE NO SEQUENCE DATA.
 IF IT'S ALPHA ZERO YOU HAVE NO
 NETWORK DATA IN THIS EQUATION.
 SO IN SUM, THE AL G ALGORITHM
TAKES
 TWO NETWORKS, BLUE AND GREEN,
 AND PRODUCES A MAPPING OF ALL
 THE NODES IN BLUE AND GREEN
 NETWORKS, THIS MAPPING IS JUST
 THIS MITT RIGS R MATRIX R, THE
SIMILAR
 SCORES.
 NOTICE THIS MATRIX IS PRETTY
 EMPTY.
 THAT'S BECAUSE LOTS OF THE
 PAIR-WISE SIMILARITY SCORES ARE
 ZERO.
 AS I SAID, WHAT WE WANT TO
 COMPUTE IS THE SEQUENCE
 SIMILARITY SCORE, PLUS THE
 NEIGHBORHOOD SIMILARITY SCORES.
 WE WANT THE NEIGHBORS OF
 SIMILAR NODES ARE ALSO SIMILAR.
 THIS IS MEASURED BY A WEIGHTED
 SUM OVER THE SIMILARITY SCORES
 OF THE NEIGHBORS OF I AND J.
 WHICH ARE RUV.
 WE DON'T KNOW THE RUV VALUES OF
 THE NEIGHBORS, SIMILARITY
 SCORES OF THE NEIGHBORS, UNTIL
 WE KNOW, UNTIL WE COMPUTED
 THEM.
 SO IT TURNS OUT THAT IT'S NOT
 SO MUCH A PROBLEM BECAUSE THIS
 IS A LINEAR SYSTEM OF
 EQUATIONS, THAT WE CAN JUST
 SOLVE AS A VALUE PROBLEM.
 IN FACT WE CAN SOLVE IT
 BECAUSE THE R MATRICES ARE SO
 SPORES.
 THE LENDS ITSELF TO A RANDOM
 WALK INTERPRETATION AND WE HAVE
 A BLUE GRAPH G-1 AND GREEN
 GRAPH G-2, AND OUR PROBLEM IS
 JUST TAKING A RANDOM WALK ON
 THE TENSOR PRODUCT GRAPH OF G 1
 AND G 2 SUCH THAT THE
 TRANSITION PROBABILITY OUT OF
 ANY GIVEN PRODUCT NODE UV IS
 THE SAME, IT'S EQUIVALENT FOR
 THE OUT EDGES OF THAT NODE.
 PRECISELY THE TERM ON OUR
 NETWORK SIMILARITY SCORE.
 HERE WE'RE LOOKING AT A SIMPLER
 CASE, NOT THE SEQUENCE
 INFORMATION IN THE NETWORK, IN
 THE SIL SIMILARITY SCORE FOR
USE OF
 COMPUTATION.
 IT TURNS OUT THE STATIONARY
 DISTRIBUTION OF THE RANDOM WALK
 IS THE LARGEST IGAN VALUE OF
 THE MATRIX, N SQUARED BY N
 SQUARED IN SIZE, THE RESULT OF
 THE TRANSITION PROBABILITIES IN
 THE MATRIX HERE.
 THIS MAY REMIND YOU OF AN
 ALGORITHM THAT'S OUT THERE,
 GOOGLE'S ALGORITHM DOES A
 SIMILAR RANDOM WALK ON A SINGLE
 GRAPH, RATHER THAN A PRODUCT OF
 GRAPHS TO RANK WEB PAGES IN
 ORDER OF IMPORTANCE.
 SO AN EVEN HARDER PROBLEM IS
 MULTIPLE NETWORK ALIGNMENT.
 AND THE REASON THIS IS SO HARD
 IS THE SAME REASON AS FOR
 MULTIPLE SEQUENCE ALIGNMENT, IS
 THAT THE PROBLEM IS EXPONENTIAL
 IN THE NUMBER OF NETWORKS.
 SO BASICALLY WE WANT TO FIND
 GIVEN MULTIPLE NETWORKS SOME
 CONSERVE STRUCTURE BETWEEN
 THEM.
 SO AS FOR THE CASE OF SEQUENCE
 ALIGNMENT, WE'RE GOING TO
 APPROXIMATE THIS WITH PAIR
 NETWORK ALIGNMENTS.
 THE APPEAR-WIS PAIR-WISE
NETWORK
 ALIGNMENTS ARE THE R MATRI
 CRUSHINGES.
 THIMATRICES.
 THE ORTHO LOGS WILL BE THE
 WEIGHTED SUBGRAPHS FOR THESE --
 FOR THIS COMPUTATION.
 SO NOTICE THAT WE'RE ALLOWING
 ONE SUCH GOOD ALIGNMENT WOULD
 BE ONE NODE FROM PURPLE, ONE
 FROM GRIPE GREEN AND TWO FROM
YELLOW
 BECAUSE WE CAN HAVE GENE
 DUPLICATION EVENTS.
 WE WANT CROSSINGS RATHER THAN
 ONE-TO-ONE MAPPING.
 QUICKLY, THIS IS HOW THIS
 WORKS.
 WE COMPUTE A SIMILARITY GRAPH
 BETWEEN ALL PAIRS OF NETWORKS.
 AND THEN WHAT WE DO IS WE WANT
 TO FIND STRONGLY SIMILAR
 NEIGHBORS.
 SO WE START WITH A PARTICULAR
 NODE LET'S SAY THE RED ONE HERE
 IN ARNOLD, AND WE WANT TO FIND
 STRONGLY SIMILAR NEIGHBORS TO
 THAT.
 SO THE IDEA THAT WE'RE USING
 HERE IS THAT IF MULTIPLE PAIRS
 OF NETWORKS AGREE, THAT
 SOMETHING -- THAT NODES OR
 PROTEINS ARE RELATED, THEN THE
 OTHER NETWORK, EVEN IF WE DON'T
 HAVE AN EDGE THERE, THEY ARE
 PROBABLY RELATED IN THAT TOO
 ALTHOUGH OF COURSE IN BIOLOGY
 THERE COULD BE EXCEPTIONS BUT
 BASICALLY WE'RE HOPING THAT IN
 THIS CLUSTER MOST OF THE EDGES
 HAVE HIGH SCORE.
 SO THEN WE FIND A STRONGLY
 SIMILAR NEIGHBOR TO THE RED ONE
 IN ARNOLD, AND IN FACT WHAT WE
 REALLY WANT IS A HIGHLY
 WEIGHTED SUBSET OF THAT.
 BECAUSE THAT MEANS THAT MOST OF
 THE CORRESPONDS ACROSS SPECIES
 AGREE THESE ARE IMPORTANT, AND
 THEY HAVE THE SAME FUNCTION.
 AND FOR THIS WE USE THE
 PAGERRING NIBBLER ALGORITHM,
 STARTING WITH ARE RED NODE, A
 RANDOM WALK WITH A TELEPORT
 BACK TO THAT NODE.
 THIS WILL BE DONE SOON, THE
 TECHNICAL PART.
 SO WE GET A COUPLE OF SUCH
 PAGERRING NIBBLE TYPE
 SUBGRAPHS, HIGHLY WEIGHTED
 SUBGRAPHS, AND THEN IF THEY
 HAVE A LOT OF EDGES IN COMMON
 WE MERGE THEM.
 AS THEY DO HERE.
 AND THEN WE REMOVE THEM FROM
 THE GRAPH ON THE NEXT SLIDE,
 HENCE NIBBLE, AND THEN REPEAT.
 SO THAT'S THE ALGORITHM THAT
 ALLOWS US TO DO MULTIPLE
 NETWORK ALIGNMENTS, AT LEAST
 APPROXIMATE IT WITH PAIR-WISE
 NETWORK ALIGNMENT.
 HOW DOES THIS DO?
 THE TROUBLE IN THIS FIELD IS
 THAT THERE'S NO GOLD STANDARD
 DATABASE FOR MEASURING
 ORTHOLOGY.
 IT'S FULL FULL ACTUALLY A HUGE
 PROBLEM, THERE ARE NO GOLD
 STANDARDS.
 WE CAME UP WITH OUR OWN
 MEASURE, NORMALIZED ENTROPY.
 WE SAID THINGS THAT ARE ORTH
 ORTHOLOGIC HAVE SIMILAR
 ENRICHMENT TERMS, SHOULD BE
 DOING SIMILAR FUNCTIONS.
 SO WE CAME UP WITH AN EN
 TERM, FEWER NODES, MORE HAVE
 THE SAME FUNCTION THAN LOTS OF
 FEWER FUNCTIONS, YOU WANT FEWER
 FUNCTIONS IS WHAT I MEANT TO
 SAY.
 BY NORMALIZE THE ENTROPY WE DID
 BETTER FO - FOR ALL SPECIES AND
JUST
 FOR HUMAN AND FLY.
 WE WERE ALSO ABLE TO GET GOOD
 COVERAGE ESPECIALLY FOR THREE
 OR MORE SPECIES AS YOU CAN SEE
 HERE, THE BEST RESULTS ARE
 BOLD-FACED.
 AND WE WERE ABLE TO DO BETTER
  -- - THERE ARE SOME,K
 NETWORK AND GREMLIN, WHICH WE
 DID BETTER THAN THOSE.
 AND THIS TAKES IN THE GENE OR
 PROTEIN I.D. OR ALL SORTS OF
 DIFFERENT TYPES OF I.D.s, AND
 IT TELLS YOU THE FIVE SPECIES
 OR SOME IS UP SET THAT  SUBSET
AND GIVES 
YOU A
 LOT OF OTHER INFORMATION ABOUT
 THE ORTHO LOGS AND LINKS TO
 OTHER DATABASES THAT CONTAIN
 INFORMATION ABOUT THE ORTHO
 LOGS.
 SO I'LL PUT UP ONE BIOLOGICAL
 APPLICATION WE'VE BEEN ABLE TO
 GET, USING ISO-BASE.
 WE WORK WITH THE SUE LUN
LUNDQUIST
 LABS THAT USE YEAST MODELS TO
 UNDERSTAND PARKINSON'S OR
 ALZHEIMER'S, THIS IS GIANTS
 INVOLVED IN TOXICITY IN
 PARKINSONS, SO SUE GAVE US A
 LIST OF GENES IN YEAST AND
  GENES TO KNOW WHAT OU ARE
GIANTS
 LIKELY HAVING THE SAME FUNCTION
 IN HUMANS?  WE USED ISO-BASED
 TO FIND THIS GENE HERE, AND
 MANY OTHERS THAT I'LL TELL YOU
 ABOUT IN A MINUTE, BUT IN
 PARTICULAR THIS GENE HERE WE
 FOUND TO BE ON A PATHWAY THAT
 WAS INVOLVED IN MEDIATED
 TRANSPORT.
 IT TURNS OUT WHEN YOU KNOCK --
 WHEN YOU OVEREXPRESS A PROTEIN
 WHICH IS IMPORTANT IN
 PARKINSON'S DISEASE, THAT THIS
 PATHWAY IS DISRUPTED.
 SO IF SOME EVIDENCE THAT THAT
 GENE IS DOING SOMETHING RELATED
 TO PARKINSON'S, AND IN FACT
 USING ISO BASE WE WERE ABLE TO
 FIND 48 HUMAN ORTHO LOGS TO HER
 YEAST COUNTERPARTS, AND 24 OF
 THEM WERE ENTIRELY NEW.
 THEY WEREN'T FOUND BY ANY OF
 THE OTHER ORTHOLOGY PREDICTORS
 OR NETWORK-BASED ONES.
 SO WE HAVE LOTS OF APPLICATIONS
 OF ISO RANK AND ISO BASED.
 YOU SAW A FEW OF THESE ALREADY.
 PEOPLE HAVE ALSO USED IT FOR
 METABOLIC NETWORK ALIGNMENT,
 AND WE WERE ABLE TO DO GENETIC
 INTERACTION NETWORK ALIGNMENT,
 WE MAKE THAT AVAILABLE IN ISO
 BASE.u
 SO AS I'VE TALKED ABOUT TODAY,
 WE SAW HOW BETTER ALGORITHMS
 CAN MAKE PROBLEMS MORE
 TRACTABLE, ACROSS VARIOUS
 AREAS, AND THEY CAN ALLOW US TO
 GAIN INSIGHTS THAT WE OTHERWISE
 WOULD NOT HAVE BEEN ABLE TO
 GET.
 BUT THIS IS A VERY SMALL
 FRACTION ACTUALLY OF WHAT WE
 CURRENTLY WORK ON.
 AND I'M JUST GOING TO NAME A
 FEW OF OUR RECENT SOFTWARE THAT
 WE'VE PUT OUT TO GIVE YOU AN
 IDEA OF THE OTHER THINGS WE
 WORK ON.
 WE ALSO HAVE DONE A LOT OF WORK
 IN PROTEIN STRUCTURE
 PREDICTION.
 IN FACT WE DEVELOPED THIS
 PROGRAM, MATT, FOR PROTEIN
 STRUCTURE ALIGNMENT, AND IN AN
 INDEPENDENT REVIEW ARTICLE IT
 WAS DEEMED TO BE THE BEST
 PROGRAM FOR PROTEIN STRUCTURE
 ALIGNMENT.
 WE DO ENSEMBLE MODELING, TO
 PREDICT STRUCTURE OR FOLDING
 PATHWAYS OF STRUCTURES, WE WORK
 ON AM LLOY AMELOIDS, PREDICTING
MUTE
 ANTMUTANTS
 AND THEIR STRUCTURES AND COUNT
 TO WORCONTINUE TO WORK IN
COIL-COILS.
 WE ALSO WORK ON PREDICTING
 NONCODING RNA STRUCTURE, AND
 LOCATIONS OF MICRO RNAs IN
 SEQUENCE DATA, SO WE HAVE A
 COUPLE PROGRAMS, RNA MUTANTS,
 PREDICTING MUTATIONAL EFFECTS
 ON THE STRUCTURE OF RNA AND A
  PREDICTINGAPER, WE DICKING
 NONCODING OF RNA STRUCTURES,
 BUT TO HIGHLIGHT WE'VE DONE A
 COUPLE PIECES OF WORK ON MICRO
 RNA PREDICTION, AND ONE HAS THE
 MINATAR PROGRAM WHICH LOOKS FOR
 MICRO NHA TARGETS.
 SURPRISINGLY THEY ARE MORE
 PREVALENT IN ORFS, AND MORE
 PREVALENT IN SOME SITES.
 WE FOUND LAST SUMMER IN ANOTHER
 PAPER GENOME RESEARCH THAT
 MICRONATE TAR GETTING TARGETS
 SEQUENCE REPEATS IN ORF REGIONS
 THAT WERE PREVIOUSLY NOT KNOWN
 TO BE TARGETED AND MAY SUGGEST
 ROLES AND REGULATION.
 AND WE TEAMED UP WITH DAVID
 BARTELL FOR THIS, HE DID
 EXPERIMENTS TO CONFIRM THIS IN
 HUMANS.
  BIOLOGICALBOURQU WORK ON
BOILING CALL
 NETWORKS.
 WE TAKE PROTEIN SEQUENCES AND
 PREDICT THEIR STRUCTURE AND WE
 WORK ON SIGNALING NETWORK
 RECONSTRUCTION, AND WE ALSO
 INTEGRATE STRUCTURE-BASED
 PREDICTIONS WITH SYSTEMS-WIDE
 SIGNALING NETWORK AND NETWORK
 ANNUAL THIS IS.
 ANALYSIS.
 WE'RE APPLYING COMPUTATIONAL
 TECHNIQUES TO BIOLOGICAL
 PROBLEMS.
 SO I WANT TO THANK THE PEOPLE
 OF MY GROUP, THE COMPRESSIVE
 GENOMICS WORK WAS DONE BY TWO
 OF MY GRAD STUDENTS AT THE
 TIME, MICHAEL BAINES NOW POST
 DOC AT HMS, AND THE MEDICAL
 GENOMICS WORK DONE BY NATHAN
 PALMER, PATRICK SCHMIDT, BOTH
 STUDENTS NOW -- NATHAN AT HMS,
 PATRICK IS GO THERE, AND ZAK
 HOHAMI AND BIOLOGICAL NETWORK
 WORK WAS DONE WITH ISO RANK
 WITH HELP, AND MICHAEL BAIN AND
 OTHERS.
 AND DANNY PARK HELPED DO THE
 ISO-BASED DATABASE.
 I WANT TO THUNDERSTORM WARNING
 BUNCH OF OTHERS WHO HAVE BEEN
 COLLABORATIVE AND INSTRUMENTAL
 IN THIS WORK.
 THANK YOU.
 [APPLAUSE]
 >> THANKS FOR A VERY
 STIMULATING AND BROAD RANGING
 PRESENTATION.
 THE FLOOR IS OPEN FOR
 QUESTIONS, THE MICROPHONE IS IN
 THE AISLES.
 PLEASE USE THOSE IF YOU HAVE A
 QUESTION TO POSE.
 YES, SIR?
 >> I WONDER IF YOU WOULD
 SPECULATE WHETHER COMPARING
 NETWORKS ASSOCIATED WITH
 TOXICOLOGY OR TOCK IS CIT TOCK
TOXICITY I
N
 ANIMALS FOR TESTING
 PHARMACEUTICALS COULD BE AL
 APPLIED NOT TO SO TO WHEN
 NETWORKS ARE SIMILAR WHEN
 ANIMAL CASES WOULD PREDICT
 HUMAN  TOXICICITY.
 >> YOU SHOULD LOOKED A MODULE
 AND NODE CORRELATIONS BUT
 THAT'S DOABLE.
 THE PROBLEM IS IF YOU'RE
 MISSING DATA YOU DON'T KNOW
 IT'S NOT SO.
 THERE'S PROBABLY A LOT OF
 MISSING DATA.
 >> THERE'S A HUGE ISSUE OF LATE
 FAILURES IN DRUG DEVELOPMENT,
 BECAUSE TOX ISSUES WERE NOT
 IDENTIFIED AS AN EARLY STAGE.
 THIS COULD BE A VALUABLE
 APPROACH.
 >> I WOULD LOVE TO TALK TO YOU
 ABOUT THAT.
 THAT SOUNDS INTERESTING.
 >> I LIKE THE PCA PLOTS WITH
 ALL THE DIFFERENT SAMPLES AND
 HOW YOU WERE ABLE TO STRATIFY.
 YOU ONLY SHOWED THE TWO
 PRINCIPLE COMPONENTS.
 THE OTHER PART, IF YOU LOOKED
 AT INDEPENDENT COMPONENT
 ANALYSIS OR MULTIPLE
 DIMENSIONAL SCALING IF THAT
 GAVE YOU MORE OR LESS
 INFORMATION ON THE SAMPLES?
 >> WE DID LOOK AT MORE
 COMPONENTS.
 THEY ARE HARD TO PUT ON HERE.
 >> OF COURSE.
 >> FOR THE LEVEL THAT WE'RE
 WORKING AT RIGHT NOW WE DIDN'T
 NEED THEM BUT I -- THERE WAS
 DEFINITELY MORE DATA WHEN YOU
 WENT OUT TO A FEW MORE
 DIMENSIONS.
 I DON'T KNOW ABOUT THE
 MULTI-DIMENSIONAL TESTING.
 >> GOTCHA.
 OKAY.
 >> BONNIE, IN THAT ANALYSIS
 WHERE YOU WERE DOING THE LEAVE
 ONE OUT EXPERIMENTS TO SEE IF
 THEY MAPPED TO WHERE THEY
 SHOULD BASED ON CELLULAR BASIS
 OF ORIGIN YOU SAID YOU GOT IT
 RIGHT ABOUT 92.8% OF THE TIME.
 IT WOULD BE INTERESTING TO LOOK
 AT THE ONES FOR YOU, BECAUSE
 THERE MIGHT BE INTERESTING
 BIOLOGY THERE IF A DATA SET
 DIDN'T LAND WHERE YOU EXPECT.
 DO
 >> YOU'RE SPEAKING LIKE A TRUE
 BIOLOGIST.
 THEY WANT TO KNOW THE CASES
 WHERE COMPUTATIONAL TECHNIQUES
 DON'T WORK.
 >> EXACTLY.
 >> WE DIDN'T.
 WE WERE JUST TRYING TO
 VALIDATE.
 THAT'S A VERY GOOD POINT.
 YEAH, IT ALSO COULD BE A LOT OF
 THAT COULD BE ERRONEOUS
 MAPPING.
 >> COULD BE, RIGHT.
 >> A LOT OF THE SAMPLES ARE
 PROBABLY MISLABELED, AND MAY
 HAVE ENDED UP, THERE MAY BE
 NOISE IN THE DATA.
 OH, YEAH, YEAH, YEAH.
 WE HAD TO CURATE THIS 3,000
 SAMPLE SET.
 WE HAD TO REALLY LOOK AT IT TO
 GET OUR CURATED MACHINE
 READABLE DATABASE.
 IT WAS KIND OF A NIGHTMARE.
 NOW THAT WE HAVE THE 3,000
 WE'VE BEEN ABLE TO RUN IT
 AUTOMATICALLY TO GET A LOT MORE
 TO CHARACTERIZE TENS OF
 THOUSANDS OF MORE SAMPLES.
 WE DON'T ONLY HAVE 3,000 NOW.
 ANYWAY -.
 >> ANOTHER QUESTION?
 >> YES.
 SO AGAIN THE PCA MAP WAS
 SIMILAR TO BARIBASI'S MAP.
 CAN YOU EXPLAIN SOME OF THE
 SIMILARITIES BETWEEN THAT
 NETWORK AND WHAT YOU'VE DONE AS
 WELL?
 >> WELL, I DON'T KNOW WHICH
 BARIBASI NETWORK YOU'RE TALKING
 ABOUT.
 HE HAS A LOT.
 >> BASICALLY THE DISEASE
 NETWORK WHERE YOU SHOWED --
 >> WHAT DISEASE IS RELATED --
 WE'RE NOT SHOWING DISEASE
 RELATIONSHIPS.
 WE'RE KIND OF HIGHLIGHTING
 SIMILAR TISSUES AND THEN WE'RE÷
 PLACING DISEASE SAMPLES ON THE
 SAMPLES, ON THOSE MAPS.
 >> YOU'RE BASING IT ON TISSUES.
 >> BASING IT ON TISSUES.
 >> I SEE, I SEE.
 >> AND WE'RE GIVING YOU THE
 PHENOTYPIC SAMPLES THEY ARE
 MOST ENRICHED FOR.
 >> HOW BIG A PROBLEM IS MISSING
 DATA FOR YOUR FUNCTIONAL
 ORTHOLOGY ANALYSIS?
 IT SEEMS LIKE YOU CAN'T GO
 THERE WITHOUT A COMPLETE DATA
 SET.
 >> YOU ARE SO RIGHT.
 IT IS A BIG PROBLEM.
 ESPECIALLY IF YOU'RE TRYING TO
 DO MOUSE DATA.
 FORTUNATELY, WE CAN ADJUST THE
 ALPHA PARAMETER AND WE CAN
 WEIGHT THE SEQUENCE DATA MORE
 IN THE CASES WHERE WE DON'T
 HAVE THE NETWORK DATA.
 BUT THAT'S WHY WE ALSO WANT TO
 GENETIC INTERACTION DATA.
 WE HAD A LOT MORE OF THAT.
 >> GOT IT.
 >> THANKS.
 >> WELL, IT'S BEEN FASCINATING
 CONVERSATION.
 YOU'RE WELCOME TO COME DOWN IN
  CONTINUEYOU WANT TO CONSIDER
 THE CONVERSATION WITH BONNIE.
 YOU'RE WELCOME TO SPEAK WITH
 THE PRESENTER HERE DOWN FRONT.
 LET US THANK THE PRESENTER ONE
 MORE TIME.
 THANK YOU, DR. BERG.
00:58:04.933,00:00:00.000
 [APPLAUSE]
