NIH 011817 BEYOND THE SEA
>> GOOD AFTERNOON, WELCOME TO
TODAY'S BEYOND THE SEA WEBINAR.
MY NAME IS TONY NGUYEN, AND
TODAY WE ARE WELCOMING LISA
FEDERER WHO IS THE RESEARCH DATA
INFORMATIONIST FROM THE NATIONAL
INSTITUTES OF HEALTH LIBRARY.
LISA CURRENTLY SERVES THERE
WHERE SHE PROVIDES TRAINING AND
SUPPORT IN THE MANAGEMENT
ORGANIZATION, SHARING AND REUSE
OF BIOMEDICAL RESEARCH DATA.
SHE'S THE AUTHOR OF SEVERAL
PEER-REVIEWED ARTICLES AND
EDITOR OF THE MEDICAL LIBRARY
ASSOCIATION GUIDE TO DATA
MANAGEMENT FOR LIBRARIANS.
SHE HOLDS A MASTER'S OF
LIBRARIAN INFORMATION SCIENCE
FROM UNIVERSITY OF CALIFORNIA
LOS ANGELES, AND GRADUATE
CERTIFICATES IN DATA SCIENCE AT
GEORGETOWN UNIVERSITY AND DATA
VISUALIZATION AT NEW YORK
UNIVERSITY.
AND SHE'S A DOCTORAL STUDENT AT
THE UNIVERSITY OF MARYLAND.
EVERYONE, PLEASE WELCOME LISA.
>> THANK YOU.
TONY.
I'M GLAD TO BE HERE TODAY TO
SHARE MY PASSION IN TERMS OF
WORKING WITH DATA AND DATA
SCIENCE.
SO, TO GIVE YOU A BRIEF OVERVIEW
OF WHAT WE'RE GOING TO COVER
TODAY IN THE NEXT HOUR, TALK A
LITTLE BIT ABOUT WHAT WE MEAN
WHEN WE'RE TALKING ABOUT BIG
DATA, IS THIS SOMETHING THAT IS
JUST A BUZZWORD OR IS THIS
SOMETHING THAT ACTUALLY HAS REAL
MEANING.
WE'LL TALK ABOUT WHAT DATA
SCIENCE IS ALL ABOUT.
IF YOU HEARD THE TERM USED
BEFORE BUT WEREN'T SURE WHAT
EXACTLY THAT ENTAILS HOPEFULLY
AT THE END OF THE SESSION TODAY
YOU'LL HAVE A BETTER IDEA OF
WHAT THAT MEANS.
WE'LL TALK A LITTLE BIT ABOUT
SOME OF THE SPECIFICS OF DATA
SCIENCE INCLUDING THE DATA
SCIENCE PIPELINE WHICH IS SORT
OF LIKE THE DATA SCIENCE VERSION
OF THE SCIENTIFIC METHOD, AND
SOME SPECIFIC DATA SCIENCE
METHODS AND TECHNIQUES THAT ARE
CURRENTLY WIDELY USED IN
BIOMEDICAL RESEARCH APPLICATION.
WE'LL TALK A LITTLE BIT ABOUT
SOME OF THE TOOLS THAT DATA
SCIENTISTS USE, AND THEN AT THE
END SORT OF DO A LITTLE, YOU
KNOW, SORT OF FORESIGHT, LOOKING
AHEAD AND TALK ABOUT WHAT ARE
SOME OF THE DEVELOPMENTS THAT
ARE POTENTIALLY COMING UP IN THE
NEXT FIVE TO TEN YEARS THAT WE
SHOULD BE AWARE OF.
WE'LL TALK ABOUT HOW CAN
LIBRARIES BE INVOLVED.
I'M AT THE NIH LIBRARY.
WE'RE THE LIBRARY FOR INTRAMURAL
RESEARCHERS AT THE NIH, ABOUT
80% OF NIH BUDGET GOES OUT TO
FUND RESEARCH AT YOUR
INSTITUTION, REMAINING 20 PER
CENT TO FUND THE INTRAMURAL
RESEARCH MOSTLY IN BETHESDA AND
OTHER SITES, SO WE'RE THE
LIBRARY THAT SERVES THOSE
RESEARCHERS.
SO VERY SIMILAR PROBABLY TO SOME
RESEARCHERS THAT YOU SERVE AT
YOUR INSTITUTION SO I'LL TALK A
LITTLE BIT ABOUT HOW WE'RE
SUPPORTING DATA MANAGEMENT AND
DATA SCIENCE AT THE NIH.
SO BIG DATA, WHAT IS IT?
WE'VE HEARD THE WORDS BIG DATA
THROWN AROUND IN POPULAR MEDIA
AND THE SCIENTIFIC WORLD.
IS THERE ANY MEANING, IS IT JUST
A BUZZWORD?
I'M SEEING?
THE CHAT THAT THERE'S A PROBLEM
WITH THE AUDIO FOR SOME PEOPLE.
TONY, AM I SOUNDING OKAY?
>> I CAN HEAR HER FINE.
I'LL MESSAGE YOU DIRECTLY.
>> I WANTED TO DOUBLE CHECK THAT
I'M NOT TALKING TO AN EMPTY ROOM
HERE.
ALL RIGHT.
SO BIG DATA, IS IT JUST A
BUZZWORD?
HOW BIG DO DATA HAVE TO BE TO BE
CONSIDERED BIG DATA?
HOW IS BIG DATA DIFFERENT FROM
SMALL OR MEDIUM DATA AND WHY IS
THAT DISTINCTION IMPORTANT?
ONE VERY BASIC DEFINITION OF BIG
DATA IS THAT THIS WOULD BE A
DATA SET THAT'S TOO LARGE OR TOO
COMPLEX TO BE DEALT WITH USING
TRADITIONAL DATA PROCESSING
ANALYSIS AND STORAGE TECHNIQUES.
IF YOU'VE EVER TRIED TO OPEN A
REALLY BIG SPREADSHEET IN EXCEL
YOU'LL FIND IT SLOWS DOWN OR
CRASHES ENTIRELY, EXCEL CAN ONLY
HANDLE WILL ONE MILLION LINES OF
DATA.
THAT'S AN EXAMPLE HOW MANY TOOLS
WE'VE TRADITIONALLY USED FOR
COMPUTATION JUST CAN'T HANDLE
LARGE DATASETS.
SO IF WE WANT TO TAKE ADVANTAGE
OF THE WEALTH OF DATA THAT IS
AVAILABLE TO US TODAY, WE HAVE
TO DEVELOP NEW TECHNOLOGIES THAT
MAKE IT POSSIBLE TO DEAL WITH
THESE VERY LARGE DATASETS IN
MEANINGFUL WAYS.
SO, TO EXPAND ON THIS IDEA A
LITTLE BIT, THIS SORT OF CONCEPT
OF BIG DATA, ONE MEANINGFUL WAY
TO THINK ABOUT BIG DATA IS TO
CONSIDER THE FOUR Vs THAT
CHARACTERIZE BIG DATA.
THE FIRST V IS VOLUME, OR THE
AMOUNT OF DATA.
SO THIS IS WHERE THAT CONCEPT OF
BIG IN TERMS OF SIZE COMES IN.
THERE'S NO SINGLE AGREED UPON
CUTOFF SIZE THAT MAKES DATA BIG
VERSUS MEDIUM VERSUS SMALL, BUT,
AGAIN, BIG DATA TYPICALLY
INVOLVES A VOLUME OF DATA THAT'S
LARGE ENOUGH THAT IT CAN'T BE
EFFECTIVELY ANALYZED ON ARE
STORED USING TRADITIONAL
COMPUTATIONAL TOOLS.
HOWEVER, IT'S NOT JUST ABOUT THE
SIZE OF THE DATA.
THE SECOND V IS VARIETY, OR
BRINGING TOGETHER MANY DIFFERENT
TYPES OF DATA THAT ALL WORK
TOGETHER TO HELP RESEARCHERS
MAKE DISCOVERIES.
FOR EXAMPLE, IF WE WERE TO STUDY
CANCER TODAY WE MIGHT BRING
TOGETHER IMAGES FROM ONE OR MORE
IMAGING MODALITIES, MRI OF A
PATIENT, A CT, WE WOULD PROBABLY
HAVE SOME CLINICAL DATA IN THE
FORM OF LAB VALUES, BLOOD TESTS
THAT HAD BEEN DONE.
WE PROBABLY HAVE SOME TEXT FROM
THE CLINICIAN NARRATIVE ABOUT
THE PATIENT AND HOW THEY WERE
DOING ON THE DAY THE CLINICIAN
SAW THEM AND PROBABLY HAVE SOME
OF THE PATIENT'S GENETIC
SEQUENCE.
SO WE HAVE IMAGES, GENETIC
SEQUENCES, CLINICAL DATA, FREE
TEXT, ALL OF THIS WOULD GIVE US
A REALLY COMPREHENSIVE VIEW OF
THE PATIENT AND WE WOULD REALLY
NEED ALL OF THIS DATA TO HAVE
THE SORT OF OVERALL 360 VIEW
THAT WILL ALLOW TO US MAKE
MEANINGFUL DISCOVERIES, BUT IT
ALSO PRESENTS A REALLY BIG
CHALLENGE OF FIGURING OUT HOW DO
WE INTEGRATE ALL OF THIS, HOW DO
WE BRING ALL OF THESE DISPARATE
TYPES OF DATA TOGETHER AND MAKE
SENSE OUT OF ALL OF THIS?
AND THIS IS A REALLY BIG
CHALLENGE THAT RESEARCHERS ARE
FACING.
THE THIRD V IS VERACITY, THE
SPEED AT WHICH DATA ARE
GENERATED.
WE'RE ABLE TODAY TO GENERATE
DATA AT A REALLY UNPRECEDENTED
RATE.
IN FACT IBM ESTIMATED WE'RE
GENERATES 2.5BILLION GIGABYTES
OF DATA EVERY SINGLE DAY.
A REALLY GOOD EXAMPLE OF THIS IS
THE LARGE HADRON COLLIDER THE
DATA GENERATED IN JUST A FEW
SECONDS FROM A SINGLE
EXPERIMENT, SO MASSIVE THAT IT
TAKES RESEARCHERS ALL AROUND THE
WORLD MONTHS TO ANALYZE JUST
THAT DATA THAT'S GENERATED SO
QUICKLY IN ONE QUICK EXPERIMENT.
FINALLY, THE FOURTH V IS
VERACITY, WHICH HIGHLIGHTS
IMPORTANT OF DATA QUALITY AND
TRUSTWORTHINESS.
THERE'S A LOT OF DATA AVAILABLE
TODAY FROM VARIOUS SOURCES, BUT
NOT ALL OF THE DATA WE CAN
ACCESS IS GOOD OR HIGH QUALITY.
THERE'S THE SAME IN COMPUTER
SCIENCE ABOUT GARBAGE IN,
GARBAGE OUT, THAT IS TRUE OF
DATA DRIVEN RESEARCH AS WELL.
IF WE WANT RESULTS WE CAN TRUST
AND ARE REPRODUCIBLE AND VERY
VERIFIABLE WE MUST USE HIGH
QUALITY DATA.
WE'VE HEARD ABOUT RETRACTED
ARTICLES, FINDINGS OF RESEARCH
MISCONDUCT, BUT EVEN LOW QUALITY
DATA IN TERMS OF DATA THAT
DOESN'T HAVE THE ADEQUATE META
DATA TO REALLY KNOW EXACTLY HOW
IT WAS COLLECTED OR WHAT EXACTLY
IT MEANS, THESE ARE ALL PROBLEMS
THAT COME ALONG WITH THIS SORT
OF CONCEPT OF BIG DATA.
SO, THAT'S SORT OF, YOU KNOW,
SUMS UP THE DEFINITION OF WHAT
WE'RE TALKING ABOUT WHEN WE TALK
ABOUT BIG DATA.
I DO ALSO WANT TO TAKE A MOMENT
HERE TO EMPHASIZE BIG DATA ISN'T
THE ONLY DATA WE SHOULD BE
PAYING ATTENTION TO.
THERE'S REALLY VERY EXCITING
WORK THAT'S BEING DONE TO
LEVERAGE BIG DATA BUT THERE'S
ALSO A LOT GOING ON WITH SMALL
DATA.
SO MANY OF THE CONCERNS AND
ISSUES I'LL BE TALKING ABOUT
HERE ARE RELEVANT TO ALL
RESEARCHERS WHO WORK WITH
DIGITAL DATA IN ANY FORM,
WHETHER IT'S BIG OR SMALL, OR
WHATEVER DISCIPLINE THEY ARE IN.
AND I DO ALSO WANT TO EMPHASIZE
THERE'S A DISTINCTION BETWEEN
BIG DATA AND DATA SCIENCE,
ALTHOUGH IT SOMETIMES SEEMS THAT
PEOPLE ARE USING THOSE TERMS
INTERCHANGEABLY BUT WE'LL TALK
MORE ABOUT WHAT DATA SCIENCE IS
ABOUT.
BIG DATA GETS ALL THE ATTENTION
BUT THERE'S A LOT OF REALLY
IMPORTANT OPPORTUNITIES TO WORK
WITH RESEARCHERS WHO ARE WORKING
WITH THOSE SMALLER OR MORE
MODERATELY SIZED DATASETS AND
THEY NEED ASSISTANCE WITH THIS
STUFF THAT WE'LL TALK ABOUT AS
WELL.
HOW HAVE WE ARRIVED AT THIS SORT
OF ERA OF BIG DATA AND DATA
SCIENCE?
WHY IS THIS SOMETHING THAT'S
BECOME IMPORTANT IN RESEARCH
TODAY?
PART OF THE REASON WE HAVE SO
MUCH DATA NOW IS THAT WE'RE ABLE
TO GENERATE DATA MORE QUICKLY
AND MORE CHEAPLY THAN EVER
BEFORE.
THE CASE OF GENOMIC DATA
PROVIDES A REALLY USEFUL EXAMPLE
FOR THIS.
SO WHAT YOU'RE SEEING HERE ON
THIS CHART IS THE COST TO
SEQUENCE A SINGLE HUMAN GENOME
FROM 2001 TO 2015.
THIS IS A CHART THAT IS PROVIDED
BY THE NATIONAL HUMAN GENOME
RESEARCH INSTITUTE.
SO BY THE TIME THE FIRST HUMAN
GENOME WAS SEQUENCED IN 2003,
THE HUMAN GENOME PROJECT THAT
YOU'VE PROBABLY HEARD OF, IT
TOOK ABOUT TEN YEARS TO DO THAT,
JUST ONE SINGLE HUMAN GENOME AND
COST ABOUT $2.7BILLION.
TODAY 15 YEARS LATER WE CAN DO
THE SAME WORK SEQUENCING A
SINGLE HUMAN GENOME IN 26 HOURS,
IT COSTS ABOUT $1000.
SO THIS IS A REALLY SIGNIFICANT
DECREASE IN BOTH TIME AND COST.
AND SO YOU CAN SEE HERE THAT
THIS HAS REALLY DROPPED
PRECIPITOUSLY IN THE LAST
SEVERAL YEARS, FAR OUTPACING THE
WHITE LINE WHICH IS MOORE'S LAW,
COMPUTING POWER TYPICALLY
DOUBLES EVERY TWO YEARS.
SO WE'VE, YOU KNOW, VASTLY, YOU
KNOW, OVERACHIEVED REALLY WITH
HOW QUICKLY AND CHEAPLY WE CAN
DO THINGS, AND SO WE'RE ABLE TO
GENERATE ALL THIS DATA.
STORING IT IS BECOMING CHEAPER
AND EASIER SO WE CAN GENERATE
HUGE AMOUNTS OF DATA AND WE CAN
KEEP IT SORT OF ALMOST
INDEFINITELY.
ANOTHER SORT OF DRIVER BEHIND
THIS RISE IN HOW WE'RE ABLE TO
GENERATE DATA SO QUICKLY IS THAT
WE'RE REALLY COLLECTING DATA ALL
AROUND US, ALL THE TIME, MORE
THAN EVER BEFORE.
SENSORS THAT GATHER REALLY
HIGHLY SPECIALIZED DATA USED TO
BE SOMETHING THAT WAS, YOU KNOW,
A BIG BULKY THING, HARD TO TAKE
IN THE FIELD, EXPENSIVE IN YOU
LOST IT.
BUT WE HAVE SENSORS NOW THAT ARE
SO CHEAP AND SO SMALL THAT WE
CAN BASICALLY LIKE WEAR THEM
AROUND, CARRY THEM WITH US
WHEREVER WE GO.
PROBABLY IF YOU ARE LISTENING TO
THIS NOW YOU ARE WITHIN REACH OF
YOUR SMARTPHONE.
MINE IS SITTING RIGHT HERE.
AND THE SMARTPHONE THAT YOU HAVE
IS AN INCREDIBLY SOPHISTICATED
DATA SENSOR AND COLLECTION
DEVICE THAT IS ABLE TO GENERATE
DATA ABOUT YOU AND THE WORLD
AROUND YOU, PRETTY MUCH ALL THE
TIME.
SO YOU CAN GO ONLINE RIGHT NOW
AND GET INSTRUCTIONS ABOUT HOW
TO, YOU KNOW, BUILD AN AIR
POLLUTION SENSOR IN ABOUT AN
HOUR USING COMPONENTS THAT
ALTOGETHER COST LESS THAN $50.
SOME OF YOU MAY HAVE OR BE
WEARING SOME OF THESE WEARABLE
SENSORS LIKE THE FITBIT OR ONE
THAT'S LIKE A NECKLACE, I FORGET
WHAT IT'S CALLED, BUT IT'S THE
SAME IDEA WHERE IT TRACKS YOUR
STEPS AND ALSO TRACKS YOUR
RESPIRATORY RATE AND TELL YOU IF
YOU'RE STRESSED OUT OR NEED TO
REST MORE AND MEDICATE.
WE HAVE SENSORS AROUND US GIVING
US DATA ABOUT THE WORLD IN
GENERAL BUT WE'RE ABLE TO SAVE
THAT DATA AND THEN TAKE
ADVANTAGE OF THAT FOR SCIENTIFIC
PURPOSES.
THERE'S ALSO BEEN AN INCREASE IN
THE AMOUNT OF WHAT'S CALLED BORN
DIGITAL DATA.
THIS IS DATA THAT ORIGINATES AS
DIGITAL DATA RATHER THAN BEING
CONVERTED OR DIGITIZED LATER.
A REALLY GOOD EXAMPLE OF THIS IS
THE MOVE TO ELECTRONIC HEALTH
RECORDS.
SO CLINICAL DATA USED TO BE
PRIMARILY COLLECTED IN THE FORM
OF PAPER CHARTS, YOU GO TO YOUR
DOCTOR, HE OR SHE WOULD WRITE
STUFF DOWN IN YOUR CHART, IT
WOULD GET FILED.
NOW PROVIDERS ARE USING MOSTLY
ELECTRONIC HEALTH RECORDS.
THE PAPER DATA THAT USED TO BE
COLLECTED BY YOUR CLINICIAN WAS
REALLY UNUSABLE FOR RESEARCHERS
BECAUSE IT SAT THERE IN THE
CHARTS IN THE CLINICIAN'S OFFICE
AND IT REALLY COSTLY AND TIME
CONSUMING TO EITHER TRY TO GO TO
ALL OF THOSE OFFICES AND, YOU
KNOW, COLLECT THAT DATA OR TRY
TO DIGITIZE ALL OF THAT, BUT NOW
RESEARCHERS CAN POTENTIALLY
CONDUCT RESEARCH REALLY EASILY
AND QUICKLY USING ELECTRIC
HEALTH RECORDS DATA BECAUSE IT'S
ALL ALREADY COLLECTED DIGITALLY,
AND POTENTIALLY AVAILABLE FOR
RESEARCHERS.
SO MANY RESEARCHERS NOW, YOU
KNOW, ARE ABLE TO USE THIS OF
COURSE PRIVACY ISSUES ARE
ANOTHER CONCERN, BUT WE HAVE SO
MUCH DIGITAL DATA AVAILABLE,
REALLY PRESENTS A LOT OF
PROMISE.
MANY RESEARCHERS ARE ALSO
SWITCHING TO ELECTRONIC LAB
NOTEBOOKS, ELNs, WHEREAS THEY
USED TO RECORD ALL OF THEE THEIR
LAB OBSERVATIONS IN PAPER LAB
NOTEBOOKS.
ANOTHER THING THAT SOME
RESEARCHERS ARE DOING IS INSTEAD
OF USING AN ELN PROGRAM THAT YOU
HAVE TO INPUT ALL OF YOUR DATA
INTO A COMPUTER IS USING
SMARTPENS WHICH WRITE LIKE A
REGULAR PEN AND USE PAPER AND
YOU HAVE THE FEEL OF A PAPER LAB
NOTEBOOK, THE ACTUAL PHYSICAL
PAPER RECORD THAT YOU CAN CARRY
WITH YOU, BUT IT ALSO DIGITALLY
RECORDS EVERY SINGLE MARK SO A
DIGITAL AND SEARCHABLE COPY IS
ALSO CREATED ELECTRONICALLY THAT
YOU CAN THEN, YOU KNOW, SAVE AND
ACCESS AND TAKE WITH YOU.
IT WILL ALSO HAVE APPS THAT MAKE
IT POSSIBLE TO COLLECT ALL SORTS
OF INFORMATION.
THE EXAMPLE THAT YOU SEE ON THE
BOTTOM LEFT-HAND CORNER THERE IS
ONE CALLED CITYNOISE, USING
SMART PHONES TO GATHER
INFORMATION ABOUT NOISE LEVELS
IN THE WILD AND PUBLIC, AND TO
TRACK NOISE POLLUTION IN CITIES.
SO SORT OF TURNING PEOPLE INTO
DATA COLLECTORS ALL AROUND.
SOCIAL MEDIA EVEN HAS POTENTIAL
RESEARCH USES, SO TWITTER AND
FACEBOOK POSTS HAVE BEEN USED TO
DO THINGS LIKE TRACK THE SPREAD
OF DISEASE, YOU KNOW, WHERE
PEOPLE WRITING THAT THEY'VE
GOTTEN THE FLU, SORT OF TRACK
DISEASE THAT WAY.
DRUG USE IS ONE.
THERE WAS A STUDY THAT USED
TWITTER AND LOOKED AT COLLEGE
STUDENTS AND, YOU KNOW, THEM
WRITING ABOUT, OH, I'M TAKING,
YOU KNOW, THIS DRUG RIGHT BEFORE
I HAVE TO GO DO MY FINAL EXAM,
OR WHATEVER.
OTHER EPIDEMIOLOGICAL STUDIES
ARE THINGS RESEARCHERS USED
SOCIAL MEDIA DATA FOR.
SO NOT ONLY ARE WE GENERATING A
LOT OF DATA, BUT MORE AND MORE
OF THIS DATA IS AVAILABLE FOR
RESEARCHERS TO USE AND OFTEN IT
CAN BE EASILY ACCESSED FROM A
REPOSITORY, AS SIMPLE AS GOING
TO A WEB PAGE, SEARCHING FOR THE
DATA YOU A WANT AND DOWNLOADING
IT, AND YOU HAVE ACCESS TO IT.
MANY REPOSITORIES ARE FULLY OPEN
AND ACCESSIBLE ALLOWING ANYONE,
ANYWHERE, WITH INTERNET ACCESS
TO ACCESS THE DATA.
OTHERS PARTICULARLY THOSE
CONTAINING ANY SORT OF
INFORMATION ABOUT HUMAN
SUBJECTS, YOU KNOW, SORT OF
HUMAN GENETIC INFORMATION
TYPICALLY REQUIRE USERS TO APPLY
FOR ACCESS SO YOU HAVE TO, YOU
KNOW, COMPLETE AN IRB AND GET
PERMISSION TO ACTUALLY ACCESS
THIS DATA.
THERE ARE A LOT OF DIFFERENT
TYPES OF REPOSITORIES,
SUBJECT-SPECIFIC REPOSITORIES
ARE AVAILABLE TO DISCIPLINES SO
RESEARCHERS KNOW IF YOU WANT TO
LOOK FOR, FOR EXAMPLE, MY
FAVORITE REPOSITORY LOGO IS THE
MOUSE GENOME, I FORGET WHAT THE
I STANDS FOR, BUT THE CUTE
LITTLE MOUSE ON THE TOP
RIGHT-HAND CORNER OF THE SCREEN.
IF YOU WANT MOUSE GENOME DATA
YOU GO TO THE MOUSE GENOME
REPOSITORY.
SOME REPOSITORIES ARE ALSO MORE
BROAD IN FOCUS, SO TAKE PRETTY
MUCH ANY KIND OF DATA, FOR
EXAMPLE, DRYAD, YOU CAN SUBMIT
ANY FORMAT, ANY SUBJECT MATTER,
THEY WILL TAKE IT.
ALSO SOME INSTITUTIONS HAVE
THEIR OWN REPOSITORIES WHERE
RESEARCHERS CAN POST DATA.
FOR EXAMPLE, IF I'M A RESEARCHER
AT UCLA, I WILL PUT MY RESEARCH
IN THE UCLA REPOSITORY.
SO LOTS OF DIFFERENT OPTIONS FOR
PEOPLE THAT EITHER WANT TO FIND
A PLACE TO SHARE THEIR DATA OR
TO LOOK FOR DATA THAT THEY CAN
REUSE AND, AGAIN, ALL OF THIS,
YOU KNOW, AVAILABILITY OF DATA
MAKES IT POSSIBLE FOR US TO SORT
OF LEVERAGE THIS AND MAKE NEW
DISCOVERIES.
A FINAL DRIVER BEHIND THE SORT
OF RISE OF BIG DATA AND DATA
SCIENCE IS SHARING MANDATES.
SOME RESEARCHERS CHOOSE OF THEIR
OWN VOLITION TO SHARE THEIR DATA
BECAUSE THEY FEEL LIKE IT'S A
GOOD THING TO DO TO CONTRIBUTE
TO THE SCIENTIFIC COMMUNITY.
BUT NOT EVERYONE FEELS THAT WAY.
SO SOME DATA ENDS UP IN
REPOSITORIES BECAUSE THE
RESEARCHER IS ACTUALLY REQUIRED
TO SHARE THEIR DATA.
A LOT OF JOURNALS NOW MAKE IT A
REQUIREMENT OF PUBLICATION THAT
THE DATA THAT SUPPORTS AN
ARTICLE HAVE TO BE SHARED AND
AVAILABLE AT THE TIME THE
ARTICLE IS PUBLISHED.
MANY VERY LARGE JOURNALS ARE
DOING THIS SO, FOR EXAMPLE,
NATURE PUBLISH GROUP, BMJ, PLOS,
PEOPLE HAVE TO BE ABLE TO ACCESS
THE DATA TO SEE HOW CONCLUSIONS
WERE DRAWN.
FUNDERS ALSO HAVE BEGUN TO
REQUIRE DATA SHARING THROUGH NEW
POLICIES THAT APPLY TO OFTEN
THOROUGHLY FUNDED RESEARCH BUT
EVEN PRIVATE FUNDERS ARE DOING
THIS.
IF YOU'RE GETTING THIS FEDERAL
FUNDING, YOU'LL BE EXPECTED TO
SHARE YOUR DATA.
SO REGARDLESS OF A RESEARCHER'S
SPECIFIC AREA OF INTEREST, IT'S
VERY LIKELY THAT THEY WILL FACE
SOME SORT OF SHARING REQUIREMENT
WHETHER FROM THE JOURNAL THEY
ARE GOING TO PUBLISH IN FOR THE
FUNDER GIVING THEM THE GRANT TO
DO THEIR RESEARCH.
SO NOW I THINK IS REALLY A GOOD
THEM FOR LIBRARIANS TO TALK TO
RESEARCHERS ABOUT THE FACT THAT
THEY WILL BE EXPECTED TO SHARE
THEIR DATA AND TO HAVE IT IN
SOME SORT OF A SHAREABLE FORM
THAT OTHER PEOPLE WILL
UNDERSTAND, YOU KNOW, WHEN IT
GETS TO THE END OF THE PROJECT
AND THEY HAVE TO PUT IT IN A
REPOSITORY.
THE UPSHOT IS THAT AS A RESULT
OF THESE SORTS OF MANDATES, WE
ARE GOING TO HAVE EVEN MORE DATA
AVAILABLE FOR REUSE IN THE
FUTURE.
SO WE'VE TALKED ABOUT BIG DATA,
THE FACT THAT THERE'S A LOT OF
DATA OUT THERE AND AVAILABLE,
AND ONE OF THE WAYS THAT PEOPLE
ARE TAKING ADVANTAGE OF THIS IS
BY USING DATA SCIENCE
METHODOLOGIES.
SO WHAT ARE WE TALKING ABOUT
WHEN WE TALK ABOUT DATA SCIENCE?
TO AN EXTENT, ALL SCIENCE USES
DATA, BUT DATA SCIENCE IS A VERY
PARTICULAR SORT OF APPROACH TO
EXPLORING DATA AND MAKING
DISCOVERIES.
ONE WAY OF THINKING ABOUT IT, IT
REQUIRES AND SORT OF MAKES USE
OF A VARIETY OF DIFFERENT TYPES
OF SKILLS TO BOTH PROCESS DATA,
TO ANALYZE IT, TO VISUALIZE IT,
TO GET SOME SORT OF KNOWLEDGE
OUT OF IT.
SO IT INVOLVES ALL OF THE SORTS
OF SKILLS THAT YOU SEE HERE.
FIRST OF ALL, YOU HAVE TO HAVE
SORT OF A MATH AND STATISTICAL
BACKGROUND TO BE ABLE TO FIND
PATTERNS AND DO ANALYSIS OF
LARGE DATASETS.
YOU ALSO NEED COMPUTATIONAL
SKILLS TO -- MOST DATA SCIENCE
IS DONE WRITING CODE, USING
THINGS OR R OR PYTHON.
SORRY, I'M DOING THIS WEBINAR
FROM HOME AND MY DOG, OPHELIA,
DECIDED SHE WANTED TO CHIME IN.
SHE FINDS DATA SCIENCE
FASCINATING.
I APOLOGIZE IF YOU HEARD HER
BARK IN THE BACKGROUND.
YOU'RE BRINGING TOGETHER THIS
MATH AND STATISTICAL EXPERTISE,
COMPUTER EXPERTISE AND YOU NEED
DOMAIN EXPERTISE TO REALITY
CHECK YOUR FINDINGS AND KNOW
WHAT YOU ARE DISCOVERING IN THIS
DATA IS ACTUALLY A TRUE PATTERN
THAT MAKES SENSE WITH SORT OF
THE SCIENCE OF, YOU KNOW, WHAT
WE KNOW ABOUT THE SUBJECT AREA.
OKAY.
SO ANOTHER WAY TO THINK ABOUT
THIS IS YOU MAY HAVE SEEN THIS
VENN DIAGRAM BEFORE.
YOU SORT OF NEED ALL OF THESE
THREE TYPES OF SKILLS TO BRING
TOGETHER TO ACCOMPLISH DATA
SCIENCE AND GET SOME SORT OF
KNOWLEDGE OUT OF DATA.
SO I MENTIONED EARLIER THAT DATA
SCIENCE IS A LITTLE BIT
DIFFERENT THAN THE TRADITIONAL
SCIENTIFIC METHOD.
SO YOU PROBABLY REMEMBER BACK
TO, YOU KNOW, ELEMENTARY SCHOOL
AND YOU LEARNED ABOUT THE
SCIENTIFIC METHOD, AND TYPICALLY
THIS STARTS WITH YOU HAVE A
HYPOTHESIS, YOU OBSERVE THE
WORLD AROUND YOU AND YOU COME UP
WITH AN IDEA ABOUT SOMETHING
THAT YOU THINK IS TRUE.
AND THE IDEA IS TO DEVICE AN
EXPERIMENT THAT ALLOWS YOU TO
TEST THAT HYPOTHESIS, AND YOU DO
THAT EXPERIMENT, YOU GATHER
DATA, AND THEN YOU, YOU KNOW,
SORT OF ANALYZE ALL OF THAT DATA
AND THEN YOU FIND OUT THAT
EITHER, YES, THAT HYPOTHESIS WAS
TRUE AND NOW YOU'VE CONFIRMED
YOUR HYPOTHESIS, OR NO, YOU FIND
THE DATA DON'T SUPPORT YOUR
HYPOTHESIS, YOU REJECT IT AND
MAYBE COME UP WITH A NEW
HYPOTHESIS.
IF I WERE GOING TO TEST A DRUG
IN THIS WAY I WOULD HYPOTHESIZE
THAT, YOU KNOW, DRUG A WOULD,
YOU KNOW, CAUSE A REMISSION IN
CANCER IN PATIENTS.
I WOULD GIVE THE DRUG TO
PATIENTS.
HAVE A CONTROL GROUP, COMPARE
THOSE TWO, AND FIND OUT IF THAT
DRUG ACTUALLY WORKED.
THIS AGAIN SORT OF THE
TRADITIONAL SCIENTIFIC METHOD,
YOU'RE PROBABLY FAMILIAR WITH
THIS, MAKES A LOT OF LOGICAL
SENSE.
DATA SCIENCE IS A LITTLE BIT
DIFFERENT IN TERMS OF HOW IT
OFTEN PROGRESSES.
PARTLY BECAUSE A LOT OF TIMES
THERE'S NOT NECESSARILY A
HYPOTHESIS IN ADVANCE.
WE MIGHT HAVE A VERY LARGE
DATASET AND WE DON'T NECESSARILY
KNOW WHAT IS SIGNIFICANT, OR
WHAT WE MIGHT FIND WHEN WE LOOK
AT IT.
BUT BY USING CERTAIN APPROACHES
OR ALGORITHMS OR TECHNIQUES
WE'RE ABLE TO FIND PATTERNS IN
DATA AND THEREFORE GET MEANING
OUT OF IT.
SO ONE OF THE THINGS THAT I'VE
HAD CONVERSATIONS WITH
RESEARCHERS ABOUT BEFORE IS
THAT, YOU KNOW, WHEN THEY ARE
DOING DATA SCIENCE AND THEY WANT
TO GET ACCESS TO MAYBE ONE OF
THOSE DATASETS THAT REQUIRES IRB
APPROVAL THEY ARE SORT OF
STUMPED ABOUT HOW TO GO ABOUT
DOING IRB APPROVAL BECAUSE
USUALLY THAT PROCESS INVOLVES,
YOU KNOW, STATING YOUR
HYPOTHESIS, WHAT IS IT YOU'RE
LOOKING FOR, AND THEY DON'T
NECESSARILY KNOW IN ADVANCE
EXACTLY WHAT THEY ARE LOOKING
FOR BUT THEY HAVE A FEELING THEY
CAN FIND, YOU KNOW, SOMETHING
SIGNIFICANT IN THAT DATASET.
SO, WHILE DATA SCIENCE SORT OF
BUILDS ON AND, YOU KNOW, USES
THIS TRADITIONAL SCIENTIFIC
METHOD, I THINK A BETTER WAY OF
THINKING ABOUT THE PROCESS THAT
DATA SCIENTISTS GO THROUGH IS
THIS DATA SCIENCE PIPELINE.
SO THE WAY THAT DATA SCIENTISTS
TYPICALLY WORK IS THAT WE START
WITH DATA INGESTION, WHICH
BASICALLY INVOLVES GETTING DATA
FROM SOME SOURCE.
THIS MIGHT BE-- IN SOME CASES
COLLECTING DATA YOURSELF BUT
MORE TYPICALLY IT WOULD BE
LOCATING DATA IN A REPOSITORY,
IT MIGHT BE ASKING FOR DATA FROM
A COLLABORATOR, BUT SOMEHOW YOU
ACQUIRE DATA FOR YOUR ANALYSIS.
A HUGE PART OF THE WORK OF DATA
SCIENCE IS THIS SECOND STEP,
WHICH IS LABELED HERE AS DATA
MUNGING OR WRANGLING, ALSO SORT
OF DATA CLEANING, DATA
PROCESSING.
ABOUT 80% OF THE WORK OF DATA
SCIENCE IS JUST GETTING THE DATA
INTO SOME SORT OF A FORMAT
THAT'S USABLE FOR YOUR ANALYSIS.
PART OF THE REASON THIS IS SO
TIME CONSUMING IS OFTEN SINCE
WE'RE REUSING DATA COLLECTED
FOR, YOU KNOW, SOME OTHER
PURPOSE, MOST LIKELY BY SOMEONE
ELSE, NOT NECESSARILY IN THE
FORM I MIGHT NEED TO DO MY
ANALYSIS.
IF I WANTED TO USE PATIENT DATA
COLLECTED IN AN ELECTRONIC
HEALTH RECORD, THAT DATA WAS
COLLECTED WITH THE PURPOSE IN
MIND OF PROVIDING A RECORD OF
THAT PATIENT'S HEALTH FOR THE
CLINICIAN, NOT NECESSARILY FOR A
RESEARCHER.
SO THE WAY THAT I WOULD LIKE TO
SEE THE DATA ORGANIZED IS NOT
NECESSARILY THE SAME WAY AS, YOU
KNOW, THE DATA WAS COLLECTED AND
THE WAY THAT A CLINICIAN WOULD
FIND IT USEFUL.
SO, A BIG PART OF THAT WORK IS
THAT DATA MUNGING OR WRANGLING
OR PROCESSING, HOWEVER YOU WANT
TO CALL IT.
ONCE THE DATA IS IN THE FORM
YOU'RE READY TO USE IT,
COMPUTATION AND ANALYSIS IS
CARRIED OUT, TYPICALLY THIS
INVOLVES STATISTICAL KNOWLEDGE
USING STATISTICAL TECHNIQUES
THAT CAN YOU MAY HAVE HEARD OF
IF YOU'VE EVER TAKEN A
BIOSTATISTICS CLASS.
AND THEN ONCE THAT ANALYSIS IS
DONE, WE'RE ABLE TO DEVELOP SOME
SORT OF MODEL OF, YOU KNOW,
BASICALLY HOW THIS DATA WORKS,
WHAT ARE THE PATTERNS, WHAT ARE
THE IMPLICATIONS FOR SORT OF
REAL WORLD PRACTICE, AND HAVE
THOSE, YOU KNOW, APPLICATIONS
FOR THAT IN THE REAL WORLD.
I'LL TALK A LITTLE BIT MORE
ABOUT SOME SPECIFIC EXAMPLES OF
THIS THAT I THINK WILL MAKE
SENSE.
AND FINALLY WE HAVE TO SHARE OUR
RESULTS.
AND SO REPORTING AND
VISUALIZATION WOULD BE THAT
FINAL STEP WHERE WE, YOU KNOW,
CREATE MAYBE SOME SORT OF A
DASHBOARD THAT WE CAN SHARE THE
DATA, CREATE A FINAL REPORT,
WHATEVER IT MIGHT BE.
AS WITH MOST SCIENTIFIC
PROCESSES, YOU KNOW, WE HAVE
THESE IN THE MODEL, WE HAVE
THESE SORT OF, YOU KNOW,
DISTINCTIVE BOXES AND ARROWS
GOING BACK AND FORTH.
IN REAL LIFE IT'S VERY RARELY
THE CASE THAT YOU NEATLY, YOU
KNOW, FOLLOW THESE STEPS IN THIS
EXACT ORDER, AND ONE STEP STARTS
AS SOON AS ANOTHER ONE ENDS,
IT'S TYPICALLY ITERATIVE PROCESS
BUT THESE ARE THE FIVE BASIC
STEPS INVOLVED IN DOING DATA
SCIENCE.
SO WHAT ARE WE TALKING ABOUT
WHEN WE ACTUALLY TALK ABOUT THE
TECHNIQUES THAT WE'RE USING?
ONE OF THE VERY COMMON
TECHNIQUES THAT'S USED IN DATA
SCIENCE IS CALLED MACHINE
LEARNING.
THIS IS BASICALLY TAKING DATA
AND ANALYZING THEM USING A
PARTICULAR ALGORITHM.
THERE ARE LOTS AND LOTS OF
DIFFERENT ALGORITHMS AVAILABLE
DEPENDING ON THE TYPE OF DATA
AND RELATIONSHIPS.
A MODEL IS DEVELOPING FROM THE
ALGORITHM TO EXPLAIN PATTERNS IN
THE DATA AND POTENTIALLY ALLOW
PREDICTIONS ABOUT CERTAIN DATA.
THERE ARE TWO TYPES OF MACHINE
LEARNING.
THE FIRST IS CALLED SUPERVISED
LEARNING.
SUPERVISED LEARNING, WE START
WITH A DATASET OF WHAT'S KNOWN
AS TRAINING DATA.
SO THIS WOULD BE DATA THAT MAYBE
ARE CATEGORIZED IN A FEW
DIFFERENT GROUPS, AND LABELED BY
A HUMAN.
SO IN THIS EXAMPLE WE HAVE I
THINK THESE ARE X-RAYS OF LUNGS
WITH I THINK IT'S TUBERCULOSIS
THAT THESE PEOPLE HAVE, THE ONES
LABELED SICK, AND THEN WE HAVE
HEALTHY CONTROLS.
SO WE WOULD PUT THIS DATA, WE
WOULD HAVE AN EXPERT LABEL THE
DATA, THIS PERSON HAS GOT THE
ILLNESS, THIS PERSON DOES NOT.
AND THEN ONCE WE'VE LABELED
THOSE, WE CAN PUT THAT MANY DATA
INTO A MACHINE LEARNING
ALGORITHM AND THE COMPUTER
BASICALLY LEARNS WHAT FEATURES
OF AN IMAGE TEND TO DISTINGUISH
THE HEALTHY LUNG FROM A DISEASED
LUNG.
IN THE CASE OF IMAGES, TYPICALLY
THAT'S DONE BY ACTUALLY LOOKING
PIXEL BY PIXEL, SO IT'S LIKE
PIXEL 55 IN A HEALTHY LUNG IS
TYPICALLY BLACK, IN A SICK LUNG
IT'S TYPICALLY WHITE.
IN OTHER TIMES OF DATA NON-IMAGE
TYPE DATA FEATURES MIGHT BE
THINGS LINE, FOR EXAMPLE, IF WE
WERE LOOKING AT HOUSE PRICES,
THINGS LIKE THE SIZE, THE SQUARE
FOOTAGE OF THE HOUSE, HOW MANY
BEDROOMS IT HAS, HOW MANY
BATHROOMS IT HAS.
THESE WOULD ALL BE CONSIDERED
FEATURES.
AND THE ALGORITHM WOULD LOOK AT
THESE VARIOUS FEATURES AND FROM
A VERY LARGE SET OF LABELED DATA
TRY TO FIGURE OUT WHAT ARE THE
FEATURES THAT ARE MOST
PREDICTIVE TO ALLOW US TO SAY
WHETHER, YOU KNOW, IN THIS CASE
THE PERSON IS SICK OR HEALTHY,
OR PREDICT HOW MUCH ANOTHER
HOUSE WOULD GO FOR ON THE
MARKET.
SO ONCE YOU PUT THIS TRAINING
DATA IN TO A SUPERVISED LEARNING
SORT OF TECHNIQUE, THEN YOU
WOULD BE ABLE TO GIVE THE
COMPUTER ANOTHER IMAGE AND IT
SHOULD BE ABLE TO PREDICT MORE
OR LESS ACCURATELY WHETHER THAT
IMAGE IS OF A SICK OR HEALTHY
PERSON.
ONE OF THE PROBLEMS THAT WE DO
HAVE TO DEAL WITH IN SUPERVISED
MACHINE LEARNING IS WHAT'S
CALLED OVERFITTING.
SO IT'S RARELY THE CASE THAT WE
HAVE LIKE A COMPLETELY NEAT LINE
BETWEEN, YOU KNOW, FOR EXAMPLE
SUBJECT A, SUBJECT B, THE SICK,
THE HEALTHY, WHATEVER.
IN THIS EXAMPLE HERE THIS IS
ACTUALLY SUPPOSED TO BE AN
ANIMATION, NOT IN THIS WEBEX,
BUT YOU CAN SEE WE HAVE THE
BLACK LINE THAT IS A BASIC GOOD,
YOU KNOW, WAY TO DIVIDE THE TWO
LINES.
SOME BLUES END UP IN THE RED
SIDE, SOME REDS END UP IN THE
BLUE SIDE, BUT THIS BLACK LINE
IS SOMETHING THAT A GOOD AND
ACCURATE MACHINE LEARNING
ALGORITHM MIGHT SUGGEST AS A
DIVIDING LINE.
YOU KNOW, WE WOULDN'T BE 100%
ACCURATE IF WE WERE TO GET SHOWN
THIS DATASET AND TRY TO GUESS
THE RED VERSUS THE BLUE, BUT THE
PROBLEM IS IF WE END UP WITH ONE
OF THESE LIKE REALLY CRAZY SORT
OF GREEN LINES WE'VE OVERFIT OUR
MODEL TO A PARTICULAR DATASET.
SO WE'VE MANAGED TO CREATE SOME
SORT OF A, YOU KNOW, CRAZY KIND
OF ALGORITHM THAT WOULD 100%
ACCURATELY PREDICT THIS
PARTICULAR DATASET, BUT IF WE
LOOK AT A DIFFERENT DATASET
WE'RE NOT GOING TO HAVE THE SAME
SORTS OF OUTLIERS, SO THIS MODEL
IS NOT GOING TO BE VERY ACCURATE
AT PREDICTING FOR OTHER DATASETS
OR FOR THE SORT OF LARGER
POPULATION.
SO, THE WAY THAT WE SORT OF, YOU
KNOW, PROTECT AGAINST
OVERFITTING TO A PARTICULAR
DATASET AND THEREFORE ENDING UP
WITH AN ALGORITHM THAT DOESN'T
REALLY SCALE TO OTHER DATA IS BY
USING TRAINING AND TEST DATA.
SO WHEN WE HAVE OUR DATASET WE
HAVE, YOU KNOW, THINK BACK TO
LUNGS.
LET'S SAY WE HAVE 1000 LUNGS
CLINICIAN LABELED AS BEING SICK
OR HEALTHY.
RATHER THAN FEEDING ALL 1000
LUNGS INTO THE ALGORITHM WE
SPLIT INTO TRAINING DATA AND
TEST DATA.
80%, 800 OF OUR LUNGS, WE ARE
GOING TO FEED INTO THAT
ALGORITHM.
BUT WE'RE GOING TO SAVE BACK 200
OF THOSE LUNGS, AND WE'RE GOING
TO USE THAT AS TEST DATA TO
FIGURE OUT IF THE MODEL THAT WE
DEVELOPED IS ACTUALLY GOOD, HOW
ACCURATE IT IS AT GUESSING ON
SIMILAR DATA.
THIS TEST DATA.
OR IF WE'VE ENDED UP OVERFITTING
OUR DATA, OR OVERFITTING OUR
MODEL TO OUR DATA IN THIS
PARTICULAR CASE.
SO TYPICALLY, AGAIN, THIS 80/20
SPLIT IS HOW WE DO IT, AND
HAVING 20% TEST DATA ALLOWS US
TO BE MORE CONFIDENT THE MODEL
WE DEVELOPED IS ACCURATE WHEN WE
PUT IT OUT IN THE REAL WORLD AND
ARE TRYING TO DO THIS ON
UNLABELED DATA.
SO THAT IS SUPERVISED LEARNING.
WHAT IF WE DON'T KNOW IN ADVANCE
NECESSARILY WHAT OUR CLUSTERS
ARE OR WHAT OUR CATEGORIES MIGHT
BE?
WE CAN DO SOMETHING CALLED
UNSUPERVISED LEARNING, ALSO
KNOWN AS CLUSTERING.
SO IN CLUSTERING, WE START WITH
UNLABELED DATA.
WE DON'T NECESSARILY KNOW-- WE
HAVE ALL THESE DATA POINTS.
WE DON'T KNOW WHICH PERSON IS
SICK, WHICH PERSON IS HEALTHY.
IN THIS CASE WE'RE TRYING TO
DECIDE BETWEEN BLUE, RED AND
GREEN DOTS.
IN THE CASE OF UNSUPERVISED
LEARNING WE COULD FEED THAT
UNLABELED DATA INTO THE
ALGORITHM, IT WOULD LOOK AT
THOSE FEATURES AND TRY AND
FIGURE OUT WHAT ARE THE SORT OF,
YOU KNOW, CHARACTERISTICS THAT
MAKE DIFFERENT ITEMS SIMILAR TO
EACH OTHER, AND HOPEFULLY SORT
THOSE INTO CORRECT GROUPS.
SO, AGAIN, WE DON'T KNOW
NECESSARILY IN ADVANCE WHAT
THOSE GROUPS MIGHT BE, BUT THE
ALGORITHM SORT OF FIGURES OUT
AND SPITS OUT AND SAYS, OKAY,
HERE ARE YOUR GROUPS, ONE, TWO
AND THREE.
I AS A SUBJECT MATTER EXPERT CAN
COME BACK IN AND SAY, GROUP ONE
IS DOTS THAT ARE BLUE, GROUP TWO
IS DOTS THAT ARE RED.
EXAMPLE OF UNSUPERVISED LEARNING
APPLICATION THAT I WAS WORKING
ON WITH ONE OF MY COLLEAGUES IS
TAKING ARTICLES AND LOOKING AT
THEIR MeSH HEADINGS AND TRYING
TO SORT OF FIGURE OUT DIFFERENT
CATEGORIES OF ARTICLES IN THIS
REALLY, REALLY LARGE DATASET OF
ARTICLES.
SO BY LOOKING AT ALL OF THE
DIFFERENT MeSH TERMS, THAT
ARTICLES WERE TAGGED WITH, THIS
ALGORITHM WAS ABLE TO FIGURE OUT
SIMILAR ARTICLES BY LOOKING AT
SORT OF PATTERNS OF, YOU KNOW,
MeSH TERMS THAT WERE OFTEN
CITED TOGETHER AND THESE SORTS
OF THINGS.
SO, AGAIN, THE MACHINE DIDN'T
NECESSARILY KNOW WHAT THOSE
PATTERNS MEANT, BUT WE AS
SUBJECT MATTERS EXPERTS COULD GO
BACK AND SAY, OH, YEAH, THIS IS
A GROUP OF ARTICLES THAT'S ABOUT
-- YOU KNOW, STEM CELL THERAPY
OR WHATEVER IT MIGHT BE.
THAT'S UNSUPERVISED LEARNING.
THE DISTINCTION OF SUPERVISED
VERSUS UNSUPERVISED COMES FROM
WHETHER THE DATA IS LABELED OR
NOT.
SO AN UNSUPERVISED LEARNING WE
HAVE THE UNLABELED DATA, WHEREAS
IN SUPERVISED LEARNING WE HAVE
THE LABELED DATA.
SO ANOTHER APPLICATION IN DATA
SCIENCE IS TEXT MINING, AND SORT
OF RELATED TO THAT IS NATURAL
LANGUAGE PROCESSING, OR NLP.
THESE ARE TECHNIQUES THAT
BASICALLY HELP US TO FIND
MEANING IN TEXTUAL DATA.
WE CAN GET A LOT OUT OF NUMBERS
AND IMAGES, BUT LANGUAGE IS
HARDER TO WORK WITH.
NATURAL LANGUAGE IS COMPLEX AND
DIFFICULT FOR COMPUTERS TO MAKE
SENSE OF, ESPECIALLY TRUE WHEN
TALKING ABOUT CLINICAL DATA.
A CLINICIAN MIGHT SAY SOMETHING
LIKE PATIENT DENIED CHEST PAIN.
AND SO, YOU KNOW, THE WORD
DENIED IS A NEGATIVE, WE KNOW
THAT THAT MEANS THERE IS NOT
CHEST PAIN, BUT IF WE JUST SAW
CHEST PAIN, WE WOULD HAVE TO
UNDERSTAND THAT MEANING OF
DENIED, SO THERE'S A LOT OF LIKE
VERY COMPLICATED STRUCTURES IN
NATURAL LANGUAGE.
BUT NLP TECHNIQUES MAKE IT
POSSIBLE FOR COMPUTERS TO
ESSENTIALLY PARSE OUT THE
LANGUAGE INTO PARTS AND
UNDERSTAND RELATIONSHIP BETWEEN
THOSE PARTS.
SO, FOR EXAMPLE, HERE IS A
SENTENCE THAT HAS BEEN PARSED,
AND WHAT ARE CALLED ENTITIES,
WITHIN THE TEXT HAVE BEEN
EXTRACTED.
THINGS LIKE SHAKESPEARE IS AN
ENTITY, HAMLET IS AN ENTITY.
ENGLISH IS AN ENTITY.
AND THEIR RELATIONSHIPS TO EACH
OTHER HAVE BEEN DETERMINED.
FOR EXAMPLE, HAMLET HAS BEEN
DETERMINED TO BE A TYPE OF A
BOOK, AND IT HAS A RELATIONSHIP
WITH SHAKESPEARE IN THAT
SHAKESPEARE WROTE HAMLET.
SO ALL OF THESE THINGS ARE SORT
OF NECESSARY TO PARSE OUT AND
UNDERSTAND THE RELATIONSHIPS TO
REALLY GET THE MEANING OUT OF
THE SENTENCE.
SO THESE TECHNIQUES, YOU KNOW,
ALSO CAN BE USEFUL IN A VARIETY
OF BIOMEDICAL APPLICATIONS, ONE
LIKE I MENTIONED IS ELECTRONIC
HEALTH RECORDS, SO THE NARRATIVE
TEXT THAT A CLINICIAN WOULD HAVE
WRITTEN ABOUT A PATIENT AND
OTHER TYPES OF TEXT EXIST AS
WELL THAT PEOPLE USE THESE SORTS
OF NLP TECHNIQUES ON AS WELL.
THE NATIONAL LIBRARY OF MEDICINE
DOES DO SOME RESEARCH ON NLP IN
TERMS OF LOOKING AT WHETHER IT
IS POSSIBLE TO DO SOME AUTOMATIC
CLASSIFICATION OF ARTICLES, AS
YOU PROBABLY KNOW ACTUAL PEOPLE
HAVE TO GO THROUGH AND READ
ARTICLES AND FIGURE OUT WHAT
MeSH TERMS ARE ASSIGNED.
IF IT WAS POSSIBLE FOR A
COMPUTER TO DO THAT IT WOULD BE
QUICKER THAN HAVING A PERSON GO
THROUGH AND LOOK.
AGAIN, IT'S A VERY COMPLICATED
TASK BECAUSE OF THE FACT THIS IS
WRITTEN IN NATURAL LANGUAGE.
SO THOSE ARE A FEW OF THE
TECHNIQUES THAT ARE COMMONLY
USED IN DATA SCIENCE.
WHAT ARE THE WAYS PEOPLE GO
ABOUT THAT?
LIKE I MENTIONED EARLIER, EXCEL,
GENERALLY NOT A WORKABLE TOOL
FOR DOING DATA SCIENCE BECAUSE
IT CAN'T HANDLE THE LARGER SORT
OF AMOUNT OF DATA THAT MOST DATA
SCIENTISTS ARE TYPICALLY WORKING
WITH.
AS WELL AS THE FACT THAT MOST
RESEARCHERS ARE NOT USING SORT
OF POINT AND CLICK PROGRAMS
BECAUSE ESSENTIALLY YOU'RE
LIMITED IN TERMS OF WHAT YOU CAN
DO.
THE WAY THAT A POINT AND CLICK
PROGRAM WORKS IS THAT IT IS--
IT CONTAINS A SET NUMBER OF
THINGS YOU CAN DO, THERE ARE
BUTTONS YOU CLICK ON TO DO THEM.
IF YOU WANT TO DO SOMETHING
OTHER THAN THAT, SORT OF OUTSIDE
OF THAT, COME UP WITH YOUR OWN
ANALYSIS, WRITE OUR OWN
ALGORITHM, YOU'RE NOT ABLE TO DO
THAT.
MOST DATA SCIENTISTS AND PEOPLE
WORKING EXTENSIVELY WITH DATA
ARE USING MORE CODE-BASED TOOLS.
SO THE SOFTWARE PACKAGES ON THE
LEFT ARE COMMONLY USED BUT THESE
ARE PROPRIETARY, SO THEIR OUTPUT
IS IN A CLOSED FORMAT ONLY
ACCESSED USING THAT SOFTWARE.
IF I GET DATA, DO MY ANALYSIS
AND USE SAS OR SPSS AND WANT TO
SHARE THAT WITH SOMEONE ELSE,
THEY ALSO HAVE TO HAVE SAS OR
SPSS.
SOFTWARE IS EXPENSIVE, IT LIMITS
WHO CAN GET ACCESS.
MANY DATA SCIENTISTS ARE MOVING
TO OPEN-SOURCE SOLUTIONS WHICH
ARE FREE FOR ONE THING, THAT'S
ALWAYS A GOOD THING.
AND THEY ARE MORE WIDELY
APPLICABLE.
BECAUSE IT'S FREE, ANYBODY CAN
DOWNLOAD R OR PYTHON OR JULIA.
YOU DON'T HAVE TO PAY FOR IT.
IF I WANT TO USE YOUR DATA, NO
PROBLEM.
VERY EASY.
SO THESE THREE EXAMPLES THAT
I'VE LISTED HERE ARE R, PYTHON
AND JULIA, ALL PROGRAMMING
LANGUAGES THAT ARE COMMONLY USED
WITHIN THE DATA SCIENCE
COMMUNITY.
AGAIN, RATHER THAN BEING A
PROGRAM THAT WE POINT AND CLICK
TO DO OUR ANALYSIS, WE WRITE ALL
OF OUR ANALYSIS OR OUR
VISUALIZATION AS THE CASE MAY
BE, USING CODE.
SO IT CAN BE TIME CONSUMING AND
DIFFICULT SOMETIMES TO LEARN A
PROGRAMMING LANGUAGE ESPECIALLY
FOR RESEARCHERS THAT DON'T HAVE
EXTENSIVE COMPUTER SCIENCE
BACKGROUND BUT I THINK MOST
RESEARCHERS I WORKED WITH, I
MYSELF PROGRAM IN R, FIND IT
WORTH THE EFFORT IN THE LONG RUN
BECAUSE KNOWING A PROGRAMMING
LANGUAGE ALLOWS YOU TO DO REALLY
HIGHLY SPECIALIZED ANALYSES AND
CREATE CUSTOMIZED VISUALIZATION
AND YOU'RE REALLY SORT OF NO
LONGER LIMITED TO A SET MENU OF
ITEMS AVAILABLE IN A PROGRAM.
YOU'RE ABLE TO DO REALLY
ANYTHING THAT YOU WANT WITH YOUR
DATA.
SO, A LOT OF DIFFERENT FREE
OPTIONS ARE AVAILABLE FOR
LEARNING ALL OF THIS KIND OF
STUFF.
LOTS OF MOOCS, MASSIVE ONLINE
COURSES FOR LEARNING THIS, TONS
OF BOOKS.
IT'S TIME CONSUMING, BUT IF
YOU'RE SOMEONE THAT WORKS WITH
DATA ON, YOU KNOW, A REGULAR
BASIS, ULTIMATELY I THINK WORTH
I SO ONE OF THE REASONS WHY I
THINK CODING IS USEFUL IS LIKE I
SAID, CUSTOMIZABILITY.
THIS IS AN ACTUAL REAL EXAMPLE
OF SOME DATA THAT A RESEARCHER
CAME TO ME WITH, THIS IS HIS
PLOT FROM EXCEL, KIND OF A MESS.
EACH COLORED LINE IS A DIFFERENT
SUBJECT AND THEY DIDN'T HAVE THE
SAME NUMBER OF-- I THINK THESE
WERE BLOOD DRAWS OR SOMETHING.
THEY DIDN'T HAVE THE SAME NUMBER
OF BLOOD DRAWS FROM EACH
PATIENT, AND SO BECAUSE OF THAT
WE GOT THIS SORT OF LIKE
NONSENSE LOOKING CHART, AND IT
WAS REALLY DIFFICULT TO FIGURE
OUT WHAT WAS GOING ON HERE.
THIS IS ALL HE COULD COME UP
WITH IN EXCEL.
HE WAS NOT HAPPY WITH THIS.
HE ASKED IF I COULD HELP HIM
COME UP WITH A BETTER-LOOKING
CHART USING R, AND THIS IS WHAT
I CAME UP WITH.
SO, YOU KNOW, IT TOOK A LONG
TIME, WELL, NOT THAT LONG
REALLY, BUT IT TOOK A WHILE TO
WRITE THIS.
IT TOOK LONGER THAN CLICKING ONE
BUTTON IN EXCEL.
BUT WE WERE ABLE TO, YOU KNOW,
END UP WITH THESE CHARTS WHERE
YOU COULD MORE EASILY SEE WHAT
WAS ACTUALLY GOING ON.
AND I CAME UP WITH TWO DIFFERENT
VERSIONS, JUST BY SLIGHTLY
MODIFYING THE CODE IN HE WAS
ABLE TO PICK THE ONE THAT HE
LIKED.
SO, YOU KNOW, AGAIN IF WE WERE
DOING ANYTHING THAT IS MORE
COMPLICATED THAN JUST, YOU KNOW,
A SIMPLE LINE CHART OR BAR CHART
OR BASIC, YOU KNOW, GIVE ME THE
AVERAGE OF THESE NUMBERS,
TYPICALLY A PROGRAMMING LANGUAGE
IS GOING TO BE A MORE EFFICIENT
WAY TO DO THAT THAN SOMETHING
LIKE EXCEL.
ANOTHER REASON THAT CODING
LANGUAGES ARE SORT OF EMPHASIZED
IN THE DATA SCIENCE AND OVERALL
GENERAL SCIENTIFIC COMMUNITIES
IS THE SORT OF ABILITY TO
REPLICATE AND REPRODUCE RESEARCH
THAT COMES OUT OF THIS.
SO IF YOU WERE TO DO SOME SORT
OF AN ANALYSIS IN EXCEL, AND YOU
WANTED TO-- OR MAYBE YOU WERE
GOING TO CLEAN UP YOUR DATA.
YOU WOULD HAVE TO SORT OF
MANUALLY RECORD EVERYTHING THAT
YOU HAD DONE.
LIKE, FOR EXAMPLE, I'M GOING TO,
YOU KNOW, I HAVE COLLECTED MY
MALE SUBJECTS AS M, FEMALE CODED
AT F.
I WANT TO CHANGE THAT TO MALE
AND FEMALE.
I COULD DO THAT WITH FIND AND
REPLACE BUT NEED TO RECORD THAT
I HAD DONE THAT TO INDICATE I'VE
MADE THIS CHANGE.
SO IT CAN BE TIME CONSUMING TO
REALLY DOCUMENT EVERYTHING THAT
YOU'VE DONE IN TERMS OF ANALYSIS
AND PROCESSING, IF YOU'RE USING
POINT AND CLICK PROGRAMMING.
IF YOU'RE WRITING CODE BASICALLY
AS YOU GO YOU'RE CREATING YOUR
DOCUMENTATION.
ALL THE CODE EXPLAINS EXACTLY
WHAT YOU'VE DONE.
AND IF YOU'RE COMMENTING YOUR
CODE, LIKE YOU CAN SEE IN LINE
19, IT'S THIS LINE STARTS WITH
THE TWO CONFINES, THIS EXPLAINS
WHAT I'M ABOUT TO DO IN THE NEXT
LINE.
I CAN GO BACK AND LOOK AT SORT
OF REMEMBER WHAT I DID LAST
WEEK.
OTHER PEOPLE CAN GO BACK AND
LOOK AND SEE WHAT I WAS DOING
WITH THIS CODE.
AND EVERYTHING IS SORT OF CLEAR
AND DOCUMENTED.
IF I WAS TO SHARE THIS WITH
SOMEBODY, THEY COULD EASILY
RECREATE AS LONG AS THEY HAD THE
DATA, EASILY RECREATE THE CHARTS
THAT I HAD JUST SHOWN YOU.
THIS IS THE CODE I USED TO
CREATE THOSE TWO CHARTS.
AND I COULD, YOU KNOW, EASILY
SHARE THIS.
THERE ARE SITES LIKE GITHUB, FOR
EXAMPLE, WHERE PEOPLE SHARE CODE
AND YOU CAN FREELY SHARE THAT
AND ANYBODY CAN GO AND DOWNLOAD
IT.
ALL RIGHT.
SO WE'VE TALKED A LITTLE BIT
ABOUT WHAT DATA, BIG DATA IS.
WHAT DATA SCIENCE IS ALL ABOUT.
WHAT WE'RE DOING SORT OF NOW IN
THE CURRENT, YOU KNOW, CLIMATE
OF SCIENTIFIC RESEARCH.
BUT WHAT SHOULD WE BE THINKING
ABOUT AS WE LOOK AHEAD,, YOU
KNOW, STRATEGICALLY PLAN FOR
WHAT MIGHT BE COMING IN TERMS OF
DATA?
THIS IS A RAPIDLY EVOLVING
FIELD, THINGS CHANGE VERY
QUICKLY FROM ONE YEAR TO THE
NEXT.
BOTH THE TECHNOLOGIES CHANGE,
NEW THINGS ALL THE TIME,
POLICIES CHANGE.
FUNDERS, JOURNALS MAY BE COMING
OUT WITH JOURNALS PEOPLE HAVE TO
COMPLY WITH.
THINGS CHANGED SIGNIFICANTLY
EVEN IN THE LAST FIVE YEARS AND
WILL CONTINUE TO DO SO VERY
LIKELY FOR THE FORESEEABLE
FUTURE.
SO, YOU KNOW, WHAT SHOULD WE BE
THINKING ABOUT AS WE LOOK AHEAD?
ONE THING THAT WE DO KNOW FOR
SURE IS THAT DATA IS ONLY GOING
TO GET BIGGER.
WE HAVE A LOT OF DATA RIGHT NOW,
AND WE CAN EXPECT THAT WE'RE
GOING TO GET EVEN MORE.
SO, YOU KNOW, ONE OF THE
IMPLICATIONS FOR THIS AS A
COMMUNITY WE NEED TO THINK ABOUT
HOW WE CAN DO BETTER AT
PRESERVING, SHARING, THE DATA IS
DISCOVERABLE SO OTHERS CAN MAKE
USE OF IT BUT THINKING ABOUT
WHAT RESEARCHERS NEED TO DO TO
SORT THEIR OWN DATA IN A WAY
THAT WILL, YOU KNOW, MAKE IT
ACCESSIBLE TO OTHERS AND MAKE IT
ACCESSIBLE, YOU KNOW, FIVE OR
TEN YEARS DOWN THE LINE.
FUNDING IS SOMETHING THAT
RESEARCHERS MIGHT HAVE TO THINK
ABOUT IF WE'RE GOING TO NEED TO
SAVE THIS DATA, YOU KNOW, IT'S
DEFINITELY CHEAPER THAN IT USED
TO BE BUT IT'S NOT FREE SO HOW
ARE WE GOING TO FUND SAVING A
TERABYTE OF DATA FOR, YOU KNOW,
TEN YEARS?
SO THESE ARE ALL THINGS THAT
RESEARCHERS REALLY ALREADY
SHOULD BE THINKING ABOUT, AS
THEY START TO GET INTO THE
PROCESS OF COLLECTING OR REUSING
DATA.
ANOTHER THING THAT I THINK IS
GOING TO BE REALLY SIGNIFICANT
MOVING FORWARD IS FIGURING OUT
WAYS WE CAN LEVERAGE
UNSTRUCTURED DATA.
BASICALLY THESE ARE TECHNIQUES
THAT HELP US DEAL WITH DATA NOT
AS EASY TO ANALYZE.
STRUCTURED DATA ARE THINGS LIKE
SPREADSHEETS, THEY ARE
CATEGORIZED, THEY ARE
WELL-DESCRIBED.
WE KNOW WHAT WE'RE LOOKING AT,
BUT THIS IS REALLY ONLY THE TIP
OF THE ICEBERG.
UNSTRUCTURED DATA IS MORE LIKE
STUFF LIKE THAT FREE TEXT THAT
NATURAL LANGUAGE THAT WE TALKED
ABOUT.
THERE ARE SORT OF PRELIMINARY
TECHNIQUES THAT ALLOW US TO DO
SOME ANALYSIS BUT THERE'S MORE
THAT NEEDS TO BE DONE FOR US TO
BE ABLE TO FULLY TAKE ADVANTAGE
OF THIS UNSTRUCTURED DATA
BECAUSE, AGAIN, THIS IS REALLY A
HUGE BULK OF DATA THAT'S
AVAILABLE.
AND SO AS RESEARCHERS, YOU KNOW,
WANT TO BE ABLE TO TAKE
ADVANTAGE OF THIS WE NEED TO
WORK ON DEVELOPING NEW METHODS
FOR HANDLING THIS TYPE OF
UNSTRUCTURED DATA AND EXTRACTING
MEANING FROM IT SO I EXPECT A
LOT OF DEVELOPMENT IN THIS SORT
OF ARENA IN THE NEXT 5-10 YEARS.
ANOTHER THING WE NEED TO THINK
ABOUT IS TRAINING DATA
SCIENTISTS BECAUSE TO DO THIS
TYPE OF SCIENCE THINK BACK TO
THE SLIDE WHERE WE HAD THE VENN
DIAGRAM WITH THE THREE PARTS.
IT'S NOT ENOUGH JUST TO BE A
SUBJECT MATTER EXPERT.
YOU ALSO HAVE TO HAVE THE
STATISTICS, YOU ALSO HAVE TO
HAVE THE PROGRAMMING EXPERTISE.
SO, ONE WAY OF THINKING ABOUT
THIS WHICH COMES FROM A BLOG I
HAVE LINKED TO AT THE BOTTOM IS
TO SORT OF THINK OF THE MODERN
RESEARCHER AS PI-SHAPED, THE
CLASSIC WAS T-SHAPED WITH A
DEPTH OF KNOWLEDGE IN A SPECIFIC
FIELD BUT NOT A REALLY BROAD
BREADTH OF KNOWLEDGE, THE MODERN
RESEARCHER HAS TO HAVE BOTH THAT
DEPTH OF KNOWLEDGE IN THEIR
SPECIALTY AND IN FACT IN
COMPUTING AND HAVE THAT SORT OF
BREADTH OF KNOWLEDGE TO BE ABLE
TO ENCOMPASS ALL OF THOSE
DIFFERENT IDEAS.
SO THE TYPES OF TRAINING THAT
PEOPLE ARE GOING TO NEED TO
ACCOMPLISH THIS IS DIFFERENT
FROM WHAT PEOPLE MIGHT HAVE DONE
IN GRADUATE PROGRAMS THUS FAR.
SO THE WAY THAT A LOT OF
APPROACHES HAVE BEEN TAKEN TO
TRAINING SORT OF NEXT GENERATION
OF DATA SCIENTISTS IS EITHER
TAKING PEOPLE THAT ALREADY HAVE
THAT SUBJECT MATTER EXPERTISE,
SO PEOPLE WHO HAVE GOTTEN
PhDs IN SCIENCE, M.D.s OR
CLINICIANS, FILLING IN COMPUTER
SCIENCE TRAINING, OR THE
ALTERNATIVE IS TO TAKE PEOPLE
THAT ARE MORE TRADITIONALLY
TRAINED IN COMPUTER SCIENCE AND
TO TEACH THEM ABOUT THE DOMAIN
KNOWLEDGE.
YOU KNOW, GIVE THEM A COURSE IN
MOLECULAR BIOLOGY OR SOMETHING.
SO, YOU KNOW, AGAIN, TO HAVE A
WORKING, YOU KNOW, COMMUNITY OF
DATA SCIENTISTS YOU REALLY NEED
PEOPLE THAT HAVE ALL OF THOSE
TECHNIQUES AND SO I THINK, YOU
KNOW, COMING UP WITH TRAINING
PROGRAMS TO BE ABLE TO ADDRESS
THAT IS GOING TO BE REALLY
IMPORTANT MOVING FORWARD.
AND FOR US AS LIBRARIANS,
THINKING ABOUT HOW WE CAN
SUPPORT THOSE PROGRAMS I THINK
IS ALSO A REALLY SIGNIFICANT,
YOU KNOW, CONCERN.
SO I'VE TALKED A LOT ABOUT SORT
OF GENERALLY SPEAKING WHAT THIS
IS ALL ABOUT.
HOPEFULLY AS I'VE BEEN TALKING
YOU MAYBE HAVE GOTTEN INSPIRED
ABOUT THINGS CAN YOU DO AT YOUR
LIBRARY.
WHAT WE DO AT MY LIBRARY IS A
FEW THINGS.
I AM THE HEAD OF THE DATA
SERVICES PROGRAM.
I HAVE A SMALL TEAM WHO ASSISTS
BUT I'M REALLY THE ONLY
FULL-TIME PERSON THAT'S DEVOTED
ENTIRELY TO DATA.
AND BECAUSE OF THAT, THERE'S,
YOU KNOW, ONLY ONE OF ME AND A
LOT OF THE RESEARCHERS.
OUR PROGRAM FOCUSES ON TRAINING,
ESPECIALLY CLASSROOM TRAINING.
I DO A LOT OF CLASSES ON USING
SOME OF THE TOOLS AND METHODS
THAT I'VE DISCUSSED HERE.
OFTEN THOSE CLASSES WILL LEAD TO
CONSULTATIONS, SO I'LL TEACH
SOMEBODY HOW TO USE R OR
WHATEVER THE TOOL MIGHT BE THAT
WE'RE TALKING ABOUT, AND THEN
THEY COME AND MEET WITH ME ONE
ON ONE TO GET ADDITIONAL HELP ON
HOW TO ACTUALLY APPLY THEIR
PARTICULAR USE CASE.
ANOTHER THING THAT WE DO IS
ASSISTANCE LOCATE DATASETS FOR
REANALYSIS OR MACHINE LEARNING
ALGORITHMS.
AS WELL AS, IT'S NOT ON THE
SLIDE, BUT HELPING PEOPLE TO
SHARE THEIR DATA.
SO, YOU KNOW, THIS IS SOMEWHAT
NEW REQUIREMENT BUT THERE ARE
MANY RESEARCHERS, THEY SHARE
THEIR DATA, AND SO THEY DON'T
NECESSARILY KNOW LIKE, YOU KNOW,
HOW DO I SHARE MY DATA, WHERE DO
I SHARE MY DATA, SO THAT'S
SOMETHING THAT WE CAN ALSO
CONSULT ON.
AND THEN I THINK A REALLY
IMPORTANT THING IS THIS IDEA OF
COMMUNITY BUILDING.
THERE'S, YOU KNOW, I THINK IN A
LOT OF INSTITUTIONS, NIH IS VERY
SILOED, THERE'S A LOT OF PEOPLE
THAT HAVE EXPERTISE AND INTEREST
BUT THEY ARE IN DIFFERENT
DEPARTMENTS OR DIFFERENT
INSTITUTES AND DON'T TALK TO
EACH OTHER SO I THINK THE
LIBRARY IS A GREAT PLACE TO
BRING TOGETHER DIFFERENT PEOPLE
THAT WOULDN'T NORMALLY TALK TO
EACH OTHER, GET EVERYONE IN THE
SAME ROOM, AND SORT OF BUILD
COMMUNITIES SO WE CAN, YOU KNOW,
CREATE COLLABORATIONS, BUILD ON
EXPERTISE PEOPLE ALREADY HAVE.
WE'VE STARTED DOING THINGS LIKE
HAVING EVENTS, HAVING LIKE A
MENTORING PROGRAM FOR SCIENTISTS
WHO ARE, YOU KNOW, MAYBE A
LITTLE BIT MORE EXPERIENCED AND
WANT TO HELP OUT OTHERS.
SO ALL OF THESE THINGS I THINK
ARE REALLY IMPORTANT IN SORT OF,
YOU KNOW, GIVING RESEARCHERS THE
TOOLS THAT THEY MIGHT NEED TO,
YOU KNOW, TAKE ADVANTAGE OF
WHAT'S ALREADY AVAILABLE TO
THEM.
SO THAT BRINGS US TO THE END.
WE HAVE A LITTLE BIT OF TIME IF
ANYBODY WANTS TO ASK ANY
QUESTIONS.
I ALSO INVITE YOU IF YOU'RE
INTERESTED TO FOLLOW UP WITH ME
WITH ANY QUESTIONS.
MY E-MAIL AND TWITTER HANDLE ARE
LISTED THERE.
I ALSO HAVE MY BLOG LISTED
THERE.
THIS IS ENTIRELY UNOFFICIAL.
IT IS IN NO WAY ASSOCIATED WITH
THE NIH.
IT'S KIND OF A NERD BLOG IN
GENERAL.
I TALK ABOUT DATA SCIENCE STUFF
THERE SO A LOT OF STUFF I'VE
KIND OF TALKED ABOUT HERE, AS
WELL AS-- I'M IN A Ph.D.
PROGRAM, SO I WRITE ABOUT MY
RESEARCH WHICH UNSURPRISINGLY IS
ABOUT DATA.
I ALSO WRITE ABOUT MY DOG SO YOU
CAN SEE PICTURES OF THE ANNOYING
DOG BARKING DURING THE WEBINAR.
SO ANY QUESTIONS?
OKAY.
I SEE A FEW.
SO SOMEONE WHO WORKS IN A
HOSPITAL LIBRARY DOING HELP
LOCATING DATASETS, HOW TO TAKE
FURTHER STEPS.
I'M EXCITED WHEN HOSPITAL
LIBRARIANS ARE GETTING EXCITED
IN DATA.
I'VE NEVER WORKED IN THAT
PARTICULAR ARENA BUT I WILL SAY
THERE'S A REALLY GOOD CHAPTER IN
THE MLA DATA MANAGEMENT I EDITED
BY A HOSPITAL LIBRARIAN WHO
BASICALLY GOT INTERESTED IN
DOING THIS AND, YOU KNOW, WHAT
SHE COULD DO TO HELP IN THE
LIBRARY AT THE HOSPITAL.
SHE ENDED UP BUILDING-- THERE'S
THE-- I ALWAYS GET THE ACTUAL
NAME INCORRECT, IT'S THE NEW
ENGLAND COLLABORATIVE DATA
MANAGEMENT CURRICULUM OR
SOMETHING LIKE THAT.
IF YOU GOOGLE NEW ENGLAND DATA
MANAGEMENT YOU'LL FIND IT.
SHE I BELIEVE BUILT OFF OF THAT
AND DID INSTRUCTION ON DATA
MANAGEMENT, SO LOOKING AT HOW TO
HELP PROMOTE DATA MANAGEMENT
EXPERTISE IN HER LIBRARY, SO
THOSE ARE SOME POTENTIAL IDEAS.
BUT, YEAH, I'M REALLY EXCITED
THAT HOSPITAL LIBRARIAN IS
INTERESTED IN THAT.
SO A QUESTION ABOUT NIH DATASETS
AND WHERE TO FIND THEM.
SO THAT'S A GOOD QUESTION.
ONE THAT'S NOT EASILY ANSWERED.
BECAUSE THE NIH FUNDS SO MANY
TYPES OF RESEARCH, THE DATA IS,
YOU KNOW, VERY DIVERSE AND
EXISTS IN A LOT OF DIFFERENT
LOCATIONS.
AND SO THERE'S NOT LIKE ONE
CENTRAL NIH DATA REPOSITORY AT
THIS TIME.
HOWEVER, I WILL SAY THAT ONE OF
THE THINGS THAT THE NIH FUNDED
RECENTLY IS THE DEVELOPMENT OF
WHAT'S CALLED DATA MED, LIKE THE
PubMed FOR DATA.
SO THE IDEA IS THAT, YOU KNOW,
EVENTUALLY WE WILL HAVE ONE
CENTRAL LOCATION WHERE NOT
NECESSARILY ALL THE DATA IS
SAVED THERE BUT IT WILL BE LIKE
PubMed IN THAT WE CAN GO AND
SEARCH FOR SOMETHING AND LOCATE
A LOT OF DIFFERENT DATA FROM
DIFFERENT SOURCES.
SO THAT NOT CURRENTLY AVAILABLE
BUT THAT IS SOMETHING TO LOOK
FOR IN THE FUTURE.
FOR NOW THERE IS A WEBSITE THAT
IF YOU GOOGLE NIH-- I DON'T
REMEMBER WHAT YOU HAVE TO GOOGLE
TO FIND IT BUT THERE'S A LIST OF
NIH DATA REPOSITORIES, IT'S NOT
IN ANY WAY, YOU KNOW,
EXHAUSTIVE, BUT SOME DATASETS
ARE LOCATED THERE.
LET'S SEE.
ONE QUESTION FROM-- ANY
PARTICULAR MOOCs OR BOOKS THAT
YOU'VE USED THAT YOU WOULD
RECOMMEND TO OTHERS WHO WOULD
LIKE TO GET INTO R BUT HAVE
LIMITED CODING EXPERIENCE?
SO HOW I GOT INTO STUFF WAS I
USED THE CORSERA DATA
SPECIALIZATION.
I KNOW MANY PEOPLE WHO HAVE
TAKEN THAT SPECIALIZATION, IT
WAS A GOOD WAY TO GET STARTED.
I FOUND IT FRUSTRATING.
I OFTEN DIDN'T KNOW WHAT I WAS
DOING.
AND SO I WAS REALLY DISCOURAGED.
BUT THE THING THAT WORKED REALLY
WELL FOR ME, MAYBE THIS IS JUST
MY LEARNING STYLE, I PREFERRED
TO LEARN FROM A BOOK AND THE
BOOK THAT I USED IS CALLED "R
FOR EVERYONE."
AND I DON'T RECALL THE NAME OF
THE AUTHOR OFFHAND BUT IT WAS I
THOUGHT A REALLY GOOD BOOK
BECAUSE IT DIDN'T ASSUME A WHOLE
LOT OF COMPUTER SCIENCE
EXPERTISE, AND, YOU KNOW, IT WAS
VERY APPROACHABLE TO SOMEONE WHO
LIKE MYSELF DID NOT HAVE ANY
COMPUTER SCIENCE BACKGROUND WHEN
I FIRST SORT OF STARTED DOING
THIS.
I SHOULD HAVE INCLUDED THIS LINK
ON THE SLIDE BUT MAYBE TONY, YOU
CAN HELP ME FOLLOW UP WITH LINKS
FOR PEOPLE.
THE WEBSITE FOR MY LIBRARY FOR
DATA SERVICES PAGE HAS ALL THE
HANDSOUTS FROM THE CLASSES I
GIVE AND GOES THROUGH EVERY STEP
OF THE EXERCISES AND WE'RE ALSO
WORKING ON RECORDING THE CLASSES
AND DEVELOPING SOME VIDEO
TUTORIALS.
IF YOU WANT SOMETHING FROM A
LIBRARIAN POINT OF VIEW, THOSE
WOULD BE SOMETHING THAT YOU
MIGHT FIND USEFUL AND THEY ARE
AVAILABLE FOR YOU.
SO, THANK YOU, SOMEONE MENTIONED
HEALTHDATA.GOV, WHICH I SHOULD
HAVE POINTED OUT.
HEALTHDATA.GOV IS THE SUBSET OF
DATA.GOV, CONTAINING
HEALTH-RELATED DATASETS, ALSO
NOT EXHAUSTIVE.
I'VE FOUND IN THE PAST IT TENDS
TO BE MORE STUFF OF SORT OF
LOCAL AND STATE HEALTH
DEPARTMENTS, BUT THERE'S
DEFINITELY USEFUL DATA THERE.
A QUESTION HOW SAS IS DIFFERENT
THAN R?
SAS, STSS, MATLAB, THEY ARE
SIMILAR, THEY DO SOME OF THE
SAME THINGS AS R EXCEPT A COUPLE
THINGS.
SO THEY ARE POINT AND CLICK BUT
ALSO WRITE CODE AS YOU GO.
SO YOU CAN EITHER WRITE THE CODE
YOURSELF IN SAS, OR YOU CAN LIKE
JUST DO THE POINT AND CLICK
STUFF AND IT WILL WRITE THE CODE
AND DOCUMENT WHAT YOU'VE DONE.
IT'S A NICE ENTIRE MEDIATE ON
THE WAY TO DOING CODE BUT NOT
THERE YET.
YOU SEE WHAT THE CODE IS BECAUSE
IT'S GENERATED FOR YOU, IT'S
NICE BECAUSE IT'S DOCUMENTED.
THE BIG DIFFERENCE AS I SAID IS
THE FACT THAT SAS IS
PROPRIETARY, COSTS MONEY, AND
NOT NECESSARILY EVERYBODY HAS
IT.
THAT'S SORT OF A STICKING POINT
FOR SOME PEOPLE.
I, YOU KNOW, AM VERY MUCH IN
FAVOR OF OPEN SOURCE.
I ALWAYS ENCOURAGE PEOPLE TO,
YOU KNOW, USE THOSE SORTS OF
TOOLS WHEN THEY CAN.
THE OTHER THING IS ESPECIALLY
FOR PEOPLE THAT ARE STUDENTS, IF
THEY GO TO THE TROUBLE TO LEARN
SAS IT'S VERY POSSIBLE THAT AT
THEIR NEXT-- YOU KNOW, THEY
GRADUATE, POSTDOC SOMEWHERE, GET
A JOB SOMEWHERE, AND THEY
LEARNED SAS BUT THE NEW PLACE
USES SPSS, SO LEARNING R OR
PYTHON THAT ARE OPEN SOURCED IS
I FEEL LIKE A MUCH MORE
TRANSFERABLE SKILL THAN USING
SAS OR SPSS.
IT ALSO DEPENDS ON THE TYPE OF
ANALYSIS THAT YOU WANT TO DO.
I'VE FOUND AND SAS AND SPSS ARE
GOOD FOR TIME SERIES ANALYSIS
THAT'S MORE COMPLICATED IN R, SO
DEPENDING ON THE AREA THAT A
RESEARCHER IS WORKING IN, THAT
MIGHT BE A CONSIDERATION.
SO THANK YOU TO EVERYONE THAT
HAS PUT LINKS TO STUFF I WAS
MENTIONING IN THE CHAT BOX.
SO THE NIH LIBRARY CANVAS GUIDES
IS THE DATA SERVICES PAGE I
MENTIONED.
THERE'S A TAB FOR EACH OF THE
DIFFERENT CLASSES I TEACH, INTRO
TO R OBVIOUSLY WOULD BE THE
PLACE TO START.
AND LINKS TO A COUPLE BOOKS.
WE'RE JUST ABOUT OUT OF TIME.
TONY, ANY LAST THINGS YOU NEED
TO WRAP UP WITH?
>> I HAVE POSTED INTO THE CHAT
BOX THE LINKS FOR PEOPLE TO
EVALUATE THIS WEBINAR AND AFTER
YOU'VE COMPLETED THE WEBINAR YOU
WILL RECEIVE YOUR MLA C.E.
CERTIFICATE.
I DID READ EARLIER THAT THERE
WERE PEOPLE WHO HAD ISSUES
REGARDING A POP-UP BLOCKER.
SO THAT THE LINKS MIGHT NOT
NECESSARILY WORK FOR THEM
THROUGH WEBEX, SO I WILL GO
AHEAD AND GATHER ALL OF THESE
LINKS AND I WILL GO AHEAD AND
COPY THEM AND PUT THEM IN AN
E-MAIL FOR EVERYONE.
THIS WEBINAR WAS RECORDED, AND
WILL BE UPLOADED ONTO THE NN/LM
YOU TUBE PAGE.
IF THERE'S SOMETHING YOU MISSED
YOU CAN GO TO OUR YOUTUBE
CHANNELING, THERE ARE VIDEOS
FROM REGIONAL OFFICES WHO HAVE
DONE WEBINARS AND CLASSES YOU
CAN GO TO.
IF YOU'D LIKE TO SUBSCRIBE TO
OUR YOUTUBE CHANNEL, PLEASE DO
SO.
THIS WEBINAR WILL HOPEFULLY BE
UPLOADED BY NEXT WEEK.
SO ALL OF THE LINKS THAT WERE
LISTED HERE AS WELL AS WHEN THE
UPLOAD FOR THE VIDEO WILL BE
AVAILABLE SHORTLY.
EVERYONE, I WANT TO THANK YOU
FOR ATTENDING.
LISA, THANK YOU SO MUCH FOR
DOING THIS WEBINAR.
I LEARNED A WHOLE LOT, AND IT'S
GOOD TO HEAR FROM OPHELIA AGAIN.
THANK YOU AGAIN.
AND FEEL FREE TO E-MAIL ME IF
YOU HAVE ANY QUESTIONS.
AND ALSO FEEL FREE TO CONTACT
LISA IN REGARDS TO DATA SCIENCE.
THANKS AGAIN.
