Today I will be talking to you about our chapter on
Text mining and web mining
The learning outcomes from this chapter are
Describe text mining and the need for text
mining
Differentiate among analytics text mining
and data mining
Discuss the different application areas for
text mining
Describe the process of carrying out a text
mining project
Recognize the different methods to introduce structure
To text based data describe sentiment analysis
Develop familiarity with popular applications
of
Sentiment analysis
And discuss the common methods for sentiment analysis
What is text mining
Text mining is the process of finding patterns of useful information
From large amounts of unstructured data sources
What is the difference between data mining
and text mining
Both of them seek for novel and useful patterns
Both are semi automated processes
Difference is in the nature of the data
For data mining the nature of data is structured
As you find in a database for text mining
the nature of
The data is unstructured as you find in word documents
Pdf files text excerpts xml files and so on
In text mining you first impose structure
to the data
Then mine the structured data
There are several benefits of text mining
For example in law court orders in academic research
In finance quarterly reports in medicine discharge summaries
In biology molecular interactions technology in patent files
Marketing in customer comments
Next we will be talking about application
areas in text mining
Which are information extraction topic tracking summarization
Categorization clustering concept linking
and question answering
In information extraction we identify key
phrases and relationships
Within text in topic tracking this is based
on documents
That the user views text mining can predict
Other documents of interest
Summarization relates to summarizing a document to save time
On the part of the reader
Categorization is helpful in identifying the
main
Theme of the document and placing it in a
predefined state of
Categories
Clustering is grouping of similar documents
without having
A predefined state of categories
Concept linking relates to connecting related documents
By identifying their shared concepts
And question answering is finding the best
answer to a given
Question through pattern matching
Next we will be going over some of the text
mining terms
Unstructured versus structured data
Unstructured means nonspecific format text
And structured data is pre determined format database
Corpus a large unstructured set of text prepared for
Conducting knowledge discovery
Terms a term is a single word or multi word
phrase
Extracted directly from the corpus by natural language processing
Concepts are features generated from a collection of
Documents by statistical rule based manual methods
Stemming the process of reducing infected
words to their
Base or root
Stop words are noise words filtered out prior to processing
Of text
Then synonyms and polysemes are homonyms
Synonyms are identical meaning such as movie film motion picture
And polysemes are homonyms they are spelt the same
But different meanings
Such as a bow a weapon a front of the ship
It can be a bend or it can be a hair bow
Lastly tokenizing a token is a block of text
In a sentence
The assignment of meaning to blocks of text is called tokenizing
Then we have term dictionary which is a collection of terms
Specified to a narrow field
Word frequency is the number of times a word
Is found in a document
Part of speech tagging that is marking of
words in
A text as nouns verbs adjectives etcetera
Morphology that studies the internal structure of words
Term by document matrix
Relationships between terms and documents
Singular value decomposition this is a method to reduce
The term by document matrix to a manageable size
Next we will be talking about text mining
applications
Marketing applications enables better customer
Relationship management by analysis of customer comments
Blogs call center data etcetera
Security applications enables deception detection
Medicine and biology identifies molecular
interactions
In academic applications this is very popular
Text mining is becoming very popular in academics
Where it can help better retrieval of data
Next is the text mining process input is text and data base
Output context specific knowledge used for
decision making
Controls also called constraints
Are software hardware limitations privacy
issues and linguistic limitations
Difficulties related to the processing of
texts
Mechanisms will be techniques and software tools
The primary purpose of text mining is to
Process unstructured or textual data along
with structured
Data if relevant and available to extract
meaningful and
Actionable patterns for better decision making
Some of the commercial text mining software tools are
Spss pasw text miner sas enterprise miner
clear forest
Dot com it offers text analysis and visualization tools
IBM intelligence miner data miner suite includes data
And text mining tools tableau such as data
visualization software
And by the way we will be using tableau in
this class
Later on in the assignments
Next we will be talking about web mining
Web mining is the process of discovering
Intrinsic relationships from web data
Web content mining is extraction of useful
information
From web pages web crawlers are used to read through
The content of a website automatically
Web structure mining 
is the process of extraction of
Useful information from the links
Embedded in the word documents
Then web usage mining this is the extraction of useful
Information from data generated through web pages
Through web page visits and transactions
Three types of data are generated through
web page visits
One is called logs another is user profiles
and then
Metadata such as usage data
Page attributes etcetera we will also talk
about
Clickstream analysis this is analysis of the
information
Collected by web servers that can help us
Better understand user behavior
Analysis of this data is called clickstream
analysis
Clickstream analysis can help in several ways
For example useful for knowing when visitors access a site
For example if a company knew that seventy percent of the
Software downloads from its website occurred between
7 pm and 11 pm it would plan for better customer support
And network bandwidth during those hours
Lastly we will be talking about sentiment
analysis
What is sentiment analysis and how does it relate to
Text mining sentiment analysis tries to answer the question
What do people feel about a certain topic
By digging into opinions of many using a variety of
Automated tools it is also known as opinion mining
Subjectivity analysis and appraisal extraction
Sentiment analysis shares many characteristics and techniques
With text mining however unlike text mining
Which categorizes text by conceptual categories of topics
Sentiment classification generally use with
two classes
Positive versus negative a range of polarity
Or a range in strength of opinion
Now what are the sources of data for sentiment analysis
Common data sources for sentiment analysis include
Opinion rich internet resources such as social media outlet
Twitter facebook etcetera online review sites
Personal blogs online communities chat rooms
News groups and search engine logs
For many companies customer calls enter transcripts
Emails product reviews price comparison photos
Provide another rich source of sentiment data
Some of the popular sentiment analysis applications
Are customer relationship management or crm
Customer experience management are popular
voice of the customer views the applications
Other applications areas include voice of
the market vom
And voice of the employee voe
This is all about the chapter on text mining
web mining
And sentiment analysis
Thank you
