The simplest approach to the problem (and the most commonly used so far) is to split sentences into tokens. Simplifying, words have abstract and subjective meanings to the people using and receiving them, tokens have an objective interpretation: an ordered sequence of characters (or bytes). Once sentences are split, the order of the token is disregarded. This approach to the problem in known as bag of words model.
A term frequency is a dictionary, in which to each token is assigned a weight. In the first example, we construct a term frequency matrix from a corpus corpus (a collection of documents) with the R package tm
.
require(tm)
doc1 <- "drugs hospitals doctors"
doc2 <- "smog pollution environment"
doc3 <- "doctors hospitals healthcare"
doc4 <- "pollution environment water"
corpus <- c(doc1, doc2, doc3, doc4)
tm_corpus <- Corpus(VectorSource(corpus))
In this example, we created a corpus of class Corpus
defined by the package tm
with two functions Corpus
and VectorSource
, which returns a VectorSource
object from a character vector. The object tm_corpus
is a list our documents with additional (and optional) metadata to describe each document.
str(tm_corpus)
List of 4
$ 1:List of 2
..$ content: chr "drugs hospitals doctors"
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-03 00:31:34"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[truncated]
Once we have a Corpus
, we can proceed to preprocess the tokens contained in the Corpus
to improve the quality of the final output (the term frequency matrix). To do this we use the tm
function tm_map
, which similarly to the apply
family of functions, transform the documents in the corpus by applying a function to each document.
tm_corpus <- tm_map(tm_corpus, tolower)
tm_corpus <- tm_map(tm_corpus, removeWords, stopwords("english"))
tm_corpus <- tm_map(tm_corpus, removeNumbers)
tm_corpus <- tm_map(tm_corpus, PlainTextDocument)
tm_corpus <- tm_map(tm_corpus, stemDocument, language="english")
tm_corpus <- tm_map(tm_corpus, stripWhitespace)
tm_corpus <- tm_map(tm_corpus, PlainTextDocument)
Following these transformations, we finally create the term frequency matrix with
tdm <- TermDocumentMatrix(tm_corpus)
which gives a
<<TermDocumentMatrix (terms: 8, documents: 4)>>
Non-/sparse entries: 12/20
Sparsity : 62%
Maximal term length: 9
Weighting : term frequency (tf)
that we can view by transforming it to a matrix
as.matrix(tdm)
Docs
Terms character(0) character(0) character(0) character(0)
doctor 1 0 1 0
drug 1 0 0 0
environ 0 1 0 1
healthcar 0 0 1 0
hospit 1 0 1 0
pollut 0 1 0 1
smog 0 1 0 0
water 0 0 0 1
Each row represents the frequency of each token - that as you noticed have been stemmed (e.g. environment
to environ
) - in each document (4 documents, 4 columns).
In the previous lines, we have weighted each pair token/document with the absolute frequency (i.e. the number of instances of the token that appear in the document).