This report is for the Data Science Capstone Project course on Coursera, delivered by The Johns Hopkins University. The purpose of this report is to show some exploratory analysis of the data sets given (which contain mainly unstructured text data) and build a simple n-gram model which will be worked across the whole course in order to improve its efficiency, as well as its compatibility, it may be used in order to predict the next word using other languages as well.
We will be using three text files that contain a series of characters that represent words written by users in three distinct formats:
The three formats have their own language, the different languages included are the following:
There are some problems when developing a model which predict the next word, for example, the multiple contractions in French or the multiple nouns in German. So only the English language will be worked on this first report, using bar plots and charts in order to represent the differences between the three text files that correspond to the selected language.
In order to load the data, a function is used so the process can be automated, the fileRead function will read all the lines of the document using the readLines function, we will store the data in three different variables in order to treat the data separately, the data stored in these character variables will be processed later to work with these unstructured data.
fileRead <- function(file) {
camino <- paste('data/final/en_US/en_US.', file, '.txt', sep = '') # The path of the files
con <- file(camino, 'r') # Connection to the file is opened
doc <- readLines(con, skipNul = TRUE) # The document is read depending on the connection opened
close(con) # The connection is closed
return(doc)
}
blogs_file <- fileRead('blogs')
news_file <- fileRead('news')
twitter_file <- fileRead('twitter')
The data is very messy to visualize, because it mostly contains a series of characters and it is not divided by sentences, the following data frame contains the size, number of lines, as well as the number of words of each file, it is easier to visualize, and illustrates more clearly the unique features of the three text files, as well as their common characteristics.
dfsum <- data.frame(
File = c('Blogs', 'News', 'Twitter'),
Size = c(file.size('data/final/en_US/en_US.blogs.txt')/1024/1024,
file.size('data/final/en_US/en_US.news.txt')/1024/1024,
file.size('data/final/en_US/en_US.twitter.txt')/1024/1024),
Lines = c(length(blogs_file), length(news_file), length(twitter_file)),
Words = c(sum(count_words(blogs_file)),
sum(count_words(news_file)),
sum(count_words(twitter_file)))
)
dfsum
## File Size Lines Words
## 1 Blogs 200.4242 899288 38154238
## 2 News 196.2775 77259 2693898
## 3 Twitter 159.3641 2360148 30218166
As the data frame reflect, the twitter file is the most lightest file, but includes many lines comparing to the other two files, this is due to the fact that it represents a lot of tweets separately. This feature of separate lines is seen in the blogs file, because it contains multiple entries that use a lot of colloquial language that represent the almost 40 million words used. Although the news file has the the least amount of lines among the three, it uses more complex words, so it uses the least amount of lines and the least amount of words and still weights the same as the blogs file.
In the next section Bar plots of the summary, bar plots will be used in order to spot these differences graphically
The following bar plots explain more clearly the differences of the three text files. Each bar plot addresses one of the three features previously presented, and a specific color-coding will be used among the three graphs to identify each file.
colors <- c('gray75', 'lightsalmon', 'lightblue') # Colors will be used
The size among the files is very similar and does not represent a big difference, we can see that the twitter file is slightly lower by 40 Mb, but they contain similar size of information, although differently distributed.
colors <- c('gray75', 'lightsalmon', 'lightblue') # Colors will be used
with(dfsum, barplot(Size, names.arg = File,
ylab = 'Space occupied by the file (in MB)',
xlab = 'Name of the file',
main = 'Size of each file',
col = colors))
The bar plot representing the number of lines of each document gives us a better view on how the data is distributed, twitter has the most lines but it is because on the structure of the data (which is mainly given by mini entries of text).
with(dfsum, barplot(Lines/1000000, names.arg = File,
ylab = 'Number of lines (#) in Millions',
xlab = 'Name of the file',
main = 'Number of lines of each file',
col = colors))
The number of words plot is quite interesting, because the news file contains very little words comparing to the other files, this may be due to the type of vocabulary used or due to the inconsistencies of the data. As data coming from a news webpage may be cleaner and more structured when compared on data provided by blogs or twitter, users could use special characters and there is not as many filters as news data has.
with(dfsum, barplot(Words/1000000, names.arg = File,
ylab = 'Number of Words (#) in Millions',
xlab = 'Name of the file',
main = 'Number of words per file',
col = colors))
A corpus will be created so the data can be processed on a more specialized, structured way. A corpus is a set of documents which contains each of the separate files. The encoding used is UTF-8 because some of the files contained various special characters that made impossible to use the more traditional, structured ASCII encoding.
camino <- 'data/final/en_US' # Only the location of the text files is required
engCorpus <- Corpus(DirSource(directory = camino, mode = 'text', encoding = 'UTF-8'))
As I have been saying, the data is very messy and contains a lot of punctuation marks, as well as special characters. So it requires cleaning in order to do the proper text mining, and the further text prediction. To achieve this objective, a function was created by the name of cleanCorp, which uses functions included in the tm package that make some transformations to a corpus variable.
The transformations used are the following:
These transformations improve the efficiency of a text-predicting model because it formats all the words in a proper way, this gives the same weight to all the words, even though they are capitalized or they contained a punctuation mark, reducing duplicates.
Also, for this report only the news file will be analyzed completely, because it contains more structured unstructured data than the other files, it also contains fewer words and fewer lines, which makes the computation feasible and significant.
newsCorpus <- NA
cleanCorp <- function(corpus, lower){
corp <- tm_map(corpus, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, content_transformer(tolower))
return(corp)
}
newsCorpus <- cleanCorp(engCorpus[2]) # The second index represent the news file
A term document matrix is a matrix which contains all the words included in a document, as well as its frequencies, it will be useful further in the report in order to analyze the text and create the n-grams. I have created a function that transforms a corpus into a term document matrix type of file, and removes some elements (in this case words) that are sparse.
tdmGenerator <- function(corp){
tdm <- TermDocumentMatrix(corp)
tdm <- removeSparseTerms(tdm, 0.7) # Maximal allowed sparsity
return(tdm)
}
Now we will make an analysis of the frequency of words within the news file, and know how many top words do we need to include in order to cover most of the vocabulary used, this depends on the most frequent words and represents how some words make up the whole file. And also represents how the data is still messy, having some terms with only 1 frequency.
I have decided to subset the tdm (term document matrix) and only include words which appeared, at least, 20 times. So that the top words are significant, however, the words that were not included on the analysis are still counted towards the total word count, this will present a realistic view on how a few words compose most of the text on the file.
frequency_plots <- function(corpus, file){
tdm <- tdmGenerator(corpus) # Call to the tdmGenerator helper function
df_freq <- data.frame(Palabra = tdm$dimnames$Terms, Frecuencia = tdm$v)
total_freq <- sum(df_freq[,2])
df_freq <- df_freq[df_freq$Frecuencia > 20, ] # Subset of the data which appeared > 20 times
df_freq <- df_freq[order(df_freq[,2], decreasing = TRUE),]
total_words <- length(df_freq[,1])
max_val <- total_words*0.33 # Using one third of the total words
step <- 1 # A step of 1 is used to get an accurate plot
df_freq <- mutate(df_freq, Ratio = (Frecuencia/total_freq) * 100)
df <- data.frame('Top_1' = df_freq[1,3])
for (num in c(seq(1, max_val, step))){
current_num <- paste('Top_', num, sep = '')
df[,current_num] <- round(sum(df_freq[1:num, 3]),4)
}
xdata <- c(0, 1, seq(2,max_val, step))
ydata <- c(0, df[,1:length(df)])
ydata <- as.numeric(unlist(ydata))
plot(ydata ~ xdata, xlim = c(0,max_val), ylim = c(0,100),
main = paste('Number of words chosen and total of words covered in', file),
xlab = 'Number (#) of words chosen', ylab = 'Percent (%) of total frequencies covered')
fit <- lm(ydata ~ xdata)
fit2 <- loess(ydata ~ xdata, degree = 2, span = 0.95)
abline(h = 70, col = 'blue') # Threshold of 70% of the frequencies of the whole document
lines(xdata, fit$fitted.values, col = 'red')
lines(xdata, fit2$fitted, col = 'orange')
legend(legend = c('Points','Linear model', 'Second-order model'), 'topleft',
col = c('black', 'red', 'orange'), pch = c(20,-1,-1), lty = c(0,1,1))
}
frequency_plots(newsCorpus, 'news')
It is quite impressive that with less of a third of the total words, more than the 70% of the total frequencies are achieved, mainly because the first words are the most used by the English language, the so called stop words featuring the word the. The relationship between the percent of the total frequencies and the number of words chosen could not be explained by the linear model (red line), while a loess line using 2 as the degree of the polynomials used was not good enough either.
As it can be seen, there is a great learning curve on the beginning on the graph, this demonstrates that most frequent words are very important and represent most of the vocabulary used by the users. The next data set represent the top 10 words used on the news file, the second column is the frequency in which they are used, and the third column the percentage which these frequencies represent on the whole document.
head(df_freq, 10)
## Palabra Frecuencia Ratio
## 1 the 151153 7.0521320
## 2 and 68159 3.1799982
## 3 for 26944 1.2570882
## 4 that 26281 1.2261555
## 5 with 19736 0.9207947
## 6 said 19164 0.8941077
## 7 was 17615 0.8218382
## 8 his 12099 0.5644860
## 9 from 11630 0.5426045
## 10 but 11574 0.5399918
As I have said before, the top words correspond to stop words that correspond to prepositions, these are usually removed when doing binary classification or sentiment analysis because they provide little information about the author of a text or whether a text corresponds to a certain type of text. Although, for the purposes of this investigation (text predicting), these stop words are key in order to know how a person will write next, so these words will be included in order to know the most probable next word.
The following code chunks create a simple n-gram model in order to know the next word depending on the most common pairs of words, we will use be using all the three files, because we will use random samples and get only 25% of the total lines depending on the file taken. This will be done in order to illustrate how the model work. Although random samples will be used, a seed will be placed for reproducible research purposes.
The n-grams created are the following:
The ngram function provided by the ngram package requires that the data to be in a certain format, and to do not include NA values (this values can be obtained via whitespaces). So the tokenizers package will be used to tokenize (count the frequency of each word) each file, then these NA values will be removed so that the ngram function can operate properly.
First we get our random sample and set the n to 2 (bigram), and remove the NA values, these values represent a small percentage of the whole file, but it affects our n-gram performance and creation.
## Creating a 2-gram model using 25% of the blogs file
set.seed(1023)
n <- 2
num_lines <- dfsum[dfsum$File == 'Blogs', 'Lines'] * .25 # Considering 25% of the total lines
doc <- blogs_file[sample(1:num_lines)]
bitokens <- unlist(tokenize_ngrams(doc, n = n))
print(paste('Ratio of NAs in the bitokens: ', round(sum(is.na(bitokens))/length(bitokens) * 100, digits = 4), '%', sep = ''))
## [1] "Ratio of NAs in the bitokens: 0.0264%"
As the NA are being removed, then we input our bitokens to the ngram function.
bitoken_na <- na.omit(bitokens)
bigram <- ngram(bitoken_na, n)
head(get.phrasetable(bigram), 10)
## ngrams freq prop
## 1 of the 46590 0.005011116
## 2 in the 38212 0.004109997
## 3 to the 21562 0.002319160
## 4 on the 18854 0.002027894
## 5 to be 16976 0.001825901
## 6 and the 14520 0.001561739
## 7 for the 14324 0.001540657
## 8 i was 12510 0.001345548
## 9 and i 12319 0.001325004
## 10 it is 11944 0.001284670
These are the most common pairs from the blogs file, they mainly consist of the and a preposition, because these are the most common types of words in the English language, the informal repetitive language used in blogs also affect on these results.
For the trigram model, a similar methodology is used, but we are focusing on the twitter file and changing the n value to 3, instead of 2. The seed had been modified to maintain the integrity of this new model and reinforce which is the seed that leads to the current calculation.
## Creating a 3-gram model using 25% of the twitter file
set.seed(1022)
n <- 3
num_lines <- dfsum[dfsum$File == 'Twitter', 'Lines'] * .25 # Considering 25% of the total lines
doc <- twitter_file[sample(1:num_lines)]
tritokens <- unlist(tokenize_ngrams(doc, n = n))
print(paste('Ratio of NAs in the tritokens: ', round(sum(is.na(tritokens))/length(tritokens) * 100, digits = 4), '%', sep = ''))
## [1] "Ratio of NAs in the tritokens: 0.2541%"
There are also NA values on this file, but these are not significant to the whole file.
tritoken_na <- na.omit(tritokens)
trigram <- ngram(tritoken_na, n)
head(get.phrasetable(trigram), 10)
## ngrams freq prop
## 1 thanks for the 5843 0.0009165853
## 2 looking forward to 2249 0.0003527983
## 3 thank you for 2196 0.0003444842
## 4 i love you 2053 0.0003220519
## 5 for the follow 1962 0.0003077769
## 6 i want to 1870 0.0002933449
## 7 going to be 1866 0.0002927175
## 8 can't wait to 1794 0.0002814229
## 9 a lot of 1552 0.0002434606
## 10 to be a 1500 0.0002353034
Looking at these pairs, we can grasp some frequent words used on twitter, these correspond to thanks for the (which is probably followed by follow), or for the follow. This denotes an important factor, that the source of data affects which is the most probable word, the type of language is different depending on the sources of information.
This factor makes that the model take one of the following routes:
The first option can make the model very accurate when addressing a certain type of user, while the second model can fit every type of user, but having (probably) a poor performance when compared to the specialized model.
This is the last model, we will be using the news file because it has more structured sentences, and this is an important factor when creating a great number of n-grams, these words are the least probable because they must be together, but are more accurate because they consider more words than the previous n-grams.
## Creating a 4-gram model using 25% of the news file
set.seed(1021)
n <- 4
num_lines <- dfsum[dfsum$File == 'News', 'Lines'] * .25 # Considering 25% of the total lines
doc <- news_file[sample(1:num_lines)]
foutokens <- unlist(tokenize_ngrams(doc, n = n))
print(paste('Ratio of NAs in the tritokens: ', round(sum(is.na(foutokens))/length(foutokens) * 100, digits = 4), '%', sep = ''))
## [1] "Ratio of NAs in the tritokens: 0.0843%"
There are also a few NA values involved, but they can be removed via the na.omit function.
foutoken_na <- na.omit(foutokens)
fougram <- ngram(foutoken_na, n)
head(get.phrasetable(fougram), 10)
## ngrams freq prop
## 1 for the first time 50 8.097533e-05
## 2 the end of the 46 7.449731e-05
## 3 at the end of 41 6.639977e-05
## 4 said in a statement 39 6.316076e-05
## 5 the rest of the 39 6.316076e-05
## 6 in the middle of 36 5.830224e-05
## 7 one of the most 35 5.668273e-05
## 8 when it comes to 32 5.182421e-05
## 9 is one of the 31 5.020471e-05
## 10 at the university of 31 5.020471e-05
Quite interesting results given by these last n-gram, the bias included by the language is observed in this file, as said in a statement appears as a frequent group of words. The group of four sequential words are not frequently encountered (as it can be seen on the low prop on the prop column), but the text predicting model will first start with the biggest n-gram created, and if it does not found a match then it will go the (n-1)-gram model, and so on until it finds a match at least with the monomer.