Moving around sparse matrices of text data - the limitations of as.h2o
This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o()
to interpret a sparse Matrix
object from R and convert it to an H2O frame. The Matrix
and as.h2o
method is ok for medium sized data but broke down on my hardware with a larger dataset - a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix
from the Matrix
package or a simple triplet matrix from the slam
package. I’m not sure what as.h2o
does under the hood in converting from Matrix
to an H2O frame, but it’s too much for my laptop.
My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext
approach; but H2O for high powered machine learning. tidytext
makes it easy to create a sparse matrix with cast_dtm
or cast_sparse
, but uploading this to H2O can be a challenge.
How to write from R into SVMLight format
After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight")
. This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge - how to efficiently write files in SVMLight format.
SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3
That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1
as the response value, 3.4
in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1
in the 22nd column, and so on.
Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm
is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste
statements that were apply
d to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.
I haven’t turned this into a package - it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).
Example application - spam detection with the Enron emails
Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.
The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.
Data
First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files
This creates six folders with the names enron1
, enron2
etc; each with a spam
and a ham
subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random:
Subject: re : creditmanager net meeting
aidan ,
yes , this will work for us .
vince
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
cc :
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net - meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
regards
aidan mc nulty
212 981 7422
The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc.
Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext
package which I never tire of saying is a game changer for convenient analysis of text by statisticians:
As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id
, and columns indicating whether it is SPAM and those other characteristics of interest.
Source: local data frame [33,702 x 6]
Groups: id [33,702]
id SPAM number_characters number_words number_stop_words
<chr> <chr> <int> <int> <int>
1 0001.1999-12-10.farmer.ham.txt ham 28 4 0
2 0001.1999-12-10.kaminski.ham.txt ham 24 4 3
3 0001.2000-01-17.beck.ham.txt ham 3486 559 248
4 0001.2000-06-06.lokay.ham.txt ham 3603 536 207
5 0001.2001-02-07.kitchen.ham.txt ham 322 48 18
6 0001.2001-04-02.williams.ham.txt ham 1011 202 133
7 0002.1999-12-13.farmer.ham.txt ham 4194 432 118
8 0002.2001-02-07.kitchen.ham.txt ham 385 64 40
9 0002.2001-05-25.SA_and_HP.spam.txt spam 990 170 80
10 0002.2003-12-18.GP.spam.txt spam 1064 175 63
I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam
package), with a column for each word (having first limited myself to interesting words)
Now we can load our two datasets onto an H2O cluster for analysis:
I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.
Analysis
To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:
- generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
- random forest
- gradient boosting machine
- neural network
I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models:
The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)
names coefficients sign word
1 C9729 1.1635213 NEG enron
2 C25535 0.6023054 NEG vince
3 C1996 0.5990230 NEG attached
4 C15413 0.4524011 NEG louise
5 C19891 0.3905246 NEG questions
6 C11478 0.2993239 NEG gas
7 proportion_stop_words 0.2935112 NEG <NA>
8 C12866 0.2774074 NEG hourahead
9 C7268 0.2600452 NEG daren
10 C16257 0.2497282 NEG meter
11 C12878 0.2439315 NEG houston
12 C21441 0.2345008 NEG sally
13 C16106 0.2179897 NEG meeting
14 C12894 0.1965571 NEG hpl
15 C16618 0.1909332 NEG monday
16 C11270 0.1873195 NEG friday
17 C8553 0.1704185 NEG doc
18 C7386 0.1673093 NEG deal
19 C21617 0.1636310 NEG schedule
20 C4185 0.1510748 NEG california
21 C12921 0.3695006 POS http
22 C16624 0.2132104 POS money
23 C22597 0.2034031 POS software
24 C15074 0.1957970 POS life
25 C5394 0.1922659 POS click
26 C17683 0.1915608 POS online
27 C25462 0.1703109 POS viagra
28 C16094 0.1605989 POS meds
29 C21547 0.1583438 POS save
30 C9483 0.1498732 POS email
So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron - things may have changed) more likely to be spam.
Here’s the code for the analysis all together: