# Moving largish data from R to H2O - spam detection with Enron emails

## At a glance:

I finally solve my problem of writing large sparse matrices from R into SVMLight format for importing to H2O; and demonstrate application with spam detection trained on the Enron email data comparing a generalized linear model, random forest, gradient boosting machine, and deep neural network.

18 Feb 2017

## Moving around sparse matrices of text data - the limitations of as.h2o

This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o() to interpret a sparse Matrix object from R and convert it to an H2O frame. The Matrix and as.h2o method is ok for medium sized data but broke down on my hardware with a larger dataset - a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix from the Matrix package or a simple triplet matrix from the slam package. I’m not sure what as.h2o does under the hood in converting from Matrix to an H2O frame, but it’s too much for my laptop.

My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext approach; but H2O for high powered machine learning. tidytext makes it easy to create a sparse matrix with cast_dtm or cast_sparse, but uploading this to H2O can be a challenge.

### How to write from R into SVMLight format

After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight"). This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge - how to efficiently write files in SVMLight format.

SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:

1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3


That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1 as the response value, 3.4 in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1 in the 22nd column, and so on.

Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste statements that were applyd to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.

I haven’t turned this into a package - it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).

## Example application - spam detection with the Enron emails

Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.

The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.

### Data

First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files

This creates six folders with the names enron1, enron2 etc; each with a spam and a ham subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random:

Subject: re : creditmanager net meeting
aidan ,
yes , this will work for us .
vince
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
cc :
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net - meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
regards
aidan mc nulty
212 981 7422


The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc.

Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext package which I never tire of saying is a game changer for convenient analysis of text by statisticians:

As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id, and columns indicating whether it is SPAM and those other characteristics of interest.

Source: local data frame [33,702 x 6]
Groups: id [33,702]

id  SPAM number_characters number_words number_stop_words
<chr> <chr>             <int>        <int>             <int>
1      0001.1999-12-10.farmer.ham.txt   ham                28            4                 0
2    0001.1999-12-10.kaminski.ham.txt   ham                24            4                 3
3        0001.2000-01-17.beck.ham.txt   ham              3486          559               248
4       0001.2000-06-06.lokay.ham.txt   ham              3603          536               207
5     0001.2001-02-07.kitchen.ham.txt   ham               322           48                18
6    0001.2001-04-02.williams.ham.txt   ham              1011          202               133
7      0002.1999-12-13.farmer.ham.txt   ham              4194          432               118
8     0002.2001-02-07.kitchen.ham.txt   ham               385           64                40
9  0002.2001-05-25.SA_and_HP.spam.txt  spam               990          170                80
10        0002.2003-12-18.GP.spam.txt  spam              1064          175                63


I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam package), with a column for each word (having first limited myself to interesting words)

Now we can load our two datasets onto an H2O cluster for analysis:

I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.

### Analysis

To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:

• generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
• random forest
• neural network

I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models:

The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)

                   names coefficients sign       word
1                  C9729    1.1635213  NEG      enron
2                 C25535    0.6023054  NEG      vince
3                  C1996    0.5990230  NEG   attached
4                 C15413    0.4524011  NEG     louise
5                 C19891    0.3905246  NEG  questions
6                 C11478    0.2993239  NEG        gas
7  proportion_stop_words    0.2935112  NEG       <NA>
9                  C7268    0.2600452  NEG      daren
10                C16257    0.2497282  NEG      meter
11                C12878    0.2439315  NEG    houston
12                C21441    0.2345008  NEG      sally
13                C16106    0.2179897  NEG    meeting
14                C12894    0.1965571  NEG        hpl
15                C16618    0.1909332  NEG     monday
16                C11270    0.1873195  NEG     friday
17                 C8553    0.1704185  NEG        doc
18                 C7386    0.1673093  NEG       deal
19                C21617    0.1636310  NEG   schedule
20                 C4185    0.1510748  NEG california
21                C12921    0.3695006  POS       http
22                C16624    0.2132104  POS      money
23                C22597    0.2034031  POS   software
24                C15074    0.1957970  POS       life
25                 C5394    0.1922659  POS      click
26                C17683    0.1915608  POS     online
27                C25462    0.1703109  POS     viagra
28                C16094    0.1605989  POS       meds
29                C21547    0.1583438  POS       save
30                 C9483    0.1498732  POS      email


So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron - things may have changed) more likely to be spam.

Here’s the code for the analysis all together:

← Previous post

Next post →