You should all go watch Branagh’s Hamlet (1996)
Earlier this year I watched Kenneth Branagh’s Hamlet (1996) and wow, I cannot recommend this movie enough. Not only is it by far the best Hamlet I have ever seen (on stage or screen), it has a fair claim to being the best Shakespeare ‘full stop’ (or ‘period’ as our American cousins say), and makes it to my list of favourite films of any sort. Gorgeously filmed on 70 mm film (including the spectacular setting of Blenheim Palace), 242 minutes long, brilliant direction, amazing atmosphere and ability to surface all the subplots and character quirks. Despite knowing the story backwards - having seen so many renderings of it, and every second line feeling like a quotation built into my cultural DNA - this film moved me in a new ways and depths that I rarely get from any work of art.
And yes, it does have a star-studded cast (perhaps excessively star-studded - Robin Williams and Gerard Depardieu seem wasted, and imagine using John Gielgud and Judi Dench for walk-on non-speaking parts), but the most important thing is that these people can really act. Branagh himself is a faultless tour-de-force, Kate Winslet is perfect as Ophelia, Julie Christie a wonderful Gertrude, and Derek Jacobi’s Claudius is spot-on. (Well noticed, observers of quality theatrical film and TV - Derek Jacobi became famous for the title role in I Claudius - but this is unrelated). Richard Briers seems to be made to play Polonius, with the perfect combination of humour, pomposity and pathos.
Some of the greatness of the movie comes from seeing famous stars in what are sometimes seen as also-ran parts in ways that make you see both the greatness of both the role and the actor - Billy Crystal as a gravedigger, and Charlton Heston bringing dignity and depth to the leader of the players.
This movie’s interpretation of the story plays up the role and dark aspects of Hamlet senior, whose statue broods over Elsinore and whose relationship to ‘our’ Hamlet is mirrored in Fortinbras senior and junior… ok I don’t have time to explain that here, but well, it’s complicated, right? Which is the point.
Anyway, “do yourself a favour” as Molly Meldrum would have said, and see this movie.
Now, I need something data-related if I am going to justify having a blog on this. So I thought I would tie my refreshed enthusiasm for the full text of Hamlet with a familiarisation project that has been on my to-do list for months (understand biterm topic modelling). As it happened, events (in the form of a blog post on another topic) overtook the topic model focus. I think the biggest interest here turns out to actually be the start of a generalisable, efficient way of representing text from a play in a tidy data model. So let’s have a go.
Basic structure of the raw data
When I first download one of the five or more versions of Hamlet from the Gutenberg project, it looks like this:
 "HAMLET, PRINCE OF DENMARK"  ""  "by William Shakespeare"  ""  ""  ""  ""  "PERSONS REPRESENTED."  ""  "Claudius, King of Denmark."  "Hamlet, Son to the former, and Nephew to the present King."  "Polonius, Lord Chamberlain."  "Horatio, Friend to Hamlet."  "Laertes, Son to Polonius."  "Voltimand, Courtier."  "Cornelius, Courtier."  "Rosencrantz, Courtier."  "Guildenstern, Courtier."  "Osric, Courtier."  "A Gentleman, Courtier."  "A Priest."  "Marcellus, Officer."  "Bernardo, Officer."  "Francisco, a Soldier"  "Reynaldo, Servant to Polonius."  "Players."  "Two Clowns, Grave-diggers."  "Fortinbras, Prince of Norway."  "A Captain."  "English Ambassadors."  "Ghost of Hamlet's Father."  ""  "Gertrude, Queen of Denmark, and Mother of Hamlet."  "Ophelia, Daughter to Polonius."  ""  "Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other"  "Attendants."  ""  "SCENE. Elsinore."  ""  ""  ""  "ACT I."  ""  "Scene I. Elsinore. A platform before the Castle."  ""  "[Francisco at his post. Enter to him Bernardo.]"  ""  "Ber."  "Who's there?"  ""  "Fran."  "Nay, answer me: stand, and unfold yourself."  ""  "Ber."  "Long live the king!"  ""  "Fran."  "Bernardo?"  ""  "Ber."  "He."  ""  "Fran."  "You come most carefully upon your hour."  ""  "Ber."  "'Tis now struck twelve. Get thee to bed, Francisco."  ""  "Fran."  "For this relief much thanks: 'tis bitter cold,"  "And I am sick at heart."  ""  "Ber."  "Have you had quiet guard?"  ""  "Fran."  "Not a mouse stirring."  ""  "Ber."  "Well, good night."  "If you do meet Horatio and Marcellus,"  "The rivals of my watch, bid them make haste."  ""  "Fran."  "I think I hear them.--Stand, ho! Who is there?"  ""  "[Enter Horatio and Marcellus.]"  ""  "Hor."  "Friends to this ground."  ""  "Mar."  "And liegemen to the Dane."  ""  "Fran."  "Give you good-night."  ""  "Mar."  "O, farewell, honest soldier;"
So this is interesting. Here are some things we need to take into account in putting this into a data model that loses no or minimal information from its structure and content:
- We have about 40 lines of preamble, then we are on to a combination of dialogue and stage directions.
- The play is structured into Acts and Scenes. Scenes have numbers and descriptions (eg
Scene I. Elsinore. A platform before the Castle.) but Acts just have numbers. Scenes are strictly hierarchically ordered under Acts (ie there is no Scene that continues over the boundary between two Acts).
- The stage directions are in square brackets
[Enter Horatio and Marcellus]. A stage direction can be seen as applying until it is countermanded (for example, once Francisco and Bernardo have entered, they should be considered as present on stage until we are told otherwise)
- A continuous line of speech can go over several lines, such as Francisco in lines 71 and 72 of the above. The line breaks may be significant (and sometimes certainly are when the line is in verse), or not.
- The character speaking is referred to be abbrevaition (eg
Fran.), which needs decoding to be fully human-readable (an actor learning the play, or a reader familiar with it would know that
Hor.is Horatio, but only by mental decoding - something that should be done in the data model)
Here is the R code that sets up my session, including preparation for things I haven’t talked about yet, downloads the play and gets us to this point of looking at the first 100 lines:
Post continues below R code
With one annoying exception, the stage directions are all single lines so we can pick them up easily with a regular expression that looks for lines that begin and finish in square brackets.
> # Number of times each stage direction used > hamlet %>% + filter(grepl("^\\[.*\\]$", text)) %>% + count(text, name = "times_used", sort = TRUE) # A tibble: 129 x 2 text times_used <chr> <int> 1 [Exit.] 19 2 [Exeunt.] 13 3 [Sings.] 11 4 [Enter Polonius.] 5 5 [Exit Polonius.] 5 6 [Enter Hamlet.] 4 7 [Exeunt Rosencrantz and Guildenstern.] 4 8 [Dies.] 3 9 [Enter Rosencrantz and Guildenstern.] 3 10 [Exit Ghost.] 3
Here is the exception - it comes from near the end, when a whole bunch of people (too many to list in a single line) come in with weapons to be used in the duel between Hamlet and Laertes:
Scenes and Acts
Similarly, we can use regular expression to find with nearly 100% accuracy the beginnings of Scenes and Acts. Note that the capitalisation of
Act is inconsistent:
> # Scenes and acts. Sometimes upper case sometimes not > hamlet %>% + filter(grepl("^Scene\\s", text, ignore.case = TRUE) | grepl("^Act\\s", text, ignore.case = TRUE)) # A tibble: 26 x 2 text original_line_number <chr> <int> 1 ACT I. 43 2 Scene I. Elsinore. A platform before the Castle. 45 3 Scene II. Elsinore. A room of state in the Castle. 378 4 Scene III. A room in Polonius's house. 822 5 Scene IV. The platform. 1029 6 Scene V. A more remote part of the Castle. 1211 7 Act II. 1560 8 Scene I. A room in Polonius's house. 1562 9 Scene II. A room in the Castle. 1779 10 ACT III. 2744 # ... with 16 more rows
Generally, a line indicating a new speaker can be identified by the pattern of “capital letter - lower case letters - full stop - end of line”. So it is straightforward to pull out lines that meet this pattern with a regular expression and count them:
> hamlet %>% + filter(grepl("^[A-Z][a-z]+\\.$", text)) %>% + count(text, name= "number_speeches", sort = TRUE) %>% + slice(1:20) # A tibble: 20 x 2 text number_speeches <chr> <int> 1 Ham. 358 2 Hor. 108 3 King. 102 4 Pol. 86 5 Queen. 69 6 Laer. 62 7 Oph. 57 8 Ros. 45 9 Mar. 31 10 Guil. 29 11 Osr. 25 12 Ber. 19 13 Ghost. 14 14 Rey. 13 15 Fran. 8 16 Capt. 7 17 Fort. 4 18 All. 3 19 Danes. 3 20 Gent. 3
358 distinct speeches to learn if you want to play Hamlet, versus (for example) 29 for Guildenstern. Basically that simple search-and-count works, although there are annoyingly some context-specific uses of
All.; and there are a few spoken lines (
Good.) that get mixed up into this pattern. It’s very difficult to identify an algorithm that can tell these apart, particularly because (for example)
Good. is a plausible abbreviation of a character name. I think only a human could be sure of the difference in a specific context, so we will need a bit of careful manual coding of some parts.
To decode abbreviations into character names and add some useful metadata for later, I created by hand a vector of the eight main characters (using my human interpretation to decide on what constituted “main”), and table that relates the abbreviations (
King., etc) to full names of the character (“Claudius, King of Denmark.”) and human-friendly abbreviations for plots and the like using the term more commonly used today (“Claudius”).
I also made a vector of words that looked like character names and feature as single word lines in the play but aren’t referring to characters:
A table of data with “spoken line” as grain
I use these lookup tables and vectors to create my first tidier version of the data,
hamlet_lines. This object is shown below. In this version, each row of the table is a complete line of dialogue. Rows of the original data that are not dialogue but refer to the Act, Scene, stage direction or who is speaking have been pivoted wider to appear as columns such as
Here is how that object looks:
> hamlet_lines # A tibble: 3,945 x 11 text original_line_nu~ speaker_abb speaker speaker_sh main_character last_stage_direction act scene new_speaker_thi~ line_number_thi~ <chr> <int> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <dbl> 1 Who's ther~ 50 Ber. Bernardo~ Bernardo FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 2 Nay, answe~ 53 Fran. Francisc~ Francisco FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 3 Long live ~ 56 Ber. Bernardo~ Bernardo FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 4 Bernardo? 59 Fran. Francisc~ Francisco FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 5 He. 62 Ber. Bernardo~ Bernardo FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 6 You come m~ 65 Fran. Francisc~ Francisco FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 7 'Tis now s~ 68 Ber. Bernardo~ Bernardo FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 8 For this r~ 71 Fran. Francisc~ Francisco FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 9 And I am s~ 72 Fran. Francisc~ Francisco FALSE Francisco at his po~ ACT I. Scene I. ~ FALSE 2 10 Have you h~ 75 Ber. Bernardo~ Bernardo FALSE Francisco at his po~ ACT I. Scene I. ~ TRUE 1 # ... with 3,935 more rows
Note that we have kept the successive lines of text by a single character as per the original - for instance rows 71 and 72 of the original data, spoken by Fancisco, are now rows 8 and 9 but are still two separate rows. A logical flag
new_speaker_this_line helps us identify such cases for future.
Here’s the chunk of code that creates this table, with “line” as its granularity, from the original messy data. Most of this code is dealing with various quirks in the data, such as
2 Clown (for “second clown”, or second gravedigger) having a different structure of character abbreviation from most of the roles.
This tidy table of data, at the granularity of line of dialogue, makes it easy to count lines per speaker, Act, Scene, etc. For example, we can see that Act II, Scene II has the most spoken lines (590):
> count(hamlet_lines, act, scene) # A tibble: 20 x 3 act scene n <chr> <chr> <int> 1 ACT I. Scene I. Elsinore. A platform before the castle. 190 2 ACT I. Scene II. Elsinore. A room of state in the castle. 280 3 ACT I. Scene III. A room in Polonius's house. 141 4 ACT I. Scene IV. The platform. 101 5 ACT I. Scene V. A more remote part of the castle. 212 6 Act II. Scene I. A room in Polonius's house. 130 7 Act II. Scene II. A room in the castle. 590 8 ACT III. Scene I. A room in the castle. 196 9 ACT III. Scene II. A hall in the castle. 383 10 ACT III. Scene III. A room in the castle. 102 11 ACT III. Scene IV. Another room in the castle. 236 12 ACT IV. Scene I. A room in the castle. 46 13 ACT IV. Scene II. Another room in the castle. 28 14 ACT IV. Scene III. Another room in the castle. 71 15 ACT IV. Scene IV. A plain in Denmark. 68 16 ACT IV. Scene V. Elsinore. A room in the castle. 228 17 ACT IV. Scene VI. Another room in the castle. 29 18 ACT IV. Scene VII. Another room in the castle. 212 19 ACT V. Scene I. A churchyard. 283 20 ACT V. Scene II. A hall in the castle. 419
A table of data with “spoken word” as grain
My next step is to make a table or data frame with spoken word as the grain. Here I want not just bags of words, but the original sequence preserved, and whether a word was originally at the beginning of a line, as well as who is speaking (and all the different abbreviations of their name and groups that person belongs to), the last stage direction, the Act and Scene, etc. For future use I also want a stemmed version of each word (eg so “answered”, “answering” and “answers” all reduce to their stem “answer”), and a flag of whether each word is a stop word or not.
Here’s how that table is going to look, capturing all 30,000 or so words spoken in the uncut version of Hamlet”
> hamlet_words %>% select(contains("word"), everything()) # A tibble: 30,022 x 20 word stopword word_stem new_speaker_thi~ word_number_thi~ word_number word_number_thi~ word_number_thi~ original_line_n~ speaker_abb <chr> <lgl> <chr> <lgl> <dbl> <int> <int> <int> <int> <chr> 1 who's TRUE who' TRUE 1 1 1 1 50 Ber. 2 there TRUE there FALSE 2 2 2 2 50 Ber. 3 nay FALSE nai TRUE 1 3 3 3 53 Fran. 4 answ~ FALSE answer FALSE 2 4 4 4 53 Fran. 5 me TRUE me FALSE 3 5 5 5 53 Fran. 6 stand FALSE stand FALSE 4 6 6 6 53 Fran. 7 and TRUE and FALSE 5 7 7 7 53 Fran. 8 unfo~ FALSE unfold FALSE 6 8 8 8 53 Fran. 9 your~ TRUE yourself FALSE 7 9 9 9 53 Fran. 10 long FALSE long TRUE 1 10 10 10 56 Ber. # ... with 30,012 more rows, and 10 more variables: speaker <chr>, speaker_sh <fct>, main_character <lgl>, last_stage_direction <chr>, # act <chr>, scene <chr>, new_speaker_this_line <lgl>, line_number_this_speech <dbl>, speech_number <int>, top_20_char <lgl>
Note that the
hamlet_words data frame has about 30,000 words - one for each word in Hamlet, excluding stage directions.
And here is how I made that table out of the
How many words each?
OK, it’s time to do something fun now that we have tidied the data. Let’s start with just counting words. Out of the 30,000 words in Hamlet, who says how many?
You might have noticed in the code that
char_summary data frame I make along the way there. It has some summary information on each character such as the number of words and speeches, words per speech, stop words, proportion of words that aren’t stopwords, etc. Here’s what that looks like. We see that (of course), Hamlet himself has the most words to say; in fact, just over a third of all the words in Hamlet are spoken by the titular character:
|Marcellus and Bernardo||9||2||4.500000||6||0.3333333||6|
Hang on, didn’t we see earlier that Hamlet had 358 speeches, and now we find only 346? That’s an anomaly related to how the data tidying worked. The 358 came from counting likes that appeared as
Ham.; the 346 is more complex and looks for when Hamlet starts speaking after someone else had. My first guess at the cause of the anomaly is scene breaks where Hamlet was the last person speaking in the previous scene, and also first in the new one. But this would need checking. I don’t have the inclination to fix that now, so let’s just note this as an example of the sort of detail that needs fixing.
It’s interesting to look at the most distinctive words spoken by each character. The plot below shows the proportion of each character’s words that are a particular wordstem, divided by that proportion for the whole play. Effectively, this is the term frequency - inverse document frequency (TFIDF).Some of this is interesting (Polonius talks about his ‘daughter’ Ophelia; the Ghost talks about ‘blood’, ‘foul’, and being ‘beneath’), some not so much (Hamlet’s distinctive words in particular seem to have little message for me).
Here’s the code to calculate those TFIDF values (I am using
dplyr to do this explicitly so I can be sure of what is going on, rather than using specialist functions) and draw that chart of distinctive words:
How long are characters’ speeches
Beyond simple counts and averages of words, lines and speeches, it might be interesting to see how the distribution of how long the various speeches different characters deliver. Here’s how that looks for nine of the main characters:
The main thing I spot here is that the Ghost of Hamlet’s father gets only a few speeches but they tend to be long and uninterrupted compared to the other characters. Gertrude, Ophelia and Horatio tend to have short speeches less than 100 words, acting as foils for their interlocutors (normally Hamlet himself). Hamlet of course has the full range - many short bits of snappy dialogue, but also three or four lengthy 400+ word speeches.
The longest speech is Hamlet’s in Act III Scene I. In this speech he reflects on the impromptu play just put on by the Players; contrasts their acted emotion and passion with his own indecision; considers he must still be unsure whether the Ghost is misleading him about the villainy of Caludius; and explains his plan to use a play (within a play) to get to test Caludius’ conscience (paragraph breaks added by hand):
“Ay, so, God b’ wi’ ye!
Now I am alone. O, what a rogue and peasant slave am I! Is it not monstrous that this player here, But in a fiction, in a dream of passion, Could force his soul so to his own conceit That from her working all his visage wan’d; Tears in his eyes, distraction in’s aspect, A broken voice, and his whole function suiting With forms to his conceit? And all for nothing! For Hecuba? What’s Hecuba to him, or he to Hecuba, That he should weep for her? What would he do, Had he the motive and the cue for passion That I have? He would drown the stage with tears And cleave the general ear with horrid speech; Make mad the guilty, and appal the free; Confound the ignorant, and amaze, indeed, The very faculties of eyes and ears.
“Yet I, A dull and muddy-mettled rascal, peak, Like John-a-dreams, unpregnant of my cause, And can say nothing; no, not for a king Upon whose property and most dear life A damn’d defeat was made. Am I a coward? Who calls me villain? breaks my pate across? Plucks off my beard and blows it in my face? Tweaks me by the nose? gives me the lie i’ the throat As deep as to the lungs? who does me this, ha? ‘Swounds, I should take it: for it cannot be But I am pigeon-liver’d, and lack gall To make oppression bitter; or ere this I should have fatted all the region kites With this slave’s offal: bloody, bawdy villain! Remorseless, treacherous, lecherous, kindless villain! O, vengeance!
“Why, what an ass am I! This is most brave, That I, the son of a dear father murder’d, Prompted to my revenge by heaven and hell, Must, like a whore, unpack my heart with words And fall a-cursing like a very drab, A scullion! Fie upon’t! foh!–About, my brain!
“I have heard That guilty creatures, sitting at a play, Have by the very cunning of the scene Been struck so to the soul that presently They have proclaim’d their malefactions; For murder, though it have no tongue, will speak With most miraculous organ, I’ll have these players Play something like the murder of my father Before mine uncle: I’ll observe his looks; I’ll tent him to the quick: if he but blench, I know my course.
“The spirit that I have seen May be the devil: and the devil hath power To assume a pleasing shape; yea, and perhaps Out of my weakness and my melancholy,– As he is very potent with such spirits,– Abuses me to damn me: I’ll have grounds More relative than this.–the play’s the thing Wherein I’ll catch the conscience of the king. [Enter King, Queen, Polonius, Ophelia, Rosencrantz, and Guildenstern.]”
Here’s the R code to plot the distribution of speech lengths, and extract in full an original speech that has been identified from our tidied row-grain data frame:
Incidentally, this exercise drew my attention to another problem - the final stage direction here
[Enter King, Queen, Polonius, Ophelia, Rosencrantz, and Guildenstern.] had not been correctly identified as such in my tidying program. So another detail to fix (I count about four of these so far…).
Who interacts with whom?
Something that’s of interest that we can find out with tidied, structured data that is difficult otherwise is the interactions between different characters. A play features (mostly) a single character talking at a time, and can be analysed as a series of hand-offs from one character to another. We can count each transition and turn it into a network chart. For illustrative purposes here, I am ignoring the direction of these transitions, and just counting the absolute number. This gives us a nice illustration of who interacts (either interrupting, or simply taking turns) with whom:
What I like about this graph is how it shows the relationship between Hamlet and his wingman Horatio as core to the play; while also making clear who are the other major speaking characters (Claudius, Polonius, Laertes, Gertrude, etc) and the relationhsips between them.
Here’s the code for that network graph, made simple by the wonder of Thomas Pedersen’s
ggraph R packages.
Topics of the play
Finally, some bi-term topic modelling. In the various short speeches (from 1 word to 500), which word stems tend to come together? Here’s a first look at this.
Does this help? Well, not so much. I probably need some more thinking, certainly about how many topics to use for starters.
I originally intended this topic modelling to be the main point of this blog, but in the six months that this post has sat in “it’s good, but it needs fixing” purgatory I have since used bi-term topic modelling elsewhere, and no longer feel great motivation to finish this idea off immediately. I do think there is big potential in using topic modelling to understand topics within a Shakespeare play (ie treating each speech as a document), not just between the plays (treating each play as a document). Most examples using Shakespeare for topic modelling are of the latter variety, probably because (as today’s post shows) there’s a fair bit of wrangling required before one can start analysing speeches as documents.
Anyway, here is the absurdly short code to fit the bi-term topic model.
Well, that’s it. I’ve had a good go at tidying one Shakespeare play into a data model that loses minimal original information, and puts us in good shape to analyse words, topics, interactions between characters, how Acts and Scenes and stage directions work, etc. There are a handful of small details I haven’t fixed, but the overall project works. It wouldn’t be hard to extend this to work with other Shakespeare plays, and maybe for plays in general.
The process of doing this has produced one piece of analysis I like, which is the network graph of hand-offs (one character speaking after the other). I think that has potential, to show at a glance what the characters are doing in a play. I’d like to try that elsewhere.
In the meantime, that’s all folks. Take care out there, wear a mask, and don’t go out if you don’t have to. Unless you’re in New Zealand or Western Australia, in which case, sure, go out, see if the rest of us mind, just don’t rub it in.