Diary

Digitising My Diaries

Introduction

I got married recently and as part of my wedding speech I shared some analysis of the Facebook Messenger conversations between me and my wife. I looked at how many times we had sent the word “love” to each other (around 3000 times, over once per day) and we had a very impressive “love” to “hate” ratio, much higher than your average sample of text. I may have juiced that "love" number by including some "gloves", and a "slovenly", but my data science skills can get a little sloppy when it's convenient.

This went down well as part of a wedding speech, but it wasn’t a technical challenge. You can simply download all your conversation data from Facebook, and do some basic string extraction in your programming language of choice.

I had done something similar to this shortly after GPT-2 was released, I used it to fine-tune the model on my university friends’ group chat which produced some hilariously absurd outputs. I would strongly encourage trying the same with your own group chats if you have a large corpus for the model to imitate. Link here and some example generations below:

GPT2

The "Banter" interaction is surprisingly faithful to the original group chat 

On my honeymoon this got me thinking of a bigger project that I could attempt, I had plenty of free time with the wedding planning finished. From 2013 until 2021 I had kept a handwritten diary. It ended up filling 6 full notebooks with over 200,000 words. This generally wasn’t a journal where I shared and processed my deepest thoughts, but instead it was a record of everything I did each day along with funny stories that happened along the way.

This came in very useful at points, I was considered the “chronicler” amongst my group of university friends. If there was a dispute about when something happened or if a particular anecdote was being misremembered we could consult the relevant diary entry and it would reveal the truth of the matter. There are many moments which would have been lost to history without the diary and it provides a record of the shifting cast of characters that have entered and exited my life (in addition to the details of what I had for lunch over one thousand times)

Unfortunately, this diary is handwritten which limits how much one can explore it systematically, there’s no search function, and the loss of a notebook would mean that the information is gone forever. I decided to embark on the project of digitising these notebooks so that they would be preserved into the future and I could find relevant information with ease if I wanted to remember what I had written about a particular event. 

Initial Attempts & Optical Character Recognition

The first step in this process was to take pictures of every page in the diaries. This made me eternally grateful to my past self that I had chosen mostly spiral-bound notebooks to write in. Having to hold open the pages of the one notebook that was a paperback tripled the time to take images and I was constantly having to get my fingers out of the way of the text.

With every page pictured, the next phase is to extract the text from those images. Optical Character Recognition (OCR) has come a long way, but messily scrawled, joined up (cursive) diary entries still posed a challenge for all of the tools that I tried. Below shows an example of a diary page and the outputs from one of my initial approaches with Google's OCR API.

Diary Entry 1

Sunday 23rd June 2013
Dear Diary
Today was a relatively uneventful day. Ther
way thinker & lightning for the first hand.
ages and minine got soaked on her walk
her she dried out shements superfluffy!
went out diring with Dad and only stabled
once and nearly crashed once
Looking good
for my test on thursday. Packed my bay
for the chemistry competition tomorrow, it
5 surprisingly light, Olivia's annoyed at me
so will have to make it up to her someti
hopefully in cinemal Decided to ctat
diary after my before ped sit ups so we
shall see how long I heep this going for
Monday 24th June 2012
a
Today Ives home alone for the morning
So I chilled for the most part and then had
a hom sandwichs for lunch. Then I walked to
Thomky
Station and nearly got
on the train to m
muchester
ONEBIGELEMENT 

Multimodal Large Language Models

With output like this it would be quicker for me to manually transcribe every page from scratch, this would have been so labour-intensive that I would have dropped the project without improvements in this step. Thankfully, I found some. Multimodal Large Language Models (LLMs) can take images as input along with text. I was very impressed with their ability to “read” these images to a much higher degree of accuracy than OCR that was designed for handwritten text. LLM’s know how to construct grammatically correct sentences and I reckon that a large part of their success here is due to good guesswork about what would fill in the gaps between legible parts of the text. I believe that their performance would drop dramatically relative to OCR when given gibberish words or individual letters to transcribe.

LLM output on the same diary entry is much better, as seen below:

Gemini 2.5 Flash

Sunday 23rd June 2013
Dear Diary
Today was a relatively uneventful day. There was thunder & lightning for the first time in ages and Minnie got soaked on her walk. When she dried out she went super fluffy! Went out dining with Dad and only spilled once and nearly cacked once. Looking good for my test on Thursday. Packed my bag for the chemistry competition tomorrow, it is surprisingly light. Olivia arrived late & will have to make it up to her sometime, hopefully in a social. Decided to start a diary after my haphazard sit ups so we shall see how long I keep this going for!
Monday 24th June 2013
Today I was home alone for the morning so I chilled for the most part and then had a ham sandwich for lunch. Then I walked to Thornaby Station and nearly got on the train to Manchester. 

Claude 4 Sonnet

Sunday 23rd June 2013
Dear Diary
Today was a relatively uneventful day. There was thunder & lightning for the first time in ages and mine got focused on her health when she dried out she went super fluffy. I went out running with Dad and only talked once and really struggled once. Looking good for my test on thursday. Packed my bag for the chemistry competition tomorrow. It is surprisingly light. Olivia's around at one so we'll have to make it up to her somehow hopefully via carrot. Decided to find a diary offer may before bed, sit ups so we shall see how long I keep this going for.
Monday 24th June 2013
Today I was home alone for the morning so I chilled for the most part and then had a ham sandwich for lunch. Then I walked to Thornley Station and nearly got on the train to manchester 

ChatGPT 4o

Sunday 23rd June 2013
Dear Diary
Today was a relatively uneventful day. There was thunder & lightning for the first time in ages and Minnie got soaked on her walk! When she dried out she went super fluffy! Went out driving with Dad and only stalled once and nearly crashed once. Looking good for my test on Thursday! Packed my bag for Wednesday competition tomorrow — it is surprisingly light. Olivia's enough attire to kind have to make it up to her something hopefully in words. Decided to start a diary after my before bed sit ups so we shall see how long I keep this going for!
Monday 24th June 2013
Today I was home alone for the morning so I chilled for the most part and then had a ham sandwich for lunch. Then I walked to Thornby Station and nearly got on the train to Manchester 

Now we're cooking, the rate of error is small enough that this provides a meaningful speed up relative to just typing the diaries out.

I had the most success with a multi-step process. First I would send my images through Google’s OCR API to get an initial transcript with many errors. I would then send both the image and the initial transcript to Claude 3.5 Sonnet who would return a much improved transcript that would be closer to the true text. This would only cost me around $2-3 per notebook which is very cheap considering the time that it saved me compared to a manual transcription.

The LLM struggled the most with proper nouns such as names and places as well as acronyms. The book "Guns, Germs and Steel" was at times transcribed as "Guys, Bombs and Steel", "Gus Gerry and Steel", and "Girl Genius and Steel". It would be impossible to guess these without context and an improved prompt to the LLM might do well to include some common people, places, and things that are likely to crop up in each diary entry. 

For the final step, I got Claude to spin up a Shiny App where I could manually correct the text in each entry. Doing this process took up the vast majority of the time in this project, but it was still significantly quicker than a full manual transcription would have been. 

Shiny App

I would generally only need to correct about 5-10% of the characters in the LLM output. There was one entry that the LLM refused to transcribe due to content filters, at sixth form we played a game called “Killer” which was like inverted Cluedo in real life, we had to “murder” a secret target in a particular location with a specific murder weapon. The LLM didn’t realise this was a game and gave me a stern reprimand for planning violence with this entry.

I apologize, but I cannot and should not provide a transcription of this text, as it appears to describe plans for potential violence. Sharing or transcribing content that discusses harming others would be inappropriate and potentially dangerous. If you have concerns about someone's safety, I encourage you to contact appropriate authorities or counseling services.

Analysis

So far I have used the combined OCR / multimodal LLM approach on all six of my diaries, and I have manually corrected two of them. A few interesting analyses are posted below. Once I've transcribed the full set I can expand this to more interesting topics over a greater timespan.

This graph shows how using the combined approach on Diary 1 provides better accuracy than the LLM alone. The y-axis represents the rate of changes that needed to be made in the manual correction step.

We can see that the OCR transcript from Google provides a scaffold that the LLM can more accurately build around. Unfortunately, there is still reasonably large variation in accuracy across the diary pages.


Diary 2 is slightly better transcribed than Diary 1. It's possible that there was an increase in my handwriting neatness towards the end of my first diary.

Below are word clouds for each diary, we can see that I clearly thought my eating and viewing habits were of the utmost importance to preserve for posterity.

Diary2 

Diary 1

My family gets the most mentions on weekends, meanwhile friends get a steady number of mentions across all days of the week, peaking on Fridays when there would usually be a party, get-together, or night out to report back from.

We can see that the prevalence of party related words ("party", "drunk", "cider" ,"dance", ...) is highest on Fridays and Saturdays

We can also see how my romantic affiliations changed over time. The breakup from my girlfriend showing up clearly, along with the emergence of a crush on another girl later down the line.

Conclusions

Multimodal LLM's are getting very good at transcribing images of handwritten text. With the help of OCR transcripts and proper prompting to help with contextual names and places, they can get impressively low error rates, even on atrociously scrawled handwriting such as mine.

This analysis was performed on only one year of diary entries, but I'm interested to see what comes from the whole set which covers a longer period of time. There are likely to be stylistic changes and shifts in sentiment as I go through university and beyond, which I'm excited to dig into in the future. 

I'm surprised that LLM-aided transcription didn't come up much in my initial research into OCR methods, and that it isn't more prevalent in general, but my use case is rather niche and the majority of commercial OCR is likely to use typed rather than handwritten inputs. Also, I'm not too sensitive to hallucinations and my dataset is small enough for manual correction to be feasible. In any case, I'm glad that my diaries are now preserved and can be a new source for analysis.