Digitising Surmelian

Our Eastern Armenian language class was coming to a close for the evening and Nelli asked, “Are there any books that you could recommend about the Armenian Genocide?”. Our teacher, Gagik answered that one of the best is Leon Surmelian’s I Ask You, Ladies and Gentlemen, a memoir of a young boy living through the times of the Armenian Genocide. It was a bestseller when it was first published in 1945, but had not been republished since 1946. However! The Armenian Institute had a plan to republish it. This plan involved a number of volunteers each being given a chapter of the book and typing it up.

“Have you considered OCR?”, I asked.

The first OCR, or Optical Character Recognition device was invented in 1870, with many subsequent devices for assisting the visually impaired, and converting text to telegraph codes. However, it’s safe to say that OCR in its more modern form has been around since the 1970s, scanning printed text and converting it to digital form. With technological advances since the 1970s, it should be pretty good by now, wouldn’t you think?

After a little test, I found myself with the 1946 British publication and a scanner, in the Armenian Institute office under the Gulbenkian Hall, next to the beautiful St Sarkis church. Some hours later, I had scans of all 224 pages.

1.png

That looks pretty good to me, but I’m not a computer.

2.png

It turns out that computers are not as good as I thought at reading printed text. The text on opposing pages and the angle of the text was confusing the computer. But with a little bit of work cleaning the images, I got this.

3.png

That’s better! It captured the text nicely. But we’re still seeing a few oddities. You can see that where a word was broken across two lines, in the case of “mem-bers”, it could become two words. It even saw a space in the middle of “arrested”. These kinds of issues are common with older fonts. Modern OCR performs much better with modern computer fonts.

But I was pleased with the results. As you can see from the screenshots, my word processor was highlighting misspelled words, so it was quick to go through the book and fix those. You might have noticed that there were a number of spaces before punctuation marks such as commas and full-stops. Again, a word processor is pretty good at searching for all instances of spaces before punctuation.

That spellchecking even helped me to spot a number of misspellings in the original book.

4.png

When reprinting a book, there is a desire to remain true to the original text. However, it’s safe to say that blatant misspellings such as those above, were not the intention of the author. This brought me into a more grey area of reprinting a book. It became obvious to me that the English language has changed quite a lot since 1946. One of the common changes that occurs in English is that where two words are often found together, they become hyphenated. And after that they often become one word. 

5.png

When reading Shakespeare, I expect many differences in linguistic style, so some archaic spellings are expected. However, in a modern text such as this, such archaic spellings can be jarring and can distract the reader from the story being told. This leads us to consider what should be kept in “period” language, and what should be updated to enable a more readable text. Does that original spelling add anything to the text, or does it just make it feel dated or more difficult to read?

Similarly, some words are spelled differently these days:

In English-speaking countries, we would now say “Ramadan”, not “Ramazan”. Is there a benefit to using Surmelian’s spelling? It is presumably, based on the pronunciation that he was familiar with as a child. But at the same time, “Ramazan” could confuse a reader.

What do you think? Do you feel that the 1946 spellings are important? Or do you feel that they might distract, or prevent acceptance amongst new readers? I would love to hear your thoughts.

Another spelling issue worth noting is that the book that I scanned was a 1946 British printing. This had a benefit in that Surmelian had some time to make small edits to his text since the first edition. Unfortunately, as you might expect, it also meant that British English spellings were used. For example, “harbor” had been changed to “harbour”, and “color” to “colour”.

In reprinting this book, we felt that it was important to make use of American English spellings, as that was the chosen language of the author.

One of the trickier spelling mistakes to spot is where a word is a real word, but should be spelled differently. A word processor will not help spot these. The phrase “taught like tightly drawn copper wires” should have used the word “taut”, not “taught”.

8.png

Among these, I’m particularly fond of “Benediction cassock“. I suspect that Surmelian wrote this correctly — after all, he spent time living with monks — , but that a copy editor didn’t recognise the word “Benedictine” and “corrected” it.

All these little challenges were a lot of fun to work through, and there was still proofreading to do. I generated PDFs of the book and these were shared around Armenian Institute staff and volunteers. While the team read the book and sent any errors they found, I was at last able to read it for pure enjoyment. And I have to say that I did enjoy reading it.

Now the book is available again in print and as an ebook. So please read it yourself, enjoy it, and if you find any errors, I would love to hear about them.


By Stephen Masters