The occasional ramblings of a freelance lexicographer

Monday, January 07, 2013

The future of dictionaries (2): lexicographer versus computer

Some 20-odd years ago, as a young, Linguistics undergraduate, I became interested in the concept of computers ‘understanding’ human language. I did my undergraduate dissertation on Natural Language Processing (NLP), considering how far computers might go in really understanding language in all its subtle, complex, nuanced detail, and holding up the talking computer Hal, from Kubrick’s 1968 film 2001, as what I then suggested was an unobtainable goal. I went on to start a Master’s course in Computer Speech and Language Processing. I only lasted a term – mainly because I discovered I really hated all the computer programming involved, but also because I was disappointed to find that most of the course seemed to revolve around the speech processing side (i.e. voice recognition) and the language processing component came down to rather vague theoretical discussions that didn’t go much beyond my basic undergraduate research. Okay, that may be a bit of a distorted recollection of the actual course content, but I was only 21 and that’s how you see things when you’re barely out of your teens!

Obviously, in the intervening decades, technology has come on in leaps and bounds. Speech recognition has improved immeasurably - I'm actually dictating this blog post using speech recognition software and while it's not perfect, it's considerably more impressive than my early efforts are programming! I have to hold up my hands here and admit that I haven't kept up-to-date with developments in NLP, but I suspect progress has been much slower; we're still an incredibly long way from communicating with our technology in the same fluent way we can chat to our friends.

So, what’s any of this got to do with dictionaries? Well, let me try and explain my train of thought, triggered by the announcement by Macmillan back at the start of November that they are to stop printing paper dictionaries and focus on their online content:

  • If publishers aren't actually selling paper dictionaries but are mostly focusing on a free online service, how much are they going to be prepared to spend on the time-consuming and labour-intensive work of lexicography?
  • Of course, they'll be looking into other related income streams, selling dictionary data for other uses, and online advertising, but without a tangible, on-the-shelf product, will that justify quite the same budget?
  • Reduced budgets often suggest a drive towards more automation, something we've already seen with the emergence of developments such as ‘TickBox lexicography’.
  • Will more automation and “more efficient” ways of working inevitably lead to a drop in standards?

Clever developments in making the dictionary compilation process more automated do supposedly speed it up, for example, by automatically selecting ‘good’ dictionary examples from a corpus, to save a human lexicographer having to trawl through by hand. But any lexicographer who's worked with them will know that they only work to a degree and only speed things up to a certain extent … probably not quite compensating for the increased rate expected of said lexicographer without a drop in quality.

And then there's the whole established process for keeping dictionaries up-to-date. Currently, most dictionaries undergo a revision and a new edition every five years or so. This is a long, slow, and labour-intensive process that involves a team of lexicographers (mostly freelancers nowadays) going through the whole A-Z, looking at each entry and checking whether it needs updating. This doesn’t just involve adding trendy new buzzwords like ‘omnishambles’ or whatever – which are rarely of much use, or interest, to the average foreign learner anyway. There are all kinds of more subtle changes in the usage of existing words, sometimes due to linguistic trends and sometimes just as a result of changes in the real world. As one commenter on the Macmillan dictionaries blog pointed out, MED still contains an entry for Inland Revenue as the name of the UK tax authority, even though it changed its name to HMRC in 2005. And having done a quick search myself, I found it also has a couple of example sentences that rather unhelpfully in a digital age refer to cassettes (She slotted another tape into the cassette player. @ slot into, He quickly undid the screws that held the cassette together. @ undo).
 And I’m not just trying to pick holes in Macmillan here; all dictionaries naturally date as language and usage changes. Thus the need for new editions. And there are changes in style and presentation too as different aspects of language come to the fore within language research and teaching. More information about collocations has become de rigueur over recent years, for example. And whilst corpora are wonderful tools for researching collocational information, it still needs a team of lexicographers to trawl through each entry and decide where it’s worth adding a bolded collocate, or in some cases, whether a particularly strong collocation should actually be shown as a phrase or an idiom.

Which comes back to where I started … computer technology can do lots of wonderful things, but for me, when it comes to language, there still needs to be a human drudge working their way through that data to make intelligent decisions about what to present in a dictionary and how. In a world of online-only dictionaries, will dictionary departments have the clout to take on a team of lexicographers to do those regular sweeps through the database or will they just have a couple of people on the lookout for interesting, newsworthy nuggets that give the appearance of being “up-to-date”?

Labels: , , , ,