Digging into Data - Concordia University

Overview

Concordia University’s BBC World Service collection is 8400 hours long, covering about 18 years of news broadcasts. Manual indexing for this material (at ten times as fast as the BBC standard for shot-listing of TV news) would take on the order of five person years. Overall the BBC has another 300 000 hours of radio content, and a total of one million hours of broadcast content. The proposed project not only “takes advantage of the large scale of the chosen digital dataset”, i.e. the research done by Condordia, but it also puts together a form of data mining that can cope with this scale of unwritten material, by automatically creating a way in - an index and finding aid.

The 8400 hours we will index covers 18 years of news bulletins. The corpus is an estimated 50 million words, with nearly 20 thousand half-hour news bulletins and an estimated quarter-million individual news ‘items’. One could dip into this data manually, but digging into it requires the data mining that this project proposes.

The ongoing partnership for this research consists of:

Concordia University’s Centre for Broadcasting Studies (CCBS), which holds the World Service News collection and understands it – the content, users and uses;
Southampton University’s IT Innovation (ITI), which has worked with the BBC archive and BBC R&D on a range of projects, and has generally strong skills in advanced IT technology,particularly in information processing.
the BBC itself, which is not seeking funding but has a vital interest in the project, and can supply supplementary information (other cataloguing, other World Service records, background information), plus their long experience in related technology, beginning with the EC-sponsored THISL project in speech recognition for access to broadcast news (1997-2000).
Glasgow Caledonian University, who have been working with the BBC on educational access to audio archive content for about seven years (JISC-NSF sponsored Spoken Word project). GCU, as with the BBC, are acting in an advisory capacity, particularly in the area of moving BBC radio material into a formal digital repository.

We view ourselves as a natural partnership. CCBS has the material, has digitised it and worked with it, and understands it best. But it is BBC content in origin, the BBC has very large amounts of related material, and the BBC has other information (data and technical experience on radio news) that should prove essential. ITI has the advanced skills in information processing and data mining technology, plus long practical experience working with BBC archives and BBC R&D on digital processing issues related to audiovisual archive content (PrestoSpace, PrestoPRIME, Avatar-m).

Finally, the BBC experience of cross-Atlantic collaboration is the GCU Spoken Word project. That project gave the BBC experience of university research on audiovisual content, and also gave us all our experience of digital repository technology (and associated issues such as access control, and rights issues for use of BBC content in higher education).

How the data was compiled

The main dataset will be created by the project. The dataset is 18 years of BBC World Service broadcasts: the North American service on short-wave, recorded off-air in Canada during 1968-1986. This collection is unique; the BBC did not archive its World Service broadcasts, nor does it have transcripts. These 18 years cover a range of major 20th century events, from the Prague Spring to Chernobyl. The dataset, though large, is simply a pilot for the much larger amounts of broadcast radio news now being digitised and made ready for online access. The BBC alone has about 100 000 hours of radio news, and as much again of TV news. The Concordia material is already digitised, as mono .wav files, currently stored on 2100 CDs.

Ancillary datasets we will use include the BBC catalogue (Infax) that can be searched to find related BBC broadcast TV and radio content (note that because the BBC did not archive the world service broadcasts, no detailed metadata exists of these in the catalogue, so searching Infax is very much to find related content, not more information on the Concordia dataset).

Concordia has full rights to on-site access to their recordings.

Centre for Journalism Experimentation (JEX)

Centre for Journalism Experimentation (JEX)

Digging into Data:

Overview

The ongoing partnership for this research consists of:

How the data was compiled