The header contains extra-linguistic information on each speech. It is made up of a number of fields, which provide information about the transcript file, the speech and the speaker.
The following is an example of the template we use:
speech number: 017
text length: short
number of words: 69
words per minute: 172
source text delivery: impromptu
speaker: Cox, Patrick
mother tongue: yes
political function: President of the European Parliament
political group: ELDR
topic: Procedure & Formalities
specific topic: speeches on matters of political importance
The first group of four fields (date, speech number, language and type) contains a reference code, which is used to classify the speeches. The first number (25) indicates the day, the second item (02) indicates the month (in this case, February), followed by the year (04, that is, 2004). The letters (m) or (p) tell us if the speech was delivered during a morning or afternoon sitting (in this particular case, in the afternoon). The number that follows (in our example 017) is a progressive number we assign to speeches.
The abbreviations "en", "it" and "es" indicate, respectively, a speech in English, Italian or Spanish. "org" and "int" indicate whether it is an original speech (i.e. a source text) or an interpretation (i.e. a target text). If it is an interpreted speech, we indicate both source and target languages, for example "int-en-it" means that the speech was interpreted from English into Italian.
This reference code is followed by a number of fields containing information on the speech, namely duration, text length and speed. We have recorded the exact figures indicating the number of seconds (timing), the number of words and the words per minute (calculated by dividing the number of words by the duration expressed in seconds).
We have also classified the duration of speeches as short, medium or long (short: < 120 secs; medium 121-360 secs; long: >360 secs).
The same applies to text length, classfied as short, medium or long (short: < 300 words; medium 301 - 1000 words; long > 1000).
Speed was classified as low, medium or high (low: < 130 w/m; medium: 131 - 160 w/m; high: > 160 w/m).
It must be pointed out that these values were calculated on the basis of the present corpus of speeches, and therefore can only be considered representative of this type of material, that is speeches delivered during a specific group of plenary sittings of the European Parliament. Indeed, in different contexts (e.g. the Italian conference interpreting market) a speech lasting 5 minutes (300 seconds) would be considered short, as opposed to medium, since simultaneous interpreters normally work in shifts of about 30 minutes. Likewise, a speech delivered at an average speed of 150 w/m is fast (not medium) by normal conference interpreting standards: however, owing to the specific rules for the allocation of speaking time in European Parliament sittings (click on Source Texts in the left-hand side bar for more information), most MEPs try and say as much as possible in the shortest possible time and therefore tend to speak very fast. In this sense and in this particular context, 150 w/m can be considered a medium speed.
Other information related to the speech includes source text delivery (that is, mode of presentation of the source speech), classified as impromptu, read or mixed. This information is recorded in the transcripts of interpreted speeches as well, since it is important to know whether the source text was read or improvised when analysing the target text.
We have grouped the speeches on the basis of macro-categories indicating the general topic of each speech and we have also recorded the specific topic under discussion in the debate. Specific topics are varied, ranging from the Parmalat fraud case to human rights in Afghanistan. A full list of specific topics, with corresponding clip numbers, is available in the archive (click on Multimedia Archive in the left-hand side bar).
The next fields in the header contain information on the speaker: name, gender, country of origin, mother tongue, political function and political group. When the speaker is an interpreter, no values are assigned to the fields name, country, political function and political group (indicated as NA, that is, not assigned).
The labels "European Commission" and "European Council" indicate that the speaker is either a Commissioner or a European Council Minister: in both cases, we record the field of action of the Commissioner or the Council configuration in the space reserved for comments at the end of the header.
European Commission's areas of responsibility:
- Agriculture and Fisheries
- Administrative Reform
- Enterprise and Information Society
- Internal Market
- Development and Humanitarian Aid
- External Relations
- Health and Consumer Protection
- Education and Culture
- Justice and Home Affairs
- Employment and Social Affairs
- Regional Policy
- Economic and Monetary Affairs
- Relations with the European Parliament, Transport and Energy
- President of the European Commission.
European Council configurations:
- General Affairs and External Relations
- Economic and Financial Affairs
- Cooperation in the fields of Justice and Home Affairs
- Employment, Social Policy, Health and Consumer Affairs
- Transport, Telecommunications and Energy
- Agriculture and Fisheries
- Education, Youth and Culture.
Finally, the label "guest" indicates that the speaker does not belong to a European Union institution: s/he could be a head of state or government, an intellectual, a politician from a country outside the EU, etc.
The last field is the space reserved for comments. As was mentioned above, this space is used to add information on Commissioners and European Council Ministers, but also to indicate whether the speaker has a noticeable accent (Scottish, Welsh, Irish; Andalusian, Latin American), to comment on any technical problems in the recordings and record any unusual features of each speech which are considered potentially useful for later analysis.
EPIC can be interrogated by carrying out a simple query or an advanced query (see Advanced Query how-to) in each sub-corpus or aligned corpora. From the Project Description page, users can select between Source texts, Target texts and Aligned texts. For example, by clicking on Source texts, it is possible to access the three sub-corpora "org-en", "org-it" and "org-es", that is, the English, Italian and Spanish source texts, respectively.
If a user wants to query the sub-corpus of English source texts (org-en), s/he can either query the whole sub-corpus (that is, search for all occurrences of a word or phrase in all the English source texts) or restrict the search to a number of texts by selecting one of the search options. The search parameters are based on the header fields (see above) and refer either to speech features or speaker features.
The "Duration" search parameter makes it possible to search for a certain phrase only in short, medium or long speeches (less than 2 minutes, between 2 and 6 minutes and more than 6 minutes, respectively).
Similarly, the "Text length" parameter makes it possible to restrict the query on the basis of the number of words in each speech. The user can select all the texts which are less than 300 words long, or those between 300 and 1000 words, or those with more than 1000 words.
The "Speed" parameter enables users to choose speeches delivered at low, medium or high speed (<130 words per minute, 130-160 w/m and >100 w/m. See above in the section on the header for more details).
The "Source text Delivery" option makes it possible to filter speeches according to delivery mode: read, impromptu, and mixed.
The "Topic" search parameter enables users to select texts according to the following macro-categories:
- Agriculture & Fisheries
- Economics & Finance
- Procedure & Formalities
- Society & Culture
- Science & Technology
Users can also select speeches on the basis of speaker characteristics. If the speaker is a source text speaker (that is, not an interpreter), s/he is always in one of the following categories:
- President of the European Parliament
- Vice-President of the European Parliament
- European Commission
- European Council
When the speaker is an MEP, we also indicate the political group to which s/he belongs, which clearly does not apply to other types of speakers. The political groups represented in the European Parliament are as follows:
- PPE-DE (Group of the European People's Party (Christian Democrats) and European Democrats)
- PSE (Socialist Group in the European Parliament)
- ALDE (Group of the Alliance of Liberals and Democrats for Europe)
- Verts/ALE (Group of the Greens/European Free Alliance)
- GUE/NGL (Confederal Group of the European United Left - Nordic Green Left)
- IND/DEM (Independence/Democracy Group)
- UEN (Union for Europe of the Nations Group)
- NI (Non-attached Members)
It is also possible to select the speaker's gender and country of origin.
The field "Mother tongue" can be used to select speeches made by native speakers only or, vice versa, by non-native speakers. This is particularly relevant for the speeches in English, which is often used as a lingua franca by non-native speakers (e.g. Commissioners and Council Ministers normally use English when they visit the Parliament).
The various search parameters can be combined to further restrict the speeches on which a query is to be carried out. For example, one can select all the English source texts delivered by non-native speakers from the European Commission dealing with finance & economics; or one may wish to query a section of EPIC made up of speeches on employment issues delivered by MEPs belonging to the Socialist Group; one can carry out separate searches on the speeches delivered by Irish speakers and UK speakers, and so on.
The display options available in EPIC allow users to display words, lemmas, POS tags, and the actual transcript showing how the words were actually uttered, including mispronounced words (click on Transcription Conventions for more details).