Wednesday 26 December 2012

More analysis of text


26/12/12 07:25 [Wednesday]
A week or two ago I started on a program to analyse text from Hansard (which is the printed - or nowadays text online - record of what is said in Parliament). The Hansard text is easily available and will likely be an educated representation of what has been said - that is it will be relatively error-free. Further the speeches reported are made by educated people speaking a fairly modern English. My interest really is in analysing the language to try to learn from exemplars grammatical rules. This follows my efforts a number of months ago with the Bible, another source of English text which is easily available relatively error-free in computer readable form.
My Hansard program has not yet got very far. It has first to identify extraneous elements introduced, such as column headers, the time of day, subject headings and names of speakers. The first task it performs which is genuinely language related is to report whether the Hansard text (block by block) includes any quote-marks. Even this is not entirely straightforward as an apostrophe is printed the same as a single quote-mark. My guess (and I have not gathered much empirical evidence yet) would be that Hansard does not often include instances of embedded direct speech. Identifying quote-marks is an easy way of finding cases of sentences nested within another sentence, but (I believe) sometimes direct quoting occurs in English (and certainly so in other languages) without the use of quote-marks, one example being where thoughts expressed in language are reported. Sometimes italicisation is used as a near equivalent to quote-marks.
The next step is to separate sentences in each block of text. (These are considerations I went through with my Bible text program, and I might end up with general procedures for analysing blocks of English text from any source.) Briefly I will repeat that a sentence ends with a terminator (full-stop, question-mark or exclamation-mark with the use of ellipses not being a clear case) but occasionally - even setting aside embedded quoted passages - question-marks and exclamation-marks occur within the scope of a single sentence, and full-stops can be used to terminate abbreviations within the scope of a sentence (especially 'hon.' for 'honourable' in Hansard).
My ideas then trend to comparing sentences to see if repeated frameworks can be identified, which might lead to notice of grammatically equivalent words or phrases (in the sense that nouns are grammatically equivalent, and other parts of speech, but that the traditional way of parsing sentences considerably simplifies such equivalences).
I commend all this to readers.
26/12/12 10:08
I am pleased to show this debug output:
Word preceding seeming full-stop: 'statement.'
Word preceding seeming full-stop: 'that.'
Word preceding seeming full-stop: 'hon.'
Word preceding seeming full-stop: 'hon.'
Word preceding seeming full-stop: 'all.'
Word preceding seeming full-stop: 'ESA.'
Word preceding seeming full-stop: 'assessment.'
Word preceding seeming full-stop: 'hon.'
Word preceding seeming full-stop: 'moon.'
Word preceding seeming full-stop: 'reform.'
Word preceding seeming full-stop: 'time.'
Word preceding seeming full-stop: 'needs.'
Prog ends OK: 31 lines were read from file

No comments: