Aim: the aim of this investigation is to determine whether it is possible to gain information about authorship of a text using statistical measures. I will be comparing two books, the first aimed at an adult audience and the second aimed at a child which is written using a lower form of English literacy.
I will be comparing the complexities of the two books using a statistical manner by calculating the average number of words per sentence and the average number of letters per word in each book. Using this information I will then calculate confidence intervals which are what I will aim to compare between the two books. I have chosen to investigate these two areas as neither are effected by font size or the number of lines on a page/words per line.
The final outcome of this investigation should give the relevant evidence to distinguish between which of the two books is more complex and uses longer words and sentences.
The two books I have used are;
“Diana, Her True Story” – Andrew Morton (174 pages, max 39 lines per page)
“A Series of Unfortunate Events” – Lemony Snicket (190 pages, max 21 lines per page)
Both these are from the same decade and therefore would be a fair comparison as there is no bias between the use of language and structure. However, had I used one book written Pre 20th century then this book may contain the sort of language spoken/written of this period, possibly using longer or shorter words and sentences.
For each book I have used a sample of 50 for both the number of words per sentence and number of letters per word. This means I will have 50 pieces of data per investigation for each book (e.g. 50 pieces of data for words per sentence and 50 pieces of data for letters per word for adult book and the same for the children’s book).
I believe that a sample of 50 is a large enough sample size from the book to represent each book as a whole population (parent population) as 30 is the standard sample size and so I have exceeded this and used a sample of 50. This will give more accurate results and confidence intervals. This means the sample means I will have calculated are a good unbiased estimator of the population mean as the larger the sample, the better it represents the parent population.
As I don’t know the actual distribution of the parent population, which in my investigation is the number of letters in a word throughout the book or the number of words per sentence, I have chosen a large sample size of 50. This will help to ensure that the distribution of the sample means is Normal. Therefore I can be confident within a given range.
As shown on the sample mean distribution, the actual sample mean I will calculate can fall anywhere on this normal curve. However I am able to calculate confidence intervals which I can use to say how confident I can be to any given percentage that my individual sample mean is a good estimator of the mean of the sample means within an interval. All this represents the central limit theorem.
In order to obtain the sample data of sample size 50 for the number of words per sentence I:
* Generated 50 random numbers for each book to pick a random page. The adult book has 174 pages and the child book has 190 pages. When generating random numbers on a calculator I used the whole number and ignored the decimals e.g. if 163.691 came up on the calculator I would use page number 163. Also I generated numbers up to 1 page higher e.g. 191 for child book, as this would give the last page an equal chance to occur. If I generated up to 191 then the highest number that could be randomly generated is 190.999 to 3 d.p. and this would therefore be classed as page 190. The number must be between 0 – 175 for the adult book and 0 – 191 for the children’s book.
* Next I randomly generated the number for the sentence I counted on that page, according to the number of sentences e.g. if page has 11 sentences, then I generated numbers up to 12.
* I counted the number of words in that sentence and recorded them in a table that I constructed.
* I would complete the above procedures for both of the books.
Then to obtain the samples for the number of letters per words for each book I;
* Used the same method as before to generate the random page numbers.
* Then I used my calculator to randomly generate the line number according to the number of lines on the page. I again ignored any decimals in the number.
* I used a calculator to randomly generate the word number on that line according to the number of words on that line.
* I then counted the number of letters in the randomly generated words and record them in my table.
* I then completed this for both the adult and children’s book.
* The numbers I generated had to be randomly selected to give each page, line, word and sentence an equal chance of coming up.
Assumptions: When generating the random numbers on my calculator there were a few factors that I needed to take into consideration. Any numbers less than 1 were discarded as there are no page numbers 0 in either of the books. If the number generated wasn’t suitable e.g. a page with no text on it, a sentence number which exceeds the number of sentences on that page, a line number of 30 when there are only 25 lines on that particular page or a words number on a line which doesn’t have that many words, then a new number will be generated which does comply with the maximum value for that statistic. Other factors which I needed to clarify before proceeding with the investigation in order to come out with fair and unbiased results were;
* If a page has only one sentence on it or a line has only one word than that sentence or word will be used without needing to generate any number as this is still randomly selected by the page number.
* If the random sentence on the page is the last and it overlaps two pages, then this sentence will still be counted from the page it starts to the page it finishes.
* A sentence ends with either . ? ! or …. unless these characters are part of a name/web address etc in the book e.g. Dr Who or www.webaddress.com.
* The first sentence on a page is counted as the first full sentence on that page and not a mid sentence carrying on from the page before. Therefore it begins either after a . , ? ! or …. . The beginning of a new chapter or paragraph may also be the first paragraph.
* When counting the number of words in a line or sentence, a number e.g. 5 isn’t counted as a word in the sentence, also the characters . , ” ( ) ‘ ? ! : ; + – % etc are not counted as words. This also applies to counting words in a line.
* A word which is split by a hyphen e.g. co-ordination is classed as one word. Also direct speech is counted when counting the words. When counting letters in a words only the letters of the alphabet count. If a word has a – (hyphen) then this is not counted as a letter e.g. co-ordination.
These must be specified so that I treat every piece of data exactly the same and therefore prevent a biased approach.