GRA6036 Multivariate Statistics

HOME EXAM IN GRA6036

MULTIVARIATE STATISTICS

This home exam has five assignments. Each assignment contains the same num-ber of points in total (that is, one fifth of the total points on the exam), but each sub-question on each assignment may contain a di?erent proportion of the points of the assignment as a whole. This is then specified in the text.

You must use SPSS for the first four assignments and Lisrel for the fifth assign-ment.

Your report must have three parts.

The main part of the report. You have a page limit of 20 pages for this part.
A mathematical part of the report, where you should write the calculations for 3a, 3b, 3c, 3d and 3e. There is no page-limit for this part, but be concise and do not write up irrelevant material. You can choose to write this part either using a word processor or by hand.
A technical appendix with the Lisrel-code and relevant Lisrel-output for assignment 5, as well as SPSS-output that verifies your analysis. There is no page-limit for this part.

The SPSS-output in part (C) will only be checked to control the correctness of the analysis, and you may not write any further text etc in this part. This means that the 20 pages available for the first part must include all material needed to grade your assignment and you must include all relevant tables and graphs in part (A).

Note that you will have points deduced for breaking the page-limit on part (A). This means you will need to express your report concisely, and spend time on formatting the material in a sensible way.

You will be judged not only on your technical skill, but also on how clear and well formulated your answers are, as well as your presentation of statistical tables and figures. If there are parts that are unclear, you will not be given the benefit of the doubt.

Assignment 1. The file “smoking_and_socioeconomic_status.sav” contains parts of the result of a survey from n = 585 people in the USA. The original study focused on many di?erent issues, but we will here be content to study only the two variables socioeconomic class (SES) and whether the individual was a smoker (SMOKE). SES is coded into 5 ordinal categories, with 1 indicating low SES and 5 indicating high SES. Smoking is encoded by 1 = current smoker, 0 = not a current smoker. In addition to these variables, the dataset contains dummy variables for smoking for each socioeconomic class. These dummy are named smoke1, smoke2,

. . ., smoke5 and encodes exactly the same information as that contained in the SMOKE and SES variables. Note that while the i’th element of the SMOKING and SES variables corresponds to the same individual, the i’th element of, say, smoke1 and smoke3 do not.

HOME EXAM – GRA6036

(counts 1/2) Describe the main features of the dataset. Fit an appropriate logistic regression model to address the question of whether there is a sta-tistically significant connection between socioeconomic class and smoking. Use SES=1 as the reference category. Explain how the predicted probabil-ities of being a smoker varies with SES and show that these corresponds to the group means of smoke1, smoke2, . . ., smoke5, within approximation error.
(counts 1/2) Let p^₁, p^₂, . . . , p^₅be the proportion of smokers in socioeco-

^	. Because it appears
nomic class 1, 2, . . . 5 respectively. Let d_i = p^_i - p^_i-1

that the proportion of smokers decreases with socioeconomic status, we ex-

^
pect (and indeed have) that d_i > 0 for each i = 5, 4, 3, 2. It might therefore
seem to be a good idea to assess if
5
D = ^X_i=2	d^{^}_i

is statistically significantly di?erent from zero in order to test if the proba-bility of being a smoker stays constant with socioeconomic class.

It can be shown that D is, for large samples, approximately Normal, ir-respective of the true probabilities for being a smoker in the various socioe-conomic classes. Identify ED and ?2 = Var D. Suppose given a consistent estimator of ?2 (which you need not identify) and denote the estimator by ?^2. Explain which aspect of the logistic regression model in the previous point can be tested through comparing

T = p

to what is expected from a standard Normal distribution.

Assignment 2. Table 1 gives a dataset concerning academic achievement scores under three di?erent training regiments. Each individual was first given an aptitude test, the result of which is registered as the x-variable in the table. The person was then randomly assigned to one of the three training regiments. After the training was completed, the academic achievement scores were registered. These are saved as the y-variable in the table.

Training method

	A			B			C
y		x	y		x	y		x
6		3	8		4	6		3
4	1		9	5		7	2
5	3		7	5		7	2
3	1		9	4		7	3
4	2		8	3		8	4
3	1		5	1		5	1
6	4		7	2		7	4

Table 1. Achievement scores

Analyze the dataset and test whether there is a statistically significant di?erence between the three training regiments when taking individual aptitude into account.

Hint: To get SPSS to perform an F-test, you will find it useful to use SPSS’ block estimation system: Let “Method” be set to enter, and specify the small model as usual. Then press “Next” where it says “Block 1 of 1”. You are now in block 2, and you may here specify the covariates that are to be included in the larger model – in addition to the ones that are in block 1.

Assignment 3. Linear regression using the ordinary least squares estimates is intimately related to the process of averaging numbers and (therefore) to linear transformations. We will here explore what this means, how this relationship comes about and contrast it to a slightly di?erent formulation of the regression problem that does not lead to a linear solution. We will then derive some very useful consequences of the linearity of the OLS estimates and see how they can be applied in a statistical investigation. Note that 3g contains 50 % of the points in this assignment.

(Mathematical, counts 1/12) Suppose we observe Y₁, Y₂, . . . , Y_nthat are independent and

Lost Password