• No results found

5.5 Examples

5.5.2 Drug Data 174

For the column categories, the non-trivial component values are

0.2103389, 0.08148318, 0.07267963 and 0.02452164, which are arranged in descending order as expected from Property 2 of Subsection 5.4.6. Therefore, Figure 5.2 is constructed using the linear principal axis with a principal inertia value of 0.2103389, and the dispersion principal axis with a principal inertia value of 0.08148318, for the r o w categories. Together these axes contribute to 75% of the variation of the drugs tested, compared with 98.2%

of the variation in the patients judgement of the drug. A s the table is only small, this suggests that there is a higher component value which m a y help to explain m o r e of this r o w variation. Infact the third (cubic) component contributes to 18.7% of this variation.

in

. o •Drug B

Excelle i#

CM SS. CD

< °"

ca Q.

"o c

LO

d

I

Fai#

•Drug A

Very. Good* Poo#

Gooc# •Rwstfo

-1.0 -0.5 0.0

Principal Axis 1

0.5

Figure 5.2 : Two-way Ordinal Correspondence (VERSION 2) of Table 5.4

Applying V E R S I O N 2 of the singly ordered correspondence analysis, Z(1)1 = -0.45648, Z(1)2 = -0.26016, Z(1)3 = 0.16505 a n d Z(1)4 = 0.36956.

Therefore, using equation (5.87) w e can see that

0.30467 = (-0.45648)2 + (-0.26016)2 + (0.16505)2 + (0.36956)2

Therefore the first principal inertia obtained from the simple

correspondence plot can be partitioned into linear, dispersion and higher order components. F r o m the above calculation, the dominant source of the first singular value is caused b y the linearity of the ordered column categories. That is, the first principal axis from the classical correspondence analysis of Table 5.5 best describes the difference in the m e a n values in the judgement of the drugs used.

Consider the four drug categories. Figure 5.2 shows the variation of these drugs in terms of the singular values. It visually describes the similar position in drugs C and D, therefore these t w o drugs have a similar effect on the patients. These t w o drugs have quite a different effect than d o drugs A and B, which in themselves are different. B y observing the proximity of the patients ratings to drugs C and D it appears that these drugs tend to be judged G o o d to Poor.

T h e ratings, w h i c h has a d o m i n a n t a n d significant location component are dominated along the first principal axis. T h e categories Excellent and Very G o o d have a similar first ordinate, but different co-ordinates along the second principal axis. Therefore, while these t w o categories are similar in terms of their m e a n values, they are quite different in terms of their spread across each of the four drugs tested. Similarly, Figure 5.2 concludes that Poor and G o o d are similar in terms of their m e a n values across the four drugs, but are spread out slightly differently.

Tables 5.5 and 5.6 details the contribution each drug and rating categories has o n the axes in the correspondence plot of Figure 5.2.

Consider the contribution of the drugs to the r o w location component. F r o m Figure 5.2 drug B is the furthest a w a y from the origin and so is less likely than the other drugs to contribute to the lack of independence between the drugs and the patients effect. Inf act drug B contributes m o r e to the r o w location component (38.29%) than the other drugs, while contributing to 67.79% of the variation in the dispersion component.

D r u g Principal Axis 1 Tested Contribution % Contr.

D r u g A 0.027053 12.86 D r u g B 0.08053 38.29 D r u g C 0.048842 23.22 D r u g D 0.053914 25.63 Total 0.210338 100

Principal Axis 2 Contribution % Contr.

0.000111 0.14 0.055237 67.79

0.01229 15.08 0.013845 16.99 0.081483 100 Table 5.5 : Contribution of the Drugs Tested on Figure 5.2

Principal Axis 1 Rating Contribution % Contr.

Poor 0.013575 4.46 Fair 0.075016 24.62 G o o d 0.019534 6.41 Very G o o d 0.056151 18.43 Excellent 0.140391 46.08

Total 0.304667 100

Principal Axis 2 Contribution % Contr.

0.001242 1.61 0.035757 46.23 0.024315 31.44 0.004059 5.25 0.011968 15.47 0.077342 100 Table 5.6 : Contribution of the Drug Rating Tested on Figure 5.2

D r u g A is the closest drug to the origin and therefore, contributes to the independence hypothesis. This also shows its relative lack of effect o n the r o w location and dispersion component; 12.9% and 0.14% respectively.

Table 5.5 s h o w s that drugs C and D, which are positioned close to one another in Figure 5.2, contribute roughly the s a m e to the location and dispersion r o w components. That is D r u g A does not contribute to w h a t ever variation there is between the drug categories. Thus, its profile co-ordinate lies close to the origin of the display.

Table 5.6 quantifies the dominance of the column category Excellent to the first principal axis. Figure 5.2 displays this category o n the far left hand side of the display, while Table 5.6 shows that it contributes to 46.1% of the columns first singular value. T h e second principal axis is dominated by the category Fair which contributes to 46.2% of the columns second singular value. Therefore these t w o categories do not contribute to the independence hypothesis, unlike Poor, which contributes to 4.5% of the first and 1.6% of the second principal axis.

5.5.3 Hospital Data

Consider again the hospital data of Table 2.9. A s the r o w and column variables are ordinal in nature a doubly ordered correspondence analysis is applicable. Subsection 2.10.3 s h o w e d that there is a significant association between a patients frequency of visitors and their length of stay in hospital, while Figure 2.5 s h o w e d that those w h o are visited less than once a m o n t h and those are never visited have similar profiles. It also s h o w s that the categories for the patients length of stay in hospital were different.

Although the simple correspondence plot of Figure 2.5 gives n o indication as to what this difference is.

Applying natural scores fl, 2, 3} to both the r o w and c o l u m n orthogonal polynomials the table of inertia values is given b y Table 5.7.

The two-dimensional ordinal correspondence plot of Table 2.9,

constructed using the first (location) and second (dispersion) principal axes, is optimal and is given b y Figure 5.3.

Table 5.7 shows that, for the r o w profiles, there is a significant and d o m i n a n t location c o m p o n e n t a n d a non-significant dispersion • component. Therefore, the dominant source of variation between the

"Frequency of Visiting" categories is d u e to the difference in the m e a n category values. These conclusions can be seen in Figure 5.3, where the r o w categories are spread out along the first principal axis, and to a far lesser extent they are also spread along the second principal axis.

Row Inertia Location Dispersion Error Column Inertia Location Dispersion Error Total Inertia

Value

0.26253 0.00392

0.22601 0.04044

0.26645

df

2 2

2 2

4

P-value

0 0.7720

0 0.9799

0

Table 5.7 : Table of Row and Column Inertia Values for Table 2.9

Table 5.7 shows that the only significant variation in the column categories is d u e to the difference in the column m e a n values. A s the dispersion component is not significant, their spread is not an important feature in the column variation. Thus a one dimensional correspondence plot consisting of the first principal axis, will represent the most significant variation in the column categories. Figure 5.3 s h o w s that the column categories are spread out along the first principal axis showing the

zero second co-ordinate, showing the near zero dispersion component value.

(O X

<

aj CL

o c oi

•o

c o o

CD 00

CD O

-<*

O CM O O O CM

O

i

^t

o

1

#2-10 YEARS

regularly visited less than a month 10-20 Y E A R S *

never visited AT L E A S T 20 Y E A R S #

-0.4 -0.2 0.0 0.2 0.4 0.6 First Principal Axis

Figure 5.3 : Ordinal Correspondence Plot of Table 2.9

The only significant bivariate moment is the linear-by-linear

association which has a zero P-value. This is as expected since the only significant r o w and column inertia is the linear. Therefore, the longer a person stays in hospital the few visitors they receive. This linear-by-linear association can be seen in Figure 5.3.

5.5.4 D r e a m Data

Consider the contingency table given b y Table 5.8 w h i c h w a s originally seen in Maxwell (1961) and analysed by Ritov & Gilula (1993). It

zero second co-ordinate, showing the near zero dispersion component value.

CD

d

«fr

co o

X

<

75 *4

-Q.

O

O

c

Secon d -0. 2 0

>

*t

O "

#2-10 Y E A R S

•regularly visited 10-20 \

less than a montt E A R S #

never visited

\T L E A S T 20 Y E A R S #

-0.4 -0.2 0.0 0.2 0.4 0.6 First Principal Axis

Figure 5.3 : Ordinal Correspondence Plot of Table 2.9

The only significant bivariate moment is the linear-by-linear

association which has a zero P-value. This is as expected since the only significant r o w and column inertia is the linear. Therefore, the longer a person stays in hospital the few visitors they receive. This linear-by-linear association can be seen in Figure 5.3.

5.5.4 D r e a m Data

Consider the contingency table given b y Table 5.8 which w a s originally seen in Maxwell (1961) and analysed by Ritov & Gilula (1993). It

cross-classifies 223 boys based on the age group in which they belong and the severity of disturbance of their dreams.

Age Group

5-7 8-9 10-11 12-13 14-15 Total

L o w 1 7 10 23 28 32 100

2 4 15

9 9 5 42

3 3 11 11 12 4 41

High 4 7 13

7 10

3 40

Total 21

49 J 50

59 44 223

Table 5.8 : Cross-classification of 223 Boys according to Age and Severity of Dream Disturbance

The Pearson chi-squared statistic for Table 5.8 is 31.67938, which has a P-value of 0.00155. Therefore, there is a strong association between the age of the child and the severity of the disturbance of their dreams. The total inertia of the contingency table is therefore 0.14206.

Applying a doubly ordered correspondence analysis, the correspondence plot of Table 5.8 is given by Figure 5.4.

The partition of the total inertia into the row and column location, dispersion and error inertia are given in Table 5.9.

Consider the inertia values for the age groups. The row location component, with a P-value of 0.0002 is highly significant and the only significant source of variation for these categories. Therefore, the variation in the age group categories is due to the difference in their mean values.

The spread of the age group categories across the "Dream Disturbance Severity" variable is not a significant source of variation. These conclusions are evident from looking at the position of the row profile co-ordinates in the ordinal correspondence plot of Figure 5.4. The first principal axis

represents 68.30% of the row variation, while the second principal axis accounts for 17.88% of the variation. Therefore, the two-dimensional plot visually accounts for 86.18% of the total variation in the row categories.

These conclusions can be verified by observing the relative position of the age group profile co-ordinates in Figure 5.4. It shows that the co-ordinates of the age group profiles are dominated by the first principal axis.

•o

c o o

CD 00

d

CM

d

co

X

<

CO Q. O

• — — • a

o o

CM

d

I

o

I

. -14-15 4#

2# -1 3#

i . .

5-7 • 1#

12-13 D-11

8-9*

1 1—'

•0.4 -0.2 0.0 0.2 First Principal Axis

0.4

Figure 5.4 : Two-Dimensional Ordinal Correspondence Plot of Table 5.8

Row Inertia Location Dispersion Error Column Inertia Location Dispersion Error Total Inertia

Value

0.09703 0.0254 0.01963

0.10116 0.01722 0.02368 0.14206

df

4 4 4

3 3 6 12

P-value

0.0002 0.2257 0.3573

0.0001 0 0.2793 0.0016

Table 5.9 : Row and Column Inertia Values for Table 5.8

Chapter 6

A Comparative Study of Different Scoring Schemes

6.1 Introduction

Beh (1997) and Chapter 5 discussed a correspondence analysis technique which involves partitioning the chi-squared statistic into bivariate m o m e n t s so that one can obtain location, dispersion and higher order components for the r o w and column categories. These then offer an informative explanation of h o w categories compare. H o w e v e r w h e n Rayner

& Best (1995), Beh (1997) and Chapter 5 considered such a partition they restricted their analysis to integer value, or natural, scores 1, 2, 3, and so on.

The problem with such a scoring scheme is that it assumes that the categories, which w e k n o w to be ordered, are equally spaced. In general this m a y not be the case. Four c o m m o n objective and subjective scoring methods are discussed in this chapter and a comparison of component values and profile co-ordinates in the correspondence plot is m a d e . This chapter is based in part o n the w o r k presented by Beh (1998).

Often w h e n comparing the ordinal correspondence plot using two different scoring schemes, there appears to be a reflection and/or a rotation

between the profile co-ordinates. Section 6.8 and Section 6.9 considers these issues.

6.2 Equal Scores

Suppose a doubly ordered two-way contingency table, N , has two consecutive equal valued column scores Sj(j) and Sj(j + l) for l<j<c. Then

b

v

(j) = b

v

(j + l) and so:

Z^ = Vn~£[

Pil

a

u

(i)b

v

(l) + • • • +(Pij+Pi,

j+

i)a

u

(i)b

v

(j)

i-i (6.1) + ... +p

iJ

a

u

(i)b

v

(J)]

Hence, the Z' terms of (6.1) apply to a transformed data set where the columns with equal scores are combined to form a single column. The same argument can be m a d e w h e n a contingency table has equal scores associated with only the r o w categories, or w h e n equal r o w scores and equal column scores are applied.

In general, construct a contingency table N ' so that the r o w categories with equal scores are combined, and the column categories with equal scores are combined. If there are k

x

identical r o w scores, such that 2<k

1

<I, and k

2

identical column scores, such that 2<k

2

<J, then N ' is a TxJ* contingency table, where T=I-(k

1

-l) and J'=J-(k

2

-l). Gilula & Krieger (1989) suggested that w h e n combining certain r o w and/or column categories the chi-squared statistic of (4.35) can then be further partitioned so that:

X(i-U(j-U

=

^(r-ixr-i)

+

X(i-i)(j-i)-(r-i)(j'-i) (6-2)

The first term on the RHS of (6.2) can be partitioned by using orthogonal

polynomials, a n d is the chi-squared statistic for N'. A s N ' is a I'xJ'

contingency table, the rank of B. is reduced from J to J', while the rank of A .

is reduced from I to T. The second term represents the difference between the profiles of combined categories. If this term is significant, then equal scores should not be applied to that particular data set. If this term is zero, then all rows combined are homogeneous, and so to are the combined columns.

Suppose there are k^ identical r o w scores, then u in (6.1) will be at most T. W h e n kx=I, B. is a cxl matrix with trivial unit elements. Thus, using such a scoring scheme for the correspondence analysis of Beh (1997) is not advised due to the triviality of B».

A further generalisation can be m a d e w h e n there is k lots of identical scores, with ma, for a=l, . . . , k, being the n u m b e r of identical scores in the

cc'th lot. For example, k=2, for the scoring scheme of the six r o w categories, 1,1,2,2,2, 3, with 1^=2, for the l's and m2= 3 for the 2's. W h e n there are k

k

lots of identical r o w scores the n u m b e r of rows in N ' is I' = I - ^ ( ma -1).

a=l

6.3 Approximately Equal Scores „

Suppose that two consecutive column scores are approximately equal, for the g'th and h'th column categories, so that Sj(g)~Sj(h) = Sj(g) + e where

8 is very small. For example, two scores, say, 1 and 1.0001 are approximately equal where s(g) = 1 and e = 0.0001.

B y substituting t w o approximately equal column scores, Sj(g) and Sj(h), into equations (4.21)-(4.23) and simplifying, then :

B^B^ep.^h)

Cv-C;+ep.hbv-i(h)bv_2(h)

and

K=ip:

i

s

!

(i}^Ui)-(Kf-(c-

y

f

j=i

+

P

.

h

[

Sj

(g)

2

bUh) - B>;_

1

(h) - c ^ p i K . ^ ) ] +e

2

p.

h

b;_

1

(h)[b;_

2

(h)-

P

.

h

bj_

1

(h)

+

i]

where the superfix dashes refer to the values calculated for N'. Therefore, as 8 approaches zero, the orthogonal polynomials calculated for N approach

the values of those calculated for N'. Note that Section 6.2 dealt with e=0 (identical scores). Therefore, the chi-squared statistic for N will approach the chi-squared statistic for N ' as e approaches zero.

This analysis of approximately equal scores also applies to the r o w scores.

6.4 Scoring Methods

The problem of determining the difference between categories of a variable, or of scaling, has long been investigated. M a n y of the multi-dimensional analytic techniques deal with this problem. Those w h o have contributed to the calculation of scales, especially those of an ordinal nature, include Bradley, Katti and Coons (1962) w h o described a method of scaling for categorical data which maximises the variation of the categories.

Armitage (1955) discussed methods of measuring the ordinal trend of categories. Becker and Clogg (1989) discussed the use of different scores for Goodman's R C m o d e l of (2.60).

There are m a n y different scoring methods available for ordered categories that are applicable to correspondence analysis, such as those discussed b y Parsa & Smith (1993), Ritov & Gilula (1993) and Schriever (1983). These scoring schemes will not be considered as they require s o m e technical computations. Instead four relatively simple and m o r e popular scoring techniques used will be briefly investigated. They are (i) natural

Related documents