Reprinted from Interchange, Vol. 12, Winter, 1981, with permission.
Follow Through is a large compensatory education program that operated in scores of communities across the United States throughout the seventies and that continues, on a reduced scale, today. During its most active phase, it was conducted as a massive experiment, involving planned variation of education approaches and collection of uniform data at each site. The main evaluation of outcomes was carried out by Abt Associates, Inc. (a private consulting firm, based in Cambridge, Massachusetts) on the second and third cohorts of children who reached third grade in the program, having entered in kindergarten or first grade. In a series of voluminous reports, Abt Associates presented analyses indicating that among the various education approaches tried, only those emphasizing "basic skills" showed positive effects when compared to Non-Follow Through treatments. House, Glass, McLean, and Walker (1978a) published a critique of the Abt Associates evaluation, along with a small reanalysis that found essentially no significant differences in effectiveness among the planned variations in educational approaches. Because of the great social importance attached to educational programs for disadvantaged groups and because no other large-scale research on the topic is likely to materialize in the near future, the Follow Through experiment deserves continuing study. The study reported here is an attempt, through more sharply focused data analysis, to obtain a more definitive answer to the question of whether different educational approaches led to different achievement outcomes.
Is it possible that the Follow Through planned variation experiment
has yielded no findings of value? Is it possible, after years of
effort and millions of dollars spent on testing different approaches,
that we know nothing more than we did before about ways to educate
disadvantaged children? This is the implicit conclusion of the widely
publicized critique by House, Glass, McLean, and Walker (1978a,
1978b). House et al found no evidence that the various Follow Through
models differed in effectiveness from one another or from Non-Follow
Through programs. The only empirical finding House et al were willing
to credit was that there was great variation in results from one
Follow Through site to another. This conclusion, as we shall show, is
no more supportable than the conclusions House et al rejected.
Accordingly, if we were to follow House, Glass, McLean, and Walker's
lead, we should have to conclude that there are no substantive
findings to be gleaned from the largest educational experiment ever
conducted.
It would be a serious mistake, however, to take the critique by House
et al as any kind of authoritative statement about what is to be
learned from Follow Through. The committee assembled by House was
charged with reviewing the Abt Associates evaluation of Follow
Through (Stebbins et al, 1977), not with carrying out an inquiry of
their own. More or less, the committee stayed within the limits of
this charge, criticizing a variety of aspects of the design,
execution, and data analysis of the experiment. Nowhere in their
report do the committee take up the constructive problem that Abt
Associates had to face or that any serious inquiries will have to
face. Given the weaknesses of the Follow Through experiment, how can
one go about trying to extract worthwhile findings from it?
In this paper we try to deal constructively with one aspect of the
Follow Through experiment: the comparison of achievement test results
among the various sponsored approaches. We try to show that if this
comparison is undertaken with due cognizance of the limitations of
the Follow Through experiment, it is possible to derive some strong,
warranted, and informative conclusions. We do not present our
research as a definitive, and certainly not as a complete, inquiry
into Follow Through results. We do hope to show, however, that the
conclusion implied by the House committee-that the Follow Through
experiment is too flawed to yield any positive findings-is gravely
mistaken.
Delimiting the Problem
Although Project Follow Through has numerous shortcomings as an
experiment, the seriousness of these shortcomings varies greatly
depending on what questions are asked of the data. One shortcoming
was in the outcome measures used, particularly in their limited range
compared to the range of objectives pursued by Follow Through
sponsors. The House committee devotes the largest part of its
critique to this shortcoming, although it is a shortcoming that
limits only the range of conclusions that may be drawn. House et al
allow, for instance, that the Metropolitan Achievement Test was
"certainly a reasonable choice for the material it covers" (1978a, p.
138). Accordingly, Follow Through's shortcomings as to outcome
measures ought not to stand in the way of answering questions that
are put in terms appropriate to the measures that were used.
Another shortcoming, recognized by all commentators on Follow
Through, is the lack of strictly comparable control groups. Follow
Through and Non-Follow Through groups at the same site differed from
one another in uncontrolled and only partly measurable ways, and the
differences themselves varied from site to site. This circumstance
makes it difficult to handle questions having to do with whether
children benefited from being in Follow Through, because such
questions require using Non-Follow Through data as a basis for
inferring how Follow Through children would have turned out had they
not been in Follow Through.
Much of the bewildering complexity of the Abt Associates' analyses
results from attempts to make up statistically for the lack of
experimental comparability. We do not intend to examine those
attempts except to note one curiosity. The difficulty of evaluating
"benefits" holds whether one is asking about the effects of Follow
Through as a whole, the effects of a particular model, or the effect
of a Follow Through program at a single site. The smaller the unit,
however, the more vulnerable the results are likely to be to a
mismatch between Follow Through and Non-Follow Through groups. On the
one hand, to the extent that mismatches are random, they should tend
to average out in larger aggregates. On the other hand, at a
particular site, the apparent success or failure of a Follow Through
program could depend entirely on a fortuitously favorable or
unfavorable match with a Non-Follow Through group.
For unknown reasons, both the Abt Associates and the House committee
analysts have assumed the contrary of the point just made. While
acknowledging, for instance, that the prevalence of achievement test
differences in favor of Non-Follow Through groups could reflect
mismatch, they are able to make with confidence statements like
"Seven of the ten Direct Instruction sites did better than the
comparison classes but three of the Direct Instruction sites did
worse" (House et al, 1978a, p. 154). Such a statement is nonsense
unless one believes that at each of the ten sites a valid comparison
between Follow Through and Non-Follow Through groups could be made.
But if House et al believe that, how could they then believe that the
average of those ten comparisons is invalid? This is like arguing
that IQ tests give an invalid estimate of the mean intelligence level
of disadvantaged children and then turning around and using those
very tests to classify individual disadvantaged children as
retarded.
There is an important class of questions that may be investigated,
however, without having to confront the problem of comparability
between Follow Through and Non-Follow Through groups. These are
questions involving the comparison of Follow Through models with one
another. A representative question of this kind would be-how did the
Follow Through models compare with one another in reading achievement
test scores at the end of third grade? There are problems in
answering such a question, but the lack of appropriate control groups
is not one of them. We can, if we choose, simply ignore the
Non-Follow Through groups in dealing with questions of this sort.
Questions about the relative performance of different Follow Through
models are far from trivial. The only positive conclusions drawn by
Abt Associates relate to questions of this kind, and the House
committee's report is largely devoted to disputing those conclusions
-that is, disputing Abt's conclusions that Follow Through models
emphasizing basic skills achieved better results than others in basic
skills and in self-concept. The models represented in Follow Through
cover a wide range of educational philosophies and approaches to
education. Choose any dimension along which educational theories
differ and one is likely to find Follow Through models in the
neighborhood of each extreme. This is not to say that the Follow
Through models are so well distinguished that they provide clean
tests of theoretical issues in education. But the differences that
are there-like, for instance, the difference between an approach
based on behavior modification principles and an approach modeled on
the English infant school-offer at least the possibility of finding
evidence relevant to major ideological disputes within education.
Unscrambling the Methodology
The Abt Associates analysts were under obligation to try to answer
the whole range of questions that could be asked about Follow Through
effects. In order to do this in a coherent way, they used one kind of
statistic that could be put to a variety of uses. This is the measure
they called "effect size," an adjusted mean difference between the
Follow Through and Non-Follow Through subjects at a site. Without
getting into the details of how effect size was computed, we may
observe that this measure is more suitable for some purposes than for
others. For answering question about benefits attributable to Follow
Through, some such measure as effect size is necessary. For comparing
one Follow Through model with another, however, the effect size
statistic has the significant disadvantage that unremoved error due
to mismatch between a Follow Through and Non-Follow Through group is
welded into the measure itself. As we noted in the preceding section,
comparisons of the effectiveness of Follow Through models with one
another do not need to involve Non-Follow Through data. Because
effect size measures will necessarily include some error due to
mismatch (assuming that covariance adjustments cannot possibly remove
all such error), these measures will contain "noise" that can be
avoided when making comparisons among Follow Through models.
The Abt Associates analysts used several different ways of computing
effect size, the simplest of which is called the "local" analysis.
This method amounts to using the results for each cohort of subjects
at each site as a separate experiment, carrying out a covariance
analysis of Follow Through and Non-Follow Through differences as if
no other sites or cohorts existed. Although this analysis has a
certain elegance, it clearly does not take full advantage of the
information available; the "pooled" analysis used by Abt, which uses
data on whole cohorts to calculate regression coefficients and at the
same time includes dummy variables to take care of site-specific
effects, is much superior in this respect. The House committee,
however, chose to use effect size measures based on the "local"
analysis in their own comparison of models. In doing so, they used
the least powerful of the Abt effect size measures, all of which are
weakened (to unknown degrees) by error due to mismatch.
In their comparisons of Follow Through models, Abt Associates
analysts calculated the significance of effects at different sites,
using individual subjects at the sites as the units of analysis, and
then used the distribution of significant positive and negative
effects as an indicator of the effectiveness of the models. The House
committee argued, on good grounds we believe, that the appropriate
unit of analysis should have been sites rather than individual
children. To take only the most obvious argument on this issue, the
manner of implementing a Follow Through model is a variable of great
presumptive significance, and it is most reasonably viewed as varying
from site to site rather than from child to child. Having made this
wise decision, however, the House committee embarked on what must be
judged either an ill-considered or an excessively casual reanalysis
of Follow Through data. Although the reanalysis of data by the House
committee occupies only a small part of their report and is presented
by them with some modesty, we believe their reanalysis warrants
severe critical scrutiny. Without that reanalysis, the House
committee's report would have amounted to nothing more than a call
for caution in interpreting the findings of the Abt Associates
analysts. With the reanalysis, the House committee seems to be
declaring that there are no acceptable findings to be interpreted.
Thus a great deal hinges on the credibility of their reanalysis.
Let us therefore consider carefully what the House committee did in
their reanalysis. First, they used site means rather than individual
scores as the unit of analysis. This decision automatically reduced
the Follow Through planned variation experiment from a very large
one, with an N of thousands, to a rather small one, with an N in the
neighborhood of one hundred. As previously indicated, we endorse this
decision. However, it seems to us that when one has opted to convert
a large experiment into a small one, it is important to make certain
adjustments in strategy. This the House committee failed to do. If an
experiment is very large, one can afford to be cavalier about
problems of power, since the large N will presumably make it possible
to detect true effects against considerable background noise. In a
small experiment, one must be watchful and try to control as much
random error as possible in order to avoid masking a true effect.
However, instead of trying to perform the most powerful analysis
possible in the circumstances, the House committee weakened their
analysis in a number of ways that seem to have no warrant. First,
they chose to compare Follow Through models on the basis of Follow
Through/Non-Follow Through differences, thus unnecessarily adding
error variance associated with the Non-Follow Through groups. Next,
they chose to use adjusted differences based on the "local" analysis,
thus maximizing error due to mismatch. Next, they based their
analysis on only a part of the available data. They excluded data
from the second kindergarten-entering cohort, one of the largest
cohorts, even though these data formed part of the basis for the
conclusions they were criticizing. This puzzling exclusion reduced
the number of sites considered, thus reducing the likelihood of
finding significant differences. Finally, they divided each
effect-size score by the standard deviation of test scores in the
particular cohort in which the effect was observed. This manipulation
served no apparent purpose. And minor though its effects may be, such
as they are would be in the direction of adding further error
variance to the analysis.
The upshot of all these methodological choices was that, while the
House group's reanalysis largely confirmed the ranking of models
arrived at by Abt Associates, it showed the differences to be small
and insignificant. Given the House committee's methodology, this
result is not surprising. The procedures they adopted were not biased
in the sense of favoring one Follow Through model over another; hence
it was to be expected that their analysis, using the same effect
measures as Abt, would replicate the rankings obtained by Abt. (The
rank differences shown in Table 7 of the House report are probably
mostly the result of the House committee's exclusion of data from one
of the cohorts on which the Abt rankings were based.) On the other
hand, the procedures adopted by the House committee all tended in the
direction of maximizing random error, thus tending to make
differences appear small and insignificant.
The analysis to be reported here is of the same general type as that
carried out by the House committee. Like the House committee, we use
site means rather than scores for individuals as the unit of
analysis. The differences in procedure all arise from our effort to
minimize random error and thus achieve the most powerful analysis
possible. The following are the main differences between our analysis
and the House et al analysis:
1. We used site means for Follow Through groups as the dependent variable, using other site-level scores as covariates. The House committee used locally adjusted site-level differences between Follow Through and Non-Follow Through groups as the dependent variable, with covariance adjustments having been made on an individual basis. Our procedure appears to have been endorsed in advance by the House committee. They state: "For the sake of both inferential validity and proper covariance adjustment, the classroom is the appropriate unit of analysis" (House et al, 1978a, p. 153). While the House committee followed their own prescription in using site-level scores as dependent variables, they failed to follow it when it came to covariance adjustments.
2. When we used Non-Follow Through scores, we entered them as covariates along with other covariates. The procedure adopted by the House committee amounted, in effect, to arbitrarily assigning Non-Follow Through mean scores a regression weight of 1 while giving all other variables empirically determined regression weights. We could not see any rational basis for such a deviation from ordinary procedures for statistical adjustment.
3. We combined all data from one site as a single observation, regardless of cohort. The House committee appear to have treated different cohorts from the same site as if they were different sites. This seemed to us to violate the rationale for analyzing data at the site level in the first place.
4. We restricted the analysis to models having data on 6 or more sites. To include in the analysis models having as few as 2 sites, as the House committee did, would, it seemed to us, reduce the power of the statistical tests to an absurd level.
The data analysis that followed from the above-mentioned decisions
was quite straightforward and conventional. The dependent variable
was always the mean score for a site on one or more Metropolitan
Achievement Test subtests, averaged over all subjects in cohorts II
and III for whom data were reported in the Abt Associates reports.
Models, which ranged from 12 to 6 in number of sites, were compared
by analysis of covariance, using some or all of the following
covariates:
SES-An index of socio-economic status calculated by Abt for each
cohort at each site. When more than one cohort represented a site, an
n-weighted mean was computed.
EL-An index of ethnic and linguistic difference from the
mainstream-treated in a manner similar to SES.
WRAT-Wide-range Achievement Test, administered near time of entry to
Follow Through students. Taken as a general measure of academic
readiness.
NFT-Mean score of local Non-Follow Through students on the dependent
variable under analysis. As a covariate, NFT scores may be expected
to control for unmeasured local or regional characteristics affecting
scholastic achievement.
Two other covariates were tried to a limited extent: Raven
Progressive Matrices scores (which, though obtained after rather than
before treatment, might be regarded as primarily reflecting
individual differences variance not affected by treatment) and a
score indicating the number of years of Follow Through treatment
experienced by subjects at a site (most Follow Through groups entered
in kindergarten, thus receiving four years of Follow Through
treatment; but some entered in first grade and received only three
years). Our overall strategy for use of analysis of covariance was as
follows: recognizing that reasonable cases could be made for and
against the use of this covariate or that, we would try various
combinations and, in the end, would take seriously only those results
that held up over a variety of reasonable covariate sets.
Results
Differences in achievement test performance-Two analyses of
covariance will be reported here, with others briefly summarized.
Figure 1 displays adjusted and standardized means from what we call
the "full" analysis of covariance-that is, an analysis using the four
main covariates (SES, EL, WRAT, and NFT) described in the preceding
section. The virtue of this analysis is that it controls for all the
main variables that previous investigators have tried, in one way or
other, to control for in comparing Follow Through models.
Table 1* notes pair-wise differences which are significant at the .05
level by Newman-Keuls tests.
Figure 2* and Table 2* show comparable data for what we call the
"conservative" analysis.
This analysis is conservative in the sense that it eliminates
covariates for which there are substantial empirical and/or rational
grounds for objection. Grounds for objecting to the NFT variable as a
covariate have been amply documented in Abt reports and echoed in the
report of the House committee (House et al, 1978a); they will not be
repeated here. Use of WRAT as a covariate has been objected to on
grounds that it is not, as logically required, antecedent to
treatment (Becker & Carnine, Ref. Note 1)-that is, the WRAT,
though nominally a pretest, was in fact administered at a time when
at least one of the models had already purportedly taught a
significant amount of the content touched on by the WRAT. While we
would not suppose the SES and EL variables to be above reproach, we
have not encountered criticisms suggesting their use would seriously
bias results-whereas not to control for these variables would
unquestionably leave the results biased in favor of models serving
less disadvantaged populations. Accordingly, we have chosen them as
the conservative set of covariates.
Other analyses, not reported, used different combinations of
covariates from among those mentioned in the preceding section. In
every case, these analyses yielded adjusted scores intermediate
between those obtained from the "full" and the "conservative"
analyses. Consequently, the results shown in Figures 1 and 2 may be
taken to cover the full range of those observed.
In every analysis, differences between models were significant at or
beyond the .05 level on every achievement variable-almost all beyond
the .01 level. As Figures 1 and 2 show, models tended to perform
about the same on every achievement variable. Thus there is little
basis for suggesting that one model is better at one thing, another
at another.
The relative standing of certain models, particularly the Tucson
Early Education Model, fluctuated considerably depending on the
choice of covariates.1 Two models, however, were at or near the top
on every achievement variable, regardless of the covariates used;
these were Direct Instruction and Behavior Analysis. Two models were
at or near the bottom on every achievement variable, regardless of
the covariates used; these were the EDC Open Education Model and
Responsive Education. Differences between the two top models and the
two bottom models were in most cases statistically significant by
Newman-Keuls tests.
Variability between sites-The only empirical finding that the House
committee was willing to credit was that there was enormous
variability of effects from site to site within Follow Through
models. In their words: "Particular models that worked well in one
town worked poorly in another. Unique features of the local settings
had more effect on achievement than did the models" (House et al,
197&, p. 156). This conclusion has recently been reiterated by
the authors of the Abt evaluation report (St. Pierre, Anderson,
Proper, & Stebbins, 1978) in almost the same words.
The ready acceptance of this conclusion strikes us as most puzzling.
It is conceivable that all of the variability between sites within
models is due to mismatch between Follow Through and Non-Follow
Through groups. This is unlikely, of course, but some of the
variability between sites must be due to this factor, and unless we
know how much, it is risky to make statements about the real
variability of effects. Furthermore there is, as far as we are aware,
no evidence whatever linking achievement to "unique features of the
local setting." This seems to be pure conjecture-a plausible
conjecture, no doubt, but not something that should be paraded as an
empirical finding.
Our analyses provide some basis for looking at the between-site
variability question empirically. Follow Through sites varied
considerably in factors known to be related to
achievement-socioeconomic status, ethnic composition, WRAT pretest
scores, etc. To say that the variance in achievement due to these
factors was greater than the variance due to model differences may be
true but not very informative. It amounts to nothing more than the
rediscovery of individual differences and is irrelevant to the
question of how much importance should be attached to variation among
Follow Through models. To say that differences in educational method
are trivial because their effects are small in comparison to the
effect of demographic characteristics is as absurd as saying that
diet is irrelevant to children's weight because among children weight
variations due to diet are small in comparison to weight variations
due to age.
Figure 1*
Standardized adjusted mean Metropolitan Achievement Test scores
obtained from "full" covariance analysis (rounded to the nearest even
tenth).
The variability issue may be more cogently formulated as follows:
considering only the variance in achievement that cannot be accounted
for by demographic and other entering characteristics of students,
what part of that variance can be explained by differences in Follow
Through models and what part remains unexplained? Our analyses
provide an approximate answer to this question, since covariance
adjustments act to remove variance among sites due to entering
characteristics. Depending on the achievement test variable
considered and on the covariates used, we found model differences to
account for roughly between 17 and 55 per cent of the variance not
attributable to covariates (as indexed by w2).
Figure 2*
Standardized adjusted mean Metropolitan Achievement Test scores
obtained from "conservative" covariance analysis (rounded to the
nearest even tenth).
These results are shown graphically in Figures 1 and 2. Adjusted mean
scores are displayed there in units of the standard deviation of
residual site means. Thus, to take the most extreme case, in Figure 2
the adjusted mean score of Direct Instruction sites on Language Part
B is 3.6 standard deviations above the adjusted mean score of EDC
Open Education sites-that is, 3.6 standard deviations of between-site
residual variability; in other words, an enormous difference compared
to differences between sites within models. That is the most extreme
difference, but in no case is the adjusted difference between highest
and lowest model less than 1.4 standard deviations. Although what
constitutes a "large" effect must remain a matter of judgment, we
know of no precedent according to which treatment effects of this
size could be considered small in relation to the unexplained
variance.
Treatment effects on other variables-Although the principal concern
of this study was with achievement test differences, the method of
analysis is adaptable to studying differences in other outcomes as
well. Accordingly we ran several briefer analyses, looking at what
Abt Associates call "cognitive / conceptual" and "affective"
outcomes.
Two kinds of measures used in the Follow Through evaluation were
regarded by Abt Associates as reflecting "cognitive / conceptual"
outcomes- Raven's Progressive Matrices (a nonverbal intelligence
test) and several Metropolitan subtests judged to measure indirect
cognitive consequences of learning. The House committee objected to
Progressive Matrices on grounds that is insensitive to school
instruction. This rather begs the question of effects of
cognitively-oriented teaching, however. True, Progressive Matrices
performance may be insensitive to ordinary kinds of school
instruction, but does that mean it will be insensitive to novel
instructional approaches claiming to be based on cognitive theories
and declaring such objectives as "the ability to reason" and "logical
thinking skills in four major cognitive areas (classification,
seriation, spatial relations and temporal relations)"? It seems that
this should be an empirical question.
If it is an empirical question, the answer is negative. Using the
same kinds of covariance analyses as were used on the achievement
test variables, we found no statistically significant differences
between Follow Through models in Progressive Matrices performance.
This finding is consistent with the Abt Associates' analyses, which
show few material effects on this test, and more negative than
positive ones.
Among Metropolitan subtests the most obviously "cognitive" are
Reading (which is, in effect, paragraph comprehension) and
Mathematics Problem-Solving. As indicated in Figures 1 and 2, our
analyses show differences among models on these subtests that are
similar in trend to those found on the other subtests. They tend,
however, to be of lesser magnitude. The most obvious explanation for
the lesser magnitude of difference on these subtests is the same as
that offered by House et al for the absence of differences on
Progressive Matrices-that these subtests, reflecting more general
differences in intellectual ability, are less sensitive to
instruction. There is, however, a further hypothesis that should be
tested. Conceivably, certain models-let us say those that avowedly
emphasize "cognitive" objectives-are doing a superior job of teaching
the more cognitive aspects of reading and mathematics, but the
effects are being obscured by the fact that performance on the
appropriate subtests depends on mechanical proficiency as well as on
higher-level cognitive capabilities. If so, these hidden effects
might be revealed by using performance on the more "mechanical"
subtests as covariates.
This we did. Model differences in Reading (comprehension) performance
were examined, including Word Knowledge as a covariate. Differences
in Mathematics Problem Solving were examined, including Mathematics
Computation among the covariates. In both cases the analyses of
covariance revealed no significant differences among models. This is
not a surprising result, given the high correlation among
Metropolitan subtests. Taking out the variance due to one subtest
leaves little variance in another. Yet it was not a forgone
conclusion that the results would be negative. If the models that
proclaimed cognitive objectives actually achieved those objectives,
it would be reasonable to expect those achievements to show up in our
analyses.
The same holds true for performance on the affective measures
included in the Follow Through evaluation. The Abt Associates'
analyses show that the ranking of models on affective measures
corresponds closely to their ranking on achievement measures. House
et al point out, however, that the instruments used place heavy
demands on verbal skills. Conceivably, therefore, if reading ability
were controlled statistically, the results might tell a different
story. We analyzed scores on the Coopersmith Self-Concept Inventory,
including reading subtest scores along with the other covariates. The
result showed no significant difference among models on the
Coopersmith. This finding could mean either that there are no
differences between models in effects on self-concept or that
self-concept among disadvantaged third-graders is sufficiently
dependent on reading ability that, when one statistically removes
reading ability differences, one at the same time removes genuine
self-concept differences. We know of no way to resolve this ambiguity
with the available data. One thing is clear, however: removing
effects due to reading achievement does not in any way yield results
either favoring models that emphasize self-concept or disfavoring
models that emphasize academic objectives.
Discussion
Before attempting to give any interpretation of Follow Through
results, we must emphasize the main finding of our study-that there
were results. Follow Through models were found to differ
significantly on every subtest of the Metropolitan Achievement
Test.
Let us briefly compare our findings with those of Abt Associates and
the House committee.
1. We disagree with both Abt and House et al in that we do not find
variability among sites to be so great that it overshadows
variability among models. It appears that a large part of the
variability observed by Abt and House et al was due to demographic
factors and experimental error. Once this variability is brought
under control, it becomes evident that differences between models are
quite large in relation to the unexplained variability within
models.
2. Our findings on the ranking of Follow Through models on
achievement variables are roughly in accord with those of the House
Committee, but we differ from the House committee in finding
significant differences among models on all achievement variables
whereas they found almost none. The similarities are no doubt due to
the fact that the two analyses used the same basic units-site-level
means. The difference in significance of outcomes is apparently due
to the variety of ways (previously discussed) in which our analysis
was more powerful than theirs.
3. The Abt Associates' results indicate that among major Follow
Through models, there is only one "winner" in the sense of having a
preponderance of positive effects-namely, Direct Instruction. All
other models showed predominately null or negative effects. Our
results are not exactly comparable in that we compared Follow Through
models only with one another and not with Non-Follow Through groups;
consequently we cannot speak of "positive" or "negative" effects.
However, our results show two models to be above average on all
achievement subtests and two models to be below average on all
subtests. Thus our results may be said to indicate two
"winners"-Direct Instruction and Behavior Analysis- and two
"losers"-EDC Open Education and Responsive Education.
We put the words "winners" and "losers" in quotation marks because,
of course, Follow Through was not a contest with the object of
attaining the highest possible achievement test scores. It simply
happens that the outcomes on which Follow Through models are found to
differ are achievement test scores. That other criteria might have
shown different winners and losers (a point heavily emphasized by the
House committee) must remain a conjecture for which all the available
evidence is negative. What we have are achievement test differences,
and we must now turn to the question of what those differences might
mean.
It lies outside the scope of this paper to discuss the importance of
scholastic achievement itself. The more immediate issue is whether
the observed differences in achievement test scores reflect actual
differences in mastery of reading, mathematics, spelling, and
language.
One obvious limitation that must be put on the results is that the
Metropolitan Achievement Test, like all other standardized
achievement batteries, covers less than the full range of achievement
objectives. As House et al point out, the test does not cover "even
such straightforward skills as the ability to read aloud, to write a
story, or to translate an ordinary problem into numbers" (1978b, p.
473). This much is certainly true, but House et al then go on to say,
"it would be reckless to suppose that the results of the testing
indicate the attainment of these broader goals" (p. 473). "Reckless"
is far too strong a word here.2 From all we know about the
intercorrelation of scholastic skills, one could be fairly confident
in assuming that children who perform above average on the MAT would
also perform above average on tests of the other skills mentioned. A
glance again at Figures 1 and 2 tells us that achievements in a
variety of areas tend to go together. Given the homogeneous drift of
scores downward from left to right in those figures, it is hard to
imagine another set of achievement measures in mathematical and
language skills that would show a trend in the opposite direction.
Such a trend cannot be declared impossible, of course, but if House
et al expect us to take such a possibility seriously, then they ought
to provide some evidence to make it plausible.
A more serious kind of charge is that the MAT is biased in favor of
certain kinds of programs. If true, this could mean that the observed
test score differences between models reflect test bias and not true
differences on the achievement variables that the test is supposed to
measure. We must be very careful, however, in using the term bias.
One sometimes hears in discussions of Follow Through statements that
the MAT is biased in favor of models that teach the sort of content
measured by the MAT. This is a dangerous slip in usage of the word
bias and must be avoided. It makes no sense whatever to call it bias
when an achievement test awards higher scores to students who have
studied the domain covered by the test than to students who have not.
It would be a very strange achievement test if it did not.
It is meaningful, however, to say that an achievement test is biased
in its sampling of a domain of content, but even here one must be
careful not to abuse the term. The Mathematics Concept subtest of the
MAT, for instance, is a hodge-podge of knowledge items drawn from
"old math," "new math," and who knows what. For any given
instructional program, it will likely be found that the test calls
for knowledge of material not covered by that program-but that
doesn't mean the test is biased against the program. The test
obviously represents a compromise that cannot be fully satisfactory
to any program. The only ground for a charge of bias would be that
the compromise was not even-handed. Investigating such a charge would
require a thorough comparison of content coverage in the test and
content coverage in the various Follow Through programs. It does no
good to show that for a particular program there are discrepancies
between content covered and content tested. The same might be equally
true of every program.
As far as the Follow Through evaluation goes, the only MAT subtest to
which a charge of content bias might apply (we have no evidence that
it does) is Mathematics Concepts. The other subtests all deal with
basic skills in language and mathematics. Different programs might
teach different methods of reading or doing arithmetic, and they
might give different amounts of emphasis to these skills, but the
skills tested on the MAT are all ones that are appropriate to test
regardless of the curriculum. Even if a particular Follow Through
model did not teach arithmetic computation at all, it would still be
relevant in an assessment of that program to test students'
computational abilities; other people care about computation, even if
the Follow Through sponsor does not. The reason why Mathematics
Concepts may be an exception is that, while everyone may care about
mathematical concepts, different people care about different ones,
and so a numerical score on a hodge-podge of concepts may not be
informative.
While such skill tests as those making up the bulk of the MAT are
relatively immune to charges of content bias, they can be biased in
other ways. They may, perhaps, be biased in the level of cognitive
functioning that they tap within a skill area. The House committee
implies such a bias when they say, "the selection of measures favors
models that emphasize rote learning of the mechanics of reading,
writing, and arithmetic" (House et al, 1978a, p. 14S). This is a
serious charge and, if true, would go some way toward discrediting
the findings.
But House et al offer no support for this charge, and on analysis it
seems unlikely that they could. Their statement rests on three
assumptions for which we know of no support: (1) That "the mechanics
of reading, writing, and arithmetic" can be successfully taught by
rote; (2) that there were Follow Through models that emphasized rote
learning (the model descriptions provided by Abt give no suggestion
that this is true)3 and (3) that the MAT measures skills in such a
way that the measurement favors children who have learned those
skills by rote rather than through a meaningful process. We must
conclude, in fact, that since the House committee could not have been
so naive as to hold all three of these assumptions, they must have
introduced the word "rote" for rhetorical effect only. Take the word
out and their statement reduces to an unimpressive complaint about
the limited coverage of educational objectives in the Follow Through
evaluation.
A final way in which skill tests might be biased is in the form of
the test problems. Arithmetic computation problems, for instance,
might be presented in notation that was commonly employed in some
programs and not in others; or reading test items might use formats
similar to those used in the instructional materials of one program
and not another. Closely related to this is the issue of "teaching
for the test"-when this implies shaping the program to fit incidental
features of a test such as item formats. We may as well throw in here
the issue of test-wiseness itself as a program outcome-that is, the
teaching of behaviors which, whether intended to do so or not, help
children perform well on tests-since it bears on the overall problem
of ways in which a program might achieve superior test scores without
any accompanying superiority in actual learning of content. In short,
children in some programs might simply get better at taking
tests.
If one looks at the Direct Instruction and Behavior Analysis models,
with their emphasis on detailed objectives and close monitoring of
student progress, and compares them to EDC Open Education, with its
disavowal of performance objectives and repudiation of standardized
testing, it is tempting to conclude in the absence of any evidence
that the former models must surely have turned out children better
prepared to look good on tests, regardless of the children's true
states of competence. Without wishing to prejudge the issue, we must
emphasize that it is an empirical question to what extent children
schooled in the various Follow Through models were favored or
disfavored with respect to the process of testing itself.
In general, children involved in the Follow Through evaluation were
subjected to more standardized testing than is normal. Since studies
of test-wiseness indicate rapidly diminishing returns from increasing
amounts of familiarization with testing (Cronbach, 1960), there is
presumptive evidence against claims that differential amounts of
test-taking among models could be significant in accounting for
test-score differences. It should be possible to investigate this
matter with Follow Through data, though not from the published data.
Children in the final Follow Through evaluation had been subjected to
from two to five rounds of standardized testing. Accordingly it
should be possible to evaluate the effect of frequency of previous
testing on third-grade test scores.
There are, however, numerous ways in which Follow Through experience
could affect children's behavior during testing. The amount of
experience that children in any program had with actual test-taking
is probably trivial in comparison to the amount of experience some
children got in doing workbook pages and similar sorts of
paper-and-pencil activities. And the nature of these activities might
have varied from ones calling for constructed responses, quite unlike
those on a multiple-choice test, to ones that amounted virtually to a
daily round of multiple-choice test-taking. Programs vary not only in
the amount of evaluation to which children are subjected but also in
the manner of evaluation­p;be it covert, which might have little
effect on the children, or face-to-face and oral, or carried out
through group testing. Finally, given that testing conditions in the
Follow Through evaluation were not ideal, it is probably relevant how
well children in the various programs learned to cheat
effectively­p;that is, to copy from the right neighbor.
Some or most of these variables could be extracted from available
information, and it would be then possible to carry out analyses
showing the extent to which they account for test scores and for the
score differences between models. Only through such a multivariate
empirical investigation could we hope to judge how seriously to take
suggestions that the score differences among models were artifactual.
Until that time, insinuations about "teaching for the test" must be
regarded as mere prejudice.
What Do The Results Mean?
What we have tried to establish so far is that there are
significant achievement test differences between Follow Through
models and that, so far as we can tell at present, these test score
differences reflect actual differences in school learning. Beyond
this point, conclusions are highly conjectural. Although our main
purpose in this paper has been simply to clarify the empirical
results of the Follow Through experiment, we shall venture some
interpretive comments, if for no other purpose than to forestall
possible misinterpretations.
The two high-scoring models according to our analysis are Direct
Instruction and Behavior Analysis; the two low-scoring are EDC Open
Education and Responsive Education. If there is some clear meaning to
the Follow Through results, it ought to emerge from a comparison of
these two pairs of models. On the one hand, distinctive
characteristics of the first pair are easy to name: sponsors of both
the Direct Instruction and Behavior Analysis models call their
approaches "behavioral" and "structured" and both give a high
priority to the three R's. EDC and Responsive Education, on the other
hand, are avowedly "child-centered." Although most other Follow
Through models could also claim to be child-centered, these two are
perhaps the most militantly so and most opposed to what Direct
Instruction and Behavior Analysis stand for.
Thus we have, if we wish it, a battle of the philosophies, with the
child-centered philosophy coming out the loser on measured
achievement, as it has in a number of other experiments (Bennett,
1976; Stallings, 1975; Bell and Switzer, 1973; Bell, Zipousky &
Switzer, 1976). This is interesting if one is keen on ideology, but
it is not very instructive if one is interested in improving as
educational program. Philosophies don't teach kids. Events teach
kids, and it would be instructive to know what kinds of events make
the difference in scholastic achievement that we have observed.
The teaching behavior studies of Brophy & Good (1974), Rosenshine
(1976), and Stallings & Kaskowitz (1974) are helpful on this
point. Generally they contrast direct with informal teaching styles,
a contrast appropriate to the two kinds of models we are comparing.
Consistently it is the more direct methods, involving clear
specifications of objectives, clear explanations, clear corrections
of wrong responses, and a great deal of "time on task," that are
associated with superior achievement test performance. The effects
tend to be strongest with disadvantaged children.
These findings from teacher observation studies are sufficiently
strong and consistent that we may reasonable ask what if anything
Follow Through results add to them. They add one very important
element, the element of experimental change. The teacher observation
studies are correlational. They show that teachers who do x get
better achievement results than those who do y. The implication is
that if the latter teachers switched from doing y to doing x, they
would get better results, too; but correlational studies can't
demonstrate that. Perhaps teachers whose natural inclination is to do
y will get worse results if they try to do x. Or maybe teachers who
do y can't or worse won't do x. Or maybe x and y don't even matter;
they only serve as markers for unobserved factors that really make
the difference.
The Follow Through experiment serves, albeit imperfectly, to resolve
these uncertainties. Substantial resources were lavished on seeing to
it that teachers didn't just happen to use direct or informal methods
according to their inclinations by rather that they used them
according to the intent of the model sponsors. The experimental
control was imperfect because communities could choose what Follow
Through model to adopt, and in some cases, we understand, teachers
could volunteer to participate. Nevertheless, it seems safe to assume
that there was some sponsor effect on teacher behavior in all
instances, so that some teachers who would naturally do x were
induced to do y and vise-versa. Thus, with tentativeness, we can
infer from Follow Through results that getting teachers of
disadvantaged children to use more direct instructional methods as
opposed to more informal ones will lead to superior achievement in
commonly tested basic skills.
Before concluding, however, that what accounts for the superior
achievement test scores of Direct Instruction and Behavior Analysis
sites is their use of direct teaching methods, we should consider a
more profound way in which these two models are distinguished from
the others. These models are distinctive not only at the level of
immediately observable teacher behavior but also at a higher level
which may be called the systemic. One may observe a lesson in which
the teacher manifests all the usual signs of direct teaching- lively
manner, clear focus on instructional objectives, frequent eliciting
of response from students, etc. One may return weeks later to find
the same teacher with the same class manifesting the same direct
teaching behavior-and still teaching the same lesson! The fault here
is at the systemic level: the teacher is carrying out sorts of
activities that should result in learning but is failing to organize
and regulate them in such a way as to converge on the intended
objectives.
More effective teachers-and this includes the great majority-
function according to a convergent system. Consider a bumbling Mr.
Chips introducing his pupils to multiplication by a two-digit
multiplier. He demonstrates the procedure at the chalkboard and then
discovers that most of the students cannot follow the procedure
because they have forgotten or never learned their multiplication
facts. So he backs up and reviews these facts, then demonstrates the
algorithm again and assigns some practice problems. Performance is
miserable, so he teaches the lesson again. By this time some children
get it, and they teach others. With a bit of help, most of the class
catches on. Mr. Chips then gives special tutoring, perhaps with use
of supplementary concrete materials, to the handful of students who
haven't yet got it. Finally everyone has learned the multiplication
algorithm except for the slowest pupils in the class-who, as a matter
of fact, haven't yet learned to add either.
Although none of the procedures used by Mr. Chips are very efficient,
he applies them in a convergent way so that eventually almost all the
children reach the instructional objective. Some of his procedures
may not have a convergent effect at all. For instance, he may assign
practice worksheets to pupils who haven't yet grasped the algorithm,
and the result is that they merely practice their mistakes (a
divergent activity). But the overall effect is convergent. Given more
efficient activities, convergence on the instructional goal might be
more rapid and it might include the pupils who fail at the hands of
Mr. Chips. But the difference in effectiveness, averaged over all
pupils, would probably not be great. This convergent property of
teaching no doubt contributes, as Stephens (1967) has suggested, to
the scarcity of significant differences between teaching methods.
Unless severely constrained, most teachers will see to it that, one
way or another, their students reach certain goals by the end of the
term.
We suggest that teaching performance of the kind just described be
taken as baseline and that innovative educational practices, such as
those promoted by the Follow Through sponsors, be judged in relation
to that baseline. What would happen to the teaching of our Mr. Chips
if he came under the supervision of a Follow Through sponsor? It
seems fairly clear that his system for getting students to reach
certain goals by the end of the term would be enhanced if he took
guidance from a Direct Instruction or Behavior Analysis sponsor but
that it might well be disrupted by guidance from one of the more
child-centered sponsors.
What Direct Instruction and Behavior Analysis provide are more fully
developed instructional systems than teachers normally employ. They
provide more systematic ways of determining whether children have the
prerequisite skills before a new step in learning is undertaken, more
precise ways of monitoring what each child is learning or failing to
learn, and more sophisticated instructional moves for dealing with
children's learning needs. Open Education and Responsive Education,
on the other hand, because of their avowed opposition to making
normative comparisons of students or thinking in terms of deficits,
will tend to discourage those activities whereby teachers normally
discover when children are not adequately prepared for a new step in
learning or when a child has mislearned or failed to learn something.
Also, because of their preference for indirect learning activities,
these models will tend to make teaching less sharply focused on
achieving specific earnings and remedying specific lacks.
Of course, child-centered educators will wish to describe the matter
differently, arguing that they do have a well-developed system for
promoting learning; but it is a different kind of system pursuing
different kinds of goals from those pursued by the direct
instructional approaches. They will point out that child-centered
teachers devote a great deal of effort to identifying individual
pupils' learning needs and to providing learning experiences to meet
these needs; it is just that their efforts are more informal and
intuitive, less programmed. Child-centered education, they will
argue, is different, not inferior.
One is inclined automatically to assent to this live-and-let-live
assessment, which relegates the differences between educational
methods to the realm of personal values and ideology. But surely the
Follow Through experiment and any comparative evaluation will have
been in vain if we take this easy way out of the dilemma of educating
disadvantaged children.
This easy way of avoiding confrontation between the two approaches
can be opposed on both empirical and theoretical grounds.
Empirically, child-centered approaches have been unable to
demonstrate any off-setting advantages to compensate for their poor
showing in teaching the three R's. House et al (1978a) have argued
that the selection of measures used in the Follow Through evaluation
did not give child-centered approaches adequate opportunity to
demonstrate their effects. This may be true to a degree, but it is
certainly not true that child-centered approaches had no opportunity
to demonstrate effects relevant to their purposes. One had better not
be a perfectionist when it comes to educational evaluation. No
measure is perfectly correlated to one's objectives. The most one can
hope for is a substantial correlation between obtained scores on the
actual measures and true scores on the ideally appropriate measures
that one wishes existed but do not.
When child-centered educators purport to increase the self-esteem of
disadvantaged children and yet fail to show evidence of this on the
Coopersmith Self-Concept Inventory, we may ask what real and
substantial changes in self-esteem would one expect to occur that
would not be reflected in changes on the Coopersmith? Similarly for
reasoning and problem-solving. If no evidence of effect shows on a
test of non-verbal reasoning, or a reading comprehension test loaded
with inferential questions, or on a mathematical problem solving
test, we must ask why not? What kinds of real, fundamental
improvements in logical reasoning abilities would fail to be
reflected in any of these tests?
If these remarks are harsh, it is only because we believe that the
question of how best to educate disadvantaged children is
sufficiently serious that a policy of live-and-let-live needs to be
replaced by a policy of put-up-or-shut-up. Certainly the cause of
educational betterment is not advanced by continual appeal to
nonexistent measures having zero or negative correlations with
existing instruments purporting to measure the same thing. Among the
numerous faults that we have found with the House committee's report,
their use of this appeal is the only one that deserves the label of
sophistry.
Critique of the Child-centered Approach
What follows is an attempt at a constructive assessment of the
child-centered approach as embodied in the Open Education and
Responsive Education models. By constructive we mean that we take
seriously the goals of these models and that our interest is in
realizing the goals rather than in scrapping them in favor of others.
These remarks are by way of preface to the following observation:
child-centered approaches have evolved sophisticated ways of managing
informal educational activities but they have remained at a primitive
level in the design of means to achieve learning objectives.
We are here distinguishing between two levels at which a system of
teaching may be examined. At the management level, an open classroom
and a classroom running according to a token economy, for example,
are radically different, and while there is much to dispute in
comparing them, it is at least clear that both represent highly
evolved systems. When we consider the instructional design level,
however, the difference is more one-sided. Child-centered approaches
rely almost exclusively on a form of instruction that
instructionally-oriented approaches use only when nothing better can
be found.
This primitive form of instruction may be called relevant activity.
Relevant activity is what teachers must resort to when there is no
available way to teach children how to do something, no set of
learning activities that clearly converge on an objective. This is
the case, for instance, with reading comprehension. Although there
are some promising beginnings, there is as yet no adequate
"how-to-do-it" scheme for reading comprehension. Accordingly, the
best that can be done is to engage students in activities relevant to
reading comprehension-for instance, reading selections and answering
questions about the selections. Such activities are relevant in that
they entail reading comprehension, but they cannot be said to teach
reading comprehension.
For many other areas of instruction, however, more sophisticated
means have been developed. There are, for instance, ways of teaching
children how to decode in reading and how to handle equalities and
inequalities in arithmetic (Engelmann, Ref. Note 2). The
instructional approaches used in Direct Instruction and Behavior
Analysis reflect years of analysis and experimentation devoted to
finding ways of going beyond relevant activity to forms of
instruction that get more directly at cognitive skills and
strategies. This effort has been successful in some areas, not so
successful in others, but the effort goes on. Meanwhile,
child-centered approaches have tended to fixate on the primitive
relevant activities form of instruction for all their instructional
objectives.
The contrast of sophistication in management and naiveté in
instruction is visible in any well-run open classroom. The behavior
that meets the eye is instantly appealing-children quietly absorbed
in planning, studying, experimenting, making things-and one has to
marvel at the skill and planning that have achieved such a blend of
freedom and order. But look at the learning activities themselves and
one sees a hodge-podge of the promising and the pointless, of the
excessively repetitious and the excessively varied, of tasks that
require more thinking than the children are capable of and tasks that
have been cleverly designed to require no mental effort at all (like
exercise sheets in which all the problems on the page have the same
answer). The scatteredness is often appalling. There is a little bit
of phonics here and a little bit of phonics there, but never a
sufficiently coherent sequence to enable a kid to learn bow to use
this valuable tool. Materials have been chosen for sensorial appeal
or suitability to the system of management. There is a predilection
for cute ideas. The conceptual analysis of learning problems tends to
be vague and irrelevant, big on name-dropping and low on
incisiveness.
There does not appear to be any intrinsic reason why child-centered
educators should have to remain committed to primitive instructional
approaches. So far, child-centered educators have been able to gain
reassurance from the fact that for the objectives they
emphasize-objectives in comprehension, thinking, and feeling-their
approaches are no more ineffective than anyone else's. But even this
defense may be crumbling. Instructional designers, having achieved
what appears to be substantial success in improving the teaching of
decoding in reading, basic mathematical concepts and operations,
spelling, and written English syntax, are now turning more of their
attention to the kinds of goals emphasized by child-centered
educators. Unless thinkers and experimenters committed to
child-centered education become more sophisticated about instruction
and start devoting more attention to designing learning activities
that actually converge on objectives, they are in danger of becoming
completely discredited. That would be too bad. Child-centered
educators have evolved a style of school life that has much in its
favor. Until they develop an effective pedagogy to go with it,
however, it does not appear to be an acceptable way of teaching
disadvantaged children.
*Graphs and tables in this article could not be reproduced clearly in
electronic format.
Notes:
1. Reduced analyses were performed, dropping TEEM and Cognitive
Curriculum from the analysis. These were the two most unstable models
in the sense of shifting most in relative performance depending on
the choice of covariates. Moreover, Cognitive Curriculum had deviant
relations between criteria and covariates, showing for instance
negative relationships between achievement and SES. The only effect
of removing these models, however, was to increase the number of
significant differences between the two top scoring models and the
other models.
2. Examined closely, the House et al statement is a bit slippery.
Since the MAT is a norm-referenced, (not a criterion-referenced)
test, it is of course "reckless" to infer any particular attainments
at all from test scores. All we know is how a person or group
performs in comparison to others. If, for example, the criterion for
"ability to write a story" is set high enough, it would be reckless
to suppose that any third-grader had attained it.
3. The obvious targets for the charge of emphasizing rote learning
are Direct Instruction and Behavior Analysis. However, the Direct
Instruction sponsors explicitly reject rote memorization (Bock,
Stebbins, & Proper, 1977, p. 65) and the Behavior Analysis model
description makes no mention of it. House, Glass, McLean, and Walker
seem to have fallen into the common fallacy here of equating direct
instruction with rote learning. If they are like most university
professors, they probably rely extensively on direct instruction
themselves and yet would be offended by the suggestion that this
means they teach by rote.
Reference Notes:
1. Becker, W.C., & Carnine, D.W. Direct Instruction-A
behavior-based model for comprehensive educational intervention with
the disadvantaged. Paper presented at the VIII Symposium on Behavior
Modification, Caracas, Venezuela, February, 1978. Division of Teacher
Education, University of Oregon, Eugene, Oregon.
2. Engelmann, S. Direct Instruction. Seminar presentation. AERA,
Toronto, March, 1978.
References
Bell, A.E., & Switzer, F. (1973). Factors related to
pre-school prediction of academic achievement: Beginning reading in
open area vs. traditional classroom systems. Manitoba Journal of
Education, 8, 22-27.
Bell, A.E., Zipuvsky, M.A., and Switzer, F. (1977). Informal or
open-area education in relation to achievement and personality.
British Journal of Educational Psychology, 46. 235-243.
Bennett, N. (1976). Teaching styles and pupil progress. Cambridge,
Mass.: Harvard University Press.
Brophy, J.E., & Good, T.L. (1974). Teacher-student relationships:
Causes and consequences. New York: Hold, Rinehart & Winston.
Cronbach, L.J. (1960). Essentials of psychological testing. (2nd
ed.). New York: Harper & Brothers.
House, E.R., Glass, G.V., McLean, L.F., and Walker, D.F. (1978a). No
Simple Answer: Critique of the "Follow Through" evaluation. Harvard
Educational Review, 28(2), 128-160.
House, E.R., Glass, G.V., McLean, L.F., and Walker, D.F. (1978b).
Critiquing a Follow Through evaluation. Phi Delta Kappan, 59(7),
473-474.
Rosenshine, B. Classroom Instruction. (1976). In Seventy-fith
Yearbook of the National Society for the Study of Education (Part 1).
Chicago: University of Chicago Press.
St. Pierre, R.G., Anderson, R.B., Proper, E.C., and Stebbins, L.B.
(1978). That Follow Through evaluation. Phi Delta Kappan, 59(10),
729.
Stallings, J.A., & Kaskowitz, D.H. (1974). Follow Through
classroom observation evaluation­p;1972-1973. Menlo Park, Cal.:
Stanford Research Institute.
Stallings, J. (1975). Implementation and child effects of teaching
practices in Follow Through classrooms. Monographs of the Society for
Research in Child Development, 40(7-8, Serial No. 163).
Stebbins, L.B., St. Pierre, R.G., Proper, E.C., Anderson, R.B., and
Cerva, T.R. (1977). A planned variation model. Vol. IV-A Effects of
Follow Through models. U.S. Office of Education.
Stephens, J. (1967). The process of schooling. New York: Holt,
Rinehart & Winston.