Commentary on Anderson & Lebiere
Abstract: 60 words
Main Text: 945 words
References: 77 words
Total Text: 1102 words
Anderson and Lebiere undertake the daunting task of evaluating cognitive architectures with the goal of identifying their strengths and weaknesses. The authors are right about the risks of proposing a psychological theory based on a single evaluation criterion. What if the several micro-theories proposed to meet different criteria do not fit together in a coherent fashion? What if a theory proposed for language understanding and inference is not consistent with the theory for language learning or development? What if a theory for playing chess does not respect the known computational limits of the brain? The answer, according to Newell, and Anderson & Lebiere, is to evaluate a cognitive theory along multiple criteria such as flexibility of behavior, learning, evolution, knowledge integration, brain realization, and so forth. By bringing in multiple sources of evidence in evaluating a single theory, one is protected from ``overfitting,'' a problem that occurs when the theory has too many degrees of freedom compared to the available data. While it is non-controversial when applied to testable hypotheses, I believe that this research strategy does not work quite well in evaluating cognitive architectures.
Science progresses by proposing testable theories and testing them. The problem with cognitive architectures is that they are not theories themselves but high-level languages used to implement theories, with only some weak architectural constraints. Moreover, these languages are computationally universal and thus are equivalent to one another in the sense that one language can simulate the other. How does one evaluate or falsify such universal languages? Are the multiple criteria listed by the authors sufficient to rule out anything at all rather than simply suggesting areas to improve on? The authors' grading scheme is telling in this respect. It only evaluates how an architecture satisfies one criterion better than another criterion, and does not say how to choose between two architectures. Of course, one cannot duck the question by suggesting to choose an architecture based on the criterion one is interested in explaining. This is precisely the original problem that Newell was trying to address through his multiple criteria.
The authors suggest that timing constraints and memory limitations imply that one cannot just program arbitrary models in ACT-R. But that still leaves room for an infinite variety of models, and ACT-R cannot tell us how to choose between them. Taking analogy to programming languages, it is possible to design an infinite variety of cognitive architectures, and implement an infinite variety of models in each architecture. Can we ever collect enough evidence to be able to choose one over another?
This suggests to me that a cognitive theory must be carefully distinguished from the concrete implementation and the underlying architecture. Just as a programming language can implement any given algorithm, a cognitive architecture can instantiate any cognitive theory (albeit with some variations in time efficiencies). This should not count as evidence for the validity of the architecture itself, any more than good performance of an algorithm should count as evidence for the validity of the programming language. Cognitive Science can make better progress by carefully distinguishing the algorithm from the architecture and confining the claims to those parts of the algorithm that are in fact responsible for the results. Consider, for example, ACT-R's theory of past tense learning by children. More specifically, consider the empirical observation that the exceptions tend to be high-frequency words. Anderson and Lebiere attribute this to the fact that only high frequency words develop enough base-level activation to be retrieved in ACT-R. In more general terms, only high frequency words provide sufficient training data for the system to be able to learn an exception. How much of this explanation is due to the particulars of ACT-R theory as opposed to the necessary consequence of learning in a noisy domain? If any learning system that operates in a noisy environment needs more training data to learn an exception, why should this be counted as evidence for the ACT-R theory? Similar criticisms can be leveled against other cognitive architectures and mechanisms such as Soar and chunking, connectionism and backprop.
In other words, even when multiple criteria are used to evaluate a cognitive architecture, there still remains an explanatory gap (or a leap of faith) between the evidence presented and the paradigm used to explain it. To guard against such over-interpretation of the evidence, Ohlsson and Jewett propose ``abstract computational models,'' which are computational models that are designed to test a particular hypothesis without taking a stand on all the details of a cognitive architecture [Ohlsson & Jewett, 1997]. Similar concerns are expressed by Pat Langley who argues that the source of explanatory power often lies not in the particular cognitive architecture being advanced but in some other fact such as the choice of features or the problem formulation [Langley, 1999]. Putting it another way, there are multiple levels of explanations for a phenomenon such as past-tense learning or categorization, including computational theory level, algorithmic level, and implementation level. Computational theory level is concerned with what is to be computed, while algorithmic level is concerned with how [Marr, 1982]. Cognitive architecture belongs to the implementation level which is below the algorithmic level. It is an open question where does the explanatory power of an implementation mostly lie. It is only by paying careful attention to the different levels of explanations and evaluating them appropriately can we discern the truth. One place to begin is to propose specific hypotheses about the algorithmic structure of the task at hand, and evaluate them using a variety of sources of evidence. This may, however, mean that we have to put aside the problem of evaluating cognitive architectures for now or forever.
Langley, P., (1999). Concrete and abstract models of category learning, Proceedings of the 21'st Annual Conference of the Cognitive Science Society (288--293). Vancouver, BC, Canada: Lawrence Erlbaum.
Marr, D. (1982). Vision. Epilogue (335--361). New York: W. H. Freeman and Company.
Ohlsson, S. & Jewett, J. J. (1997). Simulation models and the power law of learning. Proceedings of the 19'th Annual Conference of the Cognitive Science Society (584--589). Stanford, CA: Lawrence Erlbaum.