Current Status of the TeX Math Recognizer




The following excerpt is taken from a paper in progress by professor Richard Fateman and graduate student Eylon Caspi:

In a subsequent test on some 10,740 formulas from the same source [Gradshtein and Rhyzik, A Table of Integrals, Series, and Products] (extracted from the file GRAD.DAT on the CDROM), a slightly modified version of the recognizer announced 5906 errors and presumed success for the remaining 4834. Again, the success rate is actually lower, since the presumed successes include some 170 formulas with unhandled derivative forms, as well as numerous unflagged, semantically questionable forms. Of the 5906 reported errors, some 1878 are due to unrecognized control sequences that we have not yet considered, including matrix constructions, equation alignment sequences, and macros for many special function names. An additional 804 errors are due to \hbox constructions with unrecognized contents, including more special function names and embedded narrative comments. We suspect, therefore, that simply handling more special function names (and their more complicated super/sub-scripting) would allow the engine to recognize several thousand additional formulas. Other errors are due to formula forms not handled by the grammar, including some 300 ellipsis constructions (\dots, \cdots, \ldots), and forms with unexpected punctuation or bracing (we do not count the 300 ellipses as "unrecognized control sequences" since we recognize them -- we just do not know what to do with them!). Note that the stated error counts are in fact estimates which come from tallying parser error diagnostics, and may therefore be inaccurate due to cascading of errors.

Back to Project Report Page




Last updated: 1/29/98
Comments to: eylon@cs.berkeley.edu