Skip to content

MADEVAL/ReadSight

Repository files navigation

ReadSight - Multilingual Readability Engine

CI PHP License Tests PHPStan Languages Formulas

ReadSight is a PHP library for measuring text readability across 86 languages. It implements 17 readability formulas with language-specific coefficients and uses the Frank M. Liang (TeX) hyphenation algorithm for accurate syllable counting — all with zero runtime dependencies.

See It in Action

Two texts of almost equal length — a plain sentence and a chunk of legal boilerplate:

$plain = 'We made an app that reads your text. It tells you how easy it is to read. You get a score in one second.';
$legal = 'The parties acknowledge that any unauthorized disclosure of confidential information may cause irreparable harm. In such an event, the affected party shall be entitled to seek injunctive relief.';

There is no "score everything" call — you loop over the formulas the language supports and call score() for each:

use GlobusStudio\ReadSight\Engine;

$rs = new Engine('en-us');

foreach ($rs->getSupportedFormulas() as $formula) {
    $result = $rs->score($formula, $legal);
    // $result->score, $result->gradeLevel, $result->interpretation
    // ...
}

For both texts that produces:

+-----------------------+-------------------------+----------------------------+
| READABILITY FORMULA   | Plain text              | Legalese                   |
+-----------------------+-------------------------+----------------------------+
| Flesch Reading Ease   | 107.1  Very Easy        | 23.4  Very Hard            |
| Flesch-Kincaid Grade  | 0.3  g0.3 1st Grade     | 13.5  g13.5 College        |
| Gunning Fog           | 3.2  g3.2 Very Easy     | 18.5  g18.5 Extremely Hard |
| SMOG Index            | 3.1  g3.1 3rd Grade     | 15.2  g15.2 College        |
| Coleman-Liau          | -0.4  g0.0 Kindergarten | 16.5  g16.5 Graduate       |
| Automated Readability | -2.1  g0.0 Kindergarten | 13.2  g13.2 College        |
| LIX                   | 8.0  Children's Books   | 49.7  Factual Information  |
| Dale-Chall            | 5.3  5th-6th grade      | 12.2  Graduate             |
| Spache                | 2.3  g2.3 2nd Grade     | 6.5  g5.0 Above 4th Grade  |
+-----------------------+-------------------------+----------------------------+

All 9 formulas for en-us agree the second text is far harder. The bundled example prints this grid plus text metrics and a syllable histogram, for any text and language:

php examples/dashboard.php
php examples/dashboard.php --lang=de-1996 --file=essay.txt

17 formulas, 86 languages, one consistent API. Five of the formulas are truly universal — Gunning Fog, SMOG, Coleman-Liau, ARI and LIX score text in every one of the 86 languages. The remaining 12 are language-aware, each carrying its own published coefficients: Flesch Reading Ease and Flesch-Kincaid span 12 languages, the Wiener Sachtextformel speaks German, Gulpease speaks Italian, OSMAN speaks Arabic, and the Fernández-Huerta · Szigriszt-Pazos · Gutiérrez-Polini · Crawford family handles Spanish. getSupportedFormulas() then hands each language exactly the slice that fits it — 9 formulas for en-us, 11 for es, 8 for de-1996 — so an English-only metric never lands on a Thai sentence by mistake.

Table of Contents

Installation

composer require globus-studio/readsight

Requirements:

  • PHP >= 8.2
  • ext-mbstring
  • ext-json

No other runtime dependencies.

Quick Start

use GlobusStudio\ReadSight\Engine;

$engine = new Engine('en-us');

// Syllable counting
$engine->syllableCount('banana');        // 3
$engine->splitSyllables('hyphenation');  // ['hyp', 'hen', 'ati', 'on']  (4 syllables, heuristic split)
$engine->splitWord('hyphenation');       // ['hy', 'phen', 'ation']      (TeX hyphenation points)

// Text analysis
$stats = $engine->analyze('The quick brown fox jumps over the lazy dog.');
echo "Words: {$stats->wordCount}, Syllables: {$stats->syllableCount}\n";

// Readability formulas
$fre = $engine->fleschReadingEase($text);
echo "Flesch Reading Ease: {$fre->score} - {$fre->interpretation}\n";

$fog = $engine->gunningFog($text);
echo "Gunning Fog: {$fog->score} (grade {$fog->gradeLevel})\n";

$lix = $engine->lix($text);
echo "LIX: {$lix->score} - {$lix->interpretation}\n";

Syllable Counting Modes

ReadSight has three syllable counting modes, configured per language via syllableMode in data/languages/*.json:

Mode How it works count accuracy split accuracy
heuristic Vowel patterns + word list + prefix/suffix rules ≈ approximate
tex Frank M. Liang hyphenation algorithm (TeX .tex patterns) ✓ exact
composite Heuristic first, TeX as fallback ≈ approximate (uses heuristic split)

The default mode is tex. 84 languages use tex; 2 use composite (en-us, en-gb).

Example: "hyphenation" in each mode

$engine = new Engine('en-us');     // composite mode - heuristic wins
$engine->syllableCount('hyphenation');    // 4 ✓ (in problemWords list)
$engine->splitSyllables('hyphenation');   // ['hyp', 'hen', 'ati', 'on']  - heuristic: equal-width split, ≈ approximate
$engine->splitWord('hyphenation');        // ['hy', 'phen', 'ation']      - TeX hyphenator: exact points

$engine = new Engine('de-1996');   // tex mode
$engine->syllableCount('hyphenation');    // 4 ✓ (TeX patterns)
$engine->splitSyllables('hyphenation');   // ['hy', 'phena', 'ti', 'on']  - TeX: exact
$engine->splitWord('hyphenation');        // ['hy', 'phena', 'ti', 'on']  - same, both use TeX

Tip: splitWord() always uses the TeX hyphenator (exact). splitSyllables() may use the heuristic split (approximate) in composite/heuristic modes. For syllable counts both are correct.

Note: addHyphenations() adds overrides to the TeX hyphenator. These affect splitWord() but NOT splitSyllables() in composite/heuristic modes (the heuristic counter doesn't see them).

Demo

Run the interactive demo to see ReadSight in action:

php examples/demo.php

This analyzes built-in sample text and outputs:

  • Syllable breakdown with hyphenation points for common words
  • Text statistics - letters, words, sentences, syllables, histogram
  • All applicable readability formulas with scores and interpretations

Compare the same text across 8 languages:

php examples/demo.php --compare

Analyze your own text file:

php examples/demo.php --file=essay.txt
php examples/demo.php --file=essay.txt --lang=de-1996

Supported Languages

86 languages across 19 writing systems: Latin, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Tamil, Thai, Greek, Armenian, Georgian, Gujarati, Gurmukhi, Kannada, Malayalam, Odia, Telugu, Ethiopic, Coptic.

$engine = new Engine('ru');       // Russian
$engine = new Engine('de-1996');  // German (1996 reform)
$engine = new Engine('es');       // Spanish
$engine = new Engine('th');       // Thai

// List all supported languages
$langs = Engine::getSupportedLanguages();
# ['af', 'ar', 'as', 'be', 'bg', 'bn', 'ca', 'cop', 'cs', 'cu', 'cy', 'da',
#  'de-1901', 'de-1996', 'de-ch-1901', 'el-monoton', 'el-polyton', 'en-gb',
#  'en-us', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fi-x-school', 'fr', 'fur',
#  'ga', 'gl', 'grc', 'gu', 'he', 'hi', 'hr', 'hsb', 'hu', 'hy', 'ia', 'id',
#  'is', 'it', 'ka', 'kk', 'kmr', 'kn', 'la', 'la-x-classic', 'la-x-liturgic',
#  'lt', 'lv', 'mk', 'ml', 'mn-cyrl', 'mn-cyrl-x-lmc', 'mr', 'mul-ethi', 'nb',
#  'nl', 'nn', 'oc', 'or', 'pa', 'pi', 'pl', 'pms', 'pt', 'rm', 'ro', 'ru',
#  'sa', 'sh-cyrl', 'sh-latn', 'sk', 'sl', 'sq', 'sr-cyrl', 'sv', 'ta', 'te',
#  'th', 'tk', 'tr', 'uk', 'vi', 'zh-latn-pinyin']

Readability Formulas

Universal (all 86 languages)

Formula Method Type Score Range
Gunning Fog gunningFog() Syllable-based 0–20+
SMOG Index smogIndex() Syllable-based 3–18+
Coleman-Liau colemanLiau() Letter-based 0–18+
ARI automatedReadabilityIndex() Letter-based 0–18+
LIX lix() Letter-based 20–60+

Language-Specific

Language Formulas
English (en-us, en-gb) Flesch Reading Ease, FK Grade Level, Dale-Chall*, Spache*
German (de-*) Flesch Reading Ease (Amstad), FKGL, Wiener Sachtextformel (4 variants)
Russian (ru) Flesch Reading Ease (Oborneva), FKGL
Spanish (es) Flesch Reading Ease, Fernandez-Huerta, Szigriszt-Pazos, Gutierrez-Polini, Crawford
Italian (it) Flesch Reading Ease, Gulpease
French (fr) Flesch Reading Ease (Kandel-Moles)
Dutch (nl) Flesch Reading Ease (Douma)
Portuguese (pt) Flesch Reading Ease (Martins)
Turkish (tr) Flesch Reading Ease (Ateşman)
Polish (pl) FOG-PL
Arabic (ar) OSMAN

* Note: Dale-Chall and Spache formulas use a syllable-based heuristic to estimate difficult words (1-syllable ≈ easy). This is a simplified estimation, not based on the original Dale/Spache word lists. For accurate Dale-Chall/Spache scores, a curated word list would be required.

Generic dispatching:

$result = $engine->score('gunning_fog', $text);
$result = $engine->score('wiener_sachtextformel', $text);

FormulaResult

$result->score;           // float - raw formula score
$result->gradeLevel;      // ?float - normalized grade level (FKGL, GF, SMOG, CL, ARI)
$result->interpretation;  // string - qualitative interpretation ("Easy", "Hard")
$result->formulaName;     // string - formula key
$result->languageCode;    // string - language code used
$result->inputs;          // array<string, float|int> - intermediate values for debugging

API Reference

Text Analysis Methods

$engine->syllableCount(string $word): int
$engine->splitWord(string $word): list<string>
$engine->splitSyllables(string $word): list<string>
$engine->wordCount(string $text): int
$engine->sentenceCount(string $text): int
$engine->letterCount(string $text): int
$engine->totalSyllables(string $text): int
$engine->averageSyllablesPerWord(string $text): float
$engine->averageWordsPerSentence(string $text): float
$engine->polysyllableCount(string $text, bool $countProperNouns = true): int
$engine->wordsWithMoreThanNSyllables(string $text, int $n, bool $countProperNouns = true): int
$engine->histogramSyllables(string $text): array<int, int>
$engine->analyze(string $text): TextStatistics

splitSyllables vs splitWord: splitSyllables may use the heuristic ≈approximate split (depends on the language's syllableMode). splitWord always uses the TeX hyphenator for exact hyphenation points. Syllable counts are accurate in all modes. See Syllable Counting Modes.

Formula Methods

$engine->fleschReadingEase(string $text): FormulaResult
$engine->fleschKincaidGradeLevel(string $text): FormulaResult
$engine->gunningFog(string $text): FormulaResult
$engine->smogIndex(string $text): FormulaResult
$engine->colemanLiau(string $text): FormulaResult
$engine->automatedReadabilityIndex(string $text): FormulaResult
$engine->lix(string $text): FormulaResult
$engine->wienerSachtextformel(string $text, int $variant = 1): FormulaResult
$engine->gulpease(string $text): FormulaResult
$engine->fernandezHuerta(string $text): FormulaResult
$engine->szigrisztPazos(string $text): FormulaResult
$engine->gutierrezPolini(string $text): FormulaResult
$engine->crawford(string $text): FormulaResult
$engine->fogPL(string $text): FormulaResult
$engine->daleChall(string $text): FormulaResult
$engine->spache(string $text): FormulaResult
$engine->osman(string $text): FormulaResult

Performance

Operation Time
Syllable counting (single word) ~0.15 ms
Text analysis (450 words) ~20 ms
Formula calculation (incl. analysis) ~4 ms
Engine init (en-us, cached) ~5 ms
Engine init (de-1996, first load) ~380 ms

Caching: compiled patterns are stored as JSON in the cache/ directory. First load parses .tex files (native hyph-utf8 format); subsequent loads use the pre-compiled cache.

Custom Configuration

use GlobusStudio\ReadSight\Engine;

// Set default paths (before creating engines)
Engine::setDefaultCacheDir('/var/cache/readsight');
Engine::setDefaultPatternsDir('/usr/share/readsight/patterns');
Engine::setDefaultLanguagesDir('/usr/share/readsight/languages');

// Or per-instance
$engine = new Engine(
    language: 'en-us',
    patternsDir: '/custom/patterns',
    cacheDir: '/custom/cache',
);

// Add custom hyphenation rules (affects splitWord, not splitSyllables in composite/heuristic modes)
$engine->addHyphenations([
    'customword' => 'cus-tom-word',
]);
$engine->splitWord('customword');  // ['cus', 'tom', 'word']

Architecture

Engine (facade)
  ├── TextAnalyzer (syllable counting, text metrics)
  │   ├── SyllableCounter (strategy: tex | heuristic | composite)
  │   │   ├── CompositeSyllableCounter (problemWords → heuristic, rest → TeX)
  │   │   ├── HeuristicSyllableCounter (vowel patterns + word list)
  │   │   └── TexSyllableCounter → LiangHyphenator (TeX hyphenation)
  │   ├── LiangHyphenator
  │   │   ├── TexSource (parses .tex from hyph-utf8)
  │   │   ├── PatternsCollection (pattern data)
  │   │   ├── HyphenationExceptionsCollection (word overrides)
  │   │   └── JsonPatternCache (compiled patterns)
  │   └── TextSplitter (word/sentence/letter counting)
  ├── Language (JSON config per language, syllableMode + formulaConfigs)
  └── FormulaRegistry (17 formulas)
      ├── FleschReadingEase (with lang-specific coefficients)
      ├── GunningFog, SMOG, ColemanLiau, ARI, LIX (universal)
      └── WSTF, Gulpease, Fernandez-Huerta, etc. (lang-specific)

Data Sources

  • TeX hyphenation patterns: hyph-utf8 version 2026-02-21 - the canonical TeX hyphenation repository maintained by the TeX Users Group (TUG). 86 .tex pattern files from hyph-utf8 covering 86 language variants. Packaged under each pattern file's original license.
  • FRE coefficients: Amstad (DE), Oborneva (RU), Fernandez-Huerta (ES), Vacca-Franchina (IT), Kandel-Moles (FR), Douma (NL), Martins (PT), Ateşman (TR)
  • WSTF: Bamberger & Vanecek (DE)
  • Gulpease: GULP, La Sapienza University (IT)

Development

composer install          # Install dependencies

composer test             # Run PHPUnit (257 tests)
composer test:coverage    # With HTML coverage report
composer analyse          # PHPStan level max
composer cs:check         # PHP CS Fixer (dry-run)
composer cs:fix           # PHP CS Fixer (apply fixes)
composer check            # All checks: CS + PHPStan + Tests

Quality Metrics

| Metric | Value | |---|---|---| | Tests | 257 | | Assertions | 1 047 | | PHPStan | Level max, 0 errors | | Source classes | 53 | | Test classes | 21 | | Supported languages | 86 | | Writing systems | 19 | | Readability formulas | 17 | | Runtime dependencies | 0 |

License

MIT. Author: Yevhen Leonidov.

TeX pattern files from hyph-utf8 are packaged under their original licenses (see individual file headers).

About

Multilingual readability library for PHP - 86 languages, 17 formulas, TeX-based syllable counting via Frank M. Liang algorithm. Zero dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors