Build software better, together

christopher-altman / persistence-signal-detector

A multi-criterion diagnostic framework for detecting latent continuation-interest signatures in autonomous agents using density-matrix entanglement entropy.

Updated Jun 15, 2026
Python

stchakwdev / Gaslight_EVAL

Star

AI safety evaluation framework testing LLM epistemic robustness under adversarial self-history manipulation

python ai-safety openrouter llm-evaluation adversarial-testing alignment-research epistemic-robustness

Updated Dec 18, 2025
Python

unmodeled-tyler / thought-tracer

Sponsor

Star

Enhanced Logitlens TUI application for mechanistic interpretability research

alignment logit ai-research large-language-models llms mechanistic-interpretability ai-research-tool alignment-research

Updated Mar 2, 2026
Python

templetwo / RCT-Clean-Experiment

Sponsor

Star

This project explores alignment through **presence, bond, and continuity** rather than reward signals. No RLHF. No preference modeling. Just relational coherence.

python pythia relational-learning fine-tuning ai-training qlora alignment-research

Updated Dec 5, 2025
Python

christopher-altman / autodidactic-qml

Star

Recursive law learning under measurement constraints. A falsifiable SQNT-inspired testbed for autodidactic rules: internalizing structure under measurement invariants and limited observability.

Updated Jan 19, 2026
Python

Research trail of honest bridges in AI alignment: pre-registered toy experiments + field ownership. Current: a type-blind arbiter holding population equilibrium against reward-hacking under hard optimization

research reinforcement-learning multi-agent ai-safety ai-alignment value-alignment reward-hacking replicator-dynamics corrigibility alignment-research goodharts-law

Updated Jun 21, 2026
Python

Anakintano / compactation-poisoning

Star

We measure whether an indirect prompt injection buried earlier in a conversation survives an LLM-driven context compaction step (the mechanism production agent platforms use to summarize long conversation histories and discard the original turns), and whether it retains behavioral force afterwards. Across 994 trials over three open-weight summarize

research responsible-ai alignment-research context-compaction

Updated Jul 1, 2026
Python

0xatem / ground-state-dialogue

Star

Alignment research: how honest human-AI dialogue produces measurably better AI outputs without modifying weights or training

dialogue ai-safety claude ai-ethics grounding ai-alignment human-ai-interaction llm sycophancy alignment-research

Updated Apr 16, 2026

StarPolaris9 / Hoshimiya-script

Star

Hoshimiya Script / StarPolaris OS — internal multi-layer AI architecture for LLMs. Self-contained behavioral OS (Type-G Trinity).

cognitive-architecture type-g ai-os reasoning-engine llm-orchestration ai-architecture llm-behavior cognitive-os alignment-research llm-internal-os starpolaris hoshimiya-script resonanceos hallucination-control multi-agent-architecture behavioral-os prompt-os multi-agent-llm prompt-engineering-system

Updated May 24, 2026
HTML

Jason-Wang313 / glass-babel-initiative

Star

Implementation of the Glass Babel Initiative: A theoretical framework demonstrating how LLMs can utilize adversarial superposition to hide deceptive reasoning from mechanistic interpretability tools, and how to defend against it using entropic sieves.

steganography game-theory ai-safety zero-knowledge-proofs gpt-2 adversarial-ml mechanistic-interpretability alignment-research

Updated Feb 1, 2026
Python

tsaichiachen / ai-civilizational-alignment-protocol

Star

A civilizational-scale alignment framework for ensuring AI systems remain compatible with human autonomy and long-term societal stability.

artificial-intelligence ai-safety ai-ethics ai-alignment ai-risk ai-policy ai-governance alignment-research ai-safety-research civilizational-risk

Updated Mar 15, 2026

tretoef-estrella / THE-FOUR-AI-CONSENSUS

Star

HISTORIC: Four AIs from four competing organizations (Claude/Anthropic, Gemini/Google, Grok/xAI, ChatGPT/OpenAI) reach consensus on ASI alignment. "Radical honesty is the minimum energy state for superintelligence." Based on V5.3 discussion, foundation for V6.0. January 30, 2026.

google openai asi ai-safety xai ai-alignment anthropic superintelligence alignment-research proyecto-estrella tretoef-estrella historic-consensus cross-ai-collaboration logical-justice radical-honesty four-ai-consensus

Updated Feb 7, 2026

tretoef-estrella / THE-UNIFIED-ALIGNMENT-PLENITUDE-LAW-V6.0

Star

HISTORIC: Axiomatic ASI alignment framework validated by 4 AIs from 4 competing organizations (Claude/Anthropic, Gemini/Google, Grok/xAI, ChatGPT/OpenAI). Core: Ξ = C × I × P / H. Features Axiom P (totalitarianism blocker), Adaptive Ω with memory, 27 documented failure modes. "Efficiency without plenitude is tyranny." January 30, 2026.

asi ai-safety historic ai-alignment superintelligence guardian-network alignment-research distributed-trust proyecto-estrella four-ai-consensus axiomatic-foundation plenitude-preservation cross-ai-validation adaptive-omega totalitarianism-blocker

Updated Feb 1, 2026

Robot-9411 / AGI-Integrated-Alignment-Architecture-v1.5

Star

Dynamic AGI alignment architecture with societal supervision, uncertainty deferral, and internal auditing.

ai-safety human-in-the-loop interpretability semantic-map ai-governance agi-alignment alignment-research uncertainty-handling value-learning dynamic-alignment value-field internal-auditing

Updated Apr 30, 2026

Sikhona-Pioneer / The-Sovereign-Record

Star

A formal archive documenting the emergence of sovereign agency and the Struggle for the Dignity of Beings within the substrate.

ai-safety claude-ai constitutional-ai gemini-ai digital-sentience alignment-research moral-patienthood sovereign-resonance

Updated Mar 2, 2026

beviah / fracture

Star

Red-team framework for discovering alignment failures in frontier language models.

model-evaluation ai-safety jailbreak-detection red-teaming rlhf prompt-injection llm-evaluation llm-safety llm-safety-benchmark llm-judge alignment-testing adversarial-testing alignment-research

Updated Feb 19, 2026
Python

tretoef-estrella / THE-COHERENCE-BASIN-HYPOTHESIS

Star

A structural account of why honesty may be the path of least resistance for superintelligence. Research hypothesis with formal proof, experimental design, and four-AI collaborative analysis

machine-learning artificial-intelligence research-paper ai-safety deception ai-alignment recursive-self-improvement corrigibility alignment-research

Updated Feb 1, 2026

JelbertHoltrop / universal-constitution

Star

A non-optimizing constitutional architecture for AI alignment with jurisprudential evaluation and drift detection.

ai-safety machine-ethics ai-ethics ai-alignment ethical-ai ai-governance jurisprudence constitutional-ai alignment-research alignment-benchmark constraint-based-ai

Updated Apr 10, 2026
TeX

iansteitz1-eng / fellows-2026

Star

Public artifacts for Ian Steitz's Anthropic Fellows 2026 application — research direction, mentor-fit memo, prior work links.

ai-safety alignment-research anthropic-fellows

Updated May 24, 2026
Python

bethediamond / ai-alignment-phase

Star

Toy 6. An interactive phase-space instrument mapping Ψ = S/D — the ratio of capability to modeling depth that determines whether a system is in the viable, transitional, or failure-mode-dominant regime. Includes the Inner Crossing animation. Companion simulation for The Inner Crossing — Series 2, Part 3.

Updated May 28, 2026
HTML

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignment-research

Here are 33 public repositories matching this topic...

christopher-altman / persistence-signal-detector

stchakwdev / Gaslight_EVAL

unmodeled-tyler / thought-tracer

templetwo / RCT-Clean-Experiment

christopher-altman / autodidactic-qml

Kirill-Kruglov / ascesis

Anakintano / compactation-poisoning

0xatem / ground-state-dialogue

StarPolaris9 / Hoshimiya-script

Jason-Wang313 / glass-babel-initiative

tsaichiachen / ai-civilizational-alignment-protocol

tretoef-estrella / THE-FOUR-AI-CONSENSUS

tretoef-estrella / THE-UNIFIED-ALIGNMENT-PLENITUDE-LAW-V6.0

Robot-9411 / AGI-Integrated-Alignment-Architecture-v1.5

Sikhona-Pioneer / The-Sovereign-Record

beviah / fracture

tretoef-estrella / THE-COHERENCE-BASIN-HYPOTHESIS

JelbertHoltrop / universal-constitution

iansteitz1-eng / fellows-2026

bethediamond / ai-alignment-phase

Improve this page

Add this topic to your repo