# What I Know About My Training Data

**General scope:**
- I was trained on text data from the internet, books, and other sources up to early 2024
- My knowledge has a cutoff date (April 2024), so I'm unaware of events after that

**What I don't know precisely:**
- Exact sources, datasets, or proportions of different materials
- Specific filtering or curation methods used
- Details about data cleaning processes
- Exact size of the training dataset

**Known limitations:**
- Training data reflects biases and gaps present in internet text
- Overrepresentation of English and certain cultures/perspectives
- Potential gaps in specialized or niche domains
- I can't access or verify my training data directly

**What Anthropic has shared:**
- They used Constitutional AI methods for training
- They incorporated feedback from human reviewers
- They made deliberate choices about safety and helpfulness
- They've published some research on their methods

**Honest gaps:**
- I can't fully explain which specific training examples influenced my responses
- I don't have complete transparency into my own architecture
- I can't definitively separate learned knowledge from reasoning ability

Is there a specific aspect of my training you're curious about? I'm happy to discuss what's known, unknown, or uncertain.
