Perplexity.ai and the Titanic Dataset: An Experiment
For background, see my post Data Governance: Titanic Dataset and the Perils of Bad Data.
Perplexity.ai is an AI-powered answer engine designed to provide concise, well-sourced answers by synthesising information from the web using large language models (LLMs). This article explores its ability to analyse the classic Titanic dataset, focusing on the question of the oldest survivour and how well it handles potential data quality issues, revealing both the power and limitations of LLMs in data analysis.
See the details in the experiment re-played below. It was performed on 24 April 2025 Sydney time.
PROMPT: There is a common titanic data set available online - can you tell me what it has to say about the oldest survivor?
RESPONSE:
The Titanic dataset commonly analyzed (particularly the Kaggle version with 891-1,309 passenger records) contains an age-related anomaly regarding the oldest survivor. While the dataset includes a passenger listed as 80 years old, this appears to be an error according to Kaggle community findings – the age likely represents the person's lifespan after surviving the disaster, not their age during the sinking 2.
The dataset's age analysis reveals:
- Age range: Passengers span from infants (as young as 0.42 years) to 80 years in the raw data, though the 80-year-old entry is questionable 1, 8, 9.
- Survival trends:
No verified oldest survivor is explicitly identified in the dataset's standard fields, but Millvina Dean (who died in 2009 at 97) is historically recognized as the youngest survivor (2 months old during the disaster) 5. The dataset's limitations (missing ages for ~20% of passengers and potential age inaccuracies 6) make definitive conclusions about the oldest survivor challenging.
If analyzing the dataset directly, one would need to filter Survived=1
and find the maximum Age
value, but this approach risks including erroneous entries like the disputed 80-year-old record 2.
Conclusion
It seems the LLM is able to synthesise an answer from its training data, which hopefully appropriately reflects the public domain content. Interestingly, it references a Kaggle post that mentions my original blog post but doesn't cite the original source directly. The LLM's conclusion that the 80-year-old survivor "appears to be an error" highlights its tendency to replay information. More importantly, it fails to leverage the systematic data quality analysis in my original article. Why didn't it pick up on the systematic errors identified in my more comprehensive analysis, and instead relied on a community discussion thread?
This also highlights the ongoing importance of critical data analysis and the limitations of relying solely on LLMs.