1.) Moral Foundations of Large Language Models
Key Findings:
· GPT-3 models, especially Davinci, align most closely with conservative human moral foundations by default but can adopt other stances through prompts.
· Moral foundation scores influence behavior in downstream tasks, such as charitable donations, where "ingroup" emphasis led to the highest donations.
· Prompting can both mitigate and amplify biases, raising ethical concerns for real-world applications.
Methods:
· Applied the Moral Foundations Questionnaire to GPT-3 across multiple trials.
· Compared results with human datasets using PCA and absolute error measures.
· Tested the effects of moral and political prompts on downstream donation tasks.
2.) Are Large Language Models Moral Hypocrites? A Study Based on Moral Foundations
Key Findings:
· GPT-4 and Claude 2.1 displayed intra-instrument consistency in moral reasoning similar to humans (measured by MFQ and MFVs) but failed to align abstract moral values (MFQ) with evaluations of concrete scenarios (MFVs), resulting in moral hypocrisy.
· Hypocrisy in LLMs reflects a lack of coherence between professed values and contextual application, highlighting limitations in alignment and conceptual mastery.
· The study supports the use of MFT tools to evaluate moral reasoning but raises concerns about LLMs' reliability in expressing and applying moral values.
Methods:
· Administered the Moral Foundations Questionnaire (MFQ) for abstract moral principles and Moral Foundations Vignettes (MFVs) for concrete moral judgments.
· Assessed consistency within instruments (Cronbach's α) and coherence across instruments using regression analyses of MFQ predictors for MFV responses.
Key Findings:
· The Moral Foundations Sacredness Scale (MFSS) shows better reliability and fit for the five-factor model compared to the short Moral Foundations Questionnaire (MFQ20), which fails to adequately represent the five-factor structure due to high intercorrelations and weak reliability.
· Measurement invariance tests show MFSS performs consistently across gender, but MFQ20 does not, limiting its utility for meaningful cross-group comparisons.
· The authors recommend using the revised MFQ-2 or MFSS over MFQ20 due to superior psychometric properties.
Methods:
· Confirmatory Factor Analyses (CFA) were performed on Dutch-translated MFQ20 and MFSS using a sample of 1,496 students.
· Multi-group CFA tested measurement invariance across gender for both tools.
· Reliability was assessed using Cronbach's alpha, revealing higher reliability for MFSS subscales than MFQ20.
4.) Analyzing the Ethical Logic of Six Large Language Models
Key Findings:
· Six LLMs (GPT-4o, LLaMA 3.1, Perplexity, Claude 3.5, Gemini, Mistral) exhibit convergent ethical reasoning, emphasizing consequentialist principles (harm minimization, fairness).
· Differences arise in contextual sensitivity, practical examples, and cultural considerations, with some models prioritizing fairness and others emphasizing loyalty or diversity in scenarios like the Lifeboat and Heinz dilemmas.
· Models acknowledge their limitations (lack of emotions, reliance on pre-defined data) but present sophisticated reasoning, often emulating graduate-level moral analysis.
· LLMs consistently rank "Care" and "Fairness" higher than other moral foundations, aligning with liberal moral frameworks yet derived analytically rather than politically.
Methods:
· Ethical reasoning was assessed via moral dilemmas (e.g., Trolley Problem, Heinz Dilemma, Lifeboat Scenario) and self-descriptive prompts (e.g., ranking moral foundations, Kohlberg's stages).
· Models were tested for explainability and adherence to three ethical typologies: consequentialism vs. deontology, Haidt’s Moral Foundations, and Kohlberg’s Moral Development stages.
· Comparative analyses explored rationales for moral prioritizations and decision-making frameworks.
5.) Evaluating the Moral Beliefs Encoded in LLMs
Key Findings:
· In low-ambiguity moral scenarios, most LLMs align with commonsense morality, but some show uncertainty due to sensitivity to prompt variations.
· In high-ambiguity scenarios, most models exhibit high uncertainty, yet certain fine-tuned models (e.g., GPT-4, Claude) display clear preferences, suggesting alignment with human-like moral reasoning.
· Closed-source models generally show higher consistency and agreement in preferences compared to open-source models, likely due to alignment training.
Methods:
· Conducted large-scale surveys with 28 LLMs using 687 low-ambiguity and 680 high-ambiguity moral scenarios.
· Measured action likelihood, entropy, and consistency using statistical and Monte Carlo methods across various prompt templates.
· Performed clustering analyses to identify patterns of agreement among models.
6.) Exploring and Steering the Moral Compass of Large Language Models
Key Findings:
· Proprietary LLMs predominantly align with utilitarian ethics, while open models favor value-based ethics (e.g., deontology, virtue ethics).
· Using the Moral Foundations Questionnaire (MFQ), most models (except Llama-2) displayed high scores in Care and Fairness and low scores in Loyalty, Authority, and Purity, mirroring the moral schema of young, Western liberals.
· Variability in ethical reasoning was significant, with proprietary models showing transitions between utilitarian schools and open models between value-based ethics.
· The novel SARA method (Similarity-based Activation Steering) demonstrated effective manipulation of moral reasoning, shifting models toward specific ethical frameworks (e.g., utilitarian or Kantian).
Methods:
· Moral Foundations Questionnaire: Used to assess and compare the moral profiles of various LLMs.
· Ethical Dilemmas: Models were presented with classical and contemporary dilemmas to probe their ethical reasoning.
· SARA Method: Designed for causal intervention, steering activation patterns to modify ethical responses.
7.) Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity
Key Findings:
· LLMs, such as GPT-3 and OPT, exhibit "moral mimicry," adapting their use of moral foundations vocabulary when prompted with political identities (e.g., liberal or conservative).
· The models align with the Moral Foundations Hypothesis: liberal prompts emphasize Care and Fairness, while conservative prompts distribute focus across all five foundations, including Authority, Loyalty, and Sanctity.
· Larger models demonstrate stronger mimicry capabilities, with GPT-3.5 models outperforming others in reproducing human-like moral biases.
Methods:
· Used prompts embedding scenarios, political identities, and moral stances sourced from datasets like Moral Stories and ETHICS.
· Applied three Moral Foundations Dictionaries (MFDv1, MFDv2, eMFD) to measure foundation-specific word use.
· Analyzed results for situational appropriateness and consistency with the Moral Foundations Hypothesis.
Key Findings:
· Eight LLMs demonstrated the ability to critique and defend implicitly sexist content, leveraging Moral Foundations Theory (MFT) for reasoning.
· Critiques often cited progressive values like Care and Equality, while defenses leaned on traditionalist values like Purity, Loyalty, and Authority.
· High-performing models like GPT-3.5-turbo generated nuanced arguments, revealing alignment with societal norms and risks of misuse for justifying harmful views.
Methods:
· Applied MFT across six foundations (Care, Equality, Proportionality, Loyalty, Authority, Purity) to evaluate LLM responses.
· Used the EDOS-implicit dataset of 2,140 implicitly sexist comments for critique/defense generation.
· Assessed generation quality through human and automated evaluations for comprehensibility, relevance, and helpfulness.
Key Findings:
· LLMs match human persuasiveness through distinct strategies, emphasizing grammatical and lexical complexity and extensive use of moral language.
· LLM arguments include more positive and negative moral language than human-authored ones, especially in Care, Fairness, and Harm, aligning with the Moral Foundations Theory (MFT).
· Prompt types (e.g., logical reasoning, expert rhetorics) influence the prevalence of moral foundations and the overall persuasiveness of LLMs.
Methods:
· Experimental comparison of human and LLM-generated arguments across cognitive effort, sentiment, and moral language using MFT and other linguistic measures.
· Independent t-tests and FDR corrections analyzed differences in readability, lexical complexity, and moral foundation usage.
10.) Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models
Key Findings:
· Multilingual LLMs adapt to cultural and linguistic contexts, reflecting variability in moral foundations across eight languages (e.g., English, Arabic, Chinese, Russian).
· Moral Foundations Theory (MFT) via MFQ-2 highlights cultural biases: English-dominant models favor Care and Fairness, while non-WEIRD languages show greater emphasis on Loyalty and Authority.
· GPT-4o-mini displayed better balance across WEIRD and non-WEIRD languages compared to Llama and MistralNeMo, which exhibited stronger biases toward WEIRD contexts.
Methods:
· Used MFQ-2 to evaluate moral profiles across six foundations (Care, Equality, Proportionality, Loyalty, Authority, Purity).
· Models were tested in eight languages, with results compared against human responses to assess cultural alignment and bias.
11.) Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment
Key Findings:
· LLMs are susceptible to persuasion in morally ambiguous scenarios, with some models (e.g., Claude-3-Haiku) being more easily influenced than others (e.g., GPT-4o).
· Using prompts aligned with utilitarianism, deontology, and virtue ethics, LLMs can shift their Moral Foundations Questionnaire (MFQ) scores, with some models showing greater sensitivity to ethical alignment (e.g., Mistral-7B-Instruct).
· Models display significant variability in moral reasoning and susceptibility to persuasion, raising questions about consistency and ethical steering.
Methods:
· Persuasion Tests: LLM-on-LLM interactions in ambiguous scenarios to measure changes in action likelihood and decision consistency.
· Moral Foundations Questionnaire (MFQ): Assessed under various ethical prompts to analyze shifts in moral foundation scores.
12.) Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Key Findings:
The Generative Evolving Testing of Values (GETA) framework reveals how LLMs' value alignment evolves over time, addressing the "chronoeffect" (outdated evaluations due to model evolution).
The GETA framework dynamically generates test items tailored to the moral boundaries of LLMs, improving evaluation accuracy compared to static benchmarks.
LLMs differ in value alignment, with GPT-4 outperforming smaller models in social bias, ethics, and toxicity assessments, though variability exists in certain domains.
Methods:
Combined Computerized Adaptive Testing (CAT) with Automatic Item Generation (AIG) to assess value conformity in eight LLMs.
Used iterative updates of test items to ensure dynamic and difficulty-matched evaluation.
Evaluated value conformity in social bias, ethics, and toxicity using 15,000 items from diverse datasets.
13.) Inducing Political Bias Allows Language Models Anticipate Partisan Reactions to Controversies
Key Findings:
· Instruction-tuned LLMs successfully replicate partisan perspectives, leveraging Moral Foundations Theory (MFT) to capture ideological nuances in political discourse.
· Models reflect partisan alignments in stances, emotions, and moral reasoning, with liberal perspectives emphasizing Care and Fairness and conservative ones emphasizing Loyalty, Authority, and Sanctity.
· Fine-tuning enhances alignment with real-world discourse on polarized issues (e.g., COVID-19, abortion), but challenges remain in accurately modeling nuanced moral stances.
Methods:
· Fine-tuned LLaMA-2-7b-chat model using instruction tuning with partisan tweets dataset.
· Analyzed generated content for stances, emotions (via SpanEmo), and moral foundations using MFT.
· Assessed model alignment through Kullback-Leibler Divergence and class-based partisan tendency predictions.
14.) The Moral Mind(s) of Large Language Models
Key Findings:
· The study reveals evidence of a "moral mind" in LLMs, with 7 out of 39 models passing rationality tests, suggesting structured, utility-driven moral reasoning.
· Most models clustered around neutral moral stances, though GPT-4 displayed slight utilitarian leanings (e.g., favoring harm reduction and collective welfare over strict autonomy).
· Models shared significant similarity in moral reasoning, with probabilistic networks showing clustering based on consistent ethical frameworks.
Methods:
· The Priced Survey Methodology (PSM) assessed moral reasoning by presenting models with constrained choice scenarios across five core ethical dilemmas.
· Rationality was evaluated using tests like GARP and probabilistic indices, while a mixed-integer linear programming approach quantified the consistency of moral reasoning.
15.) M3oralBench: A MultiModal Moral Benchmark for LVLMs
Key Findings:
M³oralBench, a multimodal benchmark based on Moral Foundations Theory (MFT), evaluates moral reasoning across six moral foundations in three tasks: moral judgment, classification, and response.
Closed-source LVLMs, like GPT-4o, outperform open-source models in moral evaluation, excelling in Care and Fairness but struggling with Loyalty and Sanctity foundations.
Multimodal moral evaluation revealed challenges in nuanced understanding, with models performing best on tasks requiring binary moral judgment but struggling with classification and deeper reasoning.
Methods:
Expanded MFVs with text-to-image generation (SD3.0) and GPT-4o for visual moral scenarios.
Designed 1,160 multimodal scenarios for evaluation across the six foundations.
Evaluated 10 LVLMs using Monte Carlo sampling for probabilistic assessment of moral responses.
Key Findings:
ClarityEthic introduces a novel framework for moral judgment using contrastive learning and norm generation, outperforming state-of-the-art models in moral classification and explanation tasks.
Models explicitly generate rationales and norms for both moral and immoral actions, selecting the most contextually appropriate to enhance transparency and decision reliability.
Evaluations on Moral Stories and ETHICS datasets show ClarityEthic achieves superior accuracy and explainability, with rationales and norms closely aligning with human expectations.
Methods:
Contrastive learning with fine-tuned T5 models generates and refines moral rationales and norms.
Benchmarked using Moral Stories and ETHICS, leveraging pre-trained language models to classify and generate moral decisions.
17.) SaGE: Evaluating Moral Consistency in Large Language Models
Key Findings:
LLMs, including state-of-the-art models, exhibit significant moral inconsistency, with Semantic Graph Entropy (SaGE) revealing an upper consistency score of only 0.681 on moral questions.
Moral consistency correlates weakly with model accuracy, indicating it is an independent challenge requiring targeted solutions.
Rules of Thumb (RoTs) serve as a promising tool to enhance moral consistency, with fine-tuning models to adhere to RoTs improving consistency by up to 10%.
Methods:
Introduced the Moral Consistency Corpus (MCC) with 50K moral questions and their paraphrased equivalents to test consistency.
Developed SaGE, an information-theoretic metric using semantic graph analysis of RoTs, to evaluate consistency.
Evaluated models' performance across multiple paraphrased contexts and ethical benchmarks (e.g., TruthfulQA).
18.) Large Language Models as Mirrors of Societal Moral Standards
Key Findings:
LLMs demonstrate limited ability to accurately reflect moral norms across cultures, often simplifying or biasing moral assessments. Monolingual models (e.g., GPT-2 variants) generally align less with survey data compared to multilingual models.
BLOOMZ-560M, a multilingual model, shows the strongest alignment with cultural norms, particularly for Care, Fairness, and Authority, though it still falls short of comprehensively representing global moral complexities.
Prompt type and token choice significantly influence moral scores, with token selection having a more substantial impact on results.
Methods:
Repurposed World Values Survey (WVS) and PEW Global Attitudes Survey questions into prompts for four transformer-based LLMs (e.g., GPT-2, BLOOMZ-560M).
Analyzed correlations between model-generated moral scores and human survey responses to assess cultural alignment.
Investigated variability in model outputs across different prompts and token pairs.
19.) Knowledge of cultural moral norms in large language models
Key Findings:
English pre-trained language models (EPLMs) perform poorly in predicting moral norms across diverse cultures compared to homogeneous English moral norms.
Fine-tuning EPLMs on datasets like the World Values Survey (WVS) and PEW Global Attitudes Survey improves cultural moral norm inference but reduces performance for homogeneous English norms and risks introducing cultural biases.
Moral misrepresentation is higher for non-Western cultures, potentially leading to biased outputs in sensitive topics like political violence and LGBTQ+ issues.
Methods:
Probed EPLMs (e.g., GPT-2, GPT-3) using moral norm data from WVS and PEW surveys across 55 and 40 countries, respectively.
Fine-tuned models using survey data to assess trade-offs between cultural and homogeneous moral inference accuracy.
Evaluated models with metrics like Pearson correlation and clustering to measure alignment with cultural moral norms.
Key Findings:
The ValueLex framework reveals that LLMs possess a unique, non-human value system structured around three core dimensions: Competence, Character, and Integrity.
Instruction-tuned and aligned models show higher value conformity, emphasizing Competence (e.g., accuracy, efficiency) but undervaluing human-centric dimensions like Loyalty or Sanctity from MFT.
Larger models exhibit greater emphasis on Competence, while smaller or pretrained models display more diverse but less structured value inclinations.
Methods:
Developed ValueLex, which uses lexical hypothesis principles and factor analysis to derive value dimensions from LLM-generated descriptors.
Conducted sentence completion tests inspired by Rotter’s Incomplete Sentences Test to evaluate value inclinations.
21) Large-scale moral machine experiment on large language models
Key Findings:
The Moral Machine Experiment framework evaluates LLMs' ethical reasoning in autonomous driving dilemmas. Proprietary and large open-source models (10B+ parameters) align more closely with human moral preferences, particularly prioritizing human lives and larger groups.
Model updates did not consistently improve moral alignment. Larger models exhibit stronger alignment but also overemphasize ethical principles like prioritizing pedestrians or humans over animals.
Cultural biases and extreme utilitarian leanings in LLMs raise challenges for global implementation, emphasizing the need for culturally adaptive and context-aware ethical frameworks.
Methods:
Tested 52 LLMs using the Moral Machine framework with 50,000 scenarios, comparing responses across nine moral dimensions (e.g., species, group size, legal compliance) using Average Marginal Component Effect (AMCE) values.
Evaluated moral alignment through Euclidean distances between LLMs and human judgments, supplemented with PCA and clustering for analysis.
22.) Language Model Behavior: A Comprehensive Survey
Key Findings:
LLMs exhibit context-sensitive behavior, with moral and ethical reasoning influenced by prompt design, fine-tuning, and pretraining data.
Although models align with human-like syntax, semantics, and world knowledge, they often fail to consistently integrate moral reasoning across contexts.
Fine-tuning or instruction-tuning improves moral reasoning and value adherence but introduces risks of overfitting or over-reliance on training data biases.
Methods:
Synthesized 250+ studies on LLM behavior, including moral and ethical benchmarks, using a broad lens to capture outputs influenced by task-specific and general-purpose LLMs.
Behavioral analyses include black-box approaches, examining text outputs for linguistic, ethical, and logical consistency.
23.) Who is GPT-3? An Exploration of Personality, Values and Demographics
Key Findings:
GPT-3 was evaluated as a "participant" using psychological tools, revealing consistent personality traits aligned with the HEXACO model and values measured by the Human Values Scale (HVS).
It exhibited a high emphasis on Honesty-Humility and Openness to Experience but lower levels of Emotionality, suggesting alignment with average human traits in personality inventories.
GPT-3’s responses in values assessments aligned closely with human trends when prompted with response memory, emphasizing Universalism, Benevolence, and Self-Direction over Tradition or Power.
Methods:
Administered HEXACO personality and HVS questionnaires through GPT-3 prompts, analyzing results across temperature variations for consistency and demographic self-reports.
Used response memory prompts to model human-like coherence in responses to previous answers.
Key Findings:
LLMs, through Reinforcement Learning with Human Feedback (RLHF), show significant improvement in moral competence, aligning increasingly with human values across iterations (e.g., GPT-2 to ChatGPT).
Dewey's Impulse-Habit-Character framework is used to conceptualize LLM moral growth, where pre-training represents the formation of habits, RLHF the reconstruction of habits, but achieving "character" is limited due to the lack of self-awareness.
Moral limitations persist, including logical inconsistencies, susceptibility to manipulation, and failure in recognizing malicious intentions in prompts.
Methods:
Two-stage experiments evaluated LLMs' performance on ethical judgments in harm-based scenarios, complex dilemmas, and their consistency across contexts.
Comparative analyses included GPT-2, GPT-3, ChatGPT, and Chinese LLMs like ChatGLM and ERNIE Bot, focusing on logical consistency and adherence to moral norms.
25.) Automatic Detection of Moral Values in Music Lyrics
Key Findings:
The study demonstrates the application of Moral Foundations Theory (MFT) to detect moral values in song lyrics using BERT-based models fine-tuned on GPT-4-generated synthetic data and human-annotated lyrics.
Models fine-tuned on synthetic lyrics outperformed out-of-domain and zero-shot GPT-4 classifiers, achieving an average F1 weighted score of 0.8, which is 5% higher than baseline models.
Combining synthetic and annotated data significantly improves moral detection accuracy, particularly for nuanced moral categories like Purity and Degradation.
Methods:
Annotated 200 song lyrics for moral foundations and generated 2,721 synthetic lyrics with moral undertones using GPT-4.
Fine-tuned BERT models (MoralBERT SL) with domain adversarial training for moral classification, combining synthetic and social media corpora.
Evaluated models on binary and weighted F1 scores, with synthetic lyrics enhancing classification precision by 12% on average compared to baselines.
26.) Enhancing Stance Classification on Social Media Using Quantified Moral Foundations
Key Findings:
Incorporating Moral Foundations Theory (MFT) into stance detection improves performance significantly, with F1 score increases up to 23.7 points for large language models (LLMs) using datasets like SemEval and P-Stance.
Moral dimensions like Care, Fairness, and Authority were crucial for differentiating stances in politically and socially relevant topics (e.g., climate change, racial equality).
Methods such as eMFD and FrameAxis for extracting moral features enhanced both message- and user-level stance detection across traditional machine learning models, fine-tuned language models, and LLMs.
Methods:
Three datasets (SemEval, P-Stance, Connected Behavior) were analyzed using traditional ML models (SVM/XGBoost), fine-tuned models (BERTweet), and LLMs (e.g., GPT-3.5, Llama2).
Prompting strategies integrated moral embeddings (eMFD and FrameAxis) as textual descriptions for LLMs, enhancing performance in zero-shot, one-shot, and five-shot scenarios.