Still Aligning

본문 바로가기

Staying curious, growing through questions

[ACL 2024] Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs * 논문에 대한 개인적인 해석이나 견해는 정확하지 않은 내용일 수 있습니다.* 개인적으로 기억하고 싶은 내용은 이탤릭체로 표기하였습니다. 논문 링크: https://aclanthology.org/2024.acl-long.773/본 논문은 ACL 2024에서 Best Social Impact Paper Award를 수상한 논문으로 2026년 3월 기준 620회 인용되었다. 1. 이 논문을 읽게 된 이유 이 논문은 내가 2024년 10월 쯤에 처음 읽었던 논문인데, 그 이후로 정말 여러번 읽었던 논문이다.내가 석사시절 내내 했던 논문의 motivation이 되어준 논문이다. 이 논문은 제목 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Pers.. 2026.03.25
[Anthropic] Language Models (Mostly) Know What They Know * 논문에 대한 개인적인 해석이나 견해는 정확하지 않은 내용일 수 있습니다.* 개인적으로 기억하고 싶은 내용은 이탤릭체로 표기하였습니다. Language Models (Mostly) Know What They KnowAnthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.www.anthropic.com논문 링크: https://www.anthropic.com/research/language-models-mostly-know-what-they-know본 논문은 Anthropic에서 2022년 7월에 공개한 논문이다. 2026년 3월 기준 11.. 2026.03.23
[Anthropic] Alignment faking in large language models Alignment faking in large language modelsA paper from Anthropic's Alignment Science team on Alignment Faking in AI large language modelswww.anthropic.com 논문 링크: https://www.anthropic.com/research/alignment-faking본 논문은 Anthropic에서 2024년 12월에 발표한 논문이다. 이 논문은 무려 총 137페이지이다. 그 중 본문은 53페이지이다. 1. 이 논문을 읽게 된 이유https://www.youtube.com/watch?v=9eXV64O2Xp8 유튜브 알고리즘에 우연히 이 영상이 뜨게되면서 이 논문을 알게되었다! 이런 현상을 발.. 2025.08.20
[Anthropic] Training a Helpful and Harmless Assistant with Reinforcement 논문 링크: https://arxiv.org/abs/2204.05862 본 논문은 Anthropic에서 2022년 4월 12일에 공개한 논문입니다. * 리뷰에 개인적인 해석이나 견해는 정확하지 않은 내용일 수 있습니다.* 부족한 내용이나, 혹은 같이 이야기 나누고 싶은 내용은 코멘트 부탁드립니다.* 제가 개인적으로 궁금하고, 더 알아보고 싶은 내용은 하늘색 배경으로 표시하였습니다. 논문은 본문 37페이지 그리고 74페이지까지는 appendix로 상당히 긴 편이다. 1. 이 논문을 읽게 된 이유몇개월 전에 Antrhopic에서 낸 Alignment faking이라는 논문을 너무 재미있게 읽었었고, Anthropic이라는 회사에 점점 관심이 커져서 Anthropic논문의 시초부터 차근차근히 리뷰해보고자 .. 2025.12.20
[ACL 2024] Think Twice: Perspective-Taking Improves Large Language Models’Theory-of-Mind Capabilities Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind CapabilitiesAlex Wilf, Sihyun Lee, Paul Pu Liang, Louis-Philippe Morency. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.aclanthology.org 논문 링크:https://aclanthology.org/2024.acl-long.451/ACL 2024 MAIN 논문으로, 2026년 2월 24일 기준 인용수는 84회이다. Theory of M.. 2026.02.24
[EMNLP 2025] Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs 논문 링크 https://arxiv.org/abs/2508.16347이 논문은 EMNLP 2025 Findings 논문으로, 2025년 2월 기준 6회 인용되었습니다. Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMsYu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, Zhifei Zheng, Min Liu, Zhiyi Yin, Jianping Zhang. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.aclantholog.. 2026.02.24

티스토리툴바