BERT：修订间差异

删除的内容添加的内容

行内

2021年11月24日 (三) 02:10的版本

基于变换器的双向编码器表示技术（英語：Bidirectional Encoder Representations from Transformers，BERT）是用于自然语言处理（NLP）的预训练技术，由Google提出。^[1]^[2]2018年，雅各布·德夫林和同事创建并发布了BERT。Google正在利用BERT来更好地理解用户搜索语句的语义。^[3] 2020年的一项文献调查得出结论："在一年多一点的时间里，BERT已经成为NLP实验中无处不在的基线"，算上分析和改进模型的研究出版物超过150篇。^[4]

最初的英语BERT发布时提供两种类型的预训练模型^[1]：（1）BERT_BASE模型，一个12层，768维，12个自注意头（self attention head），110M参数的神经网络结构；（2）BERT_LARGE模型，一个24层，1024维，16个自注意头，340M参数的神经网络结构。两者的训练语料都是BooksCorpus^[5]以及英語維基百科语料，单词量分别是8億以及25億。^[6]

性能及分析

BERT在以下自然语言理解任务上的性能表现得最为卓越：^[1]

GLUE（General Language Understanding Evaluation，通用语言理解评估）任务集（包括9个任务）。
SQuAD（Stanford Question Answering Dataset，斯坦福问答数据集）v1.1和v2.0。
SWAG（Situations With Adversarial Generation，对抗生成的情境）。

有關BERT在上述自然语言理解任务中為何可以達到先进水平，目前還未找到明確的原因^[7]^[8]。目前BERT的可解释性研究主要集中在研究精心选择的输入序列对BERT的输出的影响关系，^[9]^[10]通过探测分类器分析内部向量表示，^[11]^[12]以及注意力权重表示的关系。^[7]^[8]

历史

BERT起源于预训练的上下文表示学习，包括半监督序列学习（Semi-supervised Sequence Learning）^[13]，生成预训练（Generative Pre-Training），ELMo（英语：ELMo）^[14]和ULMFit^[15]。与之前的模型不同，BERT是一种深度双向的、无监督的语言表示，且仅使用纯文本语料库进行预训练的模型。上下文无关模型（如word2vec或GloVe（英语：GloVe））为词汇表中的每个单词生成一个词向量表示，因此容易出现单词的歧义问题。BERT考虑到单词出现时的上下文。例如，词“水分”的word2vec词向量在“植物需要吸收水分”和“财务报表裡有水分”是相同的，但BERT根据上下文的不同提供不同的词向量，词向量与句子表达的句意有关。

2019年10月25日，Google搜索宣布他们已经开始在美国国内的英语搜索查询中应用BERT模型。^[16]2019年12月9日，据报道，Google搜索已经在70多种语言的搜索采用了BERT。^[17] 2020年10月，几乎每一个基于英语的查询都由BERT处理。^[18] In October 2020, almost every single English-based query was processed by BERT.^[19]

获奖情况

在2019年计算语言学协会北美分会（NAACL（英语：North American Chapter of the Association for Computational Linguistics））年会上，BERT获得了最佳长篇论文奖。^[20]

参见

变换器 (机器学习模型)（英语：Transformer (machine learning model)）
Word2vec
自编码器
文献-检索词矩阵（英语：Document-term matrix）
特征提取
特征学习
神经网络语言模型（英语：Neural network language model）
向量空间模型
概念向量（英语：Thought vector）
fastText（英语：fastText）
GloVe（英语：GloVe）
TensorFlow

参考文献

^ ^1.0 ^1.1 ^1.2 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018-10-11. arXiv:1810.04805v2  [cs.CL].
^ Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog. [2019-11-27]. （原始内容存档于2021-01-13）（英语）.
^ Understanding searches better than ever before. Google. 2019-10-25 [2019-11-27]. （原始内容存档于2021-01-27）（英语）.
^ Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020, 8: 842–866. doi:10.1162/tacl_a_00349.
^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books: 19–27. 2015. arXiv:1506.06724  [cs.CV]. 缺少或|url=为空 (帮助)
^ Annamoradnejad, Issa. ColBERT: Using BERT Sentence Embedding for Humor Detection. 2020-04-27. arXiv:2004.12765  [cs.CL].
^ ^7.0 ^7.1 Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna. Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). November 2019: 4364–4373 [2020-10-19]. doi:10.18653/v1/D19-1445. （原始内容存档于2020-10-20）（美国英语）.
^ ^8.0 ^8.1 Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2019: 276–286.
^ Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 284–294. Bibcode:2018arXiv180504623K. arXiv:1805.04623 . doi:10.18653/v1/p18-1027.
^ Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 1195–1205. Bibcode:2018arXiv180311138G. arXiv:1803.11138 . doi:10.18653/v1/n18-1108.
^ Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem. Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 240–248. Bibcode:2018arXiv180808079G. arXiv:1808.08079 . doi:10.18653/v1/w18-5426.
^ Zhang, Kelly; Bowman, Samuel. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 359–361. doi:10.18653/v1/w18-5448.
^ Dai, Andrew; Le, Quoc. Semi-supervised Sequence Learning. 2015-11-04. arXiv:1511.01432  [cs.LG].
^ Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer. Deep contextualized word representations. 2018-02-15. arXiv:1802.05365v2  [cs.CL].
^ Howard, Jeremy; Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. 2018-01-18. arXiv:1801.06146v5  [cs.CL].
^ Nayak, Pandu. Understanding searches better than ever before. Google Blog. 2019-10-25 [2019-12-10]. （原始内容存档于2019-12-05）.
^ Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 2019-12-10 [2019-12-10]. （原始内容存档于2020-11-29）.
^ Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 10 December 2019 [10 December 2019].
^ Google: BERT now used on almost every English query. Search Engine Land. 2020-10-15 [2020-11-24].
^ Best Paper Awards. NAACL. 2019 [2020-03-28]. （原始内容存档于2020-10-19）.

外部链接

官方GitHub仓库（页面存档备份，存于互联网档案馆）

[:0-1] 1.0 ^1.1 ^1.2 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018-10-11. arXiv:1810.04805v2  [cs.CL].

[2] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog. [2019-11-27]. （原始内容存档于2021-01-13）（英语）.

[3] Understanding searches better than ever before. Google. 2019-10-25 [2019-11-27]. （原始内容存档于2021-01-27）（英语）.

[4] Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020, 8: 842–866. doi:10.1162/tacl_a_00349.

[5] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books: 19–27. 2015. arXiv:1506.06724  [cs.CV]. 缺少或|url=为空 (帮助)

[6] Annamoradnejad, Issa. ColBERT: Using BERT Sentence Embedding for Humor Detection. 2020-04-27. arXiv:2004.12765  [cs.CL].

[:1-7] 7.0 ^7.1 Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna. Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). November 2019: 4364–4373 [2020-10-19]. doi:10.18653/v1/D19-1445. （原始内容存档于2020-10-20）（美国英语）.

[:2-8] 8.0 ^8.1 Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2019: 276–286.

[9] Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 284–294. Bibcode:2018arXiv180504623K. arXiv:1805.04623 . doi:10.18653/v1/p18-1027.

[10] Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 1195–1205. Bibcode:2018arXiv180311138G. arXiv:1803.11138 . doi:10.18653/v1/n18-1108.

[11] Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem. Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 240–248. Bibcode:2018arXiv180808079G. arXiv:1808.08079 . doi:10.18653/v1/w18-5426.

[12] Zhang, Kelly; Bowman, Samuel. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 359–361. doi:10.18653/v1/w18-5448.

[13] Dai, Andrew; Le, Quoc. Semi-supervised Sequence Learning. 2015-11-04. arXiv:1511.01432  [cs.LG].

[14] Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer. Deep contextualized word representations. 2018-02-15. arXiv:1802.05365v2  [cs.CL].

[15] Howard, Jeremy; Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. 2018-01-18. arXiv:1801.06146v5  [cs.CL].

[16] Nayak, Pandu. Understanding searches better than ever before. Google Blog. 2019-10-25 [2019-12-10]. （原始内容存档于2019-12-05）.

[17] Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 2019-12-10 [2019-12-10]. （原始内容存档于2020-11-29）.

[18] Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 10 December 2019 [10 December 2019].

[19] Google: BERT now used on almost every English query. Search Engine Land. 2020-10-15 [2020-11-24].

[20] Best Paper Awards. NAACL. 2019 [2020-03-28]. （原始内容存档于2020-10-19）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

@@ 第5行： / 第5行： @@
 '''基于变换器的双向编码器表示技术'''（{{lang-en|Bidirectional Encoder Representations from Transformers}}，'''BERT'''）是用于[[自然语言处理]]（NLP）的预训练技术，由[[Google]]提出。<ref name=":0">{{cite arxiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018-10-11 |eprint=1810.04805v2|class=cs.CL }}</ref><ref>{{Cite web|url=http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html|title=Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing|website=Google AI Blog|language=en|access-date=2019-11-27|archive-date=2021-01-13|archive-url=https://web.archive.org/web/20210113211449/https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html|dead-url=no}}</ref>2018年，雅各布·德夫林和同事创建并发布了BERT。Google正在利用BERT来更好地理解用户搜索语句的语义。<ref>{{Cite web|url=https://blog.google/products/search/search-language-understanding-bert/|title=Understanding searches better than ever before|date=2019-10-25|website=Google|language=en|access-date=2019-11-27|archive-date=2021-01-27|archive-url=https://web.archive.org/web/20210127042834/https://www.blog.google/products/search/search-language-understanding-bert/|dead-url=no}}</ref> 2020年的一项文献调查得出结论："在一年多一点的时间里，BERT已经成为NLP实验中无处不在的基线"，算上分析和改进模型的研究出版物超过150篇。<ref>{{Cite journal|last=Rogers|first=Anna|last2=Kovaleva|first2=Olga|last3=Rumshisky|first3=Anna|date=2020|title=A Primer in BERTology: What We Know About How BERT Works|url=https://aclanthology.org/2020.tacl-1.54|journal=Transactions of the Association for Computational Linguistics|volume=8|pages=842–866|doi=10.1162/tacl_a_00349}}</ref>
-最初的英语BERT发布时提供两种类型的预训练模型<ref name=":0">{{cite arxiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018-10-11 |eprint=1810.04805v2|class=cs.CL }}</ref>：（1）BERT<sub>BASE</sub>模型，一个12层，768维，12个自注意头（self attention head），110M参数的神经网络结构；（2）BERT<sub>LARGE</sub>模型，一个24层，1024维，16个自注意头，340M参数的神经网络结构。两者的训练语料都是[[BooksCorpus]]<ref>{{cite web|last1=Zhu|first1=Yukun|last2=Kiros|first2=Ryan|last3=Zemel|first3=Rich|last4=Salakhutdinov|first4=Ruslan|last5=Urtasun|first5=Raquel|last6=Torralba|first6=Antonio|last7=Fidler|first7=Sanja|date=2015|title=Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books|pages=19–27|class=cs.CV|eprint=1506.06724}}</ref>以及[[英語維基百科]]语料，单词量分别是8億以及25億。
+最初的英语BERT发布时提供两种类型的预训练模型<ref name=":0">{{cite arxiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018-10-11 |eprint=1810.04805v2|class=cs.CL }}</ref>：（1）BERT<sub>BASE</sub>模型，一个12层，768维，12个自注意头（self attention head），110M参数的神经网络结构；（2）BERT<sub>LARGE</sub>模型，一个24层，1024维，16个自注意头，340M参数的神经网络结构。两者的训练语料都是[[BooksCorpus]]<ref>{{cite web|last1=Zhu|first1=Yukun|last2=Kiros|first2=Ryan|last3=Zemel|first3=Rich|last4=Salakhutdinov|first4=Ruslan|last5=Urtasun|first5=Raquel|last6=Torralba|first6=Antonio|last7=Fidler|first7=Sanja|date=2015|title=Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books|pages=19–27|class=cs.CV|eprint=1506.06724}}</ref>以及[[英語維基百科]]语料，单词量分别是8億以及25億。<ref>{{cite arxiv|last=Annamoradnejad|first=Issa|date=2020-04-27|title=ColBERT: Using BERT Sentence Embedding for Humor Detection|class=cs.CL|eprint=2004.12765}}</ref>
 == 性能及分析 ==