显著性差异:修订间差异

维基百科,自由的百科全书
删除的内容 添加的内容
无编辑摘要
标签移动版编辑 移动版网页编辑 高级移动版编辑
无编辑摘要
标签移除维护性模板
第1行: 第1行:
{{noteTA
{{noteTA
|G1=Math
|G1=Math
|1=zh-cn:水平;zh-tw:水準;
|1=zh-cn:水平; zh-hk:水平; zh-tw:水準;
|2=zh-cn:假设检验;zh-tw:假說檢定;zh-hk:假設檢定
|2=zh-cn:检验; zh-tw:檢定
|3=zh-cn:备择假设;zh-tw:對立假說;zh-hk:對立假設
|4=zh-cn:零假设;zh-tw:虛無假說;zh-hk:虛無假設
|5=zh-cn:检验;zh-tw:檢定
}}
}}
[[統計學]]的[[假說檢定]]中<ref name=Sirkin>{{cite book |last1 = Sirkin|first1 = R. Mark |chapter= Two-sample t tests | title = Statistics for the Social Sciences| edition=3rd |publisher = SAGE Publications, Inc | location = Thousand Oaks, CA | year = 2005 |isbn =978-1-412-90546-6 |pages=271–316}}</ref><ref name=Borror>{{cite book |last1 = Borror |first1 = Connie M. |chapter= Statistical decision making |title = The Certified Quality Engineer Handbook | edition=3rd |publisher = ASQ Quality Press |location = Milwaukee, WI | year = 2009 |isbn =978-0-873-89745-7 |pages=418–472}}</ref>,'''顯著性差異'''(或'''统计学意义''',{{lang-en|statistical significance}},符號:'''ρ''')是對數據差異性的評價,當某次實驗的结果在[[虛無假說]]下不大可能发生时,就認為該結果具有顯著性差異。更準確而言,譬如某項研究設定了一個數值α(顯著水準),表示[[型一錯誤與型二錯誤|虛無假說本來正確但卻被拒絕]]的出錯概率<ref name="Dalgaard">{{cite book |last=Dalgaard |first=Peter |title=Introductory Statistics with R |location=New York |publisher=Springer |year=2008 |pages=155–56 |isbn=978-0-387-79053-4 |doi=10.1007/978-0-387-79054-1_9 |chapter=Power and the computation of sample size |series=Statistics and Computing }}</ref>,然後用[[p值]]表示虛無假說為真時得到某結果或比這個結果更極端的情況的概率<ref name=":0">{{Cite web|url=http://www.dartmouth.edu/~matc/X10/Show.htm|title=Statistical Hypothesis Testing|website=www.dartmouth.edu|access-date=2019-11-11|archive-date=2020-08-02|archive-url=https://web.archive.org/web/20200802050104/http://www.dartmouth.edu/~matc/X10/Show.htm|url-status=dead}}</ref>。當{{Math|''p'' ⩽ ''α''}}時,就可以認為結果具有統計學意義,或數據之間具有了顯著性差異。<ref
{{disputed|time=2018-03-25T02:11:40+00:00}}
name="Johnson">{{cite journal |last= Johnson| first= Valen E. |date= October 9, 2013 |title= Revised standards for statistical evidence |journal= Proceedings of the National Academy of Sciences |doi= 10.1073/pnas.1313476110 | pmid= 24218581 |volume=110 | issue= 48 |pages=19313–19317|pmc=3845140 | bibcode= 2013PNAS..11019313J | doi-access= free }}</ref><ref name="Redmond and Colton">{{cite book | last1 = Redmond|first1 = Carol | last2 = Colton | first2 = Theodore | chapter = Clinical significance versus statistical significance | title = Biostatistics in Clinical Trials | series = Wiley Reference Series in Biostatistics | edition=3rd |publisher = John Wiley & Sons Ltd |location = West Sussex, United Kingdom | year = 2001 |isbn = 978-0-471-82211-0 |pages = 35–36}}</ref><ref name="Cumming-p27">{{cite book | last1 = Cumming|first1 = Geoff | title = Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis |publisher = Routledge |location = New York, USA | year = 2012|pages = 27–28}}</ref><ref name="Krzywinski and Altman">{{cite journal |last1= Krzywinski |first1= Martin |last2= Altman |first2= Naomi |date= 30 October 2013 |title= Points of significance: Significance, P values and t-tests |journal= Nature Methods |volume= 10 |issue= 11 |pages= 1041–1042 |doi= 10.1038/nmeth.2698 |pmid= 24344377 |doi-access= free }}</ref><ref name="Sham and Purcell">{{cite journal |last1= Sham |first1= Pak C.|last2= Purcell |first2= Shaun M |date= 17 April 2014 |title= Statistical power and significance testing in large-scale genetic studies |journal= Nature Reviews Genetics |volume= 15 |issue= 5 |pages= 335–346 |doi= 10.1038/nrg3706 |pmid= 24739678|s2cid= 10961123}}</ref><ref name="Altman">{{cite book | last1 = Altman|first1 = Douglas G. | title = Practical Statistics for Medical Research | url = https://archive.org/details/isbn_9780412276309| url-access = registration|publisher = Chapman & Hall/CRC |location = New York, USA | year = 1999 | isbn = 978-0412276309 |pages = [https://archive.org/details/isbn_9780412276309/page/167 167]}}</ref><ref name=Devore>{{cite book |last1 = Devore|first1 = Jay L.|title = Probability and Statistics for Engineering and the Sciences| edition=8th |publisher = Cengage Learning |location = Boston, MA | year = 2011 |isbn =978-0-538-73352-6 |pages=300–344}}</ref>顯著水準應當在開始數據收集前就設定,通常習慣設定為5%<ref name="Salkind">{{cite encyclopedia|year=2007|title=Significance level|encyclopedia=Encyclopedia of Measurement and Statistics|publisher=SAGE Publications|location=Thousand Oaks, CA|editor-last1=Salkind|editor-first1=Neil J.|volume=3|pages=889–891|isbn=978-1-412-91611-0|last1=Craparo|first1=Robert M.}}</ref>或更低,因研究的具體學科領域而異。<ref name="Sproull">{{cite book|title=Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science|last1=Sproull|first1=Natalie L.|publisher=Scarecrow Press, Inc.|year=2002|isbn=978-0-810-84486-5|edition=2nd|location=Lanham, MD|pages=[https://archive.org/details/handbookofresear00spro/page/49 49–64]|chapter=Hypothesis testing|chapter-url=https://archive.org/details/handbookofresear00spro/page/49}}</ref>
{{no footnotes|time=2018-03-25T02:11:40+00:00}}
'''顯著性差異''' (ρ,Statistical significance) 是[[統計學]]上對數據差異性的評價。


在任何涉及到从[[总体]]中[[抽樣|抽取样本]]的[[实验]]或[[观察性研究]]中,观察到的结果都有可能只不过是由{{tsl|en|sampling error|抽样误差}}产生的。<ref name=Babbie2>{{cite book |last1 = Babbie|first1 = Earl R. |chapter= The logic of sampling | title = The Practice of Social Research| edition=13th |publisher = Cengage Learning |location = Belmont, CA | year = 2013|isbn =978-1-133-04979-1 |pages=185–226}}</ref><ref name=Faherty>{{cite book |last1 = Faherty | first1 = Vincent | chapter= Probability and statistical significance | title = Compassionate Statistics: Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS) | edition=1st |publisher = SAGE Publications, Inc |location = Thousand Oaks, CA | year = 2008 |isbn =978-1-412-93982-9 |pages=127–138}}</ref>但是,如果一个观察结果的p值小于(或等于)显著性水平α,研究者就可以得出“该结果能反映总体的特征”的结论<ref name=Sirkin/>,并拒绝零假设<ref name=McKillup>{{cite book |last1 = McKillup|first1 = Steve |title = Statistics Explained: An Introductory Guide for Life Scientists|chapter-url = https://archive.org/details/statisticsexplai0000mcki|chapter-url-access = registration| edition = 1st | publisher = Cambridge University Press|location = Cambridge, United Kingdom | year = 2006 |chapter=Probability helps you make a decision about your results | isbn = 978-0-521-54316-3 |pages=[https://archive.org/details/statisticsexplai0000mcki/page/44 44–56]}}</ref>。
當數據之間具有了顯著性差異,就說明參與比對的數據應該不是來自於同一[[总体|母體]](population),而是來自於具有差異的兩個不同總體,換句話說,實驗的樣本被統計出是有差別的。


顯著性差異的原因可能是:
*這種差異可能因參與比對的數據是來自不同實驗對象,如[[比-西一般能力測驗]]中,大學學歷被試組的成績與小學學歷被試組,統計顯著性差異存在。
*參與比對的數據是來自不同實驗對象,如[[比-西一般能力測驗]]中,大學學歷被試組的成績與小學學歷被試組之間,存在顯著性差異
*也可能來自於實驗處理對實驗對象造成了根本性狀改變,因而前測後測的數據會有顯著性差異。例如,記憶術研究發現,被試學習某記憶法前的成績和學習[[記憶法]]後的記憶成績會有顯著性差異,這一差異很可能來自於學××記憶法對被試記憶能力的改變。
*也可能是因為實驗處理對實驗對象造成了改變,因而前測後測的數據會有顯著性差異。例如,記憶術研究發現,被試學習某記憶法前的成績和學習[[記憶法]]後的記憶成績會有顯著性差異,這一差異很可能來自於這種記憶法對被試記憶能力的改變。

== 歷史 ==
顯著性差異的提出可追溯到18世纪,{{tsl|en|John Arbuthnot|约翰·阿巴思诺特}}和[[皮埃尔-西蒙·拉普拉斯]]作出了男女出生概率均等的零假设,然后计算了人类出生时[[人類性別比|性别比]]的[[p值]]。<ref>{{cite book |title=The Descent of Human Sex Ratio at Birth |first1=Éric |last1=Brian |first2=Marie |last2=Jaisson |chapter=Physico-Theology and Mathematics (1710–1794) |pages=1–25 |year=2007 |publisher=Springer Science & Business Media |isbn=978-1-4020-6036-6}}</ref><ref>{{cite journal|author=John Arbuthnot |title=An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes|journal=[[自然科学会报|Philosophical Transactions of the Royal Society of London]] | volume=27| pages=186–190 | year=1710 | url=http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf|doi=10.1098/rstl.1710.0011|issue=325–336|doi-access=free}}</ref><ref name="Conover1999">{{Citation
|last=Conover
|first=W.J.
|title=Practical Nonparametric Statistics
|edition=Third
|year=1999
|publisher=Wiley
|isbn=978-0-471-16068-7
|pages=157–176
|chapter=Chapter 3.4: The Sign Test
}}</ref><ref name="Sprent1989">{{Citation
|last=Sprent
|first=P.
|title=Applied Nonparametric Statistical Methods
|edition=Second
|year=1989
|publisher=Chapman & Hall
|isbn=978-0-412-44980-2
}}</ref><ref>{{cite book |title=The History of Statistics: The Measurement of Uncertainty Before 1900 |first=Stephen M. |last=Stigler |publisher=Harvard University Press |year=1986 |isbn=978-0-67440341-3 |pages=[https://archive.org/details/historyofstatist00stig/page/225 225–226]}}</ref><ref name="Bellhouse2001">{{Citation
|last=Bellhouse
|first=P.
|title=in Statisticians of the Centuries by C.C. Heyde and E. Seneta
|year=2001
|publisher=Springer
|isbn=978-0-387-95329-8
|pages=39–42
|chapter=John Arbuthnot}}
</ref><ref name="Hald1998">{{Citation
|last=Hald
|first=Anders
|title=A History of Mathematical Statistics from 1750 to 1930
|year=1998
|publisher=Wiley
|pages=65
|chapter=Chapter 4. Chance or Design: Tests of Significance}}
</ref>

1925年,[[羅納德·愛爾默·費雪|羅納德·費雪]]在《{{tsl|en|Statistical Methods for Research Workers|研究工作者的统计方法}}》一书中提出了统计假设检验的思想,称之为“显著性检验”({{lang|en|tests of significance}})。<ref name="Cumming">{{cite book|title=Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis|publisher=Routledge|year=2011|isbn=978-0-415-87968-2|series=Multivariate Applications Series|location=East Sussex, United Kingdom|pages=21–52|chapter=From null hypothesis significance to testing effect sizes|last1=Cumming|first1=Geoff}}</ref><ref name="Fisher1925">{{cite book|title=Statistical Methods for Research Workers|publisher=Oliver and Boyd|year=1925|location=Edinburgh, UK|pages=[https://archive.org/details/statisticalmethoe7fish/page/43 43]|last1=Fisher|first1=Ronald A.|isbn=978-0-050-02170-5|url=https://archive.org/details/statisticalmethoe7fish/page/43}}</ref><ref name="Poletiek">{{cite book|title=Hypothesis-testing Behaviour|publisher=Psychology Press|year=2001|isbn=978-1-841-69159-6|edition=1st|series=Essays in Cognitive Psychology|location=East Sussex, United Kingdom|pages=29–48|chapter=Formal theories of testing|last1=Poletiek|first1=Fenna H.}}</ref>費雪建議将1/20(=0.05)的概率作为拒绝[[虛無假說]]的一个截断值。<ref name=Quinn>{{cite book |last1 = Quinn |first1 = Geoffrey R. |last2 = Keough |first2 = Michael J. |title = Experimental Design and Data Analysis for Biologists |edition = 1st |publisher = Cambridge University Press |location = Cambridge, UK |year = 2002 |isbn = 978-0-521-00976-8 |pages = [https://archive.org/details/experimentaldesi0000quin/page/46 46–69] |url = https://archive.org/details/experimentaldesi0000quin/page/46 }}</ref>在1933年的一篇论文中,[[耶日·内曼]]和[[埃贡·皮尔逊]]把这个截断值称为“显著性水平”,並賦予它符號{{Mvar|α}}。他们建议,{{Mvar|α}}值應當在收集任何数据收集之前提前设定。<ref name=Quinn /><ref name="Neyman">{{Cite journal|last2=Pearson|first2=E.S.|year=1933|title=The testing of statistical hypotheses in relation to probabilities a priori|journal=Mathematical Proceedings of the Cambridge Philosophical Society|volume=29|issue=4|pages=492–510|doi=10.1017/S030500410001152X|last1=Neyman|first1=J.|bibcode=1933PCPS...29..492N }}</ref>

費雪最初將显著性水平定為0.05,但他并不打算将这一截断值定死。在他1956年出版的《统计方法与科学推断》一书中,他建议根据具体情况确定显著性水平。<ref name=Quinn />

===相關概念===
显著性水平{{Mvar|α}}是{{Mvar|p}}值的阈值,當{{Math|''p'' ⩽ ''α''}}時就拒絕零假设(即使零假设仍有可能是正确的)。这意味着{{Mvar|α}}也是在零假设正确的情况下错误地将其否定的概率<ref name="Dalgaard" />,称为[[偽陽性和偽陰性|伪阳性]]或[[型一錯誤與型二錯誤|型一錯誤]]、棄真錯誤、α錯誤。

而有些研究者偏好使用[[信賴區間|置信水平]]{{math|''γ'' {{=}} (1 − ''α'')}}。它是零假设成立时不拒绝零假设的概率。<ref>"Conclusions about statistical significance are possible with the help of the confidence interval. If the confidence interval does not include the value of zero effect, it can be assumed that there is a statistically significant result." {{cite journal|title=Confidence Interval or P-Value?|journal=Deutsches Ärzteblatt Online|volume=106|issue=19|pages=335–9|doi=10.3238/arztebl.2009.0335|pmid=19547734|pmc=2689604|year=2009|last1=Prel|first1=Jean-Baptist du|last2=Hommel|first2=Gerhard|last3=Röhrig|first3=Bernd|last4=Blettner|first4=Maria}}</ref><ref>[https://www.cscu.cornell.edu/news/statnews/stnews73.pdf StatNews #73: Overlapping Confidence Intervals and Statistical Significance]</ref>置信水平和置信区间是Neyman于1937年提出的。<ref name="Neyman1937">{{cite journal|year=1937|title=Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability|jstor=91337|journal={{tsl|en|Philosophical Transactions of the Royal Society A||Philosophical Transactions of the Royal Society A}}|volume=236|issue=767|pages=333–380|doi=10.1098/rsta.1937.0005|last1=Neyman|first1=J.|bibcode=1937RSPTA.236..333N |author-link=Jerzy Neyman|doi-access=free}}</ref>


== 顯著水準 ==
== 顯著水準 ==
[[File:NormalDist1.96.png|250px|thumb|在{{tsl|en|one- and two-tailed tests|双尾检验}}中,显著性水平{{math|''α'' {{=}} 0.05}}下的拒绝域分处在{{tsl|en|sampling distribution|抽样分布}}两端的尾部,共占曲线下方面积的5%。]]
'''顯著水準'''(α,Significance level),代表在[[虛無假說]](記作<math>H_0</math>)為真,錯誤地拒絕<math>H_0</math>的機率,即[[型一錯誤與型二錯誤|型一錯誤]]發生之機率。
'''顯著水準'''({{lang|en|significance level}},符號:α)常用于[[假设检验]]中检验假设和实验结果是否一致,它代表在[[虛無假說]](記作<math>H_0</math>)為真,錯誤地拒絕<math>H_0</math>的機率,即發生[[型一錯誤與型二錯誤|型一錯誤]](棄真錯誤、α錯誤)的機率。

比如,我們從兩個母體中分別抽取了兩組樣本數據A和B,這數據在顯著水準{{Math|1=''α'' = 0.05}}下具備顯著性差異這是說兩組數據所代表的母體具備顯著性差異的可能性為95%;但它們代表的母體仍5%的可能性是沒有顯著性差異的5%是由於{{tsl|en|sampling error|抽样误}}造成的。也可表述为:

*如果拒绝两组数据一致(二者不具备显著性差异)”零假设(接受“两组数据不一致”的备择假设),此时有5%的可能性犯[[第一类错误]]
*如果A=两组数据不具备显著差异;B=实际数据具有显著差异,則{{Math|1=<nowiki>P(A|B) = 0.05</nowiki>}},即統計100次,預期是B情況,但可能出現5次的A情況。

當[[假說檢定]]所測得之數據之間具有顯著性差異,實驗的[[虛無假說]]就可被推翻,也就是拒絕<math>H_0</math>,接受[[對立假說]](alternative hypothesis,記作<math>H_1</math>或<math>H_a</math>);反之若數據之間不具備顯著性差異,則拒絕[[對立假說]],不拒絕[[虛無假說]]。通常情況下,實驗結果需要證明達到顯著水準{{Math|1=''α'' = 0.05}}或{{Math|0.01}},才可以說數據之間具備了顯著性差異,否則就如上所述,容易作出錯誤的推論。在作結論時,應確實描述方向性(例如顯著大於或顯著小於)

数学表述为:引入[[p值]]作为检验[[样本]](test statistic)观察值的最低顯著水準。在{{Math|1=''α'' = 0.01}}或{{Math|1=''α'' = 0.05}}条件下,若假设成立的[[概率]]({{Mvar|p}})小于{{Mvar|α}},则表示零假设成立情况下得到这种观测结果的概率,比1%或5%還低,在该显著性水平下,我们可拒绝该假设。

*<code>P(X=x)<α=0.05</code>为“显著(significant)”,统计分析软件[[SPSS]]中以<code>*</code>标记;
*<code>P(X=x)<α=0.01</code>为“极显著(extremely significant)”,通常以<code>**</code>标记。


== 局限性 ==
比如,我們A、B兩數據在'''顯著水準'''(α)為'''0.05'''上具備顯著性差異這是說兩組數據具備顯著性差異的可能性為95%。兩個數據所代表的樣本還<code>5%</code>的可能性是沒有差異的5%的差異是由於[[隨機誤]]造成的。
研究人员常常只关注他们的结果是否具有统计学意义,但其报告的结果可能并没有实质性<ref name="Carver">{{Cite journal | last1 = Carver| first1 = Ronald P. | title = The Case Against Statistical Significance Testing | journal = Harvard Educational Review | volume = 48| issue = 3 | pages = 378–399 | year = 1978| doi = 10.17763/haer.48.3.t490261645281841 | s2cid = 16355113 | url = https://semanticscholar.org/paper/cb9adb96be34b2652fce8c2a3e8324a0f1ce0048 }}</ref>,或者研究结果无法重现<ref name="Ioannidis">{{cite journal | last1 = Ioannidis | first1 = John P. A. | title = Why most published research findings are false | journal = PLOS Medicine | volume = 2 | issue = 8 | pages = e124 | year = 2005 | doi=10.1371/journal.pmed.0020124 | pmid=16060722 | pmc=1182327}}</ref><ref name="peerj.com">{{cite journal|last1= Amrhein|first1=Valentin|last2=Korner-Nievergelt|first2=Fränzi|last3=Roth|first3=Tobias|title=The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research|journal=PeerJ|date=2017|volume=5|page=e3544|doi=10.7717/peerj.3544|pmid=28698825|pmc=5502092}}</ref>。统计学意义与实际意义之间也不能等同,有统计学意义的研究未必就有实际意义。<ref name="A Visitor’s Guide to Effect Sizes">{{cite journal|last1=Hojat|first1=Mohammadreza|last2=Xu|first2=Gang|title=A Visitor's Guide to Effect Sizes|journal=Advances in Health Sciences Education|volume=9|issue=3|pages=241–9|date=2004|doi=10.1023/B:AHSE.0000038173.00909.f6|pmid=15316274|s2cid=8045624}}</ref><ref name=":1">{{Cite web|url=http://www.stat.ualberta.ca/~hooper/teaching/misc/Pvalue.pdf|title=What is P-value?|last=Hooper|first=Peter|website=University of Alberta, Department of Mathematical and Statistical Sciences|access-date=November 10, 2019}}</ref>


=== 效应值 ===
*也可表述为:如果拒绝两组数据一致的假设拒绝不具备显著性差异的假设),那么就是<code>5%</code>的可能性犯[[第一类错误]]
{{Main|效应值}}
*如果A=两组数据不具备显著差异;B=实际数据具有显著差异<code>P(A|B) = 0.05</code>,即統計100次,預期是B情況,但可能5次的A情況。


效应值是衡量一项研究的实际意义。<ref name="A Visitor’s Guide to Effect Sizes"/>统计上显著的结果可能效应量很低。为了衡量结果的研究意义,研究人员最好同时给出效应值和p值。效应量量化了效应的强度,例如以标准差为单位的两个平均值之间的距离(Cohen's d)、两个变量之间的[[皮尔逊积矩相关系数|相关系数]]或[[决定系数|其平方]],以及其他度量。<ref name=Pedhazur>{{cite book | last1 = Pedhazur | first1 = Elazar J. | last2=Schmelkin|first2=Liora P. | title = Measurement, Design, and Analysis: An Integrated Approach| edition=Student|publisher = Psychology Press |location = New York, NY | year = 1991|isbn =978-0-805-81063-9 |pages=180–210}}</ref>
通常情況下,實驗結果需要證明達到顯著水準α='''0.05'''或'''0.01''',才可以說數據之間具備了顯著性差異,不然就像上述一樣做了不精確的推論。在作結論時,應確實描述方向性(例如顯著大於或顯著小於)。并通常用于[[假设检验]],检验假设和实验结果是否一致。


=== 再现性 ===
*数学表述为:引入[[p值]]作为检验[[样本]](test statistic)观察值的最低顯著水準。在<code>ρ= 0.01 or 0.05</code> 情况下,若假设情况实际算得的[[概率]]小于<code>ρ</code>,则该比假设成立情况下 95% 99% 会出现的情况更极端,在该显著性差异水平下,拒绝(reject)该假设。
{{Main|再现性}}
*<code>P(X=x)<ρ=0.05</code>为“[[显著]](significant)”,统计分析软件[[SPSS]]中以<code>*</code>标记;
*<code>P(X=x)<ρ=0.01</code>为“[[极显著]](extreme significant)”,通常以<code>**</code>标记。


统计上显著的结果未必能够轻易再现。<ref name="peerj.com"/>特别是一些有显著性差异的结果实际上是假阳性。重现结果每失败一次,都意味着研究结果实际上为假阳性的可能性增加。<ref>{{cite journal|last1=Stahel|first1=Werner|title=Statistical Issue in Reproducibility|journal=Principles, Problems, Practices, and Prospects Reproducibility: Principles, Problems, Practices, and Prospects|date=2016|pages=87–114|doi=10.1002/9781118865064.ch5|isbn=9781118864975}}</ref>
當[[假說檢定]](Hypothesis test)所測得之數據之間具有顯著性差異,實驗的[[虛無假說]]就可被推翻,也就是拒絕<math>H_0</math>,接受[[對立假說]](alternative hypothesis,記作<math>H_1</math>或<math>H_a</math>);反之若數據之間不具備顯著性差異,則拒絕[[對立假說]],不拒絕[[虛無假說]]。


== 参见 ==
== 参见 ==
* [[假檢定]]
* [[假檢定]]
* [[A/B測試]]
* {{tsl|en|Look-elsewhere effect|查看别处效应}}
* [[多重比較謬誤]]
* [[样本量确定]]
* [[德州神槍手謬誤]]


== 参考文献 ==
== 参考文献 ==
{{Reflist}}
{{Reflist|2}}


{{统计学}}
{{统计学}}

2022年6月19日 (日) 08:48的版本

統計學假說檢定[1][2]顯著性差異(或统计学意义,英語:statistical significance,符號:ρ)是對數據差異性的評價,當某次實驗的结果在虛無假說下不大可能发生时,就認為該結果具有顯著性差異。更準確而言,譬如某項研究設定了一個數值α(顯著水準),表示虛無假說本來正確但卻被拒絕的出錯概率[3],然後用p值表示虛無假說為真時得到某結果或比這個結果更極端的情況的概率[4]。當pα時,就可以認為結果具有統計學意義,或數據之間具有了顯著性差異。[5][6][7][8][9][10][11]顯著水準應當在開始數據收集前就設定,通常習慣設定為5%[12]或更低,因研究的具體學科領域而異。[13]

在任何涉及到从总体抽取样本实验观察性研究中,观察到的结果都有可能只不过是由抽样误差英语sampling error产生的。[14][15]但是,如果一个观察结果的p值小于(或等于)显著性水平α,研究者就可以得出“该结果能反映总体的特征”的结论[1],并拒绝零假设[16]

顯著性差異的原因可能是:

  • 參與比對的數據是來自不同實驗對象,如比-西一般能力測驗中,大學學歷被試組的成績與小學學歷被試組之間,會存在顯著性差異;
  • 也可能是因為實驗處理對實驗對象造成了改變,因而前測、後測的數據會有顯著性差異。例如,記憶術研究發現,被試者學習某記憶法前的成績,和學習記憶法後的記憶成績會有顯著性差異,則這一差異很可能來自於這種記憶法對被試記憶能力的改變。

歷史

顯著性差異的提出可追溯到18世纪,约翰·阿巴思诺特英语John Arbuthnot皮埃尔-西蒙·拉普拉斯作出了男女出生概率均等的零假设,然后计算了人类出生时性别比p值[17][18][19][20][21][22][23]

1925年,羅納德·費雪在《研究工作者的统计方法英语Statistical Methods for Research Workers》一书中提出了统计假设检验的思想,称之为“显著性检验”(tests of significance)。[24][25][26]費雪建議将1/20(=0.05)的概率作为拒绝虛無假說的一个截断值。[27]在1933年的一篇论文中,耶日·内曼埃贡·皮尔逊把这个截断值称为“显著性水平”,並賦予它符號α。他们建议,α值應當在收集任何数据收集之前提前设定。[27][28]

費雪最初將显著性水平定為0.05,但他并不打算将这一截断值定死。在他1956年出版的《统计方法与科学推断》一书中,他建议根据具体情况确定显著性水平。[27]

相關概念

显著性水平αp值的阈值,當pα時就拒絕零假设(即使零假设仍有可能是正确的)。这意味着α也是在零假设正确的情况下错误地将其否定的概率[3],称为伪阳性型一錯誤、棄真錯誤、α錯誤。

而有些研究者偏好使用置信水平γ = (1 − α)。它是零假设成立时不拒绝零假设的概率。[29][30]置信水平和置信区间是Neyman于1937年提出的。[31]

顯著水準

双尾检验英语one- and two-tailed tests中,显著性水平α = 0.05下的拒绝域分处在抽样分布两端的尾部,共占曲线下方面积的5%。

顯著水準significance level,符號:α)常用于假设检验中检验假设和实验结果是否一致,它代表在虛無假說(記作)為真時,錯誤地拒絕的機率,即發生型一錯誤(棄真錯誤、α錯誤)的機率。

比如,我們從兩個母體中分別抽取了兩組樣本數據A和B,這兩組數據在顯著水準α = 0.05下具備顯著性差異。這是說,兩組數據所代表的母體具備顯著性差異的可能性為95%;但它們代表的母體仍有5%的可能性是沒有顯著性差異的,這5%是由於抽样误差英语sampling error造成的。也可表述为:

  • 如果拒绝“两组数据一致(二者不具备显著性差异)”的零假设(接受“两组数据不一致”的备择假设),此时有5%的可能性犯第一类错误
  • 如果A=两组数据不具备显著差异;B=实际数据具有显著差异,則P(A|B) = 0.05,即統計100次,預期是B情況,但可能出現5次的A情況。

假說檢定所測得之數據之間具有顯著性差異,實驗的虛無假說就可被推翻,也就是拒絕,接受對立假說(alternative hypothesis,記作);反之,若數據之間不具備顯著性差異,則拒絕對立假說,不拒絕虛無假說。通常情況下,實驗結果需要證明達到顯著水準α = 0.050.01,才可以說數據之間具備了顯著性差異,否則就如上所述,容易作出錯誤的推論。在作結論時,應確實描述方向性(例如顯著大於或顯著小於)。

数学表述为:引入p值作为检验样本(test statistic)观察值的最低顯著水準。在α = 0.01α = 0.05的条件下,若零假设成立的概率p)小于α,则表示零假设成立的情况下得到这种观测结果的概率,比1%或5%還低,在该显著性水平下,我们可拒绝该零假设。

  • P(X=x)<α=0.05为“显著(significant)”,统计分析软件SPSS中以*标记;
  • P(X=x)<α=0.01为“极显著(extremely significant)”,通常以**标记。

局限性

研究人员常常只关注他们的结果是否具有统计学意义,但其报告的结果可能并没有实质性[32],或者研究结果无法重现[33][34]。统计学意义与实际意义之间也不能等同,有统计学意义的研究未必就有实际意义。[35][36]

效应值

效应值是衡量一项研究的实际意义。[35]统计上显著的结果可能效应量很低。为了衡量结果的研究意义,研究人员最好同时给出效应值和p值。效应量量化了效应的强度,例如以标准差为单位的两个平均值之间的距离(Cohen's d)、两个变量之间的相关系数其平方,以及其他度量。[37]

再现性

统计上显著的结果未必能够轻易再现。[34]特别是一些有显著性差异的结果实际上是假阳性。重现结果每失败一次,都意味着研究结果实际上为假阳性的可能性增加。[38]

参见

参考文献

  1. ^ 1.0 1.1 Sirkin, R. Mark. Two-sample t tests. Statistics for the Social Sciences 3rd. Thousand Oaks, CA: SAGE Publications, Inc. 2005: 271–316. ISBN 978-1-412-90546-6. 
  2. ^ Borror, Connie M. Statistical decision making. The Certified Quality Engineer Handbook 3rd. Milwaukee, WI: ASQ Quality Press. 2009: 418–472. ISBN 978-0-873-89745-7. 
  3. ^ 3.0 3.1 Dalgaard, Peter. Power and the computation of sample size. Introductory Statistics with R. Statistics and Computing. New York: Springer. 2008: 155–56. ISBN 978-0-387-79053-4. doi:10.1007/978-0-387-79054-1_9. 
  4. ^ Statistical Hypothesis Testing. www.dartmouth.edu. [2019-11-11]. (原始内容存档于2020-08-02). 
  5. ^ Johnson, Valen E. Revised standards for statistical evidence. Proceedings of the National Academy of Sciences. October 9, 2013, 110 (48): 19313–19317. Bibcode:2013PNAS..11019313J. PMC 3845140可免费查阅. PMID 24218581. doi:10.1073/pnas.1313476110可免费查阅. 
  6. ^ Redmond, Carol; Colton, Theodore. Clinical significance versus statistical significance. Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics 3rd. West Sussex, United Kingdom: John Wiley & Sons Ltd. 2001: 35–36. ISBN 978-0-471-82211-0. 
  7. ^ Cumming, Geoff. Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York, USA: Routledge. 2012: 27–28. 
  8. ^ Krzywinski, Martin; Altman, Naomi. Points of significance: Significance, P values and t-tests. Nature Methods. 30 October 2013, 10 (11): 1041–1042. PMID 24344377. doi:10.1038/nmeth.2698可免费查阅. 
  9. ^ Sham, Pak C.; Purcell, Shaun M. Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics. 17 April 2014, 15 (5): 335–346. PMID 24739678. S2CID 10961123. doi:10.1038/nrg3706. 
  10. ^ Altman, Douglas G. Practical Statistics for Medical Research需要免费注册. New York, USA: Chapman & Hall/CRC. 1999: 167. ISBN 978-0412276309. 
  11. ^ Devore, Jay L. Probability and Statistics for Engineering and the Sciences 8th. Boston, MA: Cengage Learning. 2011: 300–344. ISBN 978-0-538-73352-6. 
  12. ^ Craparo, Robert M. Significance level. Salkind, Neil J. (编). Encyclopedia of Measurement and Statistics 3. Thousand Oaks, CA: SAGE Publications: 889–891. 2007. ISBN 978-1-412-91611-0. 
  13. ^ Sproull, Natalie L. Hypothesis testing. Handbook of Research Methods: A Guide for Practitioners and Students in the Social Science 2nd. Lanham, MD: Scarecrow Press, Inc. 2002: 49–64. ISBN 978-0-810-84486-5. 
  14. ^ Babbie, Earl R. The logic of sampling. The Practice of Social Research 13th. Belmont, CA: Cengage Learning. 2013: 185–226. ISBN 978-1-133-04979-1. 
  15. ^ Faherty, Vincent. Probability and statistical significance. Compassionate Statistics: Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS) 1st. Thousand Oaks, CA: SAGE Publications, Inc. 2008: 127–138. ISBN 978-1-412-93982-9. 
  16. ^ McKillup, Steve. Probability helps you make a decision about your results需要免费注册. Statistics Explained: An Introductory Guide for Life Scientists 1st. Cambridge, United Kingdom: Cambridge University Press. 2006: 44–56. ISBN 978-0-521-54316-3. 
  17. ^ Brian, Éric; Jaisson, Marie. Physico-Theology and Mathematics (1710–1794). The Descent of Human Sex Ratio at Birth. Springer Science & Business Media. 2007: 1–25. ISBN 978-1-4020-6036-6. 
  18. ^ John Arbuthnot. An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes (PDF). Philosophical Transactions of the Royal Society of London. 1710, 27 (325–336): 186–190. doi:10.1098/rstl.1710.0011可免费查阅. 
  19. ^ Conover, W.J., Chapter 3.4: The Sign Test, Practical Nonparametric Statistics Third, Wiley: 157–176, 1999, ISBN 978-0-471-16068-7 
  20. ^ Sprent, P., Applied Nonparametric Statistical Methods Second, Chapman & Hall, 1989, ISBN 978-0-412-44980-2 
  21. ^ Stigler, Stephen M. The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press. 1986: 225–226. ISBN 978-0-67440341-3. 
  22. ^ Bellhouse, P., John Arbuthnot, in Statisticians of the Centuries by C.C. Heyde and E. Seneta, Springer: 39–42, 2001, ISBN 978-0-387-95329-8 
  23. ^ Hald, Anders, Chapter 4. Chance or Design: Tests of Significance, A History of Mathematical Statistics from 1750 to 1930, Wiley: 65, 1998 
  24. ^ Cumming, Geoff. From null hypothesis significance to testing effect sizes. Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Multivariate Applications Series. East Sussex, United Kingdom: Routledge. 2011: 21–52. ISBN 978-0-415-87968-2. 
  25. ^ Fisher, Ronald A. Statistical Methods for Research Workers. Edinburgh, UK: Oliver and Boyd. 1925: 43. ISBN 978-0-050-02170-5. 
  26. ^ Poletiek, Fenna H. Formal theories of testing. Hypothesis-testing Behaviour. Essays in Cognitive Psychology 1st. East Sussex, United Kingdom: Psychology Press. 2001: 29–48. ISBN 978-1-841-69159-6. 
  27. ^ 27.0 27.1 27.2 Quinn, Geoffrey R.; Keough, Michael J. Experimental Design and Data Analysis for Biologists 1st. Cambridge, UK: Cambridge University Press. 2002: 46–69. ISBN 978-0-521-00976-8. 
  28. ^ Neyman, J.; Pearson, E.S. The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society. 1933, 29 (4): 492–510. Bibcode:1933PCPS...29..492N. doi:10.1017/S030500410001152X. 
  29. ^ "Conclusions about statistical significance are possible with the help of the confidence interval. If the confidence interval does not include the value of zero effect, it can be assumed that there is a statistically significant result." Prel, Jean-Baptist du; Hommel, Gerhard; Röhrig, Bernd; Blettner, Maria. Confidence Interval or P-Value?. Deutsches Ärzteblatt Online. 2009, 106 (19): 335–9. PMC 2689604可免费查阅. PMID 19547734. doi:10.3238/arztebl.2009.0335. 
  30. ^ StatNews #73: Overlapping Confidence Intervals and Statistical Significance
  31. ^ Neyman, J. Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Philosophical Transactions of the Royal Society A英语Philosophical Transactions of the Royal Society A. 1937, 236 (767): 333–380. Bibcode:1937RSPTA.236..333N. JSTOR 91337. doi:10.1098/rsta.1937.0005可免费查阅. 
  32. ^ Carver, Ronald P. The Case Against Statistical Significance Testing. Harvard Educational Review. 1978, 48 (3): 378–399. S2CID 16355113. doi:10.17763/haer.48.3.t490261645281841. 
  33. ^ Ioannidis, John P. A. Why most published research findings are false. PLOS Medicine. 2005, 2 (8): e124. PMC 1182327可免费查阅. PMID 16060722. doi:10.1371/journal.pmed.0020124. 
  34. ^ 34.0 34.1 Amrhein, Valentin; Korner-Nievergelt, Fränzi; Roth, Tobias. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ. 2017, 5: e3544. PMC 5502092可免费查阅. PMID 28698825. doi:10.7717/peerj.3544. 
  35. ^ 35.0 35.1 Hojat, Mohammadreza; Xu, Gang. A Visitor's Guide to Effect Sizes. Advances in Health Sciences Education. 2004, 9 (3): 241–9. PMID 15316274. S2CID 8045624. doi:10.1023/B:AHSE.0000038173.00909.f6. 
  36. ^ Hooper, Peter. What is P-value? (PDF). University of Alberta, Department of Mathematical and Statistical Sciences. [November 10, 2019]. 
  37. ^ Pedhazur, Elazar J.; Schmelkin, Liora P. Measurement, Design, and Analysis: An Integrated Approach Student. New York, NY: Psychology Press. 1991: 180–210. ISBN 978-0-805-81063-9. 
  38. ^ Stahel, Werner. Statistical Issue in Reproducibility. Principles, Problems, Practices, and Prospects Reproducibility: Principles, Problems, Practices, and Prospects. 2016: 87–114. ISBN 9781118864975. doi:10.1002/9781118865064.ch5.