文字探勘:修订间差异

维基百科,自由的百科全书
删除的内容 添加的内容
无编辑摘要
Yoyofreeman留言 | 贡献
内容扩充
第4行: 第4行:


在台灣,威知資訊即是以文字探勘(Text Mining)技術提供客戶解決方案的公司,並以相關技術取得中華民國第153789號發明專利[http://www.webgenie.com.tw/webgenie_1-3.htm],自1999年起利用該技術所發展出的產品包含搜尋引擎、文件自動分類、外部資料擷取等,算是在此領域深耕有成的一家公司
在台灣,威知資訊即是以文字探勘(Text Mining)技術提供客戶解決方案的公司,並以相關技術取得中華民國第153789號發明專利[http://www.webgenie.com.tw/webgenie_1-3.htm],自1999年起利用該技術所發展出的產品包含搜尋引擎、文件自動分類、外部資料擷取等,算是在此領域深耕有成的一家公司

来自维基英文条目:[[en:Text mining]]
{{Moresources|date=2008--9-26}}

'''Text mining''', sometimes alternately referred to as ''text [[data mining]]'', roughly equivalent to ''[[text analytics]]'', refers generally to the process of deriving high quality [[information]] from text. High quality information is typically derived through the dividing of patterns and trends through means such as [[pattern recognition|statistical pattern learning]]. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a [[database]]), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of [[relevance (information retrieval)|relevance]], [[Novelty (patent)|novelty]], and interestingness. Typical text mining tasks include [[text categorization]], [[text clustering]], [[concept mining|concept/entity extraction]], production of granular taxonomies, [[sentiment analysis]], [[document summarization]], and entity relation modeling (''i.e.'', learning relations between [[Named entity recognition|named entities]]).

==History==
Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an [[interdisciplinary]] field which draws on [[information retrieval]], [[data mining]], [[machine learning]], [[statistics]], and [[computational linguistics]]. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

== Sentiment analysis ==
[[Sentiment analysis]] may, for example, involve analysis of movie reviews for estimating how favorably a review is for a movie.<ref>{{Cite conference
| author = Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan
| title = Thumbs up? Sentiment Classification using Machine Learning Techniques
| booktitle = Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
| pages = 79&ndash;86
| year = [[2002]]
| url = http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
}}</ref>
Such an analysis may require a labeled data set or labeling of the [[Affect_(psychology)|affectivity]] of words.
A resource for affectivity of words has been made for [[WordNet]].<ref>{{Cite journal
| author = Alessandro Valitutti, Carlo Strapparava, Oliviero Stock
| title = Developing Affective Lexical Resources
| journal = [[PsychNology Journal]]
| year = [[2005]]
| volume 2
| issue = 1
| pages = 61&ndash;83
| url = http://www.psychnology.org/File/PSYCHNOLOGY_JOURNAL_2_1_VALITUTTI.pdf
}}</ref>

==Applications==
Recently, text mining has received attention in many areas.

===Security applications===
One of the largest text mining applications that exists is probably the classified [[ECHELON]] surveillance system. Additionally, many text mining software packages such as [[AeroText]], [[Attensity]], [[SPSS]] and [[Expert System]] are marketed towards security applications, particularly analysis of plain text sources such as Internet news.

In 2007, [[Europol]]'s Serious Crime division developed an analysis system in order to track transnational organized crime. This Overall Analysis System for Intelligence Support (OASIS) integrates among the most advanced text analytics and text mining technologies available on today's market. This system led Europol to make the most significant progress to support law enforcement objectives at the international level. <ref>{{cite web| url=http://www.ialeia.org/awards| title="IALEIA-LEIU Annual Conference in Boston on April 9, 2008"}}</ref>

=== Biomedical applications ===
A range of text mining applications in the biomedical literature has been described.<ref>{{Cite journal
| author = K. Bretonnel Cohen & Lawrence Hunter
| title = Getting Started in Text Mining
| journal = [[PLoS Computational Biology]]
| month = January
| year = [[2008]]
| volume = 4
| issue = 1
| pages = e20
| doi = 10.1371/journal.pcbi.0040020
| url = http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0040020
}}</ref>
One example is [[PubGene]] that combines biomedical text mining with network visualization as an Internet service.<ref>{{Cite journal
| author = Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig
| title = A literature network of human genes for high-throughput analysis of gene expression
| journal = [[Nature Genetics]]
| volume = 28
| pages = 21&ndash;28
| year = [[2001]]
| doi = 10.1038/ng0501-21
| pmid = 11326270
| url = http://www.nature.com/ng/journal/v28/n1/abs/ng0501_21.html
| doi_brokendate = 2008-06-20
}}
* Summary: {{Cite journal
| author = Daniel R. Masys
| title = Linking microarray data to the literature
| journal = [[Nature Genetics]]
| volume = 28
| pages = 9&ndash;10
| year = [[2001]]
| pmid = 11326264
| doi = 10.1038/ng0501-9
| doi_brokendate = 2008-06-20
}}</ref>
Another example, which uses ontologies with textmining is [http://www.gopubmed.org GoPubMed.org].<ref>{{Cite journal
| author = Andreas Doms, Michael Schroeder
| title = GoPubMed: exploring PubMed with the Gene Ontology
| journal = [[Nucleic Acids Research]]
| volume = 33
| pages = W783–W786
| year = [[2005]]
| doi = 10.1093/nar/gki470
| url = http://www.nature.com/ng/journal/v28/n1/abs/ng0501_21.html
| pmid = 15980585
}}</ref>

===Software and applications===
Research and development departments of major companies, including [[IBM]] and [[Microsoft]], are researching text mining techniques and developing programs to further automate the mining and analysis processes.
Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.

===Marketing applications===
Text mining is starting to be used in marketing as well, more specifically in analytical [[Customer relationship management]]. [http://www.textmining.UGent.be Coussement and Van den Poel] (2008) apply it to improve [[predictive analytics]] models for customer churn ([[Customer attrition]]).<ref>{{Cite journal
| author = Kristof Coussement, and Dirk Van den Poel
| title = Integrating the Voice of Customers through Call Center Emails into a Decision Support System for Churn Prediction
| journal = Information and Management
| month = forthcoming
| year = [[2008]]
| url = http://www.textmining.ugent.be
}}</ref>
.

===Academic applications===
The issue of text mining is of importance to publishers who hold large [[databases]] of information requiring [[Index (database)|indexing]] for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as [[Nature (journal)|Nature's]] proposal for an Open Text Mining Interface (OTMI) and [[National Institutes of Health|NIH's]] common Journal Publishing [[Document Type Definition]] (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

The [[National Centre for Text Mining]], a collaborative effort between the Universities of [[University of Manchester|Manchester]] and [[University of Liverpool|Liverpool]], provides customised tools, research facilities and offers advice to the academic community.
They are funded by the [[Joint Information Systems Committee]] (JISC) and two of the UK [[Research Council]]s.
With an initial focus on text mining in the [[biology|biological]] and [[biomedical]] sciences, research has since expanded into the areas of [[Social Science]].

In the United States, the [[UC Berkeley School of Information|School of Information]] at [[University of California, Berkeley]] is developing a program called BioText to assist bioscience researchers in text mining and analysis.

== Software and applications==
Research and development departments of major companies, including [[IBM]] and [[Microsoft]], are researching text mining techniques and developing programs to further automate the mining and analysis processes.
Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.
There is a large number of companies that provide commercial computer programs:

{{Cleanup|date=2008年9月}}
* [[AeroText]] - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
* [[Autonomy Corporation|Autonomy]] - suite of text mining, clustering and categorization solutions for a variety of industries.
* [[Endeca Technologies]] - provides software to analyze and cluster unstructured text.
* [[Expert System S.p.A.]] - suite of semantic technologies and products for developers and knowledge managers.
* [[Fair Isaac]] - leading provider of decision management solutions powered by advanced analytics (includes text analytics).
* [[Inxight]] - provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by [[Business Objects (company)|Business Objects]] that was bought by [[SAP AG]] in 2008)
* [[Pervasive Data Integrator]] - includes Extract Schema Designer that allows the user to point and click identify structure patterns in reports, html, emails, etc. for extraction into any database
* [[RapidMiner|RapidMiner/YALE]] - open-source data and text mining software for scientific and commercial use.
* [[SAS_System|SAS]] - solutions including SAS Text Miner and Teragram - commercial text analytics, natural language processing, and taxonomy softwares leveraged for [[Information Management]]. [http://www.sas.com/technologies/analytics/datamining/textminer/]
* [[SPSS]] - provider of SPSS Text Analysis for Surveys, Text Mining for Clementine, LexiQuest Mine and LexiQuest Categorize, commercial text analytics software that can be used in conjunction with SPSS Predictive Analytics Solutions.
* [[Thomson Data Analyzer]] - Enables complex analysis on patent information, scientific publications and news.
* [[LexisNexis]] - LexisNexis is a provider of business intelligence solutions based on an extensive news and company information content set. Through the recent acquisition of Datops LexisNexis is leveraging its search and retrieval expertise to become a player in the text and data mining field. [http://www.lexisnexisanalytics.com]

===Open-source software and applications===
* [[General Architecture for Text Engineering|GATE]] - natural language processing and language engineering tool.
* [[RapidMiner|YALE/RapidMiner]] with its Word Vector Tool plugin - data and text mining software.
* tm [http://cran.r-project.org/web/packages/tm/index.html] [http://www.jstatsoft.org/v25/i05] - text mining in the [[R programming language]]

==Implications==
Until recently websites most often used text-based lexical searches; in other words, users could find documents only by the words that happened to occur in the documents. Text mining may allow searches to be directly answered by the [[semantic web]]; users may be able to search for content based on its meaning and context, rather than just by a specific word.

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, by using software that extracts specifics facts about businesses and individuals from news reports, large datasets can be built to facilitate [[social networks analysis]] or [[counter-intelligence]]. In effect, the text mining software may act in a capacity similar to an [[intelligence analyst]] or [[research librarian]], albeit with a more limited scope of analysis.

Text mining is also used in some email [[spam filter]]s as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

==Notes==
{{Reflist}}

==References==
* Ronen Feldman and James Sanger, ''The Text Mining Handbook'', Cambridge University Press, ISBN 9780521836579
* Kao Anne, Poteet, Steve R. (Editors), Natural Language Processing and Text Mining, Springer, ISBN-10: 184628175X
* Konchady Manu "Text Mining Application Programming (Programming Series)" by Manu Konchady, Charles River Media, ISBN 1584504609
* M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques, WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp. 966-974 (http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/Text%20Classification%20final%20journal.pdf)

==See also==
*[[Approximate nonnegative matrix factorization]], an algorithm used for text mining
*[[BioCreative]] text mining evaluation in biomedical literature
*[[Business intelligence]]
*[[Computational linguistics]]
*[[Concept Mining]]
*[[Data mining]]
*[[Information retrieval]]
*[[Name resolution]]
*[[Natural language processing]]
*[[Stop words]]
*[[Text analytics]]
*[[Text classification]] sometimes is considered a (sub)task of text mining.
*[[UIMA]] Unstructured Information Management Architecture from IBM.
*[[Web mining]], a task that may involve text mining (e.g. first find appropriate web pages by classifying crawled web pages, then extract the desired information from the text content of these pages considered relevant).
*[[w-shingling]]

== External links==
* http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ MUC
* http://projects.ldc.upenn.edu/ace/ ACE (LDC)
* http://www.itl.nist.gov/iad/894.01/tests/ace/ ACE (NIST)
* http://www.arts-humanities.net/text_mining (Discussion group text mining)
* [http://portal.tapor.ca Text Analysis Portal for Research (TAPoR)]
* http://textanalytics.wikidot.com/ Text Analytics Wiki
* [http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0040020;jsessionid=7C485EC9A7B5B0B48AB12894E268DB7A Getting started in text mining]
* [http://erabaki.ehu.es/jjga/pimiento Pimiento] A Text-Mining Application Framework written in Java.


[[Category:Artificial intelligence applications]]
[[Category:Data mining]]
[[Category:Computational linguistics]]



[[de:Textmining]]
[[de:Textmining]]

2008年9月26日 (五) 07:36的版本

文字探勘,也被稱為文字採礦、智慧型文字分析、文字資料探勘或文字知識發現,一般而言,指的是從非結構化文字中,萃取出有用的重要資訊知識。文字探勘是一個剛起步的學科領域,它是透過資訊擷取、資料探勘、機械學習、統計學電腦語言學來達成。大部分的資訊(超過80%)都是以文字儲存,因此,文字探勘被認為是有高度的潛在商業價值。

在台灣,威知資訊即是以文字探勘(Text Mining)技術提供客戶解決方案的公司,並以相關技術取得中華民國第153789號發明專利[1],自1999年起利用該技術所發展出的產品包含搜尋引擎、文件自動分類、外部資料擷取等,算是在此領域深耕有成的一家公司

来自维基英文条目: Template:Moresources

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

History

Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

Sentiment analysis

Sentiment analysis may, for example, involve analysis of movie reviews for estimating how favorably a review is for a movie.[1] Such an analysis may require a labeled data set or labeling of the affectivity of words. A resource for affectivity of words has been made for WordNet.[2]

Applications

Recently, text mining has received attention in many areas.

Security applications

One of the largest text mining applications that exists is probably the classified ECHELON surveillance system. Additionally, many text mining software packages such as AeroText, Attensity, SPSS and Expert System are marketed towards security applications, particularly analysis of plain text sources such as Internet news.

In 2007, Europol's Serious Crime division developed an analysis system in order to track transnational organized crime. This Overall Analysis System for Intelligence Support (OASIS) integrates among the most advanced text analytics and text mining technologies available on today's market. This system led Europol to make the most significant progress to support law enforcement objectives at the international level. [3]

Biomedical applications

A range of text mining applications in the biomedical literature has been described.[4] One example is PubGene that combines biomedical text mining with network visualization as an Internet service.[5] Another example, which uses ontologies with textmining is GoPubMed.org.[6]

Software and applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.

Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical Customer relationship management. Coussement and Van den Poel (2008) apply it to improve predictive analytics models for customer churn (Customer attrition).[7] .

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

The National Centre for Text Mining, a collaborative effort between the Universities of Manchester and Liverpool, provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils. With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of Social Science.

In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.

Software and applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results. There is a large number of companies that provide commercial computer programs:

  • AeroText - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
  • Autonomy - suite of text mining, clustering and categorization solutions for a variety of industries.
  • Endeca Technologies - provides software to analyze and cluster unstructured text.
  • Expert System S.p.A. - suite of semantic technologies and products for developers and knowledge managers.
  • Fair Isaac - leading provider of decision management solutions powered by advanced analytics (includes text analytics).
  • Inxight - provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008)
  • Pervasive Data Integrator - includes Extract Schema Designer that allows the user to point and click identify structure patterns in reports, html, emails, etc. for extraction into any database
  • RapidMiner/YALE - open-source data and text mining software for scientific and commercial use.
  • SAS - solutions including SAS Text Miner and Teragram - commercial text analytics, natural language processing, and taxonomy softwares leveraged for Information Management. [2]
  • SPSS - provider of SPSS Text Analysis for Surveys, Text Mining for Clementine, LexiQuest Mine and LexiQuest Categorize, commercial text analytics software that can be used in conjunction with SPSS Predictive Analytics Solutions.
  • Thomson Data Analyzer - Enables complex analysis on patent information, scientific publications and news.
  • LexisNexis - LexisNexis is a provider of business intelligence solutions based on an extensive news and company information content set. Through the recent acquisition of Datops LexisNexis is leveraging its search and retrieval expertise to become a player in the text and data mining field. [3]

Open-source software and applications

Implications

Until recently websites most often used text-based lexical searches; in other words, users could find documents only by the words that happened to occur in the documents. Text mining may allow searches to be directly answered by the semantic web; users may be able to search for content based on its meaning and context, rather than just by a specific word.

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, by using software that extracts specifics facts about businesses and individuals from news reports, large datasets can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

Notes

  1. ^ Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment Classification using Machine Learning Techniques (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP): 79–86. 2002. 
  2. ^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock. Developing Affective Lexical Resources (PDF). PsychNology Journal. 2005, (1): 61–83.  已忽略文本“ volume 2 ” (帮助);
  3. ^ "IALEIA-LEIU Annual Conference in Boston on April 9, 2008". 
  4. ^ K. Bretonnel Cohen & Lawrence Hunter. Getting Started in Text Mining. PLoS Computational Biology. 2008, 4 (1): e20. doi:10.1371/journal.pcbi.0040020.  已忽略未知参数|month=(建议使用|date=) (帮助);
  5. ^ Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics. 2001, 28: 21–28. PMID 11326270. doi:10.1038/ng0501-21 (不活跃 2008-06-20). 
  6. ^ Andreas Doms, Michael Schroeder. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Research. 2005, 33: W783–W786. PMID 15980585. doi:10.1093/nar/gki470. 
  7. ^ Kristof Coussement, and Dirk Van den Poel. Integrating the Voice of Customers through Call Center Emails into a Decision Support System for Churn Prediction. Information and Management. 2008.  已忽略未知参数|month=(建议使用|date=) (帮助);

References

See also

External links