TY - GEN
T1 - A multi-strategy learning approach to competitor identification
AU - Ruan, Tong
AU - Lin, Yeli
AU - Wang, Haofen
AU - Pan, Jeff Z.
N1 - This work is funded by the National Key Technology R&D Program through project No. 2013BAH11F03
PY - 2015
Y1 - 2015
N2 - Competitor identification tries to find competitors of some entity in a given field, which is the key to the success of market intelligence. Manually collecting competitors is labor-intensive and time consuming. So automatic approaches are proposed for this purpose. However, these approaches suffer from the following two main challenges. Competitor information might not only be contained in semi-structured sources like lists or tables, but also be mentioned in free texts. The diversity of its sources make competitor identification quite difficult. Also, these competitors might not always occur in form of their full names. The occurrences of name variants further increase the diversity, and make the task more challenging. In this paper, we propose a novel unsupervised approach to identify competitors from prospectuses based on a multi-strategy learning algorithm. More precisely, we first extract competitors from lists using some predefined heuristic rules. By leveraging redundancies among competitor information in lists, tables, and texts, these competitors are fed as seeds to distantly supervise the learning process to find table columns and text patterns containing competitors. The whole process is iteratively performed. In each iteration, the newly discovered competitors of high confidence from various sources are treated as new seeds for bootstrapping. The experimental results show the effectiveness of our approach without human intentions and external knowledge bases. Moreover, the approach significantly outperforms traditional named entity recognition approaches.
AB - Competitor identification tries to find competitors of some entity in a given field, which is the key to the success of market intelligence. Manually collecting competitors is labor-intensive and time consuming. So automatic approaches are proposed for this purpose. However, these approaches suffer from the following two main challenges. Competitor information might not only be contained in semi-structured sources like lists or tables, but also be mentioned in free texts. The diversity of its sources make competitor identification quite difficult. Also, these competitors might not always occur in form of their full names. The occurrences of name variants further increase the diversity, and make the task more challenging. In this paper, we propose a novel unsupervised approach to identify competitors from prospectuses based on a multi-strategy learning algorithm. More precisely, we first extract competitors from lists using some predefined heuristic rules. By leveraging redundancies among competitor information in lists, tables, and texts, these competitors are fed as seeds to distantly supervise the learning process to find table columns and text patterns containing competitors. The whole process is iteratively performed. In each iteration, the newly discovered competitors of high confidence from various sources are treated as new seeds for bootstrapping. The experimental results show the effectiveness of our approach without human intentions and external knowledge bases. Moreover, the approach significantly outperforms traditional named entity recognition approaches.
KW - Competitor mining
KW - Distant supervision
KW - Unsupervised learning
KW - Wrapper induction
UR - http://www.scopus.com/inward/record.url?scp=84928902021&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-15615-6_15
DO - 10.1007/978-3-319-15615-6_15
M3 - Published conference contribution
AN - SCOPUS:84928902021
VL - 8943
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 197
EP - 212
BT - Joint International Semantic Technology Conference
A2 - Supnithi, T
A2 - Yamaguchi, T
A2 - Pan, J
A2 - Wuwongse, V
A2 - Buranarach, M
PB - Springer-Verlag
T2 - 4th Joint International Conference on Semantic Technology, JIST 2014
Y2 - 9 November 2014 through 11 November 2014
ER -