Focused Crawling with Scalable Ordinal Regression Solvers

Abstract

In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.

Cite

Text

Babaria et al. "Focused Crawling with Scalable Ordinal Regression Solvers." International Conference on Machine Learning, 2007. doi:10.1145/1273496.1273504

Markdown

[Babaria et al. "Focused Crawling with Scalable Ordinal Regression Solvers." International Conference on Machine Learning, 2007.](https://mlanthology.org/icml/2007/babaria2007icml-focused/) doi:10.1145/1273496.1273504

BibTeX

@inproceedings{babaria2007icml-focused,
  title     = {{Focused Crawling with Scalable Ordinal Regression Solvers}},
  author    = {Babaria, Rashmin and Nath, J. Saketha and Krishnan, S. and Sivaramakrishnan, K. R. and Bhattacharyya, Chiranjib and Murty, M. Narasimha},
  booktitle = {International Conference on Machine Learning},
  year      = {2007},
  pages     = {57-64},
  doi       = {10.1145/1273496.1273504},
  url       = {https://mlanthology.org/icml/2007/babaria2007icml-focused/}
}