Directed Web Crawling Using Machine Learning

Directed web crawling takes a specific expression of a topic and directs a web spider to locate documents that are generally relevant to that interest. Spiders may also be directed by the content of documents they examine, by the underlying link topology, or by meta-information about a document or URL. However, current content-based methods are weak. Researchers typically use cosine-based vector models to evaluate content. This approach must determine a threshold to decide whether a document is relevant. Typically the same threshold is used for all topics, instead of varying the threshold in a topic-specific way. Furthermore, determining a good threshold value is difficult.

The Johns Hopkins University Applied Physics Lab has developed an approach to improve content-based methods that is also compatible with other criteria such as link-based techniques. This technique will be primarily used to locate textual resources, although it is possible that this approach will apply to other forms of electronic information. This approach depends on two techniques: a technique for characterizing the content of documents and a technique that involves the use of machine learning methods of classifying documents.

Type of Offer: Licensing

Next Patent »
« More Internet Patents

Share on      

CrowdSell Your Patent