Ordindeks webcrawler

PHP webcrawler with N-Gram-Based Text Categorization


Project Details

The project consists of a Web-crawler and an associated web application for outputting the data for use.

The PHP Web-crawler run 7 threads and only save data in the Danish language, which is achieved by the theory: N-Gram-Based Text Categorization and other programming as support. It runs on a remote Linux server hosted by Digital ocean.

Because the scope of the crawler is Danish websites I used a good amount of time working with how to avoid overload on the server. This was done mostly by creating regular expressions, targeting the frontier (the queue of URL’s to parse).

Because web-crawling is kind of a grey zone, it was hard to find direct information about how to build a web-crawler and read information from the internet. Therefore I used a lot of time researching Google, and tell the developers in the world how to make websites readable for their Google-webcrawlers. I managed to build the web-crawler with the collected information from Google and furthermore I gained a lot of experience on how to do SEO.

See an early stage of the crawler running in Windows CMD