I already have a C# crawler using .NET technologies (a very old IE engine ,a MySQL connector and a lot of tools). My crawler stores and loads data from a MySQL database, where I have the following key tables (I have a lot of tables, but these are of importance in the question):
Site, SiteSearch, Data
Site: It has the HOME_URL and a lot of useful data there.
SiteSearch: The rows in this table are different searches for sites. A site can have as many search types as needed, a search type applies to only one site.
Data: Data is stored here. A data has a lot of useful mined fields and pictures are stored on the main server about the data. MySQL runs on the database server and the crawler runs on lots of local servers.
Parts of the project:
Config: Here the user can configure the Site and SiteSearch settings. (It's a CRUD application to make the user able to define download settings)
Crawler: Manages SiteSearch starts
SiteSearch start: A SiteSearch, this is starting the old IE engine and downloads data from a Site with a certain search
FTP: Uploads the pictures to the main server and deletes deprecated pictures from the local server
SERVER: Deletes deprecated pictures from the main server
Main problems with the current version:
1.) The SiteSearch inserts a row whenever a Data was found, this causes Database slowness
2.) The IE engine is very old and I would like to use something better, like a Chrome or an FF engine
3.) The System is based on Windows and I would like a Platform Independent System
Planned crawler:
1.) Config: This will work in the same way as the old Config, but with a Platform Independent approach
2.) Crawler: This will work in the same way as the old Crawler, but with a Platform Independent approach, also this will delete deprecated pictures from the local server.
3.) SiteSearch start: I would like to use a better engine, which is more modern than the old built-in IE engine and I would like to use tabs instead of windows in the engine. Also, I want my SiteSearch to send an XML file to the server whenever the search was completed, containing all the data and pictures.
4.) FTP: This won't be needed because the SiteSearch will provide data feed to the main server.
5.) SERVER: This will delete deprecated pictures from the main server and also will do all the scheduled database work based on the received XML files.
My questions are:
1.) What language should I use?
2.) What Engine should I use?
3.) Are there some technologies which will help me rebuild my crawler?
4.) What is the total cost of these technologies?
Thank you for the answers.