Problem: to find answers and exercises of lectures in Mathematics at Uni. Helsinki
Practical problems
- to make a list of sites with .com which has
Disallow
in robots.txt - to make a list of sites at (1) which contain files with *.pdf
- to make a list of sites at (2) which contain the word "analyysi" in pdf-files
Suggestions for practical problems
- Problem 3: to make a compiler which scrapes data from pdf-files
Questions
- How can you search .com -sites which are registered?
- How would you solve the practical problems 1 & 2 by Python's defaultdict and BeautifulSoap?