ansaurus

Question

Automated Class timetable optimize crawler?

Answer 1

A:

BeautifulSoup was mentioned here a few times, e.g get-list-of-xml-attribute-values-in-python.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

gimel 2009-01-07 09:39:56

Answer 2

+2 A:

Depending on how far you plan on taking #6, and how big the dataset is, it may be non-trivial; it certainly smacks of NP-hard global optimisation to me...

Still, if you're talking about tens (rather than hundreds) of nodes, a fairly dumb algorithm should give good enough performance.

So, you have two constraints:

A total ordering on the classes by score; this is flexible.
Class clashes; this is not flexible.

What I mean by flexible is that you can go to more spaced out classes (with lower scores), but you cannot be in two classes at once. Interestingly, there's likely to be a positive correlation between score and clashes; higher scoring classes are more likely to clash.

My first pass at an algorithm:

selected_classes = []
classes = sorted(classes, key=lambda c: c.score)
for clas in classes:
    if not clas.clashes_with(selected_classes):
        selected_classes.append(clas)

Working out clashes might be awkward if classes are of uneven lengths, start at strange times and so on. Mapping start and end times into a simplified representation of "blocks" of time (every 15 minutes / 30 minutes or whatever you need) would make it easier to look for overlaps between the start and end of different classes.

Alabaster Codify 2009-01-07 10:29:08

Answer 3

A:

There are waaay too many questions here.

Please break this down into subject areas and ask specific questions on each subject. Please focus on one of these with specific questions. Please define your terms: "best" doesn't mean anything without some specific measurement to optimize.

Here's what I think I see in your list of topics.

Scraping HTML

1 Logon to the website using its Enterprise Sign On Engine login

2 Find my current semester and its related subjects (pre setup)

3 Navigate to the right page and get the data from each related subject (lecture, practical and workshop times)

4 Strip the data of useless information
Some algorithm to "rank" based on "closer to each other" looking for a "best time". Since these terms are undefined, it's nearly impossible to provide any help on this.

5 Rank the classes which are closer to each other higher, the ones on random days lower

6 Solve a best time table solution
Output something.

7 Output me a detailed list of the BEST CASE information

8 Output me a detailed list of the possible class information (some might be full for example)
Optimize something, looking for "best". Another undefinable term.

9 Get the program to select the best classes automatically

10 Keep checking to see if we can achieve 7.

BTW, Python has "lists". Whether or not they're "linked" doesn't really enter into it.

S.Lott 2009-01-07 12:09:58

ansaurus

tags:

views:

answers:

Automated Class timetable optimize crawler?

related questions