views:

454

answers:

2

I'm having trouble getting my bot to login to a MediaWiki install on the intranet. I believe it is due to the http authentication protecting the wiki.

Facts:

  1. The wiki root is: https://local.example.com/mywiki/
  2. When visiting the wiki with a web browser, a popup comes up asking for enterprise credentials (I assume this is basic access authentication)

This is what I have in my user-config.py:

mylang = 'en'
family = 'mywiki'
usernames['mywiki']['en'] = u'Bot'
authenticate['local.example.com'] = ('user', 'pass')

This is what I have in mywiki_family.py:

# -*- coding: utf-8  -*-
import family, config

# The Wikimedia family that is known as mywiki
class Family(family.Family):
  def __init__(self):
      family.Family.__init__(self)
      self.name = 'mywiki'
      self.langs = { 'en' : 'local.example.com'}

  def scriptpath(self, code):
      return '/mywiki'

  def version(self, code):
      return '1.13.5'

  def isPublic(self):
      return False

  def hostname(self, code):
      return 'local.example.com'

  def protocol(self, code):
      return 'https'

  def path(self, code):
      return '/mywiki/index.php'

When I execute login.py -v -v, I get this:

urllib2.urlopen(urllib2.Request('https://local.example.com/w/index.php?title=Special:Userlogin&useskin=monobook&action=submit', wpSkipCookieCheck=1&wpPassword=XXXX&wpDomain=&wpRemember=1&wpLoginattempt=Aanmelden%20%26%20Inschrijven&wpName=Bot, {'Content-type': 'application/x-www-form-urlencoded', 'User-agent': 'PythonWikipediaBot/1.0'})):
(Redundant traceback info here)
urllib2.HTTPError: HTTP Error 401: Unauthorized

(I'm not sure why it has 'local.example.com/w' instead of '/mywiki'.)

I thought it might be trying to authenticate to example.com instead of example.com/wiki, so I changed the authenticate line to:

authenticate['local.example.com/mywiki'] = ('user', 'pass')

But then I get an HTTP 401.2 error back from IIS:

You do not have permission to view this directory or page using the credentials that you supplied because your Web browser is sending a WWW-Authenticate header field that the Web server is not configured to accept.

Any help on how to get this working would be appreciated.

Update After fixing my family file, it now says:

Getting information for site mywiki:en ('http error', 401, 'Unauthorized', ) WARNING: Could not open 'https://local.example.com/mywiki/index.php?title=Non-existing_page&action=edit&useskin=monobook'. Maybe the server or your connection is down. Retrying in 1 minutes...

I looked at the HTTP headers on a plan urllib2.ulropen call and it's using WWW-Authenticate: Negotiate WWW-Authenticate: NTLM. I'm guessing urllib2 and thus pywikipedia don't support this?

Update Added a tasty bounty for help in getting this to work. I can authenticate using python-ntlm. How do I integrate this into pywikipedia?

A: 

I am guessing the problem you have is that the server expects basic authentication and you are not handling that in your client. Michael Foord wrote a good article about handling basic authentication in Python.

You did not provide enough information for me to be sure about this, so if that does not work, please provide some additional information, like network dump of you connection attempt.

Heikki Toivonen
no? :)pywikipedia handles correctly authentication. You just need to configure it properly :)
NicDumZ
+4  A: 

Hello!

Well the fact that login.py tries accessing '\w' instead of your path shows that there is a family configuration issue.

Your code is indented strangely: is scriptpath a member of the new Family class? as in:

class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'mywiki'
        self.langs = { 'en' : 'local.example.com'}

    def scriptpath(self, code):
        return '/mywiki'

    def version(self, code):
        return '1.13.5'

    def isPublic(self):
        return False

    def hostname(self, code):
        return 'local.example.com'

    def protocol(self, code):
        return 'https'

?

I believe that something is wrong with your family file. A good way to check is to do in a python console:

import wikipedia
site = wikipedia.getSite('en', 'mywiki')
print site.login_address()

as long as the relative address is wrong, showing '/w' instead of '/mywiki', it means that the family file is still not configured correctly, and that the bot won't work :)

Update: how to integrate ntlm in pywikipedia?

I just had a look at the basic example here. I would integrate the code before that line in login.py:

response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))

You want to write something of the like:

from ntlm import HTTPNtlmAuthHandler

user = 'DOMAIN\User'
password = "Password"
url = self.site.protocol() + '://' + self.site.hostname()

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)

response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))

I would test this and integrate it directly into pywikipedia codebase if only I had an available ntlm setup...

Whatever happens, please do not vanish with your solution: we're interested, at pywikipedia, by your solution :)

NicDumZ
This was part of the problem, +1. I was missing the "def path(self, code)" line in the family part of the code. Apparently the "scriptpath" section wasn't doing it.
Jake
I found the line it is choking on: f = uo.open(url, data) in method getUrl. After I forced it to use the authenticateUrlOpener (and introduced the ntlm handler) it throws an exception "list index out of range" when I go to open it. The url looks fine and data is None, so not sure why it's freaking out here.
Jake
I can't help if you don't give me the complete traceback...
NicDumZ