Home Updates Messages SandBox

Wiki Engine in Python from Scratch

This is a sample wiki engine, written in Python. It's meant as a learning aid, not a real tool: it lacks most functionalities, can serve only to one user at a time and stores all the page contents in memory – so they are gone when you restart it.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import BaseHTTPServer, urllib, re

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    template = u"""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"><html><head><title>%s</title>
</head><body><h1>%s</h1><pre>%s</pre><form action="" method="POST"
class="editor"><div><textarea name="text">%s</textarea><input type="submit"
value="Save"></div></form></body></html>"""

    def escape_html(self, text):
        """Replace special HTML characters with HTML entities"""
        return text.replace(
            "&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def link_repl(self, match):
        """Return HTML for link"""
        title = match.group(1)
        if title in self.server.pages:
            return u"""<a href="%s">%s</a>""" % (title, title)
        return u"""%s<a href="%s">?</a>""" % (title, title)

    def do_HEAD(self):
        """Send response headers"""
        self.send_response(200)
        self.send_header("content-type", "text/html;charset=utf-8")
        self.end_headers()

    def do_GET(self):
        """Send page text"""
        self.do_HEAD()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.escape_html(self.server.pages.get(page, "Empty..."))
        parsed = re.sub(r"\[\[([^]]+)\]\]", self.link_repl, text)
        self.wfile.write(self.template % (page, page, parsed, text))

    def do_POST(self):
        """Save new page text and display it"""
        length = int(self.headers.getheader('content-length'))
        if length:
            text = self.rfile.read(length)
            page = self.escape_html(urllib.unquote(self.path.strip('/')))
            self.server.pages[page] = urllib.unquote_plus(text[5:])
        self.do_GET()

if __name__ == '__main__':
    server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
    server.pages = {}
    server.serve_forever()

To try this wiki, just run it with a Python interpreter on your computer, and point your web browser to http://127.0.0.1:8080.

This engine uses build in web server from Python's standard library, BaseHTTPServer, so that you don't need to setup your own web server or look for hosting services just to play with it. We provide this server with a custom request handler, that supports three kind of requests:

There is a lot of space for improvement:

I'm also writing down a detailed process in which I came up with this code (minus obvious errors and some frustration with empty POSTs) at Step By Step Wiki Engine.


Ok, so you've seen that 50-line wiki, but you would like to know how I actually wrote it? It's not any special feat, actually witting exceptionally small programs, although takes much more time, seems to me to be easier than writing elaborate code for doing the same thing. Mostly because there is less room for the bugs. Anyways, I thought it could be beneficial to show how you actually do it, not just the end result. So here goes.

This wiki engine was intended to be used as a earning tool, and I wanted it to work out-of-the-box anywhere possible (in this case, where Python is available). Because getting hosting service with Python is not trivial, and setting up your own web server may be too complicated on various operating systems, I decided that the engine must contain its own web server. I knew there is a simple web server implementation in the Python standard library, but I didn't know how to use it. So, naturally, the first step was a simple test server:

import BaseHTTPServer

handler = BaseHTTPServer.BaseHTTPRequestHandler
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), handler)
server.serve_forever()

This code is the basic HTTP server that just runs there on port 8080 of our local host (127.0.0.1 always points to the local computer, it's sometimes called a loopback address) and responds with an error to every request:

Error response

Error code 501.
Message: Unsupported method ('GET').
Error code explanation: 501 = Server does not support this operation.

You can terminate it by pressing ctrl+c twice in the console where you run it. I've chosen the port 8080, not the default one, 80, because there may be already a server running on that port, and on most systems you would need adminitrator privileges to use it. I tell it to only run on the loopback interface, and not on all interfaces for security reasons – I don't want anyone from the outside connecting to my experimental program. I can replace the "127.0.0.1" with just empty string "" to make it respond on all interfaces later.

The reason why it responds with error is obvious: it doesn't know how to do anything else, the handler we used is a blank slate, doesn't do anything useful yet. To make it do something, we need to add something to it, to extend it – and we can do that by making our own handler that inherits everything from the BaseHTTPRequestHandler, but in addition defines code to handle the GET and other methods. So, the next step is a simple "hello world":

import BaseHTTPServer

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/plain")
        self.end_headers()
        self.wfile.write("Hello world!")

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()

I only implemented the do_GET method, because that's the default way web browsers "get" the web pages. All the rest of the Handler is copied from the BaseHTTPRequestHandler – I don't even know what required code may be possibly in there, but I know (from the Python documentation) that there are several useful methods in there:

There is also a file-like object defined in the handler, called wfile, that I can write to to send tings to the web browser. I use it to send a "hello world" message. Directing our web browser to any address beginning with http://localhost:8080/ gives us:

Hello world!

Now we can display the pages, changing the content type is not a problem. But it would be nice to show different pages depending on the URL used. We can get that information from the path attribute:

import BaseHTTPServer

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % self.path)

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()

Nice, it displays the page title as intended, we can easily remove the slashes from beginning, and optionally also from the end, using strip. There is however a problem for non-English speakers. Try this URL: http://localhost:8080/Łączka and you will see something like this:

/%C5%81%C4%85czka

Hello world!

Not exactly as intended. What is happening? Only a small set of characters is allowed to appear inside URL, and all other characters have to be encoded in form of their numeric codes, prefixed with %. We set ourcharacter set to utf-8, so the url is encoded as utf-8. We just need to decode these characters. Fortunately, there is a ready function that does that in the Python standard library, in urllib.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = urllib.unquote(self.path.strip('/'))
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % page)

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()

This takes care of all the special characters. Let's see how this works: http://localhost:8080/a<i>c. Weird, the characters are eaten, together with the "i". Let's look into the text that our web browser got from the server: using the "view page source" option present in most modern browsers, we can see:

<html><head><title>Sample</title></head><body>
<h1>a<i>c</h1>
Hello world!
</body></html>

The <i> is there alright, so what is going on here? Wait, that "c" looks a little weird, slanted as if it was italic. What was the "<i>" in HTML for? Right, these characters are treated as the HTML markup, not as content. What can we do to avoid that? The standard procedure is to encode the three special characters: "&", "<", and ">" as so-called entities. There is a list of available HTML entities, but we only need "&amp;", "&lt;" and "&gt;" (they are derived from the names "ampersand", "lesser than" and "greater than"). Note, that we must replace the "&" first, otherwise we would break the ampersands in the other entities. Note, that if you don't escape all and any user-provided text in your web applications, you are opening a security hole and enabling cross-site scripting attacks (so-called XSS) and various tricks with styles.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % page)

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()

I made it into a separate method because it will come in handy several times. Looking at this now, I should have made it a function outside of the class, not a class method, but that's not so important at the moment.

The next step is to show some text for different pages. We will use a dictionary to store the page text. We don't need to save it into files or store in a database, because our server is running all the time. If it was a PHP or CGI script, then it would be restarted with every request, so we couldn't cheat like that.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre>
</body></html>""" % (page, self.server.pages.get(page, "empty")))

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()

Initially I put the pages dictionary in the Handler class itself, but later decided to move it to the server object: mostly because I can initialize it easier this way, and also because I'm not sure if various more advanced implementations of the http server also only use a single instance of handler. Anyways, our web application now displays "Hello world!" with a link on the first page, "world" on the page titled "hello" and "empty" on all the others. I've put the text of the page in a <pre> block to preserve all whitespace and newlines.

It's all good, but it's not a wiki if you can't edit it. So we need an editor for our pages. I decided to put it on the same page that the rendered text – so that we don't need any special page names to indicate that we want the editor, not the page itself. Of course, special addresses will have to be introduced sooner or later if you want to have more advanced features. But I don't care about this for now, lets just have our wiki working first.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()

Ok, let's see: I have added the textarea to the HTML, and now I retrieve the text of the page to a variable, because I have to repeat it twice. Let's see how it works: it displays the editor just fine, but as soon as you press "Save", you can see:

Error response

Error code 501.

Message: Unsupported method ('POST').

Error code explanation: 501 = Server does not support this operation.

Looks familiar? Of course, we only have the GET method implemented, and not the POST. We need to make a do_POST method:

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()

This should do it. Yes, I am lazy. But it's almost done, we only need some code that would actually take the text that is posted and save it into our pages dictionary. We can read that text from the self.rfile file of the handler. Piece of cake.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        text = self.rfile.read()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.server.pages[page] = text
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()

So, I read the text from the rfile, and assign it to the apropriate page name – I need to compute the page name again, because it's not known yet at this point. For now, I just copied the the relevant name, but if I had to do it a third time, I would move it to a separate function, or store in an attribute. I also removed the initial text from the pages, as we are going to be able to edit them, at least that's the plan.

Well, but the new code doesn't work. When you hit "Save" it just keeps on waiting to load the page, until it timeouts. What's wrong? I kept on struggling with this for a long while. For some reason, the wiki engine keeps on waiting for the text, but never receives any. I thought that it may be trying to read too much: let's check how much there is to read first, we can check it in the request headers:

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        length = int(self.headers.getheader('content-length'))
        text = self.rfile.read(length)
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.server.pages[page] = text
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()

Now, this gives us an interesting result: the page is erased when we try to save it, there is no text! What is happening? Then I looked into the source of the page, and noticed the error: the textarea tag has no name on it! Nameless tags are not passed in the form data, that's why why get no content! Just giving a name to the textare tag fixes it:

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        length = int(self.headers.getheader('content-length'))
        text = self.rfile.read(length)
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.server.pages[page] = text
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()

Let's try and save the default "empty" text and see what comes out:

text=empty

Of course, the format of the data needs to contain the variable names! The different pieces are separate with the "&" character, and the "name=" is perpended to every one of them (with the "name" replaced with the actual form field name, of course). I was pretty tired with the previous problem, so I did something wrong. Something very wrong. I just remove the first 5 characters of the data, the one that is supposed to contain the substring "text=". When you write a real web application, you can't really depend on the exact format of that that web browser sends you back like that! Even if you limited the from fields size, even if you do some client-side validation with JavaScript, even if the form absolutely must contain certain fields, you can't rely on it – because browser sometimes behave weird, because users disable JavaScript for better performance and security, but most important, because you can always receive some forged requests from users trying to hack your site or bots. That's why you always need to re-check the validity of the data you receive, and never use any data directly in your code unquoted – be it SQL query, binary files or HTML output. Well, at least we can escape the HTML with our ready escaping function.

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        length = int(self.headers.getheader('content-length'))
        text = self.rfile.read(length)
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.server.pages[page] = self.escape_html(text[5:])
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()

The output still doesn't look right. Saving text "Hello world!" gives us:

Hello+world%21

We need to decode these quoted characters, in similar way we did with urls, with an additional twist: all the spaces are convertet to "+" characters, so we have to decode them too. There is a ready function for this in the urllib:

import BaseHTTPServer, urllib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
    def escape_html(self, text):
        return text.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")

    def do_GET(self):
        self.send_response(200)
        self.send_header("content-type", "text/html; charset=utf-8")
        self.end_headers()
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        text = self.server.pages.get(page, "empty")
        self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))

    def do_POST(self):
        length = int(self.headers.getheader('content-length'))
        text = self.rfile.read(length)
        page = self.escape_html(urllib.unquote(self.path.strip('/')))
        self.server.pages[page] = self.escape_html(urlli.unquote_plus(text[5:]))
        self.do_GET()

server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()

Now it looks good! The next step is adding links.

To be continued…