Sunday, December 31, 2023

IRC Forth Chatbot

After putting in a lot of work on the 2023 Project Goals for my presentation at HHC 2023, I decided to take a break and work on a few other things for a while. One of those things is a chatbot that connects to IRC and runs Forth code it receives from users. My inspiration for this is the geordi bot which evaluates C code and also runs on IRC. Testing short programs and showing the results immediately is really useful to demonstrate language concepts. Although my Forth chatbot was a really fun project to work on and taught me a lot about several different technologies, I decided not to finish it in the end.

The chatbot is split into two parts. The main program is a Python script that connects to IRC, handles communication, and runs the terminal interface that manages the bot. The second part is a shared library written in C that the Python script loads and calls into to run the Forth code submitted by users. The library is written in C to maximize performance.

The IRC communication is handled by the irc library for Python. In general, I prefer to avoid libraries like these and write things from scratch where practical, but IRC communication seemed like too big of a project to recreate myself. In the past, relying on pre-existing libraries like this one has led to a lot of headaches and frustration which is why I avoid them when possible. Sometimes it's not apparent that a library is missing some key piece of functionality until the project is pretty far along. When you have to abandon something you worked hard on, it makes you wish you had just created the needed library functionality yourself since that would have been faster in the end. Another annoying thing is when the documentation for the library is incomplete or just plain wrong. All in all, the irc library worked reliably and as expected with only a few annoyances. Thankfully, the source is laid out really well so I could look through it to figure out what was going wrong. The library provides an IRC bot class that can be used as a subclass which is really neat. Methods like on_join and on_welcome are called by the library's event handler as necessary, so defining these builds an event-driven system. Unfortunately, on_disconnect doesn't fire when the disconnection happens from a dropped connection which happens a few times a day where I live now. To get around this, my script detects a dropped connection by regularly pinging the IRC server and receiving the reply through the on_pong event. The connect method of the IRC bot class also doesn't work correctly, so the script calls the class's parent's internal _connect method which is generally not a good idea but necessary in this case.

The irc library's example implementation of a chatbot starts an event handler and never returns from it. In my case, the chatbot needs to also accept keyboard input so that there's a way to issue commands to it on the backend. The event handler has a method to only process pending IRC events and return immediately which is very convenient. The keyboard input also needs to yield regularly so that IRC events can be processed while waiting on more keyboard input. The best way to do that in a Linux terminal is with the curses package which worked well for my 6502 Interactive Assembler. There is a curses mode where keyboard input has a timeout which solves this problem. Since curses is primarily used for outputting formatted text, I added a nice text interface which I might have skipped if there was an easier way to get keyboard timeouts without curses.

The Forth code submitted by users is executed by a shared library written in C imported by the Python script to maximize performance. Setting this up was very interesting and may be useful for future projects. The "shared" nature of the library means it's shared as necessary between running threads so that each program doesn't have to include its own copy. This is a bit confusing since on Linux, each thread gets its own copy of the library, so the threads don't necessarily share the same copy in memory. Although it wasn't needed for this project, one really interesting thing about this arrangement is that the library can call functions defined in the calling program since linking happens when the thread loads rather than only when the library is compiled. Setting up a shared library with gcc was straightforward. Several of the tutorials online had overly complicated set up instructions. The following simplified process works just as well:
    
    gcc -c -O3 -fpic testlib.c
    gcc -shared -o libtestlib.so testlib.o
    gcc -O2 -o main -Wl,-rpath,'$ORIGIN' main.c libtestlib.so

The first two lines produce a shared library that Python can import. The third line compiles a C program linking in the library which could be useful for testing but is not necessary for Python. By default, Linux looks in certain locations for shared libraries. The -rpath argument on the third line tells the compiled program to look in its own directory for the library. One frustration while testing was intermittent segmentation fault errors in the C program. This was caused by leaving the program that loaded the shared library running while making changes to the library and recompiling it. It seems the running program was trying to use the new version of the library which caused it to crash.

The shared library is imported into Python using the ctypes library which also allows the definition of the return type and arguments for imported functions. Like its name says, the package also offers variables with the same data types as C, so you can use 8-bit char or 32-bit int variables, for example, directly in Python code. If I had known about this before, I might have used ctypes for other projects like my 6502 Interactive Assembler where working with native types is easier. Once the C function that interprets the Forth code is imported into Python, it can be set to run in a different thread using the multiprocessing library which can also allocate memory to be shared between Python and C. Running the C function in its own thread lets the Python script continue to accept keyboard input and process IRC communication while the function is interpreting the user's Forth code. The Python script also keeps track of how long the function has been running so it can be terminated if it takes too long or is stuck in an endless loop.

The Forth interpreter implemented in C receives pointers to memory that's allocated in Python for holding the user's memory, program input, and output. Passing data between the two was very smooth. Each user has 64K of memory to use for data and holding their Forth programs. The Forth itself uses 32-bit cells and is token-threaded. I put a lot of thought into how to keep the system secure since people can submit any Forth code they like to execute. All addresses in the Forth system are 16-bit and refer only to the 64K of allocated memory. In theory, no native address should ever make its way into the Forth system or be in any way accessible from Forth. All memory accesses by Forth primitives are passed through guard functions to prevent things like buffer overflows. All changes to the 64K of user memory happen first to a copy which is only kept if the code executed to the end and with no errors. All of the functionality up to here has been implemented along with about 20 primitives. The chatbot can now receive input from users and return the result including any errors.

After doing all of this work, I started to look into how to run a container or virtual machine to host the chatbot on. Almost all of my programming projects now run on an old workstation running Linux which is more than powerful enough to run a small Debian VM to host the chatbot. It took me a long time to go down the rabbit hole of how to secure Debian, and there's still a lot left to learn. Even though the chatbot will connect to IRC as a client and never technically run as a server, I've learned enough to be wary of allowing the chatbot program to run arbitrary code from the internet even if the system is completely sandboxed by my C program. The risk of making a mistake in my code that might let an attacker access my home network doesn't seem worth it at this point. Although it would have been nice to understand all of this before I spent so much time on this project, it was still interesting to learn new things about shared libraries, Python ctypes, and Linux network security.

2 comments:

  1. Hi Joey,

    Interesting idea and nice read again. Could VMA help you with the security concerns?
    I was also amazed with your amount of work given to keyboards variants in HHC 2023 presentation.

    Good luck in 2024.
    KR
    P

    ReplyDelete
  2. Hi Pane,
    Thanks a lot! Maybe VMA is something that would help. I think I'll hold off on running anything like that from my home network for now. Thanks for the tip though.

    I read several of the articles on your MSP430 blog years ago and got a lot out of them.

    Take care,
    Joey

    ReplyDelete