Hanging network in XEN with bridging

This article is about a strange network problem occuring when used bridging in a XEN environment.

Symptoms

  • Some DomU is not reachable from the network anymore.
  • The problem persists after restarting the DomU (all: reboot, shutdown+create, destroy+create).
  • Possible: The DomU still has a virtual NIC, if pings are send from DomU via console only arp-requests are seen on the virtual NIC in Dom0, no answers.
  • Possible: There could be log-entries as: xenbr1: port 4(domu_eth0) entering disabled state.

Read the rest of this entry »

Endian 2.3 in Xen on Debian Lenny

This article is about setting up Endian 2.3 on a XEN machine without hardware virtualization (hvm).

We work in a shared office space. It’s none of those big anonymous spaces where you can rent in on a monthly basis, but it does not belong to us alone. We share it with p4930 (architects) and control-b (we work together). However, some of the architects are using windows machines (the rest of us either linux or mac) which got ill with a virus. This ugly bugger happend to be a spam-bot and sent out a lot of UCE via our VDSL uplink. The Telekom (T-Online, our internet provider) decided to block outgoing SMTP on port 25 (which was just the beginning of a lot of more problems), which was quite reasonable. Until then, our network setup was shared between the architects and our side (control-b and fortrabbit), using the architects uplink as fallback, if our VDSL goes offline. This was mainly the legacy of of the previous tenant, who wired the whole office with 100Mbit network hidden in nice panels around the office space. The main router was an old Linksys with DD-WRT installed, not fully capable to route the full downstream bandwidth of 50Mbit, but working and quite easy to configure.

Now it was time to go separate ways, and we decided to go with Endian on our site.

We wanted it to be a VM in our XEN based office server. The server does not support hardware virtualiszation so the “normal” approaches (HVM) did not work. So here is what we did..
Read the rest of this entry »

Using OpenVPN as fallback connection

This article is about how to setup a fallback connection to reach your network if your NAT fails (eg your primary router with NAT dies and your failover without NAT comes up).

A week ago, our internet provider (T-Online, Deutsche Telekom) in the office decided to cut our connection – or at least he could’nt keep it stable up anymore. Since then, we see the sync-LED of the VDSL modem going on and off and it really sucks a lot. Luckily we are in an office complex with multiple WAN uplinks (from other companies) and thus could obtain WLAN access from our neighbours. Of course, this is not a good solution (their uplink has a tenth of our bandwidth) especially because we run some services (eg Nagios, some SSH servers and so on) NATed from our office router/firewall. Hopping to the next best WLAN allows us to keep on working but does not allow us to access our office-server remotely (i could and would not congfigure “foreign” routers).

So we needed a solution and found one using OpenVPN. It’s quite simple and took me about an hour to setup (mostly due to up-and-downs of our VDSL).

The general idea is to etablish a connection via VPN to a remote machine of us and make then NATting from there into the office back again. The VPN connection is etablished either via our “real” uplink or via the failover. It should not matter, that’s the point.

Read the rest of this entry »

Perl and BerkleyDB

In our Mailservers (Postfix) we use a lot of policy-servers, which are custom perl servers, eg granting sender permissions to external mailservers or building live-statistics of mail throughput. However, i recently ran into some problems starting one of those which uses a berkley database backend (aka bdb via BerkleyDB and BerkeleyDB::Env with DB_INIT_MPOOL from CPAN). On manual starting, i always got “Lock table is out of available locker entries”, which could not really be the case, because the perl-policy-server was the one and only accessing this database and the server has just been setup – so no traffic, a single newborn thread. Checking with db_stat revealed that bdb really thinks, all locks are in use:

1025	Last allocated locker ID
0x7fffffff	Current maximum unused locker ID
5	Number of lock modes
1000	Maximum number of locks possible
1000	Maximum number of lockers possible
1000	Maximum number of lock objects possible
500	Number of current locks
501	Maximum number of locks at any one time
1000	Number of current lockers
1000	Maximum number of lockers at any one time
33	Number of current lock objects
34	Maximum number of lock objects at any one time
1504	Total number of locks requested
1004	Total number of locks released
0	Total number of locks upgraded
1000	Total number of locks downgraded
0	Lock requests not available due to conflicts, for which we waited
0	Lock requests not available due to conflicts, for which we did not wait
0	Number of deadlocks
0	Lock timeout value
0	Number of locks that have timed out
0	Transaction timeout value
0	Number of transactions that have timed out
712KB	The size of the lock region
0	The number of region locks that required waiting (0%)

Well, the solution was to clear the environment directory on restart, because the statistics are (of course) persistent! So i removed any “__db.*”-file in the environment directory and everything worked like a charm again.

Building FUSE s3fs Debian Package

I’d like to play with s3 via s3fs on fuse, but couldn’t find any debian package. Normally this means a lot of work writing a couple of build and rules files and this scares me off – but i thought: just give it a try. And so i did and it was so easy, have to share it.

Read the rest of this entry »

Wordpress, Flash Uploader and headaches

I was helping my brother to create his new website based on Wordpress. He has a lot of image stuff to upload and the Flash Uploader simply didnt work. All we got was a simple plain red “HTTP Error”. A lot of google research suggested that we have either wrong directory permissions (impossible, my server, each VirtualHosts runs as a single unix user and has rw to all his own directories) or running mod_security (which we dont).  However, i spent a lot of time, testing on all the browsers in my portfolio.. FF 2,3, 3.5, Opera 9.x, Google Chrome, IEx .. but always the same results.

Read the rest of this entry »

Detaching from Postfix MailQ

You maybe know this scenario: At some sunny morning some guy using your mailservers decides to send a reaal big mailing via you and has forgotten to be polite and to remove all the recipients who does not exist any more (as in: adress closed) and even less polite: to remove all recipients who want to be removed! Yes, this is Spam, and it runs through your mailservers.

Now you have to clean your queue and to remove all those unwanted mails. Thats where the problem starts: with the postfix standard tools its quite slow and a terrible lot of work. The “normal” approach, to simply run mailq, copy the ID and remove via postsuper -d ID of course has its limits for – lets say – 10 Mails. Then it becomes a pain in the ass. Ok, what else can you do ? Remove all deferred mails, as in postsuper -d ALL defer, but is this really what you want ? There could be also a lot of mails from other clients as well (which are queued because of the big mailing).

Ok, here is what i wanted: removing all mails for a certain sender (a single email or a whole domain) to a certain recipient (in some cases, sometimes all), but, and this is really important to me, dont simply remove those mails. I want a save them, so that i can re-inject them later on or zip them and give them back to sender.

Read the rest of this entry »

Good old simple quotas

Recently we ran into some storage issues on our office NAS. We have a Raid5 (Software, MDADM) with 4 Disks (+1 Hot Spare), about 1,2 TB net space and some users decided to store their “non workstuff” there also. Because of my latest studies about ZFS (a post will follow soon) i was very aware of what is possible and what i want: quotas. Because of our Hosting work, i use many different techniques for archiving quotas (for example our Mailserver running Dovecot which quotas are based on the LDAP user database) which usually fit the needs of the very application much better. However, our office NAS (sadly) has to export storage on multiple protocols (NFS, SMB, AFS, ..) so i had to find another approach. A collegue of mine remembered me of the most simplistic approach: filesystem quotas ;)

I found a lot of articles about that matter, however, they mostly were very focused on a single issue about this topic and didnt comprehend “the whole” thing (or at least not from my point of view). So i had to read and google a lot until i fully understand what i need and can do.

Thus, here is my summary.

Read the rest of this entry »

Firefox 3.0 und SSDs

Seit längerem kämpfe ich nun schon mit meinem Firefox auf meinem Acer Aspire One. Das Problem an der Sache ist, dass mein Netbook eine SSD der frühen Generation hat. Mal abgesehen davon, dass die maximale Durchsatzrate schreibend bei um die 5MB/Sekunde liegt steht das ganze System still sobald mal mehrere kleinere Schreiboperation anstehen.

Auf dem Laptop ist Ubuntu 8.04 installiert. Aus lauter Paranoia auch noch voll verschlüsselt (reduziert IO Leistung natürlich noch mehr). Mein User-Partition ist auf einer SD-Card, hier liegen also auch die FF Userdaten. Der Durchsatz ist hier zwar höher (~15MB/s) aber das viel-schreiben-Problem besteht genauso..

Ich habe so ziemlich jeden Guide zum reduzieren der IO Last gelesen und umgesetzt. Das brachte auch alles ziemliche Verbesserungen nur ein Problem wurde einfach nicht behoben: Ab und zu stand der FF einfach still! Nachvollziehbar insbesondere bei stark Javascript lastigen Seiten (wie zB der Wordpress Admin) trat das Problem besonders häufig auf. Endlich habe ich eine Lösung gefunden auch das zu fixen.

Erst einmal alle Maßnahmen die man überall im Netz nachlesen kann, die auf jeden Fall gemacht werden sollten:

  • History auf maximal 3 Tage (Edit -> Preferences -> Privacy -> History)
  • Deaktivieren der Firefox Anti-Phishing Einstellungen (Edit->Preferences->Security, dann beides “Tell me if the site I’m visiting is a suspected attack site” und “.. forgery site” aus machen. Achtung: Nur wenn man sich sicher ist, was das bewirkt und mit den Konsequenzen umgehen kann!)
  • Auslagern des Caches ins RAM: In about:config den neuen Schlüssel browser.cache.disk.parent_directory anlegen (rechts klicken -> New -> String ) und als Wert /dev/shm angeben

Als Resultat sollte FF sich nun deutlich geschmeidiger anfühlen. Aber das nervtötenden Stillstehen (Eingabe reagiert nicht mehr für etliche Sekunden bis hin zu einigen Minuten, teilweise wird das ganze System davon betroffen) lässt sich so leider auch noch nicht beheben. Die Lösung ist erstaunlich einfach. Hierfür muss man einen weiteren Schlüssel in about:config eintragen: toolkit.storage.synchronous (rechts klicken -> New -> Integer), der Wert ist 0. Achtung: kann (soweit ich gelesen haben) unter Umständen zum vollständigen Verlust der History usw führen.

Nach einem FF Neustart sollte alles übernommen sein. Zwar ist das gefühlte Verhalten noch immer nicht wie auf einem System mit normaler HD, aber ich kann immerhin wieder richtig damit arbeiten.

Wen es interessiert: Dieser Schlüssel schaltet fsync für die Datenbank (places.sqlite usw) aus. Das Problem liegt darin, dass die sqlite Bibliothek offenbar “zu häufig” synchronisiert was in hoher IO Last resultiert (oder vielmehr Anzahl was bei einer fühen SSD zu langem IO Wait führt).

FastCGI for the Best, suPHP for the Rest

Wie jeder Shared-Hoster wurden wir auch mit folgender Frage konfrontiert: Wie lösen wir das PHP Problem ?

Was dieses Problem ist und warum man es nicht mit mod_php lösen kann wird schon genügend im Netz beschrieben .

Ich habe mich damals ziemlich schnell mit der mod_fastcgi + php-cgi angefreundet und es ist nun seit nunmehr 5 Monaten auch produktiv im Einsatz .. Aber mit der Zeit hat sich leider ein großes Manko heraus kristallisiert: Der Ressourcenverbrauch der FastCGI Lösung ist einfach enorm!

Um unseren Kunden möglichst einfach und sicher von einander abzugrenzen, gleichzeitig aber noch beweglich zu bleiben haben wir jedem Virtual Host einen eigenen Unix Benutzer zugewiesen unter dessen Rechten dann die  FastCGI Server laufen (sofern PHP eingesetzt wird). Wir teilen also noch nicht einmal PHP Prozesse zwischen zwei Virtual Hosts die dem gleichen Kunden gehörten.
Der Größte VHost rennt momentan mit über 20 gleichzeitigen Servern, wird aber in absehbarer Zeit verdoppelt, denn er nutzt alle verfügbaren voll und ganz aus. Das Problem beginnt nun mit den kleinen.. die ganz kleinen, die mit einem einzelnen Prozess idle im RAM hocken und – garnichts machen. Und das sind fast alle, bis auf eine handvoll wirklich besuchter Seiten.

Die avisierte Lösung muss also drei Ziele erfüllen:

  1. Es muss Sicherheit in einer Shared-Hosting Umgebung gewährleisten (mind. getrennte Userprozesse) die mod_php nicht erreicht
  2. Gute und vor allem nach oben skalierbare Performance für große Internetauftritte
  3. Öknomischer Umgang mit Ressourcen für kleine und hauptsächliche idle Websites

Glücklicherweise brauchen wir aber keine one-fits-all-Lösung, sondern können einfach zwei verschiedene Techniken einsetzen. Für große und stark frequentierte Seiten setzen wir also weiterhin mod_fastcgi ein. Der “ständige” RAM Vebrauch der persistenten Prozesse ist hier nicht der Flaschenhals. Für kleine, kaum besuchte Seiten hingegen benutzen wir nun suPHP. Wieder einmal klappt das aber nicht alles so, wie es soll. Die suPHP Debian Pakete unterstützen ein (IMO wichtigstes) Feature nicht: “suPHP_UserGroup <user> <group>” um Benutzer und Gruppe für ein Virtual Host zu definieren. Daher hab ich sie selber gebaut:

Für i386 gibt es hier Pakete.