<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.theindustrystandard.com" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>The Industry Standard - Experimental container support for 2.6.24 - Comments</title>
 <link>http://www.theindustrystandard.com/news/2007/11/07/experimental-container-support-2-6-24</link>
 <description>Comments for &quot;Experimental container support for 2.6.24&quot;</description>
 <language>en</language>
<item>
 <title>Experimental container support for 2.6.24</title>
 <link>http://www.theindustrystandard.com/news/2007/11/07/experimental-container-support-2-6-24</link>
 <description>&lt;p&gt;&lt;!--paging_filter--&gt;By Jonathan Corbet, IDG News Service
&lt;p&gt;&quot;Containers&quot; are a form of lightweight virtualization as represented by projects like OpenVZ. While virtualization creates a new virtual machine upon which the guest system runs, containers implementations work by making walls around groups of processes. The result is that, while virtualized guests each run their own kernel (and can run different operating systems than the host), containerized systems all run on the host&#039;s kernel. So containers lack some of the flexibility of full virtualization, but they tend to be quite a bit more efficient.&lt;/p&gt;
&lt;p&gt;As of 2.6.23, virtualization is quite well supported on Linux, at least for the x86 architecture. Containers lag a little behind, instead. It turns out that, in many ways, containers are harder to implement than virtualization is. A container implementation must wrap a namespace layer around every global resource found in the kernel, and there are a lot of these resources: processes, filesystems, devices, firewall rules, even the system time. Finding ways to wrap all of these resources in a way which satisfies the needs of the various container projects out there, and which also does not irritate kernel developers who may have no interest in containers, has been a bit of a challenge.&lt;/p&gt;
&lt;p&gt;Full container support will get quite a bit closer once the 2.6.24 kernel is released. The merger of a number of important patches in this development cycle fills in some important pieces, though a certain amount of work remains to be done.&lt;/p&gt;
&lt;p&gt;Once upon a time, there was a patch set called process containers. The containers subsystem allows an administrator (or administrative daemon) to group processes into hierarchies of containers; each hierarchy is managed by one or more &quot;subsystems.&quot; The original &quot;containers&quot; name was considered to be too generic - this code is an important part of a container solution, but it&#039;s far from the whole thing. So containers have now been renamed &quot;control groups&quot; (or &quot;cgroups&quot;) and merged for 2.6.24.&lt;/p&gt;
&lt;p&gt;Control groups need not be used for containers; for example, the group scheduling feature (also merged for 2.6.24) uses control groups to set the scheduling boundaries. But it makes sense to pair control groups with the management of the various namespaces and resource management in general to create a framework for a containers implementation.&lt;/p&gt;
&lt;p&gt;The management of control groups is straightforward. The system administrator starts by mounting a special cgroup filesystem, associating the subsystems of interest with the filesystem at mount time. There can be more than one such filesystem mounted, as long as each subsystem appears on at most one control group. So the administrator could create one cgroup filesystem to manage scheduling and a completely different one to associate processes with namespaces.&lt;/p&gt;
&lt;p&gt;Once the filesystem is mounted, specific groups are created by making directories within the cgroup filesystem. Putting a process into a control group is a simple matter of writing its process ID into the tasks virtual file in the cgroup directory. Processes can be moved between control groups at will.&lt;/p&gt;
&lt;p&gt;The concept of a process ID has gotten more complicated, though, since the PID namespace code was also merged. A PID namespace is a view of the processes on the system. On a &quot;normal&quot; Linux system, there is only the global PID namespace, and all processes can be found there. On a system with PID namespaces, different processes can have very different views of what is running on the system. When a new PID namespace is created, the only visible process is the one which created that namespace; it becomes, in essence, the init process for that namespace. Any descendants of that process will be visible in the new namespace, but they will never be able to see anything running outside of that namespace.&lt;/p&gt;
&lt;p&gt;Virtualizing process IDs in this way complicates a number of things. A process which creates a namespace remains visible to its parent in the old namespace - and it may not have the same process ID in both namespaces. So processes can have more than one ID, and the same process ID may be found referring to different processes in different namespaces. For example, it is fairly common in containers implementations to have the per-namespace init process have ID 1 in its namespace.&lt;/p&gt;
&lt;p&gt;What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs; any such code must take care to maintain the association between a process ID and the namespace in which it is defined. To make life easier (and safer), the containers developers have been working for some time to eliminate (to the greatest extent possible) use of process IDs within the kernel itself. Kernel code should use task_struct pointers (which are always unambiguous) to refer to specific processes; a process ID, instead, has become a cookie for communication with user space, and not much more.&lt;/p&gt;
&lt;p&gt;This job of cleaning up PID use is not complete at this point. In fact, the process ID namespace work has a great many loose ends in general, to the point that some of the developers do not think that it is really ready to be used yet. In particular, there is concern that some of the management APIs could change, breaking code which is written for the 2.6.24 API. Adding new user-space APIs is always problematic in this regard: getting an API right is hard, and getting right the first time is even harder. But user-space APIs are supposed to stay constant once they are merged; there is no provision for any sort of stabilization period where things can change. For PID namespaces, what&#039;s likely to happen is that the feature will be marked &quot;experimental&quot; in the hope that nobody will use it in its 2.6.24 form.&lt;/p&gt;
&lt;p&gt;Also merged for 2.6.24 is the network namespace patch. The idea behind this code is to allow processes within each namespace to have an entirely different view of the network stack. That includes the available interfaces, routing tables, firewall rules, and so on. These patches are in a relatively early state; they add the infrastructure to track different namespaces, but not a whole lot more. Quite a few internal networking APIs have been changed to take a namespace parameter, but, in most cases, the code simply fails any operation which is attempted in anything other than the default, root namespace. There is a new &quot;veth&quot; virtual network device which can be used to create tunnels between namespaces.&lt;/p&gt;
&lt;p&gt;The PID and network namespace patches have added a couple of lines to &amp;lt;linux/sched.h&amp;gt;:&lt;/p&gt;
&lt;p&gt;    #define CLONE_NEWPID        0x20000000      /* New pid namespace */&lt;/p&gt;
&lt;p&gt;    #define CLONE_NEWNET        0x40000000      /* New network namespace */&lt;/p&gt;
&lt;p&gt;These entries highlight an interesting problem: the CLONE_ flags are passed to the kernel as a 32-bit value. As of this writing, there are only two bits left for new flags. So the containers developers are going to run out of flags; how they plan to deal with that problem is not clear at this point.&lt;/p&gt;
&lt;p&gt;These developers are also working on the management of containers, and, in particular, how to move between them. One of the things likely to come out of that work in the near future is a proposal for a new system call:&lt;/p&gt;
&lt;p&gt;    int hijack(unsigned long clone_flags, int which, int id);&lt;/p&gt;
&lt;p&gt;This system call behaves much like clone() in that it creates a new process, but with an interesting twist. The new process created by clone() takes all of its resources - including namespaces - from the calling process; these resources will be copied or shared as directed by the clone_flags argument. A call to hijack(), instead, obtains all of those resources from the process whose ID is given in the id parameter. So it is possible to write a little program which forks via a hijack() call and runs a shell in the resulting child process; that shell will be running with all of the namespaces of the hijacked process.&lt;/p&gt;
&lt;p&gt;To make life easier for people working with containers, the which parameter was added in recent versions of this API. If which is passed as 1, the call treats id as a process ID, as described above. A value of 2, instead, says that id is actually an open file descriptor for the tasks file in a cgroup control directory. In this case, hijack() finds the lead process for that control group and obtains resources from there.&lt;/p&gt;
&lt;p&gt;This system call is new, and it has not seen a whole lot of review outside of the containers mailing list. So chances are that some changes will be requested once it becomes more widely visible; among other things, a name change might be called for. In general, there is a lot yet to be done with the containers code, but progress is visibly being made. There will come a point where the mainline kernel comes equipped with complete container capabilities.&lt;/p&gt;
</description>
 <comments>http://www.theindustrystandard.com/news/2007/11/07/experimental-container-support-2-6-24#comments</comments>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1402">IDGNS</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1599">Linux</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1558">non-Windows</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1615">Open source</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1556">Operating systems</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/1520">Software</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/5667">Software &amp;amp; Web</category>
 <category domain="http://www.theindustrystandard.com/taxonomy/term/98">Breaking News</category>
 <pubDate>Wed, 07 Nov 2007 09:09:24 -0800</pubDate>
 <dc:creator>IDG News Service</dc:creator>
 <guid isPermaLink="false">75390 at http://www.theindustrystandard.com</guid>
</item>
</channel>
</rss>
