Orchard: A Simple Alternative to XS ---------------------------------------------------------------------- **** Introduction # This session is an overview of the Mostly-C runtime of Orchard and its interface to Perl. * Mostly-C: a C-based runtime # Mostly-C is a simple extension to C using a preprocessor and a small runtime. # Mostly-C was created to optimize Orchard interfaces, to make writing optimized modules simpler and easier, and to allow those modules to be reused in different languages. * Orchard: a collection of design patterns # In Perl, Orchard is primarily just "smart hashes", hashes with a lot of extra features. # Several different broad features, or design patterns, are consistenly applied across many different sets of data. # High-level Orchard will be covered in tomorrow's session "Orchard: A New API for XML". # The next slide will describe the particular features of Orchard that needed to be optimized in C. ---------------------------------------------------------------------- **** Orchard * Nodes (hashes) # Nodes in Orchard are virtually the same as objects in most OO languages, except that data model is emphasized over behavior model, but, # Orchard nodes can still have behavorial methods. # Properties (attributes) are the primary focus of nodes. * Accessors (those hashes are tied) # All accesses to properties can be caught and handled specially, or # can be passed on to storage handlers # in-memory, on-disk, or remote. * Namespaces (hash keys) # In Orchard, inspired by XML, every property name can be specialized within a namespace, which is a URI. # Allows properties from different sources or schemas to appear on the same node. # Allows user data to be stored on nodes without fear of conflict. * [$namespace_uri, $local_name] # Namespace qualified names are a two-element array containing a URI and a simple name. * $NS->local_name # A convenience module is available that lets you use a namespace variable and method syntax. # Orchard supports "symbols", and those symbols can also be namespace qualified. * Array accessors # Orchard arrays are often specialized with either storage backends or data models, so they are also "smart arrays". * Parent pointers and circular references # Many Orchard data sets, XML for example, are tree-oriented and include parent references, which are circular. # Orchard automatically frees circular references. * Storage backends * In memory, on disk, remote # Orchard supports pluggable storage modules that let data models be stored anywhere. * XML and non-XML node sets # All of Orchard features are shared among all types of nodes. # Many XML tools can be applied to non-XML node sets, such as queries, transforms, and schemas ---------------------------------------------------------------------- **** Mostly-C Background * Developed to make Perl XML simpler and faster # Orchard was developed to make XML simpler, Mostly-C was developed to make Orchard perform acceptably, and just happens to be faster than pure-Perl * Objective C # Mostly-C's runtime is similar (but not exactly) to Objective-C # all methods are virtual, all methods get passed self and a selector, undefined methods can be caught # Mostly-C would have used Objective-C if it was more cross-platform and had garbage collection * "C++ in C" (W3C, Gnome) # Since Mostly-C is not Objective-C, it then comes closer to the "C++ in C" style used in Gnome and common in W3C tools. # Mostly-C uses a preprocessor to make this style easier, and to add dynamic dispatch and attribute access syntax * SGML Groves * Storage backends * Wide variety of node types with shared functionality * Scalable schemas # SGML Groves are the inspiration for Orchard in general, where the "smart data model" idea comes from. # Mostly-C makes that practical within scripting languages. * Shared code between languages # Mostly-C code can be reused among many dynamic languages. # Increases the benefit of using Mostly-C and the audience of those who would use it. * Modern APIs and design patterns # Orchard and Mostly-C use many techniques and design patterns that have only recently become widespread, such as dynamic compiling and loading, XML tools, namespaces, internationalization (I18N), pluggable functionality. * Extension language first, elegance second # One thing should be clear though, Mostly-C isn't a "new language" set to take over the world. It's an extension language for existing languages like Perl, Python, Tcl, and Ruby. # This means that there are rough spots in Mostly-C, particularly where it clings to native C when other OO C extensions don't. # On the other hand, this also reduces the amount of new syntax and runtime, thus making Mostly-C much simpler (closer to "C++ in C" of writing code than a whole new language). ---------------------------------------------------------------------- **** Example of Mostly-C code @namespace itr urn:to-be-determined # Defines a prefix 'itr' bound to a namespace 'urn:to-be-determined' @class XML_Element(Node) # Defines a class 'XML_Element', subclassed from 'Node' typedef id XML_Element; # A rough spot, every Mostly-C class has a corresponding C structure, all of which derive from the 'id' structure @new() { self = Node__new(NULL, 0); self->isa_ = &XML_Element_isa_; @self.contents = List__new(NULL, 0); return self; } # creates a new XML_Element object from the superclass, initializes its class and instance information, and returns it # more C roughness in 'Node__new()', '&XML_Element_isa_' # All 'new()' methods are straight C functions, for now @.itr:node_type { return XML_ElementType; } # A property accessor, returns the node's type (a symbol). # This property is namespace qualified, (urn:to-be-determined, node_type) ---------------------------------------------------------------------- **** Example of Mostly-C code (cont.) @.name { id prefix = @self.prefix; id local_name = @self.local_name; if (MOC_NIL == local_name) { local_name = @""; } if (MOC_NIL != prefix) { id str; str = @prefix.concat(@":"); str = @str.concat(local_name); return str; } else { return @local_name.copy(); } } # This is a read accessor for the XML element name. # XML element names may contain a prefix and a local name, this read accessor performs that using the "raw" properties 'prefix' and 'local_name' ---------------------------------------------------------------------- **** Example of Mostly-C code (cont.) @.name= { const char *name = String_s(value); char *localname; if (has_prefix(name, &localname)) { @self.prefix = Symbol__new_sn(name, localname-name-1); @self.local_name = Symbol__new_s(localname); } else { @self.prefix = MOC_NIL; @self.local_name = Symbol__new_s(name); } return self; } # This is the write accessor for the XML element name. # It does the inverse, splitting the element name into prefix and local_name and storing them seperately. ---------------------------------------------------------------------- **** Class Summary # Mostly-C has very few core classes at this point, mostly just support for nodes, arrays, some simple types, and XML. * Node and mapping types # Node and mapping types are the backbone of Orchard, where most of Orchards interfaces and patterns are applied. # Nodes have properties, storage plug-ins, and all of Orchard's "smart" features * Sequence types # Basic arrays, which may have some smarts to them when necessary * Simple types # integers, strings, nil * Node sets # Each node set contains a cluster of classes for each type of node. # At some point, node sets may just be compiled from schema languages. # Most nodes still allow user-defined or external properties to be stored on them. ---------------------------------------------------------------------- **** Syntax -- accessors, literals, and symbols # '@' introduces all Mostly-C syntax that is not C syntax. * Read accessors @variable.property @variable.prefix:property # Read accessors can be used whereever C allows expressions * Write accessors @variable.property = @variable.prefix:property = # Write accessors are left-hand-sides of assignments # Write accessors are converted to methods taking the value to be stored * Methods @variable.method([parameters]) @variable.prefix:method([parameters]) # Methods can also be namespace qualified, since # Class implementations can be dynamically modified or extended * Literal objects @"string" @1234 # Literal objects are short-cuts for their equivalent C function * Symbols @prefix:symbol # Symbols are tuples of namespace-uri and local_name. # Symbols are cached, only stored once. # Symbols can be compared as C pointers # *NOTE* This symbol syntax has not yet been added to Mostly-C ---------------------------------------------------------------------- **** Syntax -- class definition * Namespace prefix declaration @namespace prefix uri * Class declaration @class classname(superclass [, ...]) # Multiple inheritance is currently supported, but will likely go away in favor of dynamically loaded overlays * Read accessor method @.property { ... } @.prefix:property { ... } * Write accessor method @.property= { ... } @.prefix:property= { ... } * Instance method @.method([parameters]) { ... } @.prefix:method([parameters]) { ... } ---------------------------------------------------------------------- **** Runtime # These are the main characteristics of the Mostly-C runtime * Dynamic methods and accessors # Every class has a dispatch table # Methods are dispatched inline, so are as fast as C++ virtual methods # accessors are really GET_ and PUT_ methods * Default accessors are storage methods # If a read or write accessor is not defined for a node, the access is passed to the storage backend * All objects are of C type 'id' # Like Perl, the class of an object doesn't matter until you try to call a method on it, then it will throw an exception if the method is undefined * Methods passed 'self' and method selector * Constants and symbols * @moc:nil, @moc:true, @moc:false, @uri:type # *NOTE* Symbols are not yet implemented * Garbage collected * Perl bridge handles reference counting * Primitive type conversions are C functions # All Mostly-C methods take and return only 'id' type values, so conversion to and from non-id types are done with straight C functions. # There is an intrinsic namespace for Mostly-C defined properties and symbols, 'urn:to-be-determined', conventionally declared with the prefix 'itr'. ---------------------------------------------------------------------- **** Mapping Perl to Mostly-C * Mostly-C nodes act like Perl blessed hashes # Orchard nodes were intended to look just like data stored in hashes, but with magic added in # Ordinary hashes should work well in most places in Orchard, in place of any other "live" node * $node->method() calls @node.method() # *NOTE* Perl does not yet have a convenient way to call namespace qualified methods, but those are not used in core Orchard # namespace-qualified methods use name mangling of the URL and local name, so there is a brute force way * $node->{'foo'} accesses @node.foo * $node->{[$uri, 'name']} accesses @node.prefix:name # Convenience methods are available to make this simpler (as will be shown in the upcoming example) # Tuple accessors can be store in Perl variables * Mostly-C accessors intercept fetches and stores # ...all based on ties. * Mostly-C arrays are wrapped as tied arrays * Mostly-C simple values convert to Perl scalars * Property names are case and underscore mapped * 'NamespaceURI' <--> 'namespace_uri' # This allows different languages to maintain their own styles. ---------------------------------------------------------------------- **** Trivial Perl Example # As shown earlier in the Mostly-C code, Name has read and write accessors that split the name portions over the raw Prefix and LocalName properties * Creating and accessing an XML Element use Orchard::XML::NonOpt; my $factory = Orchard::XML::NonOpt::Document->new(); my $element = $factory->createElement(); $element->{Name} = 'Fu:Bar'; $element->{Prefix} = 'Foo'; die if ($element->{Name} ne 'Foo:Bar'); ---------------------------------------------------------------------- **** RSS example # There's nothing really unique about this code from a Perl perspective, which is one of the most important aspects of Orchard. # What can't easily be shown in a small code example is how, just like and shared with all other node sets, a variety of features or storage options are available with any node type. # This example shows the convenience function for using namespace qualified properties, in $DC. use Orchard qw{namespace}; use Orchard::RSS; my $channel = \ Orchard::RSS->load('http://MonkeyFist.com/rss1.php3'); my $DC = namespace('http://purl.org/dc/elements/1.1/'); print "Site: " + $channel->{Title} . "\n"; print "URL: " + $channel->{Link} . "\n"; print "Description: " + $channel->{Description} . "\n"; print "Copyright: " + $channel->{$DC->rights} . "\n"; print "Language: " + $channel->{$DC->language} . "\n"; print "Publisher: " + $channel->{$DC->publisher} . "\n"; print "\n"; print "Items:\n;" foreach my $item ($channel->{Items}) { print " Title: " + $item->{Title} . "\n"; print " Link: " + $item->{Link} . "\n"; print " Description: " + $item->{Description} . "\n"; print " Link creator: " + $item->{$DC->creator}. "\n"; print " Link date: " + $item->{$DC->date} . "\n"; print "\n"; ---------------------------------------------------------------------- **** Performance * XML Parsing * Baseline (1x) is Expat with no handlers # By not defining handlers, Expat only parses the file and does not spend any time calling user code. # Bigger numbers are number of times slower. * Mostly-C SAX handlers -- 3.5x * XML::Parser -- 12.7x -- Mostly-C calling Perl SAX is the same # In this case, it's the crossing of the Perl boundary on every event that takes most of the time, thus why XML::Parser and Perl SAX are about the same. * SAXDriver::XMLParser, pure Perl Orchard -- 398.3x * Mostly-C's raison d'etre! # To get Orchard features in pure Perl is excrutiatingly slow, due to tied hashes and optional accessors. # Mostly-C is the optimization of that. * XML in memory # Because Orchard allows accessors, the underlying storage model can be optimized, sometimes extensively, while still providing the same view to the user. * Baseline (1x) is XML file size # ...on disk. # Bigger numbers are number of times larger than the file size. * Mostly-C XML "fast and small" nodes -- 7.8x # "Fast and small" nodes use C structure fields for the primary properties, and accessors to read and write them, thus saving the space for keys. # "Fast and small" nodes can still take user-properties by allocating a storage node. * XML::Parser's Tree style -- 9.3x # XML::Parser's Tree style uses a simple array structure of the element name and contents (sub-arrays), and does not support any other XML information items such as comments, processing instructions, namespaces, or declarations * XML::Grove (hashes) -- 19.1x # XML::Grove, a precursor to Orchard, uses plain hashes to store the properties of nodes. # This figure is typical of most Perl XML tree modules (XML::DOM, XML::XPath) * Parsing speed is at the Perl interface barrier # Mostly-C optimizes the "smart" part of nodes, and is doing it's job perfectly if it does that and is no slower than pure Perl. * Port Perl code to Mostly-C for speed # On the other hand, porting specific functions to Mostly-C can improve the speed 2x-100x, depending on what you're doing. # Support of Inline will make this easy to do on a function-by-function basis. * Tree sizes can shrink, possibly below 1x # In the case of XML, there several techniques for sharing redundant data that can be used to reduce the tree size further. # Shrinking memory footprint is a tradeoff with performance, and easily lost if signicant user data is added to nodes. ---------------------------------------------------------------------- **** Mostly-C and Inline * Very complementary! * Plans for Inline::MostlyC * Inline makes using C code in Perl easy * Mostly-C code is portable, object-based, and GC'd * Mostly-C can borrow from Inline # Mostly-C Perl bridge likely to be rewritten in Inline # Use Inline alone when writing straightforward C functions or interfaces to existing libraries # Use Mostly-C when wrapping a data-access library using Orchard patterns # Use Mostly-C when module or library wrapper is significant enough to be used by multiple languages ---------------------------------------------------------------------- **** Not covered in this session * Language interface API # Mostly-C language bindings interface with Orchard via a C interface. * In-depth behaviors shared by Orchard types # Many of the features of Orchard are part of the implementation of Orchard modules and don't really affect the design or use of Mostly-C, and thus aren't described here. # High-level Orchard will be covered in tomorrow's session "Orchard: A New API for XML". * Using Mostly-C by itself # Mostly-C can be used standalone, outside of any hosting language * Writing portable classes # As new language bindings are developed, there will need to be guides for writing Mostly-C modules that can be used well by multiple languages. ---------------------------------------------------------------------- **** Where we're headed * Schema languages for accessors and validation # Schema languages will save writing many accessors for validating data as it's stored. * Language bindings # Python, Tcl, Ruby * Dynamic loading # Currently, all Mostly-C modules have to be pre-processed at the same time and recompiled, and the Makefile is a headache. # In the next major release after this one, dynamic compilation and loading will be used, and all at runtime (unless precompiled). * Storage drivers # Only the in-memory storage driver is implemented. # Storage drivers for on-disk and remote storage will be written. # Wrappers for binary file formats are a special form of storage driver. * Meta classes # Meta classes will provide information for dynamic loading, properties and schemas, and introspection * Mini-runtime kit (bundle into Perl?) # Once modules are seperated out by dynamic compilation and loading, the core of Orchard becomes incredibly tiny. # This core, for example, could be bundled into Perl and allow one to install a Mostly-C module as easily as a Perl module. * Optimizations and conveniences # Very few optimizations have been taken in the current versions of Mostly-C and the Perl binding. # More convenience functions or syntax would make writing Mostly-C code that much simpler. ---------------------------------------------------------------------- **** Resources and contact * http://Orchard.SourceForge.net/ * C++ in C: http://www.w3.org/Library/User/Style/Cpp.html * Ken MacLeod * Matt Sergeant