BeyeBLOGS | BeyeBLOGS Home | Get Your Own Blog

« semantic rationalization blog series: part 5 - reusability and ease of use | Main | expressor: 2009 top 10 premium content links »

January 13, 2010

Insight series: parallel processing

Bookmark and Share
This is the first in a series of mini-tutorials designed to provide additional insight into specific technical features and capabilities of the expressor semantic data integration system. We're starting with an examination of our implementation of parallel processing.

expressor uses three compatible and combinable techniques for parallel processing to provide high scalability while maintaining high performance:


Pipeline parallelism

With pipeline parallelism, an arbitrary set of contiguous records is processed sequentially from component to component. Working like a process pipe on a command line, each expressor processor operator in a drawing (data flow diagram) is simultaneously processing different data records in the same record stream.

In the following diagram, while the furthest downstream operator is processing record 1, its upstream neighbor is processing record 2, the next upstream operator is processing record 3, etc. This means that as many records as the application has operators can be processed in parallel.

expressor_pipeline_parallelism_graphic

Depth parallelism

With depth parallelism, multiple input and output processor motors run simultaneously. One motor does not need to finish its processing before another motor can begin. Depth parallelism allows a high-speed CPU, or multiple CPUs or cores, to simultaneously service multiple slower I/O connections.

The following figure illustrates depth parallelism where multiple input and output motors run simultaneously. In the drawing, two input motors feed data records into a transform operator (e.g. a joiner or joiner-multi operator). The last operator splits the records into multiple channels (see partition parallelism), which are delivered to multiple output motors. The operators also use pipeline parallelism and all the motors and operators run simultaneously in separate processes.

expressor_depth_parallelism_graphic_1

An alternative depth parallelism design is to use multiple data processing pathways in a single drawing. Again, all the input and output motors and operators can run simultaneously in separate processes.

expressor_depth_parallelism_graphic_2

Partition parallelism

With partition parallelism (also called data parallelism), multiple I/O channels are set up between sequential data processing operators. As with depth parallelism, a high speed CPU, or multiple CPUs or cores, can simultaneously service multiple slower I/O channels; each tool in the deployed application is serviced by multiple channels and can process multiple data records concurrently.

expressor_partition_parallelism_graphic

There is a significant difference between partition and depth parallelism. Partition parallelism uses multiple channels in a single instance of a motor or operator while depth parallelism uses multiple distinct instances of identical or different motors. In the above drawing, each operator and motor is running in a single, separate process with four data processing threads. In the depth parallelism diagrams (see previous section), multiple instances of the input and output motors are running in multiple processes with, in this example, only a single data processing thread.

This concludes my brief description of expressor's parallel processing techniques.

Bill Kehoe, lead engineer, expressor

Posted by expressor software at January 13, 2010 8:30 AM

Comments

Post a comment




Remember Me?