Practical DevOps for Big Data/Deployment-Specific Modelling

Introduction

DIAs and the Big Data assets these manipulate are key to industrial innovation. However going data-intensive requires much effort not only in design, but also in system/infrastructure configuration and deployment - these still happen via heavy manual fine-tuning and trial-and-error. We outline abstractions and automations that support data-intensive deployment and operation in an automated DevOps fashion, featuring Infrastructure-as-Code and TOSCA.

Concerning Infrastructure-as-Code and TOSCA in the specific, they reflect the DevOps tactic to adopt source-code practicals in infrastructure design as well. More in particular, infrastructure-as-code envisions the definition, versioning, evaluation, testing, etc. of source-code for infrastructural designs just as application code is defined, versioned, evaluated, and tested. TOSCA, is the OASIS standard definition language for infrastructure-as-code and stands for "Topology and Orchestration Specification for Cloud Applications".

The DDSM allows to express the deployment of DIAs on the Cloud, using UML.

Technical Overview

On one hand, IasC is a typical DevOps tactic that offers standard ways to specify the deployment infrastructure and support operations concerns using human-readable notations. The IasC paradigm features: (a) domain-specific languages (DSLs) for Cloud application specification such as TOSCA, i.e., the “Topology and Orchestration Specification for Cloud Applications" standard, to program the way a Cloud application should be deployed; (b) specific executors, called orchestrators, that consume IasC blueprints and automate the deployment based on those IasC blueprints.

On the other h DDSM framework is a UML-based modeling framework based on the MODAClouds4DICER meta-model, which is a transposition and an extension of the MODACloudsML meta-model adapted for the intents and purposes of data intensive deployment. MODACloudsML is a language that allows to model the provisioning and deployment of multi-cloud applications exploiting a component-based approach. The main motivation behind the adoption of such a language on top of TOSCA is that we want to make the design methodology TOSCA-independent, in such a way that the designer have not to be a TOSCA-expert, nor even to be aware about TOSCA, but he should just follow the proposed methodology. Moreover the MODACloudsML language has basically the same purpose of the TOSCA standard, but it exhibits a higher level of abstraction and so results in being more user friendly. The below Figure shows an extract of the MODAClouds4DICER meta-model. The main concepts are inherited directly from MODACloudsML. A MODACloudsML model is a set of Components which can be owned by a Cloud provider ExternalComponents or by the application provider InternalComponents. A Component can be either an application, a platform or a physical host. While an ExternalComponent can just provide Ports and ExecutionPlatforms, an InternalComponent can also require them, since it is controlled by the application provider. Ports and ExecutionPlatforms serve as a way to connect Components to each other. ProvidedPorts and RequiredPorts can be linked by mean of the concept of Relationship, while ProvidedExecutionPlatforms and RequiredExecutionPlatforms can be linked by mean of the concept of ExecutionBinding. The latter could be seen as a particular type of relationship between two Components which tells that one of them is executing the other.

MODACloudsML has been adapted extending elements in order to capture data intensive specific concepts, e.g. systems that are usually exploited by data intensive applications such as NoSQLStorage solutions and ParallelProcessingPlatforms, which are typically composed of a MasterNode and one or many SlaveNodes.

DDSM UML Deployment Profile

Stemming from the previous technical overview, in the following we elaborate on essential DDSM stereotypes, which are reported in the below Table.

DDSM main stereotypes
#	Stereotype	Meaning
1.	InternalNode	Service that are managed and deployed by the application owner
2.	ExternalNode	Service that are managed and deployed by the third-party provider
3.	VMsCluster	A cluster of virtual machines
4.	PeerToPeerPlatform	A data-intensive platform operating according to the peer-to-peer style
5.	MasterSlavePlatform	A data-intensive platform operating according to the master-slave style
6.	StormCluster	An instance of a Storm cluster
7.	CassandraCluster	An instance of a Cassandra cluster
8.	BigDataJob	The actual DIA to be executed
9.	JobSubmission	Deployment association between a BigDataJob and its corresponging execution environment

DDSM distinguishes between InternalNode, or services that are managed and deployed by the application owner, and ExternalNode that are owned and managed by a third-party provider (see the providerType property of the ExternalNode stereotype). Both the InternalNode and ExternalNode stereotypes extend the UML meta-class Node.

VMsCluster stereotype is defined as a specialisation of ExternalNode, as renting computational resources such as virtual machines is one of the main services (so called Infrastructure-as-a-Service) offered by Cloud providers. VMsCluster also extends the Device UML meta-class, since a cluster of VMs logically represents a single computational resource with processing capabilities, upon which applications and services may be deployed for execution. A VMsCluster has an instances property representing its replication factor, i.e., the number of VMs composing the cluster. VMs in a cluster are all of the same size (in terms of amount of memory, number of cores, clock frequency), which can be defined by means of the VMSize enumeration.

Alternatively the user can specify lower and upper bounds for the VMs’ characteristics (e.g. minCore/maxCore, minRam/maxRam), assuming the employed Cloud orchestrator is then able to decide the optimal Cloud offer, according to some criteria, that matches the specified bounds. The VMsCluster stereotype is fundamental towards providing DDSM users with the right level of abstraction, so that they can model the deployment of DIAs, without having to deal with the complexity exposed by the underlying distributed computing infrastructure. In fact, an user just has model her clusters of VMs as stereotyped Devices that can have nested InternalNodes representing the hosted distributed platforms. Furthermore, a specific OCL constraint imposes that each InternalNode must be contained into a Device holding the VMsCluster stereotype, since by definition an InternalNode have to be deployed and managed by the application provider, which thus has to dispose the necessary hosting resources.

We then define DIA-specific deployment abstractions, i.e. the PeerToPeerPlatform, MasterSlavePlatform stereotypes, as further specialisations of InternalNode. These two stereotypes basically allow the modelling language to capture the key differences between the two general type of distributed architectures. For instance the MasterSlavePlatform stereotype allows to indicate a dedicated host for the master node, since it might require more computational resources. By extending our deployment abstractions, we implemented a set of technology modelling elements (StormCluster, CassandraCluster, etc.), one for each technology we support. DIA execution engines (e.g. Spark or Storm) also extend UML ExecutionEnvironment, so to distinguish those platforms DIA jobs can be submitted to. Each technology element allows to model deployment aspects that are specific to a given technology, such as platform specific configuration parameters or dependencies on other technologies, that are enforced by means of OCL constraints in the case they are mandatory.

The BigDataJob stereotype represents the actual application that can be submitted for execution to any of the available execution engine. It is defined as a specialisation of UML Artefact, since it actually corresponds to the DIA executable artefact. It allows to specify job-specific information, for instance the artifactUrl from which the application executable can be retrieved.

The JobSubmission stereotype, which extends UML Deployment, is used to specify additional deployment options of a DIA. For instance, it allows to specify job scheduling options, such as how many times it has to be submitted and the time interval between two subsequent submissions. In this way the same DIA job can be deployed in multiple instances using different deployment options. An additional OCL constraint requires each BigDataJob to be connected by mean of JobSubmission to a UML ExecutionEnvironment which holds a stereotype extending one between the MasterSlavePlatform or the PeerToPeerPlatform stereotypes.

UML Deployment Modelling: The WikiStats Example

We showcase the defined profile by applying it to model the deployment of a simple DIA that we called Wikistats, a streaming application which processes Wikimedia articles to elicit statistics on their contents and structure. The application features Apache Storm as a stream processing engine and uses Apache Cassandra as storage technology. Wikistats is a simple example of a DIA needing multiple, heterogeneous, distributed platforms such as Storm and Cassandra. Moreover Storm depends on Apache Zookeeper. The Wikistats application itself is a Storm application (a streaming job) packaged in a deployable artefact. The below Figure shows the DDSM for the Wikistats example.

In this specific example scenario, all the necessary platforms are deployed within the same cluster of 2 large-sized VMs from an OpenStack installation. Each of the required platform elements is modelled as a Node annotated with a corresponding technology specific stereotype. In particular Storm is modelled as an ExecutionEnvironment, as it is the application engine that executes the actual Wikistats application code. At this point, fine tuning of the Cloud infrastructure and of the various platforms is the key aspect supported by DDSM. The technology stereotypes allow to configure each platform in such a way to easily and quickly test different configurations over multiple deployments and enabling the continuous architecting of DIAs. The dependency of Storm on Zookeeper is enforced via the previously discussed OCL constraints library which comes automatically installed within the DDSM profile. The deployment of the Wikistats application is modelled as an Artefact annotated with the BigDataJob stereotype and linked with the StormCluster element using a Deployment dependency stereotyped as a JobSubmission. Finally BigDataJob and JobSubmission can be used to elaborate details about the Wikistats job and how it is scheduled.

Conclusion

DICE UML-based deployment modelling heavily rotates around DDSM, a refined UML profile to specify Infrastructure-as-Code using simple UML Deployment Diagrams stereotyped with appropriate data-intensive augmentations.