This is a guest post written for the Computer Weekly Open Source Insider blog by Daniel Jones in his capacity as CTO of EngineerBetter — the organisation is a member of the Cloud Foundry Foundation is UK-based and specialises in Cloud Foundry, BOSH and concourse.
NOTE: BOSH is an open source project which offers a tool chain for release engineering, deployment & life-cycle management of large scale distributed services. Concourse is an open source pipeline based CI system that focuses on simplicity, usability and reproducibility.
Jones writes as follows…
Software that is well-packaged requires a lower cognitive load of the engineers using it, allowing them to focus on solving business problems.
If a service has only a single responsibility then we have very few things to remember about it. Single-purpose tools are more efficient, too. Although we can make dual-necked musical instruments that combine a guitar and a bass, you can’t play both necks at the same time, but we do still have to carry the weight of both even if you only need to use one.
Multiple modes are dangerous
Software that can do many things requires us to remember more about its capabilities and its state. This is the reason that interfaces with multiple modes are dangerous in critical applications – plane crashes have been caused by one button that does different things depending on the state of the aircraft.
Humans prefer things that always work the same way. How many times have you failed to log into a system because you accidentally had CAPS lock on?
Software with a single responsibility can be reused more effectively and by defining well-thought-out boundaries with clear interfaces it can be changed and iterated upon with minimal impact on other elements of the system.
Way back in 1995, long before microservices were the latest craze, Robert C. Martin outlined a set of packaging principles that outline how to achieve this reusability and independence: if you can’t re-use them together, they shouldn’t live together; things that change together live together; things that are used together live together.
These principles combined with a platform that can offer self-service zero-downtime deployments allow engineers to continuously deliver business value in discrete reusable chunks. Being able to pick from a catalogue of stable and well-defined components enables agility – imagine if architects had to build every house fitting from scratch instead of picking them from a catalogue.
Any ‘unit of currency’ in software engineering benefits from having a single concern that is separate from those of its peers: this enables them to evolve independently and so change frequently.
Cloud Foundry’s buildpacks
Logically then this discussion brings us to Cloud Foundry’s buildpacks.
These are a good example of a clear separation of concerns: you provide the app code and the platform provides the operating system, base filesystem and language runtime.
This separation of concerns allows app code to have a ‘change cadence’ independent of the underlying layers, meaning kernel patches can happen underneath deployed applications.
Technologies such as Docker have blended these responsibilities, causing developers to need to care about what operating system and native libraries are available to their applications – after years of the industry striving for more abstraction and increased decoupling!
What should we expect from Microsoft Build or //build/ 2017 then?
One thing you may notice is that rather than posting this on the Computer Weekly Developer Network (CWDN) column, we have subtly shifted the position of this story to run on Computer Weekly Open Source Insider and its accompanying TechTarget sites.
For it must surely be said, Microsoft is inevitably going to focus on its continued moves in open source at this year’s developer-focused convention and exhibition.
Yes you can continue to chide Microsoft’s commercial impetus for having ‘got the open religion’, but the firm’s moves in this space have been arguably without too many faltering steps.
Okay not every product has been as thoroughbred a release as some purists would have liked, but Microsoft is still turning the ship around to a degree.
Looking at 2017, we can be sure to hear more about Xamarin, more about Microsoft Cognitive Servives and more about bots, bots and bots. Last year there wasn’t a keynote or a breakout with lots and lots of bots.
Microsoft’s Scott Guthrie has noted this month that the current preview SDK for the Windows Creators Update is now being feature complete. The firm has also announced Azure Managed Disks and VM scale sets, which are supposed to bring ease of use and scale benefits of PaaS to IaaS developers.
We can also hope to hear more on Windows Bug Tracker, more on the MSFT approach to DevOps mechanics and how its software ‘actually does stuff’ in real DevOps terms and perhaps more on Project Neon Windows 10 UI refresh.
According to Guthrie, “For over 25 years Microsoft has been focused on bringing the developer community together with tech leaders like Bill Gates and Satya Nadella — at PDC and Build, from LA to San Francisco and now back to the clouds in Seattle. With more than 5,000 developers joining us in person and millions following via live stream, Build and Seattle will be the hub for what’s next.”
Visual Studio 2017
Let’s not forget, March 7 2017 sees Visual Studio 2017 arrive… so there should be plenty of content focused here.
Oh and there will be Internet of Things (IoT), obviously… and Microsoft may even explain what the term Microsoft Cloud could come to mean in term of Windows and its wider approach to its core operating system… although that last point is very much just conjecture at this stage.
Yahoo!’s Big ML (machine learning) team comprising of Lee Yang, Jun Shi, Bobbie Chern and Andy Feng have confirmed that they are offering TensorFlowOnSpark to the community. This is the latest open source framework for distributed deep learning on big-data clusters.
The team says that it has found that in order to gain insight from massive amounts of data, they needed to deploy distributed deep learning. But (and here comes the reason for the new release) they also say that existing DL frameworks often require setting up separate clusters for deep learning, forcing them to create multiple programs for a machine learning pipeline.
Having separate clusters requires the team to transfer large datasets between them they say… and this introduces unwanted system complexity and end-to-end learning latency.
“Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters,” confirm the team.
The team says it uses CaffeOnSpark to improve NSFW image detection, to automatically identify eSports game highlights from live-streamed video.
With the community’s feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on Docker containers.
“This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe,” they say.
After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016.
In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.
According to the team, “To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. Databricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.”
The new framework, TensorFlowOnSpark (TFoS), is meant to enable distributed TensorFlow execution on Spark and Hadoop clusters.
TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.
The team also says that any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.
Progress has open sourced its Telerik UI for Universal Windows Platform (UWP) native UI controls for building Windows apps. As well as expanding its donations to the .NET ecosystem and foundation, the firm is effectively here giving 20+ UI for Universal Windows Platform controls under an Apache license.
Grande fromage of developer tooling at Progress is Faris Sweis. Claiming that Progress Telerik UI for UWP can reduce coding time, he explains that the tool itself is meant to remove the repetitive, tactical elements of development to focus on solving business challenges.
So now, .NET developers can access (and of course further contribute to) these more than 20 UI controls to create Windows apps that share a single codebase and run on a multitude of Microsoft devices.
These include some of the more popular business and enterprise-critical controls such as Grids, Charts, DataForm and ListView.
Principal program manager lead at Microsoft Tim Heuer is on the record saying that native UI controls for building Universal Windows apps are important to developers creating applications for use across devices.
Progress Telerik UI for UWP is part of Progress Telerik DevCraft—the most complete .NET toolbox for web, mobile and desktop development.
The Computer Weekly Open Source Insider blog takes a closer look at Bloomberg — as previously reported, Bloomberg L.P. produces technology for financial markets. The firm’s software includes tools that can be used to track trading indices, perform financial ‘asset swapping’ functions… and provide broker platforms (Bloomberg Tradebook) that work to perform what is known as multi-asset execution technology and algorithmic trading.
The latest milestone in open source development at Bloomberg is the incorporation of the Learning-to-Rank (LTR) plug-in into the Apache Solr 6.4.0. enterprise search platform.
What is Apache Solr?
Solr is built on top of the Apache Lucene search engine library and and provides distributed search and index replication — it powers the search and navigation features of many of the world’s largest Internet sites. With the Learning To Rank (or LTR for short) contrib module users can configure and run machine learned ranking models in Solr.
The release of this plug-in marks the culmination of a year’s worth of close collaboration between two groups of Bloomberg software engineers in London, New York and the open source project’s community to make it easier to re-rank search results using machine learning.
The original goal was to improve both Federated Search and News Search on the Bloomberg Terminal. A Solr-based Search-as-a-Service platform drives search for multiple functions on the Terminal and Learning-to-Rank algorithms are responsible for the quality of many of its search results. Any time users perform a search, they expect to instantly find the most relevant companies, people and news.
In New York, the re-ranking requirements of the News Search team were different, but similar. As the engineers talked with colleagues, other teams also came forward asking for their own Solr-based re-ranking frameworks.
How does the Learning-to-Rank plug-in work?
In the Information Retrieval field, Learning-to-Rank techniques are used to improve the relevance of users’ search results. First, a search query is made for documents that match the user’s search terms. The top N results of the original search query are then re-ranked using new scores computed by applying the trained machine learning model.
Since these machine learning queries are more computationally intensive—slow and expensive, in other words—using the ranking from the second query on just a subset of results helps improve performance, while delivering relevant results.
The effort to integrate the Learning-to-Rank plug-in into the upstream project was led by Apache Lucene/Solr committer Christine Poerschke, a senior software engineer in the News Search team in London. Last month, Poerschke was named to the Apache Lucene Project Management Committee (PMC), becoming the first Bloomberg employee to be invited to join any Apache PMC. In this new role, she is part of a group of developers around the globe that provides oversight of the project for the Apache Software Foundation (ASF), decides the release strategy, appoints new committers and sets community and technical direction for their project.
Benefits for engineers
After a year-long period of on-and-off iterative code revisions, public comments and documentation, the Learning-to-Rank plug-in is now part of the Solr 6.4.0 release. The plug-in provides an easy-to-use framework to deploy machine learning models into Solr. Now search engineers, both inside and outside Bloomberg, can use the plug-in and their own machine learning models to improve their search solutions. This allows engineering teams to focus on their specific domain, rather than spend time building and maintaining their own re-ranking infrastructure.
With the inclusion of the Learning-to-Rank plug-in as part of Solr, the project’s worldwide community has taken on the responsibility for maintaining and extending this technology. This collaborative open development means that, in the future, the community – which includes several Bloomberg engineers who are active contributors, developers at other companies, as well as independent search experts – will be able to integrate their own extensions and improvements to the plug-in. Those updates will then automatically ship to all Learning-to-Rank plug-in users as part of future Solr releases.
The opportunity for Bloomberg engineers to participate in important and interesting open source projects also (it is argued) has other benefits. Search results ranking is a relatively difficult technical problem. Taking on, and then contributing the results of, this kind of challenge is validating and rewarding for Bloomberg’s engineers and it is also of interest to many prospective Bloomberg engineers.
Cloudera is an interesting company. Interesting in that it bills itself as a data management, analytics and machine learning specialist… three ‘disciplines’ that one might have expected to find in three different firms.
Given this supposed breadth then, the firm now welcomes Apache Kudu (as many readers will know, an open source storage engine for fast analytics on fast moving data) now shipping as a generally available component within the Cloudera Enterprise 5.10 version.
What is fast data?
All data is fast really… but we use the term to explain the notion that there is a time sensitivity to data in the first place i.e. we don’t want data to reside in the ‘data lake’ where it sits as unstructured and full of potential but essentially unused.
As TechTarget defines it, “The term fast data is often associated with self-service BI and in-memory databases. The concept plays an important role in native cloud applications that require low latency and depend upon the high I/O capability that all-flash or hybrid flash storage arrays provide.”
Kudu simplifies the path to real-time analytics, allowing users to act quickly on data as-it-happens to make better business decisions.
Complex lambda architecture (mixed workloads)
“Real-time data analysis has been a challenge for enterprises because it required a complex lambda architecture to merge together real-time stream processing and batch analytics. Kudu eases that architecture with a single storage engine that addresses both needs,” said Charles Zedlewski, senior vice president of products at Cloudera. “The high-demand workloads in place today, which include a growing number of new machine-learning models, can identify cybersecurity threats, predict maintenance issues in the Industrial Internet of Things (IIoT), and bring much more accuracy to all types of online reporting.”
Kudu was designed to take advantage of hardware such as solid state storage, memory and more affordable RAM.
Further here… we know that Kudu is purpose-built for fast, large-scale analytic scans that also support rapidly updating data – necessary for handling time series data, machine data analytics, online reporting, or other analytic or operational workload needs.
I’ve seen the future… and it’s mixed data workloads
“Incorporating Apache Kudu into CDH will greatly simplify execution of the mixed workloads our customers increasingly utilise once they migrate their enterprise data warehouse and real-time streams to Hadoop. The Cloudera-certified StreamSets Data Collector natively supports Kudu as a plug-and-play dataflow destination, and StreamSets Dataflow Performance Manager helps assure the continuous availability and accuracy of the data flowing into Kudu,” said Arvind Prabhakar, chief technology officer at Streamsets.
For additional background here, back in September 2015, Cloudera announced the public beta release of Apache Kudu, and two months later, Cloudera donated Kudu to the Apache Software Foundation (ASF) to open it to the broader development community
Open source database company MariaDB Corporation is on a politically charged mission to bring data analytics to the people, or so it says.
Bourgeois analyst cognoscenti
The firm has grandiosely asserted that Business Intelligence (BI) and data analytics used to be the sole preserve of bourgeois, well-funded business analysts who lavished their IT budget spends on buy Oracle orTeradata, for example.
MariaDB says it wants to put an end to plutocracy with its open alternative – that it, ah hem, still charges commercial license fees for, albeit at what are probably always lower levels than its more competitively corporate competitors.
MariaDB ColumnStore 1.0 was actually made available at the end of 2016 and exists as an open source columnar storage engine.
Open source OLAP
Through the General Availability of its product ColumnStore, MariaDB will open source OLAP and make it more accessible for businesses who are put off by high costs, says the firm.
ColumnStore will be made available on AWS AMI, Canonical Ubuntu, Debian and CentOS.
Two imperfect options
According to MariaDB, “Enterprises continue to be confronted with two imperfect options for big data analytics: build a costly, proprietary data warehouse on premise with vendors like Teradata and Vertica, or get locked into potentially uncontrollable costs with cloud-based solutions like Redshift.”
MariaDB’s ‘veep’ of engineering David Thompson backs up all these assertions by further claiming that his firm’s technology costs on average 90% less per TB per year than the leading data warehouses.
How can you (a company, we mean) be an open source as-a-Service company? That’s how Platform9 describes itself.
In truth (and for some clarity) the firm is in fact a SaaS managed hybrid cloud firm with a keen focus on open source, so perhaps some re-direction is needed in terms of how the technology proposition is being made here.
The company actually specialises in self-service provisioning, orchestration, access to open APIs, and an ability to integrate with major automation frameworks.
Platform9’s private cloud solution has been described as “programmatic DevOps” to help accelerate build, test, release cycles.
Platform9 (yes, we know, it’s almost impossible to even read the name without thinking “Plan 9 From Outer Space” in your head) has this month rolled out its Managed Kubernetes service… said to be an infrastructure-agnostic managed form of SaaS.
Managed Kubernetes is deployed and managed entirely as a SaaS solution, across on-premises and public cloud infrastructure.
Kubernetes has emerged as the standard for container orchestration and microservices, but projects are often hampered by the prohibitively steep learning curve required to effectively use it and the technical complexity needed to fully integrate and manage production Kubernetes environments.
The company also introduced Fission, an open source, serverless framework built on Kubernetes.
Drastically simplified operational model
The key sell here is that these offerings feature what is billed as a drastically simplified operational and consumption model that eliminates the steep learning curve currently associated with Kubernetes.
“SaaS-managed delivery makes Kubernetes accessible to a much larger audience at a time when many development teams are committing to microservices as their cloud-native development paradigm,” said Sirish Raghuram, chief executive officer at Platform9.
“We have built our reputation on our OpenStack-as-a-service offering, which remains a core focus. While enterprises will be running virtualised workloads on OpenStack for years to come, though, there’s growing demand for platforms that offer a choice of virtualisation, microservices or both. Microservices in particular require a more intuitive, managed approach that reduces time-to-value for Kubernetes projects and work on any choice of infrastructure: on-premises, in the cloud or across multiple clouds,” added Raghuram.
Platform9’s Managed Kubernetes may indeed work for DevOps and IT teams, its allowe integration across any combination of cloud platform (or on-premises infrastructure) without re-engineering any single line of code — or worrying about backend configuration and maintenance.
Software supply chain automation company Sonatype is hanging out the flags to celebrate the fact that it has experienced a 300 percent growth in the use of its Nexus Repository over the past three years.
The reason for the growth? The firm thinks it is down to growing concern about security vulnerabilities in open source components and containers.
“There is increasing evidence that more organizations are taking software supply chain automation and component security seriously,” said Wayne Jackson, CEO of Sonatype. “Specifically, DevOps-native organisations are embracing tools such as Nexus Repository and Nexus Firewall to automatically block bad components from entering into their mission-critical applications.”
State of the nation
According to the firm’s own State of the Software Supply Chain Report, 1 in 15 open source components used in production applications has at least one known security vulnerability.
The company now claims that organisations that rely on the Nexus Repository to house open source software components and containerized applications have gained new visibility into the quality of components flowing through their software supply chains.
In 2016 alone, Nexus Repository saw a 40 percent increase in the use of its Repository Health Check feature. Today, 23,000 organisations utilise Repository Health Check every day to automatically analyse security, licensing, and architectural risks across 58 million components living inside local Nexus Repository Managers.
Now we know that 2017 is the year when open source finally grows up… at least that’s what some people are saying.
The recent security issues regarding MongoDB (it was reported that 28,000 databases were held at ransom by hackers) should not dampen open source’s success in the enterprise. Let’s remember that Microsoft has moved to validate the open source movement by joining the Linux Foundation.
The latest ClusterControl features automate multiple application environments including:
“Always on” Databases
Standard MySQL replication is still the most widely used method of replicating data between database hosts, yet Galera and Group Clustering is proven to be more reliable. With the addition of MySQL replication, ClusterControl now covers a wide range of application use cases that require high availability.
Added support for MongoDB, larger sharded cluster deployment means ClusterControl provides a way of managing MongoDB in polyglot environments, to help enterprises circumvent vendor lock-in for users.
Enabling the DBA
Extension of Load Balancing technology via ProxySQL, HAProxy and MaxScale. This gives professionals control over how applications are accessing their databases. It also enables query caching, negating the need to gather information straight from the database.
ClusterControl aims to provide a single interface to deploy and manage open source databases so that DevOps don’t have to “cobble together” a combination of tools, utilities and scripts which need constant updates and maintenance.