Data Cleaning & Related Platform

1. Cascading (Apache) 
The Cascading Ecosystem is a collection of applications, languages, and APIs for developing data-intensive applications. 
At the ecosystem core is Cascading, a Java API for defining complex data flows and integrating those flows with back-end systems, and a query planner for mapping and executing logical flows onto a computing platform. 
There are quite a few extensions to Cascading providing integrations with popular systems, testing frameworks, and tools that leverage Cascading. 
Sitting on top of the Cascading API are languages and tools to simplify the development of data-intensive applications. For Scala developers, see Scalding. For Clojure developers, see Cascalog. For SQL developers, see Lingual. And for Java developers, the raw Cascading API can be used, or a fluent interface named Fluid. 
Sitting below the Cascading query planner are platform providers and rules for mapping data flows onto a given platform like Apache Hadoop, Apache Tez, Apache Flink, or simply locally in memory (suitable for many streaming applications). 
Learn more from the the User Guide, the most recent Cascading and Scalding books, or the tutorials and example applications. To learn about Cascading internals, see this post on the 3.x query planner. 

2. Talend 
Talend Open Studio for Data Integration helps you get your data to the right place, in the right form, at the right time. The leading open source ETL solution for data warehousing and business intelligence, Talend Open Studio for Data Integration is: 
  • Powerful and versatile. Transform, move, and synchronize data across all your heterogeneous sources and targets 
  • Easy to use. Start productive work right away with an intuitive interface rich in modelling tools, job-building components, and more than 900 data connectors, including the Cloud 
  • Proven in the field. Hundreds of thousands of users manage their critical data with Talend Open Studio for Data Integration, from SMBs to some of the largest corporations in the world 
  • Ready to start today. Talend Open Studio for Data Integration is free to download and use, for as long as you want. No budget battles or endless delays – just faster, easier data integration, starting today 

3. Pentaho 
business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. It is headquartered in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015. On September 19, 2017, Pentaho became part of Hitachi Vantara, a new company that unifies the operations of Pentaho, Hitachi Data Systems and Hitachi Insight Group. 
  • Internet of Things Analytics. Expect better business outcomes – from improved customer satisfaction to higher profitability –with the power of IoT analytics. 
  • Big Data Integration and Analytics. Drive maximum value from your data with a complete platform for full data integration and business analytics. 
  • Pentaho Data Integration. Quickly and easily deliver the best data to your business and IT users – no coding or complexity required. 
  • Business Analytics. Empower business users with interactive, real-time visual data analysis and predictive modelling, with minimal IT support. 

4. Sqoop (Apache) 
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. 

Popular posts from this blog

Kokology Questions & Answers

Psychological Terms, Physics Laws & Effect, Mathematics & Paradoxes, Fallacies, Metaheuristics(Growing List)

The Art of Thinking Clearly (Rolf Dobelli, 2013)