How to estimate work hours for a typical data profiling and cleansing effort? Any Estimate calculators in Excel can be shared?

5 pts.
Data quality
Microsoft Excel
One of my known acquaintance asked a question to help him with an estimator to logically estimate person hours required for data profiling and cleansing. His company is assuming their data quality will be bad. Now that there is a necessity to check and cleanse data, how to arrive at the man hours is the question. Legacy system is AS/400.

Answer Wiki

Thanks. We'll let you know when a new response is added.
you can check out this link for some good info DATA PROFILING
here is a excerpt from it
Some tools are free software and open source; however, many, but not all free data profiling tools are open source projects. In general, their functionality is more limited than that of commercial products, and they may not offer free telephone or online support. Furthermore, their documentation is not always thorough. However, some small companies still use these free tools instead of expensive commercial software,considering the benefits that free tools provide.
1. Aggregate Profiler Tool
Aggregate Profiler (AP) is an open source project developed in Java [7]. AP supports both traditional database and big data, such as Hadoop or Hive, and it offers statistical analysis, pattern matching, distribution chat, basket analysis, etc. AP also supports data generation, data preparation, data masking features, and address correction for data quality projects. Moreover, this tool offers data validation (metadata profiling,analytical profiling, and structural profiling), and data quality (removing duplicate data, null values, and dirty data).
2. Talend Open Studio for Data Quality
Talend Open Studio for Data Quality (TOSDQ) [8] is also based on Java and is a
mature open source tool. TOSDQ offers navigator interface to access databases and data files. This tool supports catalog analysis, time correlation analysis, column analysis, table analysis, column correlation analysis, and schema analysis; it also supports column functional dependency, redundancy analysis, numerical correlation analysis, nominal correlation analysis, connection analysis, column set analysis, and match analysis. Furthermore, TOSDQ reports several different types of statistics indicators, including simple statistics, text statistics, summary statistics, pattern frequency statistics, Soundex frequency statistics, phone number statistics, and Fraud detection (Benford’s law frequency).
3. DataCleaner
DataCleaner [9] is a commercial tool for data profiling and data cleaning, but it has a free version which offers multiple data profiling functions, including pattern matching, boolean analysis, weekday distribution, completeness analysis, value matcher,character set distribution, value distribution, date gap analysis, unique key check,date/time analysis, string analysis, number analysis, referential integrity, and reference data matching.
Commercial data profiling products usually come packaged in data governance suites.These products have multiple functions, high performance, and strong capabilities;they can connect to other suites to provide comprehensive solutions for customers.Moreover, these software is not only powerful, but end-users also can find online services and telephone support.
1. IBM InfoSphere Information Analyzer
IBM InfoSphere Information Analyzer (IIA) [10] is part of IBM’s data governance
suite that includes InfoSphere Blueprint Director, Metadata Workbench, DataStage,QualityStage, Data Click, Business Glossary, and Information Services Director. IIA supports column analysis (statistics, distribution, cardinality, and value analysis.), identifying keys and relationships, discovering redundant data, comparing data and structures through history baselines, analyzing data via data rules, and importing and exporting data rules.
2. Informatic Data Profiling
Informatic Data Profiling is a key component of PowerCenter [11]. This profiling
software supports aggregate functions (count null values, calculate averages, get maximum or minimum values, and get lengths of strings), candidate key evaluation (unique or non-unique), distinct value count, domain inference, functional dependency analysis, redundancy evaluation, and row count. In addition, users can add business rules (verbose mode) or configure profile functions in this tool.
3. Oracle Enterprise Data Quality
Oracle Enterprise Data Quality (EDQ) [12] permits address verification, profiling data (files, databases, and spreadsheets), standardization, audit reviews (incorrect values, missing data, inconsistencies, duplicate records, and key quality metrics), matching and merging columns (duplicate prevention, de-duplication, consolidation, and integration), and case management (data reviewing). Furthermore, the tool can utilize prebuilt templates or user-defined rules to profile data. EQD also can connect to other Oracle data governance products, including Oracle Data Integrator and Oracle Master Data Management.
4. SAP Information Steward
SAP Information Steward can improve information quality and governance [13] via
the Data Insight module (data profiling and data quality monitoring), Metadata Management module (metadata analysis), Metapedia Module (business term taxonomy),and cleansing package builder (cleansing rules). The data insight module can define validation rules, determine profiling (column, address, uniqueness, dependency, and redundancy), import and export metadata, and create views [13].
5. SAS DataFlux Data Management Studio
SAS DataFlux Data Management Studio (DDMS) [14] is a data governance suite that consists data profiling, master data management, and data integration. This data profiling tool covers key analysis (primary and foreign keys), pattern frequency distribution analysis, redundant data analysis, and data profiling reports.
6. Collibra Data Stewardship Manager
Collibra Data Stewardship Manager (DSM) [15] module is part of Collibra’s Data
Governance Center that also includes Business Semantic Glossary (BSG) and Reference Data Accelerator (RDA) module. DSM also provides historical data quality reports around trend analysis and reports to understand the impact of resolved data issues. In addition, DSM provides fully configurable data quality reporting dashboard (see figure below) by bringing data quality rules and metrics calculated in one or multiple sources (data quality tools, databases, and big data).
Several academic papers offer data quality measurement methods. In [16], the authors mention information quality dimensions including accessibility, completeness, and security; however, these dimensions only focus on information quality for data governance. [17] separates data quality problems according to how a data administrator might view them-for example, single-source, multisource, instance level, schema level. The authors state that these problems could be solved by ETL, but these procedures only improve the quality of data after it has been collected and stored in a data warehouse making the solution inflexible. [18] presents algorithms for calculating data quality measures, such as free-of-error, completeness, and appropriate-amount-of-data.

Discuss This Question:  

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

Thanks! We'll email you when relevant content is added and updated.


Share this item with your network: