Abstract
With the democratization of data science libraries and frameworks, most data scientists manage and generate their data analytics pipelines using a collection of scripts (e.g., Python, R). This marks a shift from traditional applications that communicate back and forth with a DBMS that stores and manages the application data. While code debuggers have reached impressive maturity over the past decades, they fall short in assisting users to explore data-driven what-if scenarios (e.g., split the training set into two and build two ML models). Those scenarios, while doable programmatically, are a substantial burden for users to manage themselves. Dagger (Data Debugger) is an end-to-end data debugger that abstracts key data-centric primitives to enable users to quickly identify and mitigate data-related problems in a given pipeline. Dagger was motivated by a series of interviews we conducted with data scientists across several organizations. A preliminary version of Dagger has been incorporated into Data Civilizer 2.0 to help physicians at the Massachusetts General Hospital process complex pipelines.
Original language | English |
---|---|
Publication status | Published - 2020 |
Event | 10th Annual Conference on Innovative Data Systems Research, CIDR 2020 - Amsterdam, Netherlands Duration: 12 Jan 2020 → 15 Jan 2020 |
Conference
Conference | 10th Annual Conference on Innovative Data Systems Research, CIDR 2020 |
---|---|
Country/Territory | Netherlands |
City | Amsterdam |
Period | 12/01/20 → 15/01/20 |