使用dabl自动化数据科学

科技2023-12-19 101

数据科学 (Data Science)

dabl stands for Data Analysis Baseline Library. The idea behind dabl is to make supervised learning automated for reducing boilerplate for common tasks. Meaning, while building any predictive model the data has to be cleaned, analyzed, and run through many models with different parameter tuning to get the best accuracy rate which needs several lines of code and man time, all these tasks will be handled by dabl with very few lines of code saving time and money of someone handling tons of data each day.

dabl代表数据分析基准库。 dabl的想法是使有监督的学习自动化，以减少常见任务的样板。意思是，在构建任何预测模型时，必须清理，分析数据并遍历具有不同参数调整功能的许多模型，以获得最佳准确率，这需要几行代码和人工，所有这些任务将由dabl处理。几乎没有几行代码可以节省每天处理大量数据的人员的时间和金钱。

The main idea behind developing the library is to allow data scientists to spend more time thinking about the problem statement and creating more custom analysis instead of going through the same repeated traditional steps every time. dabl takes inspiration from scikit-learn and auto-sklearn. Let’s dive in.

开发该库的主要思想是允许数据科学家花更多的时间思考问题陈述并创建更多的自定义分析，而不是每次都重复相同的重复传统步骤。 dabl从scikit-learn和auto-sklearn中汲取了灵感。让我们潜入。

安装和导入 (Installation and import)

We just have to import one library, which is dabl for all the necessary tasks.

我们只需要导入一个库即可处理所有必要的任务。

!pip install dablimport dabl

获取数据 (Getting data)

dabl has few DataFrame in it that can be loaded directly and used. We can also use regular Pandas style to read any external data. We will work on DataFrame containing the adult census dataset.

dabl中几乎没有可以直接加载和使用的DataFrame。我们还可以使用常规的Pandas样式读取任何外部数据。我们将处理包含成人普查数据集的DataFrame。

df = dabl.datasets.load_adult()df.head()

数据清理 (Data cleaning)

We all know that the very first step is data cleaning. dabl tries to detect the types of data in the dataset and apply appropriate conversions. The aim of dabl is to get the data cleaned enough for data visualization and models. We can also perform custom cleaning if required.

众所周知，第一步是数据清理。 dabl尝试检测数据集中的数据类型并应用适当的转换。 dabl的目的是为了使数据可视化和模型化而对数据进行足够的清理。如果需要，我们还可以执行自定义清洁。

dabl.clean(X, type_hints=None, return_types=False, target_col=None, verbose=0)

dabl.clean (X，type_hints = None，return_types = False，target_col = None，verbose = 0)

X: DataFrame

X：DataFrame

type_hinta: if the detection of semantic types (continuous, categorical, ordinal, text, etc) fails

type_hinta：如果语义类型(连续，类别，序数，文本等)的检测失败

return_type: Whether to return the inferred types

return_type：是否返回推断的类型

target_colstring: Target columns are never dropped

target_colstring：永远不会删除目标列

data_clean = dabl.clean(df, type_hints={"capital-gain": "continuous"})data_clean

描述数据集 (Describing dataset)

Traditionally we apply .info() on the dataset to get initial insights, additionally we can use dabl.data_types() which predicts the data type of each column.

传统上，我们在数据集上应用.info()以获得初步见解，此外，我们可以使用dabl.data_types()来预测每列的数据类型。

dabl.detect_types(X, type_hints=None, max_int_cardinality='auto', dirty_float_threshold=0.9, near_constant_threshold=0.95, target_col=None, verbose=0)

dabl.detect_types (X，type_hints = None，max_int_cardinality ='auto'，dirty_float_threshold = 0.9，near_constant_threshold = 0.95，target_col = None，verbose = 0)

Look at the documentation for more details.

请参阅文档以获取更多详细信息。

dabl.detect_types(df)

探索性数据分析 (Exploratory Data analysis)

dabl.plot() will give you a quick insight into the data. However, dabl does not guarantee to provide all the interesting aspects of the data. It throws very high-level insights such as important features, their interactions, and difficulty level of the problem. One has to again perform traditional custom plotting for specific analysis.

dabl.plot()将使您快速了解数据。但是，dabl不能保证提供数据的所有有趣方面。它引发了非常高级的见解，例如重要功能，它们之间的相互作用以及问题的难度级别。必须再次执行传统的自定义绘图以进行特定分析。

dabl.plot(X, y=None, target_col=None, type_hints=None, scatter_alpha='auto', scatter_size='auto', verbose=10, plot_pairwise=True, **kwargs)

dabl.plot (X，y = None，target_col = None，type_hints = None，scatter_alpha ='auto'，scatter_size ='auto'，verbose = 10，plot_pairwise = True，** kwargs)

dabl.plot(df, target_col="income")

Amazing isn’t it, we get quite a good piece of insight with just half a line of code.

并非如此，我们仅用一半的代码就能获得相当不错的见解。

建筑模型 (Model building)

SimpleClassifier tries to find the best fitting model. It applies several baselines on subsampled data. As dabl is inspired by scikit-learn it allows us to specify data to be fit in scikit-learn-style. There are two ways of doing it.

SimpleClassifier尝试找到最佳拟合模型。它对子采样数据应用了多个基线。由于dabl受到scikit-learn的启发，它使我们可以指定适合scikit-learn样式的数据。有两种方法。

model = dabl.SimpleClassifier(random_state=0)X = data_clean.drop("income", axis=1)y = data_clean.incomemodel.fit(X, y)

要么

model = dabl.SimpleClassifier(random_state=0).fit(data_clean, target_col="income")

output:

输出：

As we can see, it applied several models with different parameter tuning to find the best fitting model and accuracy score. The SimpleClassifier also performs preprocessing such as missing value imputation and one-hot encoding. We can inspect the model using dabl.explain()

如我们所见，它应用了具有不同参数调整功能的多个模型，以找到最佳拟合模型和准确性得分。 SimpleClassifier还执行预处理，例如缺失值插补和一键编码。我们可以使用dabl.explain()检查模型

dabl.explain(model)

真的需要人类数据科学家吗？ (Are human data scientist’s really required?)

dabl is really interesting and automated, but it is still in the development stage with very minimal features and functions. I suggest you go through the list of API provided by dabl. Personally what I think is, it still takes a lot of time to deploy a fully functional and feature-rich version and when it does the industry has to trust and accept it. Data Science is such a field, every day a unique dataset and problem statement/requirement is generated in the industry hence at some point in time human intervention is necessary. All I’m trying to say is that automated data science will be the future but not anytime soon, so you can focus on upskilling or learning data science. Remember, always upskills yourself, take this article as a wake-up call every time you think you have enough knowledge, and can stop learning new skills.

dabl确实很有趣且自动化，但是它仍处于开发阶段，只有极少的特性和功能。我建议您仔细阅读dabl提供的API列表。我个人认为，部署功能齐全且功能丰富的版本仍然需要花费大量时间，并且当这样做时，业界必须信任并接受它。数据科学就是这样一个领域，每天在行业中都会生成唯一的数据集和问题陈述/要求，因此在某个时间点需要人工干预。我要说的是，自动化数据科学将是未来，但不会很快，因此您可以专注于提高技能或学习数据科学。请记住，始终要提高自己的技巧，每次您认为自己有足够的知识时就把它当作一个警钟，并且可以停止学习新技能。

dabl的局限性 (Limitations of dabl)

Right now(at the time of writing this article) dabl does not deal with text data, time-series data, neural network models. Image, audio, and video data are completely out of scope.

目前(在撰写本文时)dabl尚未处理文本数据，时间序列数据和神经网络模型。图像，音频和视频数据完全超出范围。

未来目标(在撰写本文时) (Future goals (at the time writing this article))

Ready-made visualizations

现成的可视化 Model diagnostics

模型诊断 Efficient model search

高效的模型搜索 Type detection

类型检测 Automatic preprocessing

自动预处理 Portfolios of well-performing pipelines

表现良好的管道产品组合 Photo by Kelly Sikkema on Unsplash Kelly Sikkema在 Unsplash上的照片

翻译自: https://medium.com/towards-artificial-intelligence/automating-data-science-with-dabl-76acb7344727