Follow

Here's a puzzle for you: from within the language, how do you list the sizes of the built-in datasets in or ?
I'm new enough to R that I wasn't able to figure that one out.

Raw Python doesn't have built-in datasets, but sklearn does. (There are more in Seaborn and it's possible to load the R sets from Python.) Here's my first shot; the multiple special cases smell bad to me:

from sklearn import datasetsloaders = [f for f in dir(datasets) if f.startswith('load') and f != 'load_boston' and f != 'load_files' and not f.startswith('load_sample_image') and not f.startswith('load_svmlight_file')][(f[5:], len(eval('datasets.' + f + '().data'))) for f in loaders]

Output:

[('breast_cancer', 569), ('diabetes', 442), ('digits', 1797), ('iris', 150), ('linnerud', 20), ('wine', 178)]

My second pass is easier to read, but it squelches errors (gasp!) and still shows you the warning about the deprecated boston dataset:

from sklearn import datasets
names = []
sizes = []
for f in dir(datasets):
    try:
sizes.append(eval('len(datasets.' + f + '().data)'))
        names.append(f[5:])
    except:
        pass
print(list(zip(names, sizes)))

This is something of a weird reflective programming task in both languages, so neither can be blamed for inelegance.

@data_science

@peterdrake

ls(as.environment("package:stats")) |>
sapply(function(x) { object.size(get(x, as.environment("package:stats"))) } )

@hhmacedo A step in the right direction, but this appears to be about the memory size of functions in the stats package.

I'm interested in the number of data points in each of the datasets in the datasets package.

@peterdrake I'm sorry but I don't know why I changed from "datasets" to "stats". And changing from memory size to the length, the function would be:

ls(as.environment("package:datasets")) |>
sapply(function(x) { length(unlist(get(x, as.environment("package:datasets")))) } )

@hhmacedo I'm curious as to how the iris dataset comes to have a "length" of 600. Yes, there are 150 rows, each with 4 features, but what about the "Species" column?

@peterdrake iris and iris3 are organised in different ways, the former has a column Species (5*150=750) while the later is a 3d dataset of 4*50 observations, with Species being the third dimension (4*50*3=600).

@hhmacedo "Ah-h-h," as Frank Herbert would say, "-h-h-h-h-h-h-h."

Thanks!

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.