**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 26, 2022, 04:48

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 26, 2022, 04:48

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

Jun 26, 2022, 04:48

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

Here's a puzzle for you: from within the language, how do you list the sizes of the built-in datasets in #Rstats or #Python?
I'm new enough to R that I wasn't able to figure that one out.

Raw Python doesn't have built-in datasets, but sklearn does. (There are more in Seaborn and it's possible to load the R sets from Python.) Here's my first shot; the multiple special cases smell bad to me:

from sklearn import datasetsloaders = [f for f in dir(datasets) if f.startswith('load') and f != 'load_boston' and f != 'load_files' and not f.startswith('load_sample_image') and not f.startswith('load_svmlight_file')][(f[5:], len(eval('datasets.' + f + '().data'))) for f in loaders]

Output:

[('breast_cancer', 569), ('diabetes', 442), ('digits', 1797), ('iris', 150), ('linnerud', 20), ('wine', 178)]

My second pass is easier to read, but it squelches errors (gasp!) and still shows you the warning about the deprecated boston dataset:

from sklearn import datasets
names = []
sizes = []
for f in dir(datasets):
try:
sizes.append(eval('len(datasets.' + f + '().data)'))
names.append(f[5:])
except:
pass
print(list(zip(names, sizes)))

This is something of a weird reflective programming task in both languages, so neither can be blamed for inelegance.

@data_science

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 15:03

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 15:03

Jun 27, 2022, 15:03

hhmacedo @hhmacedo@mastodon.social

@peterdrake

ls(as.environment("package:stats")) |>
sapply(function(x) { object.size(get(x, as.environment("package:stats"))) } )

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · 2022-06-27T16:14:04Z

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

@hhmacedo A step in the right direction, but this appears to be about the memory size of functions in the stats package.

I'm interested in the number of data points in each of the datasets in the datasets package.

Jun 27, 2022, 16:14 · · · ·

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 17:35

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 17:35

Jun 27, 2022, 17:35

hhmacedo @hhmacedo@mastodon.social

@peterdrake I'm sorry but I don't know why I changed from "datasets" to "stats". And changing from memory size to the length, the function would be:

ls(as.environment("package:datasets")) |>
sapply(function(x) { length(unlist(get(x, as.environment("package:datasets")))) } )

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 18:00

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 18:00

Jun 27, 2022, 18:00

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

@hhmacedo Perfect!

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 19:05

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 19:05

Jun 27, 2022, 19:05

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

@hhmacedo I'm curious as to how the iris dataset comes to have a "length" of 600. Yes, there are 150 rows, each with 4 features, but what about the "Species" column?

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 19:21

**hhmacedo** @hhmacedo@mastodon.social · Jun 27, 2022, 19:21

Jun 27, 2022, 19:21

hhmacedo @hhmacedo@mastodon.social

@peterdrake iris and iris3 are organised in different ways, the former has a column Species (5*150=750) while the later is a 3d dataset of 4*50 observations, with Species being the third dimension (4*50*3=600).

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 19:24

**Peter Drake, he/him, LFHCfS 🔥** @peterdrake@qoto.org · Jun 27, 2022, 19:24

Jun 27, 2022, 19:24

Peter Drake, he/him, LFHCfS 🔥 @peterdrake@qoto.org

@hhmacedo "Ah-h-h," as Frank Herbert would say, "-h-h-h-h-h-h-h."

Thanks!

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…