-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathconstructor.jl
406 lines (349 loc) · 15.2 KB
/
constructor.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
"""
Dataset <: AbstractDataset
An AbstractDataset that stores a set of named columns
The columns are normally AbstractVectors stored in memory.
# Constructors
```julia
Dataset(pairs::Pair...; makeunique::Bool=false)
Dataset(pairs::AbstractVector{<:Pair}; makeunique::Bool=false)
Dataset(ds::AbstractDict)
Dataset(kwargs...)
Dataset(columns::AbstractVecOrMat, names::Union{AbstractVector, Symbol};
makeunique::Bool=false)
Dataset(table)
Dataset(::DatasetRow)
```
# Keyword arguments
- `makeunique` : if `false` (the default), an error will be raised
(note that not all constructors support these keyword arguments)
# Details on behavior of different constructors
It is allowed to pass a vector of `Pair`s, a list of `Pair`s as positional
arguments, or a list of keyword arguments. In this case each pair is considered
to represent a column name to column value mapping and column name must be a
`Symbol` or string. Alternatively a dictionary can be passed to the constructor
in which case its entries are considered to define the column name and column
value pairs. If the dictionary is a `Dict` then column names will be sorted in
the returned `Dataset`.
In all the constructors described above column value can be a vector which is
consumed as is or an object of any other type (except `AbstractArray`). In the
latter case the passed value is automatically repeated to fill a new vector of
the appropriate length. As a particular rule values stored in a `Ref` or a
`0`-dimensional `AbstractArray` are unwrapped and treated in the same way.
It is also allowed to pass a vector of vectors or a matrix as as the first
argument. In this case the second argument must be
a vector of `Symbol`s or strings specifying column names, or the symbol `:auto`
to generate column names `x1`, `x2`, ... automatically.
If a single positional argument is passed to a `Dataset` constructor then it
is assumed to be of type that implements the
[Tables.jl](https://github.com/JuliaData/Tables.jl) interface using which the
returned `Dataset` is materialized.
Finally it is allowed to construct a `Dataset` from a `DatasetRow`.
# Notes
The `allowmissing` function is called on all columns passed to constructor before being added to the output data set.
By default an error will be raised if duplicates in column names are found. Pass
`makeunique=true` keyword argument (where supported) to accept duplicate names,
in which case they will be suffixed with `_i` (`i` starting at 1 for the first
duplicate).
If an `AbstractRange` is passed to a `Dataset` constructor as a column it is
always collected to a `Vector`. As a general rule
`AbstractRange` values are always materialized to a `Vector` by all functions in
InMemoryDatasets.jl before being stored in a `Dataset`.
`Dataset` can store only columns that use 1-based indexing. Attempting
to store a vector using non-standard indexing raises an error.
The `Dataset` type is designed to allow column types to vary and to be
dynamically changed also after it is constructed.
# Examples
```jldoctest
julia> Dataset((a=[1, 2], b=[3, 4])) # Tables.jl table constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 3
2 │ 2 4
julia> Dataset([(a=1, b=0), (a=2, b=0)]) # Tables.jl table constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset("a" => 1:2, "b" => 0) # Pair constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset([:a => 1:2, :b => 0]) # vector of Pairs constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset(Dict(:a => 1:2, :b => 0)) # dictionary constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset(a=1:2, b=0) # keyword argument constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset([[1, 2], [0, 0]], [:a, :b]) # vector of vectors constructor
2×2 Dataset
Row │ a b
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
julia> Dataset([1 0; 2 0], :auto) # matrix constructor
2×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 0
2 │ 2 0
```
"""
struct Dataset <: AbstractDataset
columns::Vector{AbstractVector}
colindex::Index
attributes::Attributes
# the inner constructor should not be used directly
function Dataset(columns::Union{Vector{Any}, Vector{AbstractVector}},
colindex::Index; copycols::Bool=true)
if length(columns) == length(colindex) == 0
return new(AbstractVector[], Index(Dict{Symbol, Int}(), Symbol[], Dict{Int, Function}(), Int[], Int[], false, colindex.perm, colindex.starts, 1, false), Attributes())
elseif length(columns) != length(colindex)
throw(DimensionMismatch("Number of columns ($(length(columns))) and number of " *
"column names ($(length(colindex))) are not equal"))
end
len = -1
firstvec = -1
for (i, col) in enumerate(columns)
if col isa AbstractVector
if len == -1
len = length(col)
firstvec = i
elseif len != length(col)
n1 = _names(colindex)[firstvec]
n2 = _names(colindex)[i]
throw(DimensionMismatch("column :$n1 has length $len and column " *
":$n2 has length $(length(col))"))
end
end
end
len == -1 && (len = 1) # we got no vectors so make one row of scalars
# it is not good idea to use threads when we have many rows (memory wise)
if length(columns) > 100
Threads.@threads for i in eachindex(columns)
columns[i] = _preprocess_column(columns[i], len, copycols)
end
else
for i in eachindex(columns)
columns[i] = _preprocess_column(columns[i], len, copycols)
end
end
for (i, col) in enumerate(columns)
firstindex(col) != 1 && _onebased_check_error(i, col)
end
new(convert(Vector{AbstractVector}, columns), colindex, Attributes())
end
end
function _preprocess_column(col::Any, len::Integer, copycols::Bool)
if col isa AbstractRange
return allowmissing(collect(col))
elseif col isa AbstractVector
if isa(col, BitVector)
return convert(Vector{Union{Bool, Missing}}, col)
else
_res = allowmissing(col)
if copycols
_res === col ? copy(_res) : _res
else
_res
end
end
elseif col isa Union{AbstractArray{<:Any, 0}, Ref}
x = col[]
return fill!(allocatecol(Union{Missing, typeof(x)}, len), x)
elseif col isa AbstractArray
throw(ArgumentError("adding AbstractArray other than AbstractVector " *
"as a column of a data set is not allowed"))
else
return fill!(allocatecol(Union{Missing, typeof(col)}, len), col)
end
end
# Create Dataset
Dataset(df::Dataset) = copy(df)
# Create Dataset
function Dataset(pairs::Pair{Symbol, <:Any}...; makeunique::Bool=false,
)::Dataset
colnames = [Symbol(k) for (k, v) in pairs]
columns = Any[v for (k, v) in pairs]
return Dataset(columns, Index(colnames, makeunique=makeunique)
)
end
# Create Dataset
function Dataset(pairs::Pair{<:AbstractString, <:Any}...; makeunique::Bool=false)::Dataset
colnames = [Symbol(k) for (k, v) in pairs]
columns = Any[v for (k, v) in pairs]
return Dataset(columns, Index(colnames, makeunique=makeunique))
end
# Create Dataset
# this is needed as a workaround for Tables.jl dispatch
function Dataset(pairs::AbstractVector{<:Pair}; makeunique::Bool=false)
if isempty(pairs)
return Dataset()
else
if !(all(((k, v),) -> k isa Symbol, pairs) || all(((k, v),) -> k isa AbstractString, pairs))
throw(ArgumentError("All column names must be either Symbols or strings (mixing is not allowed)"))
end
colnames = [Symbol(k) for (k, v) in pairs]
columns = Any[v for (k, v) in pairs]
return Dataset(columns, Index(colnames, makeunique=makeunique))
end
end
# Create Dataset
function Dataset(d::AbstractDict)
if all(k -> k isa Symbol, keys(d))
colnames = collect(Symbol, keys(d))
elseif all(k -> k isa AbstractString, keys(d))
colnames = [Symbol(k) for k in keys(d)]
else
throw(ArgumentError("All column names must be either Symbols or strings (mixing is not allowed)"))
end
colindex = Index(colnames)
columns = Any[v for v in values(d)]
df = Dataset(columns, colindex)
d isa Dict && select!(df, sort!(propertynames(df)))
return df
end
# Create Dataset
function Dataset(; kwargs...)
if isempty(kwargs)
Dataset([], Index())
else
cnames = Symbol[]
columns = Any[]
copycols = true
for (kw, val) in kwargs
if kw === :copycols
if val isa Bool
copycols = val
else
throw(ArgumentError("the `copycols` keyword argument must be Boolean"))
end
elseif kw === :makeunique
throw(ArgumentError("the `makeunique` keyword argument is not allowed " *
"in Dataset(; kwargs...) constructor"))
else
push!(cnames, kw)
push!(columns, val)
end
end
Dataset(columns, Index(cnames), copycols=copycols)
end
end
# Create Dataset
function Dataset(columns::AbstractVector, cnames::AbstractVector{Symbol};
makeunique::Bool=false, copycols::Bool=true)::Dataset
if !(eltype(columns) <: AbstractVector) && !all(col -> isa(col, AbstractVector), columns)
throw(ArgumentError("columns argument must be a vector of AbstractVector objects"))
end
return Dataset(collect(AbstractVector, columns),
Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
copycols=copycols)
end
# Create Dataset
Dataset(columns::AbstractVector, cnames::AbstractVector{<:AbstractString};
makeunique::Bool=false, copycols::Bool=true) =
Dataset(columns, Symbol.(cnames), makeunique=makeunique, copycols=copycols)
# Create Dataset
Dataset(columns::AbstractVector{<:AbstractVector}, cnames::AbstractVector{Symbol};
makeunique::Bool=false, copycols::Bool=true)::Dataset =
Dataset(collect(AbstractVector, columns),
Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
copycols=copycols)
# Create Dataset
Dataset(columns::AbstractVector{<:AbstractVector}, cnames::AbstractVector{<:AbstractString};
makeunique::Bool=false, copycols::Bool=true) =
Dataset(columns, Symbol.(cnames); makeunique=makeunique, copycols=copycols)
# Create Dataset
function Dataset(columns::AbstractVector, cnames::Symbol; copycols::Bool=true)
if cnames !== :auto
throw(ArgumentError("if the first positional argument to Dataset " *
"constructor is a vector of vectors and the second " *
"positional argument is passed then the second " *
"argument must be a vector of column names or :auto"))
end
return Dataset(columns, gennames(length(columns)), copycols=copycols)
end
# Create Dataset
Dataset(columns::AbstractMatrix, cnames::AbstractVector{Symbol}; makeunique::Bool=false) =
Dataset(AbstractVector[columns[:, i] for i in 1:size(columns, 2)], cnames,
makeunique=makeunique, copycols=false)
# Create Dataset
Dataset(columns::AbstractMatrix, cnames::AbstractVector{<:AbstractString};
makeunique::Bool=false) =
Dataset(columns, Symbol.(cnames); makeunique=makeunique)
# Create Dataset
function Dataset(columns::AbstractMatrix, cnames::Symbol)
if cnames !== :auto
throw(ArgumentError("if the first positional argument to Dataset " *
"constructor is a matrix and a second " *
"positional argument is passed then the second " *
"argument must be a vector of column names or :auto"))
end
return Dataset(columns, gennames(size(columns, 2)), makeunique=false)
end
# Discontinued constructors
# Create Dataset
Dataset(matrix::Matrix) =
throw(ArgumentError("`Dataset` constructor from a `Matrix` requires " *
"passing :auto as a second argument to automatically " *
"generate column names: `Dataset(matrix, :auto)`"))
# Create Dataset
Dataset(vecs::Vector{<:AbstractVector}) =
throw(ArgumentError("`Dataset` constructor from a `Vector` of vectors requires " *
"passing :auto as a second argument to automatically " *
"generate column names: `Dataset(vecs, :auto)`"))
# Create Dataset
Dataset(column_eltypes::AbstractVector{T}, cnames::AbstractVector{Symbol},
nrows::Integer=0; makeunique::Bool=false) where T<:Type =
throw(ArgumentError("`Dataset` constructor with passed eltypes is " *
"not supported. Pass explicitly created columns to a " *
"`Dataset` constructor instead."))
# Create Dataset
Dataset(column_eltypes::AbstractVector{<:Type}, cnames::AbstractVector{<:AbstractString},
nrows::Integer=0; makeunique::Bool=false) =
throw(ArgumentError("`Dataset` constructor with passed eltypes is " *
"not supported. Pass explicitly created columns to a " *
"`Dataset` constructor instead."))
"""
copy(ds::Dataset)
Copy data set `ds`.
> This function uses `copy` rather than `deepcopy` internally, thus, it is not safe to use it when observations are mutable.
"""
function Base.copy(ds::Dataset)
# TODO currently if the observation is mutable, copying data set doesn't protect it
# Create Dataset
newds = Dataset(copy(_columns(ds)), copy(index(ds)))
setinfo!(newds, _attributes(ds).meta.info[])
return newds
end