v0.5 Diadem

Version/DateChanges
12.01.2024
v0.5.3
Docs

Highlights

  • Added experimental keras.Model.compile(..., sol_compile=True, sol_vdims=[...]) support
  • Bringing back CUDA support! 🎉
  • Fixed random numbers using MKL/VSL within PyTorch >2.0 on X86.
  • Significantly improved TensorFlow RNN performance and numerical stability.
  • Added sol.config['vdims'] = ... to enable usage of SOL's vdims system also with torch.compile(...) that cannot pass such information to the backend compiler.
  • You can now limit number of parallel JIT processes using SOL_JIT_THREADS env var, in case you encounter out-of-memory crashes while compiling.
  • SOL no longer segfaults when more than 50% of system memory is used, when compiling.

Breaking Changes

  • Due to increasing number of unresolved issues in TensorFlow PluggableDevice API (e.g., #55497, #57095, #60883 or #60895) we decided to no longer maintain our veda-tensorflow extension. Therefore you cannot longer use with tf.device("/VE:0"):. Instead please use Transparent Offloading using sol.device.set('ve', 0). We are sorry for the inconvenience, but we don't see any commitment of the TensorFlow team to accept our bugfixes, nor to fix the issues themselves.

Closed Issues

  • #1305 [PyTorch] Can't optimize TorchScriptModules
  • #1302 [TF] Unable to broadcast (1 * #0) and 2 in TF_Issue#1201
  • #1301 [Compiler] Investigate why NCC or ISPC sometimes stop when compiling many jobs
  • #1300 [TF] BCE: O3_D0_Output is not stored in ctx
  • #1299 [TF] Segfault when running RNN test suite
  • #1297 [TF] Prefixsum: 'T4_D0_Input (Reverse) has no src!'
  • #1294 [TF] TF_Issue#985 returns wrong results
  • #1293 [TF] tt.nn.max_pool_with_argmax returns wrong indices
  • #1291 [DFP] New Cache Planning fails in simple expansion example
  • #1288 [Runtime] Unify Shutdown/Unload methods in device runtimes
  • #1287 [CMake] Prevent SOL to run out of memory when compiling dependencies
  • #1286 [DFP] CUDA uses too much shared memory if we use too large group_cnt
  • #1283 [DFP] Deprecate renderRootCache
  • #1280 [PyTorch] Expected a value of type 'Optional[float]' for argument 'value' but instead found type 'int'. in 'dm_nfnet_f0'
  • #1279 [PyTorch] Check v2.1.2 compatibility
  • #1278 [PyTorch] PY_Issue_964 fails with `Buffer T25_D0_Output is already chained with C1_D0_Output`
  • #1277 [PyTorch] Investigate "object has no attribute or method '__ne__'" when compiling multiple models
  • #1274 [Runtime] deprecated sol::runtime::Tensor and use sol::Tensor instead
  • #1273 [CAPI] move sol/autotune.h to sol/capi/autotune.h
  • #1272 [SDK] Update GEMM example with new autotune routine
  • #1271 [DNNL] Deprecate current Autotune impl and move to handle
  • #1270 [NVIDIA] Fix NVIDIA profiling
  • #1267 [Tests] Warn always about OMP issues
  • #1266 [HLIR] Arithmetic DType detection fails for S64/F32, should be F32 and not F64
  • #1265 [HLIR] Cast -> Cast optimization
  • #1262 [DFP] transform `X = Y / broadcast(Z)` to `ZZ = 1/Y; X = Y * broadcast(ZZ)`
  • #1260 [CUBLAS] Fix SegFault in autotune
  • #1259 [Tests] Mark if it's an IDENTICAL match (==)
  • #1258 [MKL] Deprecate current Autotune impl and move to handle
  • #1257 [DNNL] Deprecate current autotune implementation and move into DNNL-handle
  • #1256 [CUBLAS] move autotune into handle, similar to VEBLAS and share impl.
  • #1254 [PyTorch] PY_Rand test case runs always on CPU?
  • #1253 posix_spawn fails when cmd contains ""
  • #1252 [JIT] Refactor API
  • #1249 [CUDA] Compile backend APIs on target machine, to link to correct CUDA version
  • #1248 [JIT] Out of memory when calling any JIT compiler, when already >50% of system memory is filled.
  • #1247 [PyTorch] Can't trace Inception and GoogleNet suing torch.compile(..., backend='sol')
  • #1246 SoftMax can produce NAN in Bloom
  • #1245 [HLIR] hlir::cast({...}) casts away float of constants e.g. in pow(float, int)
  • #1244 [ISPC] Cast from float to bool results always in False
  • #1243 [Python] implement sol.set_vdims(...) as global default.
  • #1242 [PyTorch] Don't duplicate parameters while optimizing (to prevent crashes)
  • #1241 [PyTorch] Check if we can automatically interleave SOL and non-SOL function using torch.compile
  • #1239 [PyTorch] enable parsing of `torch.autograd.Function` constructs
  • #1235 [HLIR] Constant -> PrefixSum -> Arange
  • #1233 [PyTorch] Add bitwise_x
  • #1232 [SDK] Add FindSOL.cmake script
  • #1231 [SDK] Fix GCC9 compatibility
  • #1229 [SDK] Compiled libraries fail during loading due to missing CXX11 abi symbols
  • #1225 [SDK] Add Examples
  • #1222 [DNNL] Upgrade v3.3.1
  • #1220 [HLIR/DFP] Split AvgPooling in SumPooling + AvgPoolingNorm
  • #1219 [PyTorch] PyTorch Module tries to be compiled with WITH_CUDA=1 even no CUDA is available.
  • #1218 [PyTorch] Test v2.1.1
  • #1217 [TensorFlow] Test v2.15.0
  • #1216 [TensorFlow] Test v2.14.1
  • #1215 [DFP] unvectorized LoopStack in Reduction
  • #1213 [UI] Add AutoTuner to ProgressBar
  • #1212 [HLIR] Remove Unpooling::requiresZero as it's obsolete
  • #1210 [HLIR] Missed cluster fusion opportunity in Alexnet BWD pass
  • #1209 [DFP] Split Total and Bias Reductions
  • #1208 [DFP] Input Loop Merging
  • #1207 [Refactoring] Refactor Network* in HLIR to Network&
  • #1205 [DFP] Split Full-Reduction Layers
  • #1203 [Keras] Does call to sol_fwd require the output shapes really provided?
  • #1202 [Keras] Simple example disables ALL vdims
  • #1201 [API] Don't duplicate outputs anymore, but instead do the duplication in Wrapper
  • #1200 [API] Store sol_network* net, sol_wrapper_container* inputs, sol_wrapper_container* outputs in Optimizer
  • #1199 [Runtime] Enable to use "DefaultStreams" if framework does not provide any
  • #1198 [PyTorch] No need to pass dtypes to `module.cpp` as they get initialized in YAAL
  • #1197 [Runtime] Remove fetching of all tensor data in `module.cpp` and instead implement as lazy call in `handle.cpp` that gets executed in `sol_ptr` if needed.
  • #1196 [Runtime] Move handling of malloc/free/... to C-API that directly is accessible in sol_handle without going through `runtime::Handle`. And instead use these direct callbacks also in `runtime::Handle`
  • #1195 [PyTorch] huge overhead when model has many parameters/buffers
  • #1194 [CUDA] add `__launch_bounds__` in generated code
  • #1192 [SDK] Add DNN and DFP headers to adv-sdk
  • #1187 [AT_MT19937] Vectorize 64Bit
  • #1184 [DFP] Too much shared memory in PY_Norms
  • #1183 [PyTorch] Wrong gradient dtype in torch.where(cond, float32, int64)
  • #1182 [Tests] Fix Remaining Errors
  • #1181 [HLIR] Move Rand, Bernoulli and Dropout algorithm selection to respective backends
  • #1179 [TF] wrong indices for tf.nn.max_pool_with_argmax in some situations
  • #1178 [TF] AvgPooling produces deviating gradients
  • #1177 [HLIR] In PY_Norms LayerNorm we have CONSTANTs that get stored as COPY in fwd pass for training
  • #1176 [PyTorch] Fix LayerNorm Weight Gradient
  • #1175 [HLIR] Fix BatchNorm training inaccuracies
  • #1174 [HLIR] Don't create copies of Broadcast!
  • #1172 [HLIR] Implement Functional Sigmoid and Tanh using high level operators, to make benefit of HLIR optimization passes.
  • #1171 [TensorFlow] Performance Regression in DNNL Conv
  • #1170 [PyTorch] Fix AlphaDropout
  • #1169 [PyTorch] Fix U8 dtype in PyTorch Handle
  • #1168 [CUDNN] Fix NHWC Conv execution
  • #1167 [DNNL] Revert using PYPI packages, as these do not keep up with Github releases
  • #1164 [PyTorch] Test v2.1.0
  • #1161 [TF] Test v2.14.0
  • #1160 [TF] Test v2.13.1
  • #1159 [PyTorch] Always use torch.compile(...) for parsing models for better compatibility
  • #1158 [Python] Upgrade importlib usage
  • #1156 [PyTorch] X86 Dropouts fail
  • #1154 [DFP] Move Cache to beginning of parent Cores (can this be stored within the LoopStack?)
  • #1152 [Keras] Can we get better estimates than `1` when Input-Size is variable?
  • #1151 [TESTS] Add option to force comparison with host system
  • #1150 [CUDA] Enable sub-warp grouping with sizes 1, 2, 4, 8 and 16
  • #1148 [VEDNN] Port to new SOL-namespace API
  • #1147 [DNN-RNN] Evaluate if instead of copying the input of the activations, that we can copy the output.
  • #1146 [OMP] TF causes MKL to run very slow when using more threads than cores
  • #1145 [CUBLAS/CUDNN] Store handle as thread_local.
  • #1143 [HLIR] Mask does not need gradient, although it is indicated as having gradient in debug output
  • #1142 [PyTorch] Use at::Tensor in `module.cpp` and `handle.cpp` instead of doing the allocations on our own.
  • #1140 [DNN-RNN] Verify bwd SimpleRNN with masking=True for normal and normed activations
  • #1138 [DNN-RNN] Implement new Special Layers using GEMM
  • #1137 [CUBLAS] Investigate CUBLAS overlap error in TF
  • #1136 [DNN-RNN] Remove dB0 + dB1 entirely from RNNDeActivation
  • #1135 [DNN] use a Macro to manage `using` centralized
  • #1134 [DNN-GCC] Investigate why LSTM_Bwd activationX is taking most of the execution time
  • #1133 [CUBLAS] use 64bit API
  • #1132 [DNN-RNN] Evaluate if permuting the WH matrix to a more vector pleasent format improved performance
  • #1131 [RNN-NCC] Improve performance of VE RNN kernels
  • #1130 [VE] Test if NCC v5 has solved the constexpr problem
  • #1117 [TF] `unsupported operand type(s) for *: 'int' and 'NoneType'` in TF_Tile
  • #1116 [CUDA DFP] Wrong results in TF_Reverse
  • #1114 [Compiler] Verify that LayerOutputs have dests and are not dangling
  • #1113 [PyTorch] PY_Norms fails using `-bs 5 -vbs -d nvidia`
  • #1112 [CURAND] Store random states per GPU
  • #1111 [CUFFT] Cache Plans
  • #1107 [TF] Can we use `op.skip_input_indices` to replace requires_grad?
  • #1105 [TF] ValueError: The inequality of unknown TensorShapes is undefined. in TF v2.6.5 when running BWD pass
  • #1103 [CUDNN] Catch invalid padding modes for Conv, e.g., first layer of AlexNet
  • #1101 [DNN-RNN] Optimize TF specific RNN Bwd Kernels
  • #1100 [Runtime] Warn is User does not specify sol.set_seed
  • #1099 [DNN-RNN] Race Condition in accumulating bias
  • #1097 [HLIR+DFP] Chaining multiple Padding Layers fails
  • #1094 [PyTorch] Module does not compile when CUDA is not installed.
  • #1093 [TF] RNN fails with `Layer T13_D0 is already registered to this network!`
  • #1091 [DNN-RNN] Enable RecurrentDropout in SimpleRNN
  • #1090 [HLIR] UnPooling throws `Assertion 'inSize(kd) <= outSize(kd)' failed!`
  • #1089 [RNN] Performance
  • #1087 [CUDA] Make CUDA Handler Impl ThreadSafe
  • #1086 [TF] Check Compatibility v2.13.0
  • #1083 [PyTorch] Investigate overhead when executing with CUDA
  • #1082 [HLIR] Don't do DFP Bernoulli or BatchNormInference-style fake Algos
  • #1081 [PyTorch] CUDA specific Dropout impl
  • #1080 [PyTorch] torch.mm
  • #1079 [PyTorch] GPU Bernoulli
  • #1078 [DFP] Reenable Grouped Cores
  • #1077 [DFP] identifier "L20m5_L21" is undefined
  • #1076 [PyTorch] aten::full
  • #1075 [ISPC] Evaluate if we really need a ISPC version of AT_MT19937
  • #1072 [PyTorch] File BugReport that DeviceType::CUDA does not register to c10::GetAllocator(...)
  • #1071 [DNNL] store temporary memory consumption in Algo
  • #1070 [DFP] Optimize initUsesVDims by using traverseBool
  • #1069 [DFP] Fix Operation Type::Condition
  • #1068 [PyTorch] Wrong results in PY_Rand on x86
  • #1067 [DFP] Remove double cast
  • #1066 [C++] gen::string::join
  • #1064 [Tests] Report correct GPU name
  • #1063 [DFP] If same non-cluster value is read, create a copy node to enforce loop merging within DFP?
  • #1061 [HLIR] PrefixSum of len 1 == memcpy
  • #1056 [DNNL] Upgrade 3.1.1
  • #1055 [TF] modify keras.Model.compile to enable direct SOL compilation
  • #1054 [CUBLAS] port to new API
  • #1053 [CUDNN] port to new API
  • #1052 [CUDA] complete all open issues
  • #1051 [CUDA] Port to new DFP API
  • #1050 [CUDA] add TF_PHILOX support
  • #1049 [CURAND] add AT_PHILOX support
  • #1047 [OpenSSL] Upgrade 1.1.1u
  • #1046 [YAAL] Switch `sol::dnn::X::rand::create` to `sol_random_seed`-callback
  • #1045 [DNNL] '[DNNL ERROR] The operation failed because of incorrect function arguments.' in Reorder with VDIMS after API upgrade
  • #1044 [TF] DenseNet: Segfault
  • #1042 [VEBLAS] Performance regression with BatchSize 2-7 on AlexNet
  • #1041 [Compiler] prevent compilation to be overlapping autotuning phase
  • #1040 [HLIR] Move BatchNormMean/Var to DFP module
  • #1039 [SQLITE] Upgrade v3.42.0
  • #1038 [HLIR] Broadcasting using VDims, where one is 1 and other is != 1 causes VDims merge conflict
  • #1037 [PyTorch] Reduce Testcase, all var(False) fail
  • #1036 [PyTorch] Test v2.0.1 compatibility
  • #1035 [ISPC] Upgrade v1.20.0
  • #1033 [DFP] Circular Schedule in tf.layers.ReduceBool testcases
  • #1031 [DNNL] Bugfix Dilated Conv
  • #1016 [PyTorch] add aten::eye
  • #1015 [RNN] Investigate NAN in non "linear" cases
  • #1013 [PyTorch] Bernoulli(scalar) produces different results on X86 and VE
  • #1008 [DFP] Error: Ignoring redeclaration of symbol "L212IN0__11_10L" in Interpolation TestCase
  • #993 [TF] Check if we handle local seeds correctly when we use multiple NNs and if they match with the seed generated into the train methods, etc.
  • #985 [HLIR] We can use `transform::duplicates` on Rand if they have same local seeds, and there is no dependency between them.
  • #982 [OMP] Parallelize tf_philox across groups * sequences
  • #979 [HLIR] Refactor RAND system
  • #974 [DNN-RNN] Support NSC input format without permutations
  • #972 [DNN-RNN] Remove WH input if H == 0 and seq == 1
  • #965 [HLIR] Improve Duplicates detection
  • #958 [ISPC] Evaluate if we can use new ISPC Template feature
  • #955 [DFP] PY_Issue_945 creates non vectorized implementation
  • #943 [RNN] Make Masks INT8, because they anyway don't need to be vectorized
  • #921 [DFP] Loop Lookup "Chicken/Egg"-Problem preventing GPT2
  • #917 [HPTT] Add ENABLE_AVX flag
  • #910 [PyTorch] Investigate Performance regression Resnet18 vs Resnet50
  • #891 [TF] Investigate using tf.function for inner_call
  • #886 [CUDA] Add Custom Layer Support
  • #884 [TF] RNN with stateful=True sometimes produce deviations in "recurrent_kernel" within 2nd training iteration
  • #874 [HLIR] Merge into atomic Slice -> Broadcast operation
  • #872 [DNNL] Add Reorder time to Conv Autotuning measurements
  • #869 [YAAL] Experiment to perform sol_shape_checks using OMP
  • #857 [DNNL] Upgrade v3.1
  • #819 [TF] add keras.layers.GroupNormalization (requires v2.11.0)
  • #785 [VEBLAS] Analyze performance of different parallelization strategies and small/big batchsizes
  • #769 [PyTorch] RegNet Inference Accuracy Problems on AVX512
  • #768 [DFP] DFP sometimes allocates too much stack memory
  • #655 [TF] check accuracy issues on AlexNet running on X86
  • #650 [TF] tf.keras.applications.mobilenet.Mobilenet accuracy problem
  • #588 [TF] LayerNormalization
  • #476 [PyTorch] Einsum
  • #474 [CUDA] WhereTrue
  • #473 [CUDA] Value2VDim
  • #467 [CUDA] PrefixSum
  • #405 [CUFFT] FFT
  • #400 [CUDA] add Struct functions
  • #213 [VE] Add checks to jit-ncc to verify that necessary loops really get vectorized
  • #176 [CUDA] RNN
  • #85 [HLIR] H-Merge identical parallel layers
  • #79 [DFP/Performance] H-Merge (i.e. BERT BT)
28.04.2023
v0.5.2
Docs

Highlights

  • Deprecated DL4J and Unikraft support.
  • Significantly improved compatibility of SOL integration into PyTorch and TensorFlow.
  • Experimental Custom Layer support for PyTorch and Tensorflow.
  • Lots of internal bugfixes and improvements, e.g., improved code generation of loop indices to reduce recomputation within compute kernels.
  • Added more compiler specific env vars.
  • Added SOL_CWD env var, to enable users to change the directory which SOL uses as working directory.
  • SOL now implements identical random number generators as PyTorch and Tensorflow.
  • New compiler::deterministic config option to trade accuracy of the model for more performance. See configs.
  • Preliminary support for TensorBoard profiler. Set SOL_PROFILE=TENSORBOARD:FILENAME. Results will be stored in FILENAME.
  • You now can use torch.compile(model, backend='sol') as alternative to sol.optimize(...) in Pytorch > 2.0! See here for more details.

Known Issues

  • Since TensorFlow v2.10.0 issue #57095 causes problems within the PluggableDeviceAPI. Tensors that need to be placed within the host memory randomly appear on the executing device. This problem increased in v2.12.0 and can cause random segfaults when running TensorFlow workloads on NEC SX-Aurora. This is a problem within TensorFlow that we cannot fix. Unfortunately it doesn’t seem that the TensorFlow team is going to fix this problem any time soon, although other vendors face the same problem. If you encounter random segfaults using TF + VE, please downgrade to TensorFlow v2.9.0.

  • torch.dropout(...) and torch.bernoulli(scalar) return different random numbers than SOL. This is caused by PyTorch issue #94388 that uses a different random number generator for Bernoulli/Dropout. In PyTorch v1.* this even causes different random numbers on Intel and AMD CPUs. SOL only supports identical random numbers for torch.rand(...) and torch.bernoulli(Tensor) yet.

Closed Issues

  • #1034 [DFP-NCC] Error in PyTorch Narrow testcase
  • #1032 [DNNL] Upgrade 2.7.4
  • #1030 [PyTorch] Could not cast attribute 'num_batches_tracked' to type Tensor: Unable to cast to Tensor
  • #1029 [DimMapper] Assertion 'A.size() == B.size()' failed! in ShuffleNet
  • #1028 [PyTorch] aten::tensor
  • #1027 [SegFault] Investigate random SegFault
  • #1026 [DFP] ASSERT(p->groups() == 1 || p->groups() == p->inChannels()) failed
  • #1025 [HLIR] unable to find gradient in many TIMM networks
  • #1024 [HLIR] Conv: Assertion '!wdims.hasVDims()' failed!
  • #1023 [HLIR] Incompatbile shapes within Arithmetic in GCResNet
  • #1022 [DNNL] The operation failed because of ...
  • #1021 [Parser] Can't initialize sol.hlir.Dim with [64.]/
  • #1020 [Python] Make calls to sol.optimize(...) recoverable when exception occured
  • #1019 [PyTorch] Add support for "same" and "valid" in ConvXD
  • #1018 [PyTorch] add aten::Bool
  • #1017 [PyTorch] add aten::unbind
  • #1014 [RNN/GCC] Investigate performance regression for RNN with GCC 9.X
  • #1012 [TF] Fix TF executing Model
  • #1011 [Python] add check for OMP problems to `python3 -m sol`
  • #1010 [Python] `python3 -m sol` stopped working
  • #1009 [GCC] Remove GLIBCXX11ABI == 0
  • #1007 [Boost] Upgrade 1.82.0
  • #1006 [DNN/RNN] Fix NAN if Softmax + Linear is used as activation functions
  • #1005 [HLIR] remove duplicates not working in some situations
  • #1004 [HLIR] Transform GEMM with in==1 or out==1 to basic operations
  • #1003 [DNN/GCC] Move dnn::mkl::RNN* to dnn::gcc::RNN*
  • #1002 [DNN/RNN] Fix Memory Consumption Reporting
  • #1001 [RNN] Split (De)RNN into (De)RNNActivation and GEMM/(De)RNNDropout
  • #1000 [RNN] Fix #nan for large values of pow, tanh, ...
  • #998 [Sleef] CMake install libsleefgnuabi and add to Illyrian
  • #997 [MKL/RNN] Fix GCC auto-vectorization problems
  • #996 [OpenMP] Investigate Performance regression when sklearn is loaded
  • #994 [DFP] Remove "Buffer" from list of supported layers
  • #992 [WHL] Add backend-noop to nec-sol-core package
  • #989 [TF] Test v2.12.0
  • #988 [Python] private variables such as _DType__determine will be normal attributes in 3.11
  • #987 [Python] Replace `imp` with `importlib` in HLIR parser
  • #986 [SQLite] Upgrade v3.41.2
  • #983 [TF] Test v2.11.1
  • #977 [PyTorch] Upgrade v2.0.0
  • #976 [RNN] Remove unnecessary layer inputs + template params from RNN impl.
  • #975 [TF] SimpleRNN mismatch of Bias gradient with dropout != 0 and activation == ReLU
  • #973 [DNN-RNN] Report correct temporary memory allocations
  • #970 [SQLite] Upgrade v3.41.1
  • #969 [HUGO] Upgrade v0.111.3
  • #968 [RNN] Evaluate if we can move the (I @ WI) * DI part outside of the RNN cell
  • #967 [RNN] Compilation fails for Channelsize == 1
  • #966 [Python] Port sol.tests to termcolor
  • #964 [HLIR] YAAL does not allocate memory for Buffer
  • #963 [C-API] Encode requires_grad as int64_t and set specific bits, instead of wasting an entire array with that
  • #962 [HUGO] Upgrade v0.111.2
  • #961 [HLIR] Circular Cluster construction in SimpleRNN testcase
  • #960 [TF] Get RNN Dropouts correct
  • #959 [HLIR] add axis attribute to Rand
  • #957 [ISPC] Upgrade to v1.19.0
  • #956 [PyTorch] SOL's model parameters don't get updated through the optimizer
  • #953 [Keras] RNN recurrent_dropout needs to use separate Rand objects for each sequence
  • #951 [Profiler] Allow to specify output file
  • #950 [Keras] Investigate influence of MASKING onto loss computation
  • #949 [Keras] Problem handling input_mask/output_mask propagation
  • #948 [HLIR] Transform where to pure arithmetic operation
  • #947 [RNN] doublecheck recurrent_dropout implementation
  • #946 [SQLite] Upgrade to v3.41.0
  • #945 [VE] Resnet101 does not converge
  • #944 [Python] upgrade np dtypes to new standard
  • #942 [RNN] Fix return_sequences + masking training
  • #941 [DFP-ISPC] Error: Ambiguous use of overloaded function "sol_dfp_ispc_max".
  • #940 [TF] Investigate differences in bwd pass for Average and MaxPooling
  • #939 [Keras] Dropouts are not disabled in inference
  • #936 [Keras] Debug Dropout behavior in basic CNN
  • #935 [Keras] Fix automatic setting of vdims=[True]
  • #934 [GCC] Add vdims support
  • #933 [VE] Fix TensorList synchorization
  • #932 [TF] Fix performance problem when using evaluate, predict or fit
  • #931 [TF] Investigate "Optimization loop failed" warning
  • #927 [PyTorch] MaskedFill, LogicalNot: found expected boolean, found int8_t
  • #926 [HLIR] Cast::copy
  • #925 [OpenSSL] Upgrade to v1.1.1t
  • #924 [Backends] Implement MT19937 and Philox Random number generators
  • #923 [Devices] Implement deterministic RAND mode
  • #922 [PyTorch] Fix TorchScript errors in PyTorchic BERT testcase
  • #916 [HLIR] Remove duplicates illegally merges Prefixsum with different shapes
  • #915 [DFP] Reorders can prevent vectorization in DFP
  • #911 [Profiler] Add TensorBoard integration
  • #909 [PyTorch] Disable torch.bernoulli within accuracy test runs
  • #907 [HLIR] Add Constant -> Reduction optimizations
  • #906 [ISPC] investigate why some random numbers are always zero
  • #905 [PyTorch] Debug Efficientnet BatchNorm buffers in training
  • #904 [PyTorch] Debug ConvNext wrong gradients in `layer_scale`
  • #903 [TF] Implement testcases for predict and evaluate
  • #902 [PyTorch] enable PyTorch to make clone of model, if the original is Pytorch model itself
  • #901 [TF] Find workaround for TF giving identical names to RNN states in LSTM
  • #900 [CAPI] Evaluate if we still need CAPI lazy init
  • #899 [PyTorch] Debug Testcases
  • #898 [HLIR] merge duplicates where dims are identical when squeezed
  • #897 [PyTorch] list index out of range, when SOL does not use all parameters
  • #896 [PyTorch] "Missing Parameter" when SOL does not use all model parameters
  • #895 [TF] RNN inaccuracies
  • #894 [TF] LSTM.stateful = True uses identical name `lstm/Variable:0` as name for H and C
  • #893 [RNN] Fix RNN zero_output_for_mask
  • #890 [DL4J] Remove code as it is abandoned
  • #889 [Unikraft] Remove code as it is abandoned
  • #888 [DNNL] Pure Permutation Reorders can result in Segfault on X86
  • #885 [HLIR] don't Derivative::copy Reorders if their src is a Param
  • #883 [DNNL] Upgrade v2.7.3
  • #882 [RNN] Remove OM, seems we do not need it.
  • #881 [TF] analyze TF-TRT integration and check if we can do that with SOL
  • #878 [PyTorch] add requires_grad to module.cpp
  • #877 [PyTorch] Unroll Module/Sequence structure within the Renderer
  • #876 [ONNX] "SET training TO EVAL ONCE WE FIX #527"
  • #875 [DFP] Optimize placement of LookupCheck
  • #871 [DNNL] AutoTuning for inference sometimes chooses different layouts than training causing unncessary reorders
  • #870 [DFP] Optimized Broadcast
  • #868 [DNNL] Add permute support
  • #867 [DFP] Loop Merging merges loop that should be flagged as unmergeable
  • #866 [Keras] change keras.evaluate to execute the training forward pass, not the inference pass
  • #865 [SQLITE] Upgrade v3.40.1
  • #864 [HUGO] Upgrade v0.109.0
  • #862 [PyTorch] Split handle.cpp into handle and module
  • #861 [VE] Backport VE to new runtime API
  • #860 [PyTorch] Move device check to set_tensors
  • #858 [YAAL] Move Shape and DType Checks to YAAL
  • #856 [TensorFlow] Setting KerasView.training needs to set the weight's training value
  • #855 [HLIR] Investigate to set Grads dynamically, similar to VDIMS
  • #854 [HLIR] Consider to not move weight dims from Reorder, if numel differ
  • #853 [CAPI] Disable LazyInit. No longer needed after we do lazy compilation of framework modules.
  • #852 [DFP] complete new indexing
  • #850 [PyTorch] Upgrade 1.13.1
  • #849 [Boost] Upgrade 1.81.0
  • #848 [RNN] implement Softmax Activations
  • #847 [Hugo] Upgrade 0.108.0
  • #846 Make Enums final
  • #845 Determine Absolute SOL_PATH once at startup, to prevent errors when users change CWD at runtime
  • #843 [TF] similar layer names 'lstm_8/concat:0' and 'lstm_8/transpose:0' cause 'duplicate layer' within KerasModel
  • #842 [PyTorch] Evaluate to replace Python Wrapper with torch.jit.compiled version, that uses torch.ops.sol.call instead
  • #841 [Report] Refactor reporting API to not require IDs and instead use str labels
  • #835 Do not compile all frameworks handlers at startup
  • #832 Add Custom Layer Support
  • #831 [PyTorch] integrate sol into torch.compile(...)
  • #830 [PyTorch] Add TIMM models to test suite
  • #828 [Python] add option to reset handlers
  • #821 [PyTorch] fix RNN with Hidden inputs
  • #820 [DNNL] Upgrade to 2.7.2
  • #818 [Hugo] Upgrade 1.106.0
  • #816 [SQLite] Upgrade 3.40.0
  • #813 [JIT] Add [N/NV]CPATH, CPLUS_INCLUDE, C_INCLUDE, ... to respective compilers
  • #812 [HLIR] Remove old scheduling algorithm
  • #811 [HLIR] Move Cluster::srcs, etc. to Device::initSchedule
  • #810 [TF] Enable keras.supports_masking?
  • #809 [PyTorch] Missing layer: MaskedFill
  • #807 [TF] Enable model.layers
  • #806 [Installer] Error installing veda-pytorch due to '==' instead of '~=' version matching
  • #805 [TF] Keep Keras Model name
  • #804 [TF] Can we mimic the Keras output behavior using nn.Identity layers?
  • #803 [TF] Enable Keras users to get a "view" using "get_layer(name=...)" and then do "set_weights(...)"
  • #802 [Installer] Debug passwords with "!"
  • #800 [OpenSSL] upgrade to 1.1.1s
  • #799 [HUGO] upgrade to 1.105.0
  • #796 [PyTorch] Upgrade 1.13.0
  • #793 [WHL] Device Meta Packages `nec-sol-meta-x86` that installs all dependencies for the given device.
  • #792 [WHL] Add nec-sol-omp as dependency to device-x86 and device-ve
  • #791 [WHL] Add jit-dot and jit-python to nec-sol-core requires-dist
  • #789 [DNNL] Upgrade v2.7.1
  • #788 [ISPC] Upgrade v1.18.1
  • #779 [Dependencies] Automatically install symlinks for VEDA and TUNGL to CMAKE_INSTALL_PREFIX
  • #777 [AutoTuner] Store results of previous results and reinit Algo objects directly from DB
  • #773 [DFP] Check why sometimes linear loops don't get stored in IDX
  • #744 [TF+RNN] some RNN hyperparameter constelations produce wrong gradients
  • #715 [HLIR] Transform TF RNN cells to HLIR RNN cells
  • #639 [TF] RNN stateful=True results in "Unable to fetch values for ..."
  • #637 [PyTorch] Build own SOL C-style AutoGrad Function Wrapper
  • #527 [HLIR] GPT-2 Backward causes Stack Overflow in Cluster initialization
  • #230 [PyTorch] Can we clone methods from the original model into the sol_model, i.e. model.doSomething()
  • #179 [PyTorch] Evaluate if using a CPP instead of Python function yields in less overhead
17.10.2022
v0.5.1
Docs

Highlights

  • General
    • sol.check_version() command can be used to check for new version.
    • Form to apply for SOL4VE closed beta.
    • Improved command-line based installer.
  • PyTorch
    • Calls to sol.optimize(...) from PyTorch does no longer require example inputs. Instead it gets parsed the very first time it gets executed.
    • Limited support for torch.einsum(...)
    • Support for GANs through MaxUnPooling and TransposedConv layer support.
    • Automatically detection when to use torch.jit.script and when to fall back to torch.jit.trace
    • Improved handling of inline-operations
  • TensorFlow
    • RNN support

Closed Issues

  • #775 [OpenSSL] Upgrade to 1.1.1r
  • #774 [PyTorch] v1.12.1 breaks VE complex number support
  • #772 [PyTorch] SqueezeNet training accuracy problems on VE
  • #771 [DFP] Investigate Unvectorizable DType in SqueezeNet Training on VE
  • #765 [PyTorch] automatic fallback to jit.trace if model parsing fails with jit.script
  • #764 [Docs] add Pytorch lazy optimizer documentation
  • #763 [PyTorch] enable kwargs for trace=True cases
  • #762 [PyTorch] add lazy optimizer
  • #761 [PyTorch] add torch.triu
  • #760 [Docs] Update supported layers
  • #759 [Core] Disable Progress bar if not running in interactive shell
  • #758 [Hugo] Upgrade to v0.103.0
  • #756 [VE] find alternative for "constexpr" in RNN implementation
  • #754 [DFP] Implement Interpolate autoSqueeze
  • #753 [Config] Remove unused config options
  • #752 [Docs] Update Docs for v0.5.1
  • #751 [NEC-SOL] Add "test access" option to nec-sol
  • #750 [NEC-SOL] Device Support Packages that users don't have access to prevent to install the package in the first place.
  • #748 [DNNL] Upgrade 2.6.2
  • #747 [NEC-SOL] add --verbose flag and report access to urls.
  • #745 [TF+RNN] verify RNNSimple results
  • #743 Add TF v2.10.0 Support
  • #739 [Hugo] upgrade v0.102.1
  • #735 [Boost] Upgrade to 1.80.0
  • #733 [OMP] Add Heuristics to sol_parallel_for and sol_parallel_simd
  • #732 [HLIR] add option to remove unused model parameters
  • #731 [HLIR] Make RNN sequence length variable
  • #730 [Keras] add tf.ensure_shape to KerasLayer
  • #729 [Keras] Support named inputs
  • #728 [HLIR] Repair Where 2 min/max transformation
  • #726 [DNN/RNN] evaluate not using BLAS
  • #725 [TF] Solve Threadblocking Issue on X86
  • #723 [HLIR] derive(Slice) == Slice, if it's a reverse-slice
  • #720 [TF2RNN] Handle cases where only OH is used without slicing
  • #719 [RNN] Move sol.hlir.rnn API to C++ space
  • #718 [DNN/RNN] Improve Handling of O and OH
  • #717 [RNN] Masking Support
  • #716 [RNN] Recurrent Dropout Support
  • #714 [HLIR] remove dropout layers from inference executions
  • #713 [TF] fix keras (alpha_)dropouts in inference mode
  • #712 [TF] enable to change input shapes using sol.optimize(..., shapes={...}, ...)
  • #711 [DNNL] Upgrade 2.6.1
  • #705 [OpenSSL] Upgrade 1.1.1q
  • #704 [NVIDIA] JIT compile 64 and 128 Bit memset functions
  • #703 [Numpy] Numpy Runtime can cause SegFault when a NDArray gets freed when SOL already destroyed the handlers during shutdown
  • #702 [ONNX] add MaxUnPool
  • #701 [PyTorch] add MaxUnPooling
  • #700 [HLIR] check if we MaxUnPooling always uses "max" instead of "+=" in fwd Pass
  • #699 [HLIR] Add DeSampling optimization to DeConv
  • #698 [Core] Fix Conv::Transform::Subsampling when applied in Bwd Pass
  • #697 [PyTorch] accuracy problem in RNN with S>1 in v1.12.0
  • #695 [PyTorch] add aten::pad
  • #694 [PyTorch] add aten::index
  • #688 [PyTorch] Upgrade to 1.12.0
  • #687 [PyTorch] Add Swin Transformer Tests
  • #686 [NEC-SOL] Pure command line mode?
  • #685 [DFP] fix unvectorized read loops
  • #684 [DFP-NCC] remove struct operator and add restricted keyword.
  • #683 [NCC] Improve DFP unrolling
  • #682 [PyTorch] Add PyTorch Lightning Support
  • #673 [OpenSSL] Upgrade 1.1.1p
  • #672 [JIT-NCC] In debug, warn about obstructive functions, and other vectorization problems.
  • #671 [VE] Add _Pragma("_NEC always_inline") into DFP-NCC headers
  • #670 [API] Add linker script
  • #669 [JIT] Add linker script
  • #668 [API] Replace all remaining extern "C" with SOL_API
  • #665 [CUDNN] Get Bundle from NVIDIA
  • #664 [MKL] Switch to PIP MKL Package instead of Bundling
  • #663 [Runtime] Add Mutex to runtime::device::Network to prevent parallel executions as in TF.
  • #662 [TF/X86] Warn user if tf.config.threading.set_inter_op_parallelism_threads is not set to 1, that it has negative impact on SOL performance
  • #661 [Python] Replace any remaining 'print' with 'tungl.info'
  • #660 [Web] Add SOL4VE Registration Form
  • #658 [PyTorch+TF] Use unified implementation of framework::Handle for all devices
  • #657 [TF] Add option to enable grad on inputs
  • #654 [OpenSSL] Upgrade 1.1.1o
  • #653 [Hugo] Upgrade 0.100.2
  • #652 [Installer] Upgrade fails with 'set' object has no attribute 'append'
  • #649 [TF] Unable to fetch values for...
  • #647 [Core] Lock DB when executing cache::clear
  • #646 [Docs] Update TF SDK docs
  • #645 [HLIR] Transformation to remove duplicate Permutes is not working correctly.
  • #644 [TF] RNN Seq2Seq testcase exposes #3 as vdim
  • #643 [Hugo] Upgrade v0.100.1
  • #642 [TF] GRU
  • #641 [TF] LSTM
  • #640 [TF] SimpleRNN
  • #638 [TF] Evaluate unified module instead of compiling new modules for each network
  • #636 [Profiler] Rework API
  • #635 [Installer] Not Showing PyTorch/TensorFlow properly
  • #634 [NCC] investigate why OpenMP is not working with -std=c++17
  • #633 [MKL] RNN
  • #631 [TF] Upgrade 2.9.1
  • #630 [Deployment] change binary2obj to objcopy
  • #629 [PyTorch] Sum of BOOL needs to be casted into INT DType
  • #626 [Python] Evaluate if using CFFI is less verbose compared to CTypes
  • #625 [PyTorch] Evaluate if replacing sol.runtime.set_tensor with a CPP function yields in less overhead.
  • #621 [HLIR] ZeroCopy Layer, that tries to not duplicate outputs in frameworks.
  • #609 [ONNX] GroupNorm: illegal view transformation
  • #608 [TF] Add tf.keras.applications testcases
  • #607 [TF] Upgrade 2.8.2, 2.7.4, 2.6.5
  • #601 [VEDA] VEDA_ERROR_UNKNOWN_CONTEXT thrown when calling vedaDevicePrimaryCtxRetain
  • #600 [PyTorch] Missing Primitive: aten::bernoulli
  • #597 [Core] Wrong Rendering of Output in Inception
  • #593 [X86+VE] IRFFT2D accuracy issues
  • #582 [Core] Show Update Message
  • #547 [PyTorch] Remove RNN Fix for when PyTorch v1.12.0 is released
  • #541 [DFP] improve size-1 write loop removal
  • #540 [DFP] Split BatchSize == 1 outer loops if there are multiple inner
  • #498 [Python API] Add option to override requires_grad
  • #283 [TensorFlow] automatically unload network, when all instances of the network have been destroyed
  • #214 [VEBLAS] Better parallelize RNN for small BatchSizes
  • #209 [PyTorch] automatically unload network, when all instances of the network have been destroyed
  • #123 [Layers] ConvTranspose/Deconvolution
23.05.2022
v0.5.0rc2
Docs

This is the 2nd release candidate of SOL v0.5.0 Diadem. It adds brings back support for ONNX, Numpy (runtime), as well as lots of bugfixes and improvements.

Highlights

  • ONNX frontend
  • Numpy runtime
  • Printing StackTrace when exception occurs in SOL
  • Improved printing of model signature
  • The installer no longer shows packages you don't have access to

Closed Issues

  • #624 [FFT] duplicate symbol "vdims"
  • #623 [TF] tf.nn.max_pool_with_argmax indices differ again...
  • #619 [Hugo] Upgrade 0.99.0
  • #617 [ONNX] "Unable to fetch values for " in ShuffleNet
  • #616 [ONNX] "can't 384 / 13 because it's no divider" in X86 Igornet and Layers[Repeat]
  • #615 [ONNX] "Unable to find dimension PO1" in X86 MNasNet and EfficientNet
  • #614 [ONNX] Cumsum accuracy issues
  • #613 [Core] Improve calculation of memory consumption estimation
  • #612 [TF] TF casts shape of [1,1,1,1] to [] during training
  • #611 [NCC] Segfault in networks using VDims
  • #610 [ONNX] CumSum
  • #606 [ISPC] Wrong initialization of VDims
  • #605 [DB] Verify that we can open multiple instances of SOL using the same DB file
  • #604 [X86+FFT] Accuracy Error in PY_IRFFT2D
  • #603 [X86+PyTorch] Memory is already allocated
  • #602 [X86+FFT] [DNNL ERROR] The operation failed because of incorrect function arguments.
  • #599 [PyTorch] Missing Primitive "aten::empty"
  • #598 [HLIR] VDims cause promotion to F64 in BatchNorm
  • #595 [VE] RNN Segfault in Testcases
  • #594 [Tests] Report % of values that exceed threshold
  • #592 [PyTorch] Models without parameter can create training context if 'model.training = False' is not set
  • #591 [Docs] Update documentation for v0.5.0rc2
  • #590 [Core] "axis needs to be between 0 and 0 but found 1" in DeReduce
  • #589 [TF] Can't use Keras models with scalar parameters
  • #587 [TF] Ordering of Parameters in KerasLayer is not stable
  • #586 [VE] Investigate possible Memleak in PyTorch-VE Runtime
  • #584 [PyTorch] Add Error Handling when User provides too many/too few arguments to function
  • #583 [JsonCPP] Upgrade to 1.9.5
  • #581 [ISPC] Upgrade to 1.18.0
  • #580 [CMake] Change min GCC to 10.X
  • #579 [Hugo] upgrade 0.98.0
  • #578 [PyTorch] RNNCell parsing fails in PyTorch 1.11.0
  • #577 [PyTorch] "illegal view transformation" when parsing RNN
  • #576 [DB] automatically recover from "database image malformed" exceptions
  • #575 [VE] RNN does not compile with NCC 3.4.2
  • #574 [Runtime] SegFault in Context::Destroy when starting another training context.
  • #573 [Core] Print StackTrace when SOL Exceptions are thrown
  • #572 [VE] Value2VDim
  • #571 [X86] Value2VDim
  • #570 [VE] WhereTrue
  • #569 [X86] WhereTrue
  • #568 [VE] PrefixSum
  • #567 [X86] PrefixSum
  • #566 [HLIR] return NAN instead of throwing error when dividing by 0
  • #563 [HLIR] Remove Unnecessary Permute
  • #562 [Runtime] do not reinit Params if the model has been run before
  • #561 [PyTorch] add Mobilenet V3 testcases
  • #560 [PyTorch] add efficientnet testcases
  • #559 [PyTorch] add ConvNext Testcases
  • #558 [PyTorch] add RegNet testcases
  • #556 [HLIR] show actual input/output shapes, not the "possible"
  • #555 [Installer] Cache Password for PIP
  • #554 [Installer] hide packages that user does not have access to
  • #553 [Runtime] Context needs to distinguish between offloading and framework handle!
  • #552 [Numpy] Implement Lazy Allocations in Numpy executor
  • #551 [HLIR] Fix Memory Consumption to report Copies not as Outputs
  • #550 [NCC/OMP] cannot explicitly instantiate std::tuple in sol parallel simd with NCC 3.3 or newer
  • #549 [Compiler/Runtime] Encode/Check framework version, device compute capability, ... in compiled NN library name
  • #548 [SQLite] Upgrade to 3.38.2
  • #546 [DNNL] Upgrade 2.6
  • #545 [TF] SOL's MaxPooling makes other choices than TF's implementation in training
  • #544 [TF] implement save/load methods in Keras model
  • #543 [TF] fix BatchNorm Assignment
  • #539 [DNNL] Illegal Operation in Resnext BWD Pass
  • #538 [Python] Don't store size of Tensor in Python but always request from C++ space
  • #522 [DNNL] Performance Problem in ConvBwdData Param Reorder
  • #514 [PyTorch] Upgrade 1.11.0
  • #512 [HLIR] Copy Buffers during Training, not LayerOutputs
  • #507 [TF] Which momentum does TF use in BatchNorm?
  • #492 [TF/VE] Check if Optimizers and Loss functions are implemented
  • #438 [TF] Update 2.8.0
  • #429 [ONNX] Add Handler option
  • #425 [HLIR] Remove unnecessary Copies in RNN -> Copy -> Output i.e. for Workspace
  • #276 [ONNX] set that params require grad, so we can train ONNX models
  • #201 [DFP] Double check if we correctly report the scratchpad memory
  • #186 [ONNX] use sol.internal.Tensor operators instead of explicit calls
  • #166 [PyTorch] torch.nn.InstanceNorm2d missing
  • #165 [PyTorch] torch.nn.GroupNorm missing
  • #113 [ONNX] PRelu
  • #112 [ONNX] Gather
  • #108 [ONNX] ERROR getsym_handler
  • #74 [DFP] Performance: implement Input Loop Merging
23.03.2022
v0.5.0rc1
Docs

The SOL v0.5.0 Diadem is the next major release of SOL, containing already over 160 closed issues! We further switched to a new rolling releases model using release candidates, to push out fixes + changes more often.

This first release candidate DOES NOT contain support for: NVIDIA devices, ONNX or Deployment! These features will be reenabled in later release candidates.

Highlights

Breaking Changes

  • We modified the sol.optimize(model, args, kwargs={}, framework=None, **fwargs) call. If you have more than one input, you now need to pass it using the args argument as list or tuple, or using the kwargs as a dictionary. This was necessary to be more compliant with the AI framework's.
  • sol.optimize(..., batchsize=...) has been removed in favor of the new variable dimensions system. Please look here for more details.

Closed Issues

  • #538 [Python] Don't store size of Tensor in Python but always request from C++ space
  • #537 [OpenSSL] Upgrade to 1.1.1n
  • #535 [PyTorch] create OMP Symlink in sol-framework-pytorch-x86
  • #534 [DNNL] Upgrade 2.5.3
  • #533 [SQLITE] Upgrade 3.38.0
  • #531 [Core] Report Peak Memory in Output during Compilation Phase for each Pass
  • #530 [Python] increase min Python Version from 3.6 to 3.7 because of TF 2.8.0 requirement
  • #529 [PyTorch] Upgrade v1.10.2
  • #524 [PyTorch] Check why we get weird JIT accuracy errors in HuggingFace Bert training
  • #523 [Python] deprecated "copy_parameters" and just do it always.
  • #521 [DNNL] implement memory layout AutoTuning for Conv
  • #520 [CMake] Enable Release Candidates
  • #519 [VE] If NC++ is not found, we need to disable VE entirely, not just not setting the VE_LD_LIBRARY_PATH
  • #518 [VEDA] upgrade 1.2.0
  • #517 [Runtime] segfault when an inference Context is still open and get cleaned up during shutdown
  • #516 [PyTorch] Find a solution for conflicting GOMP implementation
  • #515 [OMP] Limit number of threads to match number of hardware cores, not number of hardware threads
  • #511 [Runtime] Store raw pointer in sol_tensor
  • #510 [DFP] Unroll Interpolation using TmpVars
  • #509 [Runtime] Evaluate if TBB is better than OMP
  • #508 [Runtime] Remove sol::runtime::Tensor
  • #505 [DFP] Remove Rand from DFP and implement in ISPC/CUDA and VEASL
  • #500 [DFP] input loop merge
  • #499 [Python API] Change sol.optimize(*args, **kwargs) to use one variable
  • #496 [DFP] Ensure that in ISPC each core uses it's own random state
  • #494 [DFP] Remove GLoops
  • #493 [DFP] Don't allocate stack memory in CORES loops.
  • #491 [HLIR] Fix weird TorchVision structure of DenseNet
  • #490 [TF] ValueError: Data cardinality is ambiguous: x sizes: 1, 100, 100
  • #489 [TF] OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.
  • #488 [HLIR] Add verification for DTypes
  • #487 [PyTorch] Testcase "MemberTensors" fails in Training
  • #486 [TF] can we pass the sol_ctx as attribute instead of scalar tensor?
  • #485 [TF] Enable models with more than 255 model parameter tensors
  • #483 [Core] Buffer Chaining fails if there is a view in between the concat operations
  • #480 [PyTorch] add strided slice support
  • #479 [SQLite] Upgrade to 3.37.2
  • #478 [TF] add missing StridedSlice features
  • #477 [ISPC] Upgrade to 1.17.0
  • #475 [DNNL] Upgrade to v2.5.2
  • #471 [X86] correctly detect vector instructions
  • #470 [TF] tf.math.cumsum
  • #469 [ISPC] migrade dnn_ispc to ispc, as it does not use any dnn components
  • #468 [Core] Verify that all layers have properly been assigned an algorithm before generating code
  • #466 [PyTorch] add torch.cumsum
  • #465 [TF] add tf.Where
  • #464 [Core] remove NUMEL through transform if VDIMS get removed
  • #463 [Core] add input to NUMEL
  • #462 [TF] backport new sol.hlir.reduce API
  • #461 [Core] Revise sol_tensor using real shape and separate numel field
  • #460 [DFP] BackendRenderer missing DTYPE
  • #459 [SQLite] Upgrade to 3.37.1
  • #458 [AVEO] Upgrade to 2.10.0
  • #457 [PyTorch] Support multivalued constants
  • #456 [HLIR] Bugfix GEMM batchHelper function
  • #455 [VDIMS] PyTorch AvgPool2D#3 with two different shapes after each other fail because of identical Hash
  • #454 [Pytorch] Upgrade 1.10.1
  • #453 [Docs] Add TensorFlow Native to VE Chapter
  • #452 [DNNL] Upgrade to 2.5.1
  • #451 [Core] Remove Tensor Class
  • #450 [Docs] Correct VDim case with #2 == 5
  • #449 [Deprecated] YaalUpdate and YaalSeed
  • #448 [VDIMS] option to enable/disable VDIM usage, disable by default.
  • #447 [DFP] Fix BatchNorm BWD Pass
  • #446 [DFP] Flag Loops where DataStride() == 1 for ALL data to be used for SIMD
  • #445 [PyTorch] torch.reshape missing
  • #444 [Core] Print NN Input/Output Signature
  • #443 [PyTorch] argmin + argmax missing
  • #442 [TF] wrong results in activation layers
  • #439 [PIP] Enable manylinux2014 compatible build process
  • #437 [DNNL] Upgrade 2.4.4
  • #436 nec-sol: download.pytorch.org may also need to be trusted
  • #435 nec-sol: codec can't display license agreement
  • #434 upon program exit, malloc(): unsorted double linked list corrupted
  • #433 [TF] Enable to parse tf.Module and tf.saved_models
  • #432 [TF] Upgrade to 2.6.1
  • #431 [PyTorch] torch.rand* and torch.randint* missing
  • #430 [Pytorch] Enable Models without inputs
  • #428 [PyTorch] Add Handler option
  • #427 [TensorFlow] Add Handler option
  • #426 [HLIR] Add Custom Layer Support
  • #423 [PyTorch] Possible Segfault in Py_InfNAN on X86
  • #422 [Core] Add Option to register Operations at runtime
  • #421 [Core] Missing library libcrypto.so.10 on Ubuntu 20.04
  • #420 [PyTorch] Debug v1.10.0 print problems with native-ve
  • #419 [PyTorch] Upgrade RNN API to new HLIR version
  • #418 [CUDNN] Upgrade to 8.2.4
  • #417 [CUDA] Upgrade to 11.3
  • #416 [PyTorch] Upgrade to v1.10.0
  • #415 [CORE] use SQLite Transactions to lock SOL Cache in multiple processes
  • #414 [DNNL] Upgrade to 2.4.2
  • #413 [VE] replace sol_ve_copy with VEDAMemset and VEDAMemcopy
  • #412 [VEDA] Add device side memset
  • #411 [VE] Add ComplexDTypes to Device API
  • #410 [TBB] CMake install script fails on first run.
  • #408 [VEDA] memset 128
  • #407 [VEDA] CMake ASL FFTW
  • #401 [DNNL] Upgrade to 2.4.1
  • #399 [NCC] Add Struct Functions
  • #398 [X86] Check Handle::memset performance!
  • #397 [VEDA] improve S8, S16 vedaMemset Performance
  • #396 [TBB] Upgrade to 2021.4.0
  • #395 [DNNL] Upgrade to 2.4
  • #394 [PyTorch] torch.complex missing
  • #393 [PyTorch] torch.imag missing
  • #391 [PyTorch] sol.optimize(model, torch.Tensor) stopped working?
  • #389 [NEC-SOL] automatically detect VENV
  • #388 [PyTorch] use torch.jit.trace instead of torch.jit.script to support Transformers
  • #387 [HLIR/DFP] Change Dropout to always use input rand
  • #386 [NEC-SOL] UTF-8 encoding problem
  • #385 [DFP] Bad Loop Merging in multi-path kernels
  • #384 [DFP] Wrong vectorization in Bias Sum Case
  • #383 [YAAL] Improve Scheduling
  • #382 [PyTorch] Upgrade to 1.9.1
  • #381 [Tests] Restructure TESTS package, so the PyTorch package does not load TF and vice versa.
  • #379 [NNPACK] Deprecate
  • #376 [ISPC] Check OpenMP implementation performance for small batchsizes
  • #375 [TensorFlow] Upgrade to 2.6.0
  • #364 [DFP] Fix Tile backward pass for X86
  • #363 [PyTorch] aten::repeat missing
  • #362 [PyTorch] aten::tile missing
  • #360 [PyTorch] aten::smooth_l1_loss
  • #355 [PyTorch] HuggingFace transformers broken
  • #332 [TF] Enabled Delayed Allocation in Native-VE
  • #329 [PyTorch] Add BatchNorm num_batches_tracked
  • #321 [API] change sol_external_malloc to use shapes instead of accumulated sizes
  • #307 [HLIR] Input > Dropout > Output produce wrong results in Inference
  • #306 [TF] tf.nn.max_pool_with_argmax returns wrong indicies
  • #302 [PyTorch] Use VE instead of HIP device
  • #299 [PyTorch] Slicing with [0:0] should return empty Tensor
  • #293 [HLIR] Constant with more than 1 element
  • #289 [TF] Remove sol_model.convert("device")
  • #285 [DFP] Remove Nests
  • #284 [DFP] Implement Linear
  • #271 [TensorFlow] enable lazy allocations
  • #267 [Core] Remove OptimizerType
  • #259 [IgorNet] Can't run Inference (division by zero)
  • #239 [PyTorch] add torch.nn.functional.interpolate
  • #225 [DNNL] Upgrade to 2.3.2
  • #212 [HLIR] SIGN on Unsigned is 1
  • #211 [HLIR] Abs on Unsigned is Noop
  • #208 [PyTorch] some networks can't use variable BatchSize
  • #203 [DFP] Buffered Dropout >> Narrow allocates too few memory
  • #202 [Compiler] Show Progress Bar already during Code generation Phase
  • #199 [HLIR] Detect Permutations in front of GEMM and merge them into the GEMM by changing the layout.
  • #198 [HLIR] Initialize Gradients with 0 and make all multi-connections to REQURIESACC
  • #193 [PyTorch] "Only one dimension for View can use summation." in ShuffleNet when using Variadic BatchSizes
  • #187 [DFP] BatchNorm does not support NOT to trace runtime metrics
  • #184 [PyTorch] No Gradient for inputs which get disconnected
  • #183 [PyTorch] LPPool Backward Pass results do not match
  • #182 [DFP] Split BatchNorm into two operations to better fit the actual computations and to remove the workarounds in DFP module
  • #177 [VEBLAS] Memory Estimation: RNN
  • #163 [PyTorch] torch.Tensor.repeat missing
  • #161 [PyTorch] Interpolation Layers missing
  • #157 [Autotuning] Device reports out of memory, as SOL does not use the framework's memory allocator during auto-tuning
  • #139 [Core] Enable to save/load SOL models
  • #135 [HLIR] Pytorchic BERT can't use variable batchsize because of the view?
  • #131 [DFP] segfault generating RReLU layer
  • #120 [DType] Add Complex DTypes
  • #117 [PyTorch] FFT
  • #81 [Variable BatchSize] GPT-2 can't use variable batchsize
  • #80 [Clusters] Errornous accumulation in GPT-2 BWD Pass
  • #71 [DFP] missing layer: MaxUnPooling
  • #70 [DFP] LeakyRelu: Remove IF in generated code
  • #35 [X86] Does not show correct used memory