SOL > Releases > v0.6 Electra

v0.6 Electra

Version/Date Changes

Version/Date	Changes
29.07.2024 v0.6.1 Docs	Closed Issues #1566 [Installer] add `nec-sol --version` #1565 [Installer] Handle conflicting framework dependencies #1564 [Installer] Handle local version strings for veda-pytorch #1560 [VE] Numerical instability in Sigmoid
27.07.2024 v0.6.0 Docs	Highlights Added experimental support for vLLM. Added experimental support for CUDAGraphs in PyTorch. Added BFloat16 and Float16 support for X86 and NVIDIA. Added FlashAttn-like kernel fusion for X86, NVIDIA and SX-Aurora. Improved `torch.compile(...)` integration. SOL no longer aborts execution but properly throws Python exceptions that can be catched using `try: ... except sol.Exception as e: ...` Breaking Changes `sol.config['compiler::profile']` has been deprecated. Use env var `SOL_PROFILER` instead Known Issues No BFloat16 and Float16 support for SX-Aurora. Performance regressions on SX-Aurora (e.g., ConvNext). No gradient computations yet for Interpolation layers. Closed Issues #1555 [NVIDIA] NAN in Albert model #1554 [PyTorch] Scriptparser uses deprecated imp package #1550 [PyTorch] torch.cross #1548 [PyTorch] Unknown at::device #1547 [VDims] Can't use the same vidx twice! (128 * #2 * #2) #1546 [VE] User specified an unsupported autocast device_type 've' #1545 [BuildChain] CentOS7 mirrors are deprecated, we might need to upgrade to manylinux_2_28 #1544 [Python] Destroy sol_optimizer* when error is thrown #1543 [VE] Invalid results when large values get passed to exp #1542 [Core] Uncaught sol::Exception when autotuning triggers an assertion #1540 [VE] PY_Issue_1410 fails #1538 [Python] Allow `determinism` to be overwritten by user #1537 [PyTorch] Finalize Determinism #1536 [ISPC] Make Prefixsum as template #1535 [DNN] AutoTuner Cross GEMM performance #1534 [PyTorch] torch.rand/rand_like F16/BF16 on X86 #1532 [Wrapper] Don't call sol_call if graph does not contain any nodes #1530 [DNNL] PostOp Bias #1529 [Core] Improve Exception Handling within TF #1528 [PyTorch] aten::extend #1527 [ONNX] change s_tensors to scope, and store s_opset in scope #1526 [ONNX] LRN #1525 [ONNX] Hub Tests #1520 [DFP] No Write/Read loops in 376:628 #1519 [PyTorch] Can't expand [1, 3, 80, 80, 2] to [80, 80, 2] in YOLO perf run #1514 [VE] Finalize VE3 support #1512 [DNN] Add GEMM upcasting API to support e.g. FP16/BF16 on CPU/VE #1508 [Transformers] Persimmon/Qwen2/XLMRobertaXL Accuracy #1507 [TIMM] Upgrade to v1.0.7 #1505 [VE] Set NCC -stdlib=compat #1504 [Parser] Implement lazy Numpy eval #1503 [DFP] Conv Accuracy #1502 [DNNL] Enable F16 and BF16 #1501 [DFP] Tanh(large number) == nan #1498 [NVIDIA] Limit usage of TensorCores to suitable shapes #1497 [ISPC] Unable to store varying in uniform var #1496 [TIMM] fix models with `lit != idims.end()` error #1494 [HLIR] Add Constraint Registry that allows to serialize them #1493 [DNNL] Upgrade 3.5 #1492 [PyTorch] Unify Script and FX parser #1490 [SLEEF] Upgrade v3.6.1 #1489 [PyTorch/Lightning] Update Testcase to new API #1487 [PyTorch] Test v2.3.1 #1483 [DFP] AccessorPinning fails in PY_REDUCE training #1481 [PyTorch] HuggingFace LLama: `This model uses variable number of arguments, falling back to torch.jit.trace` #1480 [DFP] CUDA pow(-0.5, 2.0) results in NaN #1477 [DNN] Add timeouts for GEMM and Transpose AutoTune #1476 [DFP] remove stack_alloc calls #1475 [AutoTuner] Reneable caching with new AT scheme #1474 [HLIR] SDPAttention: Cannot modify VDims at this point anymore! #1472 [DFP] Revise CPU+VE cores caching #1471 [CUDA] Don't abort if libcuda.so is not found (e.g. in Docker build) #1470 [Torchvision] vgg11: T18_D0 violates DNNL post op requirements as src is of type Add #1469 [CUDNN] Could not load library libcudnn_cnn_infer.so.8. Error: /usr/lib64/libcudnn_cnn_infer.so.8: undefined symbol #1468 [Runtime] Free Persistent Data in case user reruns training fwd pass without bwd pass #1466 [DFP] DFPBackend::lowerIR(Layer* l, Cast* p) causes infinite loop when using VE #1464 [DFP/Nvidia] LLama using -dt tf32 results in illegal view transformation #1462 [PyTorch] Test v2.3.0 #1459 [NCC] add `-march ARCH` to NCC command #1457 [DFP] Unvectorized OnlineSoftMax in SDP #1456 [PyTorch] If FlashAttn enabled for F32 SDP, then also set GEMM::TF32 #1454 [VE] VEDA_ERROR_VEO_COMMAND_EXCEPTION when running VEDNN GEMM AutoTune #1452 [Wrapper] Add options to wrapper::Attributes to either override or accumulate #1451 [VE] ve::trace=True does not compile #1450 [DFP] Transform Numel->Cast->Broadcast to Numel->Cast with correct dims #1446 [DFP] Test TensorCore-like GEMM implementations #1443 [DNN] Implement GEMM autotune swapping inputs #1442 [PyTorch] add Tensor.uniform_ #1441 [PyTorch] add one_hot #1440 [DFP] HMerge identical Pooling/Conv and Reduce layers #1439 [DFP] Reduce Cores LoopAccessors #1437 [DFP] Rework Combine Input #1435 [Accuracy] Investigate SoftMax Gradient problem #1432 [Runtime] pass Model.training parameter as runtime parameter #1431 [DFP/ISPC] add iterations to sol_dfp_reduce_x(..., count_iterations) #1430 [HLIR] move Bernoulli::transform(Device&) to respective backends #1429 [DNNL] Upgrade 3.4.1 #1427 [PyTorch] VLLM support #1425 [Docs] Update max TF version #1424 [HLIR] SDPAttention layer #1422 [DFP] Rework Reduce::transform as it underutilizes #1421 [ISPC] Add PyTorch_DETERMINISM compile flag, as they don't use FP64 sleef functions! #1419 [PyTorch] Test v2.2.2 #1417 [DFP/X86] Accuracy GELU #1416 [DFP] Loop+AccLoop fusion #1415 [PyTorch] NaN when training MNist example on CUDA #1413 [PyTorch] Pass on fwargs and vdims to torch.compile Backend #1412 [TestBench] Add Default DType to perf Output #1411 [PyTorch] add tensordot #1410 [CUDA] Wrong result in CUDA Transpose #1409 [PyTorch] evaluate FP64 GEMM uses tensor cores #1407 [RNN] Add Determinism to API #1406 [Profiler] MemTrace reports wrong total #1405 [NVIDIA] LayerNorm accuracy #1403 [NVIDIA] consider to remove cross-warp-reduction support #1402 [DFP] OnlineSoftMax::derive #1399 [Runtime/HLIR] Include Model Inputs in runtime::RequiresGrad #1398 [DNNL] Use PostOp Activations in Inference #1397 [Numpy] Adopt new VDims system #1396 [Compuler/Runtime] Remove special "INF" case of Derivative #1395 [TF] Missing TF handler for DivNoNan #1394 [PyTorch] YOLO accuracy #1393 [TF] Test v2.15.1 #1391 [Runtime] Consider separating INF and Training #1388 [DFP] Still unused Loop Unpacks, e.g. in AlexNet #1387 [DFP] Check Merged FOR-FOR loops, that cause unpacking of Loops (e.g. AlexNet) #1386 [TIMM] fix vit_base_patch16_384 #1385 [Profiler] Reporting to file not working #1384 [NVIDIA] Evaluate new __reduce_xxx_sync function for CC >=8 #1383 [DFP] Improve cache planning #1382 [DFP] Unnecessary cast in FP16 Mul->Mul #1381 [DFP] Expected [S64] for sol::compiler::backend::dfp::Indices but found [S32] #1378 [HLIR] Inherit Cast in MaxPooling.Indices -> Cast -> ... #1376 [Runtime] Enable to use `sol.device.set(...)` from different Threads to run multiple devices in parallel #1373 [CUDA] SOL crashes with "invalid device context" when using Streamlit #1372 [HLIR] Evaluate if using determinism instead of sol.hlir.RandType is sufficient #1371 [DFP] Upcast pure intermediate results in FP16/BF16 to FP32 #1370 [PyTorch] Test v2.2.1 #1369 [DFP] Add transformation to upcast internal data types from f16 to f32. #1367 [VE] SegFault Norms(64) #1366 [CUDNN] CUDNN_STATUS_BAD_PARAM in TF RegNet #1365 [NVIDIA] PY_Reduce(6/12) argmin/argmax fails with F64 #1364 [NVIDIA] PY_Issue_1316 fails with F64 #1363 [Tests] Enable non-fp32 dtypes in testsuites #1360 [DNN] use `sol_sync` instead of `sol::dnn:XXX::sync` #1359 [CUDA] performance of cublasGEAM not optimial, e.g. in TransposeSwap testcase #1358 [DFP] Fix elementwise cases that don't get CoresSIMD assigned #1357 [Runtime] Implement sol_tensor_swap to skip Noop Reorders #1356 [Sleef] Upgrade v3.6 #1355 [PyTorch] Performance Issues in CNNs with BS=1 on X86 #1351 [PyTorch] Fix 'PY_Padding' on VE #1350 [PyTorch] Fix 'PY_Norms(3)' on VE #1349 [PyTorch] Fix 'PY_Addbmm' on VE #1348 [PyTorch] Fix 'PY_Matmul#T#Batched' on VE #1346 [PyTorch] Fix 'PY_CatRewire' -> nan on VE #1344 [YAAL] Return if any of the checks fail with error code #1343 [TF] Test v2.15.0 #1342 [Runtime] Trigger recompilation if non-dynamic dimension changes instead of crashing #1341 [NVIDIA] Fix library detection if cu11 and cu12 packages are installed #1338 [Rand] Numpy Random Number Generator #1337 [HLIR] Add `tensor[condition]` operator #1336 [HLIR] Remove (Layer*) arguments from Operation, as they know their layer via `m_layer`! #1335 [BLAS] Fix decision making on AMD EPYC for new Autotuning #1334 [BLAS] Unify OpenBLAS, MKL, DNNL and AOCLBLAS BLAS Interface #1333 [OpenBLAS] Add Backend #1332 [AutoTuner] Consider Backend Specific "number of runs" and "not improved" #1331 [AutoTuner] Add option to poll performance for a layer from within another layer's tuning cycle #1330 [PyTorch] Capture ::c10::Error errors in handle and rethrow as sol::Exception #1329 [AutoTuner] AutoTuner Cache does not allow rerunning Reorder -> GEMM, profiling when identical GEMM layer but without previous Reorder was executed #1328 [PyTorch] Test v2.2.0 #1327 [Numpy] Executor overrides input #1326 [MKL/VEBLAS] Evaluate if using GEMV when bs==1 is better #1323 [Profiler] Fix Total Bytes #1322 [DNN] Evaluate other GEMM tuning strategies #1320 [DNN] repair GEMM autotuning #1319 [VE] Attach to host profiler #1318 [PyTorch] Set executing device in model without inputs and parameters #1317 [PyTorch] Adjust executor to also copy MemberTensors that are no Buffers to device #1316 [PyTorch] aten::scaled_dot_product_attention #1315 [TF] Read accuracy modifying values, e.g. tf32 execution #1314 [PyTorch+VE] might cause segfault when exiting #1313 [PyTorch] Using shape of not used tensor #1312 [CAPI] Unify SOL dtypes for generated code #1311 [NCC] Don't expect nc++, ... to be installed in /opt/nec/ve/bin/ #1310 [Installer] Add option to renew license #1309 [Installer] Download option not working, as PIP downloads only Wheels, not Source packages #1308 [HLIR] Tensors should only evaluate value_op if values and value_op are not None #1307 [HLIR] Parser implemented non existing np functions, e.g. np.erf, np.acos, ... #1306 [TF] Check Resnet50 CPU performance #1304 [Core] Progress bar breaks, when Backend Handles get compiled during optimization process #1298 [DNNL] "Unsupported dnnl_format_tag_t: POI/2/true" in tf/regnet #1296 [Python] Remove SOL_CONSTANT Params #1289 [DFP] Store ReAlloc not as instruction but directly within LoopStack #1285 [DFP] Group WriteBacks for better performance in case of MasterStacks #1284 [PyTorch] Read Torch accuracy + determinism values and attach them to the layers #1282 [DFP] Minimize Loop Index Calculations #1275 [ISPC] Investigate impact of setting different target gang-sizes in ISPC compilation #1269 [PyTorch] Implement FlashAttention #1268 [CUBLAS] FP16 + using advanced flags #1264 [DFP] Implement `Online normalizer calculation for softmax` #1250 [AOCL] Add new Backend #1240 [PyTorch] enable Torch Compile style fwd+bwd within one pass #1238 [PyTorch] Bloom Support and Optimizations #1236 [HLIR] Reorder -> GEMM transform #1223 [DNNL] Enable NVIDIA GPUs #1193 [PyTorch/JIT] Add torch.jit.freeze to test-suite #1191 [VDims] Autodetect VDims #1190 [Profiler] Trace Memory Allocations #1186 [VE] Fix AutoCast to 32Bit/64Bit vars to enable vectorization #1185 [NCC] v5.0.2 fails in PY_Norms, PY_Reduce and PY_TorchNorm #1180 [HLIR/DFP] Enable to store SOL_CONSTANT in model #1153 [PyTorch] Investigate torch.fx for improving the parser #1060 [CUDA] Implement transpose as series of calls to cublasgeam #1043 [TF] "Unable to find SOL_CONSTANT" #999 [ISPC] Improve SLEEF integration #991 [Python] Improve performance of sol.optimize #920 [PyTorch] Add more Einsum testcases #913 [DFP] Rework stack memory caches #798 [NVIDIA] Change dependencies to use NVIDIA PIP packages #787 [AutoTuner] Think about choosing algorithms not solely based on the performance, but also about it's neighborhood, to increase chances of fusion. #766 [TF] Accuracy for BatchNorm in Efficientnet #696 [PyTorch] add aten::roll #651 [TF] tf.keras.applications.efficientnet.EfficientNet + V2 accuracy problems #622 [ISPC] ISPC casts wrongly casts double to uint8 #504 [DFP] fix removal of unnecessary accumulators #503 [DFP] Memory Pinning #502 [DFP] Operation Inlining #497 [HLIR] Add PassThrough Node #495 [VEBLAS] Evaluate SOL's BatchedGEMM versus new NLC BatchedGEMM #390 [VEDNN] readd GEMM #367 [PyTorch] Add YOLO Test Case #366 [ONNX] Add YOLO TestCase #319 [PyTorch] Add option to parse "if self.training:" diverging paths #197 [DFP] Reorder: Fill #196 [DFP] Reorder: Narrow #104 [All] FP16, BFloat16 #69 [DFP] Performance: BatchNorm Welford Algorithm

29.07.2024
v0.6.1
Docs

Closed Issues

#1566 [Installer] add nec-sol --version
#1565 [Installer] Handle conflicting framework dependencies
#1564 [Installer] Handle local version strings for veda-pytorch
#1560 [VE] Numerical instability in Sigmoid

27.07.2024
v0.6.0
Docs

Highlights

Added experimental support for vLLM.
Added experimental support for CUDAGraphs in PyTorch.
Added BFloat16 and Float16 support for X86 and NVIDIA.
Added FlashAttn-like kernel fusion for X86, NVIDIA and SX-Aurora.
Improved torch.compile(...) integration.
SOL no longer aborts execution but properly throws Python exceptions that can be catched using try: ... except sol.Exception as e: ...

Breaking Changes

sol.config['compiler::profile'] has been deprecated. Use env var SOL_PROFILER instead

Known Issues

No BFloat16 and Float16 support for SX-Aurora.
Performance regressions on SX-Aurora (e.g., ConvNext).
No gradient computations yet for Interpolation layers.

Closed Issues

#1555 [NVIDIA] NAN in Albert model
#1554 [PyTorch] Scriptparser uses deprecated imp package
#1550 [PyTorch] torch.cross
#1548 [PyTorch] Unknown at::device
#1547 [VDims] Can't use the same vidx twice! (128 * #2 * #2)
#1546 [VE] User specified an unsupported autocast device_type 've'
#1545 [BuildChain] CentOS7 mirrors are deprecated, we might need to upgrade to manylinux_2_28
#1544 [Python] Destroy sol_optimizer* when error is thrown
#1543 [VE] Invalid results when large values get passed to exp
#1542 [Core] Uncaught sol::Exception when autotuning triggers an assertion
#1540 [VE] PY_Issue_1410 fails
#1538 [Python] Allow determinism to be overwritten by user
#1537 [PyTorch] Finalize Determinism
#1536 [ISPC] Make Prefixsum as template
#1535 [DNN] AutoTuner Cross GEMM performance
#1534 [PyTorch] torch.rand/rand_like F16/BF16 on X86
#1532 [Wrapper] Don't call sol_call if graph does not contain any nodes
#1530 [DNNL] PostOp Bias
#1529 [Core] Improve Exception Handling within TF
#1528 [PyTorch] aten::extend
#1527 [ONNX] change s_tensors to scope, and store s_opset in scope
#1526 [ONNX] LRN
#1525 [ONNX] Hub Tests
#1520 [DFP] No Write/Read loops in 376:628
#1519 [PyTorch] Can't expand [1, 3, 80, 80, 2] to [80, 80, 2] in YOLO perf run
#1514 [VE] Finalize VE3 support
#1512 [DNN] Add GEMM upcasting API to support e.g. FP16/BF16 on CPU/VE
#1508 [Transformers] Persimmon/Qwen2/XLMRobertaXL Accuracy
#1507 [TIMM] Upgrade to v1.0.7
#1505 [VE] Set NCC -stdlib=compat
#1504 [Parser] Implement lazy Numpy eval
#1503 [DFP] Conv Accuracy
#1502 [DNNL] Enable F16 and BF16
#1501 [DFP] Tanh(large number) == nan
#1498 [NVIDIA] Limit usage of TensorCores to suitable shapes
#1497 [ISPC] Unable to store varying in uniform var
#1496 [TIMM] fix models with lit != idims.end() error
#1494 [HLIR] Add Constraint Registry that allows to serialize them
#1493 [DNNL] Upgrade 3.5
#1492 [PyTorch] Unify Script and FX parser
#1490 [SLEEF] Upgrade v3.6.1
#1489 [PyTorch/Lightning] Update Testcase to new API
#1487 [PyTorch] Test v2.3.1
#1483 [DFP] AccessorPinning fails in PY_REDUCE training
#1481 [PyTorch] HuggingFace LLama: This model uses variable number of arguments, falling back to torch.jit.trace
#1480 [DFP] CUDA pow(-0.5, 2.0) results in NaN
#1477 [DNN] Add timeouts for GEMM and Transpose AutoTune
#1476 [DFP] remove stack_alloc calls
#1475 [AutoTuner] Reneable caching with new AT scheme
#1474 [HLIR] SDPAttention: Cannot modify VDims at this point anymore!
#1472 [DFP] Revise CPU+VE cores caching
#1471 [CUDA] Don't abort if libcuda.so is not found (e.g. in Docker build)
#1470 [Torchvision] vgg11: T18_D0 violates DNNL post op requirements as src is of type Add
#1469 [CUDNN] Could not load library libcudnn_cnn_infer.so.8. Error: /usr/lib64/libcudnn_cnn_infer.so.8: undefined symbol
#1468 [Runtime] Free Persistent Data in case user reruns training fwd pass without bwd pass
#1466 [DFP] DFPBackend::lowerIR(Layer* l, Cast* p) causes infinite loop when using VE
#1464 [DFP/Nvidia] LLama using -dt tf32 results in illegal view transformation
#1462 [PyTorch] Test v2.3.0
#1459 [NCC] add -march ARCH to NCC command
#1457 [DFP] Unvectorized OnlineSoftMax in SDP
#1456 [PyTorch] If FlashAttn enabled for F32 SDP, then also set GEMM::TF32
#1454 [VE] VEDA_ERROR_VEO_COMMAND_EXCEPTION when running VEDNN GEMM AutoTune
#1452 [Wrapper] Add options to wrapper::Attributes to either override or accumulate
#1451 [VE] ve::trace=True does not compile
#1450 [DFP] Transform Numel->Cast->Broadcast to Numel->Cast with correct dims
#1446 [DFP] Test TensorCore-like GEMM implementations
#1443 [DNN] Implement GEMM autotune swapping inputs
#1442 [PyTorch] add Tensor.uniform_
#1441 [PyTorch] add one_hot
#1440 [DFP] HMerge identical Pooling/Conv and Reduce layers
#1439 [DFP] Reduce Cores LoopAccessors
#1437 [DFP] Rework Combine Input
#1435 [Accuracy] Investigate SoftMax Gradient problem
#1432 [Runtime] pass Model.training parameter as runtime parameter
#1431 [DFP/ISPC] add iterations to sol_dfp_reduce_x(..., count_iterations)
#1430 [HLIR] move Bernoulli::transform(Device&) to respective backends
#1429 [DNNL] Upgrade 3.4.1
#1427 [PyTorch] VLLM support
#1425 [Docs] Update max TF version
#1424 [HLIR] SDPAttention layer
#1422 [DFP] Rework Reduce::transform as it underutilizes
#1421 [ISPC] Add PyTorch_DETERMINISM compile flag, as they don't use FP64 sleef functions!
#1419 [PyTorch] Test v2.2.2
#1417 [DFP/X86] Accuracy GELU
#1416 [DFP] Loop+AccLoop fusion
#1415 [PyTorch] NaN when training MNist example on CUDA
#1413 [PyTorch] Pass on fwargs and vdims to torch.compile Backend
#1412 [TestBench] Add Default DType to perf Output
#1411 [PyTorch] add tensordot
#1410 [CUDA] Wrong result in CUDA Transpose
#1409 [PyTorch] evaluate FP64 GEMM uses tensor cores
#1407 [RNN] Add Determinism to API
#1406 [Profiler] MemTrace reports wrong total
#1405 [NVIDIA] LayerNorm accuracy
#1403 [NVIDIA] consider to remove cross-warp-reduction support
#1402 [DFP] OnlineSoftMax::derive
#1399 [Runtime/HLIR] Include Model Inputs in runtime::RequiresGrad
#1398 [DNNL] Use PostOp Activations in Inference
#1397 [Numpy] Adopt new VDims system
#1396 [Compuler/Runtime] Remove special "INF" case of Derivative
#1395 [TF] Missing TF handler for DivNoNan
#1394 [PyTorch] YOLO accuracy
#1393 [TF] Test v2.15.1
#1391 [Runtime] Consider separating INF and Training
#1388 [DFP] Still unused Loop Unpacks, e.g. in AlexNet
#1387 [DFP] Check Merged FOR-FOR loops, that cause unpacking of Loops (e.g. AlexNet)
#1386 [TIMM] fix vit_base_patch16_384
#1385 [Profiler] Reporting to file not working
#1384 [NVIDIA] Evaluate new __reduce_xxx_sync function for CC >=8
#1383 [DFP] Improve cache planning
#1382 [DFP] Unnecessary cast in FP16 Mul->Mul
#1381 [DFP] Expected [S64] for sol::compiler::backend::dfp::Indices but found [S32]
#1378 [HLIR] Inherit Cast in MaxPooling.Indices -> Cast -> ...
#1376 [Runtime] Enable to use sol.device.set(...) from different Threads to run multiple devices in parallel
#1373 [CUDA] SOL crashes with "invalid device context" when using Streamlit
#1372 [HLIR] Evaluate if using determinism instead of sol.hlir.RandType is sufficient
#1371 [DFP] Upcast pure intermediate results in FP16/BF16 to FP32
#1370 [PyTorch] Test v2.2.1
#1369 [DFP] Add transformation to upcast internal data types from f16 to f32.
#1367 [VE] SegFault Norms(64)
#1366 [CUDNN] CUDNN_STATUS_BAD_PARAM in TF RegNet
#1365 [NVIDIA] PY_Reduce(6/12) argmin/argmax fails with F64
#1364 [NVIDIA] PY_Issue_1316 fails with F64
#1363 [Tests] Enable non-fp32 dtypes in testsuites
#1360 [DNN] use sol_sync instead of sol::dnn:XXX::sync
#1359 [CUDA] performance of cublasGEAM not optimial, e.g. in TransposeSwap testcase
#1358 [DFP] Fix elementwise cases that don't get CoresSIMD assigned
#1357 [Runtime] Implement sol_tensor_swap to skip Noop Reorders
#1356 [Sleef] Upgrade v3.6
#1355 [PyTorch] Performance Issues in CNNs with BS=1 on X86
#1351 [PyTorch] Fix 'PY_Padding' on VE
#1350 [PyTorch] Fix 'PY_Norms(3)' on VE
#1349 [PyTorch] Fix 'PY_Addbmm' on VE
#1348 [PyTorch] Fix 'PY_Matmul#T#Batched' on VE
#1346 [PyTorch] Fix 'PY_CatRewire' -> nan on VE
#1344 [YAAL] Return if any of the checks fail with error code
#1343 [TF] Test v2.15.0
#1342 [Runtime] Trigger recompilation if non-dynamic dimension changes instead of crashing
#1341 [NVIDIA] Fix library detection if cu11 and cu12 packages are installed
#1338 [Rand] Numpy Random Number Generator
#1337 [HLIR] Add tensor[condition] operator
#1336 [HLIR] Remove (Layer*) arguments from Operation, as they know their layer via m_layer!
#1335 [BLAS] Fix decision making on AMD EPYC for new Autotuning
#1334 [BLAS] Unify OpenBLAS, MKL, DNNL and AOCLBLAS BLAS Interface
#1333 [OpenBLAS] Add Backend
#1332 [AutoTuner] Consider Backend Specific "number of runs" and "not improved"
#1331 [AutoTuner] Add option to poll performance for a layer from within another layer's tuning cycle
#1330 [PyTorch] Capture ::c10::Error errors in handle and rethrow as sol::Exception
#1329 [AutoTuner] AutoTuner Cache does not allow rerunning Reorder -> GEMM, profiling when identical GEMM layer but without previous Reorder was executed
#1328 [PyTorch] Test v2.2.0
#1327 [Numpy] Executor overrides input
#1326 [MKL/VEBLAS] Evaluate if using GEMV when bs==1 is better
#1323 [Profiler] Fix Total Bytes
#1322 [DNN] Evaluate other GEMM tuning strategies
#1320 [DNN] repair GEMM autotuning
#1319 [VE] Attach to host profiler
#1318 [PyTorch] Set executing device in model without inputs and parameters
#1317 [PyTorch] Adjust executor to also copy MemberTensors that are no Buffers to device
#1316 [PyTorch] aten::scaled_dot_product_attention
#1315 [TF] Read accuracy modifying values, e.g. tf32 execution
#1314 [PyTorch+VE] might cause segfault when exiting
#1313 [PyTorch] Using shape of not used tensor
#1312 [CAPI] Unify SOL dtypes for generated code
#1311 [NCC] Don't expect nc++, ... to be installed in /opt/nec/ve/bin/
#1310 [Installer] Add option to renew license
#1309 [Installer] Download option not working, as PIP downloads only Wheels, not Source packages
#1308 [HLIR] Tensors should only evaluate value_op if values and value_op are not None
#1307 [HLIR] Parser implemented non existing np functions, e.g. np.erf, np.acos, ...
#1306 [TF] Check Resnet50 CPU performance
#1304 [Core] Progress bar breaks, when Backend Handles get compiled during optimization process
#1298 [DNNL] "Unsupported dnnl_format_tag_t: POI/2/true" in tf/regnet
#1296 [Python] Remove SOL_CONSTANT Params
#1289 [DFP] Store ReAlloc not as instruction but directly within LoopStack
#1285 [DFP] Group WriteBacks for better performance in case of MasterStacks
#1284 [PyTorch] Read Torch accuracy + determinism values and attach them to the layers
#1282 [DFP] Minimize Loop Index Calculations
#1275 [ISPC] Investigate impact of setting different target gang-sizes in ISPC compilation
#1269 [PyTorch] Implement FlashAttention
#1268 [CUBLAS] FP16 + using advanced flags
#1264 [DFP] Implement Online normalizer calculation for softmax
#1250 [AOCL] Add new Backend
#1240 [PyTorch] enable Torch Compile style fwd+bwd within one pass
#1238 [PyTorch] Bloom Support and Optimizations
#1236 [HLIR] Reorder -> GEMM transform
#1223 [DNNL] Enable NVIDIA GPUs
#1193 [PyTorch/JIT] Add torch.jit.freeze to test-suite
#1191 [VDims] Autodetect VDims
#1190 [Profiler] Trace Memory Allocations
#1186 [VE] Fix AutoCast to 32Bit/64Bit vars to enable vectorization
#1185 [NCC] v5.0.2 fails in PY_Norms, PY_Reduce and PY_TorchNorm
#1180 [HLIR/DFP] Enable to store SOL_CONSTANT in model
#1153 [PyTorch] Investigate torch.fx for improving the parser
#1060 [CUDA] Implement transpose as series of calls to cublasgeam
#1043 [TF] "Unable to find SOL_CONSTANT"
#999 [ISPC] Improve SLEEF integration
#991 [Python] Improve performance of sol.optimize
#920 [PyTorch] Add more Einsum testcases
#913 [DFP] Rework stack memory caches
#798 [NVIDIA] Change dependencies to use NVIDIA PIP packages
#787 [AutoTuner] Think about choosing algorithms not solely based on the performance, but also about it's neighborhood, to increase chances of fusion.
#766 [TF] Accuracy for BatchNorm in Efficientnet
#696 [PyTorch] add aten::roll
#651 [TF] tf.keras.applications.efficientnet.EfficientNet + V2 accuracy problems
#622 [ISPC] ISPC casts wrongly casts double to uint8
#504 [DFP] fix removal of unnecessary accumulators
#503 [DFP] Memory Pinning
#502 [DFP] Operation Inlining
#497 [HLIR] Add PassThrough Node
#495 [VEBLAS] Evaluate SOL's BatchedGEMM versus new NLC BatchedGEMM
#390 [VEDNN] readd GEMM
#367 [PyTorch] Add YOLO Test Case
#366 [ONNX] Add YOLO TestCase
#319 [PyTorch] Add option to parse "if self.training:" diverging paths
#197 [DFP] Reorder: Fill
#196 [DFP] Reorder: Narrow
#104 [All] FP16, BFloat16
#69 [DFP] Performance: BatchNorm Welford Algorithm