v0.6 Electra

Version/DateChanges
29.07.2024
v0.6.1
Docs

Closed Issues

  • #1566 [Installer] add nec-sol --version
  • #1565 [Installer] Handle conflicting framework dependencies
  • #1564 [Installer] Handle local version strings for veda-pytorch
  • #1560 [VE] Numerical instability in Sigmoid
27.07.2024
v0.6.0
Docs

Highlights

  • Added experimental support for vLLM.
  • Added experimental support for CUDAGraphs in PyTorch.
  • Added BFloat16 and Float16 support for X86 and NVIDIA.
  • Added FlashAttn-like kernel fusion for X86, NVIDIA and SX-Aurora.
  • Improved torch.compile(...) integration.
  • SOL no longer aborts execution but properly throws Python exceptions that can be catched using try: ... except sol.Exception as e: ...

Breaking Changes

  • sol.config['compiler::profile'] has been deprecated. Use env var SOL_PROFILER instead

Known Issues

  • No BFloat16 and Float16 support for SX-Aurora.
  • Performance regressions on SX-Aurora (e.g., ConvNext).
  • No gradient computations yet for Interpolation layers.

Closed Issues

  • #1555 [NVIDIA] NAN in Albert model
  • #1554 [PyTorch] Scriptparser uses deprecated imp package
  • #1550 [PyTorch] torch.cross
  • #1548 [PyTorch] Unknown at::device
  • #1547 [VDims] Can't use the same vidx twice! (128 * #2 * #2)
  • #1546 [VE] User specified an unsupported autocast device_type 've'
  • #1545 [BuildChain] CentOS7 mirrors are deprecated, we might need to upgrade to manylinux_2_28
  • #1544 [Python] Destroy sol_optimizer* when error is thrown
  • #1543 [VE] Invalid results when large values get passed to exp
  • #1542 [Core] Uncaught sol::Exception when autotuning triggers an assertion
  • #1540 [VE] PY_Issue_1410 fails
  • #1538 [Python] Allow determinism to be overwritten by user
  • #1537 [PyTorch] Finalize Determinism
  • #1536 [ISPC] Make Prefixsum as template
  • #1535 [DNN] AutoTuner Cross GEMM performance
  • #1534 [PyTorch] torch.rand/rand_like F16/BF16 on X86
  • #1532 [Wrapper] Don't call sol_call if graph does not contain any nodes
  • #1530 [DNNL] PostOp Bias
  • #1529 [Core] Improve Exception Handling within TF
  • #1528 [PyTorch] aten::extend
  • #1527 [ONNX] change s_tensors to scope, and store s_opset in scope
  • #1526 [ONNX] LRN
  • #1525 [ONNX] Hub Tests
  • #1520 [DFP] No Write/Read loops in 376:628
  • #1519 [PyTorch] Can't expand [1, 3, 80, 80, 2] to [80, 80, 2] in YOLO perf run
  • #1514 [VE] Finalize VE3 support
  • #1512 [DNN] Add GEMM upcasting API to support e.g. FP16/BF16 on CPU/VE
  • #1508 [Transformers] Persimmon/Qwen2/XLMRobertaXL Accuracy
  • #1507 [TIMM] Upgrade to v1.0.7
  • #1505 [VE] Set NCC -stdlib=compat
  • #1504 [Parser] Implement lazy Numpy eval
  • #1503 [DFP] Conv Accuracy
  • #1502 [DNNL] Enable F16 and BF16
  • #1501 [DFP] Tanh(large number) == nan
  • #1498 [NVIDIA] Limit usage of TensorCores to suitable shapes
  • #1497 [ISPC] Unable to store varying in uniform var
  • #1496 [TIMM] fix models with lit != idims.end() error
  • #1494 [HLIR] Add Constraint Registry that allows to serialize them
  • #1493 [DNNL] Upgrade 3.5
  • #1492 [PyTorch] Unify Script and FX parser
  • #1490 [SLEEF] Upgrade v3.6.1
  • #1489 [PyTorch/Lightning] Update Testcase to new API
  • #1487 [PyTorch] Test v2.3.1
  • #1483 [DFP] AccessorPinning fails in PY_REDUCE training
  • #1481 [PyTorch] HuggingFace LLama: This model uses variable number of arguments, falling back to torch.jit.trace
  • #1480 [DFP] CUDA pow(-0.5, 2.0) results in NaN
  • #1477 [DNN] Add timeouts for GEMM and Transpose AutoTune
  • #1476 [DFP] remove stack_alloc calls
  • #1475 [AutoTuner] Reneable caching with new AT scheme
  • #1474 [HLIR] SDPAttention: Cannot modify VDims at this point anymore!
  • #1472 [DFP] Revise CPU+VE cores caching
  • #1471 [CUDA] Don't abort if libcuda.so is not found (e.g. in Docker build)
  • #1470 [Torchvision] vgg11: T18_D0 violates DNNL post op requirements as src is of type Add
  • #1469 [CUDNN] Could not load library libcudnn_cnn_infer.so.8. Error: /usr/lib64/libcudnn_cnn_infer.so.8: undefined symbol
  • #1468 [Runtime] Free Persistent Data in case user reruns training fwd pass without bwd pass
  • #1466 [DFP] DFPBackend::lowerIR(Layer* l, Cast* p) causes infinite loop when using VE
  • #1464 [DFP/Nvidia] LLama using -dt tf32 results in illegal view transformation
  • #1462 [PyTorch] Test v2.3.0
  • #1459 [NCC] add -march ARCH to NCC command
  • #1457 [DFP] Unvectorized OnlineSoftMax in SDP
  • #1456 [PyTorch] If FlashAttn enabled for F32 SDP, then also set GEMM::TF32
  • #1454 [VE] VEDA_ERROR_VEO_COMMAND_EXCEPTION when running VEDNN GEMM AutoTune
  • #1452 [Wrapper] Add options to wrapper::Attributes to either override or accumulate
  • #1451 [VE] ve::trace=True does not compile
  • #1450 [DFP] Transform Numel->Cast->Broadcast to Numel->Cast with correct dims
  • #1446 [DFP] Test TensorCore-like GEMM implementations
  • #1443 [DNN] Implement GEMM autotune swapping inputs
  • #1442 [PyTorch] add Tensor.uniform_
  • #1441 [PyTorch] add one_hot
  • #1440 [DFP] HMerge identical Pooling/Conv and Reduce layers
  • #1439 [DFP] Reduce Cores LoopAccessors
  • #1437 [DFP] Rework Combine Input
  • #1435 [Accuracy] Investigate SoftMax Gradient problem
  • #1432 [Runtime] pass Model.training parameter as runtime parameter
  • #1431 [DFP/ISPC] add iterations to sol_dfp_reduce_x(..., count_iterations)
  • #1430 [HLIR] move Bernoulli::transform(Device&) to respective backends
  • #1429 [DNNL] Upgrade 3.4.1
  • #1427 [PyTorch] VLLM support
  • #1425 [Docs] Update max TF version
  • #1424 [HLIR] SDPAttention layer
  • #1422 [DFP] Rework Reduce::transform as it underutilizes
  • #1421 [ISPC] Add PyTorch_DETERMINISM compile flag, as they don't use FP64 sleef functions!
  • #1419 [PyTorch] Test v2.2.2
  • #1417 [DFP/X86] Accuracy GELU
  • #1416 [DFP] Loop+AccLoop fusion
  • #1415 [PyTorch] NaN when training MNist example on CUDA
  • #1413 [PyTorch] Pass on fwargs and vdims to torch.compile Backend
  • #1412 [TestBench] Add Default DType to perf Output
  • #1411 [PyTorch] add tensordot
  • #1410 [CUDA] Wrong result in CUDA Transpose
  • #1409 [PyTorch] evaluate FP64 GEMM uses tensor cores
  • #1407 [RNN] Add Determinism to API
  • #1406 [Profiler] MemTrace reports wrong total
  • #1405 [NVIDIA] LayerNorm accuracy
  • #1403 [NVIDIA] consider to remove cross-warp-reduction support
  • #1402 [DFP] OnlineSoftMax::derive
  • #1399 [Runtime/HLIR] Include Model Inputs in runtime::RequiresGrad
  • #1398 [DNNL] Use PostOp Activations in Inference
  • #1397 [Numpy] Adopt new VDims system
  • #1396 [Compuler/Runtime] Remove special "INF" case of Derivative
  • #1395 [TF] Missing TF handler for DivNoNan
  • #1394 [PyTorch] YOLO accuracy
  • #1393 [TF] Test v2.15.1
  • #1391 [Runtime] Consider separating INF and Training
  • #1388 [DFP] Still unused Loop Unpacks, e.g. in AlexNet
  • #1387 [DFP] Check Merged FOR-FOR loops, that cause unpacking of Loops (e.g. AlexNet)
  • #1386 [TIMM] fix vit_base_patch16_384
  • #1385 [Profiler] Reporting to file not working
  • #1384 [NVIDIA] Evaluate new __reduce_xxx_sync function for CC >=8
  • #1383 [DFP] Improve cache planning
  • #1382 [DFP] Unnecessary cast in FP16 Mul->Mul
  • #1381 [DFP] Expected [S64] for sol::compiler::backend::dfp::Indices but found [S32]
  • #1378 [HLIR] Inherit Cast in MaxPooling.Indices -> Cast -> ...
  • #1376 [Runtime] Enable to use sol.device.set(...) from different Threads to run multiple devices in parallel
  • #1373 [CUDA] SOL crashes with "invalid device context" when using Streamlit
  • #1372 [HLIR] Evaluate if using determinism instead of sol.hlir.RandType is sufficient
  • #1371 [DFP] Upcast pure intermediate results in FP16/BF16 to FP32
  • #1370 [PyTorch] Test v2.2.1
  • #1369 [DFP] Add transformation to upcast internal data types from f16 to f32.
  • #1367 [VE] SegFault Norms(64)
  • #1366 [CUDNN] CUDNN_STATUS_BAD_PARAM in TF RegNet
  • #1365 [NVIDIA] PY_Reduce(6/12) argmin/argmax fails with F64
  • #1364 [NVIDIA] PY_Issue_1316 fails with F64
  • #1363 [Tests] Enable non-fp32 dtypes in testsuites
  • #1360 [DNN] use sol_sync instead of sol::dnn:XXX::sync
  • #1359 [CUDA] performance of cublasGEAM not optimial, e.g. in TransposeSwap testcase
  • #1358 [DFP] Fix elementwise cases that don't get CoresSIMD assigned
  • #1357 [Runtime] Implement sol_tensor_swap to skip Noop Reorders
  • #1356 [Sleef] Upgrade v3.6
  • #1355 [PyTorch] Performance Issues in CNNs with BS=1 on X86
  • #1351 [PyTorch] Fix 'PY_Padding' on VE
  • #1350 [PyTorch] Fix 'PY_Norms(3)' on VE
  • #1349 [PyTorch] Fix 'PY_Addbmm' on VE
  • #1348 [PyTorch] Fix 'PY_Matmul#T#Batched' on VE
  • #1346 [PyTorch] Fix 'PY_CatRewire' -> nan on VE
  • #1344 [YAAL] Return if any of the checks fail with error code
  • #1343 [TF] Test v2.15.0
  • #1342 [Runtime] Trigger recompilation if non-dynamic dimension changes instead of crashing
  • #1341 [NVIDIA] Fix library detection if cu11 and cu12 packages are installed
  • #1338 [Rand] Numpy Random Number Generator
  • #1337 [HLIR] Add tensor[condition] operator
  • #1336 [HLIR] Remove (Layer*) arguments from Operation, as they know their layer via m_layer!
  • #1335 [BLAS] Fix decision making on AMD EPYC for new Autotuning
  • #1334 [BLAS] Unify OpenBLAS, MKL, DNNL and AOCLBLAS BLAS Interface
  • #1333 [OpenBLAS] Add Backend
  • #1332 [AutoTuner] Consider Backend Specific "number of runs" and "not improved"
  • #1331 [AutoTuner] Add option to poll performance for a layer from within another layer's tuning cycle
  • #1330 [PyTorch] Capture ::c10::Error errors in handle and rethrow as sol::Exception
  • #1329 [AutoTuner] AutoTuner Cache does not allow rerunning Reorder -> GEMM, profiling when identical GEMM layer but without previous Reorder was executed
  • #1328 [PyTorch] Test v2.2.0
  • #1327 [Numpy] Executor overrides input
  • #1326 [MKL/VEBLAS] Evaluate if using GEMV when bs==1 is better
  • #1323 [Profiler] Fix Total Bytes
  • #1322 [DNN] Evaluate other GEMM tuning strategies
  • #1320 [DNN] repair GEMM autotuning
  • #1319 [VE] Attach to host profiler
  • #1318 [PyTorch] Set executing device in model without inputs and parameters
  • #1317 [PyTorch] Adjust executor to also copy MemberTensors that are no Buffers to device
  • #1316 [PyTorch] aten::scaled_dot_product_attention
  • #1315 [TF] Read accuracy modifying values, e.g. tf32 execution
  • #1314 [PyTorch+VE] might cause segfault when exiting
  • #1313 [PyTorch] Using shape of not used tensor
  • #1312 [CAPI] Unify SOL dtypes for generated code
  • #1311 [NCC] Don't expect nc++, ... to be installed in /opt/nec/ve/bin/
  • #1310 [Installer] Add option to renew license
  • #1309 [Installer] Download option not working, as PIP downloads only Wheels, not Source packages
  • #1308 [HLIR] Tensors should only evaluate value_op if values and value_op are not None
  • #1307 [HLIR] Parser implemented non existing np functions, e.g. np.erf, np.acos, ...
  • #1306 [TF] Check Resnet50 CPU performance
  • #1304 [Core] Progress bar breaks, when Backend Handles get compiled during optimization process
  • #1298 [DNNL] "Unsupported dnnl_format_tag_t: POI/2/true" in tf/regnet
  • #1296 [Python] Remove SOL_CONSTANT Params
  • #1289 [DFP] Store ReAlloc not as instruction but directly within LoopStack
  • #1285 [DFP] Group WriteBacks for better performance in case of MasterStacks
  • #1284 [PyTorch] Read Torch accuracy + determinism values and attach them to the layers
  • #1282 [DFP] Minimize Loop Index Calculations
  • #1275 [ISPC] Investigate impact of setting different target gang-sizes in ISPC compilation
  • #1269 [PyTorch] Implement FlashAttention
  • #1268 [CUBLAS] FP16 + using advanced flags
  • #1264 [DFP] Implement Online normalizer calculation for softmax
  • #1250 [AOCL] Add new Backend
  • #1240 [PyTorch] enable Torch Compile style fwd+bwd within one pass
  • #1238 [PyTorch] Bloom Support and Optimizations
  • #1236 [HLIR] Reorder -> GEMM transform
  • #1223 [DNNL] Enable NVIDIA GPUs
  • #1193 [PyTorch/JIT] Add torch.jit.freeze to test-suite
  • #1191 [VDims] Autodetect VDims
  • #1190 [Profiler] Trace Memory Allocations
  • #1186 [VE] Fix AutoCast to 32Bit/64Bit vars to enable vectorization
  • #1185 [NCC] v5.0.2 fails in PY_Norms, PY_Reduce and PY_TorchNorm
  • #1180 [HLIR/DFP] Enable to store SOL_CONSTANT in model
  • #1153 [PyTorch] Investigate torch.fx for improving the parser
  • #1060 [CUDA] Implement transpose as series of calls to cublasgeam
  • #1043 [TF] "Unable to find SOL_CONSTANT"
  • #999 [ISPC] Improve SLEEF integration
  • #991 [Python] Improve performance of sol.optimize
  • #920 [PyTorch] Add more Einsum testcases
  • #913 [DFP] Rework stack memory caches
  • #798 [NVIDIA] Change dependencies to use NVIDIA PIP packages
  • #787 [AutoTuner] Think about choosing algorithms not solely based on the performance, but also about it's neighborhood, to increase chances of fusion.
  • #766 [TF] Accuracy for BatchNorm in Efficientnet
  • #696 [PyTorch] add aten::roll
  • #651 [TF] tf.keras.applications.efficientnet.EfficientNet + V2 accuracy problems
  • #622 [ISPC] ISPC casts wrongly casts double to uint8
  • #504 [DFP] fix removal of unnecessary accumulators
  • #503 [DFP] Memory Pinning
  • #502 [DFP] Operation Inlining
  • #497 [HLIR] Add PassThrough Node
  • #495 [VEBLAS] Evaluate SOL's BatchedGEMM versus new NLC BatchedGEMM
  • #390 [VEDNN] readd GEMM
  • #367 [PyTorch] Add YOLO Test Case
  • #366 [ONNX] Add YOLO TestCase
  • #319 [PyTorch] Add option to parse "if self.training:" diverging paths
  • #197 [DFP] Reorder: Fill
  • #196 [DFP] Reorder: Narrow
  • #104 [All] FP16, BFloat16
  • #69 [DFP] Performance: BatchNorm Welford Algorithm