27.07.2024 v0.6.0 Docs | 
Highlights
	- Added experimental support for vLLM.
 
	- Added experimental support for CUDAGraphs in PyTorch.
 
	- Added BFloat16 and Float16 support for X86 and NVIDIA.
 
	- Added FlashAttn-like kernel fusion for X86, NVIDIA and SX-Aurora.
 
	- Improved 
torch.compile(...) integration. 
	- SOL no longer aborts execution but properly throws Python exceptions that can be catched using 
try: ... except sol.Exception as e: ... 
 
Breaking Changes
 
	sol.config['compiler::profile'] has been deprecated. Use env var SOL_PROFILER instead 
 
Known Issues
	- No BFloat16 and Float16 support for SX-Aurora.
 
	- Performance regressions on SX-Aurora (e.g., ConvNext).
 
	- No gradient computations yet for Interpolation layers.
 
 
Closed Issues
- #1555 	[NVIDIA] NAN in Albert model
 
- #1554 	[PyTorch] Scriptparser uses deprecated imp package
 
- #1550 	[PyTorch] torch.cross
 
- #1548 	[PyTorch] Unknown at::device
 
- #1547 	[VDims] Can't use the same vidx twice! (128 * #2 * #2)
 
- #1546 	[VE] User specified an unsupported autocast device_type 've'
 
- #1545 	[BuildChain] CentOS7 mirrors are deprecated, we might need to upgrade to manylinux_2_28
 
- #1544 	[Python] Destroy sol_optimizer* when error is thrown
 
- #1543 	[VE] Invalid results when large values get passed to exp
 
- #1542 	[Core] Uncaught sol::Exception when autotuning triggers an assertion
 
- #1540 	[VE] PY_Issue_1410 fails
 
- #1538 	[Python] Allow 
determinism to be overwritten by user 
- #1537 	[PyTorch] Finalize Determinism
 
- #1536 	[ISPC] Make Prefixsum as template
 
- #1535 	[DNN] AutoTuner Cross GEMM performance
 
- #1534 	[PyTorch] torch.rand/rand_like F16/BF16 on X86
 
- #1532 	[Wrapper] Don't call sol_call if graph does not contain any nodes
 
- #1530 	[DNNL] PostOp Bias
 
- #1529 	[Core] Improve Exception Handling within TF
 
- #1528 	[PyTorch] aten::extend
 
- #1527 	[ONNX] change s_tensors to scope, and store s_opset in scope
 
- #1526 	[ONNX] LRN
 
- #1525 	[ONNX] Hub Tests
 
- #1520 	[DFP] No Write/Read loops in 376:628
 
- #1519 	[PyTorch] Can't expand [1, 3, 80, 80, 2] to [80, 80, 2] in YOLO perf run
 
- #1514 	[VE] Finalize VE3 support
 
- #1512 	[DNN] Add GEMM upcasting API to support e.g. FP16/BF16 on CPU/VE
 
- #1508 	[Transformers] Persimmon/Qwen2/XLMRobertaXL Accuracy
 
- #1507 	[TIMM] Upgrade to v1.0.7
 
- #1505 	[VE] Set NCC -stdlib=compat
 
- #1504 	[Parser] Implement lazy Numpy eval
 
- #1503 	[DFP] Conv Accuracy
 
- #1502 	[DNNL] Enable F16 and BF16
 
- #1501 	[DFP] Tanh(large number) == nan
 
- #1498 	[NVIDIA] Limit usage of TensorCores to suitable shapes
 
- #1497 	[ISPC] Unable to store varying in uniform var
 
- #1496 	[TIMM] fix models with 
lit != idims.end() error 
- #1494 	[HLIR] Add Constraint Registry that allows to serialize them
 
- #1493 	[DNNL] Upgrade 3.5
 
- #1492 	[PyTorch] Unify Script and FX parser
 
- #1490 	[SLEEF] Upgrade v3.6.1
 
- #1489 	[PyTorch/Lightning] Update Testcase to new API
 
- #1487 	[PyTorch] Test v2.3.1
 
- #1483 	[DFP] AccessorPinning fails in PY_REDUCE training
 
- #1481 	[PyTorch] HuggingFace LLama: 
This model uses variable number of arguments, falling back to torch.jit.trace 
- #1480 	[DFP] CUDA pow(-0.5, 2.0) results in NaN
 
- #1477 	[DNN] Add timeouts for GEMM and Transpose AutoTune
 
- #1476 	[DFP] remove stack_alloc calls
 
- #1475 	[AutoTuner] Reneable caching with new AT scheme
 
- #1474 	[HLIR] SDPAttention: Cannot modify VDims at this point anymore!
 
- #1472 	[DFP] Revise CPU+VE cores caching
 
- #1471 	[CUDA] Don't abort if libcuda.so is not found (e.g. in Docker build)
 
- #1470 	[Torchvision] vgg11: T18_D0 violates DNNL post op requirements as src is of type Add
 
- #1469 	[CUDNN] Could not load library libcudnn_cnn_infer.so.8. Error: /usr/lib64/libcudnn_cnn_infer.so.8: undefined symbol
 
- #1468 	[Runtime] Free Persistent Data in case user reruns training fwd pass without bwd pass
 
- #1466 	[DFP] DFPBackend::lowerIR(Layer* l, Cast* p) causes infinite loop when using VE
 
- #1464 	[DFP/Nvidia] LLama using -dt tf32 results in illegal view transformation
 
- #1462 	[PyTorch] Test v2.3.0
 
- #1459 	[NCC] add 
-march ARCH to NCC command 
- #1457 	[DFP] Unvectorized OnlineSoftMax in SDP
 
- #1456 	[PyTorch] If FlashAttn enabled for F32 SDP, then also set GEMM::TF32
 
- #1454 	[VE] VEDA_ERROR_VEO_COMMAND_EXCEPTION when running VEDNN GEMM AutoTune
 
- #1452 	[Wrapper] Add options to wrapper::Attributes to either override or accumulate
 
- #1451 	[VE] ve::trace=True does not compile
 
- #1450 	[DFP] Transform Numel->Cast->Broadcast to Numel->Cast with correct dims
 
- #1446 	[DFP] Test TensorCore-like GEMM implementations
 
- #1443 	[DNN] Implement GEMM autotune swapping inputs
 
- #1442 	[PyTorch] add Tensor.uniform_
 
- #1441 	[PyTorch] add one_hot
 
- #1440 	[DFP] HMerge identical Pooling/Conv and Reduce layers
 
- #1439 	[DFP] Reduce Cores LoopAccessors
 
- #1437 	[DFP] Rework Combine Input
 
- #1435 	[Accuracy] Investigate SoftMax Gradient problem
 
- #1432 	[Runtime] pass Model.training parameter as runtime parameter
 
- #1431 	[DFP/ISPC] add iterations to sol_dfp_reduce_x(..., count_iterations)
 
- #1430 	[HLIR] move Bernoulli::transform(Device&) to respective backends
 
- #1429 	[DNNL] Upgrade 3.4.1
 
- #1427 	[PyTorch] VLLM support
 
- #1425 	[Docs] Update max TF version
 
- #1424 	[HLIR] SDPAttention layer
 
- #1422 	[DFP] Rework Reduce::transform as it underutilizes
 
- #1421 	[ISPC] Add PyTorch_DETERMINISM compile flag, as they don't use FP64 sleef functions!
 
- #1419 	[PyTorch] Test v2.2.2
 
- #1417 	[DFP/X86] Accuracy GELU
 
- #1416 	[DFP] Loop+AccLoop fusion
 
- #1415 	[PyTorch] NaN when training MNist example on CUDA
 
- #1413 	[PyTorch] Pass on fwargs and vdims to torch.compile Backend
 
- #1412 	[TestBench] Add Default DType to perf Output
 
- #1411 	[PyTorch] add tensordot
 
- #1410 	[CUDA] Wrong result in CUDA Transpose
 
- #1409 	[PyTorch] evaluate FP64 GEMM uses tensor cores
 
- #1407 	[RNN] Add Determinism to API
 
- #1406 	[Profiler] MemTrace reports wrong total
 
- #1405 	[NVIDIA] LayerNorm accuracy
 
- #1403 	[NVIDIA] consider to remove cross-warp-reduction support
 
- #1402 	[DFP] OnlineSoftMax::derive
 
- #1399 	[Runtime/HLIR] Include Model Inputs in runtime::RequiresGrad
 
- #1398 	[DNNL] Use PostOp Activations in Inference
 
- #1397 	[Numpy] Adopt new VDims system
 
- #1396 	[Compuler/Runtime] Remove special "INF" case of Derivative
 
- #1395 	[TF] Missing TF handler for DivNoNan
 
- #1394 	[PyTorch] YOLO accuracy
 
- #1393 	[TF] Test v2.15.1
 
- #1391 	[Runtime] Consider separating INF and Training
 
- #1388 	[DFP] Still unused Loop Unpacks, e.g. in AlexNet
 
- #1387 	[DFP] Check Merged FOR-FOR loops, that cause unpacking of Loops (e.g. AlexNet)
 
- #1386 	[TIMM] fix vit_base_patch16_384
 
- #1385 	[Profiler] Reporting to file not working
 
- #1384 	[NVIDIA] Evaluate new __reduce_xxx_sync function for CC >=8
 
- #1383	[DFP] Improve cache planning
 
- #1382 	[DFP] Unnecessary cast in FP16 Mul->Mul
 
- #1381 	[DFP] Expected [S64] for sol::compiler::backend::dfp::Indices but found [S32]
 
- #1378 	[HLIR] Inherit Cast in MaxPooling.Indices -> Cast -> ...
 
- #1376 	[Runtime] Enable to use 
sol.device.set(...) from different Threads to run multiple devices in parallel 
- #1373 	[CUDA] SOL crashes with "invalid device context" when using Streamlit
 
- #1372 	[HLIR] Evaluate if using determinism instead of sol.hlir.RandType is sufficient
 
- #1371 	[DFP] Upcast pure intermediate results in FP16/BF16 to FP32
 
- #1370 	[PyTorch] Test v2.2.1
 
- #1369 	[DFP] Add transformation to upcast internal data types from f16 to f32.
 
- #1367 	[VE] SegFault Norms(64)
 
- #1366 	[CUDNN] CUDNN_STATUS_BAD_PARAM in TF RegNet
 
- #1365 	[NVIDIA] PY_Reduce(6/12) argmin/argmax fails with F64
 
- #1364 	[NVIDIA] PY_Issue_1316 fails with F64
 
- #1363 	[Tests] Enable non-fp32 dtypes in testsuites
 
- #1360 	[DNN] use 
sol_sync instead of sol::dnn:XXX::sync 
- #1359	[CUDA] performance of cublasGEAM not optimial, e.g. in TransposeSwap testcase
 
- #1358 	[DFP] Fix elementwise cases that don't get CoresSIMD assigned
 
- #1357 	[Runtime] Implement sol_tensor_swap to skip Noop Reorders
 
- #1356 	[Sleef] Upgrade v3.6
 
- #1355 	[PyTorch] Performance Issues in CNNs with BS=1 on X86
 
- #1351 	[PyTorch] Fix 'PY_Padding' on VE
 
- #1350 	[PyTorch] Fix 'PY_Norms(3)' on VE
 
- #1349 	[PyTorch] Fix 'PY_Addbmm' on VE
 
- #1348 	[PyTorch] Fix 'PY_Matmul#T#Batched' on VE
 
- #1346 	[PyTorch] Fix 'PY_CatRewire' -> nan on VE
 
- #1344 	[YAAL] Return if any of the checks fail with error code
 
- #1343 	[TF] Test v2.15.0
 
- #1342 	[Runtime] Trigger recompilation if non-dynamic dimension changes instead of crashing
 
- #1341 	[NVIDIA] Fix library detection if cu11 and cu12 packages are installed
 
- #1338 	[Rand] Numpy Random Number Generator
 
- #1337 	[HLIR] Add 
tensor[condition] operator 
- #1336 	[HLIR] Remove (Layer*) arguments from Operation, as they know their layer via 
m_layer! 
- #1335 	[BLAS] Fix decision making on AMD EPYC for new Autotuning
 
- #1334 	[BLAS] Unify OpenBLAS, MKL, DNNL and AOCLBLAS BLAS Interface
 
- #1333 	[OpenBLAS] Add Backend
 
- #1332 	[AutoTuner] Consider Backend Specific "number of runs" and "not improved"
 
- #1331 	[AutoTuner] Add option to poll performance for a layer from within another layer's tuning cycle
 
- #1330 	[PyTorch] Capture ::c10::Error errors in handle and rethrow as sol::Exception
 
- #1329 	[AutoTuner] AutoTuner Cache does not allow rerunning Reorder -> GEMM, profiling when identical GEMM layer but without previous Reorder was executed
 
- #1328 	[PyTorch] Test v2.2.0
 
- #1327 	[Numpy] Executor overrides input
 
- #1326 	[MKL/VEBLAS] Evaluate if using GEMV when bs==1 is better
 
- #1323 	[Profiler] Fix Total Bytes
 
- #1322 	[DNN] Evaluate other GEMM tuning strategies
 
- #1320 	[DNN] repair GEMM autotuning
 
- #1319 	[VE] Attach to host profiler
 
- #1318 	[PyTorch] Set executing device in model without inputs and parameters
 
- #1317 	[PyTorch] Adjust executor to also copy MemberTensors that are no Buffers to device
 
- #1316 	[PyTorch] aten::scaled_dot_product_attention
 
- #1315 	[TF] Read accuracy modifying values, e.g. tf32 execution
 
- #1314 	[PyTorch+VE] might cause segfault when exiting
 
- #1313 	[PyTorch] Using shape of not used tensor
 
- #1312 	[CAPI] Unify SOL dtypes for generated code
 
- #1311 	[NCC] Don't expect nc++, ... to be installed in /opt/nec/ve/bin/
 
- #1310 	[Installer] Add option to renew license
 
- #1309 	[Installer] Download option not working, as PIP downloads only Wheels, not Source packages
 
- #1308 	[HLIR] Tensors should only evaluate value_op if values and value_op are not None
 
- #1307 	[HLIR] Parser implemented non existing np functions, e.g. np.erf, np.acos, ...
 
- #1306 	[TF] Check Resnet50 CPU performance
 
- #1304 	[Core] Progress bar breaks, when Backend Handles get compiled during optimization process
 
- #1298 	[DNNL] "Unsupported dnnl_format_tag_t: POI/2/true" in tf/regnet
 
- #1296 	[Python] Remove SOL_CONSTANT Params
 
- #1289 	[DFP] Store ReAlloc not as instruction but directly within LoopStack
 
- #1285 	[DFP] Group WriteBacks for better performance in case of MasterStacks
 
- #1284 	[PyTorch] Read Torch accuracy + determinism values and attach them to the layers
 
- #1282 	[DFP] Minimize Loop Index Calculations
 
- #1275	[ISPC] Investigate impact of setting different target gang-sizes in ISPC compilation
 
- #1269 	[PyTorch] Implement FlashAttention
 
- #1268 	[CUBLAS] FP16 + using advanced flags
 
- #1264 	[DFP] Implement 
Online normalizer calculation for softmax 
- #1250 	[AOCL] Add new Backend
 
- #1240	[PyTorch] enable Torch Compile style fwd+bwd within one pass
 
- #1238 	[PyTorch] Bloom Support and Optimizations
 
- #1236 	[HLIR] Reorder -> GEMM transform
 
- #1223	[DNNL] Enable NVIDIA GPUs
 
- #1193 	[PyTorch/JIT] Add torch.jit.freeze to test-suite
 
- #1191 	[VDims] Autodetect VDims
 
- #1190 	[Profiler] Trace Memory Allocations
 
- #1186 	[VE] Fix AutoCast to 32Bit/64Bit vars to enable vectorization
 
- #1185	[NCC] v5.0.2 fails in PY_Norms, PY_Reduce and PY_TorchNorm
 
- #1180	[HLIR/DFP] Enable to store SOL_CONSTANT in model
 
- #1153 	[PyTorch] Investigate torch.fx for improving the parser
 
- #1060	[CUDA] Implement transpose as series of calls to cublasgeam
 
- #1043	[TF] "Unable to find SOL_CONSTANT"
 
- #999 	[ISPC] Improve SLEEF integration
 
- #991 	[Python] Improve performance of sol.optimize
 
- #920 	[PyTorch] Add more Einsum testcases
 
- #913 	[DFP] Rework stack memory caches
 
- #798	[NVIDIA] Change dependencies to use NVIDIA PIP packages
 
- #787	[AutoTuner] Think about choosing algorithms not solely based on the performance, but also about it's neighborhood, to increase chances of fusion.
 
- #766	[TF] Accuracy for BatchNorm in Efficientnet
 
- #696	[PyTorch] add aten::roll
 
- #651 	[TF] tf.keras.applications.efficientnet.EfficientNet + V2 accuracy problems
 
- #622 	[ISPC] ISPC casts wrongly casts double to uint8
 
- #504 	[DFP] fix removal of unnecessary accumulators
 
- #503 	[DFP] Memory Pinning
 
- #502	[DFP] Operation Inlining
 
- #497 	[HLIR] Add PassThrough Node
 
- #495 	[VEBLAS] Evaluate SOL's BatchedGEMM versus new NLC BatchedGEMM
 
- #390 	[VEDNN] readd GEMM
 
- #367 	[PyTorch] Add YOLO Test Case
 
- #366	[ONNX] Add YOLO TestCase
 
- #319	[PyTorch] Add option to parse "if self.training:" diverging paths
 
- #197 	[DFP] Reorder: Fill
 
- #196 	[DFP] Reorder: Narrow
 
- #104 	[All] FP16, BFloat16
 
- #69 	[DFP] Performance: BatchNorm Welford Algorithm
 
 
 |