This paper presents a fine-tuned implementation of the quicksort algorithm for highly parallel multicore NVIDIA graphics processors. The described approach focuses on algorith-mic and implementation-level improvements to achieve enhanced performance. Several fine-tuning techniques are explored to identify the best combination of improvements for the quicksort algorithm on GPUs. The results show that this approach leads to a significant reduction in execution time and an improvement in algorithmic operations, such as the number of iterations of the algorithm and the number of operations performed compared to its predecessors. The experiments are conducted on an NVIDIA graphics card, taking into account several distributions of input data. The findings suggest that this fine-tuning approach can enable efficient and fast sorting on GPUs for a wide range of applications.
In this article, an upgraded version of CUDA-Quicksort - an iterative implementation of the quicksort algorithm suitable for highly parallel multicore graphics processors, is described and evaluated. Three key changes which lead to improved performance are proposed. The main goal was to provide an implementation with increased scalability with the size of data sets and number of cores with modern GPU architectures, which was successfully achieved. The proposed changes also lead to significant reduction in execution time. The execution times were measured on an NVIDIA graphics card, taking into account the possible distributions of the input data.
Education and society always lag behind technical state of the art achievements. General computer literacy needed decades to become the part of public acceptance after computers become available. Smart phones enters our life and becomes an extension of the human body yet we still do not know how to properly apply them in education. Artificial intelligence is an exciting technology that adapts educational experiences to different learning groups, teachers and tutors. Intelligent Management Systems (IMS) are not a novelty in education though. There have been many experiments, but they have all somehow stalled due to immature technology or misinterpretation. We can now see a new impetus for AI in education, and its impact will soon be very noticeable. In education, AI can: personalize learning, connect and create innovative learning content, perform tutoring in intelligent tutoring systems, is used to help pupils with special needs, help teachers assess, give students access to learning content, and translate educational content from different languages, removing language barriers. This article will explore the different possibilities of using AI in education and its use in education.
In this paper, three variants of the Floyd-Warshall (FW) All Pairs Shortest Path (APSP) algorithm are presented and compared - the sequential implementation, the parallel implementation using the Nvidia CUDA API, and the blocked parallel version of the FW algorithm. A performance analysis between these three approaches, as well as between the individual phases of the parallel algorithm is provided. The performance of these algorithms has been measured on regular as well as on embedded GPU hardware, and a significant speedup has been achieved. Additionally, this paper shows that a blocked data access results in significant energy savings of up to 72% on embedded hardware.
Paper presents speedup achieved through parallelization of code for computing π. Codes are implemented in C# with .NET framework and in C with OpenMP, on machine with i7 processor. Parallelization of code, more precisely embarassingly parallel problem, should show linear speedup, but as as shown in the following paper the same was not proven to be right. The differences in speedup between OpenMP and Task Parallel Library, hereinafter referenced as TPL are demonstrated by calculating speedup in different scenarios. Problem remains the same through the scenarios, but the number of iterations and the number of cores activated are changed. Finally, results are presented comparing the time needed for execution of serial and parallel computing. Ultimately, the results show that OpenMP is parallelization tool that is adviced to use while solving problems similar to the problem that is considered in this paper.
At the beginning of the new century, we started with an educational project aimed at the joint development and shared use of teaching materials for software engineering education. The aim was to transfer knowledge, as well as to save expenses. Over the years, our cooperation did not only cover the development of the course, but also its delivery, e.g. by guest lecturing. This paper reports on the experience gained in a multi-country project. Both success factors and problems are outlined.
With this paper we show through example how to determine acceptable response time of information system (IS) as part of Service level agreement (SLA) in cloud based information systems (CBIS) for randomly selected requirements and hardware infrastructure. IS which we analyzed is implemented on three different database management systems (DBMS): MS SQL Server - as commercial relational database management system (RDBMS), MySQL Server - as open source RDBMS and MongoDB - as open source document oriented DBMS. Our analysis includes databases from different areas regarding implementation and license and is not directly related to the hosting mechanism. Analysis is based on cloud hosted DBMS(s). Response times for different databases are compared in order to determine the most convenient DBMS which can be used for implementation of any cloud based IS. We show that open source DBMS(s) are acceptable replacement for commercial DBMS(s) and cloud based IS is much cheaper and easier to implement than IS hosted in local private IS infrastructure.
Discovering true needs for computer system performance has always been tricky and time consuming. Data size, number of simultaneous queries and their mutual isolation play key roles in defining need for computer resources. Response time of an average application is crucial from the customer's view. This paper explains the lessons learned from benchmarking of the response time.
Evolutionary strategies is a heuristic, guided-search based evolutionary algorithm, widely used as optimization technique for computationally intensive problems. Python is a high-level programming language known for code readability, reusability and the ease of use, making it preferable choice for quick and robust software development, although it is lacking in performance and concurrency area. Emerging technologies such as Anaconda Accelerate Python compiler attempt to combine Python's ease of use with both declarative and explicit parallelization and high performance in computationally intensive problems. In this paper an example of master - slave parallel Evolutionary strategy ES(μ,λ) implementation in Python is given, and its performance on CPU and GPU are analyzed.
MapReduce as a programming model is considered one of the biggest improvements in massive data processing which utilizes parallelization. The increasing amount of data being processed and stored has caused a need to investigate more efficient solutions to common problems, one of which is performing a join operation on two interconnected datasets. In this paper, a classic sequential solution to this problem is compared with a MapReduce approach, with the intent of discovering the relative advantages of the two. The sequential application runtime for datasets of negligible sizes in today's terms is proven prohibitively slow. Furthermore, a MapReduce cluster of five Amazon EC2 nodes is shown to process, in the same time period, ten times larger data than the sequential application.
This paper examines the efficiency of different load granularity in ray tracing. Benchmarking of image rendering using raytracing algorithm with different load balancing scenarios is presented. Open source Monte Carlo raytracer was modified in order to enable measurement of individual thread execution time. It is showed that, with specific load balancing scenarios, specific threads' idle time can be reduced and thus overall program execution time improved. The presented results are analyzed and future work suggested.
This paper propose parallel implementation of Ant Colony System (ACS) algorithm for automated combinational circuit design. Ant Colony System is one of the most popular and widely used Ant Colony Optimization (ACO) algorithm and heuristic algorithm in general. As digital logic circuits become more complex, efficient circuit design is priority and use of heuristic methods are unavoidable. Unfortunately, the optimization problems became so complex in sense of their size, even the most powerful heuristic algorithms can't solve them on single CPU. In order to be able to tackle the problem, parallel version of ACS is needed and this paper presents CUDA (Compute Unified Device Architecture) C language implementation.
Vertex coloring is a subset of the graph coloring problem. It is of great importance in many applications. Vertex coloring implies a coloring of the vertices of the graph with minimal number of colors (k), so that adjacent vertices have different color. The paper presents a hybrid implementation of Simulated Annealing algorithm for k-coloring of the vertices of the graph. The programming has been done with the use of CUDA toolkit. In order to find out how the speedup is achieved by parallelization, a sequential implementation for the problem has been used as a starting point. The results of CUDA based program and sequential implementation are analyzed. This hybrid implementation shows significant improvement and can be used as base for future work.
Finding efficient vehicle routes is an important logistics problem which has been studied for several decades. Metaheuristic algorithms offer some solutions for that problem. This paper deals with GPU implementation of the ant colony optimization algorithm (ACO), which can be used to find the best vehicle route between designated points. The algorithm is applied on finding the shortest path in several oriented graphs. It is embarrassingly parallel, since each ant constructs a possible problem solution independently. Results of sequential and parallelized implementation of the algorithm are presented. A discussion focused on implementing ACO using OpenMP and CUDA provides a basis for analysis of different results achieved on those two platforms.
Optimization using ant colony (ACO) is one of algorithms which is used for distributed control and optimization. Beside static methods these methods and algorithms are more flexible and robust in dynamical environments such as traffic on Internet and standard telephony. These problems belong to class of hard problems because of huge space of possible solutions which need to be found in reasonable time. Traveling salesman problem (TSP) belongs to hard problems. Since solving these problems need a lot of time for execution this paper presents an attempt to decrees execution time using parallelization on multicore processors. OpenMP was used as a main parallelization tool. Certain acceleration was achieved.
Nema pronađenih rezultata, molimo da izmjenite uslove pretrage i pokušate ponovo!
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više