Academia.eduAcademia.edu

Enhancing virtual switching system of on-chip networks

2015

the huge datacenters in the world used in the social networks, researching and computing centers, and storage devices have faced fundamental problem reflected in the need for high throughput to switch between the large numbers of virtual machines installed in these datacenters. The increased use of cloud applications will increase the need for greater numbers of virtual machines. In order to meet the requirements of this increase, it must be increased the hardware capabilities in the datacenters, through increasing the number of cores in the servers. Therefore, there is a need to provide network architecture that can connect all of these cores and chips very quickly. Consequently, it is necessary to enhance the virtual switching system to achieve the maximum integration and efficiency between the network segments, and analysis the scheduling and virtual queuing algorithms depending on the decisions from the experiences implemented using high-level source code and hardware descriptio...

International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 1888 Enhancing virtual switching system of on-chip networks Rony Kassam, Talal Al-Aateky, Radwan Dandah Abstract— the huge datacenters in the world - used in the social networks, researching and computing centers, and storage devices have faced fundamental problem reflected in the need for high throughput to switch between the large numbers of virtual machines installed in these datacenters. The increased use of cloud applications will increase the need for greater numbers of virtual machines. In order to meet the requirements of this increase, it must be increased the hardware capabilities in the datacenters, through increasing the number of cores in the servers. Therefore, there is a need to provide network architecture that can connect all of these cores and chips very quickly. Consequently, it is necessary to enhance the virtual switching system to achieve the maximum integration and efficiency between the network segments, and analysis the scheduling and virtual queuing algorithms depending on the decisions from the experiences implemented using high-level source code and hardware description scheme with Xilinx platform. Index Terms— Datacenters, Hardware description scheme, Network on Chip, Network architecture, Virtual switching system, MPSoC, 3 Xilinx, SDF , MAMPSx. ——————————  —————————— 1 INTRODUCTION T he Computer architecture of the datacenters consists of multiple processors and memories connected through interconnection networks, which have multiple schemes like mesh, torus, tree, and others. These schemes have built physically and logically on the chip, through advanced technologies such as NoC (Network on Chip). By using this network, it can balance the load of the computing and the data exchange among the installed processing elements in the datacenter. Especially with the increasing of the number of these elements, and the need to achieve the accurate level of the effective work of the datacenter, so there is not any additional processing elements consumption if there is no need. In the case of defining the requirements through the application installed in the datacenter, there will be proper mapping between the application and hardware architecture. It will be effective mapping between the VMs (virtual machines), the NoC routers, and processing elements, which will be responsible for operating these machines, switching, and routing among them, taking into the consideration throughput and power consumption. In this research, enhancing and optimization have performed on NoC to achieve the maximum integration and efficiency between the network segments, and analysis the scheduling and virtual queuing algorithms depending on the decisions from the experiences implemented using C source code and hardware description scheme with Xilinx platform. Research steps have been set to find the correct practical environment to carry out scenarios that give the desired results. The application and load installed on the environment were descripted by using the resources from Netmap[1], in order to test the application and identify the source file that must be relied upon. It was also analyzed the link between VMs in the virtual switching system, by implementing nSwitch[2], and then come with the necessary to use MPSoC (Multiprocessor System on Chip) architecture, in order to reach a valid and workable results. Then, a theoretical and practical evaluation has been studied on routers, tiles, and network schemes of the NoC, to choose the description scheme, simulator, and development tool in this research [3]. It has been identified SDF3 (Synchronous Data Flow For Free) as a description scheme, MAMPSx (Multi-Application MultiProcessor Synthesis) tool to generate the mapped MPSoC architecture, Xilinx platform in order to get the results, and Matlab to build the diagrams. IJSER ———————————————— • Eng. Rony Kassam – MSc Student – Department of Computer Systems and Networks – Faculty of Information Engineering – Tishreen University – Lattakia – Syria, PH-00963988225669. E-mail: [email protected] • Dr. Talal Al-Aateky – Assistant Professor – Department of Computer Systems and Networks – Faculty of Information Engineering – Tishreen University – Lattakia – Syria, PH-00963470858. E-mail: [email protected] • Dr. Radwan Dandah – Professor – Department of Computer Systems and Networks – Faculty of Information Engineering – Tishreen University – Lattakia – Syria, PH-00963944535235. E-mail: [email protected] 2 BACKGROUND AND INITIAL EXPERIMENTS 2.1 Application and load To define the application and the load installed on the practical environment, it has been relied on the Netmap[1], which is Ethernet switch design for VMs, it is provide high-speed connections between these machines depending only on the software. There is a problem faced by the network ports, especially in real systems and hardware, it is the latency of system calls and memory allocations for each packet. That is, because using specific API (Application Programming Interface) for the socket, more specifically the use of “libpcap” package. Therefore, many systems calls and memory operations will execute only to access to the kernel level. The solution used in Netmap[1] greatly facilitate the process, by forming data paths between the wire and the application at the user level; making the application very close to the hardware, this principle is the goal that we seek in this research. During the installation of Netmap on Ubuntu 14.10, according IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 to the installation guide, it appears a problem about the need for the kernel source and header files, and the need for the driver of the network interface. Since, it is necessary to install the source packages, because it will rebuild the driver of the network interface, in order to insert the “netmap.ko” module into the system. After that, a new problem appears when rebuilding the driver, because Netmap supports only specific network interface. Because it is not available, it was set a parameter “no-drivers” to generate virtual network interface. All of that because the need to identify the applications that deal with Netmap, which are very the research. It is chosen packet generator, the source file “pkt-gen.c”, because, this application can send the packets in certain sizes from sender to receiver on the same environment through physical layer. Because, this generator is independent on the processors scheme, but its dependence on the number of existing processors or cores, and also its dependence on the network interface, therefore, it has been modified to deal with the processing elements and routers within the NoC. Taking into consideration the use of appropriate tools in order to coordinate the work of this generator to be as a valid input for the hardware description scheme and the other tools. 2.2 VMs networking technology About the technologies that connect VMs with the virtual switching system, many enhancements of the SR-IOV (Single Root I/O Virtualization) have been done. SR-IOV is unable to support switching between two VMs on the same computer only through software switching. nSwitch is one of These enhancements to support hardware switching between VMs. nSwitch technique was compared with software switching vSwitch, and IEEE 802.1Qbg and 802.1Qbh pSwitch techniques[2]. As a result, using the vSwitch, which implemented by Citrix XenServer 6.5 and OpenVSwitch. Due to the heavy load on the CPU and the I/O queues to transfer between two VMs, the maximum throughput was 744 Mbps, and this value will be affected by increasing the VMs. About the IEEE 802.1Qbg and 802.1Qbh pSwitch technologies, and because the traffic was transferred through the physical network interface, so the throughput will be like the network interface throughput; 1 Gbps, and this value will be affected by the external transfer. The nSwitch technology has exceeded the delay caused by the network interface in the previous case, and therefore the throughput does not suffer from any limitations, only as to the PCIe 3.0 32-bit bandwidth, which is eight GigaTransfer/s. Starting with nSwitch experience in this research, by reforming the network interface to enable two PFs (Physical Function), which are PF0 and PF1, one for each processing unit. Each function can take number of VFs (Virtual Function). Installing this interface using Xilinx Vivado 2014.2, then building the design depending on the operating guide and experience within this platform [5]. It has been used Xilinx Virtex-7 FPGA, then set two PFs and six VFs with PCIe X8 Gen3, also set interface AXI4-stream 256-bit, memory 4KB 32bit, and finally use read and write “Dword” 32bit transactions for memory mapping. During the work, it shows the necessary for a physical testing interface, but because of the difficulty to provide it, it has 1889 been resorting to implement this interface virtually within the “.bin” file. Taking the advantage of Xilinx ability to run specific Linux operating system on the virtual interface, but it did not succeed in this experience, because it works only with some virtualized boards like MPSoC boards. Therefore, it has to do the work on MPSoC architecture to prepare the NoC within. This architecture can load the operating system, and at the same time can be modified to perfectly fit what we want without having to provide PCIe interface, only integrate the work into the board directly. 2.3 NoC architecture This architecture benefited from the OSI (Open Systems Interconnection) model, in order to transfer between the specific components IPs (Intellectual property) [6], which can be processors, memories, and others. The layers of the NoC are NI (Network Interface), PL (Physical link), R (Router), and IP [6]. Where, NI (data-link layer in the OSI) tasks’ are messages encapsulating and caching, PL (physical layer in the OSI) tasks’ are Signals and Phits (Physical Unit) transferring, R (network layer in the OSI) tasks’ are network scheming, routing and switching, and IP (transport layer in the OSI) is responsible for data transferring between the specific components. Within R, there are many technologies, starting from VCTlite [4], which is a new implementation of the VCT (Virtual Cut Through) adapted to multicore and multiprocessors on the chip. It can take advantage of the VCT, which is characterized by supporting broadcast and multicast. However, VCT requires large buffer size. This is not suitable for on-chip applications, but it is suitable for WH (Worm Hole), which deal with small buffer size. Therefore, the target is to reach the best technology that takes features of the two technologies. It is VCTlite, which uses buffer size such as the size of control message, whereas data message must be packetized [4]. IJSER The basic architecture of this router equipped with input buffers, as well as five-stage pipelining: IC (Input Controller), RT (Routing), VA-SA (Virtual channel And Switch Allocator), XB (Crossbar), and LT (Link Traversal) (see Fig. 1). Each router consists of a number of input and output ports, one port communicates with the processing element, while the remaining ports communicate with neighbors routers, as to the specified network scheme [4]. About output units [7], which track the status of the receiving VC, by using a group of registers. "input_vc" records the VCs reserved for this output, "idle" indicates whether the VC has received the flit tail of last packet, and "Credits" records a certain value reflects the possibility of the port to proceed in its function; if it contain the value “1” then can transfer one flit and so on. Each PL allocates a set of VCs in order to support the coherence protocol that is used to coordinate between caches. It has allocated one buffer to each VC with IC. In design view, the buffer is implemented by conventional shift register. Despite the fact that this structure is not effective in terms of power consumption, it does not introduce any additional circuits, which lead to increased router latency. VCs are grouped, therefore any message can use the VCs of the same group, and thus the traffic in different VCs groups cannot be mixed. This IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 constrain is very necessary in order to avoid the case of a deadlock within coherence protocol. 1890 address these challenges. It is design flow, which takes application and architecture scheme and generates MPSoC architecture with hardware and software models suitable for Xilinx. MAMPSx includes the stages, as follow: BONES is a model, depended on an algorithm recursively improving the design and quality time. DSE (Design) is an effective way to check MPSoC design in order to achieve performance and area conditions. GEN (Generate) is a structured approach, which generates multi-processor systems with the hardware infrastructure for a specific application, suitable with Xilinx platform [8]. The target of this research is to prepare an effective laboratory environment, includes general application capable to form and modify network packets. In addition, to link the application through description languages, which map it on a specific hardware environment, and at the same time, it is not limited to a specific development platform [9]. Therefore, according to [1], packet generator has been selected as an application, but this application is dealing directly with the operating system, so it has been modified in order to be deal with the laboratory environment in this research. Fig. 1. NoC router components. The parameters of this router [4], VCs number is 5, link width is 3 bytes, Flit size is 3 bytes, input buffers can store five Flits to achieve short RTT (Round Trip Time) between neighboring routers. This time is affected by propagation and flow processing delay. So small buffer size will appear bubbles in the communication stream [7]. For flow control, it was implemented Stop&Go protocol to control the Flits between neighboring routers. About routing stage on each input port, it has been achieved to support a routing algorithm DOR (Dimension Order Routing) [4]. The VA of the input port determines which VC will compete with other VCs of other input ports, in order to reach the XB. The SA to be achieved for each output port, and use round-robin arbitrator [4]. SA permissions will control for the entire packet, in order to give priority for the VCs that got permission through the switch [7]. To improve performance, VA-SA stage does not depend on Flit-level arbitration, as the case in WH, but they continue to route the Flit from the current message as long as there are Flits ready to send. However, this is a critical stage for the router, because it determines the operating frequency for this router. To reduce power consumption, clock gating is used [7]. By reprogramming “pkt-gen.c”, detailing its components and data streams within, and linking them with the description schemes. Then the files were initially created “sdf3PInoc.opt”, “archgraph-noc.xml”, and “usecase.xml”, which represents the settings required by the tool SDF3. These files are modified for each scenario, in order to be examined gradually. After that, all of the files become an input for MAMPSx tool. IJSER 2.4 Specific MPSoC generation Providing a design to evaluate the throughput constrained applications on the MPSoC architecture [8]. It was merged many schemes and tools to apply the load on Xilinx development platform. It was depended on SDF, which represents the application, so it can be mapped with hardware units, and determines the throughput in the most difficult status. After the application modeling, it was used C language in order to implement the scheme depending on a certain hardware architecture. Using these inputs, it is generated MPSoC architecture designed specifically for the requirements and the schemes of the application. Then take advantage of the MAMPS tool in order to operate this architecture with Xilinx. In order to achieve the best performance, taking into account time to market requirement, it presents MAMPSx to Beginning with the data stream analysis, a consistency of the scheme is analyzed to prevent the deadlock in the application by examining its resources, also by using the following rule: the scheme of “archgraph-noc” is consistent, if the repetition vector is not equal to the null vector. It has been planned MCM (Maximum Cycle Mean) analysis, to determine the throughput in the scheme, in order to organize tokens registering and releasing of the Actors. The scheme is analyzed in terms of repetition vector, in order to calculate all the comparisons between allocated buffers of channels identified in the scheme and the mapped maximum throughput. In order to make the resources of this model far from the unlimited, it is determined the number of edges within the Actors, where this number will be adjusted through the various scenarios. At this stage, in order to reduce the use of resources and provide guarantees on the application throughput in MPSoC system, it is used iterative design flow consisted of four stages [10]. it automatically and frequently does all stages, but the only manual stage is to determine the application scheme “usecase.xml”, NoC scheme “archgraph-noc.xml” within the setting file. As a final step, it is planned to export several formats fit with the display, as well as fit with the input of MAMPSx tool. SDF3 output files are placed into a folder that includes “archgraph-noc.xml”, “usecase.xml”, and the source files that implement the actors of the Packet generator. Then, it is run the tool “mamps_template” on the applica- IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 tion, which was named within the setting files, as “RTRpktgen”. The generated files are “RTRpktgen.tar.gz” and “RTRpktgen.backup.tar.gz”, where they contain the full Xilinx project. In order to get the executable file on the MPSoC, it is compiled through “make” command. Thus, the system has become ready for evaluation within the Xilinx development platform according to various scenarios. At each scenario, it will be back to the beginning of the experiment and make adjustments on the parts of the MAMPSx. This modification has named RTRNoC, which separates the formation of the NoC from the IP types within the different version of Xilinx platform. In addition, it supports Stop&Go flow control protocol. After doing all the modification, it will be started from the beginning of the experiment steps, to analysis and generate. The part that concerns us greatly in the research is the part located in the generated folder “pcores”, which contains a hardware description by VHDL (VHSIC (Very High Speed Integrated Circuit) Hardware Description Language) for the hardware components of the NoC, NI, and PE which is represented by the MicroBlaze processor. 3 EVALUATIONS AND RESULTS Talking about the evaluation scenarios, and identifying the parameters and metrics of the experiments, to reach for the enhancement. It is implemented 218 scenario; each scenario has 40 input parameters as their categories. In the scenario (X for example), it is modified some parameters’ values, according to the results from the scenario (X-1), and so on. The number of modified parameters are 18 parameters, whereas the other parameters remain constant. For system’s metrics that determine the effectiveness and quality of the system are frequency, area occupied by the chip containing the whole system, latency of links, protocols and components consisting the system, power consumed. The input parameters can be classified as following: 1891 creases significantly. For power consumption, it increases significantly even up to the point where getting a bit, the reason is the use of clock isolation, there is not additional tiles participated in the work. Fig. 2. System output related to network scheme and its parameters. The chip area increases in the case of 2D-Mesh and torus linearly with the increase in the number of tiles. The latency decreases at first, but with the increase in area, it increases significantly. Power consumption and latency in torus are less than the case of the 2D-Mesh, the reason is the low maximum distance between tiles, because of the additional links on the edges in the network scheme. Area dramatically increased in the case of the tree, the reason is the design of a multi-level tree. For the latency and power consumption in the tree, the reason is bandwidth increasing in the main branches of the tree. IJSER * Fixed parameters related to the application: “executionTime” consumed by the actor in certain MicroBlaze (or number of the Actors that do the same function). “MemoryElementSize” of the data memory “.data” and code memory “.code”. Tokens number of channels within the application “channelTokensNum”, which expresses the relationship weight between two actors. * Variable parameters related to network scheme: the number of nodes “tilesNum”, and their topology “NetworkTopology”. Certain link delay “tilesConnectionDelay”, work frequency within the network “networkFrequency”, and “linkWidth”. After making scenarios that include variable values of these parameters, the resulting chart, see Fig. 2. We note, with increasing the number of tiles in the network scheme, the frequency decreases, because it is inverse proportionality to the time consuming during the transition. The area increases, but have little increase in the first (more than doubled), and then become a big increase (two and more). With this increase, latency reduced slightly, but after that, it in- * Fixed and variable parameters related to network interface: the fixed includes, identify network interface model “niModel” that communicates with the PE. The variable includes, determine the number of input and output channels of the interface, “nrInputConnection” and “nrOutputConnection”, as well as determine the bandwidth of input and output, “InBandwidth” and “outBandwidth”. These parameters have been introduced into the scenarios, after choosing four schemes from the Fig. 2, these schemes are 2D-Mesh Auto, 2DMesh 325MHz, Tree Auto, and Torus 325MHz. And, it is given series of compound values for the previous parameters respectively (4,4,48,48 - 8,8,96,96 - 16,16,192,192 - 32,32,384,384); the interface bandwidth is proportionate with the number of input and output channels. Notice that, if the number of tiles is “4” as shown in Fig. 3, increasing in frequency, decreasing of latency and stabling in power consumption and area, with the increasing in input and output bandwidth of network interface. However, increased frequency stands at a value close to 600MHz when bandwidth close to 200 Mbps, as well as decreased latency stands at a value close to 30ns. IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 1892 Fig. 5. System output related to network interface parameters, tiles=64 Fig. 3. System output related to network interface parameters, tiles=4 If the number of tiles is “16”, as in Fig. 4, increasing in frequency, decreasing of latency and stabling in power consumption and area, with the increasing in input and output bandwidth of network interface. However, increased frequency stands at a value close to 200MHz when bandwidth close to 200 Mbps, as well as decreased latency stands at a value close to 150ns. These parameters have been introduced into the scenarios, after choosing four schemes from the Fig. 3, these schemes are 2D-Mesh Auto, 2D-Mesh 325MHz, Tree Auto, Torus 325MHz. And, it is given compound values for network interface (8,8,96,96), because the charts in Fig. 3. 4. 5 show that these values are what make the system outputs in case of changing. It has been given multiple values for the parameters of the MicroBlaze and after noticing the results, it selects “wheelsize=1000” and the size “64KB” for the memories in the next scenarios. IJSER Fig. 4. System output related to network interface parameters, tiles=16 When tiles number is “64” as in Fig. 5, increasing in frequency, decreasing of latency and stabling in power consumption and area, with the increasing in input and output bandwidth of network interface. However, increased frequency stands at a value close to 50MHz when bandwidth close to 200 Mbps, as well as decreased latency stands at a value close to 1400ns. * Fixed and variable parameters related to MicroBlaze PE: identify the arbitration type of the scheduler “arbitrationMcroBlz” inside Microbalze. The variable parameters are determining the “wheelsize”, and the size of the data and instruction memories, “dmemSize” and “imemSize”. * Fixed and variable parameters related to the router: The type of coherence protocol “coherencyProType”, the design of input buffers and registers “inpuBufferDesginType” “registerType”, determine the protocol to avoid a deadlock “deadlockAvoidPro”, flow control protocol “flowControlPro”. Routing algorithm “routingAlgo”, the switching mechanism within the router “switchAllocate”, the type of arbitration within the router “routerArbitration”, a mechanism to reduce power consumption “powerSaving”. The variable parameters are the number of ports of the router “routerPorts”, the number of virtual channels for each physical link “VCsNum”, “bufferSize”, “flitSize”, “linkWidth”, and the size of the main packet “packetSize”. The number of router’s ports relies on the network scheme, in the case of the tree, the number is four, while in the 2D-Mesh and the torus, the number is five. To determine the number of VCs, many experiments have been done using different values of VCs (2, 5, and 10). Notice from the results, the use of values (2, 5) more effective. The last four parameters have been given series of compound values for respectively (6,3,3,6 - 15,3,3,1515,5,5,15 - 9,3,3,64). Each of previews compound values is considered as a model of router architecture. The model (15, 3, 3, 15) represents VCTlite [4]. Notice that in [4], it was planned to use 2D-Mesh network scheme only, while this research includes an implementation of VCTlite architecture and protocols and install it in our laboratory environment, in addition to input new parameters IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 ISSN 2229-5518 1893 and schemes. The other models are contributions of the research after doing many experiments. If the number of tiles is “4” as in Fig. 6, notice that the model RTR155515 that implemented in tree scheme and two VCs achieves an improvement in frequency while maintaining the area and latency, but with a small increasing in power consumption. In the three models, which implemented in tours scheme with frequency lock and five VCs, notice improving the latency and small increasing in power consumption. Fig. 7. System output related to PE and router parameters, tiles=16 IJSER Fig. 6. System output related to PE and router parameters, tiles=4 Note if the number of tiles is “16” as in Fig. 7. The three models, which implemented in 2D-Mesh scheme with frequency lock and five VCs, achieve an improvement in latency while maintaining of area and power consumption. However, there is little increase in power consumption only in RTR155515 model. Note if the number of tiles is “64” as in Fig. 9, the model RTR6336 that implemented in tree scheme and two VCs achieves an improvement in frequency, area, latency, and power consumption. In addition, the same model implemented in torus scheme with frequency lock with different VCs’ number achieves an improvement in area and power consumption while maintaining the frequency and latency. . Fig. 8. System output related to PE and router parameters, tiles=64 4 CONCLUSION It is prepared laboratory environment, which enables to build any hardware emulator. By using the tools MAMPS, SDF3, and RTRNoC, we can provide all the necessary sources and descriptions files that reflect certain hardware for analyzing through any development platform, such as Xilinx. In adIJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 1894 ISSN 2229-5518 dition, we provide a design and an implementation for an ap- [8] R. Jordans, F. Siyoum, S. Stuijk, A. Kumar, H. Corporaal, An automated flow to map throughput constrained applications to a mpsoc, Proceedplication “RTRpktgen”, which generates packets, and deings of Bringing Theory to Practice: Predictability and Performance in signed to fit MPSoC architecture that supports NoC. Embedded Systems, DATE Workshop PPES. (Pages: 47–58 Year of PubliThrough this environment, it has been working on a wide cation: 2011). range of scenarios, which depend on each other in order to [9] S. Fernando, A. Kumar, H. Corporaal, Mampsx: A design methodology achieve a better evaluation and optimization for the paramefor rapid system-level exploration, synthesis of heterogeneous soc on fpga. In ters of the virtual switching system of on-chip networks. Eindhoven University of Technology and National University of Singapore, 2012. A set of conclusions has been reached. It can improve the frequency in a few tiles network scheme by increasing the [10] A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R. Mousavi, A. J. M. Moonen, M. J. G. Bekooij, Throughput analysis buffer size within the router, being equal to the size of the of synchronous data flow graphs, Proceedings of the Sixth International main packet, being greater than twice of the size of the Flit, Conference on Application of Concurrency to System Design, ACSD, IEEE and using this combination with the tree unlocked frequency Computer Society (Pages: 25–36 Year of Publication: 2006 ISBN: 0-7695and a small number of VCs. For improving the delay, we can 2556-3). use any combination of router parameters with the torus locked frequency. By increasing the number of tiles to average limit, we must go to the 2D-Mesh unlocked frequency with average number of VCs, to achieve improvement on the frequency, latency, and power consumption. By dramatically increasing the number of tiles, it is necessary to reduce the size of the buffer, to be equal to the size of the main packet, also to be equal to twice of the size of the Flit, and to use this combination with the tree unlocked frequency and a small number of VCs. To achieve improvement in area and power consumption also follow the previews structure but with the torus locked frequency. ACKNOWLEDGMENT IJSER This work was supported by Tishreen University, faculty of Information Engineering, department of Computer Systems and Networks Engineering. Also thanks to Dr. Akash Kumar, PhD (TU Eindhoven/NUS); 2009 for online helping. REFERENCES [1] [2] [3] [4] [5] [6] [7] L. Rizzo, M. Landi, Netmap: Memory mapped access to network devices, SIGCOMM Comput. Commun. Rev. Vol. 41, n. 4, pp. 422-423, 2011. J. Bardgett, C. C. Zou, nswitching: Virtual machine aware relay hardware switching to improve intra-nic virtual machine traffic, Proceedings of IEEE International Conference on Communications (Pages: 2700-2705 Year of Publication: 2012 ISBN: 978-1-4577-2052-9). A. B. Achballah, S. B. Saoud, A survey of network-on-chip tools, International Journal of Advanced Computer Science and Applications IJACSA, Vol. abs/1312.2976, pp. 4-9, 2013. A. Roca, J. Flich, F. Silla, J. Duato, Vctlite: Towards an efficient implementation of virtual cut-through switching in on-chip networks, Proceedings of International Conference on High Performance Computing, HiPC (Pages: 1-12 Year of Publication: 2010). V. Surabhi, Designing with sr-iov capability of xilinx virtex-7 pci express gen3 integrated block, Report in Xilinx, Inc, 2013. URL: http://www.xilinx.com/support/documentation/application_notes/xa pp1177-pcie-gen3-sriov.pdf A. B. Achballah, S. B. Saoud, The design of a network-on-chip architecture based on an avionic protocol, International Journal of Advanced Computer Science and Applications IJACSA. Vol. abs/1401.4891, 2014. S. Ma, Z. Wang, Z. Liu, N. D. E. Jerger, Leaving one slot empty: Flit120 bubble flow control for torus cache-coherent nocs, IEEE Trans. Computers. Vol. 64, n. 3, pp. 763–777, 2015. IJSER © 2015 http://www.ijser.org