May 27, 2024


Science It Works

New data science platform accelerates Python queries

Credit: CC0 public domain

Researchers at Brown University and MIT have developed a new data science framework that allows users to process data in the Python programming language. Generally, you do not have to pay the “performance tax” associated with user-friendly languages.

A new framework called Tuplex allows you to: Process data Queries written in Python are up to 90 times faster than industry standard data systems such as Apache Spark and Dask. The research team unveiled the system in a study presented at the premier data processing conference, SIGMOD 2021, making the software freely available to everyone.

“Python is the leading programming language used by data science practitioners,” said Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers at Tuplex. “It makes a lot of sense. Python is widely taught in college and is an easy language to get started with. But when it comes to data science, it’s related to Python because the platform can’t handle Python efficiently. There is a huge performance tax. Backend. “

Platforms like Spark Data analysis Distribute tasks across multiple processor cores or machines in a data center.It Parallel processing Users can process huge datasets that suffocate a single computer. Users interact with these platforms by entering their own queries that contain custom logic written as “user-defined functions” or UDFs. The UDF specifies custom logic, such as extracting the number of bedrooms from the text of the real estate list for a query that searches all real estate lists in the United States and selects all real estate lists with three bedrooms. ..

Due to its simplicity, Python is the language of choice for creating UDFs in Data science community. In fact, the Tuplex team cites a recent poll showing that 66{13aab5633489a05526ae1065595c074aeca3e93df6390063fabaebff206207ec} of data platform users use Python as their primary language. The problem is that the analytics platform has a problem handling these bits of Python code efficiently.

The data platform is written in a high-level computer language that is compiled before execution. A compiler is a program that takes a computer language and converts it into machine code. This is a set of instructions that a computer processor can execute quickly. However, Python is not precompiled. Instead, the computer interprets the Python code line by line while the program is running, which can significantly reduce performance.

“These frameworks need to break out of efficient execution of compiled code and jump to the Python interpreter to run the Python UDF,” Schwarzkopf said. “The process can be 100 times less efficient than running the compiled code.”

If you can compile your Python code, it will be much faster. However, researchers have been trying to develop a generic Python compiler for years with little success. Therefore, instead of writing a generic Python compiler, researchers designed Tuplex to compile a highly specialized program for specific queries and general input data. Rare input data, which makes up a small part of the instance, is isolated and referenced by the interpreter.

“This process is called dual-case processing because it divides the data into two cases,” said Leonhard Spiegelberg, co-author of the study that describes Tuplex. “This simplifies compilation issues because we only have to worry about a single data type and a set of common assumptions. In this way, high productivity and fast execution speed 2 You can take advantage of the strengths of one world. “

Also, the run-time benefits can be significant.

“Our research shows that we can reduce the wait time for output to 10 minutes,” says Schwarzkopf. “So it’s really a substantial improvement in performance.”

In addition to speeding things up, researchers say Tuplex also has an innovative way to handle anomalous data. Large datasets are often cluttered with corrupted records and data fields that do not follow the rules. For example, in real estate data, the number of bedrooms is either a number or a spelled out number. Such inconsistencies may be sufficient to crash some data platforms. However, Tuplex extracts those anomalies and puts them aside to avoid crashes. When the program is run, the user has the option to repair those anomalies.

“We believe this can have a significant impact on productivity for data scientists,” says Schwarzkopf. “It’s really a big problem that you don’t have to run out to drink a cup of coffee while waiting for output, and don’t run the program for an hour just to crash before it’s done.”

AI for code facilitates collaborative and open scientific discoveries

For more information:
paper:… 21-sigmod-tuplex.pdf


Provided by
Brown University

Quote: The new data science platform is a Python query retrieved from on July 6, 2021 (2021, July 1). ) Speed ​​up

This document is subject to copyright. No part may be reproduced without written permission, except for fair transactions for personal investigation or research purposes. The content is provided for informational purposes only.