Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Problems writing data #612

Open
tschm opened this issue Jul 9, 2018 · 3 comments
Open

Problems writing data #612

tschm opened this issue Jul 9, 2018 · 3 comments

Comments

@tschm
Copy link

tschm commented Jul 9, 2018

I am a bit disappointed and surprised by the speed I observe when writing to my influxdb. However, I am probably doing something very wrong and would appreciate any pointers into the correct direction.

I am using the latest docker-image.

I have a DataFrame with 10 columns and 10000 rows, e.g. 100 000 data points. To write them into the database took me like 30 seconds! I am running Ubuntu on a 16GB RAM machine with a SSD drive

        x = pd.DatetimeIndex(start=pd.Timestamp("2010-01-01"), periods=10000, freq="D")
        y = pd.DataFrame(index=x, data=pd.np.random.randn(10000, 10))
        print(y)

I write column per column (all using the same measurement but using the name of the column as tag), e.g.

        for key, data in y.items():
            # every column represents a different tag
            self.client.series_upsert(ts=data, tags={"Global tag": "Peter Maffay", "name": key}, field="random", measurement="measure")

I have tried other methods (e.g. the SeriesHelper etc.) but the speed has never really picked up
Here's the decisive fragment from my own client (I inherit from your standard client)

    def series_upsert(self, ts, tags, field, measurement):
        if len(ts) > 0:
            json_body = [{'measurement': measurement,'time': t, 'fields': {field: float(x)}} for t,x in ts.items()]
            self.influxclient.write_points(json_body, time_precision="s", tags=tags, batch_size=10000)
@tschm
Copy link
Author

tschm commented Jul 9, 2018

@epa095
Copy link

epa095 commented Jul 18, 2018

5.2.0 was released a day after you reported this, try that one. Also try 5.0.0 and see if it is equally slow.

@flikka
Copy link

flikka commented Jul 19, 2018

We also experience massive performance problems on writing large-ish data frames with version 5.1.0. One example (see below) uses 5 seconds on my local laptop to write a million lines (one column) with version 5.0.0, but on version 5.1.0 it takes minutes (in fact I gave up before it finished). On the current master the performance is back to 5.0.0 performance. The pypi current latest version 5.2.0 is broken (see #616), so I'll use 5.0.0 or master for now. I guess the fix in #617 will come to the pip-world soon though.

I have no idea what the cause is, maybe something wrong with our setup of InfluxDB itself perhaps...

The below example produces this output for master and similar for 5.0.0:
InfluxDB python version: 5.0.0
Number of points: 10000
Time: 0.20903187998919748
Number of points: 100000
Time: 0.6698060079943389
Number of points: 1000000
Time: 5.531810616987059

But for version 5.1.0 this is how it looks:
InfluxDB python version: 5.1.0
Number of points: 10000
Time: 3.2126710579905193
Number of points: 100000
Time: 34.53647285097395
Number of points: 1000000
<Aborted here, didn't bother to wait>

Example code used (Tried on both 1.5.X and 1.6.0 InfluxDB instance):

from influxdb import DataFrameClient
import pandas as pd
from timeit import default_timer as timer

host = '0.0.0.0'
port = 8086
user = 'admin'
password = ''
db_name = 'test'

def simple_test(num_points, batch_size):
    client = DataFrameClient(host, port, user, password, db_name)
    x = pd.DatetimeIndex(start=pd.Timestamp("2010-01-01"), periods=num_points, freq="S")
    y = pd.DataFrame(index=x, data=pd.np.random.randn(num_points, 1))

    client.create_database(db_name)
    client.write_points(y, "perf-test", batch_size=batch_size, protocol='line')
    client.drop_database(db_name)

if __name__=='__main__':
    print("InfluxDB python version: {}".format(influxdb.__version__))
    num_points_list = [10000, 100000, 1000000]
    batch_size = 10000
    for num_points in num_points_list:
        print("Number of points: {}".format(num_points))
        start = timer()
        simple_test(num_points, batch_size)
        end = timer()
        elapsed = end - start
        print("Time: {}".format(elapsed))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants