Reading data from CSV into dataframe with multiple delimiters efficiently





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







12















I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.



I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.



I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?



The below string can be used for benchmarking purposes.



# Python 3.7.0, Pandas 0.23.4

from io import StringIO
import pandas as pd
import csv

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)

def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)

def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])


Checks:



res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))

print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432

assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))


Benchmarking results:



%timeit csv_reader_1(StringIO(x))  # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop


Update



I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.










share|improve this question

























  • have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

    – chris
    Jan 4 at 18:39













  • @chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

    – jpp
    Jan 4 at 18:45













  • @jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

    – pygo
    Jan 4 at 19:28






  • 1





    @pygo, the docs explain regex only works with engine python. No go.

    – jpp
    Jan 4 at 20:12











  • What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

    – DrMcCleod
    Jan 7 at 12:37


















12















I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.



I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.



I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?



The below string can be used for benchmarking purposes.



# Python 3.7.0, Pandas 0.23.4

from io import StringIO
import pandas as pd
import csv

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)

def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)

def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])


Checks:



res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))

print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432

assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))


Benchmarking results:



%timeit csv_reader_1(StringIO(x))  # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop


Update



I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.










share|improve this question

























  • have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

    – chris
    Jan 4 at 18:39













  • @chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

    – jpp
    Jan 4 at 18:45













  • @jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

    – pygo
    Jan 4 at 19:28






  • 1





    @pygo, the docs explain regex only works with engine python. No go.

    – jpp
    Jan 4 at 20:12











  • What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

    – DrMcCleod
    Jan 7 at 12:37














12












12








12


5






I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.



I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.



I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?



The below string can be used for benchmarking purposes.



# Python 3.7.0, Pandas 0.23.4

from io import StringIO
import pandas as pd
import csv

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)

def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)

def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])


Checks:



res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))

print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432

assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))


Benchmarking results:



%timeit csv_reader_1(StringIO(x))  # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop


Update



I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.










share|improve this question
















I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.



I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.



I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?



The below string can be used for benchmarking purposes.



# Python 3.7.0, Pandas 0.23.4

from io import StringIO
import pandas as pd
import csv

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)

def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)

def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])


Checks:



res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))

print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432

assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))


Benchmarking results:



%timeit csv_reader_1(StringIO(x))  # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop


Update



I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.







python pandas performance csv dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 at 15:50







jpp

















asked Jan 4 at 18:07









jppjpp

103k2167117




103k2167117













  • have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

    – chris
    Jan 4 at 18:39













  • @chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

    – jpp
    Jan 4 at 18:45













  • @jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

    – pygo
    Jan 4 at 19:28






  • 1





    @pygo, the docs explain regex only works with engine python. No go.

    – jpp
    Jan 4 at 20:12











  • What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

    – DrMcCleod
    Jan 7 at 12:37



















  • have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

    – chris
    Jan 4 at 18:39













  • @chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

    – jpp
    Jan 4 at 18:45













  • @jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

    – pygo
    Jan 4 at 19:28






  • 1





    @pygo, the docs explain regex only works with engine python. No go.

    – jpp
    Jan 4 at 20:12











  • What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

    – DrMcCleod
    Jan 7 at 12:37

















have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

– chris
Jan 4 at 18:39







have you considered using regular expressions for multiple delimiters? For example: link 1, link 2. Not sure if it'd be any faster.

– chris
Jan 4 at 18:39















@chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

– jpp
Jan 4 at 18:45







@chris, Now I have (see edit), regex with engine='python' is ~8x slower than pd.read_csv with no converters.

– jpp
Jan 4 at 18:45















@jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

– pygo
Jan 4 at 19:28





@jpp, what if you use engine=c , as document suggested The C engine is faster while the Python engine is currently more feature-complete.

– pygo
Jan 4 at 19:28




1




1





@pygo, the docs explain regex only works with engine python. No go.

– jpp
Jan 4 at 20:12





@pygo, the docs explain regex only works with engine python. No go.

– jpp
Jan 4 at 20:12













What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

– DrMcCleod
Jan 7 at 12:37





What is stopping you just replacing all the ; for , in the CSV file and importing it normally?

– DrMcCleod
Jan 7 at 12:37












6 Answers
6






active

oldest

votes


















6














Use a command-line tool



By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.



Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.



For example, using Find And Replace Text:



import os

os.chdir(r'C:temp') # change directory location
os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace

df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas





share|improve this answer
























  • Better yet, use streams instead of overwriting the file.

    – ivan_pozdeev
    Jan 13 at 3:56











  • Btw use subprocess.check_call instead of os.system because it checks the exit code.

    – ivan_pozdeev
    Jan 13 at 3:58











  • @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

    – jpp
    Jan 13 at 15:53











  • @jpp I wonder if you saw my answer before accepting your own

    – Leon
    Jan 15 at 11:39











  • @Leon, Woops, 8 answers and I missed yours! Will test it..

    – jpp
    Jan 15 at 11:45



















2





+25









How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?



import io
import pandas as pd

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.

The stream implements Python 3's newer I/O API (available in Python 2's io
module).

For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)

def replacementgenerator(haystack, needle, replace):
for s in haystack:
if s == needle:
yield str.encode(replace);
else:
yield str.encode(s);

csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])


Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.



This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.






share|improve this answer



















  • 1





    Any benchmarks?

    – ivan_pozdeev
    Jan 13 at 4:02











  • Nice idea, but this one clocks in at 2min 1s !

    – jpp
    Jan 13 at 19:35



















1














If this is an option, substituting the character ; with , in the string is faster.
I have written the string x to a file test.dat.



def csv_reader_4(x):
with open(x, 'r') as f:
a = f.read()
return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])


The unicode() function was necessary to avoid a TypeError in Python 2.



Benchmarking:



%timeit csv_reader_2('test.dat')  # 1.6 s per loop
%timeit csv_reader_4('test.dat') # 1.2 s per loop





share|improve this answer
























  • This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

    – jpp
    Jan 7 at 11:07











  • I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

    – Daniele
    Jan 7 at 12:32





















1














A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':



def csv_reader_4(x):
with x as fin:
reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
return reader


The benchmark:



%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop





share|improve this answer
























  • Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

    – jpp
    Jan 13 at 15:51











  • @jpp Ugh, your timings are different from mine, i am on Windows.

    – U9-Forward
    Jan 14 at 3:00











  • @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

    – Leon
    Jan 14 at 6:47



















1














Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.






share|improve this answer































    1














    In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:



    class SingleCharReplacingFilter:

    def __init__(self, reader, oldchar, newchar):
    def proxy(obj, attr):
    a = getattr(obj, attr)
    if attr in ('read'):
    def f(*args):
    return a(*args).replace(oldchar, newchar)
    return f
    else:
    return a

    for a in dir(reader):
    if not a.startswith("_") or a == '__iter__':
    setattr(self, a, proxy(reader, a))

    def csv_reader_6(x):
    with x as fin:
    return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
    sep=',', header=None, usecols=[3, 4, 5])


    The result is a little better performance compared to reading the entire CSV file with no conversions:



    In [3]: %timeit pd.read_csv(StringIO(x))
    605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    In [4]: %timeit csv_reader_5(StringIO(x))
    733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    In [5]: %timeit csv_reader_6(StringIO(x))
    568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




    1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).






    share|improve this answer


























    • On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

      – jpp
      Jan 15 at 23:26











    • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

      – Leon
      Jan 16 at 8:52











    • Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

      – jpp
      Jan 20 at 4:47














    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54044022%2freading-data-from-csv-into-dataframe-with-multiple-delimiters-efficiently%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    6 Answers
    6






    active

    oldest

    votes








    6 Answers
    6






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6














    Use a command-line tool



    By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.



    Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.



    For example, using Find And Replace Text:



    import os

    os.chdir(r'C:temp') # change directory location
    os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace

    df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas





    share|improve this answer
























    • Better yet, use streams instead of overwriting the file.

      – ivan_pozdeev
      Jan 13 at 3:56











    • Btw use subprocess.check_call instead of os.system because it checks the exit code.

      – ivan_pozdeev
      Jan 13 at 3:58











    • @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

      – jpp
      Jan 13 at 15:53











    • @jpp I wonder if you saw my answer before accepting your own

      – Leon
      Jan 15 at 11:39











    • @Leon, Woops, 8 answers and I missed yours! Will test it..

      – jpp
      Jan 15 at 11:45
















    6














    Use a command-line tool



    By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.



    Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.



    For example, using Find And Replace Text:



    import os

    os.chdir(r'C:temp') # change directory location
    os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace

    df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas





    share|improve this answer
























    • Better yet, use streams instead of overwriting the file.

      – ivan_pozdeev
      Jan 13 at 3:56











    • Btw use subprocess.check_call instead of os.system because it checks the exit code.

      – ivan_pozdeev
      Jan 13 at 3:58











    • @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

      – jpp
      Jan 13 at 15:53











    • @jpp I wonder if you saw my answer before accepting your own

      – Leon
      Jan 15 at 11:39











    • @Leon, Woops, 8 answers and I missed yours! Will test it..

      – jpp
      Jan 15 at 11:45














    6












    6








    6







    Use a command-line tool



    By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.



    Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.



    For example, using Find And Replace Text:



    import os

    os.chdir(r'C:temp') # change directory location
    os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace

    df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas





    share|improve this answer













    Use a command-line tool



    By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.



    Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.



    For example, using Find And Replace Text:



    import os

    os.chdir(r'C:temp') # change directory location
    os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace

    df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 7 at 13:23









    jppjpp

    103k2167117




    103k2167117













    • Better yet, use streams instead of overwriting the file.

      – ivan_pozdeev
      Jan 13 at 3:56











    • Btw use subprocess.check_call instead of os.system because it checks the exit code.

      – ivan_pozdeev
      Jan 13 at 3:58











    • @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

      – jpp
      Jan 13 at 15:53











    • @jpp I wonder if you saw my answer before accepting your own

      – Leon
      Jan 15 at 11:39











    • @Leon, Woops, 8 answers and I missed yours! Will test it..

      – jpp
      Jan 15 at 11:45



















    • Better yet, use streams instead of overwriting the file.

      – ivan_pozdeev
      Jan 13 at 3:56











    • Btw use subprocess.check_call instead of os.system because it checks the exit code.

      – ivan_pozdeev
      Jan 13 at 3:58











    • @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

      – jpp
      Jan 13 at 15:53











    • @jpp I wonder if you saw my answer before accepting your own

      – Leon
      Jan 15 at 11:39











    • @Leon, Woops, 8 answers and I missed yours! Will test it..

      – jpp
      Jan 15 at 11:45

















    Better yet, use streams instead of overwriting the file.

    – ivan_pozdeev
    Jan 13 at 3:56





    Better yet, use streams instead of overwriting the file.

    – ivan_pozdeev
    Jan 13 at 3:56













    Btw use subprocess.check_call instead of os.system because it checks the exit code.

    – ivan_pozdeev
    Jan 13 at 3:58





    Btw use subprocess.check_call instead of os.system because it checks the exit code.

    – ivan_pozdeev
    Jan 13 at 3:58













    @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

    – jpp
    Jan 13 at 15:53





    @ivan_pozdeev, Can you expand on how to use streams instead of overwriting the file? Is there an example of this elsewhere on SO?

    – jpp
    Jan 13 at 15:53













    @jpp I wonder if you saw my answer before accepting your own

    – Leon
    Jan 15 at 11:39





    @jpp I wonder if you saw my answer before accepting your own

    – Leon
    Jan 15 at 11:39













    @Leon, Woops, 8 answers and I missed yours! Will test it..

    – jpp
    Jan 15 at 11:45





    @Leon, Woops, 8 answers and I missed yours! Will test it..

    – jpp
    Jan 15 at 11:45













    2





    +25









    How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?



    import io
    import pandas as pd

    # strings in first 3 columns are of arbitrary length
    x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
    MNOP,QRST,UVWX,56.23;63.45;625.234
    '''*10**6

    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
    def __init__(self):
    self.leftover = None
    def readable(self):
    return True
    def readinto(self, b):
    try:
    l = len(b) # We're supposed to return at most this much
    chunk = self.leftover or next(iterable)
    output, self.leftover = chunk[:l], chunk[l:]
    b[:len(output)] = output
    return len(output)
    except StopIteration:
    return 0 # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

    def replacementgenerator(haystack, needle, replace):
    for s in haystack:
    if s == needle:
    yield str.encode(replace);
    else:
    yield str.encode(s);

    csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])


    Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.



    This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.






    share|improve this answer



















    • 1





      Any benchmarks?

      – ivan_pozdeev
      Jan 13 at 4:02











    • Nice idea, but this one clocks in at 2min 1s !

      – jpp
      Jan 13 at 19:35
















    2





    +25









    How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?



    import io
    import pandas as pd

    # strings in first 3 columns are of arbitrary length
    x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
    MNOP,QRST,UVWX,56.23;63.45;625.234
    '''*10**6

    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
    def __init__(self):
    self.leftover = None
    def readable(self):
    return True
    def readinto(self, b):
    try:
    l = len(b) # We're supposed to return at most this much
    chunk = self.leftover or next(iterable)
    output, self.leftover = chunk[:l], chunk[l:]
    b[:len(output)] = output
    return len(output)
    except StopIteration:
    return 0 # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

    def replacementgenerator(haystack, needle, replace):
    for s in haystack:
    if s == needle:
    yield str.encode(replace);
    else:
    yield str.encode(s);

    csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])


    Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.



    This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.






    share|improve this answer



















    • 1





      Any benchmarks?

      – ivan_pozdeev
      Jan 13 at 4:02











    • Nice idea, but this one clocks in at 2min 1s !

      – jpp
      Jan 13 at 19:35














    2





    +25







    2





    +25



    2




    +25





    How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?



    import io
    import pandas as pd

    # strings in first 3 columns are of arbitrary length
    x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
    MNOP,QRST,UVWX,56.23;63.45;625.234
    '''*10**6

    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
    def __init__(self):
    self.leftover = None
    def readable(self):
    return True
    def readinto(self, b):
    try:
    l = len(b) # We're supposed to return at most this much
    chunk = self.leftover or next(iterable)
    output, self.leftover = chunk[:l], chunk[l:]
    b[:len(output)] = output
    return len(output)
    except StopIteration:
    return 0 # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

    def replacementgenerator(haystack, needle, replace):
    for s in haystack:
    if s == needle:
    yield str.encode(replace);
    else:
    yield str.encode(s);

    csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])


    Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.



    This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.






    share|improve this answer













    How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?



    import io
    import pandas as pd

    # strings in first 3 columns are of arbitrary length
    x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
    MNOP,QRST,UVWX,56.23;63.45;625.234
    '''*10**6

    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
    def __init__(self):
    self.leftover = None
    def readable(self):
    return True
    def readinto(self, b):
    try:
    l = len(b) # We're supposed to return at most this much
    chunk = self.leftover or next(iterable)
    output, self.leftover = chunk[:l], chunk[l:]
    b[:len(output)] = output
    return len(output)
    except StopIteration:
    return 0 # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

    def replacementgenerator(haystack, needle, replace):
    for s in haystack:
    if s == needle:
    yield str.encode(replace);
    else:
    yield str.encode(s);

    csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])


    Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.



    This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 10 at 20:55









    Max SnijdersMax Snijders

    24129




    24129








    • 1





      Any benchmarks?

      – ivan_pozdeev
      Jan 13 at 4:02











    • Nice idea, but this one clocks in at 2min 1s !

      – jpp
      Jan 13 at 19:35














    • 1





      Any benchmarks?

      – ivan_pozdeev
      Jan 13 at 4:02











    • Nice idea, but this one clocks in at 2min 1s !

      – jpp
      Jan 13 at 19:35








    1




    1





    Any benchmarks?

    – ivan_pozdeev
    Jan 13 at 4:02





    Any benchmarks?

    – ivan_pozdeev
    Jan 13 at 4:02













    Nice idea, but this one clocks in at 2min 1s !

    – jpp
    Jan 13 at 19:35





    Nice idea, but this one clocks in at 2min 1s !

    – jpp
    Jan 13 at 19:35











    1














    If this is an option, substituting the character ; with , in the string is faster.
    I have written the string x to a file test.dat.



    def csv_reader_4(x):
    with open(x, 'r') as f:
    a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])


    The unicode() function was necessary to avoid a TypeError in Python 2.



    Benchmarking:



    %timeit csv_reader_2('test.dat')  # 1.6 s per loop
    %timeit csv_reader_4('test.dat') # 1.2 s per loop





    share|improve this answer
























    • This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

      – jpp
      Jan 7 at 11:07











    • I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

      – Daniele
      Jan 7 at 12:32


















    1














    If this is an option, substituting the character ; with , in the string is faster.
    I have written the string x to a file test.dat.



    def csv_reader_4(x):
    with open(x, 'r') as f:
    a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])


    The unicode() function was necessary to avoid a TypeError in Python 2.



    Benchmarking:



    %timeit csv_reader_2('test.dat')  # 1.6 s per loop
    %timeit csv_reader_4('test.dat') # 1.2 s per loop





    share|improve this answer
























    • This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

      – jpp
      Jan 7 at 11:07











    • I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

      – Daniele
      Jan 7 at 12:32
















    1












    1








    1







    If this is an option, substituting the character ; with , in the string is faster.
    I have written the string x to a file test.dat.



    def csv_reader_4(x):
    with open(x, 'r') as f:
    a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])


    The unicode() function was necessary to avoid a TypeError in Python 2.



    Benchmarking:



    %timeit csv_reader_2('test.dat')  # 1.6 s per loop
    %timeit csv_reader_4('test.dat') # 1.2 s per loop





    share|improve this answer













    If this is an option, substituting the character ; with , in the string is faster.
    I have written the string x to a file test.dat.



    def csv_reader_4(x):
    with open(x, 'r') as f:
    a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])


    The unicode() function was necessary to avoid a TypeError in Python 2.



    Benchmarking:



    %timeit csv_reader_2('test.dat')  # 1.6 s per loop
    %timeit csv_reader_4('test.dat') # 1.2 s per loop






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 7 at 10:59









    DanieleDaniele

    32419




    32419













    • This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

      – jpp
      Jan 7 at 11:07











    • I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

      – Daniele
      Jan 7 at 12:32





















    • This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

      – jpp
      Jan 7 at 11:07











    • I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

      – Daniele
      Jan 7 at 12:32



















    This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

    – jpp
    Jan 7 at 11:07





    This is causing MemoryError for me, presumably because it requires reading everything in, effectively, twice? Once into a and then into pd.DataFrame.

    – jpp
    Jan 7 at 11:07













    I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

    – Daniele
    Jan 7 at 12:32







    I think a.replace creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like cython

    – Daniele
    Jan 7 at 12:32













    1














    A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':



    def csv_reader_4(x):
    with x as fin:
    reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader


    The benchmark:



    %timeit csv_reader_4(StringIO(x)) # 3.51 s per loop





    share|improve this answer
























    • Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

      – jpp
      Jan 13 at 15:51











    • @jpp Ugh, your timings are different from mine, i am on Windows.

      – U9-Forward
      Jan 14 at 3:00











    • @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

      – Leon
      Jan 14 at 6:47
















    1














    A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':



    def csv_reader_4(x):
    with x as fin:
    reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader


    The benchmark:



    %timeit csv_reader_4(StringIO(x)) # 3.51 s per loop





    share|improve this answer
























    • Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

      – jpp
      Jan 13 at 15:51











    • @jpp Ugh, your timings are different from mine, i am on Windows.

      – U9-Forward
      Jan 14 at 3:00











    • @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

      – Leon
      Jan 14 at 6:47














    1












    1








    1







    A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':



    def csv_reader_4(x):
    with x as fin:
    reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader


    The benchmark:



    %timeit csv_reader_4(StringIO(x)) # 3.51 s per loop





    share|improve this answer













    A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':



    def csv_reader_4(x):
    with x as fin:
    reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader


    The benchmark:



    %timeit csv_reader_4(StringIO(x)) # 3.51 s per loop






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 13 at 6:25









    U9-ForwardU9-Forward

    18.6k51744




    18.6k51744













    • Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

      – jpp
      Jan 13 at 15:51











    • @jpp Ugh, your timings are different from mine, i am on Windows.

      – U9-Forward
      Jan 14 at 3:00











    • @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

      – Leon
      Jan 14 at 6:47



















    • Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

      – jpp
      Jan 13 at 15:51











    • @jpp Ugh, your timings are different from mine, i am on Windows.

      – U9-Forward
      Jan 14 at 3:00











    • @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

      – Leon
      Jan 14 at 6:47

















    Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

    – jpp
    Jan 13 at 15:51





    Have you tested relative performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this.

    – jpp
    Jan 13 at 15:51













    @jpp Ugh, your timings are different from mine, i am on Windows.

    – U9-Forward
    Jan 14 at 3:00





    @jpp Ugh, your timings are different from mine, i am on Windows.

    – U9-Forward
    Jan 14 at 3:00













    @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

    – Leon
    Jan 14 at 6:47





    @U9-Forward I improved your approach by doing the replacement during read() operations: stackoverflow.com/a/54176770/6394138

    – Leon
    Jan 14 at 6:47











    1














    Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
    Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.






    share|improve this answer




























      1














      Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
      Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.






      share|improve this answer


























        1












        1








        1







        Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
        Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.






        share|improve this answer













        Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
        Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 14 at 7:28









        sancelotsancelot

        933622




        933622























            1














            In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:



            class SingleCharReplacingFilter:

            def __init__(self, reader, oldchar, newchar):
            def proxy(obj, attr):
            a = getattr(obj, attr)
            if attr in ('read'):
            def f(*args):
            return a(*args).replace(oldchar, newchar)
            return f
            else:
            return a

            for a in dir(reader):
            if not a.startswith("_") or a == '__iter__':
            setattr(self, a, proxy(reader, a))

            def csv_reader_6(x):
            with x as fin:
            return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
            sep=',', header=None, usecols=[3, 4, 5])


            The result is a little better performance compared to reading the entire CSV file with no conversions:



            In [3]: %timeit pd.read_csv(StringIO(x))
            605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [4]: %timeit csv_reader_5(StringIO(x))
            733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [5]: %timeit csv_reader_6(StringIO(x))
            568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




            1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).






            share|improve this answer


























            • On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

              – jpp
              Jan 15 at 23:26











            • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

              – Leon
              Jan 16 at 8:52











            • Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

              – jpp
              Jan 20 at 4:47


















            1














            In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:



            class SingleCharReplacingFilter:

            def __init__(self, reader, oldchar, newchar):
            def proxy(obj, attr):
            a = getattr(obj, attr)
            if attr in ('read'):
            def f(*args):
            return a(*args).replace(oldchar, newchar)
            return f
            else:
            return a

            for a in dir(reader):
            if not a.startswith("_") or a == '__iter__':
            setattr(self, a, proxy(reader, a))

            def csv_reader_6(x):
            with x as fin:
            return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
            sep=',', header=None, usecols=[3, 4, 5])


            The result is a little better performance compared to reading the entire CSV file with no conversions:



            In [3]: %timeit pd.read_csv(StringIO(x))
            605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [4]: %timeit csv_reader_5(StringIO(x))
            733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [5]: %timeit csv_reader_6(StringIO(x))
            568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




            1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).






            share|improve this answer


























            • On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

              – jpp
              Jan 15 at 23:26











            • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

              – Leon
              Jan 16 at 8:52











            • Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

              – jpp
              Jan 20 at 4:47
















            1












            1








            1







            In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:



            class SingleCharReplacingFilter:

            def __init__(self, reader, oldchar, newchar):
            def proxy(obj, attr):
            a = getattr(obj, attr)
            if attr in ('read'):
            def f(*args):
            return a(*args).replace(oldchar, newchar)
            return f
            else:
            return a

            for a in dir(reader):
            if not a.startswith("_") or a == '__iter__':
            setattr(self, a, proxy(reader, a))

            def csv_reader_6(x):
            with x as fin:
            return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
            sep=',', header=None, usecols=[3, 4, 5])


            The result is a little better performance compared to reading the entire CSV file with no conversions:



            In [3]: %timeit pd.read_csv(StringIO(x))
            605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [4]: %timeit csv_reader_5(StringIO(x))
            733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [5]: %timeit csv_reader_6(StringIO(x))
            568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




            1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).






            share|improve this answer















            In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:



            class SingleCharReplacingFilter:

            def __init__(self, reader, oldchar, newchar):
            def proxy(obj, attr):
            a = getattr(obj, attr)
            if attr in ('read'):
            def f(*args):
            return a(*args).replace(oldchar, newchar)
            return f
            else:
            return a

            for a in dir(reader):
            if not a.startswith("_") or a == '__iter__':
            setattr(self, a, proxy(reader, a))

            def csv_reader_6(x):
            with x as fin:
            return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
            sep=',', header=None, usecols=[3, 4, 5])


            The result is a little better performance compared to reading the entire CSV file with no conversions:



            In [3]: %timeit pd.read_csv(StringIO(x))
            605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [4]: %timeit csv_reader_5(StringIO(x))
            733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

            In [5]: %timeit csv_reader_6(StringIO(x))
            568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




            1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 16 at 8:50

























            answered Jan 14 at 6:41









            LeonLeon

            21.8k23372




            21.8k23372













            • On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

              – jpp
              Jan 15 at 23:26











            • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

              – Leon
              Jan 16 at 8:52











            • Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

              – jpp
              Jan 20 at 4:47





















            • On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

              – jpp
              Jan 15 at 23:26











            • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

              – Leon
              Jan 16 at 8:52











            • Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

              – jpp
              Jan 20 at 4:47



















            On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

            – jpp
            Jan 15 at 23:26





            On Python 3.7, Pandas 0.23.4, I'm getting ValueError: Invalid file path or buffer object type: <class '__main__.SingleCharReplacingFilter'> on the pd.read_csv line. Any ideas?

            – jpp
            Jan 15 at 23:26













            @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

            – Leon
            Jan 16 at 8:52





            @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an __iter__ method. I updated my answer to reflect that.

            – Leon
            Jan 16 at 8:52













            Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

            – jpp
            Jan 20 at 4:47







            Sorry for the delay. I timed this, it took 1s longer than csv_reader_1 on my setup (4.28s for _1 vs 5.28s for _6). I'm using input x = """..."""*10**6 as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent.

            – jpp
            Jan 20 at 4:47




















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54044022%2freading-data-from-csv-into-dataframe-with-multiple-delimiters-efficiently%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Monofisismo

            Angular Downloading a file using contenturl with Basic Authentication

            Olmecas