I want to create a pipeline in make.com that first converts my pdf to excel and then runs a python code to get the data in a standardized format. All the pdfs have different column names which mean the same. Some have a few extra columns. My aim is to have all files in the same format. I have attached snippets of my pipeline at the end. I don’t know if the step I perform to get my data in a standardized format is correct or not.
- How to read the excel file returned by cloudconvert in my python code?
- Where do I store the data returned by the python code? (df[df_ordered]) I want to download this data as excel again and store it on one drive
This is my python code:
import pandas as pd
import numpy as np
data = pd.read_excel({{3.data}})
def std_cols(df):
standard_format = {
‘Item’: [‘PRODUCT DESCRIPTION’, ‘PRODUCT NAME’],
‘Opening stock quantity’: [‘OP STOCK’, ‘OPENING STOCK’, ‘OP QTY’
, ‘OPENING\nSTOCK’, ‘Op.Bal.\nQty.’
],
‘Opening stock amount’: [‘OP AMT’],
‘Purchase quantity’: [‘PURCHASE QUANTITY’, ‘PURCHASE QTY’,
‘PUR QTY’, ‘PURCHASE\nQUANTITY’,
‘Receipt\nQty.’],
‘Purchase amount’: ,
‘Purchase return quantity’: [‘P/R\nQUANTITY’],
‘Purchase return amount’: ,
‘Sales quantity’: [‘SALE\nQUANTITY’, ‘Issue Qty.’],
‘Sales amount’: ,
‘Sales return quantity’: [‘SALE RETURN\nQUANTITY’],
‘Sales return amount’: ,
‘Closing stock quantity’: [‘CLOSING\nSTOCK’, ‘Closing Balance’
],
‘Closing stock amount’: ,
‘Rate’: [‘RATE’],
‘Expiry Stock’: [‘EXPIRY\nSTOCK’, ‘Near Expiry’],
}
actual_vs_std_col = {}
for (x, y) in standard_format.items():
for c in df.columns:
if c in y:
actual_vs_std_col[c] = x
necessary_columns = [‘Item’, ‘Opening stock quantity’,
‘Purchase quantity’, ‘Sales quantity’,
‘Closing stock quantity’]
df_columns = [col[1] for col in actual_vs_std_col.items()]
if all(item in df_columns for item in necessary_columns):
drop_col = [col for col in df.columns if col
not in list(actual_vs_std_col)]
df.drop(columns=drop_col, inplace=True)
df.rename(columns=actual_vs_std_col, inplace=True)
new_cols = [new for new in list(standard_format) if new
not in [c[1] for c in actual_vs_std_col.items()]]
for new_c in new_cols:
df[new_c] = 0
df_ordered = list(standard_format)
return df[df_ordered]
result = {‘data’: std_cols(data)}
This is the error that I get: