Error Parsing a CSV File

Goal: Download a CSV file from a website and extract only specific data from it into a Google sheet on my Google Drive.

So far in the scenario I have used the HTTP (Get a File) Module and linked it to the CSV (Parse CSV) Module.

When I run the scenario I encounter this error:

The operation failed with an error. Number of columns on line 2696 does not match header

Steps Attempted:

  • Searching this forum

  • Trying the AI assistant to point me in the right direction

  • Running down various rabbit holes based on what I have read on forums and from the AI assistant responses

Through all this, I found a few similar problems people have had to mine but was unable to use that information to solve my issue.

So far in trying to solve this I learned:

  • What a line terminator is and that this CSV file in particular uses CRLF

  • That Make wants to see LF, and that may be one of the issues I need to figure out how to solve.

  • What trailing commas are in a CSV file and how the CSV parser may be tripping over them when trying to parse the file

  • I still have a lot to learn

This is the link to the CSV file in question, and I’ve included some screenshots to hopefully provide enough information to help solve my issue.

Thank you in advance for all your time and help.


Welcome to the Make community!

As the error says, line 2696 onwards are in a different format.

Screenshot_2024-06-15_100615

Also there are supposed to be 9 columns but then it becomes 4.

Fix this.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Hey @samliew thanks for the reply.

What tool can I use to determine the same information in your screenshot?

I used Visual Studio Code on a recommendation to determine the line terminators and format. VSC showed trailing commas, instead of nothing like your screenshot shows, which led me to believe that maybe it was empty columns creating the issue.

Your solution “Fix this” concerning the columns becoming 4, points me in a general direction but opens more questions than answers for a new make user like myself.

Based on the research I have been doing since the original post of this error, I have come across a few solutions that were based on utilizing the CSV text field at the bottom of the settings pop-up menu for the module.

I have seen where you and others have suggested regular expressions to handle some of these errors. From what I can tell utilizing a regex like replace(; ; ) modies how the incoming data is handled by the Parse CSV module.

Is this the right direction to pursue? a regex that can handle and fix the column mismatch?

Or is there something like using a text parser or other module/tool in make that is better suited for fixing this mismatch?

I understand the value of leading someone to an answer rather than giving it to them, but any additional direction from you or anyone else in the community is very much appreciated.

That was Notepad++. I can also open it in VSCode with the same output:

Screenshot_2024-06-16_140632

What I meant was to fix this at the source where this file was being generated.

What was the software that outputs this file? Was datasets merged into a single file?

Or perhaps it was converted from an Excel file that uses different formats from row 2696 onwards?

This is not the right solution here. Regex is for matching/extracting text of a similar format.

You can’t really fix it with regex when the output may be “corrupted” in unknown ways.

Why was it “corrupted” in the first place?

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

It does look like the Parse CSV module expects to have each row have the number of columns indicated in the configuration. If there are columns missing it refuses to parse the whole line, so starting with row 2696, since there are no trailing commas for empty columns the Parse CSV breaks.

How to fix it:

  1. For rows with less or more than 9 rows you need to have a policy on what you think should be done. This is non trivial because you could have other anomalies in the data. One policy could be to delete any rows in the CSV data that have less or more than 9 columns of data.

  2. See if there is another way to ingest this data to convert it into data without using Parse CSV.

The main issue with a text parser (which you could use) is it would explode the number of operations you would use to the number of lines in the CSV file so I really don’t think you want this.

So it’s back to 1 or find another CSV parsing module in another app that does not depend on the number of columns to be always availabel in every row. In theory the best policy would be to import UP to the max number of columns and keep empty columns not made avavailable in the file empty even if there are no trailing commas in the file. This is after all how Excel would handle a file like this without complaint.

The below scenario snippet simply pads the lines that end with " with the right number of empty commas, assuming these rows are in correct. You could potentially create a regex to find rows with less than 9 columns and pad with the right number of commas but they would need to be at the end which is the only policy you could follow for rows like this.

Module Export

You can copy and paste this module export into your scenario. This will paste the modules shown in my screenshots above.

  1. Expand the “View Scenario JSON Blueprint” viewer and click the copy button when you mouseover the top-right of the code block
    Screenshot_2024-01-17_200117
  2. Enter your scenario editor. Press ESC to close any dialogs. Paste your clipboard (eg CTRLV) to add the blueprint code directly into the Make scenario canvas. Each module should appear in your canvas.
  3. Click on each imported module and save it for validation. You may be prompted to remap some variables or select some connections before you can run the scenario.
View Scenario JSON Blueprint
{
    "subflows": [
        {
            "flow": [
                {
                    "id": 1,
                    "module": "http:ActionGetFile",
                    "version": 3,
                    "parameters": {
                        "handleErrors": false
                    },
                    "mapper": {
                        "url": "https://www.freddiemac.com/pmms/docs/PMMS_history.csv",
                        "method": "get",
                        "serializeUrl": false,
                        "shareCookies": false
                    },
                    "metadata": {
                        "designer": {
                            "x": 0,
                            "y": 0
                        },
                        "restore": {},
                        "parameters": [
                            {
                                "name": "handleErrors",
                                "type": "boolean",
                                "label": "Evaluate all states as errors (except for 2xx and 3xx )",
                                "required": true
                            }
                        ],
                        "expect": [
                            {
                                "name": "url",
                                "type": "url",
                                "label": "URL",
                                "required": true
                            },
                            {
                                "name": "serializeUrl",
                                "type": "boolean",
                                "label": "Serialize URL",
                                "required": true
                            },
                            {
                                "name": "method",
                                "type": "hidden",
                                "label": "Method"
                            },
                            {
                                "name": "shareCookies",
                                "type": "boolean",
                                "label": "Share cookies with other HTTP modules",
                                "required": true
                            }
                        ]
                    }
                },
                {
                    "id": 2,
                    "module": "regexp:Replace",
                    "version": 1,
                    "parameters": {},
                    "mapper": {
                        "pattern": "\"{{carriagereturn}}{{newline}}",
                        "value": "\",,,,,{{carriagereturn}}{{newline}}",
                        "global": true,
                        "sensitive": true,
                        "multiline": true,
                        "singleline": false,
                        "text": "{{1.data}}"
                    },
                    "metadata": {
                        "designer": {
                            "x": 300,
                            "y": 0
                        },
                        "restore": {
                            "expect": {
                                "global": {
                                    "mode": "chose"
                                },
                                "sensitive": {
                                    "mode": "chose"
                                },
                                "multiline": {
                                    "mode": "chose"
                                },
                                "singleline": {
                                    "mode": "chose"
                                }
                            }
                        },
                        "expect": [
                            {
                                "name": "pattern",
                                "type": "text",
                                "label": "Pattern",
                                "required": true
                            },
                            {
                                "name": "value",
                                "type": "text",
                                "label": "New value"
                            },
                            {
                                "name": "global",
                                "type": "boolean",
                                "label": "Global match",
                                "required": true
                            },
                            {
                                "name": "sensitive",
                                "type": "boolean",
                                "label": "Case sensitive",
                                "required": true
                            },
                            {
                                "name": "multiline",
                                "type": "boolean",
                                "label": "Multiline",
                                "required": true
                            },
                            {
                                "name": "singleline",
                                "type": "boolean",
                                "label": "Singleline",
                                "required": true
                            },
                            {
                                "name": "text",
                                "type": "text",
                                "label": "Text"
                            }
                        ]
                    }
                },
                {
                    "id": 3,
                    "module": "csv:ParseCSV",
                    "version": 1,
                    "parameters": {
                        "colCount": 9,
                        "csvContainsHeaders": true,
                        "delimiterType": "other",
                        "relax": false,
                        "delimiter": ","
                    },
                    "mapper": {
                        "csv": "{{2.text}}"
                    },
                    "metadata": {
                        "designer": {
                            "x": 600,
                            "y": 0,
                            "messages": [
                                {
                                    "category": "last",
                                    "severity": "warning",
                                    "message": "A transformer should not be the last module in the route."
                                }
                            ]
                        },
                        "restore": {
                            "parameters": {
                                "delimiterType": {
                                    "label": "Other"
                                }
                            }
                        },
                        "parameters": [
                            {
                                "name": "colCount",
                                "type": "number",
                                "label": "Number of columns",
                                "required": true
                            },
                            {
                                "name": "csvContainsHeaders",
                                "type": "boolean",
                                "label": "CSV contains headers",
                                "required": true
                            },
                            {
                                "name": "delimiterType",
                                "type": "select",
                                "label": "Delimiter",
                                "required": true,
                                "validate": {
                                    "enum": [
                                        ",",
                                        "\t",
                                        "other"
                                    ]
                                }
                            },
                            {
                                "name": "relax",
                                "type": "boolean",
                                "label": "Preserve quotes inside unquoted field",
                                "required": true
                            },
                            {
                                "name": "delimiter",
                                "type": "text",
                                "label": "Delimiter character",
                                "validate": {
                                    "max": 1,
                                    "min": 1
                                },
                                "required": true
                            }
                        ],
                        "expect": [
                            {
                                "name": "csv",
                                "type": "text",
                                "label": "CSV",
                                "required": true
                            }
                        ],
                        "interface": [
                            {
                                "name": "col1",
                                "label": "date",
                                "type": "text"
                            },
                            {
                                "name": "col2",
                                "label": "pmms30",
                                "type": "text"
                            },
                            {
                                "name": "col3",
                                "label": "pmms30p",
                                "type": "text"
                            },
                            {
                                "name": "col4",
                                "label": "pmms15",
                                "type": "text"
                            },
                            {
                                "name": "col5",
                                "label": "pmms15p",
                                "type": "text"
                            },
                            {
                                "name": "col6",
                                "label": "pmms51",
                                "type": "text"
                            },
                            {
                                "name": "col7",
                                "label": "pmms51p",
                                "type": "text"
                            },
                            {
                                "name": "col8",
                                "label": "pmms51m",
                                "type": "text"
                            },
                            {
                                "name": "col9",
                                "label": "pmms51spread",
                                "type": "text"
                            }
                        ]
                    }
                }
            ]
        }
    ],
    "metadata": {
        "version": 1
    }
}
4 Likes

@samliew When I click on the link to the file that I am pointing the HTTP get module to, it appears to just be a webpage with the CSV data displayed. As far as what software is being used to output the file I am not sure how I would determine that. It is a public site that updates this CSV data on a weekly basis.

So my understanding is if I cannot control how the data is being output, I would have to pre-process it and get it in the format the Parse CSV module is looking for.

Am I on the right path to pre-processing it with Python or something similar? Or is there another module in Make that is able to accomplish that task?

Thank you for your continued responses, I appreciate your time and help.

Yes, then see Alex’s reply above.

2 Likes

@alex.newpath Thank you so much for your reply. I copied and pasted it into the scenario to check it out and it worked perfectly.

I spent a little time digging through what was entered into the Text Parser Pattern and New Value sections of the module to understand better how this functions, so my apologies for the late response to your post.
image

If I understand everything correctly:

  • using " before the values tells the Text Parser Module to look for exactly these regular expressions

  • Carriagereturn and newline are placeholder for the regex: \r and \n respectively

  • By specifying /r/n without trailing commas, the Text Parser will replace all those instances with 5 trailing commas. Which effectively fills in the 5 missing columns as empty and allows the Parse CSV to function as intended?

The quote “ mark in is just part of the text pattern we look for. I’m not using a regular expression there but just looking for the trailing quote with the carriage return line feed combination that appears at the end of lines that appear to be short.

The rest of your assumptions are correct.

2 Likes