I have a large (2 Million rows) csv file exported from a SQL Server database. I don't have access to the database, and there's some newline character within a column, which makes it difficult to process in R.
Sample data like this:
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0
,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060
,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9
In above sample data, some lines breaks into two lines. How to clean it up?
Update
- The csv file was generated on Microsoft platform. So the line ending is
^m. I made up the sample data on Linux - ended in\n. It's my bad, I didn't make it clear. But I guess I can substitute\nwith\r\n. - The line break doesn't always appear before a comma, it appears randomly even within a field like
.
Bill
Gates.
Solved
Step 1: Remove ^M in the middle of line:
perl -pe 's/\r(?!\n)//g'
Ref: https://stackoverflow.com/questions/6081465/how-to-remove-carriage-returns-in-the-middle-of-a-line
Step 2: Substitute \n, with , (See @jimmij 's answer below.)
perl -p00e 's/\n,/,/g'
^Mor\r\n. I didn't make it clear, my bad. (I made the data sample on Linux.) No. The split is not always before comma, it even appears within a field likeBill \r\n Gates. I know this is really stupid. Data cleaning is always frustrated. – Nick Aug 09 '15 at 01:26^Min the middle of the line, and then using your perl code. The00option is not so clear, can you explain it briefly? What does a paragraph mean? – Nick Aug 09 '15 at 02:370nuances in http://unix.stackexchange.com/questions/192485/how-exactly-does-perls-0-option-work . In fact in this case single0should be enough. – jimmij Aug 09 '15 at 10:21