Apache Spark is a very common platform for the structured and unstructured data processing. It
supports many basic data types when it comes to processing structured data, such as integer, long,
double, string etc. Spark often supports more complex types of data, such as the Date and Timestamp,
which developers sometimes find hard to understand. We take a deep dive into the Date and
Timestamp styles in this blog post to help you fully understand their actions and how to prevent a few
common issues. To sum up, this blog has four parts to it.
The Date type definition and the calendar associated with it. It also includes the Spark 3.0 Calendar turn.
Timestamp style description, and how it applies to time zones. It also explains the details of time zone
offset resolution, and the subtle behavior changes in the new time API used by Spark 3.0 in Java 8.
The common APIs used in Spark to construct date and time mark values.
Common pitfalls and best practices for collecting items on the Spark driver by date and timestamp.
supports many basic data types when it comes to processing structured data, such as integer, long,
double, string etc. Spark often supports more complex types of data, such as the Date and Timestamp,
which developers sometimes find hard to understand. We take a deep dive into the Date and
Timestamp styles in this blog post to help you fully understand their actions and how to prevent a few
common issues. To sum up, this blog has four parts to it.
The Date type definition and the calendar associated with it. It also includes the Spark 3.0 Calendar turn.
Timestamp style description, and how it applies to time zones. It also explains the details of time zone
offset resolution, and the subtle behavior changes in the new time API used by Spark 3.0 in Java 8.
The common APIs used in Spark to construct date and time mark values.
Common pitfalls and best practices for collecting items on the Spark driver by date and timestamp.
To more information visit the following big data and hadoop course Blog.
Date and Timetable
A date description is very simple: it is a combination of the fields of year, month and day, like
(year=2012, month=12, day=31). Nevertheless, there are restrictions to the meanings of the fields of
year, month and day, so the date value is a true day in the real world. For example, the month value
must be between 1 and 12, the day value must be between 1 and 28/29/30/31 (depending on the year
and month), and so forth.
One of many possible calendars defines those constraints. Some of them, like the Lunar Calendar, are
only used in different areas. Some of them are used only in tradition, as in the Julian calendar. The
Gregorian calendar is the de facto universal standard at this level, and is used for civil purposes almost
everywhere in the world. This was implemented in 1582, and is also applied before 1582 to endorse
dates. This expanded calendar is known as the Gregorian Proleptic calendar.
Starting with version 3.0, Spark uses the Proleptic Gregorian calendar that other data systems, such as
pandas, R, and Apache Arrow, already use. A variation of the Julian and Gregorian calendar used before
Spark 3.0: the Julian calendar was used for the dates before 1582, the Gregorian calendar was used for
the dates after 1582. This is inherited from the legacy java.sql. Date API which was superseded by
java.time. LocalDate in Java 8, which also uses the Gregorian Proleptic calendar.
Notably, time zones are not considered in the Date type.
Time-marking and time zone
The Timestamp type expands the Date type with new fields: hour, minute, second (which may have a
fractional part) and with a regional time zone. This describes a moment in concrete time on Earth. For
example, with session timezone UTC+01:00 (year=2012, month=12, day=31, hour=23, minute=59,
second=59.123456); While writing timestamp values to non-text data sources such as Parquet, the
values are just instants (like timestamps in UTC) that have no detail about the time zone. You may see
different values of the hour / minute / second fields if you write and read a timestamp value for
different session time zones. But they are essentially the same concrete time instantaneous.
The fields of hour, minute and second have normal ranges: 0–23 for hours, 0–59 for minutes and
seconds. Spark supports fractional seconds with precision of up to microsecond. The valid fractional
range is between 0 and 999.999 microseconds.
Different Time zones
You can observe several different values of wall clocks at any concrete moment, depending on time
zone.
For example, a set of wall clocks may represent a great many different instants of time. In Apache Spark
3.0, the time zone offset feature allows us to unambiguously attach a local timestamp to an instant
time.
And conversely, any value can represent many different instants of time on wall clocks. The offset time
zone helps us to unambiguously attach a local time sign to an instant time. Time zone offsets are usually
defined as offsets from the Greenwich Mean Time (GMT) or UTC+0 (Coordinated Universal Time) in
hours. Such a time zone information representation eliminates ambiguity but it is inconvenient for end
users. Users prefer to point to a location around the globe like America / Los Angeles or Europe / Paris
for example.
Time Zone offsets
This extra degree of abstraction from zone offsets makes life easier but brings with it its own problems.
For example, we must now maintain a special time zone database to map names of time zone to offsets.
Since Spark is running on the JVM, the mapping is delegated to the standard Java library, which loads
data from the Internet Assigned Numbers Authority Time Zone Database (IANA TZDB). Also, in Java 's
standard library, the mapping function has some complexities that affect Spark's behavior. Below we
reflect on some of those complexities.
● The JDK has provided a new API for date-time manipulation and time zone offset resolution
since Java 8, and in version 3.0 Spark migrated to that new API. Although the mapping of time
zone names to offsets has the same source, IANA TZDB, Java 8 and higher versus Java 7
implements it differently.
● For example, let 's look at a timestamp in the America / Los Angeles time zone before the year
1883: 1883-11-10 00:00:00. This year stands out from many, as all North American railroads
moved to a new standard time system on November 18, 1883, which henceforth regulated their
schedules.
● Using the Java 7 times API, the local time zone offset can be accessed as of -08:00:
second=59.123456); While writing timestamp values to non-text data sources such as Parquet, the
values are just instants (like timestamps in UTC) that have no detail about the time zone. You may see
different values of the hour / minute / second fields if you write and read a timestamp value for
different session time zones. But they are essentially the same concrete time instantaneous.
The fields of hour, minute and second have normal ranges: 0–23 for hours, 0–59 for minutes and
seconds. Spark supports fractional seconds with precision of up to microsecond. The valid fractional
range is between 0 and 999.999 microseconds.
Different Time zones
You can observe several different values of wall clocks at any concrete moment, depending on time
zone.
For example, a set of wall clocks may represent a great many different instants of time. In Apache Spark
3.0, the time zone offset feature allows us to unambiguously attach a local timestamp to an instant
time.
And conversely, any value can represent many different instants of time on wall clocks. The offset time
zone helps us to unambiguously attach a local time sign to an instant time. Time zone offsets are usually
defined as offsets from the Greenwich Mean Time (GMT) or UTC+0 (Coordinated Universal Time) in
hours. Such a time zone information representation eliminates ambiguity but it is inconvenient for end
users. Users prefer to point to a location around the globe like America / Los Angeles or Europe / Paris
for example.
Time Zone offsets
This extra degree of abstraction from zone offsets makes life easier but brings with it its own problems.
For example, we must now maintain a special time zone database to map names of time zone to offsets.
Since Spark is running on the JVM, the mapping is delegated to the standard Java library, which loads
data from the Internet Assigned Numbers Authority Time Zone Database (IANA TZDB). Also, in Java 's
standard library, the mapping function has some complexities that affect Spark's behavior. Below we
reflect on some of those complexities.
● The JDK has provided a new API for date-time manipulation and time zone offset resolution
since Java 8, and in version 3.0 Spark migrated to that new API. Although the mapping of time
zone names to offsets has the same source, IANA TZDB, Java 8 and higher versus Java 7
implements it differently.
● For example, let 's look at a timestamp in the America / Los Angeles time zone before the year
1883: 1883-11-10 00:00:00. This year stands out from many, as all North American railroads
moved to a new standard time system on November 18, 1883, which henceforth regulated their
schedules.
● Using the Java 7 times API, the local time zone offset can be accessed as of -08:00:
Default scala > java.time. ZoneId.system
Res0: java.time. ZoneId = United States / Los Angeles
Scala> java.sql.Timestamp.valueOf('1883-11-10 00:00:00').
Res1: Fold = 8.0
The functions in the Java 8 API return a different result:
Scala> java.time.ZoneId.of('Los Angeles/America')
.getRules.getOffset(java.time. LocalDateTime.parse('1883-11-10T00:00)
Res2: java.time, offset zone = -07:52:58
Historical Data of Time Zones
The example shows that Java 8 functions are more reliable and take historical data from IANA TZDB into
account. After transitioning to the Java 8 time API, Spark 3.0 immediately benefitted from the upgrade
and became more precise in how it addresses time zone offsets.
Spark 3.0 Gregorian calendar
Spark 3.0 has also moved to the Proleptic Gregorian calendar for the date form as we mentioned earlier.
The same is true for forms of timestamp. The ISO SQL:2016 standard declares the valid time mark range
to be between 0001-01-01 00:00:00 and 9999-12-31 23:59:59.999999. Spark 3.0 is fully standard
compliant and supports all timestamps in this range. Compared to Spark 2.4 and earlier, the following
sub-ranges should be emphasized.
● Milli1-01-01 00:00:00 .. 1582-10-03 23:59:59.99999. Spark 2.4 uses the Julian calendar and is not
standard compatible. Spark 3.0 addresses the problem and applies the Proleptic Gregorian
calendar to timestamps such as having year, month, day etc. in internal operations. Many dates
that appear in Spark 2.4 don't occur in Spark 3.0 because of different calendars. 1000-02-29, for
example, is not a true date since 1000 is not a leap year in the Gregorian calendar. In addition,
Spark 2.4 incorrectly applies the time zone name to zone offsets for this time mark range.
● 00:00:00 .. 1582-10-14 23:59:59.999999. It is a true set of local time signs in Spark 3.0, as
opposed to Spark 2.4 where there were no such timestamps.
● 1582-10-15 00:00:00 to 1899-12-31 23:59:59.9999999. Spark 3.0 correctly addresses time zone
offsets using historical IANA TZDB data. Compared to Spark 3.0, Spark 2.4 which in some cases
solve zone offsets from time zone names incorrectly, as seen in the example above.
● 1900-01-01 00:00:00:00 to 2036-12-31 23:59:59.99999. Spark 3.0 and Spark 2.4 all comply with
the ANSI SQL format and use the Gregorian calendar in date-time operations such as having the
month's day.
Res0: java.time. ZoneId = United States / Los Angeles
Scala> java.sql.Timestamp.valueOf('1883-11-10 00:00:00').
Res1: Fold = 8.0
The functions in the Java 8 API return a different result:
Scala> java.time.ZoneId.of('Los Angeles/America')
.getRules.getOffset(java.time. LocalDateTime.parse('1883-11-10T00:00)
Res2: java.time, offset zone = -07:52:58
Historical Data of Time Zones
The example shows that Java 8 functions are more reliable and take historical data from IANA TZDB into
account. After transitioning to the Java 8 time API, Spark 3.0 immediately benefitted from the upgrade
and became more precise in how it addresses time zone offsets.
Spark 3.0 Gregorian calendar
Spark 3.0 has also moved to the Proleptic Gregorian calendar for the date form as we mentioned earlier.
The same is true for forms of timestamp. The ISO SQL:2016 standard declares the valid time mark range
to be between 0001-01-01 00:00:00 and 9999-12-31 23:59:59.999999. Spark 3.0 is fully standard
compliant and supports all timestamps in this range. Compared to Spark 2.4 and earlier, the following
sub-ranges should be emphasized.
● Milli1-01-01 00:00:00 .. 1582-10-03 23:59:59.99999. Spark 2.4 uses the Julian calendar and is not
standard compatible. Spark 3.0 addresses the problem and applies the Proleptic Gregorian
calendar to timestamps such as having year, month, day etc. in internal operations. Many dates
that appear in Spark 2.4 don't occur in Spark 3.0 because of different calendars. 1000-02-29, for
example, is not a true date since 1000 is not a leap year in the Gregorian calendar. In addition,
Spark 2.4 incorrectly applies the time zone name to zone offsets for this time mark range.
● 00:00:00 .. 1582-10-14 23:59:59.999999. It is a true set of local time signs in Spark 3.0, as
opposed to Spark 2.4 where there were no such timestamps.
● 1582-10-15 00:00:00 to 1899-12-31 23:59:59.9999999. Spark 3.0 correctly addresses time zone
offsets using historical IANA TZDB data. Compared to Spark 3.0, Spark 2.4 which in some cases
solve zone offsets from time zone names incorrectly, as seen in the example above.
● 1900-01-01 00:00:00:00 to 2036-12-31 23:59:59.99999. Spark 3.0 and Spark 2.4 all comply with
the ANSI SQL format and use the Gregorian calendar in date-time operations such as having the
month's day.
● 2037-01-01 00:00:00:00 to 9999-12-31 23:59:59.999999. Because of the JDK bug # 8073446,
Spark 2.4 can resolve time zone offsets and particularly daylight saving time offsets incorrectly.
Spark 3.0 is not suffering from this shortcoming.
Conclusion
I hope you reach to a conclusion about Apache Spark Date and Time functions. You can learn more
about this through big data and hadoop online training.
Spark 2.4 can resolve time zone offsets and particularly daylight saving time offsets incorrectly.
Spark 3.0 is not suffering from this shortcoming.
Conclusion
I hope you reach to a conclusion about Apache Spark Date and Time functions. You can learn more
about this through big data and hadoop online training.
No comments:
Post a Comment